The Basics of R


Informal R Workshop
Day 1 - July 5, 2016

R. Chris Fraley - University of Illinois at Urbana-Champaign

What is R and Why Should you use it?

R is an open-source software package for mathematical and statistical computing.

Installing R

https://www.r-project.org/

Download the binaries for the base package

Direct link for USA R for Windows: http://ftp.ussg.iu.edu/CRAN/

Overview of the basic environment

Console Window

Script Window

Some people dislike the natural R interface. RStudio is a popular alternative that is free.

Some basic operations

Simple math - R as a calculator


	1 + 2 * 3
	1 - (4/2)
	2^4

R has innate functions that can be used The general structure of a function is functionname(arg1, arg2, ...) where the arguments or parameters can be further specified You can learn more about these using ?functionname You can also create your own functions. We'll discuss that later.

Some simple examples


	exp(5)
	sqrt(4)
	round(pi, digits=4)

Vectors

Vectors are a one-dimensional set of values that are all of the same type (e.g., number, string). Part of the power of R is the way it is able to perform vector operations efficiently.


	c(1,2,3,4,5)	# concatenate
	1:5

Simple operations on vectors


	sum(1:5)
	length(1:5)
	mean(1:5)
	sd(1:5)
	var(1:5)
	min(1:5)
	max(1:5)
	diff(1:5)

There are different classes of vectors (numeric, integers, logical, character, datetime, factors). We will deal with these later.

Objects

R stores variables, scalars, matrices, etc. as objects. They have properties and can be manipulated. They exist in the R environment--a workspace in which you can create objects and manipulate them.

Note: Assignment is via = or <-


	x = 3
	x = 1:5


	mean(x)
	m = mean(x)

Can see 'contents' of object by typing the name or print(x)

R is 'case sensitive'. x is not the same object as X. print() is not the same command as Print()

Data Modes

Common data modes are numeric and string/text/character

numeric


	x = 4

character (text, string)


	h = "Hello"

You can find the data mode of an object using mode() or can use str() to find type/structure/content of an object

Using Scripts vs. the Console

The R Console is for doing one thing at a time. It is old school. As an alternative, you can write out all commands/code in a Script window. Easier to save and modify and share. File > New Script

You can run an entire script or parts of it by selecting the code and pressing CNTL-R

Can use # to comment your code

Workspace

All objects we have created exist in our workspace

See contents of workspace via ls()

Remove an object with rm()

Can Save and Load your workspace. Doing so makes it easy to package objects and results together.

Note: Can also save (or load) your History: The commands you've used in a session.

Moving Beyond the Basics in Basic R

Vectors

Vectors are the primary way in which we store data for variables--whether those are empirical observations or simulated ones. Thus, it is necessary to be familiar with the way vectors work.

The power of vectorization


	x = 1:5
	x + 5

Other vector creation tricks


	seq(from = -3, to = 3, by = .05)
	seq(from = -3, to = 3, length = 10)

	rep(0, times = 5)
	rep(1:3, times = 5)
	rep(1:3, each = 5)

Indexing in R


	x[10]
	x[2:4]
	x[c(1,4,5)]

Change values


	x[2] = NA

Remove


	x[-2] (won't save without assignment)

Logical Operations

Returns TRUE if the condition holds; FALSE otherwise.


	x == y	# is equal to
	x != y	# is not equal to
	x > y	# greater than
	x < y	# less than
	x >= y	# equal to or greater than
	x & y	# and
	x | y	# or

Which values of p are less than .05? Returns a vector of logicals


	x = seq(0,.10, length=10)
	x < .05

Which values of p are less than .05? Returns a vector of index values Super Useful


	which(x < .05)

Useful summary operations


	sum(x)	# how many elements tested true?
	any(x)	# did any of them test true?
	all(x)	# did all of them test true?

Missing Values

Many functions in R will crash if there are missing values. You have to know in advance how to deal with this problem. And, for better or worse, different functions check for missing data differently.


	NA (not available)
	x = NA
	is.na(x)

Matrices

A matrix is a data structure of common data type (e.g., numbers) with rows and columns. Matrices are commonly used in simulation work.


	x = matrix(1:6, nrow=3, ncol=2, byrow=TRUE)

Show properties or dimensions


	str(x)		# structure
	dim(x)		# dimensions rows by cols
	length(x)	# total entries

Reference entries


	x[2,2]				# What is the value of the 2nd row, 2nd col?
	x[2,2] = 55			# change that value
	x[ ,2]				# show all rows of second col
	x[c(1,3), ]			# show 1st and third row of all cols
	x[1, ] = c(55,55)		# replace 1st row with new vector

Show or create diagonal of square matrix


	diag(x)
	diag(x) = 1

Add rows or cols via row bind and column bind functions


	x = rbind(x, c(44,44))
	x = cbind(x, 1)			# the 1 repeats here

Give rows and cols names


	colnames(x) = c("Anxiety","Avoidance","Depression")

This allows one, if desired, to reference the variable by name rather than number. Helpful for large data sets where you want to know something about a variable but dont know the col number without looking it up.


	x[,"Anxiety"]
	mean(x[,"Anxiety"])

Basic Matrix Operations


	x = matrix(1:4, 2, 2)
	x + 5
	t(x)			# transpose matrix
	x%*%t(x)		# matrix multiplication
	solve(x)		# find inverse of a square matrix
	diag(x)			# find the diagonal of a square matrix
	svd(x)			# singular value decomposition of matrix
	eigen(x)		# computer eigenvalues/vectors for matrix

Data Frames

Like a matrix, but can hold multiple types of data (e.g., numeric, characters) You will mostly work with data frames when using empirical data. They are the most natural analog to a spreadsheet in Excel or SPSS.

Can convert a matrix into a data frame


	x.df = as.data.frame(x)
	x.df

Note: Variable labels are added by default


	dimnames(x.df)

These names be changed if you wish


	colnames(x.df) = c("X1","X2","X3")

You can reference a single variable from a data frame easily with names


	x.df$X1

Add variables to a dataframe


	x.df$X4 = c(4,4,4)

Some people find referencing a variable by first denoting the data structure in which it is contained cumbersome. (I like it, but typically use short names for my dataframes, such as "x" or "data".) An alternative is to "attach" a data frame so R treats it as the environment in which operations are performed.


	attach(x.df)
	mean(X1)

Now you can reference X1 directly by name rather than x.df$X1

If there are variables with names in the data frame that overlap with those in the global environment, the global variables have precedence. R will warn you of this.

You must "detach" the data frame when you're done or you'll create chaos.


	detach(x.df)

Importing Data

R can import data from a variety of sources. It is simplest, in my opinion, to create a comma-delimited file (csv) from any source (e.g., Excel) and import that into R.

But, you can also use libraries/packages to import data directly from SPSS or Excel files too. Read more here: http://www.r-tutor.com/r-introduction/data-frame/data-import

Import a txt/csv file from the interwebs

In this example, the data file has variable names in the first row (header=TRUE) and the entries are separated by commas (sep=",")


	mydata <- read.table("http://yourpersonality.net/R Workshop/example.csv",
	header=TRUE, sep=",")

Import txt file from local computer

Note the funny backslashes


	mydata = read.csv("C:\\Users\\rcfraley.UOFI\\Dropbox\\mydata.csv")  

Import a file from SPSS or Excel

To read a file in from SPSS or Excel, you have to first install the "foreign" library and load it. We will discuss libraries in more depth later; this is a place- holder.


	library("foreign")
	data<-read.spss("C:\\Users\\rcfraley.UOFI\\Dropbox\\someSPSSfile.sav")
	data<-data.frame(data)

See data in a spreadsheet-like way (the V must be capitalized)


	View(mydata)

See and Edit data in a spreadsheet like way


	fix(mydata)

Basic Data Analysis

Some Basic Data Analytic Functions and Examples in R

Summary Statistics (mean, med, max, min)


	summary(mydata)

Correlation


	cor(mydata$x, mydata$y)

Correlation matrix for selected variables


	cor(mydata[,c(4,5,6,7)])

t-test


	t.test(mydata$y ~ mydata$condition)

or


	

t.test(y ~ condition, data=mydata)

more here: http://www.statmethods.net/stats/ttest.html

ANOVA (simple one-way)

more here: http://www.statmethods.net/stats/anova.html


	summary(aov(mydata$y ~ mydata$condition))

or


	summary(aov(y ~ condition, data=mydata))

Regression


	lm(y ~ x, data=mydata)
	summary(lm(y ~ x, data=mydata))

Multiple Regression


	lm(y ~ x1 + x2, data=mydata)


	lm(y ~ x1 + x2 + x1*x2, data=mydata)
	summary(lm(y ~ x1 + x2 + x1*x2, data=mydata))

Standardize Variables


	scale(mydata$x)

If you want to save the standardized results, save the results as a new object.


	mydata$zx = scale(mydata$x)

Standardize multiple variables quickly using the apply function (applies a function to rows (1) or cols (2) of a matrix/frame)


	apply(mydata[4:7],2,scale)

Basic Plots

Histogram


	hist(mydata$y)

Scatterplot


	plot(mydata$x, mydata$y)

Adjust various plotting parameters http://www.statmethods.net/advgraphs/parameters.html

Add some labels


	plot(mydata$x, mydata$y, ylab="Y axis label", xlab="X axis label", 
	main="Main graph label", pch=15)

Save a high-resolution graph for publication purposes (journals often want a tiff image file submitted separately and not embedded in your manuscript)


	tiff("figure_1_example.tiff", width = 10000, 
	height = 10000, res = 1000)

	plot(mydata$x, mydata$y, ylab="Subjective Well-being", 
	xlab="Coffee Consumption", main=" ", pch=15, cex.lab=1.3)

	dev.off()

Subsetting Data or Selecting Cases

Subset of data for which condition is 0


	mydata[which(mydata$condition==0), ]

Find mean of y (third variable) for this subset


	mean(mydata[which(mydata$condition==0),3])

Create a new object to make it easier


	z = mydata[which(mydata$condition==0),]
	z$y
	mean(z$y)

Can have multiple conditions using logicals


	mydata[which(mydata$condition==0 & mydata$ID > 2), ]

Less ugly


	# select cases where condition = 0
	newData = subset(mydata, condition==0)
	mean(newData$y)