Simulation Methods in R


Informal R Workshop
Day 2 - July 6, 2016

R. Chris Fraley - University of Illinois at Urbana-Champaign

Sample and Shuffle

The most basic version of random sampling involves the sample() function. This enables us to take a vector and randomly order the values within the vectors. The sample() function is useful for randomization tests, permutations, and gambling.


	x = 1:10
	sample(x)
	

Notice that the sample() function takes the original vector and samples randomly from it without replacement. The example below will lead some values to be resampled.


	sample(x, replace=TRUE)

In the next example, we will roll the die. Take the first element returned from the random sampling of a 1:6 vector.


	sample(1:6)[1]

Distributions

A common form of simulation involves drawing cases or samples at random from distributions with known/population parameters. Example: Sampling cases from a normal distribution.


	rnorm(n = 1, mean=0, sd=1)

Each time we run that, we are sampling one case from a normal distribution with a population mean of 0 and a population sd of 1. (Recall that we don't have to write the argument/parameter name. We can also use rnorm(1,0,1) to achieve the same result.)

We can sample more than one case by adjusting the n parameter


	rnorm(n = 100, mean=0, sd=1)

And save the randomly sampled set of 100 scores as an object:


	x1 = rnorm(n = 1000, mean=0, sd=1)

Now we can do fun things, like find the sample mean, sd, or create histograms.


	mean(x1)
	sd(x1)
	hist(x1)

Do this a few times. From a statistical point of view, one thing you should have noticed is that, even when you sample from a population with known parameters, your sample stats do not necessarily match the parameters. This is the well-known problem of sampling error, which is the problem that 95% of statistical tools are designed to address.

There are a number of statistical distributions built into R that can be used for sampling. Here are some examples:


	runif()	# sample from a uniform distribution
	rexp()	# sample from an exponential distribution
	rbinom()	# sample from binomial
	rpois()	# Poisson
	rt()	# t-distribution
	rbeta()	# beta distribution
	rchisq()	# chi-square

In addition to randomly sampling from these distributions (the "r" part of the function names above), you can find probability values for portions of the curves, densities, and so on using "p" and "d" as the prefix (e.g., dnorm() ).

Loops

A loop is a strategy for performing an operation for a pre-specified number of times. Looping is a valuable way to perform simple or complex functions repeatedly.

Here is an illustration of loops to study the problem of the sampling error of the mean:


	n = 20	# set the sample size
	trials = 10000	# set the number of samples to use
	sample.mean = rep(NA, trials)		# a vector to hold the results

	for(i in 1:trials){
		sample.mean[i] = mean(rnorm(n,0,1))
	}

	mean(sample.mean)
	sd(sample.mean)	# compare to 1/sqrt(n)
	1/sqrt(n)

For loops - the structure, stated more generally


	for(i in begin:end){
		# stuff to do repeatedly
	}

There are other kinds of loops too (e.g., while loops). You can read more about those via web searches.

Loops and Statistical Power Explorations

Let's build an example to study statistical power. But, first, let me elaborate on a key theme: You can extract information from R objects that are the outputs of functions.

To illustrate, let's draw two samples from two populations and perform a t-test to test the hypothesis that they are drawn from a common population.


	sample.1 = rnorm(20, 10.0, 1)
	sample.2 = rnorm(20, 10.5, 1)
	results = t.test(sample.1, sample.2)
	results

Notice that the object results contains lots of information. We can see what things exist in that object using the str() function or the names() function.


	str(results)
	names(results)

There is a specific component called "p.value" and we can extract that by calling it out specifically.


	results$p.value

Because we can extract the p-value (and other important info) from this results object, we can, in theory, conduct thousands of t-tests and store the p.value for each one. This allow us to use simulation methods to study statistical power. Example:


	n = 20
	trials = 10000
	sample.p = rep(NA, trials)
	pop.mean.1 = 10.0
	pop.mean.2 = 10.5
	pop.sd.1 = 1
	pop.sd.2 = 1

	for(i in 1:trials){
		sample.1 = rnorm(n = n, mean=pop.mean.1, sd=pop.sd.1)
		sample.2 = rnorm(n = n, mean=pop.mean.2, sd=pop.sd.2)
		results = t.test(sample.1, sample.2)
		sample.p[i] = results$p.value
	}

Let's unpack some of our findings.


	mean(sample.p)
	sd(sample.p)
	hist(sample.p)

How many p-values were statistically significant?


		sample.p < .05

Let's tabulate those results


		table(sample.p < .05)
		table(sample.p < .05)/trials	# as proportions of trials

or


		sum(sample.p < .05)/trials

In this example we see that the average power to detect d = .50 (an r of about .24) with 20 per cell is about 34%. If the population difference is equal to d = .50, this sample size will only allow us to reach the correct conclusion (the null is false) 34% of the time. We will reach the incorrect conclusion 66% of the time. The research design is "dead in the water" as some might claim.

How many cases should we study if we want, say, 80% power? Plug in different n's and see what happens.

Tangent: Seeing Progress

Often I like to know which iteration the script is on when I'm running a large loop. Insert this code to get some live feedback on the value of i in the for() loop. These lines can be placed right after the for() statement.


	flush.console()
	print(i)

What this does is print the current value of i to the console window. Because R will typically wait to display the results of print commands after a loop, we use the flush.console() function to force the display to refresh.

Dance of the p-values

Let's look at the histogram of p-values again from the original example.


	hist(sample.p)

Notice how the p-values are all over the place? Most of them are on the lower end, as they should be, because, technically, the null hypothesis is false.

What happens when the null is true?


	n = 20
	trials = 10000
	sample.p = rep(NA, trials)
	pop.mean.1 = 10.0
	pop.mean.2 = 10.0
	pop.sd.1 = 1
	pop.sd.2 = 1

	for(i in 1:trials){
		sample.1 = rnorm(n = n, mean=pop.mean.1, sd=pop.sd.1)
		sample.2 = rnorm(n = n, mean=pop.mean.2, sd=pop.sd.2)
		results = t.test(sample.1, sample.2)
		sample.p[i] = results$p.value
	}

	hist(sample.p)
	table(sample.p < .05)/trials

First, notice that the Type I error rate is close to 5%, as it should be. Second, notice that the distribution of p-values is essentially uniform: p-values of .03 are just as likely as p-values of .30 or .90 when the null is true. The p-values are all over the place: Having a highly significant p-value means nothing.

This observation provides the foundation of p-curve analyses: using distribution of p-values between 0 and .05 across multiple tests to detect whether the observed p-value distribution is more compatible with the null hypothesis (which implies a uniform distribution) or the alternative hypothesis (which implies a right-skewed distribution (see http://p-curve.com/). So-called p-hacking of an extreme type might be reflected in a left-skewed distribution.

We can plot the p-curves, which is useful here simply for showing you a few more plotting tricks that can be useful.

Let's first rerun our simulations and save sample.p as sample.p.true for the outcomes in which we assumed the null hypothesis was true. Let's save sample.p as sample.p.false for the outcomes in which we assumed the null was false.


	# Let's create bins for p-values
		slices = seq(0, .05, .01)

	# Let's find the number of significant results
	# across all samples

		total.significant.true = sum(sample.p.true <= .05)
		total.significant.false = sum(sample.p.false <= .05)

	# Let's create vectors to store our results
	# for both the cases where the null was true and false

		results.true = rep(NA, (length(slices)-1) )
		results.false = rep(NA, (length(slices)-1) )

	# Let's loop across the different p-value segments (e.g., 0 to .01)
	# and see how many cases fall within those bins.
	# We'll express that number as a percentage of the total number of
	# significant results in both cases.

		for(i in 1:(length(slices)-1)){
		results.true[i] = sum(sample.p.true > slices[i] & 
		sample.p.true <= slices[i+1])/total.significant.true
		
		results.false[i] = sum(sample.p.false > slices[i] & 
		sample.p.false <= slices[i+1])/total.significant.false
		}

And now our plots:


	# Modify the plotting space to place graphs in a 1 x 2 grid

		par(mfrow=c(1,2))

	# type="b"
	# add points and lines
	
	# axes=FALSE
	# remove axes
	
	# axis(2) add y-axis 
	# axis(1, . . . ) 
	# add x axis and place some labels at locations 1:5

	plot(results.true,ylim=c(0,1), type="b", main="p-curve when null is true", axes=FALSE, 
	ylab="Proportion of cases")
	axis(2)
	axis(1, at=c(1,2,3,4,5),labels = c(".00-.01", ".01-.02", ".02-.03", ".03-.04", ".04-.05"))

	plot(results.false,ylim=c(0,1), type="b", main="p-curve when null is false", axes = FALSE, 
	ylab="Proportion of cases")
	axis(2)
	axis(1, at=c(1,2,3,4,5),labels = c(".00-.01", ".01-.02", ".02-.03", ".03-.04", ".04-.05"))

Why are some editors/reviewers/critics skeptical of results based on p = .04 or p-values near the threshold of statistical significance?

Run the following script twice. The first time use a population difference of 0. The next time use a population difference of .5.


	n = 200
	trials = 10000
	sample.p = rep(NA, trials)
	pop.mean.1 = 10.0
	pop.mean.2 = 10.5	# swap this between 10.0 and 10.5
	pop.sd.1 = 1
	pop.sd.2 = 1

	for(i in 1:trials){
		sample.1 = rnorm(n = n, mean=pop.mean.1, sd=pop.sd.1)
		sample.2 = rnorm(n = n, mean=pop.mean.2, sd=pop.sd.2)
		results = t.test(sample.1, sample.2)
		sample.p[i] = results$p.value
	}

	hist(sample.p)
	table(sample.p >= .04 & sample.p < .05)/trials

The last line tallies the proportion of p-values that fall between .04 and .05. Notice that p-values just under the threshold are more likely to occur when the null is true as opposed to when it is false!

Statistical Power and Interactions

Next we are going to explore an example by Uri Simonsohn and his colleagues (see http://datacolada.org/17) that was designed to consider power in the context of an interaction in a simple 2x2 design. The point they wanted to make was that many interaction tests in psychology have the following logic: Let's show that X matters for Y. But then we can make the effect go away when we consider M. Thus, there is an interaction between M and X on Y: The effect exists when, say, M = 0 and does not exist when M = 1.

As Simonsohn and colleagues noted, the statistical power to test such hypotheses is often quite low. One of their take-home messages is that you need at least twice as many subjects per cell to show that the effect of X goes away when M is introduced than you need to show that the effect of X is present in the first place.

Let's see why.

We will simulate data for four cells in a 2x2 design. We will use the linear model to analyze the data. Here is the script we will use. We can configure the sample size (per cell), the number of trials to run/simulate, and the population means for each cell. To keep things simple, we will assume the population SD is 1.00 in each case; this is not a necessary assumption, of course. We are assuming in this example that the simple effect of X on Y is equal to Cohen's d = .50 when M = 0; the effect goes away (d = 0) when M = 1. These assumptions can be adjusted by modifying the values of m1 - m4 below.


	# Some basic parameters

		n = 30	# sample size per cell
		trials = 10000	# number of trials

	# Set the population means for each cell
	# X M          
		m1 = 10.5	# 0 0
		m2 = 11.0	# 0 1
		m3 = 11.0	# 1 0
		m4 = 11.0	# 1 1

	# Set up some vectors to hold the results
	
		p1 = rep(NA, trials)	# hold p-values for main effect of x
		p2 = rep(NA, trials)	# hold p-values for main effect of m
		p3 = rep(NA, trials)	# hold p-values for interaction
		psimple = rep(NA, trials)	# hold p-values for simple test

	# Create a loop to simulate the data trials times

	for(i in 1:trials){
		
		# Generate sample values for each cell using
		# pre-determined population means

		y1 = rnorm(n, m1)
		y2 = rnorm(n, m2)
		y3 = rnorm(n, m3)
		y4 = rnorm(n, m4)

		# Create a mini-data matrix for each cell
		# y value, 0 or 1 for X, 0 or 1 for M

		d1 = cbind(y1,rep(0,n),rep(0,n))
		d2 = cbind(y2,rep(0,n),rep(1,n))
		d3 = cbind(y3,rep(1,n),rep(0,n))
		d4 = cbind(y4,rep(1,n),rep(1,n))

		# (Row)bind all the mini-datasets together

		d = rbind(d1,d2,d3,d4)

		# Conduct regression analysis
		# Save results into object results

		results = summary(lm(d[,1] ~ d[,2]*d[,3]))
		
		# Pull out the p-value for each term in the analysis
		p1[i] = results$coef[2,4]
		p2[i] = results$coef[3,4]
		p3[i] = results$coef[4,4]

		# Simulate a simple analysis
		# We only analyze data where M = 0
		# datasets 1 and 3

		d.simple = rbind(d1,d3)
		result.simple = summary(lm(d.simple[,1] ~ d.simple[,2]))
		psimple[i] = result.simple$coef[2,4]
	}

	# Show some results

		sum(p1 < .05)/trials	# % of significant effects of X
		sum(p2 < .05)/trials	# % of significant effects of M
		sum(p3 < .05)/trials	# % of significant X*M 
		sum(psimple < .05)/trials	# % of results of X alone

Try using different sample sizes. With n = 32, for example, there is a 50% chance of detecting the effect of X. (And a 28% chance of detecting the interaction.) You have to double the n (n = 64) to get the power to detect the interaction close to 50%. And, if you want 80% power to detect the interaction, you need n = 128 per cell (N = 512).

There are two useful take-home messages from this:

  1. If the research strategy is to show that an effect of X is eliminated by some other factor (e.g., context, manipulation, trait), one needs to have at least twice the sample size to demonstrate this with the same power as one demonstrated the simple effect. Even larger samples are necessary for evaluating a mere reduction in effect.
  2. Most research on interactions in psychology is designed without sufficient power.