Using R to Mine Data from Twitter: An Introduction for Psychological Scientists


Informal R Workshop
Day 5 - Aug 10, 2016

R. Chris Fraley - University of Illinois at Urbana-Champaign

Why Study Twitter Data?

In contemporary culture many people use internet-based technologies to share their thoughts, feelings, and activities. As such, there is a growing interest among psychological scientists in mining these data and using them to study human behavior. Doing so allows us to gauge the pulse of political discourse across the world, study social network structure among people and organizations, and follow real people's lives as they live them.

The purpose of this workshop is to show you how to mine data from Twitter--a popular social media application. Until recently, this kind of work was done only by people with extensive programming experience (people who are unlikely to be psychologists). Fortunately, there are several tools available that make it possible for your typical R-using researcher in psychology to do many creative, interesting, and useful things without a degree in information technology.

In this session I'll show you what you need to do to get started (setting up an authenticated session with Twitter, obtaining the necessary libraries). I'll walk you through two examples that shamelessly capitalize on the fact that we're approaching election season in the US. I'll also show you how to build a database of tweets (small data becomes big data) and automate many aspects of the data mining process.

It is important to note from the outset that this tutorial is not designed to be comprehensive. It is designed to give you some basic skills, build your competence, and give you a taste for what can be done. Let's begin.

Getting Started: Libraries to Install

We will be using two libraries for the bulk of our examples: twitteR and RCurl. twitteR is an amazing package developed by Jeff Gentry that enables R to interface fluidly with the Twitter API. RCurl is a library that facilitates retrieving information from web servers using a variety of protocols.

You can install these packages as follows:


	install.packages("twitteR", "RCurl")

Be sure the load them at the beginning of your R session. All the examples we use below assume these libraries are active.

	library(twitteR)
	library(RCurl)

Create a Twitter Account and Obtain API Credentials

One reason it is possible to scrape data from Twitter reasonably easily is that Twitter has made available an application program interface (API) that facilitates the process. Using this API, it is possible to extract any tweet that has been posted publicly within the past 5 to 9 days.

The Twitter API allows one to search these tweets in a number of ways. For example, one can retreive all tweets by a specific user or search for tweets containing specific keywords or hashtags (e.g., #Trump). In addition, it is possible to extract a wide array of data, including the number of times a tweet has been favorited, whether the tweet was a retweet, and who the person is following (and who is following that person).

To use this interface, you must have a valid Twitter account. If you do not already have one, you can create one by visiting the Twitter website. Even if you don't intend to use Twitter for purposes other than data mining, you'll need an account. You will also need to enter your mobile number for verification. You will not be able to extract data using the Twitter API without providing this information.

Once you have a valid Twitter account, go to apps.twitter.com--a special site for developers who wish to nterface with Twitter's services and data. (You might be prompted to log in again.)

Create New App

If this is your first time to the site, you should see a message like the one below, "You don't currently have any Twitter Apps." You can create one by clicking the "Create New App" button.

figure01

To create an app, simply provide the name of the app and a brief description of it. The details of this don't matter too much for our purposes. Getting the details correct matters more if the purpose of your app is to do something more than mine data (such as interact with users, publish tweets via a different interface). The Name, however, should be unique; you'll get an error if the name has been taken already.

figure02

Modify App Permissions

Optional. Once your application has been created successfully, choose the Permissions tab and select the option for "Read, Write, and Access direct messages" and then save the changes by pressing the Update Settings button. This will give you the greatest flexibility for later projects.

figure03

Create my Access Token

Finally, click on the Keys and Access Tokens tab. Chose "Create my access token" at the bottom of the page. Doing this will give us information that we will need anytime we attempt to establish a connection to the Twitter API: The Access Token, the Access Token Secret. We will also need the Consumer Key (API key) and the Consumer Secret (API Secret), which you should also be able to see on the page.

figure04

Please copy-and-paste these four pieces of information to a location on your computer where no one will have access to them. Without this information, you will not be able to interface with the Twitter API. And, importantly, if someone else gets access to these codes, they'll be able to pretend to be you. Keep these codes secret; keep them safe. (The screenshot above shows fake information.)

Authorizing Your Session in R

To use the twitteR functions in R, you will need to "authorize" your connection to the Twitter API service. This allows you to connect to the Twitter database as a valid, authenticated user. Thus, at least once at the beginning of your session, you will need to run the following script (but using your credentials rather than my fake ones):


# Define Parameters for Authorization (handshake)
# Conduct authorization
# Please note that I've hidden my codes for obvious reasons. Please replace
# the fields below with your information.

 consumer_key 	<- 'xxxDq6Gm'
 consumer_secret<- 'xxxoC2Fc'
 access_token 	<- 'xxxIHu47'
 access_secret 	<- 'xxxpCP2W'

 setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

For my own convenience, I keep a simple script file available that I run at the beginning of any session in which I intend to mine data from Twitter. The script contains the commands to load the libraries of interest (see above) and authorize my connection to Twitter (using the lines immediately above).

Basic Data Mining on Twitter Using searchTwitter() Function

One of the useful functions available in the twitteR package is the searchTwitter() function. This function is designed to search the recent set of tweets that Twitter has made available. You can use this function to search tweets for key words or to retrieve tweets from specific users. Some examples of the ways the function can be used are highlighted in the table below. In each of these examples, I'm running the searchTwitter function and saving the results of that function to an object called 'tweets'.

Code Example Explanation
tweets = searchTwitter("trump", n=100, lang="en") Search for all tweets containing the term 'trump'. The search is not case sensitive.
tweets = searchTwitter("trump OR hillary", n=100, lang="en") Search for all tweets containing the terms 'trump' or 'hillary'. (The OR must be in all caps.)
tweets = searchTwitter("trump hillary", n=100, lang="en") Search for all tweets containing the terms 'trump' and 'hillary'.
tweets = searchTwitter(" \"coffee can\" ", n=100, lang="en") Search for the exact phrase "coffee can". (Ignores punctuation.)
tweets = searchTwitter("trump -hillary", n=100, lang="en") Search all tweets for those containing the term 'trump' but NOT also containing the term 'hillary'.
tweets = searchTwitter("#trump", n=100, lang="en") Search for tweets containing the hashtag trump.
tweets = searchTwitter("to:RealDonaldTrump", n=100, lang="en") Search for all tweets sent to the screen name 'RealDonaldTrump'
tweets = searchTwitter("from:RealDonaldTrump", n=100, lang="en") Search for all tweets sent from the screen name 'RealDonaldTrump'
tweets = searchTwitter("@RealDonaldTrump", n=100, lang="en") Search for all tweets that reference the screen name '@RealDonaldTrump' using the @ tag.
Notice that the results of the query are stored as a list in R. You can see this more clearly by using str(tweets) to see the structure of the tweets object.


> tweets = searchTwitter("hillary", n=1, lang="en")
> str(tweets)
List of 1
 $ :Reference class 'status' [package "twitteR"] with 17 fields
  ..$ text         : chr "RT @stormqvist: Her freeze up & help from \"aid\" almost unnerves me. The more times I watch the creepier it gets @Cernovic"| __truncated__
  ..$ favorited    : logi FALSE
  ..$ favoriteCount: num 0
  ..$ replyToSN    : chr(0) 
  ..$ created      : POSIXct[1:1], format: "2016-08-06 21:30:39"
  ..$ truncated    : logi FALSE
  ..$ replyToSID   : chr(0) 
  ..$ id           : chr "762038201506787328"
  ..$ replyToUID   : chr(0) 
  ..$ statusSource : chr "Mobile Web (M5)"
  ..$ screenName   : chr "HiPatrickH"
  ..$ retweetCount : num 104
  ..$ isRetweet    : logi TRUE
  ..$ retweeted    : logi FALSE
  ..$ longitude    : chr(0) 
  ..$ latitude     : chr(0) 
  ..$ urls         :'data.frame':       1 obs. of  5 variables:
  .. ..$ url         : chr "https://t.co/HXlepD2JVE"
  .. ..$ expanded_url: chr "http://www.dangerandplay.com/2016/08/06/hillary-clinton-stroke-seizure-coughing-fits/"
  .. ..$ display_url : chr "dangerandplay.com/2016/08/06/hil...""| __truncated__
  .. ..$ start_index : num 129
  .. ..$ stop_index  : num 144
  ..and 53 methods, of which 39 are  possibly relevant:
  ..  getCreated, getFavoriteCount, getFavorited, getId, getIsRetweet, getLatitude, getLongitude, getReplyToSID, getReplyToSN,
  ..  getReplyToUID, getRetweetCount, getRetweeted, getRetweeters, getRetweets, getScreenName, getStatusSource, getText,
  ..  getTruncated, getUrls, initialize, setCreated, setFavoriteCount, setFavorited, setId, setIsRetweet, setLatitude,
  ..  setLongitude, setReplyToSID, setReplyToSN, setReplyToUID, setRetweetCount, setRetweeted, setScreenName, setStatusSource,
  ..  setText, setTruncated, setUrls, toDataFrame, toDataFrame#twitterObj


Given that we have not discussed lists up to this point, let us bypass that discussion for now and transform all the retrieved information to a data frame. There is a function dedicated to doing just that in the twitteR package called twListToDF(). Transforming our list output to a dataframe will make it easier for us to organize the results of interest.

Here is an example in which we retrieve the 5 most recent tweets from HillaryClinton and save those results as a dataframe called 'tweets.df':


tweets = searchTwitter("from:HillaryClinton", n=5, lang="en")
tweets.df = twListToDF(tweets)

Here is an example of what the contents of tweets.df may look like:

> tweets.df
                                                                                                                                    text
1 Why the @HoustonChron is urging voters to support Hillary in the "starkest political choice in living memory": https://t.co/phjf19ybXr
2                                               Why do Trump's foreign policy ideas read like a Putin wish list? https://t.co/IrJqXq0Kfg
3                                                                                          Four years ago today. https://t.co/WPM1I9PPN3
4                                                            ... so yeah, it's been quite a week for Donald Trump. https://t.co/UsexFeJjtl
5                                         RT @HFA: Here's the story behind our campaign's first braille button: https://t.co/xbiY2gaPB5.
  favorited favoriteCount replyToSN             created truncated replyToSID                 id replyToUID
1     FALSE          1285        NA 2016-08-06 20:59:02     FALSE         NA 762030244584890368         NA
2     FALSE          1987        NA 2016-08-06 20:09:34     FALSE         NA 762017792921051136         NA
3     FALSE          8678        NA 2016-08-06 19:12:34     FALSE         NA 762003451584745472         NA
4     FALSE          6129        NA 2016-08-06 18:59:46     FALSE         NA 762000226811011072         NA
5     FALSE             0        NA 2016-08-06 18:41:07     FALSE         NA 761995535389691904         NA
                                                                         statusSource     screenName retweetCount isRetweet
1 TweetDeck HillaryClinton          525     FALSE
2 TweetDeck HillaryClinton         1000     FALSE
3                  Twitter Web Client HillaryClinton         5489     FALSE
4 TweetDeck HillaryClinton         3634     FALSE
5 TweetDeck HillaryClinton          650      TRUE
  retweeted longitude latitude
1     FALSE        NA       NA
2     FALSE        NA       NA
3     FALSE        NA       NA
4     FALSE        NA       NA
5     FALSE        NA       NA


Depending on our objectives, we probably only want portions of the data returned to us. For example, we may only want the content of the tweet. Or we might only want to know how many times a tweet (or a set of tweets) was favorited.

Because the results have been saved in a dataframe, each tweet is represented in a row and the various pieces of information that are available for that tweet are represented as columns/variables. We can thus reference each piece (or multiple pieces) in the same ways that we have previously when using dataframes.

For example, if we want to see the tweets (i.e., the text) by itself, we can reference it as tweets.df$text. If we wish to see a vector of all the times the various tweets were favorited, we can do so as tweets.df$favoriteCount.

The table below briefly outlines the information that is available. Again, if the returned data is stored as a dataframe, any part of it can be referenced using these terms.

Attribute of Object Explanation
text This is the content of the tweet.
favorited (This appears to be depreciated. TRUE or FALSE)
favoriteCount How many times has the tweet been favorited?
replyToSN Was this tweet a reply to another account? If so, this is the screen name.
created The date/time the tweet was posted, using POSIXct format. Example: 2016-08-06 20:10:50
truncated Was the tweet truncated? TRUE FALSE
replyToSID If this tweet was a reply to another tweet, this contains the ID for the tweet in question.
id The numeric ID for the tweet. Every tweet should have a unique ID.
replyToUID If the tweet was a reply to another tweet, this contains the ID for the account in question.
statusSource From what source did the tweet originate? Examples: iPhone, Android
screenName The user's screename
retweetCount The number of times the tweet has been retweeted.
isRetweet Is the tweet a retweet? TRUE or FALSE
retweeeted (This appears to be depreciated. TRUE or FALSE)
longitude Coordinates of the origin of the tweet, if the user has geocoords enabled.
latitude Coordinates of the origin of the tweet, if the user has geocoords enabled.

An Example

Q: Are Donald Trump's tweets more likely to be liked (favorited) if they mention Hillary than if they do not?

Let's begin by obtaining a sample of recent tweets from Donald Trump. We will do two searches. The first will include the search term 'hillary'. The second will exclude tweets that contain the term 'hillary'.

(Please note that this is merely an example to illustrate the use of R's twitter functions. If you wanted to do a careful study on this matter, you would need to account for the fact that Trump often refers to Hillary as CrookedHillary, which doesn't match the search string 'hillary' exactly.)

After running each search, we'll extract the information to a dataframe.


tweets.with = searchTwitter("from:RealDonaldTrump+hillary", n=500, lang="en")
tweets.with = twListToDF(tweets.with)
tweets.with$text

tweets.without = searchTwitter("-Hillary from:RealDonaldTrump", n=500, lang="en")
tweets.without = twListToDF(tweets.without)
tweets.without$text

(Note. Although I requested 500 of each, only 35 (with) 88 (without) were available.) We can see the favorite counts for each tweet as follows:


tweets.with$favoriteCount
tweets.without$favoriteCount

Notice that some of the tweets have favorite counts of zero. These are tweets that Trump has retweeted. (I think that favoriting a retweet credits the original tweet rather than the retweeter.) Regardless, the 0's are clearly outliers in these distributions and don't merit inclusion in our analyses for substantive reasons. Let's treat 0's as missing values. (We can also filter out retweets when building our dataframe too.)


tweets.with$favoriteCount[tweets.with$favoriteCount==0]=NA
tweets.without$favoriteCount[tweets.without$favoriteCount==0]=NA

We can conduct a simple t-test to compare the favorite counts in tweets with and without 'hillary' to see which tweets are more popular. (It would be preferrable to use a Poisson Regression model for count data. But let's stick with what we know for this example.)


t.test(tweets.with$favoriteCount, tweets.without$favoriteCount)

When I ran this on 8/6/2016, Trump's tweets containing references to 'hillary' were more liked (M = 34245.21, SD = 10080.92) than his tweets that did not contain references to 'hillary' (M = 25988.84, SD = 11462.98)(t[80.9] = 4.001, p < .001, d = .76).

A Second Example: Simple Text Analysis

Q: Do Trump's and Hillary's tweets differ in the way they use singular (e.g., "I") vs. plural (e.g., "Us") pronouns?

In the next example we will illustrate some simple text mining and analysis methods. We will define two dictionaries that contain either singular first-person pronouns (e.g., I, me) or plural first-person pronouns (e.g., we, us). These dictionaries are not intended to be inclusive or even representative of the linguistic space; they are merely examples of how the method can be used.

After scraping tweets for Donald Trump and Hillary Clinton, we will cycle through each set of tweets and count the number of tweets that contain words in those dictionaries. We will only note in a yes/no way whether any of the words were present in a tweet; tweets with multiple singular first-person pronouns, for example, do not count more than tweets with one such reference.

The final output part of the script simply expresses the proportion of occurrences of the various word types for each individual (relative to their total number of tweets).

Note: Instead of using the searchTwitter() function, this time I will use a function called userTimeline() which is designed to extract the tweets from a specific user's timeline. This is valuable for our purposes because we are not limited to recent tweets; we can, in principle, extract many more tweets using this method.


# Scrape Trump's recent tweets
  tweets.d = userTimeline('realdonaldtrump', n=500)
  tweets.d <- twListToDF(tweets.d)
  tweets.d = tweets.d[tweets.d[,"isRetweet"]==FALSE,]	# Remove retweets

# Scrape Hillary's recent tweets
  tweets.h = userTimeline('HillaryClinton', n=500)
  tweets.h <- twListToDF(tweets.h)
  tweets.h = tweets.h[tweets.h[,"isRetweet"]==FALSE,]	# Remove retweets

# Create 2 simple dictionaries
  dictionary1 = c(" I "," me "," mine ", " my ")
  dictionary2 = c(" we "," our "," ours ", " us ")

# Initialize some vectors to save our results
# we will use d for the Donald suffix and h for Hillary
  singular.d = rep(0,dim(tweets.d)[1])
  plural.d = rep(0,dim(tweets.d)[1])

  singular.h = rep(0,dim(tweets.h)[1])
  plural.h = rep(0,dim(tweets.h)[1])


# Loop through Trump's tweets and count dictionary word occurances

  for(i in 1:dim(tweets.d)[1]){
    this.tweet = tweets.d[i,1]

    for(j in 1:length(dictionary1)){
      result = grepl(dictionary1[j], this.tweet, ignore.case=TRUE)
	if(result == TRUE){
	   singular.d[i] = 1
	}
    }
 
    for(j in 1:length(dictionary2)){
      result = grepl(dictionary2[j], this.tweet, ignore.case=TRUE)
	if(result == TRUE){
	  plural.d[i] = 1
	}
    }
  }

# Loop through Hillary's tweets and count dictionary word occurances

  for(i in 1:dim(tweets.h)[1]){
    this.tweet = tweets.h[i,1]

    for(j in 1:length(dictionary1)){
      result = grepl(dictionary1[j], this.tweet, ignore.case=TRUE)
	if(result == TRUE){
	   singular.h[i] = 1
	}
    }
 
    for(j in 1:length(dictionary2)){
      result = grepl(dictionary2[j], this.tweet, ignore.case=TRUE)
	if(result == TRUE){
	  plural.h[i] = 1
	}
    }
  }

# Results

  dim(tweets.d); dim(tweets.h)

  sum(singular.d)/dim(tweets.d)[1] # total singular pronouns for Trump
  sum(singular.h)/dim(tweets.h)[1] # total singular pronouns for Hillary

  sum(plural.d)/dim(tweets.d)[1] # total plural pronouns for Trump
  sum(plural.h)/dim(tweets.h)[1] # total plural pronouns for Hillary


The results of such an analysis has the potential to vary, depending on when you're doing the sampling and how many tweets you're able to obtain. When I ran this on Aug 8, 2016, the function extracted 453 tweets from Trump and 222 tweets from Hillary. Using those tweets, Trump had a higher proportion of singular first-person pronouns in his tweets than Hillary (20% vs. 11%). Hillary had a higher proportion of plural first-person pronouns in her tweets than Trump (30% vs. 14%). See if you observe the same pattern.

Pattern Matching

Let me elaborate on some of the new functions we have used above. We used the grepl() function to search a text string and determine whether it contains a pattern of interest. There are a large number of such functions in R; I encourage interested readers to explore these. The grepl() function takes as inputs the target string (e.g., "worried"), the text input to be searched (e.g., this.tweet), and some optional arguments (e.g, ignore.case=TRUE).

Notice that the singular dictionary included spaces surrounding the words (e.g., " I ", " me "). That was done to encourage R to search for the space-i-space pattern rather than the occurrence of the letter i itself. The letter i, as you might imagine, occurs in a large proportion of tweets. Similarly, 'me' can occur frequently too because it is contained in many words, such as means and limes.

Extensions

This is clearly a simplified example. One of the challenges for most researchers is mining the data from Twitter in the first place. Once we have the data, of course, there are a number of differnet tools that can be used to do text and linguistic analyses. LIWC, for example, is a commonly used tool in social and personality psychology for text-based analysis. There are also a variety of packages available for text mining in R, including tm (there is a good tutorial on using tm here). Of course, you can also code the text manually too--a long-standing tradition in psychology. But, with hundreds of thousands of tweets, that might not be possible.

A Third Example: Using Geocoordinates

Some people have their Twitter accounts configured such that, when they post a tweet, the latitude and longitude from the location at which they tweeted is recorded. When this is the case, you'll see the information contained in the $latitude and $longitude columns of your tweet dataframe. Although few people elect to use this option, many people do indicate their location (e.g., City, State) as part of their profile information. In both of these cases, it is possible to selectively extract tweets that are tied to that location.

To do so, simply include the lattitude and longitude coordinates as part of the geocode="lattitude, longitude, radius" argument in your search statement. In the example below, I'm searching for tweets that have come from with a 10 mile radius of the Psychology Department at UIUC.


tweets.champaign = searchTwitter("pikachu", n=100, lang="en", geocode="40.1164204,-88.24338290000003,10mi")
tweets.champaign = twListToDF(tweets.champaign)

If you want to know the latitude and longitude of various locations throughout the world, you can easily obtain them via www.gps-coordinates.net/.

By mining Twitter across multiple pre-defined locations, it is possible to examine whether certain themes are more salient in some cities than others. This opens the doors to doing lots of cool work on how attitudes and psychological processes many vary geographically.

A Fourth Example: Getting Friends and Followers

Each person with a Twitter account can elect to follow other people (called "friends"). By doing so, the user can see the tweets that his or her friends are posting in his or her timeline/news feed. Unlike Facebook, these friendships do not need to be mutual. Thus, I can elect to be friends with Brent Roberts (and see his tweets), but he might not elect to be friends with me (and, thus, he won't automatically see my tweets). In Twitter terms, I'm friends with Brent. And, from his perspective, I'm one of his followers (but not one of his friends).

Thus, each account has "friends" (people the user is following) and "followers" (people who are following that account). We can extract information about both the friends and followers of an account using the following methods:


user = getUser('brentwroberts')
friends		= user$getFriends() 
followers	= user$getFollowers() 

In this example, I have extracted information about the user BrentWRoberts and saved it to an object. I've then applied two methods/functions to that object. The first, getFriends(), allows me to extact information about Brent's friends (i.e., the people he follows). The second, getFollowers(), allows me to extract information about Brent's followers (i.e., the people who follow him).

This information is returned as a list. Personally, I find it helpful to work with data frames instead of lists, so we can transform the information to a data frame using the twListToDF() function:


friends.df = twListToDF(friends)
followers.df = twListToDF(followers)

By inspecting these dataframes, we can see that there is a lot of information returned about Brent's friends and followers. We can see, for example, how many friends and followers those friends and followers have.

As you might expect, this opens the doors to an endless number of research questions that involve social status, network structure, etc. A recent example that I thought was pretty cool was published by Ritter, Preston, and Hernandez (2014). They examined people who followed five Christians public figures and five atheists public figures. By examining the tweets of these two groups of people, they discovered that Christians tended to be happier than atheists people. That is, their tweets contains more positive-affect themes.

Here is another example. In this example we are going to study the social network structure of Brent Roberts's friends. This has the potential to be computationally expensive. So, in this example, we will focus only on a subset (6) of Brent's friends, selected at random.

Our goal is to create a friendship matrix that indicates whether each of Brent's friends are also friends with each other. A '0' will be used to denote cases where a friendship tie does not exist (Friend R is not friends with Friend C). A '1' will be used to denote cases where a friendship tie does exist (Friend R is also friends with Friend C). Please note that this matrix is not symmetric because Friend R can follow Friend C even if Friend C does not follow Friend R.


user = getUser('brentwroberts')
friends = twListToDF(user$getFriends()) 
followers = twListToDF(user$getFollowers()) 

k = 6		# number of target's friends to study
n = dim(friends)[1]		# how many friends total?
x = sample(1:n)[1:k]		# sample k friends randomly 

# Create a new data frame that only contains the k friends
  friends.df = friends[x,]

# Initialize a matrix that will contain connections among target's friends
# 0 will mean row not friend with column
# 1 will mean row is friend with column
  friend.mat = matrix(0,k,k)

# Loop through target's k friends
for(i in 1:k){

 # get target friend's screenName
   this.friend = friends.df$screenName[i]

 # Construct a search term. Let's grab the user info for this particular friend
   search.term = paste("'",this.friend,"'",sep="")
   eval(parse(text = paste("this.user = getUser( ",search.term," )  ")))

 # Get the screenNames of this person's friends (i.e., friends of friend)
   this.friends = twListToDF(this.user$getFriends())$screenName

  # Do the target's friends appear in the list of friends for this particular friend?
  # 0 = no, 1 = yes
    for(j in 1:k){
      z = sum(friends.df$screenName[j] == this.friends)
      friend.mat[i,j] = z
    }
 }

# label rows and columns of connection matrix
rownames(friend.mat)=colnames(friend.mat)=friends.df$screenName
friend.mat

Once one has a matrix of this type, it is possible to perform a large number of social network style analyses with it. Here is an example output matrix. I'll admit that I rigged this one with the intention of trying to identify a few cognitive folks (Simons, Kane, and Chabris) and a few personality/ind. diffs. folks (Condon, Penke, and Drob). As you can see, everyone is friends with each other within those subgroups. But there are a few connections across guilds too. There are a variety of social network analytic tools that could be used to reveal and examine these kinds of clusters.


               profsimons Kane_WMC_Lab cfchabris DMCpersonality LarsPenke tuckerdrob
profsimons              0            1         1              0         0          0
Kane_WMC_Lab            1            0         1              0         1          1
cfchabris               1            1         0              0         1          0
DMCpersonality          0            0         0              0         1          1
LarsPenke               0            0         1              1         0          1
tuckerdrob              1            1         1              1         1          0

Automating Scraping using Task Scheduler in Windows

One of the limitations of the Twitter API is that it only makes available tweets from the past 5 to 8 days. This is a sensible decision on Twitter's part, of course. Big Data can be BIG, and if they allow anyone with an account to traverse their massive datasets, they would have to eat the expense in computation. Unfortunately for us, this limits the extent to which we can ask questions of historical significance using Twitter. If we want to know, for example, how in-group/out-group language changes as a function of terrorist attacks, we cannot easily do so.

We can solve these problems, however, if we can define the questions ahead of time and collect data prospectively on Twitter. This allows us to solve two problems. First, it allows us to build a larger body of tweets for analysis. And, second, it allows us to study specific issues, brands, or individuals across time.

One way to do this is to simply run your tweet scraping scripts once a week or once a day--depending on your objectives. I'll show you below how to append the results of your scrapes to a common datafile. This will make it easier to compile lots of data that can then be analyzed at a later date.

Another option is to automate your R scripts so that your computer runs them without you having to manually do so. Again, those results can be automatically appended to a common datafile for later analysis.

I'll review both of these tricks below. Please note that my examples assume you're using a Windows environment. All of this should translate to iOS simply enough, but you might have to use Apple-specific tools instead. For an overview of various ways of scheduling tasks on a Mac, please see this link.

Appending New Scrapes to an Existing File

We are going to use a few tricks to pull this off.

Please note that you can use "real" data bases for the purpose of storing data mined from Twitter. For example, the twitteR package is designed to interface with a database program called SQLite. However, I'm going to show you a more primitive approach--saving the mined data to a text file. Hopefully this will make the process easier to use, even if it isn't the most elegant solution from a tech perspective.

Conducting a Time-Limited Search

Our goal is to define our Twitter search so that whatever data we are trying to scrape (e.g., tweets from Hillary) are limited to a specific time frame. In our example we will assume that our intent is to scrape Twitter once a day. As such, we will constrain our search to tweets from the past day.

To do this, we need a simple way to get the current date from R. Fortunately, R has a number of functions that make working with date/time objects easy. The function Sys.Date() will return the current date (local/computer time) in the YYYY-MM-DD format that we need.


d = Sys.Date()
d

R recognizes this object as belonging to the date class. That means we can perform simple operations upon it (such as subtracting a day) without the complexity of dealing with the first of the month or the first of the year.


d.since = d-1
d.until = d

d.since
d.until

This allows us to construct new search arguments that limit our search to yesterday by using our new variables, d.since and d.until, to constrain the search.

Unfortunately, we can't simply stick d.since and d.until into our search function because searchTwitter() wants the dates to be enclosed in quotes. It can be a bit tricky to enclose a variable in quotes because the quotes imply that the text is to be taken literally. There is a non-intuitive solution to this problem. Namely, we can paste together a new variable, search.params, that contains the info we want to pass onto the searchTwitter() function. By creating this string separately, we can ensure that variable values, like d.since, are substituted into the string rather than being treated literally. We can then pass that set of search parameters to the searchTwitter() function using the eval() and parse() functions. Here is an example. It might not be intuitive (it took me a while to put it together and figure out how to make it work), but it should be simple enough to configure for any number of purposes.


search.params = paste('n=1000', ',', ' lang=', '"', 'en', '"', ', ', ' since=', '"' ,d.since, '"', ',' , ' until=', '"', d.until,'"'   ,sep="")
search.terms = "coffee"
tweets = eval(parse(text = paste("searchTwitter(search.terms,", search.params, ")")))
tweets.df = twListToDF(tweets)

In sum, we have created a search string that searches Twitter's recently available data for a specific term (i.e., coffee) on a specific date. The result is equivalent to running something like tweets = searchTwitter("coffee", n=1000, lang="eng", since="2016-08-06", until="2016-08-07") but where the dates in question are generated dynamically--at the day/time at which the command is run. We have also placed the results in a dataframe. Next, let's discuss next how to write that data frame to a local text file.

Writing the Search Results to a Local Text File

Writing a data frame to a text file is trivial in R. Before we do it, however, let's clean and reorganize the output a bit. We will do two things: First, we will remove line breaks from the tweets so we don't have unexpected line breaks in our data file. Second, we will add a new column to our dataframe that represents the date the scraping was done.

The following code uses the gsub() function, which is designed to search for a specific pattern and replace it with a new one. In this case, we are instructing it to search the first column of tweets.df (the tweets themselves) for line breaks (represented as \r or \n) and to replace those with the word RETURN. (We can simply remove them by using " " instead--replacing them with white space. But I like the idea of having an easy way to recreate them later if we need to. Replacing them with the term REMOVE gives us an easy way to search and replace at a later date.) We will also replace commas with the word COMMA for similar reasons. In this case, we need to remove the commas because we want to save the data as a comma delimited text file and need to reserve the comma for that specific function.


tweets.df[,1] = gsub("[\r\n]", " RETURN ", tweets.df[,1])

tweets.df[,1] = gsub("[,]", " COMMA ", tweets.df[,1])

Next, let's add a new column to our dataframe that contains the date/time of our scrape. We will use the Sys.time() function which returns a date/time sting that looks something like this: "2016-08-07 09:30:21 CDT". We can append it to the dataframe using the tricks we discussed before. Because there are multiple rows in our dataframe, but a single date/time, the same value is used for each row.

tweets.df$scrape = Sys.time()

Now, let's write the data frame to a text file.

logFile = "C:\\Users\\Chris\\Documents\\R_Twitter\\cron_example_output.txt"
write.table(tweets.df, file=logFile, append=TRUE, sep=",", col.names=FALSE)

A few things to note about this code. First, we are specifying the path on our local computer to which the data should be saved. This should be a text file. The text file needs to exist before you run this the first time. The easiest way to do this is to open an empty file and, without putting anything inside it, saving it as a text file with a name of your choosing. An alternative is to create an empty file and place the variable labels/headers in the top row. That ensures that (a) the file exists so R can append data to it and (b) the file will be easier to import later because it will have headers. You can download an example file here that is good to go; just name it accordingly for your purposes.

Because of the append=TRUE argument, it will append data to that file every other time it runs. Thus, it is not overwriting the data already in the file; it is building onto what is already there.

The sep="," argument allows us to specify our field delimiter--what symbol denotes the end of one variable and the start of the next on each line. Here we have selected a comma, with the intention of building a comma-delimited text file.

Finally, we have set col.names=FALSE to prevent R from placing the variable names or headers in the first row. Other other circumstances, it would be great to do this. But since we'll be appending data across days to the same file, we don't want those headers to appear as cases.

To wrap this up, let's collate all the commands we've used into one place. Let's assume that you've placed these in a script file and have saved it as something like "auto_scrape.R" on your machine. In principle, every time you run this script, you will load the libraries you need, authenticate your access with Twitter, perform the search of interest, place the results into a dataframe, and then append that data into a text file on your machine.



# Load Libraries
# These libraries need to be installed before you can call them.
# You only need to install the libraries once.

	library(twitteR)
	library(RCurl)

# Define Parameters for Authorization (handshake)
# Conduct authorization
# Please note that I've hidden my codes for obvious reasons. Please replace
# the fields below with your information.

 consumer_key 	<- 'xxxDq6Gm'
 consumer_secret<- 'xxxoC2Fc'
 access_token 	<- 'xxxIHu47'
 access_secret 	<- 'xxxpCP2W'

 setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)


# Conduct search/scrape

  # Create date info
  d = Sys.Date()	# get today's date, on your machine
  d.since = d-1		# today's numeric day minus one
  d.until = d		# today's numeric day

  # Construct the search query
  search.params = paste('n=1000', ',', ' lang=', '"', 'en', '"', ', ', ' since=', '"' ,d.since, '"', ',' , ' until=', '"', d.until,'"'   ,sep="")
  search.terms = "coffee"

  # Run the search
  tweets = eval(parse(text = paste("searchTwitter(search.terms,", search.params, ")")))
	
# Transform tweets list into a data frame
  tweets.df = twListToDF(tweets)

# Remove line breaks and commas
  tweets.df[,1] = gsub("[\r\n]", " RETURN ", tweets.df[,1])
  tweets.df[,1] = gsub("[,]", " COMMA ", tweets.df[,1])

# Add scrape time
  tweets.df$scrape = Sys.time()

# Write dataframe to text file
# Specify the full path for your output file on your machine

  logFile = "C:\\Users\\Chris\\Documents\\R_Twitter\\cron_example_output.txt"
  write.table(tweets.df,file=logFile,append=TRUE, sep=",", col.names=FALSE)

If you remember to run this script each day, you're golden. But wouldn't it be better to let your computer be the conscientious on your behalf?

Automating the Process in Windows

There are a few steps we need to follow to automate this process in Windows.

Create your R Script File

You will need to create the R script that you wish to run. An example of such a script was discussed previously. Make a note of where on your computer the script is located. In one of the steps below you will need to be able to reference the location of the script using its path (e.g., C:\Users\Chris\Documents\R_Twitter\auto_scrape.R).

Make a Note of Where your R.exe Program is Located

Second, you need to know the path for the location of the R.exe file on your computer--the executable for R. On my computer, that happens to be "C:\Program Files\R\R-3.2.4revised\bin\x64\R.exe". It might be different location on your machine. Please note that you need the full path for R.exe, not Rgui.exe.

Create a Batch File

You will need to create a file that contains commands for Windows to execute. The file should be a simple text file (not a MS Word file) and, for the purposes of our example, should be called task.bat.

The file should contain nothing more than the following:

@echo off
"LOCATION OF YOUR R.EXE FILE" CMD BATCH "LOCATION OF YOUR R SCRIPT"

Here is an example, using the paths on my machine:

@echo off
"C:\Program Files\R\R-3.2.4revised\bin\x64\R.exe" CMD BATCH "C:\Users\Chris\Documents\R_Twitter\auto_scrape.R"

What we are going to do is instruct Windows to run this batch file according to a specific schedule. The batch file instructs Windows to open the R.exe program and then run the auto-scrape.R script file. Once the commands in the script file have been run, the process will close. Until next time.

Configure Task Scheduler to Run the Batch File At Pre-defined Times

Our final step is to configure a task in Window's Task Scheduler to run our batch file once a day for a (definite or indefinite) period of time. If you're using a Mac, the process will be pretty much the same, except you're looking for a Cron application rather than a Task Scheduler application. Follow the steps in an intuitive way that will allow you to accomplish the same goals.

  1. Find your Task Scheduler app. If you don't know how to access it, do a search for Task Scheduler in your Start menu.
  2. Choose the option called Create Basic Task. It is possible that the details vary from one version of Windows to the next. This will open a task wizard that will allow you to name the task, provide a description, etc.
  3. Select the appropriate options, given your needs. In the example illustrated in the screenshots below I have scheduled a task to run every hour. The task is to run the task.bat file. And, as you saw previously, that file runs the auto_scrape.R script in R.

Once you've configured everything appropriately, you can test it out off-schedule by selecting the task and pressing the Run option. A command window should appear on your screen while the task runs. Once the task is complete, you should be able to check to see if it worked okay by opening your data file (cron_example_output.txt). You should also have a new file in the same directory that reports on the progress of the batch processing (e.g., how long it took to run and what, if any, errors it encountered along the way).

Figure 1. Naming the task and providing a description. figure01
Figure 2. Triggers tab. Create new or edit existing triggers for activating the script. figure02
Figure 3. Actions tab. What happens on the trigger? (Start task.bat) figure03
Figure 4. Conditions tab. Fine-tune the conditions under which the script will run. figure04
Figure 5. Settings tab. Fine-tune. figure05
Figure 6. If I choose to edit the trigger on the Triggers tab, here is an example of what I might see. Here I am instructing the task to run every hour until 8/12/2016. The initial "event" that starts the process occurs when the time hits 10:45 pm on 8/5/2016. figure06

Credits

Everything I learned about mining data from Twitter I learned on the Internet. Here are some incredibly valuable resources that I encourage you to check out.