Sunday 9 November 2014

Sentiment Analysis of Twitter Data

In my last post, I had explained about how to scrape data from websites. In this post I will describe how to obtain twitter data and perform Sentiment Analysis on it. The method presented here is a standard method of doing sentiment analysis and it can be extended to many other things like sentimental analysis of news articles, book reviews, movie reviews etc. Thus, the output obtained from the last post can be used as an input for the R code in this post!

Part 1: Comparing sentiment of tweets from Arvind Kejriwal and Narendra Modi
- Analysis of tweets from a particular Twitter handle

Here, the tweets from 2 interesting political personalities: Arvind Kejriwal and Narendra Modi have been considered.

#Libraries
--------------------------------------------------------------------------------------------------
library(twitteR)
library(plyr) 
---------------------------------------------------------------------------------------------------
#OAuth Credentials and Authentication
# go to dev.twitter.com and create a new application after which you will get a consumerKey and consumerSecret
reqURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- " Enter your consumerKey here "
consumerSecret <- " Enter your consumerSecret here "
twitCred <- OAuthFactory$new(consumerKey=consumerKey,
                             consumerSecret=consumerSecret,
                             requestURL=reqURL,
                             accessURL=accessURL,
                             authURL=authURL)
download.file(url="http://curl.haxx.se/ca/cacert.pem",destfile="cacert.pem")
twitCred$handshake(cainfo = system.file("CurlSSL", "cacert.pem",
                   package = "RCurl"))           # OAuth handshake is required for every request you send
registerTwitterOAuth(twitCred)

#Main code
----------------------------------------------------------------------------------------------------
tweets <- userTimeline('narendramodi', 1500,  cainfo="cacert.pem")
#put twitter handle here along with count of tweets you want, maximum allowed in 1 request is 1500
tweetsdf1 <- twListToDF(tweets)
write.csv(tweetsdf1, file= "kejri.csv")      
# final output file containing info such as retweets,favourites,time-samp etc
tweets=read.csv("kejri.csv", header=T)
# a set positive and negative words to search from 
pos_words=scan("positive-words.txt", what="character", comment.char=";")
neg_words=scan("negative-words.txt", what="character", comment.char=";")

score.sentiment = function(sentences, pos.words, neg.words, .progress='none')  
 #sentiment  score calculating function
{
require(stringr) # we got a vector of sentences. plyr will handle a list
# we want a simple array of scores back, so we use
scores = laply(sentences, function(sentence, pos.words, neg.words) {
# clean up sentences with R's regex-driven global substitute, gsub():
sentence = gsub('[[:punct:]]', '', sentence)
sentence = gsub('[[:cntrl:]]', '', sentence)
sentence = gsub('\\d+', '', sentence)                      # and convert to lower case:
sentence = tolower(sentence)                               # split into words. str_split is in the stringr package
word.list = str_split(sentence, '\\s+')                  # sometimes a list() is one level of hierarchy too much
words = unlist(word.list)            
# compare our words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)      # match() returns the position of the matched term or NA
# we just want a TRUE/FALSE:
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}
tweets.score=score.sentiment(tweets$text, pos_words,neg_words, .progress='text')
write.csv(tweets.score,"sentiment_kejrii.csv") # this file shows sentiment score for each tweet


We obtain information such as favorite, favoriteCount, replyToSN, created (time and date), truncated, replyToSID, id,   replyToUID, statusSource, screenName, retweetCount, isRetweet, retweeted, longitude, latitude along with the tweet content when we pass the request through OAuth for a particular twitter handle. The score calculating function will give +1 to a positive word, -1 for a negative word and 0 for neutral word. The sets of positive and negative words has been obtained from link. So finally, it gives a net sentiment score for each tweet which can be used further for analyses (time/day wise-sentiments,RT/Favourite count- timeline etc)
The following is the sentiment comparison of tweets from @ ArvindKejriwal and @narendramodi:

Comparing sentiment of tweets from Modi and Kejriwal
Well, you guessed it right. Modi has a higher proportion of positive tweets, while Kejriwal has a higher proportion of neutral and negative tweets compared to Modi.


Part 2: Analysis of tweets having a particular # tag
- Analysing #ModiAtMadison

The following lines of code will help you pick up tweets according to a specific hash tag. After this, the sentiment score calculating function can be used to obtain the sentiments, like the previous part.

tweets  = searchTwitter(“#ModiAtMadison”, n=1500, cainfo= “cacert.pem”)
tweets.text = laply(tweets, function(t) t$getText())

PM Modi’s visit to the United States had garnered many eyeballs. The whole of India was watching him closely. His speech to the Indian-Americans at the Madison Square Garden was attended by over 30 Congressmen. Some people criticized him , some praised him. But this sentiment graph shows how twitter was buzzing about it:

Positive sentiment dominating the tweets for #ModiAtMadison 

No comments:

Post a Comment