Sentiment Analysis with twitteR and tidytext

A sentiment analysis is a useful way of gauging group opinion on a certain topic at a particular point in time.

Using social media data, let us see how we can use the twitteR library to stream tweets from Twitter and conduct a sentiment analysis to determine current sentiment on gold prices.

Downloading tweets with twitteR

Before being able to download tweets, it is necessary to create the appropriate application to do so at apps.twitter.com.

From there, you will create an account where you will receive different authentication codes to connect to the API.

Firstly, we authenticate our credentials and then download 500 tweets pertaining to the search term “gold prices”:

library(twitteR)
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

tweets<-searchTwitter("gold prices", n=500, since='2018-06-25')
df <- twListToDF(tweets)

Now, we are unlisting our tweets and using dplyr to store these under the one observation:

wctext<-paste(unlist(df), collapse=" ")
str(df)

library(dplyr)
wctext <- data_frame(line = 1, text = wctext)
wctext

Conducting sentiment analysis with tidytext

Let us now see how we will construct the actual sentiment analysis with tidytext. Here, we are removing stop words from our analysis before proceeding further. To clarify, stop words refer to words that are commonly used but do not necessarily add value to the sentiment analysis. Therefore, we remove them as these words will be numerous and prevent us from obtaining any useful insight from the analysis.

library(tidytext)

wctext2 <- wctext %>%
  unnest_tokens(word,text)

data(stop_words)

wctext2 <- wctext2 %>%
  anti_join(stop_words)

Additionally, I have also defined a list of stopwords that I stored in my own text file (such as "trump", "twitter", "gold") and removing these from the analysis. To note, I am removing the word "gold", since tidytext deems this to be a positive word, but in this case it prevents us from gauging true sentiment on the commodity.

setwd("C:/Users/michaeljgrogan/Documents/a_documents/computing/data science/datasets")
library(tm)

stopwords = readLines('stopwords.txt')
x  = wctext2$word
x  =  removeWords(x,stopwords)

wctext2 <- x
wctext2<-data.frame(wctext2)

We now create a tibble, where we gauge the occurrence of each word in the analysis, and then use the Bing sentiment engine to rank the keywords.

tibble<-wctext2 %>%
  count(wctext2,sort=TRUE)
col_headings<-c("word","n")
names(tibble) <- col_headings

tibblefiltered = tibble %>% filter(n > 1)
attach(tibblefiltered)
barplot(tibblefiltered$n, main="Word Frequency", 
        xlab="Word", names.arg=tibblefiltered$word)
str(tibblefiltered)
barsentiment <- tibblefiltered %>%
  inner_join(get_sentiments("bing"), by = c("word"))

rm(list= ls()[!(ls() %in% c('barsentiment'))])
attach(barsentiment)
library(ggplot2)

barsentiment %>%
  count(sentiment, word, n=n) %>%
  ungroup() %>%
  filter(n >= 1) %>%
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_bar(stat = "identity") +
  ylab("Contribution to sentiment") +
  coord_flip()

Let us see what the sentiment analysis looks like:

sentiment gold

We can see that the positive words in our analysis such as "amazing" and "beautiful" do not necessarily describe positive sentiment on gold in financial terms.

However, there are quite a few negative words that do, such as "risk" and "fall". From looking at the above, the general sentiment from a financial perspective appears to be negative on gold.

When we take a look at gold prices over the past 24 hours, they have been heading downwards - appearing to support the sentiment analysis we have gleaned from Twitter.

gold futures

Conclusion

In this tutorial, you have learned:

  • How to download Tweets with the twitteR library
  • How to conduct text mining with tidytext
  • How to generate a sentiment analysis

Author: Michael Grogan

Michael Grogan is a machine learning consultant and educator, with a profound passion for statistics and data science.

Leave a Reply

Your email address will not be published. Required fields are marked *

one + one =