tidytext: Word Clouds and Sentiment Analysis in R

wordcloud

The tidytext library in R is one of the most innovative I’ve come across within the language.

tidytext is the cornerstone library for developing text mining algorithms in R (developed by Julia Silge and David Robinson).

Here, I conduct a text mining analysis on one of my former blog posts – “Big Data Helps You Target The Right Market”.

We will see how text mining can:

  1. Create a “tibble” which can identify the most frequent words in the text
  2. Develop a wordcloud showing a graphical illustration of word frequency
  3. Conduct a sentiment analysis to illustrate positive and negative words in the text

Use of tidytext and word frequency

Firstly, we’re going to load the blog post from a text file – we could link to it directly but I am using a text file to simplify the illustration for the meantime.

To determine word frequency, we are going to do the following:

  • Convert text into a data frame suitable for analysis with tidytext
  • Remove stop words – or words with no inherent value (and, but, etc.), from the text
  • Form a tibble to sort words by frequency in descending order
  • Filter this tibble to only include words which appear in the text more than once
#Convert list to string
require(stringr)
WordList <- str_split(readLines("textfile.txt"), pattern = " ")
text<-paste(unlist(WordList), collapse=' ')
str(text)

library(dplyr)
text_df <- data_frame(line = 1, text = text)
text_df

library(tidytext)

text_df2 <- text_df %>%
  unnest_tokens(word,text)

data(stop_words)

text_df2 <- text_df2 %>%
  anti_join(stop_words)

tibble<-text_df2 %>%
  count(word,sort=TRUE)

tibblefiltered = tibble %>% filter(n > 1)

We now see that we have a tibble which shows the frequency of each word in the text (having eliminated stop words).

tibble

Given that we have filtered based on n > 1, it means that only those words which appear more than once in the text will be included.

Wordcloud

Next, we design our word cloud to illustrate the word frequency graphically, specifying a maximum of 100 words in the cloud:

#Wordclouds
library(wordcloud)
text_df2 %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

wordcloud

Sentiment Analysis

Now, suppose we wish to classify our words according to whether they are positive or negative. Here is how we can use a sentiment-based word cloud to classify:

#Sentiment Analysis with reshape

library(reshape2)
text_df2 %>%
  filter()
text_df2 %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words=100)

sentiment analysis

For more information on how tidytext works, I recommend you check out the Text Mining With R by Julia Silge and David Robinson. I found it to be very insightful in this area, and would highly recommend it.

Usual disclaimers apply: This is not a promotion of the title. I am merely citing this book as a text I found helpful. I have no business relationship with the authors, nor am I being compensated by O’Reilly Media for citing this book.

Leave a Reply

Your email address will not be published. Required fields are marked *