The tidytext library in R is one of the most innovative I’ve come across within the language.
tidytext is the cornerstone library for developing text mining algorithms in R (developed by Julia Silge and David Robinson).
We will see how text mining can:
- Create a “tibble” which can identify the most frequent words in the text
- Develop a wordcloud showing a graphical illustration of word frequency
- Conduct a sentiment analysis to illustrate positive and negative words in the text
Use of tidytext and word frequency
Firstly, we’re going to load the blog post from a text file – we could link to it directly but I am using a text file to simplify the illustration for the meantime.
To determine word frequency, we are going to do the following:
- Convert text into a data frame suitable for analysis with tidytext
- Remove stop words – or words with no inherent value (and, but, etc.), from the text
- Form a tibble to sort words by frequency in descending order
- Filter this tibble to only include words which appear in the text more than once
#Convert list to string require(stringr) WordList <- str_split(readLines("textfile.txt"), pattern = " ") text<-paste(unlist(WordList), collapse=' ') str(text) library(dplyr) text_df <- data_frame(line = 1, text = text) text_df library(tidytext) text_df2 <- text_df %>% unnest_tokens(word,text) data(stop_words) text_df2 <- text_df2 %>% anti_join(stop_words) tibble<-text_df2 %>% count(word,sort=TRUE) tibblefiltered = tibble %>% filter(n > 1)
We now see that we have a tibble which shows the frequency of each word in the text (having eliminated stop words).
Given that we have filtered based on n > 1, it means that only those words which appear more than once in the text will be included.
Next, we design our word cloud to illustrate the word frequency graphically, specifying a maximum of 100 words in the cloud:
#Wordclouds library(wordcloud) text_df2 %>% anti_join(stop_words) %>% count(word) %>% with(wordcloud(word, n, max.words = 100))
Now, suppose we wish to classify our words according to whether they are positive or negative. Here is how we can use a sentiment-based word cloud to classify:
#Sentiment Analysis with reshape library(reshape2) text_df2 %>% filter() text_df2 %>% inner_join(get_sentiments("bing")) %>% count(word, sentiment, sort=TRUE) %>% acast(word ~ sentiment, value.var = "n", fill = 0) %>% comparison.cloud(colors = c("gray20", "gray80"), max.words=100)
Now, suppose that we wish to rank our different keywords based on sentiment. How would we do it?
Firstly, let us take our “tibblefiltered” data frame. Essentially, what we want to do is assign each word a sentiment and then visually plot these sentiments by means of a bar graph.
barsentiment <- tibblefiltered %>% inner_join(get_sentiments("bing"), by = c("word"))
I have named the above “barsentiment”, as this is the data we will use to create our bar chart.
Now, we will remove all other data except “barsentiment” from our environment.
rm(list= ls()[!(ls() %in% c('barsentiment'))]) attach(barsentiment) library(ggplot2)
Now, let’s generate our bar chart:
barsentiment %>% count(sentiment, word, wt=n) %>% ungroup() %>% filter(n >= 1) %>% mutate(n = ifelse(sentiment == "negative", -n, n)) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n, fill = sentiment)) + geom_bar(stat = "identity") + ylab("Contribution to sentiment") + coord_flip()
Here is what our bar chart looks like:
You can see that a “Contribution to sentiment” score is given on the bar chart, and this allows us to quantify the sentiment contribution of each word in our dataset.
For more information on how tidytext works, I recommend you check out the Text Mining With R by Julia Silge and David Robinson. I found it to be very insightful in this area, and would highly recommend it.
Usual disclaimers apply: This is not a promotion of the title. I am merely citing this book as a text I found helpful. I have no business relationship with the authors, nor am I being compensated by O’Reilly Media for citing this book.