The tidytext library in R is one of the most innovative I’ve come across within the language.
tidytext is the cornerstone library for developing text mining algorithms in R (developed by Julia Silge and David Robinson).
Here, I conduct a text mining analysis on one of my former blog posts – “Big Data Helps You Target The Right Market”.
We will see how text mining can:
- Create a “tibble” which can identify the most frequent words in the text
- Develop a wordcloud showing a graphical illustration of word frequency
- Conduct a sentiment analysis to illustrate positive and negative words in the text
Use of tidytext and word frequency
Firstly, we’re going to load the blog post from a text file – we could link to it directly but I am using a text file to simplify the illustration for the meantime.
To determine word frequency, we are going to do the following:
- Convert text into a data frame suitable for analysis with tidytext
- Remove stop words – or words with no inherent value (and, but, etc.), from the text
- Form a tibble to sort words by frequency in descending order
- Filter this tibble to only include words which appear in the text more than once
#Convert list to string require(stringr) WordList <- str_split(readLines("textfile.txt"), pattern = " ") text<-paste(unlist(WordList), collapse=' ') str(text) library(dplyr) text_df <- data_frame(line = 1, text = text) text_df library(tidytext) text_df2 <- text_df %>% unnest_tokens(word,text) data(stop_words) text_df2 <- text_df2 %>% anti_join(stop_words) tibble<-text_df2 %>% count(word,sort=TRUE) tibblefiltered = tibble %>% filter(n > 1)
We now see that we have a tibble which shows the frequency of each word in the text (having eliminated stop words).
Given that we have filtered based on n > 1, it means that only those words which appear more than once in the text will be included.
Next, we design our word cloud to illustrate the word frequency graphically, specifying a maximum of 100 words in the cloud:
#Wordclouds library(wordcloud) text_df2 %>% anti_join(stop_words) %>% count(word) %>% with(wordcloud(word, n, max.words = 100))
Now, suppose we wish to classify our words according to whether they are positive or negative. Here is how we can use a sentiment-based word cloud to classify:
#Sentiment Analysis with reshape library(reshape2) text_df2 %>% filter() text_df2 %>% inner_join(get_sentiments("bing")) %>% count(word, sentiment, sort=TRUE) %>% acast(word ~ sentiment, value.var = "n", fill = 0) %>% comparison.cloud(colors = c("gray20", "gray80"), max.words=100)
For more information on how tidytext works, I recommend you check out the Text Mining With R by Julia Silge and David Robinson. I found it to be very insightful in this area, and would highly recommend it.
Usual disclaimers apply: This is not a promotion of the title. I am merely citing this book as a text I found helpful. I have no business relationship with the authors, nor am I being compensated by O’Reilly Media for citing this book.