The tidytext library in R is one of the most innovative I’ve come across within the language.
tidytext is the cornerstone library for developing text mining algorithms in R (developed by Julia Silge and David Robinson).
We will see how text mining can:
- Create a “tibble” which can identify the most frequent words in the text
- Develop a wordcloud showing a graphical illustration of word frequency
- Conduct a sentiment analysis to illustrate positive and negative words in the text
Use of tidytext and word frequency
Firstly, we’re going to load the blog post from a text file – we could link to it directly but I am using a text file to simplify the illustration for the meantime.
To determine word frequency, we are going to do the following:
- Convert text into a data frame suitable for analysis with tidytext
- Remove stop words – or words with no inherent value (and, but, etc.), from the text
- Form a tibble to sort words by frequency in descending order
- Filter this tibble to only include words which appear in the text more than once
#Convert list to string require(stringr) WordList <- str_split(readLines("textfile.txt"), pattern = " ") text<-paste(unlist(WordList), collapse=' ') str(text) library(dplyr) text_df <- data_frame(line = 1, text = text) text_df library(tidytext) text_df2 <- text_df %>% unnest_tokens(word,text) data(stop_words) text_df2 <- text_df2 %>% anti_join(stop_words) tibble<-text_df2 %>% count(word,sort=TRUE) tibblefiltered = tibble %>% filter(n > 1)
We now see that we have a tibble which shows the frequency of each word in the text (having eliminated stop words).
Given that we have filtered based on n > 1, it means that only those words which appear more than once in the text will be included.
Next, we design our word cloud to illustrate the word frequency graphically, specifying a maximum of 100 words in the cloud:
#Wordclouds library(wordcloud) text_df2 %>% anti_join(stop_words) %>% count(word) %>% with(wordcloud(word, n, max.words = 100))
Now, suppose we wish to classify our words according to whether they are positive or negative. Here is how we can use a sentiment-based word cloud to classify:
#Sentiment Analysis with reshape library(reshape2) text_df2 %>% filter() text_df2 %>% inner_join(get_sentiments("bing")) %>% count(word, sentiment, sort=TRUE) %>% acast(word ~ sentiment, value.var = "n", fill = 0) %>% comparison.cloud(colors = c("gray20", "gray80"), max.words=100)
For more information on how tidytext works, I recommend you check out the Text Mining With R by Julia Silge and David Robinson. I found it to be very insightful in this area, and would highly recommend it.
Usual disclaimers apply: This is not a promotion of the title. I am merely citing this book as a text I found helpful. I have no business relationship with the authors, nor am I being compensated by O’Reilly Media for citing this book.
Code Scripts and Datasets
Hope you enjoyed this tutorial!
The full code is available by subscribing to my mailing list.
Upon subscription, you will receive full access to the codes and datasets for my tutorials, as well as a comprehensive course in regression analysis in both Python and R.