Text Mining and Search Analytics Using Python and R

The following example illustrates how text mining capabilities in Python and R can be used in order to analyse a text file with a set of words, and how these words can be split into separate categories with the frequency of the same then being calculated. This is of particular relevance to social media analytics, in instances where a user’s search terms are collated and split into several categories to build a user profile.

While there are more advanced methods of conducting text mining such as word clouds, the below is a simplified example of how web searches can be sorted into categories, with the frequencies of those keywords then being ranked by frequency. Let us suppose that a user had conducted the following web searches in a given day:

How to test for serial correlation in R
Weather in New York today
Data analytics
machine learning
Seven countries you must travel to
Building a great business
Top 10 tips for using Linux
How technology is changing the workplace
Social media analytics
Top programming languages in data science
hotel reviews
Advantages and disadvantages of Python
Python vs. R
latest tennis news
skills required to be a data scientist
recipe ideas for dinner
best authors of all time
best days to book flights
where is the stock market trading
Donald Trump
best and worst things about being an entrepreneur
financial planning

Using this information, let us suppose that an advertiser, social media company, or similar entity would like to categorize the above web searches – in order to determine which category of advertisements would cater to them best.

There are two main tasks in this regard:

  1. Order the searches into defined categories
  2. Determine the frequency of each category to build a user profile

The following example uses Python and R to accomplish the above. I find that Python is superior to R when it comes to text mining – although I still choose to use R in this example for the category analysis.

When it comes to building a text mining algorithm using Python, there are three main things that we want Python to do for us:

  1. Import the web searches via text file (assume for this instance that searches have been saved as .txt)
  2. Eliminate keywords of non-importance (such as ‘and’, ‘it’, ‘be’, ‘the’, etc.)
  3. Replace chosen keywords with a keyword corresponding to category (in this example, the four keywords/categories are ‘data’, ‘travel’, ‘news’, ‘business’).

Text Editing With Python

# Read file
with open('filepath.txt', 'r') as file :
filedata = file.read()
# Replace keyword
filedata = filedata.replace('How ', ' ')
filedata = filedata.replace('Why ', ' ')
filedata = filedata.replace('of ', ' ')
filedata = filedata.replace('to ', ' ')
filedata = filedata.replace('you ', ' ')
filedata = filedata.replace('all ', ' ')
filedata = filedata.replace('and ', ' ')
filedata = filedata.replace('be ', ' ')
filedata = filedata.replace(' a ', ' ')
filedata = filedata.replace(' for ', ' ')
filedata = filedata.replace(' in ', ' ')
filedata = filedata.replace(' is ', ' ')
filedata = filedata.replace(' the ', ' ')
filedata = filedata.replace(' about ', ' ')
filedata = filedata.replace(' an ', ' ')
filedata = filedata.replace('Data', ' data ')
filedata = filedata.replace('Python', ' data ')
filedata = filedata.replace('R', ' data ')
filedata = filedata.replace('machine', ' data ')
filedata = filedata.replace('Linux', ' data ')
filedata = filedata.replace('technology', ' data ')
filedata = filedata.replace('flights', 'travel')
filedata = filedata.replace('countries', 'travel')
filedata = filedata.replace('hotel', 'travel')
filedata = filedata.replace('analytics', 'data')
filedata=  filedata.replace('CNN', 'news')
filedata=  filedata.replace('weather', 'news')
filedata=  filedata.replace('Trump', 'news')
filedata=  filedata.replace('market', 'business')
filedata=  filedata.replace('entrepreneur', 'business')
filedata=  filedata.replace('financial', 'business')
# Write to file
with open('filepath2.txt', 'w') as file:

When we write to the new file (‘filepath2.txt’ in this case), we see that the separate keywords chosen have been replaced with the category keyword, and keywords that we do not want have been eliminated:

test serial correlation data Weather New York today data data data learning Seven travel must travel Building great business Top 10 tips using data data changing workplace Social media data Top programming languages data science travel reviews Advantages disadvantages data data vs. data latest tennis news skills required data scientist recipe ideas dinner best authors time best days book travel news where stock business trading Donald news best worst things being business business planning

Text Mining and Category Analysis Using R

With the above file, we now want to use R to read the list, and count each keyword in the file to come up with a frequency for each keyword – this will indicate which category is appearing most frequently for the web queries.

To do this, we use the stringr library and read the file using readLines, and then sort through a frequency table:

WordList <- str_split(readLines("filepath2.txt"), pattern = " ")

In the above, each keyword frequency is calculated in descending order, where [1:100] indicates the top 100 words in a list.

When the above R commands are run, we get the following output: