rvest: Web Scraping Using R

rvest is one of the standard libraries when it comes to web scraping using R. In the following example, we use R to import a sample table from this webpage using the aforementioned library.

With the sample table below containing 100 observations, we use the following code to import this table into the R environment. Note that if we just wish to import the data in a matrix format, we can run the below:

Part I – Import Web Table as Matrix

#We install the "rvest" package to scrape data:
#Load the library:
#Load HTML website:
html <- read_html("http://www.michaeljgrogan.com/rvest-web-scraping-using-r/")
#Include relevant HTML nodes using CSS generator:
marketingtable <- html_nodes(html, ".odd .column-4 , .odd .column-3 , .odd .column-2 , .odd .column-1, .even .column-4 , .even .column-3 , .even .column-2 , .even .column-1")
#Determine table length

#Import table by html_text function

Note that while a user can use the Selector Gadget tool to visually select the various parts of a particular webpage that the user wishes to import into R, we are simply specifying the odd and even columns for the html_nodes as above.


As per the below, our table has been imported into R:

> webtable<-html_text(marketingtable)
> webtable
 [1] "Observation"         "Marketing Spend"     "Number of campaigns" "Consumer Rating"     "1"                   "9201"               
 [7] "20"                  "2"                   "2"                   "3759"                "61"                  "6"                  
[13] "3"                   "11702"               "39"                  "8"                   "4"                   "6990"               
[19] "84"                  "9"                   "5"                   "1023"                "44"                  "6"   

Part II - Import Web Table as Data Frame

However, when it comes to conducting an analysis on the data that has been imported, we will want to convert our matrix into a data frame, i.e. convert the data into a table readable by R that can be calculated on directly. To accomplish this, we need to structure our matrix into a data frame.

#Import rvest library
#Import table from web page
html <- read_html("http://www.michaeljgrogan.com/rvest-web-scraping-using-r/")
#Structure separate variables according to node
observation <- html_nodes(html, ".odd .column-1, .even .column-1")
marketingspend <- html_nodes(html, ".odd .column-2, .even .column-2")
numberofcampaigns <- html_nodes(html, ".odd .column-3, .even .column-3")
consumerrating <- html_nodes(html, ".odd .column-4, .even .column-4")
#Define separate variables
#Structure data frame and remove heading
df = data.frame(observationvalues, marketingspendvalues, numberofcampaignsvalues, consumerratingvalues)
df2<-df[-1, ]

We see that each column (odd and even) is being separated by variable, and these variables are inserted into our data.frame function to define the variables separately. Given that the original table from the web source has a heading, we are removing it by means of df[-1, ].

The new data frame is now set up in R under the name df2:

observationvalues marketingspendvalues numberofcampaignsvalues consumerratingvalues
                  1                 9201                      20                    2
                  2                 3759                      61                    6
                  3                11702                      39                    8
                  4                 6990                      84                    9
                  5                 1023                      44                    6

Sample Table

ObservationMarketing SpendNumber of campaignsConsumer Rating