rvest: Web Scraping Using R

rvest is one of the standard libraries when it comes to web scraping using R. In the following example, we use R to import a sample table from this webpage using the aforementioned library.

With the sample table below containing 100 observations, we use the following code to import this table into the R environment. Note that if we just wish to import the data in a matrix format, we can run the below:
 

Part I – Import Web Table as Matrix

#We install the “rvest” package to scrape data:
install.packages(“rvest”)
 
#Load the library:
library(rvest)
 
#Load HTML website:
html <- read_html("http://www.michaeljgrogan.com/rvest-web-scraping-using-r/")

 
#Include relevant HTML nodes using CSS generator:
marketingtable <- html_nodes(html, ".odd .column-4 , .odd .column-3 , .odd .column-2 , .odd .column-1, .even .column-4 , .even .column-3 , .even .column-2 , .even .column-1")

 
#Determine table length
length(marketingtable)

#Import table by html_text function
html_text(marketingtable)

Note that while a user can use the Selector Gadget tool to visually select the various parts of a particular webpage that the user wishes to import into R, we are simply specifying the odd and even columns for the html_nodes as above.
 

Result

As we can see from the below, our table has been imported into R:

importedtable
 

Part II – Import Web Table as Data Frame

However, when it comes to conducting an analysis on the data that has been imported, we will want to convert our matrix into a data frame, i.e. convert the data into a table readable by R that can be calculated on directly. To accomplish this, we need to structure our matrix into a data frame.

#Import rvest library
 
library(rvest)
 
#Import table from web page
 
html <- read_html("http://www.michaeljgrogan.com/rvest-web-scraping-using-r/")
 
#Structure separate variables according to node
 
observation <- html_nodes(html, ".odd .column-1, .even .column-1")
marketingspend <- html_nodes(html, ".odd .column-2, .even .column-2")
numberofcampaigns <- html_nodes(html, ".odd .column-3, .even .column-3")
consumerrating <- html_nodes(html, ".odd .column-4, .even .column-4")
 
#Define separate variables
 
observationvalues<-html_text(observation)
marketingspendvalues<-html_text(marketingspend)
numberofcampaignsvalues<-html_text(numberofcampaigns)
consumerratingvalues<-html_text(consumerrating)

 
#Structure data frame and remove heading
 
df = data.frame(observationvalues, marketingspendvalues, numberofcampaignsvalues, consumerratingvalues)
df2<-df[-1, ]
df2

We see that each column (odd and even) is being separated by variable, and these variables are inserted into our data.frame function to define the variables separately. Given that the original table from the web source has a heading, we are removing it by means of df[-1, ].

The new data frame is now set up in R under the name df2:

dataframetable
 

Sample Table

ObservationMarketing SpendNumber of campaignsConsumer Rating
19201202
23759616
311702398
415317322
519909953
68455603
7128798710
84205756
919872591
108509607
1114459791
1217074259
134665281
143977557
1516703372
1614538910
17159023010
1813887523
1911706246
208039959
211730789
229656545
237080439
2414023624
253271829
26133253510
276103594
282285569
291445724
3011889279
3115214552
324157238
3319960972
3410177865
359341615
369867499
3714569877
3817187804
3916881817
4019514388
411267353
4215365955
4315524439
4416564369
454414516
4615463328
4713583685
4816949366
493885533
502471466
5115426658
525095737
5373276910
541187232
5518195213
568284268
579421865
582460283
5915282372
608444599
6118366809
621196959
6351594810
6417151599
6510077262
667550696
6713385334
686623771
6911898210
7018020449
7115269323
7217858205
7313261808
743811965
758506231
769264811
776758659
784447517
793732533
808395906
8118228727
82125917610
8319156951
842078842
8511030599
86138148210
8717044476
8812852381
8912565584
9013972866
9111062726
925362717
9316324475
944256464
958015735
9633838310
975034386
9817459398
996382863
10054827510