plyr and dplyr: Data Manipulation in R

plyrdplyr_5

The purpose of the plyr and dplyr libraries in R is to manipulate data with ease.

As we’ve seen in a previous post, there are various methods of wrangling and summarising data in R. However, wouldn’t it be great if we had some libraries that can greatly simplify this process for us?

plyr and dplyr helps us do just that. The commands in these libraries are very extensive – the purpose of this post is to introduce you to the select few commands that you’ll find yourself using most of the time. We will look at the following commands:

  1. arrange: Arrange data in ascending or descending order
  2. filter: Filter data based on certain criteria
  3. slice: Select a portion of a dataset
  4. mutate: Add variables to a data frame
  5. count: Count the number of observations in a data frame
  6. empty: Check for an empty data frame
  7. ddply: Summarize data

In this example, we will use a food price dataset available from the Humanitarian Data Exchange. First, let’s import the dataset into R:

#IMPORTING CSV FILES AND CONSTRUCTING DATA FRAMES
setwd("directory")
fooddata<-read.csv("filename.csv",encoding="UTF-8")
attach(fooddata)

Next, we load the plyr and dplyr libraries:

#PLYR AND DPLYR
library(plyr)
library(dplyr)

 

1. In the first example, let's see how we can arrange our data in both ascending and descending order. In this case, we wish to arrange by fooddata, mp_price, and mp_commoditysource, in that precise order.

#Order Data
fooddataascending<-arrange(fooddata, mp_price, mp_commoditysource)
fooddatadescending<-arrange(fooddata, desc(mp_price), desc(mp_commoditysource))

Ascending Order

plyr dplyr_1

Descending Order

plyr dplyr_2
 

2. Now, let us suppose that we only wish to extract prices greater than 500.

#Filter Data
filterfooddata<-data.frame(filter(fooddataascending, mp_price > 500))

plyr dplyr_3
 

3. To get the first 50 observations:

#First 50 observations
fooddata50= fooddata %>% slice(1:50)

plyr dplyr_4
 

4. The head and tail functions are technically not part of the plyr/dplyr library, but I thought it appropriate to include them here as they are quite a useful complement. In the example below, the head command is selecting the first 10 observations from the dataset, while the tail command is selecting the last 10 observations.

head(fooddata50,10)
tail(fooddata50,10)

plyr dplyr_5
 

5. To add variables to our dataset, we use the mutate function. In this case, we are creating a new variable (log of mp_price), and adding it to our dataset.

#Add Variables
add_variables<-mutate(fooddata50,logprice=log(mp_price))

plyr dplyr_6
 

6. To count the number of observations in our dataset, we use the count command.

#Count Data
count(fooddata50)

plyr dplyr_7
 

7. The empty command is used to check for empty dataframes:

#Check for empty dataframe
empty(fooddata50)

plyr dplyr_8
 

8. ddply allows us to obtain summary statistics for our data. In this example, let us suppose that we wish to obtain the average food price for each country.

#Data Summary
meansummary<-ddply(fooddata, .(adm0_name), summarise, mean_mpprice = mean(mp_price))

plyr

We see that each country is being summarised by food price. If we wished, we could also get the sum by substituting mean(mp_price) for sum(mp_price), in a similar manner to the SUMIF function in Excel.

Author: Michael Grogan

Michael Grogan is a machine learning consultant and educator, with a profound passion for statistics and data science.