The purpose of the plyr and dplyr libraries in R is to manipulate data with ease.
As we’ve seen in a previous post, there are various methods of wrangling and summarising data in R. However, wouldn’t it be great if we had some libraries that can greatly simplify this process for us?
plyr and dplyr helps us do just that. The commands in these libraries are very extensive – the purpose of this post is to introduce you to the select few commands that you’ll find yourself using most of the time. We will look at the following commands:
- arrange: Arrange data in ascending or descending order
- filter: Filter data based on certain criteria
- slice: Select a portion of a dataset
- mutate: Add variables to a data frame
- count: Count the number of observations in a data frame
- empty: Check for an empty data frame
- ddply: Summarize data
In this example, we will use a food price dataset available from the Humanitarian Data Exchange. First, let’s import the dataset into R:
#IMPORTING CSV FILES AND CONSTRUCTING DATA FRAMES setwd("directory") fooddata<-read.csv("filename.csv",encoding="UTF-8") attach(fooddata)
Next, we load the plyr and dplyr libraries:
#PLYR AND DPLYR library(plyr) library(dplyr)
1. In the first example, let's see how we can arrange our data in both ascending and descending order. In this case, we wish to arrange by fooddata, mp_price, and mp_commoditysource, in that precise order.
#Order Data fooddataascending<-arrange(fooddata, mp_price, mp_commoditysource) fooddatadescending<-arrange(fooddata, desc(mp_price), desc(mp_commoditysource))
2. Now, let us suppose that we only wish to extract prices greater than 500.
#Filter Data filterfooddata<-data.frame(filter(fooddataascending, mp_price > 500))
3. To get the first 50 observations:
#First 50 observations
fooddata50= fooddata %>% slice(1:50)
#First 50 observations fooddata50= fooddata %>% slice(1:50)
4. The head and tail functions are technically not part of the plyr/dplyr library, but I thought it appropriate to include them here as they are quite a useful complement. In the example below, the head command is selecting the first 10 observations from the dataset, while the tail command is selecting the last 10 observations.
5. To add variables to our dataset, we use the mutate function. In this case, we are creating a new variable (log of mp_price), and adding it to our dataset.
#Add Variables add_variables<-mutate(fooddata50,logprice=log(mp_price))
6. To count the number of observations in our dataset, we use the count command.
#Count Data count(fooddata50)
7. The empty command is used to check for empty dataframes:
#Check for empty dataframe empty(fooddata50)
8. ddply allows us to obtain summary statistics for our data. In this example, let us suppose that we wish to obtain the average food price for each country.
#Data Summary meansummary<-ddply(fooddata, .(adm0_name), summarise, mean_mpprice = mean(mp_price))
We see that each country is being summarised by food price. If we wished, we could also get the sum by substituting mean(mp_price) for sum(mp_price), in a similar manner to the SUMIF function in Excel.