Non-normal data is a common occurrence when conducting data analysis. When we work with a particular dataset – especially when it comes to using tools such as Ordinary Least Squares – we are making an assumption that the distribution of the data is normal. A normal distribution is a bell shaped curve where the mean = median = mode, and approximately 50% of observations are to the left of the curve while the other 50% are to the right. A normal distribution is similarly to that of the plot below:
In this particular example, we wish to run an OLS regression of income (dependent variable) and population (independent variable). An OLS regression assumes that the relevant data is normally distributed, but we cannot know this right off the bat. In this regard, we will test our data for normality, and in the case of non-normality transform our data using a Box-Cox transformation.
Firstly, let’s set the directory and attach data as standard:
setwd(directory) mydata <- read.csv(directory) attach(mydata)
1. Quantile-Quantile (QQ) Plot: Inspect for non-normal distribution
When we run a qq (quantile-quantile) plot, we see that our data is not normally distributed in that it does not follow the normal regression line.
2. Shapiro-Wilk Test for Normality
Moreover, when we test for a normal distribution using the Shapiro test, we see that our p-value is less than 0.05, meaning that we reject the null hypothesis of normality at the 5% significance level.
3. Run OLS and test for heteroscedasticity
When we run a standard OLS regression on Income vs. Population, we see that our regression tests negative for heteroscedasticity, i.e. the p-value of our Breusch-Pagan test is above 0.05, meaning we cannot reject the null hypothesis of homoscedasticity.
reg1<-lm(Income~Population) summary(reg1) bptest(reg1)
4. Box-Cox Transformation for Non-Normal Data
To transform our data, we use the Box-Cox transformation as below. We wish to obtain a lambda value - that is the value at which the data is transformed into a normal shape.
outliers = boxplot(mydata$Income, plot=FALSE)$out outliers mydata[mydata$Income %in% outliers,] b <- boxplot(mydata$Income) b boxcoxreg1<-boxCox(reg1) summary(boxcoxreg1) library(MASS) bc<-boxcox(reg1,plotit=T) which.max(bc$y) (lambda<-bc$x[which.max(bc$y)])
With our lambda value, we now transform our data.
Our output is as follows, and note that we have also yielded a negative test for heteroscedasticity using the Breusch-Pagan test: