Non-normal data is a common occurrence when conducting data analysis. When we work with a particular dataset, we are making an assumption that the distribution of the data is normal. This is because data becomes significantly easier to forecast when it is assumed that a normal distribution is followed.
What Is A Normal Distribution And Why Do We Need One?
A normal distribution is a bell shaped curve where the mean = median = mode, and approximately 50% of observations are to the left of the curve while the other 50% are to the right.
The concept of a normal distribution is very important in statistics.
When data is normally distributed, it becomes easier to make inferences about predictions regarding the data, or relationships between the different variables in the data. Without this assumption, it becomes a lot harder (if not impossible) to make accurate forecasts.
A normal distribution shows a similar bell curved shape to that of the plot below:
In this particular example, we wish to run an OLS regression of income (dependent variable) and population (independent variable).
While a normal distribution is not, strictly speaking, required for OLS, the technique does assume normality in the error terms. While OLS can still be a useful predictor with non-normal data, having a normal distribution is preferable for this reason. In this regard, we will test our data for normality, and in the case of non-normality transform our data using a Box-Cox transformation.
We’ll work through this example using R.
Quantile-Quantile (QQ) Plot: Inspect for non-normal distribution
> boxplot(Income) > qqnorm(Income);qqline(Income)
We see that our boxplot shows a significant level of values outside of the box, implying that there are a large number of outliers in our data:
Moreover, this implies that the resulting distribution will not be normal. This is confirmed from our QQ plot, in that a significant number of observations deviate from the regression line:
Shapiro-Wilk Test for Normality
When we test for a normal distribution using the Shapiro test, we see that our p-value is less than 0.05, meaning that we reject the null hypothesis of normality at the 5% significance level.
> shapiro.test(Income) Shapiro-Wilk normality test data: Income W = 0.91679, p-value = 7.867e-12
Run OLS and test for heteroscedasticity
When we run a standard OLS regression on Income vs. Population, we see that our regression tests negative for heteroscedasticity, i.e. the p-value of our Breusch-Pagan test is above 0.05, meaning we cannot reject the null hypothesis of homoscedasticity.
> reg1<-lm(Income~Population) > summary(reg1) Call: lm(formula = Income ~ Population) Residuals: Min 1Q Median 3Q Max -6338.4 -892.5 -233.7 634.6 7744.3 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.056e+03 1.460e+02 14.08 <2e-16 *** Population 5.147e-03 1.983e-04 25.96 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1446 on 297 degrees of freedom Multiple R-squared: 0.6941, Adjusted R-squared: 0.6931 F-statistic: 674 on 1 and 297 DF, p-value: < 2.2e-16 > bptest(reg1) studentized Breusch-Pagan test data: reg1 BP = 0.85829, df = 1, p-value = 0.3542
Box-Cox Transformation for Non-Normal Data
To transform our data, we use the Box-Cox transformation as below. We wish to obtain a lambda value – that is the value at which the data is transformed into a normal shape.
> library(MASS) > boxcoxreg1<-boxcox(reg1) > summary(boxcoxreg1) Length Class Mode x 100 -none- numeric y 100 -none- numeric > bc<-boxcox(reg1,plotit=T) > title("Lambda and Log-Likelihood") > which.max(bc$y)  62 > (lambda←bc$x[which.max(bc$y)])  0.4646465
We can also see a graphical illustration of the obtained lambda value and corresponding log-likelihood:
With our lambda value (i.e. the value at which our log-likelihood is maximised), we now transform our data.
> bc_transformed<-lm(I(Income^lambda)~Population,data=mydata) > summary(bc_transformed) Call: lm(formula = I(Income^lambda) ~ Population, data = mydata) Residuals: Min 1Q Median 3Q Max -31.198 -4.293 -0.835 3.458 31.618 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.636e+01 7.178e-01 50.66 <2e-16 *** Population 2.484e-05 9.745e-07 25.48 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.108 on 297 degrees of freedom Multiple R-squared: 0.6862, Adjusted R-squared: 0.6851 F-statistic: 649.4 on 1 and 297 DF, p-value: < 2.2e-16 > plot(bc_transformed,which=1) > bptest(bc_transformed) studentized Breusch-Pagan test data: bc_transformed BP = 0.073392, df = 1, p-value = 0.7865
The above is our new output using the data that has been transformed to approximate a normal distribution. Note that we have also yielded a negative test for heteroscedasticity using the Breusch-Pagan test.
When we plot the residuals vs. fitted graph, we see that it follows a much straighter line than that of what we originally observed in our QQ Plot:
To summarise, normally distributed data allows us to be confident that the results we obtain from statistical tests can be generalised to a population. In this regard, it gives us more confidence that the results we obtain are actually of predictive value!
So there you have it – how to transform non-normal data in R! You might also find the guides below helpful in gaining a further understanding of this topic.