Non-Normal Data: Shapiro Test and Box-Cox Transformation

Non-normal data is a common occurrence when conducting data analysis. When we work with a particular dataset, we are making an assumption that the distribution of the data is normal. This is because data becomes significantly easier to forecast when it is assumed that a normal distribution is followed.

What Is A Normal Distribution And Why Do We Need One?

A normal distribution is a bell shaped curve where the mean = median = mode, and approximately 50% of observations are to the left of the curve while the other 50% are to the right.

The concept of a normal distribution is very important in statistics.

When data is normally distributed, it becomes easier to make inferences about predictions regarding the data, or relationships between the different variables in the data. Without this assumption, it becomes a lot harder (if not impossible) to make accurate forecasts.

A normal distribution shows a similar bell curved shape to that of the plot below:

normal distribution

In this particular example, we wish to run an OLS regression of income (dependent variable) and population (independent variable).

While a normal distribution is not, strictly speaking, required for OLS, the technique does assume normality in the error terms. While OLS can still be a useful predictor with non-normal data, having a normal distribution is preferable for this reason. In this regard, we will test our data for normality, and in the case of non-normality transform our data using a Box-Cox transformation.

We’ll work through this example using R.

Quantile-Quantile (QQ) Plot: Inspect for non-normal distribution

> boxplot(Income)
> qqnorm(Income);qqline(Income)

We see that our boxplot shows a significant level of values outside of the box, implying that there are a large number of outliers in our data:


Moreover, this implies that the resulting distribution will not be normal. This is confirmed from our QQ plot, in that a significant number of observations deviate from the regression line:

qq non-normal

Shapiro-Wilk Test for Normality

When we test for a normal distribution using the Shapiro test, we see that our p-value is less than 0.05, meaning that we reject the null hypothesis of normality at the 5% significance level.

> shapiro.test(Income)

        Shapiro-Wilk normality test

data:  Income
W = 0.91679, p-value = 7.867e-12

Run OLS and test for heteroscedasticity

When we run a standard OLS regression on Income vs. Population, we see that our regression tests negative for heteroscedasticity, i.e. the p-value of our Breusch-Pagan test is above 0.05, meaning we cannot reject the null hypothesis of homoscedasticity.

> reg1<-lm(Income~Population)
> summary(reg1)

lm(formula = Income ~ Population)

    Min      1Q  Median      3Q     Max 
-6338.4  -892.5  -233.7   634.6  7744.3 

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.056e+03  1.460e+02   14.08   <2e-16 ***
Population  5.147e-03  1.983e-04   25.96   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1446 on 297 degrees of freedom
Multiple R-squared:  0.6941,    Adjusted R-squared:  0.6931 
F-statistic:   674 on 1 and 297 DF,  p-value: < 2.2e-16

> bptest(reg1)

        studentized Breusch-Pagan test

data:  reg1
BP = 0.85829, df = 1, p-value = 0.3542

Box-Cox Transformation for Non-Normal Data

To transform our data, we use the Box-Cox transformation as below. We wish to obtain a lambda value – that is the value at which the data is transformed into a normal shape.

> library(MASS)
> boxcoxreg1<-boxcox(reg1)
> summary(boxcoxreg1)

  Length Class  Mode   
x 100    -none- numeric
y 100    -none- numeric

> bc<-boxcox(reg1,plotit=T)
> title("Lambda and Log-Likelihood")
> which.max(bc$y)

[1] 62

> (lambda←bc$x[which.max(bc$y)])

[1] 0.4646465

We can also see a graphical illustration of the obtained lambda value and corresponding log-likelihood:

lambda and log-likelihood

With our lambda value (i.e. the value at which our log-likelihood is maximised), we now transform our data.

> bc_transformed<-lm(I(Income^lambda)~Population,data=mydata)
> summary(bc_transformed)

lm(formula = I(Income^lambda) ~ Population, data = mydata)

    Min      1Q  Median      3Q     Max 
-31.198  -4.293  -0.835   3.458  31.618 

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.636e+01  7.178e-01   50.66   <2e-16 ***
Population  2.484e-05  9.745e-07   25.48   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.108 on 297 degrees of freedom
Multiple R-squared:  0.6862,    Adjusted R-squared:  0.6851 
F-statistic: 649.4 on 1 and 297 DF,  p-value: < 2.2e-16

> plot(bc_transformed,which=1)
> bptest(bc_transformed)

        studentized Breusch-Pagan test

data:  bc_transformed
BP = 0.073392, df = 1, p-value = 0.7865

The above is our new output using the data that has been transformed to approximate a normal distribution. Note that we have also yielded a negative test for heteroscedasticity using the Breusch-Pagan test.

When we plot the residuals vs. fitted graph, we see that it follows a much straighter line than that of what we originally observed in our QQ Plot:

To summarise, normally distributed data allows us to be confident that the results we obtain from statistical tests can be generalised to a population. In this regard, it gives us more confidence that the results we obtain are actually of predictive value!

So there you have it – how to transform non-normal data in R! You might also find the guides below helpful in gaining a further understanding of this topic.

Further Reading