Huber vs. Ridge Regressions: Accounting for Outliers

In a previous tutorial, we saw how we can use Huber and Bisquare weightings to adjust for outliers in a dataset.

These weightings allow us to adjust our regression analysis to give less weight to extreme values.

The previous analysis was illustrated in R, which you can find here.

Now, let’s see how we can construct and interpret such an analysis using Python. The purpose of this analysis is to determine the factors that influence internet usage (as measured in megabytes) across a population. The independent variables used are income, videohours, web pages, gender, and age.

Initial Regression Results

Firstly, we will generate a standard linear regression output using statsmodels. Let’s import our libraries, dataset, and run the analysis:

#Load libraries and set directory
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import os;
path="/home/michaeljgrogan/Documents/a_documents/computing/data science/datasets"
os.chdir(path)
os.getcwd()

#Import CSV and define variables
variables = pd.read_csv('internetoutliers.csv')
usage = variables['usage']
income = variables['income']
videohours = variables['videohours']
webpages = variables['webpages']
gender = variables['gender']
age = variables['age']

fig=sm.qqplot(usage)
plt.show()

#Generate OLS Results
x = np.column_stack((income,videohours,webpages,gender,age))
x = sm.add_constant(x, prepend=True)
results = smf.OLS(usage,x).fit()
print(results.summary())

Firstly, we can see that the QQ Plot shows three clear outliers in the dataset, i.e. incidences of much higher internet usage than the general population:

qqplot

Here is what our regression analysis looks like:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  usage   R-squared:                       0.837
Model:                            OLS   Adj. R-squared:                  0.836
Method:                 Least Squares   F-statistic:                     986.8
Date:                Wed, 25 Jul 2018   Prob (F-statistic):               0.00
Time:                        18:16:43   Log-Likelihood:                -9406.5
No. Observations:                 966   AIC:                         1.883e+04
Df Residuals:                     960   BIC:                         1.885e+04
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const      -2498.8844    489.346     -5.107      0.000   -3459.195   -1538.574
x1             0.7874      0.047     16.903      0.000       0.696       0.879
x2          1172.0461     30.100     38.939      0.000    1112.977    1231.115
x3            54.1443      3.341     16.208      0.000      47.589      60.700
x4          -122.6890    265.000     -0.463      0.643    -642.736     397.358
x5            87.8120     11.157      7.871      0.000      65.917     109.707
==============================================================================
Omnibus:                      529.892   Durbin-Watson:                   1.881
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            12104.541
Skew:                           2.020   Prob(JB):                         0.00
Kurtosis:                      19.865   Cond. No.                     2.63e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.63e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

>>> #T Critical Value
... print (stats.t.ppf(1-0.025, 966))
1.96242277937671

Note that while a large condition number is indicated, the VIF test yielded results of below 2 across the independent variables upon running the “vif” command in R.

Therefore, it is assumed that multicollinearity is not present here.

Huber vs. Ridge Regression

Now, we will run a Huber-weighted regression (which will give more weight to less extreme values) and a ridge regression (which will not adjust the values for this purpose).

Let us import HuberRegressor and Ridge from sklearn, and generate the predictions:

import sklearn
from sklearn import linear_model
from sklearn.linear_model import HuberRegressor, Ridge
rg = Ridge(fit_intercept=True, alpha=0.0, random_state=0, normalize=True)
ridgemodel = rg.fit(x,usage)
ridgepredictions = rg.predict(x)
print(ridgepredictions)

Once we have done this, we generate the predictions from the Ridge regression:

[ 4432.7396198   3354.00366108 21582.7015611  21075.32748243
  8682.40120058  4439.82174989  7823.91011603  4903.39412117
 26385.59314104 29162.63684572  8680.99576092 17933.56948042
...
 10480.88538249 13337.89162182 10625.80300619  4603.47200848
  6526.63369436 27038.20836683  8577.8635685  91089.50643952
 99523.28219672 62923.75856318]

Now, we can generate the Huber regressions and determine how our predictions compare with Ridge vis-a-vis the actual usage statistics.

An important value that we will factor into the HuberRegressor is epsilon, which is a proxy for how sensitively the regression treats outliers.

Therefore, five Huber regressions are run, with epsilon values ranging from 1.1 to 3.9.

#Huber 1
hb1 = linear_model.HuberRegressor(epsilon=1.1, max_iter=100, alpha=0.0001, warm_start=False, fit_intercept=True, tol=1e-05)
hubermodel1 = hb1.fit(x,usage)
huberpredictions1 = hb1.predict(x)
print(huberpredictions1)
hb1.coef_

#Huber 2
hb2 = linear_model.HuberRegressor(epsilon=1.8, max_iter=1000, alpha=0.0001, warm_start=False, fit_intercept=True, tol=1e-05)
hubermodel2 = hb2.fit(x,usage)
huberpredictions2 = hb2.predict(x)
print(huberpredictions2)
hb2.coef_

#Huber 3
hb3 = linear_model.HuberRegressor(epsilon=2.5, max_iter=1000, alpha=0.0001, warm_start=False, fit_intercept=True, tol=1e-05)
hubermodel3 = hb3.fit(x,usage)
huberpredictions3 = hb3.predict(x)
print(huberpredictions3)
hb3.coef_

#Huber 4
hb4 = linear_model.HuberRegressor(epsilon=3.2, max_iter=1000, alpha=0.0001, warm_start=False, fit_intercept=True, tol=1e-05)
hubermodel4 = hb4.fit(x,usage)
huberpredictions4 = hb4.predict(x)
print(huberpredictions4)
hb4.coef_

#Huber 5
hb5 = linear_model.HuberRegressor(epsilon=3.9, max_iter=1000, alpha=0.0001, warm_start=False, fit_intercept=True, tol=1e-05)
hubermodel5 = hb5.fit(x,usage)
huberpredictions5 = hb5.predict(x)
print(huberpredictions5)
hb5.coef_

Now, we are going to generate deviation readings from the actual usage for different percentiles, i.e. usage statistics that occur within a certain percentile.

Let’s firstly try the 50th percentile.

#Accuracy at the 50th percentile
trialridge=(ridgepredictions-usage)/usage
qr=np.quantile(trialridge, 0.5)
qr

trial1=(huberpredictions1-usage)/usage
q1=np.quantile(trial1, 0.5)
q1

trial2=(huberpredictions2-usage)/usage
q2=np.quantile(trial2, 0.5)
q2

trial3=(huberpredictions3-usage)/usage
q3=np.quantile(trial3, 0.5)
q3

trial4=(huberpredictions4-usage)/usage
q4=np.quantile(trial4, 0.5)
q4

trial5=(huberpredictions5-usage)/usage
q5=np.quantile(trial5, 0.5)
q5

deviation=np.array([q1,q2,q3,q4,q5])
t = np.arange(1., 4., 0.7)
plt.plot(t,deviation)
plt.xlabel('Epsilon')
plt.ylabel('Deviation')
plt.title('Deviation')
plt.show()

We see that at the 50th percentile, deviation is quite low at the epsilon value of 1.1, and increases significantly as we increase the value of epsilon.

huber 50

Let’s now try this at the 30th and 70th percentile.

30th percentile

huber 30

70th percentile

huber 70

Given that the Huber regression is placing less weight on the extreme values, we see that for an epsilon value of 1.1, we have more deviation at the 30th percentile than we do at the 70th percentile.

This indicates that Huber regressions are not necessarily a “one-and-done” method. Since this dataset has outliers that are heavily biased to the upside, then it makes sense that using this regression would result in higher deviations for lower estimated values.

Should we have extreme values that are both much higher and much lower than the majority of the sample, then Huber weights may work better under these circumstances.

Conclusion

In this tutorial, you have learned:

  • What is a Huber regression
  • How we can generate such a regression in Python
  • The shortcomings of this type of analysis

Many thanks for viewing the tutorial, and if you have any questions feel free to leave them in the comments below.

Leave a Reply

Your email address will not be published. Required fields are marked *

17 − 3 =