In a previous tutorial, we saw how we can use Huber and Bisquare weightings to adjust for outliers in a dataset.

These weightings allow us to adjust our regression analysis to give less weight to extreme values.

The previous analysis was illustrated in R, which you can find here.

Now, let’s see how we can construct and interpret such an analysis using Python. The purpose of this analysis is to determine the factors that influence internet usage (as measured in megabytes) across a population. The independent variables used are *income*, *videohours*, *web pages*, *gender*, and *age*.

## Initial Regression Results

Firstly, we will generate a standard linear regression output using statsmodels. Let’s import our libraries, dataset, and run the analysis:

#Load libraries and set directory import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf import pandas as pd import matplotlib.pyplot as plt from scipy import stats import os; path="/home/michaeljgrogan/Documents/a_documents/computing/data science/datasets" os.chdir(path) os.getcwd() #Import CSV and define variables variables = pd.read_csv('internetoutliers.csv') usage = variables['usage'] income = variables['income'] videohours = variables['videohours'] webpages = variables['webpages'] gender = variables['gender'] age = variables['age'] fig=sm.qqplot(usage) plt.show() #Generate OLS Results x = np.column_stack((income,videohours,webpages,gender,age)) x = sm.add_constant(x, prepend=True) results = smf.OLS(usage,x).fit() print(results.summary())

Firstly, we can see that the QQ Plot shows three clear outliers in the dataset, i.e. incidences of much higher internet usage than the general population:

Here is what our regression analysis looks like:

OLS Regression Results ============================================================================== Dep. Variable: usage R-squared: 0.837 Model: OLS Adj. R-squared: 0.836 Method: Least Squares F-statistic: 986.8 Date: Wed, 25 Jul 2018 Prob (F-statistic): 0.00 Time: 18:16:43 Log-Likelihood: -9406.5 No. Observations: 966 AIC: 1.883e+04 Df Residuals: 960 BIC: 1.885e+04 Df Model: 5 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const -2498.8844 489.346 -5.107 0.000 -3459.195 -1538.574 x1 0.7874 0.047 16.903 0.000 0.696 0.879 x2 1172.0461 30.100 38.939 0.000 1112.977 1231.115 x3 54.1443 3.341 16.208 0.000 47.589 60.700 x4 -122.6890 265.000 -0.463 0.643 -642.736 397.358 x5 87.8120 11.157 7.871 0.000 65.917 109.707 ============================================================================== Omnibus: 529.892 Durbin-Watson: 1.881 Prob(Omnibus): 0.000 Jarque-Bera (JB): 12104.541 Skew: 2.020 Prob(JB): 0.00 Kurtosis: 19.865 Cond. No. 2.63e+04 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.63e+04. This might indicate that there are strong multicollinearity or other numerical problems. >>> #T Critical Value ... print (stats.t.ppf(1-0.025, 966)) 1.96242277937671

Note that while a large condition number is indicated, the VIF test yielded results of below 2 across the independent variables upon running the “vif” command in R.

Therefore, it is assumed that multicollinearity is not present here.

## Huber vs. Ridge Regression

Now, we will run a Huber-weighted regression (which will give more weight to less extreme values) and a ridge regression (which will not adjust the values for this purpose).

Let us import HuberRegressor and Ridge from sklearn, and generate the predictions:

import sklearn from sklearn import linear_model from sklearn.linear_model import HuberRegressor, Ridge rg = Ridge(fit_intercept=True, alpha=0.0, random_state=0, normalize=True) ridgemodel = rg.fit(x,usage) ridgepredictions = rg.predict(x) print(ridgepredictions)

Once we have done this, we generate the predictions from the Ridge regression:

[ 4432.7396198 3354.00366108 21582.7015611 21075.32748243 8682.40120058 4439.82174989 7823.91011603 4903.39412117 26385.59314104 29162.63684572 8680.99576092 17933.56948042 ... 10480.88538249 13337.89162182 10625.80300619 4603.47200848 6526.63369436 27038.20836683 8577.8635685 91089.50643952 99523.28219672 62923.75856318]

Now, we can generate the Huber regressions and determine how our predictions compare with Ridge vis-a-vis the actual usage statistics.

An important value that we will factor into the HuberRegressor is **epsilon**, which is a proxy for how sensitively the regression treats outliers.

Therefore, five Huber regressions are run, with epsilon values ranging from 1.1 to 3.9.

#Huber 1 hb1 = linear_model.HuberRegressor(epsilon=1.1, max_iter=100, alpha=0.0001, warm_start=False, fit_intercept=True, tol=1e-05) hubermodel1 = hb1.fit(x,usage) huberpredictions1 = hb1.predict(x) print(huberpredictions1) hb1.coef_ #Huber 2 hb2 = linear_model.HuberRegressor(epsilon=1.8, max_iter=1000, alpha=0.0001, warm_start=False, fit_intercept=True, tol=1e-05) hubermodel2 = hb2.fit(x,usage) huberpredictions2 = hb2.predict(x) print(huberpredictions2) hb2.coef_ #Huber 3 hb3 = linear_model.HuberRegressor(epsilon=2.5, max_iter=1000, alpha=0.0001, warm_start=False, fit_intercept=True, tol=1e-05) hubermodel3 = hb3.fit(x,usage) huberpredictions3 = hb3.predict(x) print(huberpredictions3) hb3.coef_ #Huber 4 hb4 = linear_model.HuberRegressor(epsilon=3.2, max_iter=1000, alpha=0.0001, warm_start=False, fit_intercept=True, tol=1e-05) hubermodel4 = hb4.fit(x,usage) huberpredictions4 = hb4.predict(x) print(huberpredictions4) hb4.coef_ #Huber 5 hb5 = linear_model.HuberRegressor(epsilon=3.9, max_iter=1000, alpha=0.0001, warm_start=False, fit_intercept=True, tol=1e-05) hubermodel5 = hb5.fit(x,usage) huberpredictions5 = hb5.predict(x) print(huberpredictions5) hb5.coef_

Now, we are going to generate deviation readings from the actual usage for different percentiles, i.e. usage statistics that occur within a certain percentile.

Let’s firstly try the 50th percentile.

#Accuracy at the 50th percentile trialridge=(ridgepredictions-usage)/usage qr=np.quantile(trialridge, 0.5) qr trial1=(huberpredictions1-usage)/usage q1=np.quantile(trial1, 0.5) q1 trial2=(huberpredictions2-usage)/usage q2=np.quantile(trial2, 0.5) q2 trial3=(huberpredictions3-usage)/usage q3=np.quantile(trial3, 0.5) q3 trial4=(huberpredictions4-usage)/usage q4=np.quantile(trial4, 0.5) q4 trial5=(huberpredictions5-usage)/usage q5=np.quantile(trial5, 0.5) q5 deviation=np.array([q1,q2,q3,q4,q5]) t = np.arange(1., 4., 0.7) plt.plot(t,deviation) plt.xlabel('Epsilon') plt.ylabel('Deviation') plt.title('Deviation') plt.show()

We see that at the 50th percentile, deviation is quite low at the epsilon value of 1.1, and increases significantly as we increase the value of epsilon.

Let’s now try this at the 30th and 70th percentile.

30th percentile

70th percentile

Given that the Huber regression is placing less weight on the extreme values, we see that for an epsilon value of 1.1, we have more deviation at the 30th percentile than we do at the 70th percentile.

This indicates that Huber regressions are not necessarily a “one-and-done” method. Since this dataset has outliers that are heavily biased to the upside, then it makes sense that using this regression would result in higher deviations for lower estimated values.

Should we have extreme values that are both much higher and much lower than the majority of the sample, then Huber weights may work better under these circumstances.

## Conclusion

In this tutorial, you have learned:

- What is a Huber regression
- How we can generate such a regression in Python
- The shortcomings of this type of analysis

Many thanks for viewing the tutorial, and if you have any questions feel free to leave them in the comments below.

## Leave a Reply