statsmodels and sklearn: Linear and Logistic Regression Modelling in Python

The statsmodels and sklearn libraries are frequently used when it comes to generating regression output. While these libraries are frequently used in regression analysis, it is often the case that a user needs to work with different libraries depending on the extent of the analysis.

Generally speaking, statsmodels is useful when it comes to generating all the regression output under the one output frame. However, sklearn is necessary when it comes to being more flexible with the output, e.g. extracting coefficients, running an auxiliary regression, etc. In the below example, we will go through how these libraries differ when it comes to generating regression output.

Before we get into how the statsmodels and sklearn libraries work, let’s run a very simple linear regression using the numpy and scipy libraries to get a better understanding of regression analysis.

Linear Regression using numpy and scipy

from scipy import stats
import numpy as np
x=[-0.018,-0.008,0.011,0.017,-0.008,-0.002]
y=[-0.006,-0.001,0.015,0.017,-0.0019,-0.005]
gradient,intercept,r_value,p_value,std_err=stats.linregress(x,y)
print "Gradient and intercept",gradient,intercept
Gradient and intercept 0.724084177708 0.00398211223694
 
print "R-squared",r_value**2
R-squared 0.854175903355
 
print "p-value",p_value
p-value 0.00839711859611

Let’s interpret the above output.

  • Intercept = 0.00398. This is what the value of our y (dependent variable) would be if the gradient had a value of 0.
  • Gradient = 0.72408. For every unit increase in x, y increases by 0.72408.
  • R-Squared = 0.85417. 85.41% of the variation in our y variable is “explained” by the x variable in this model.
  • p-value = 0.0083. The p-value is below 0.05%, denoting statistical significance of the regression at the 5% level.

Part 1: Run an OLS regression using statsmodels

Here, we will run an analysis on a hypothetical dataset of 49 stocks. The purpose of the regression was to determine the impact of dividends, earnings and debt to equity on stock returns. We are using a cross-sectional dataset in this instance – meaning that all data is collected at a specific point in time.

In this instance, we use the ols_stock2 dataset (see end of post for details). To form a linear regression model, we set up our model in statsmodels as follows:

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import os;
path="C:/Users/michaeljgrogan/Documents/a_documents/computing/data science/datasets"
os.chdir(path)
os.getcwd()

variables = pd.read_csv('ols_stock2.csv')
stock_return = variables['stock_return']
dividend = variables['dividend']
earnings = variables['earnings']
debt_to_equity = variables['debt_to_equity']
marketcap = variables['marketcap']
stock_return_scaled = (stock_return/earnings)*1

x = np.column_stack((dividend,earnings,debt_to_equity))
x = sm.add_constant(x, prepend=True)

results = smf.OLS(stock_return,x).fit()
print(results.summary())

In addition to defining our independent variables, we see that we are stacking our independent variables into a single x variable using the x = np.column_stack((dividend, earnings, debt_to_equity)) command.
 
This allows our model to take into account multiple parameters and generate detailed output by first defining our results through results = smf.OLS(y,x).fit() and then generating a summary of the results through the print(results.summary()) command.
 
Once we run this code, we generate the following output in Python:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.078
Model:                            OLS   Adj. R-squared:                  0.017
Method:                 Least Squares   F-statistic:                     1.277
Date:                Mon, 05 Feb 2018   Prob (F-statistic):              0.294
Time:                        12:29:30   Log-Likelihood:                -228.45
No. Observations:                  49   AIC:                             464.9
Df Residuals:                      45   BIC:                             472.5
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          3.5733      8.284      0.431      0.668       -13.112    20.259
x1             3.1439      3.068      1.025      0.311        -3.034     9.322
x2            -0.3646      0.255     -1.429      0.160        -0.878     0.149
x3             4.0564      8.698      0.466      0.643       -13.462    21.575
==============================================================================
Omnibus:                       99.858   Durbin-Watson:                   1.855
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2938.634
Skew:                           5.817   Prob(JB):                         0.00
Kurtosis:                      39.110   Cond. No.                         72.0
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

As we can see, the statsmodels library allows us to generate highly detailed regression output, with additional statistics such as skew, kurtosis, R-Squared and AIC.

From the above, our regression equation is as follows:

stock_return = 4.14320 + 2.23811(dividend) + 0.26361(earnings) + 11.88278(debt_to_equity)

Output Interpretation

Let’s interpret our regression output:

  • If the dividend yield increases by 1%, then the return on the stock increases by 2.32%.
  • If the earnings growth increases by 1%, then the return on the stock increases by 0.26%.
  • If the debt to equity ratio increases by 1%, then the return on the stock increases by 11.88%.
  • The R-Squared value is 0.6439, meaning 64.39% of the variation in stock returns is “explained” by our predictors.
  • The dividend variable is significant at the 5% level, earnings is significant at the 1% level, while debt to equity is significant at all levels.

Detecting heteroscedasticity using statsmodels

One of the issues that we face when running a linear regression is that of heteroscedasticity.

Heteroscedasticity is a term used to describe a condition where there is an uneven variance across our error term.

Let’s take the following example. Suppose that you are conducting an analysis on incomes across the United States. Chances are, you will find that incomes in New York are significantly higher than in North Dakota.

Given that major cities are likely to have higher populations and thus higher incomes, the residuals across your observations will be unevenly distributed. In this instance, you will find that the variance of your error term gets larger across cities with higher populations:

heteroscedasticity

In order to correct for this condition, it is necessary to test for heteroscedasticity using the Breusch-Pagan test, and then remedy for the condition if it exists.

Let’s go back to our stock example.

Remember that our first regression (reg1) was structured as follows:

stock_return ~ dividend + earnings + debt_to_equity

We first import the statsmodels library and then compute the Breusch-Pagan statistic using the statsmodels.stats.diagnostic.het_breushpagan command:

import statsmodels

name = ['Lagrange multiplier statistic', 'p-value', 
        'f-value', 'f p-value']
bp = statsmodels.stats.diagnostic.het_breushpagan(results.resid, results.model.exog)
bp
pd.DataFrame(name,bp)

breusch-pagan statsmodels

Since our p-value is less than 0.05, this indicates that heteroscedasticity is present, and we reject the null hypothesis of homoscedasticity.

There are several ways to correct for this. Some of the solutions, such as heteroscedasticity-corrected standard errors, are more technical in nature.

However, let’s see if we can rescale our data in order to correct for this issue. Remember that the level of earnings growth across our stocks differ. In this regard, it may be the case that returns across stocks are biased towards those with higher earnings.

Let’s see what happens if we scale returns to a constant rate of earnings growth:

stock_return_scaled = (stock_return/earnings)*1

Then, we substitute this variable into our prior statsmodels code:

results = smf.OLS(stock_return_scaled,x).fit()
print(results.summary())
 OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.078
Model:                            OLS   Adj. R-squared:                  0.017
Method:                 Least Squares   F-statistic:                     1.277
Date:                Sun, 04 Feb 2018   Prob (F-statistic):              0.294
Time:                        13:33:11   Log-Likelihood:                -228.45
No. Observations:                  49   AIC:                             464.9
Df Residuals:                      45   BIC:                             472.5
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          3.5733      8.284      0.431      0.668       -13.112    20.259
x1             3.1439      3.068      1.025      0.311        -3.034     9.322
x2            -0.3646      0.255     -1.429      0.160        -0.878     0.149
x3             4.0564      8.698      0.466      0.643       -13.462    21.575
==============================================================================
Omnibus:                       99.858   Durbin-Watson:                   1.855
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2938.634
Skew:                           5.817   Prob(JB):                         0.00
Kurtosis:                      39.110   Cond. No.                         72.0
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We now see that all of our variables have become insignificant as our p-values are now significantly higher than 0.05, and the R-Squared statistic has dropped to a miniscule 0.078.

Moreover, let’s run our Breusch-Pagan test for heteroscedasticity again and see what we come up with:

breusch-pagan

Now, we see that the p-value has increased to 0.1855. This is higher than 0.05, so we can no longer reject the null hypothesis of homoscedasticity.

In this regard, the Breusch-Pagan test has confirmed that our prior results were being influenced by heteroscedasticity and were therefore unreliable.

Using sklearn to test for multicollinearity

It is also possible that correlations exist between our predictors. This is a condition known as multicollinearity, and this has the potential to make our results invalid. Therefore, we test for this condition using a Variance Inflation Factor (VIF) test. If our VIF statistic > 10, then there exists an issue with multicollinearity.

To do this, we run a VIF on the reg1 regression model that we created. The Variance Inflation Factor is calculated as follows:

VIF = 1/(1-R^2)

Note that the R2 is the statistic obtained from the auxiliary regression – i.e. the one where an independent variable is regressed against the other independent variables.

Let’s illustrate this within our code. Suppose that we wish to regress dividend against earnings and debt to equity. This is our auxiliary regression, and we are obtaining the R-Squared statistic from this and calculating the VIF statistic for the dividend variable:

z = np.column_stack((earnings,debt_to_equity))
z = sm.add_constant(z, prepend=True)

from sklearn.metrics import r2_score
from sklearn import linear_model
lm = linear_model.LinearRegression()
model = lm.fit(z,dividend)
predictions = lm.predict(z)
print(predictions)

rsquareddividend=r2_score(dividend, predictions)
rsquareddividend

vifdividend=1/(1-(rsquareddividend))

You can see there’s quite a lot going on here, so let’s try to simplify things.

We are defining our data stack of earnings and debt to equity (our predictor variables) as a variable z. Once we have done this, we are then running this variable against the dividend variable (our auxiliary regression). Once we have the predictions from this, we are then obtaining the R-Squared value and then calculating the VIF statistic from that.

vifdividend
1.2844298745832432

Since the VIF statistic for the dividend variable is less than 10, this indicates that no multicollinearity exists in this variable. To check for the others, we simply select the relevant variable as the dependent one, and regress against the others. Upon doing this, we find that the VIF statistics for earnings and debt to equity are 1.258 and 1.568. Therefore, the VIFs for these variables also indicates that no multicollinearity exists in our model.

The above VIF statistic for the dividend variable is the same as the one we obtained from our vif function above. Similarly, we are structuring earnings and debt to equity as the respective dependent variables in our regression to obtain the VIF statistic for each one.

Part 2: Logistic Regression Modelling in Python

Now, let us take a look at how to run logistic regressions in Python.

There are three major types of variables we can encounter:

  • Nominal: One where there is no order to the elements. This type of variable is known as categorical, and hence is commonly used for categorising. e.g. male = 0, female = 1, etc.
  • Ordinal: A categorical variable for which the elements are ordered. For instance, a ranking scale from 1-5 (with 1 being the lowest and 5 being the highest) is an ordinal variable.
  • Interval: A variable where the elements are ordered and the differences between those variables are significant. For instance, speed is a variable that operates on an interval scale – the difference between 70-80 km/h and 100-110 km/h is 10 km/h in both instances.

A binary logistic regression is one where the dependent variable is nominal. Let’s see how we run and interpret regressions such as these.

Here is the structure of our variables within the binomial_stock dataset (see end of post):

Dependent Variable (Y)

  • Dividends (dummy variable) – if the company pays dividends currently it is given a ranking of 1. If the company does not pay dividends, it is given a ranking of 0.

Independent Variables (X)

  • Years: The number of years of consecutive dividend increases.
  • Earnings Estimates: If the stock beat earnings for the past four consecutive quarters, it is given a ranking of 1. Otherwise, it is given a ranking of 0.

Essentially, we wish to use the input variables years and earnings estimates to determine whether a stock will pay a dividend or not.

Firstly, we will import our relevant libraries, including the LogisticRegression and train_test_split from sklearn.

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import os;
path="C:/Users/michaeljgrogan/Documents/YourDirectory"
os.chdir(path)
os.getcwd()

The code and datasets we will use for this example are logisticregression.ipynb and binomial_stock.csv, available from the MGCodesandStats Github Repository. Then, we generate our regression model, using the dividend variable as our dependent variable, and years and earnings estimates as our independent variables:

variables = pd.read_csv('binomial_stock.csv')
years = variables['years']
dividend = variables['dividend']
earnings_estimates = variables['earnings_estimates']

y = dividend
x = np.column_stack((years,earnings_estimates))
x = sm.add_constant(x, prepend=True)

logit_model=sm.Logit(y,x)
result=logit_model.fit()
print(result.summary())

Upon running the above, we generate the following output:

Optimization terminated successfully.
         Current function value: 0.162588
         Iterations 10
                           Logit Regression Results                           
==============================================================================
Dep. Variable:               dividend   No. Observations:                   70
Model:                          Logit   Df Residuals:                       67
Method:                           MLE   Df Model:                            2
Date:                Sat, 03 Feb 2018   Pseudo R-squ.:                  0.7654
Time:                        21:24:55   Log-Likelihood:                -11.381
converged:                       True   LL-Null:                       -48.520
                                        LLR p-value:                 7.425e-17
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        -10.6858      3.441     -3.105      0.002       -17.431    -3.941
x1             0.8919      0.297      2.999      0.003         0.309     1.475
x2             0.3953      1.100      0.359      0.719        -1.761     2.552
==============================================================================

Possibly complete quasi-separation: A fraction 0.29 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

Of our predictors, we see that years is indicated as statistically significant at the 1% significance level.

Note that the Pseudo R-Squared statistic provided in the output is not necessarily reliable as a measure of fit in logistic regressions. This is because the degree of variation in a nominal dependent variable is a lot smaller than that of an interval one, and therefore a very large number of observations are needed for this statistic to have credence.

Odds Ratios and Probabilities

Now, we need to obtain the odds ratio for the years variable. An odds ratio is a ratio comparing the likelihood of an event happening compared to the likelihood of the event not happening.

For instance, the odds ratio for our years variable is exp(0.8919) = 2.439. This means that if a company’s lifespan increases by a year, then the odds of that company paying a dividend is 2.439:1.

Now, let’s interpret the odds ratio in a way that makes more sense. We can convert the odds ratio to a probability as follows:

probability=(oddsratio/(1+oddsratio))
probability

Therefore, we can then calculate that if the lifespan of a company increases by a year, then the likelihood of that company paying a dividend goes up by 70.93%:

(2.439/(1+2.439)) = 0.7093

Note that while we ran a logistic regression on all our observations, we can also split the data into training and test data using train_test_split.

print("Training set score: {:.3f}".format(logreg.score(x_train,y_train)))
Training set score: 0.885

print("Test set score: {:.3f}".format(logreg.score(x_test,y_test)))
Test set score: 0.944

We see that we obtained high training and test scores across our data, indicating that the logistic regression showed a high degree of accuracy in determining whether a stock would pay a dividend or not.

Summary

In this tutorial, you have learned how to:

  • Create and interpret linear regression output using statsmodels
  • Use the Breusch-Pagan test to screen for heteroscedasticity
  • Employ the sklearn library to generate regression output
  • Test for multicollinearity using the Variance Inflation Factor test
  • Generate logistic regression models, interpret odds ratios, and convert to probabilities