Linear and Logistic Regression Modelling in Python

The statsmodels and sklearn libraries are frequently used when it comes to generating regression output. While these libraries are frequently used in regression analysis, it is often the case that a user needs to work with different libraries depending on the extent of the analysis.

Generally speaking, statsmodels is useful when it comes to generating all the regression output under the one output frame. However, sklearn is necessary when it comes to being more flexible with the output, e.g. extracting coefficients, running an auxiliary regression, etc.

In the below example, we will go through how these libraries differ when it comes to generating regression output.

Before we get into how the statsmodels and sklearn libraries work, let’s run a very simple linear regression using the numpy and scipy libraries to get a better understanding of regression analysis.

Introduction to simple linear regression using numpy and scipy

from scipy import stats
import numpy as np
x=[-0.018,-0.008,0.011,0.017,-0.008,-0.002]
y=[-0.006,-0.001,0.015,0.017,-0.0019,-0.005]
gradient,intercept,r_value,p_value,std_err=stats.linregress(x,y)
print "Gradient and intercept",gradient,intercept
Gradient and intercept 0.724084177708 0.00398211223694
 
print "R-squared",r_value**2
R-squared 0.854175903355
 
print "p-value",p_value
p-value 0.00839711859611

Let’s interpret the above output.

  • Intercept = 0.00398. This is what the value of our y (dependent variable) would be if the gradient had a value of 0.
  • Gradient = 0.72408. For every unit increase in x, y increases by 0.72408.
  • R-Squared = 0.85417. 85.41% of the variation in our y variable is “explained” by the x variable in this model.
  • p-value = 0.0083. The p-value is below 0.05%, denoting statistical significance of the regression at the 5% level.

OLS (Linear) Regressions

The purpose of this regression is to determine the impact of price, hours driven, and tank capacity on gasoline consumption.

We are using a cross-sectional dataset in this instance – meaning that all data is collected at a specific point in time.

In this instance, we use the gasoline.csv dataset. To form a linear regression model, we set up our model in statsmodels as follows:

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
from sklearn.model_selection import train_test_split
import os;
path="/home/michaeljgrogan/Documents/a_documents/computing/data science/datasets"
os.chdir(path)
os.getcwd()

variables = pd.read_csv('gasoline.csv')
consumption = variables['consumption']
capacity = variables['capacity']
price = variables['price']
hours = variables['hours']

y = consumption
x = np.column_stack((capacity,price,hours))
x = sm.add_constant(x, prepend=True)

x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0)

results = smf.OLS(y_train,x_train).fit()
print(results.summary())

In addition to defining our independent variables, we see that we are stacking our independent variables into a single x variable using the x = np.column_stack((capacity,price,hours)) command. Additionally, note that we are feeding our training data into the model so that we can generate predictions, and then assess the accuracy of those predictions against our test data.

This allows our model to take into account multiple parameters and generate detailed output by first defining our results through results = smf.OLS(y,x).fit() and then generating a summary of the results through the print(results.summary()) command.

OLS Regression Results

Once we run this code, we generate the following output in Python:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:            consumption   R-squared:                       0.762
Model:                            OLS   Adj. R-squared:                  0.734
Method:                 Least Squares   F-statistic:                     27.72
Date:                Sun, 07 Oct 2018   Prob (F-statistic):           2.93e-08
Time:                        16:48:57   Log-Likelihood:                -181.02
No. Observations:                  30   AIC:                             370.0
Df Residuals:                      26   BIC:                             375.6
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        157.9277    531.927      0.297      0.769    -935.463    1251.319
x1            -8.1258      4.359     -1.864      0.074     -17.086       0.834
x2           200.3928     77.056      2.601      0.015      42.001     358.785
x3             0.0685      0.016      4.364      0.000       0.036       0.101
==============================================================================
Omnibus:                        4.677   Durbin-Watson:                   1.160
Prob(Omnibus):                  0.096   Jarque-Bera (JB):                3.245
Skew:                          -0.774   Prob(JB):                        0.197
Kurtosis:                       3.444   Cond. No.                     4.10e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.1e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

As we can see, the statsmodels library allows us to generate highly detailed regression output, with additional statistics such as skew, kurtosis, R-Squared and AIC.

From the above, our regression equation is as follows:

consumption = 157.9277 – 8.1258(capacity) + 200.3928(price) + 0.0685(hours)

Output Interpretation

Let’s interpret our regression output:

  • If the capacity increases by 1 unit, then gasoline consumption decreases by 8.1258 units. However, this variable is deemed insignificant at the 5% level.
  • If the price increases by 1 unit, then consumption increases by 200.39 units. This indicates that drivers of less fuel-efficient cars pay more for fuel, but do not drive less as a result of higher prices.
  • When the hours driven increases by 1, then the fuel consumption increases by 0.0685. The slight increase could indicate that drivers who drive more hours have more fuel-efficient cars, which in turn would lower the spend on fuel consumption.

Detecting heteroscedasticity using statsmodels

One of the issues that we face when running a linear regression is that of heteroscedasticity.

Heteroscedasticity is a term used to describe a condition where there is an uneven variance across our error term.

For instance, larger cars often come with more fuel-efficient engines (e.g. diesel), and are typically driven for longer periods. Therefore, if we are surveying consumption across different car sizes, then it is possible that consumption may be biased towards car size:

heteroscedasticity

In order to correct for this condition, it is necessary to test for heteroscedasticity using the Breusch-Pagan test, and then remedy for the condition if it exists.

Let’s go back to our example.

Remember that our first regression (reg1) was structured as follows:

consumption ~ capacity + price + hours

We first import the statsmodels library and then compute the Breusch-Pagan statistic using the statsmodels.stats.diagnostic.het_breushpagan command:

import statsmodels

name = ['Lagrange multiplier statistic', 'p-value', 
        'f-value', 'f p-value']
bp = statsmodels.stats.diagnostic.het_breushpagan(results.resid, results.model.exog)
bp
pd.DataFrame(name,bp)

And here are our results:

>>> pd.DataFrame(name,bp)
                                      0
3.510484  Lagrange multiplier statistic
0.319405                        p-value
1.148537                        f-value
0.348164                      f p-value

Since our p-value is greater than 0.05, this indicates that heteroscedasticity is not present, and we cannot reject the null hypothesis of homoscedasticity.

Using sklearn to test for multicollinearity

It is also possible that correlations exist between our predictors. This is a condition known as multicollinearity, and this has the potential to make our results invalid.

Therefore, we test for this condition using a Variance Inflation Factor (VIF) test. If our VIF statistic > 10, then there exists an issue with multicollinearity.

To do this, we run a VIF on the reg1 regression model that we created. The Variance Inflation Factor is calculated as follows:

VIF = 1/(1-R^2)

Note that the R2 is the statistic obtained from the auxiliary regression – i.e. the one where an independent variable is regressed against the other independent variables.

Let’s illustrate this within our code. Suppose that we wish to regress the hours variable against price and capacity.

This is our auxiliary regression, and we are obtaining the R-Squared statistic from this and calculating the VIF statistic for the hours variable:

z = np.column_stack((capacity,price))
z = sm.add_constant(z, prepend=True)

from sklearn.metrics import r2_score
from sklearn import linear_model
lm = linear_model.LinearRegression()
model = lm.fit(z,hours)
predictions = lm.predict(z)
print(predictions)

rsquared=r2_score(hours, predictions)
rsquared

vif=1/(1-(rsquared))

You can see there’s quite a lot going on here, so let’s try to simplify things.

We are defining our data stack of earnings and debt to equity (our predictor variables) as a variable z. Once we have done this, we are then running this variable against the hours variable (our auxiliary regression).

>>> print(predictions)
[14072.48585043 13693.61384188 13405.42449666 14307.1723462
 14448.22339131 14119.05060852 13498.18506552 14201.25114267
 13479.56552649 14086.96075169 14736.80572929 12972.99037447
 13383.17742762 14321.41188958 13749.68765919 13523.09588648
 13045.5465611  13863.65549489 15002.80726966 15134.45491176
 15207.30572045 16737.77289153 16038.06935441 16949.71366929
 14769.6944438  15112.93067629 16523.71898949 15988.22366706
 15980.0993119  16144.17223828 16585.55750652 14966.56160176
 15278.25301842 14743.59735905 16206.04597404 16676.8092563
 16222.04086768 16017.46080612 16315.54661668 17007.85980549]

Once we have the predictions from this, we are then obtaining the R-Squared value and then calculating the VIF statistic from that.

vif
0.5198862908709541

Since the VIF statistic for the hours variable is less than 10, this indicates that no multicollinearity exists in this variable. To check for the others, we simply select the relevant variable as the dependent one, and regress against the others.

The VIF readings for the other variables also indicate that no multicollinearity exists in our model.

Model Predictions

Now, let’s generate some predictions with our model:

y_pred = results.predict(x_test)

>>> print(y_pred)
[1782.161739   1520.78970223 1607.0050501  1436.46031383 1485.46152932
 1345.93190007 1725.92235667 1363.48429628 1631.55660023 1718.88234168]

Using this data, we would now like to calculate a mean squared error.

>>> mse=(y_pred-y_test)/y_test
>>> mse
22    0.088011
20   -0.066427
25   -0.040021
4     0.025311
10    0.018137
15   -0.129973
28    0.018844
11   -0.072460
18    0.017180
29    0.006961

We can see that in many instances, the deviation between our predictions and actual values is under 10% percent, which is a good start considering that the size of our dataset as a whole is 40 observations. While in a real-world scenario one would need to have more observations to establish the validity of the above relationships, a linear regression can shed significant insights into the nature of correlations between variables and their significance.

Logistic Regression Modelling in Python

Now, let us take a look at how to run logistic regressions in Python.

There are three major types of variables we can encounter:

  • Nominal: One where there is no order to the elements. This type of variable is known as categorical, and hence is commonly used for categorizing. e.g. male = 0, female = 1, etc.
  • Ordinal: A categorical variable for which the elements are ordered. For instance, a ranking scale from 1-5 (with 1 being the lowest and 5 being the highest) is an ordinal variable.
  • Interval: A variable where the elements are ordered and the differences between those variables are significant. For instance, speed is a variable that operates on an interval scale – the difference between 70-80 km/h and 100-110 km/h is 10 km/h in both instances.

What is a binary logistic regression?

A binary logistic regression is one where the dependent variable is nominal. Let’s see how we run and interpret regressions such as these.

Now, let’s take another example.

In this case, assume a stock analyst is trying to determine whether a stock pays a dividend or not, and wishes to classify accordingly.

The dataset we will use for this example is the dividendinfo dataset.

Here is the structure of our variables within the dataset:

Dependent Variable (Y)

  • Dividends: If the company pays dividends currently it is given a ranking of 1. If the company does not pay dividends, it is given a ranking of 0.

Independent Variables (X)

  • fcfps: Free cash flow per share (in $)
  • earnings_growth: Earnings growth in the past year (in %)
  • de: Debt to Equity ratio
  • mcap: Market Capitalization of the stock
  • current_ratio: Current Ratio (or Current Assets/Current Liabilities)

Essentially, we wish to use the input variables years and earnings estimates to determine whether a stock will pay a dividend or not.

Firstly, we will import our relevant libraries, including the LogisticRegression and train_test_split from sklearn.

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import os;
path="/home/michaeljgrogan/Documents/a_documents/computing/data science/datasets"
os.chdir(path)
os.getcwd()

Model Generation

We generate our regression model, using the dividend variable as our dependent variable, and years and earnings estimates as our independent variables:

variables=pd.read_csv('dividendinfo.csv')
dividend=variables['dividend']
freecashflow_pershare=variables['fcfps']
earnings_growth=variables['earnings_growth']
debt_to_equity=variables['de']
mcap=variables['mcap']
current_ratio=variables['current_ratio']

y=dividend
x=np.column_stack((freecashflow_pershare,earnings_growth,debt_to_equity,mcap,current_ratio))
x=sm.add_constant(x,prepend=True)

Now, we are going to split our data into training and test data, and compute scores for each.

x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0)
logreg=LogisticRegression().fit(x_train,y_train)
logreg
print("Training set score: {:.3f}".format(logreg.score(x_train,y_train)))
print("Test set score: {:.3f}".format(logreg.score(x_test,y_test)))

Let’s see what scores we get:

>>> print("Training set score: {:.3f}".format(logreg.score(x_train,y_train)))
Training set score: 0.920
>>> print("Test set score: {:.3f}".format(logreg.score(x_test,y_test)))
Test set score: 0.960

Now, we can use the training data to generate the output of the logistic regression itself:

logit_model=sm.Logit(y_train,x_train)
result=logit_model.fit()
print(result.summary())

Upon running the above, we generate the following output:

Optimization terminated successfully.
         Current function value: 0.111698
         Iterations 10
>>> print(result.summary())
                           Logit Regression Results                           
==============================================================================
Dep. Variable:               dividend   No. Observations:                  150
Model:                          Logit   Df Residuals:                      144
Method:                           MLE   Df Model:                            5
Date:                Sat, 06 Oct 2018   Pseudo R-squ.:                  0.8385
Time:                        15:59:37   Log-Likelihood:                -16.755
converged:                       True   LL-Null:                       -103.76
                                        LLR p-value:                 1.018e-35
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const        -21.1447      6.084     -3.475      0.001     -33.069      -9.220
x1             1.7848      0.578      3.087      0.002       0.652       2.918
x2             0.1035      0.035      2.952      0.003       0.035       0.172
x3            -0.9209      0.538     -1.713      0.087      -1.975       0.133
x4             0.0246      0.007      3.476      0.001       0.011       0.039
x5             5.2756      1.790      2.948      0.003       1.768       8.783
==============================================================================

Of our predictors, we see that years is indicated as statistically significant at the 1% significance level.

Note that the Pseudo R-Squared statistic provided in the output is not necessarily reliable as a measure of fit in logistic regressions.

This is because the degree of variation in a nominal dependent variable is a lot smaller than that of an interval one, and therefore a very large number of observations are needed for this statistic to have credence.

Odds Ratios and Probabilities

Now, we need to obtain the odds ratio for the years variable. An odds ratio is a ratio comparing the likelihood of an event happening compared to the likelihood of the event not happening.

For instance, the odds ratio for our fcfps (free cash flow per share) variable is exp(1.7848) = 5.958. This means that if a company’s lifespan increases by a year, then the odds of that company paying a dividend is 5.598:1.

Now, let’s interpret the odds ratio in a way that makes more sense. We can convert the odds ratio to a probability as follows:

probability=(oddsratio/(1+oddsratio))
probability

Therefore, we can then calculate that if the lifespan of a company increases by a year, then the likelihood of that company paying a dividend goes up by 84.84%:

(5.598/(1+5.598)) = 0.8484

ROC Curves

However, in order to more formally analyse accuracy, we will generate what is called an ROC Curve.

Essentially, we are going to plot the true positive rate against the false positive rate to determine if our logistic regression is a better classifier than just a random guess.

The metric we will use to determine this is called AUC (area under the curve). The higher the AUC, the better. If we have an AUC of 0.5, then this indicates that our classifier is no better than random guessing.

import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
falsepos,truepos,thresholds=roc_curve(y_test,logreg.decision_function(x_test))

plt.plot(falsepos,truepos,label="ROC")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")

cutoff=np.argmin(np.abs(thresholds))
plt.plot(falsepos[cutoff],truepos[cutoff],'o',markersize=10,label="cutoff",fillstyle="none")
plt.show()

Here is a plot of our ROC curve:

roc

AUC Value

Now, let’s compute our AUC value:

from sklearn import metrics
metrics.auc(falsepos, truepos)

We see that we obtain an AUC of 0.9932:

>>> metrics.auc(falsepos, truepos)
0.9932088285229203

This means that our area under the curve is 99.32%, and this indicates that our classifier has a very high degree of accuracy in predicting whether a stock pays a dividend or not.

It would appear that as we add more independent variables to our analysis, the higher the AUC. For instance, if we exclude the marketcap and current_ratio variables from our analysis, the AUC drops slightly to 94.4%.

Summary

In this tutorial, you have learned how to:

  • Create and interpret linear regression output using statsmodels
  • Use the Breusch-Pagan test to screen for heteroscedasticity
  • Employ the sklearn library to generate regression output
  • Test for multicollinearity using the Variance Inflation Factor test
  • Generate logistic regression models, interpret odds ratios, and convert to probabilities
  • Interpret accuracy of logistic regression models by using ROC curves