The **statsmodels** and **sklearn** libraries are frequently used when it comes to generating regression output. While these libraries are frequently used in regression analysis, it is often the case that a user needs to work with different libraries depending on the extent of the analysis.

Generally speaking, statsmodels is useful when it comes to generating all the regression output under the one output frame. However, sklearn is necessary when it comes to being more flexible with the output, e.g. extracting coefficients, running an auxiliary regression, etc.

In the below example, we will go through how these libraries differ when it comes to generating regression output.

Before we get into how the **statsmodels** and **sklearn** libraries work, let’s run a very simple linear regression using the **numpy** and **scipy** libraries to get a better understanding of regression analysis.

## Introduction to simple linear regression using numpy and scipy

from scipy import stats import numpy as np x=[-0.018,-0.008,0.011,0.017,-0.008,-0.002] y=[-0.006,-0.001,0.015,0.017,-0.0019,-0.005] gradient,intercept,r_value,p_value,std_err=stats.linregress(x,y) print "Gradient and intercept",gradient,interceptGradient and intercept 0.724084177708 0.00398211223694print "R-squared",r_value**2R-squared 0.854175903355print "p-value",p_valuep-value 0.00839711859611

Let’s interpret the above output.

**Intercept = 0.00398.**This is what the value of our y (dependent variable) would be if the gradient had a value of 0.**Gradient = 0.72408.**For every unit increase in x, y increases by 0.72408.**R-Squared = 0.85417.**85.41% of the variation in our y variable is “explained” by the x variable in this model.**p-value = 0.0083.**The p-value is below 0.05%, denoting statistical significance of the regression at the 5% level.

## OLS (Linear) Regressions

The purpose of this regression is to determine the impact of price, hours driven, and tank capacity on gasoline consumption.

We are using a cross-sectional dataset in this instance – meaning that all data is collected at a specific point in time.

In this instance, we use the **gasoline.csv** dataset. To form a linear regression model, we set up our model in statsmodels as follows:

import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf import pandas as pd from sklearn.model_selection import train_test_split import os; path="/home/michaeljgrogan/Documents/a_documents/computing/data science/datasets" os.chdir(path) os.getcwd() variables = pd.read_csv('gasoline.csv') consumption = variables['consumption'] capacity = variables['capacity'] price = variables['price'] hours = variables['hours'] y = consumption x = np.column_stack((capacity,price,hours)) x = sm.add_constant(x, prepend=True) x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0) results = smf.OLS(y_train,x_train).fit() print(results.summary())

In addition to defining our independent variables, we see that we are stacking our independent variables into a single x variable using the **x = np.column_stack((capacity,price,hours))** command. Additionally, note that we are feeding our **training** data into the model so that we can generate predictions, and then assess the accuracy of those predictions against our **test** data.

This allows our model to take into account multiple parameters and generate detailed output by first defining our results through **results = smf.OLS(y,x).fit()** and then generating a summary of the results through the **print(results.summary())** command.

### OLS Regression Results

Once we run this code, we generate the following output in Python:

OLS Regression Results ============================================================================== Dep. Variable: consumption R-squared: 0.762 Model: OLS Adj. R-squared: 0.734 Method: Least Squares F-statistic: 27.72 Date: Sun, 07 Oct 2018 Prob (F-statistic): 2.93e-08 Time: 16:48:57 Log-Likelihood: -181.02 No. Observations: 30 AIC: 370.0 Df Residuals: 26 BIC: 375.6 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 157.9277 531.927 0.297 0.769 -935.463 1251.319 x1 -8.1258 4.359 -1.864 0.074 -17.086 0.834 x2 200.3928 77.056 2.601 0.015 42.001 358.785 x3 0.0685 0.016 4.364 0.000 0.036 0.101 ============================================================================== Omnibus: 4.677 Durbin-Watson: 1.160 Prob(Omnibus): 0.096 Jarque-Bera (JB): 3.245 Skew: -0.774 Prob(JB): 0.197 Kurtosis: 3.444 Cond. No. 4.10e+05 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 4.1e+05. This might indicate that there are strong multicollinearity or other numerical problems.

As we can see, the statsmodels library allows us to generate highly detailed regression output, with additional statistics such as **skew**, **kurtosis**, **R-Squared** and **AIC**.

From the above, our regression equation is as follows:

*consumption = 157.9277 – 8.1258(capacity) + 200.3928(price) + 0.0685(hours)*

### Output Interpretation

Let’s interpret our regression output:

- If the capacity increases by 1 unit, then gasoline consumption decreases by 8.1258 units. However, this variable is deemed insignificant at the 5% level.
- If the price increases by 1 unit, then consumption increases by 200.39 units. This indicates that drivers of less fuel-efficient cars pay more for fuel, but do not drive less as a result of higher prices.
- When the hours driven increases by 1, then the fuel consumption increases by 0.0685. The slight increase could indicate that drivers who drive more hours have more fuel-efficient cars, which in turn would lower the spend on fuel consumption.

### Detecting heteroscedasticity using statsmodels

One of the issues that we face when running a linear regression is that of heteroscedasticity.

**Heteroscedasticity** is a term used to describe a condition where there is an uneven variance across our error term.

For instance, larger cars often come with more fuel-efficient engines (e.g. diesel), and are typically driven for longer periods. Therefore, if we are surveying consumption across different car sizes, then it is possible that consumption may be biased towards car size:

In order to correct for this condition, it is necessary to test for heteroscedasticity using the Breusch-Pagan test, and then remedy for the condition if it exists.

Let’s go back to our example.

Remember that our first regression (reg1) was structured as follows:

consumption ~ capacity + price + hours

We first import the statsmodels library and then compute the Breusch-Pagan statistic using the statsmodels.stats.diagnostic.het_breushpagan command:

import statsmodels name = ['Lagrange multiplier statistic', 'p-value', 'f-value', 'f p-value'] bp = statsmodels.stats.diagnostic.het_breushpagan(results.resid, results.model.exog) bp pd.DataFrame(name,bp)

And here are our results:

>>> pd.DataFrame(name,bp) 0 3.510484 Lagrange multiplier statistic 0.319405 p-value 1.148537 f-value 0.348164 f p-value

Since our p-value is greater than 0.05, this indicates that heteroscedasticity is not present, and we cannot reject the null hypothesis of homoscedasticity.

### Using sklearn to test for multicollinearity

It is also possible that correlations exist between our predictors. This is a condition known as multicollinearity, and this has the potential to make our results invalid.

Therefore, we test for this condition using a Variance Inflation Factor (VIF) test. If our **VIF statistic > 10**, then there exists an issue with multicollinearity.

To do this, we run a VIF on the reg1 regression model that we created. The Variance Inflation Factor is calculated as follows:

**VIF = 1/(1-R^2)**

Note that the R2 is the statistic obtained from the auxiliary regression – i.e. the one where an independent variable is regressed against the other independent variables.

Let’s illustrate this within our code. Suppose that we wish to regress the hours variable against price and capacity.

This is our auxiliary regression, and we are obtaining the R-Squared statistic from this and calculating the VIF statistic for the hours variable:

z = np.column_stack((capacity,price)) z = sm.add_constant(z, prepend=True) from sklearn.metrics import r2_score from sklearn import linear_model lm = linear_model.LinearRegression() model = lm.fit(z,hours) predictions = lm.predict(z) print(predictions) rsquared=r2_score(hours, predictions) rsquared vif=1/(1-(rsquared))

You can see there’s quite a lot going on here, so let’s try to simplify things.

We are defining our data stack of earnings and debt to equity (our predictor variables) as a variable z. Once we have done this, we are then running this variable against the hours variable (our auxiliary regression).

>>> print(predictions) [14072.48585043 13693.61384188 13405.42449666 14307.1723462 14448.22339131 14119.05060852 13498.18506552 14201.25114267 13479.56552649 14086.96075169 14736.80572929 12972.99037447 13383.17742762 14321.41188958 13749.68765919 13523.09588648 13045.5465611 13863.65549489 15002.80726966 15134.45491176 15207.30572045 16737.77289153 16038.06935441 16949.71366929 14769.6944438 15112.93067629 16523.71898949 15988.22366706 15980.0993119 16144.17223828 16585.55750652 14966.56160176 15278.25301842 14743.59735905 16206.04597404 16676.8092563 16222.04086768 16017.46080612 16315.54661668 17007.85980549]

Once we have the predictions from this, we are then obtaining the R-Squared value and then calculating the VIF statistic from that.

vif 0.5198862908709541

Since the VIF statistic for the hours variable is less than 10, this indicates that no multicollinearity exists in this variable. To check for the others, we simply select the relevant variable as the dependent one, and regress against the others.

The VIF readings for the other variables also indicate that no multicollinearity exists in our model.

### Model Predictions

Now, let’s generate some predictions with our model:

y_pred = results.predict(x_test) >>> print(y_pred) [1782.161739 1520.78970223 1607.0050501 1436.46031383 1485.46152932 1345.93190007 1725.92235667 1363.48429628 1631.55660023 1718.88234168]

Using this data, we would now like to calculate a mean squared error.

>>> mse=(y_pred-y_test)/y_test >>> mse 22 0.088011 20 -0.066427 25 -0.040021 4 0.025311 10 0.018137 15 -0.129973 28 0.018844 11 -0.072460 18 0.017180 29 0.006961

We can see that in many instances, the deviation between our predictions and actual values is under 10% percent, which is a good start considering that the size of our dataset as a whole is 40 observations. While in a real-world scenario one would need to have more observations to establish the validity of the above relationships, a linear regression can shed significant insights into the nature of correlations between variables and their significance.

## Logistic Regression Modelling in Python

Now, let us take a look at how to run **logistic regressions** in Python.

There are three major types of variables we can encounter:

**Nominal:**One where there is no order to the elements. This type of variable is known as categorical, and hence is commonly used for categorizing. e.g. male = 0, female = 1, etc.**Ordinal:**A categorical variable for which the elements are ordered. For instance, a ranking scale from 1-5 (with 1 being the lowest and 5 being the highest) is an ordinal variable.**Interval:**A variable where the elements are ordered and the differences between those variables are significant. For instance, speed is a variable that operates on an interval scale – the difference between 70-80 km/h and 100-110 km/h is 10 km/h in both instances.

### What is a binary logistic regression?

A binary logistic regression is one where the dependent variable is **nominal**. Let’s see how we run and interpret regressions such as these.

Now, let’s take another example.

In this case, assume a stock analyst is trying to determine whether a stock pays a dividend or not, and wishes to classify accordingly.

The dataset we will use for this example is the **dividendinfo** dataset.

Here is the structure of our variables within the dataset:

**Dependent Variable (Y)**

- Dividends: If the company pays dividends currently it is given a ranking of 1. If the company does not pay dividends, it is given a ranking of 0.

**Independent Variables (X)**

- fcfps: Free cash flow per share (in $)
- earnings_growth: Earnings growth in the past year (in %)
- de: Debt to Equity ratio
- mcap: Market Capitalization of the stock
- current_ratio: Current Ratio (or Current Assets/Current Liabilities)

Essentially, we wish to use the input variables **years** and **earnings estimates** to determine whether a stock will pay a dividend or not.

Firstly, we will import our relevant libraries, including the **LogisticRegression** and **train_test_split** from sklearn.

import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split import os; path="/home/michaeljgrogan/Documents/a_documents/computing/data science/datasets" os.chdir(path) os.getcwd()

### Model Generation

We generate our regression model, using the **dividend** variable as our dependent variable, and **years** and **earnings estimates** as our independent variables:

variables=pd.read_csv('dividendinfo.csv') dividend=variables['dividend'] freecashflow_pershare=variables['fcfps'] earnings_growth=variables['earnings_growth'] debt_to_equity=variables['de'] mcap=variables['mcap'] current_ratio=variables['current_ratio'] y=dividend x=np.column_stack((freecashflow_pershare,earnings_growth,debt_to_equity,mcap,current_ratio)) x=sm.add_constant(x,prepend=True)

Now, we are going to split our data into training and test data, and compute scores for each.

x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0) logreg=LogisticRegression().fit(x_train,y_train) logreg print("Training set score: {:.3f}".format(logreg.score(x_train,y_train))) print("Test set score: {:.3f}".format(logreg.score(x_test,y_test)))

Let’s see what scores we get:

>>> print("Training set score: {:.3f}".format(logreg.score(x_train,y_train))) Training set score: 0.920 >>> print("Test set score: {:.3f}".format(logreg.score(x_test,y_test))) Test set score: 0.960

Now, we can use the training data to generate the output of the logistic regression itself:

logit_model=sm.Logit(y_train,x_train) result=logit_model.fit() print(result.summary())

Upon running the above, we generate the following output:

Optimization terminated successfully. Current function value: 0.111698 Iterations 10 >>> print(result.summary()) Logit Regression Results ============================================================================== Dep. Variable: dividend No. Observations: 150 Model: Logit Df Residuals: 144 Method: MLE Df Model: 5 Date: Sat, 06 Oct 2018 Pseudo R-squ.: 0.8385 Time: 15:59:37 Log-Likelihood: -16.755 converged: True LL-Null: -103.76 LLR p-value: 1.018e-35 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ const -21.1447 6.084 -3.475 0.001 -33.069 -9.220 x1 1.7848 0.578 3.087 0.002 0.652 2.918 x2 0.1035 0.035 2.952 0.003 0.035 0.172 x3 -0.9209 0.538 -1.713 0.087 -1.975 0.133 x4 0.0246 0.007 3.476 0.001 0.011 0.039 x5 5.2756 1.790 2.948 0.003 1.768 8.783 ==============================================================================

Of our predictors, we see that years is indicated as statistically significant at the 1% significance level.

Note that the **Pseudo R-Squared** statistic provided in the output is not necessarily reliable as a measure of fit in logistic regressions.

This is because the degree of variation in a nominal dependent variable is a lot smaller than that of an interval one, and therefore a very large number of observations are needed for this statistic to have credence.

### Odds Ratios and Probabilities

Now, we need to obtain the odds ratio for the years variable. An odds ratio is a ratio comparing the likelihood of an event happening compared to the likelihood of the event not happening.

For instance, the odds ratio for our fcfps (free cash flow per share) variable is **exp(1.7848) = 5.958**. This means that if a company’s lifespan increases by a year, then the odds of that company paying a dividend is **5.598:1**.

Now, let’s interpret the odds ratio in a way that makes more sense. We can convert the odds ratio to a probability as follows:

probability=(oddsratio/(1+oddsratio)) probability

Therefore, we can then calculate that if the lifespan of a company increases by a year, then the likelihood of that company paying a dividend goes up by 84.84%:

**(5.598/(1+5.598)) = 0.8484**

### ROC Curves

However, in order to more formally analyse accuracy, we will generate what is called an **ROC Curve**.

Essentially, we are going to plot the true positive rate against the false positive rate to determine if our logistic regression is a better classifier than just a random guess.

The metric we will use to determine this is called **AUC (area under the curve)**. The higher the AUC, the better. If we have an AUC of 0.5, then this indicates that our classifier is no better than random guessing.

import matplotlib.pyplot as plt from sklearn.metrics import roc_curve falsepos,truepos,thresholds=roc_curve(y_test,logreg.decision_function(x_test)) plt.plot(falsepos,truepos,label="ROC") plt.xlabel("False Positive Rate") plt.ylabel("True Positive Rate") cutoff=np.argmin(np.abs(thresholds)) plt.plot(falsepos[cutoff],truepos[cutoff],'o',markersize=10,label="cutoff",fillstyle="none") plt.show()

Here is a plot of our ROC curve:

### AUC Value

Now, let’s compute our AUC value:

from sklearn import metrics metrics.auc(falsepos, truepos)

We see that we obtain an AUC of 0.9932:

>>> metrics.auc(falsepos, truepos) 0.9932088285229203

This means that our area under the curve is **99.32%**, and this indicates that our classifier has a very high degree of accuracy in predicting whether a stock pays a dividend or not.

It would appear that as we add more independent variables to our analysis, the higher the AUC. For instance, if we exclude the **marketcap** and **current_ratio** variables from our analysis, the AUC drops slightly to 94.4%.

## Summary

In this tutorial, you have learned how to:

- Create and interpret linear regression output using statsmodels
- Use the Breusch-Pagan test to screen for heteroscedasticity
- Employ the sklearn library to generate regression output
- Test for multicollinearity using the Variance Inflation Factor test
- Generate logistic regression models, interpret odds ratios, and convert to probabilities
- Interpret accuracy of logistic regression models by using ROC curves