The **statsmodels** and **sklearn** libraries are frequently used when it comes to generating regression output. While these libraries are frequently used in regression analysis, it is often the case that a user needs to work with different libraries depending on the extent of the analysis.

Generally speaking, statsmodels is useful when it comes to generating all the regression output under the one output frame. However, sklearn is necessary when it comes to being more flexible with the output, e.g. extracting coefficients, running an auxiliary regression, etc. In the below example, we will go through how these libraries differ when it comes to generating regression output.

Before we get into how the **statsmodels** and **sklearn** libraries work, let’s run a very simple linear regression using the **numpy** and **scipy** libraries to get a better understanding of regression analysis.

## Linear Regression using numpy and scipy

from scipy import stats import numpy as np x=[-0.018,-0.008,0.011,0.017,-0.008,-0.002] y=[-0.006,-0.001,0.015,0.017,-0.0019,-0.005] gradient,intercept,r_value,p_value,std_err=stats.linregress(x,y) print "Gradient and intercept",gradient,interceptGradient and intercept 0.724084177708 0.00398211223694print "R-squared",r_value**2R-squared 0.854175903355print "p-value",p_valuep-value 0.00839711859611

Let’s interpret the above output.

**Intercept = 0.00398.**This is what the value of our y (dependent variable) would be if the gradient had a value of 0.**Gradient = 0.72408.**For every unit increase in x, y increases by 0.72408.**R-Squared = 0.85417.**85.41% of the variation in our y variable is “explained” by the x variable in this model.**p-value = 0.0083.**The p-value is below 0.05%, denoting statistical significance of the regression at the 5% level.

# Part 1: Run an OLS regression using statsmodels

Here, we will run an analysis on a hypothetical dataset of 49 stocks. The purpose of the regression was to determine the impact of dividends, earnings and debt to equity on stock returns. We are using a cross-sectional dataset in this instance – meaning that all data is collected at a specific point in time.

In this instance, we use the **ols_stock2** dataset (see end of post for details). To form a linear regression model, we set up our model in statsmodels as follows:

import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf import pandas as pd import os; path="C:/Users/michaeljgrogan/Documents/a_documents/computing/data science/datasets" os.chdir(path) os.getcwd() variables = pd.read_csv('ols_stock2.csv') stock_return = variables['stock_return'] dividend = variables['dividend'] earnings = variables['earnings'] debt_to_equity = variables['debt_to_equity'] marketcap = variables['marketcap'] stock_return_scaled = (stock_return/earnings)*1 x = np.column_stack((dividend,earnings,debt_to_equity)) x = sm.add_constant(x, prepend=True) results = smf.OLS(stock_return,x).fit() print(results.summary())

In addition to defining our independent variables, we see that we are stacking our independent variables into a single x variable using the **x = np.column_stack((dividend, earnings, debt_to_equity))** command.

This allows our model to take into account multiple parameters and generate detailed output by first defining our results through **results = smf.OLS(y,x).fit()** and then generating a summary of the results through the **print(results.summary())** command.

Once we run this code, we generate the following output in Python:

OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.078 Model: OLS Adj. R-squared: 0.017 Method: Least Squares F-statistic: 1.277 Date: Mon, 05 Feb 2018 Prob (F-statistic): 0.294 Time: 12:29:30 Log-Likelihood: -228.45 No. Observations: 49 AIC: 464.9 Df Residuals: 45 BIC: 472.5 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ const 3.5733 8.284 0.431 0.668 -13.112 20.259 x1 3.1439 3.068 1.025 0.311 -3.034 9.322 x2 -0.3646 0.255 -1.429 0.160 -0.878 0.149 x3 4.0564 8.698 0.466 0.643 -13.462 21.575 ============================================================================== Omnibus: 99.858 Durbin-Watson: 1.855 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2938.634 Skew: 5.817 Prob(JB): 0.00 Kurtosis: 39.110 Cond. No. 72.0 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

As we can see, the statsmodels library allows us to generate highly detailed regression output, with additional statistics such as **skew**, **kurtosis**, **R-Squared** and **AIC**.

From the above, our regression equation is as follows:

**stock_return = 4.14320 + 2.23811(dividend) + 0.26361(earnings) + 11.88278(debt_to_equity)**

## Output Interpretation

Let’s interpret our regression output:

- If the dividend yield increases by 1%, then the return on the stock increases by 2.32%.
- If the earnings growth increases by 1%, then the return on the stock increases by 0.26%.
- If the debt to equity ratio increases by 1%, then the return on the stock increases by 11.88%.
- The R-Squared value is 0.6439, meaning 64.39% of the variation in stock returns is “explained” by our predictors.
- The dividend variable is significant at the 5% level, earnings is significant at the 1% level, while debt to equity is significant at all levels.

## Detecting heteroscedasticity using statsmodels

One of the issues that we face when running a linear regression is that of heteroscedasticity.

**Heteroscedasticity** is a term used to describe a condition where there is an uneven variance across our error term.

Let’s take the following example. Suppose that you are conducting an analysis on incomes across the United States. Chances are, you will find that incomes in New York are significantly higher than in North Dakota.

Given that major cities are likely to have higher populations and thus higher incomes, the residuals across your observations will be unevenly distributed. In this instance, you will find that the variance of your error term gets larger across cities with higher populations:

In order to correct for this condition, it is necessary to test for heteroscedasticity using the Breusch-Pagan test, and then remedy for the condition if it exists.

Let’s go back to our stock example.

Remember that our first regression (reg1) was structured as follows:

stock_return ~ dividend + earnings + debt_to_equity

We first import the statsmodels library and then compute the Breusch-Pagan statistic using the statsmodels.stats.diagnostic.het_breushpagan command:

import statsmodels name = ['Lagrange multiplier statistic', 'p-value', 'f-value', 'f p-value'] bp = statsmodels.stats.diagnostic.het_breushpagan(results.resid, results.model.exog) bp pd.DataFrame(name,bp)

Since our p-value is less than 0.05, this indicates that heteroscedasticity is present, and we reject the null hypothesis of homoscedasticity.

There are several ways to correct for this. Some of the solutions, such as heteroscedasticity-corrected standard errors, are more technical in nature.

However, let’s see if we can rescale our data in order to correct for this issue. Remember that the level of earnings growth across our stocks differ. In this regard, it may be the case that returns across stocks are biased towards those with higher earnings.

Let’s see what happens if we scale returns to a constant rate of earnings growth:

stock_return_scaled = (stock_return/earnings)*1

Then, we substitute this variable into our prior statsmodels code:

results = smf.OLS(stock_return_scaled,x).fit() print(results.summary())

OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.078 Model: OLS Adj. R-squared: 0.017 Method: Least Squares F-statistic: 1.277 Date: Sun, 04 Feb 2018 Prob (F-statistic): 0.294 Time: 13:33:11 Log-Likelihood: -228.45 No. Observations: 49 AIC: 464.9 Df Residuals: 45 BIC: 472.5 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------ const 3.5733 8.284 0.431 0.668 -13.112 20.259 x1 3.1439 3.068 1.025 0.311 -3.034 9.322 x2 -0.3646 0.255 -1.429 0.160 -0.878 0.149 x3 4.0564 8.698 0.466 0.643 -13.462 21.575 ============================================================================== Omnibus: 99.858 Durbin-Watson: 1.855 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2938.634 Skew: 5.817 Prob(JB): 0.00 Kurtosis: 39.110 Cond. No. 72.0 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We now see that all of our variables have become insignificant as our p-values are now significantly higher than 0.05, and the R-Squared statistic has dropped to a miniscule 0.078.

Moreover, let’s run our Breusch-Pagan test for heteroscedasticity again and see what we come up with:

Now, we see that the p-value has increased to **0.1855**. This is higher than 0.05, so we can no longer reject the null hypothesis of homoscedasticity.

In this regard, the Breusch-Pagan test has confirmed that our prior results were being influenced by heteroscedasticity and were therefore unreliable.

## Using sklearn to test for multicollinearity

It is also possible that correlations exist between our predictors. This is a condition known as multicollinearity, and this has the potential to make our results invalid. Therefore, we test for this condition using a Variance Inflation Factor (VIF) test. If our **VIF statistic > 10**, then there exists an issue with multicollinearity.

To do this, we run a VIF on the reg1 regression model that we created. The Variance Inflation Factor is calculated as follows:

**VIF = 1/(1-R^2)**

Note that the R2 is the statistic obtained from the auxiliary regression – i.e. the one where an independent variable is regressed against the other independent variables.

Let’s illustrate this within our code. Suppose that we wish to regress dividend against earnings and debt to equity. This is our auxiliary regression, and we are obtaining the R-Squared statistic from this and calculating the VIF statistic for the dividend variable:

z = np.column_stack((earnings,debt_to_equity)) z = sm.add_constant(z, prepend=True) from sklearn.metrics import r2_score from sklearn import linear_model lm = linear_model.LinearRegression() model = lm.fit(z,dividend) predictions = lm.predict(z) print(predictions) rsquareddividend=r2_score(dividend, predictions) rsquareddividend vifdividend=1/(1-(rsquareddividend))

You can see there’s quite a lot going on here, so let’s try to simplify things.

We are defining our data stack of earnings and debt to equity (our predictor variables) as a variable z. Once we have done this, we are then running this variable against the dividend variable (our auxiliary regression). Once we have the predictions from this, we are then obtaining the R-Squared value and then calculating the VIF statistic from that.

vifdividend 1.2844298745832432

Since the VIF statistic for the dividend variable is less than 10, this indicates that no multicollinearity exists in this variable. To check for the others, we simply select the relevant variable as the dependent one, and regress against the others. Upon doing this, we find that the VIF statistics for earnings and debt to equity are **1.258** and **1.568**. Therefore, the VIFs for these variables also indicates that no multicollinearity exists in our model.

The above VIF statistic for the dividend variable is the same as the one we obtained from our vif function above. Similarly, we are structuring earnings and debt to equity as the respective dependent variables in our regression to obtain the VIF statistic for each one.

# Part 2: Logistic Regression Modelling in Python

Now, let us take a look at how to run **logistic regressions** in Python.

There are three major types of variables we can encounter:

**Nominal:**One where there is no order to the elements. This type of variable is known as categorical, and hence is commonly used for categorising. e.g. male = 0, female = 1, etc.**Ordinal:**A categorical variable for which the elements are ordered. For instance, a ranking scale from 1-5 (with 1 being the lowest and 5 being the highest) is an ordinal variable.**Interval:**A variable where the elements are ordered and the differences between those variables are significant. For instance, speed is a variable that operates on an interval scale – the difference between 70-80 km/h and 100-110 km/h is 10 km/h in both instances.

A binary logistic regression is one where the dependent variable is **nominal**. Let’s see how we run and interpret regressions such as these.

Here is the structure of our variables within the **binomial_stock** dataset (see end of post):

**Dependent Variable (Y)**

- Dividends (dummy variable) – if the company pays dividends currently it is given a ranking of 1. If the company does not pay dividends, it is given a ranking of 0.

**Independent Variables (X)**

- Years: The number of years of consecutive dividend increases.
- Earnings Estimates: If the stock beat earnings for the past four consecutive quarters, it is given a ranking of 1. Otherwise, it is given a ranking of 0.

Essentially, we wish to use the input variables **years** and **earnings estimates** to determine whether a stock will pay a dividend or not.

Firstly, we will import our relevant libraries, including the **LogisticRegression** and **train_test_split** from sklearn.

import numpy as np import statsmodels.api as sm import statsmodels.formula.api as smf import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split import os; path="C:/Users/michaeljgrogan/Documents/YourDirectory" os.chdir(path) os.getcwd()

The code and datasets we will use for this example are **logisticregression.ipynb** and **binomial_stock.csv**, available from the MGCodesandStats Github Repository. Then, we generate our regression model, using the **dividend** variable as our dependent variable, and **years** and **earnings estimates** as our independent variables:

variables = pd.read_csv('binomial_stock.csv') years = variables['years'] dividend = variables['dividend'] earnings_estimates = variables['earnings_estimates'] y = dividend x = np.column_stack((years,earnings_estimates)) x = sm.add_constant(x, prepend=True) logit_model=sm.Logit(y,x) result=logit_model.fit() print(result.summary())

Upon running the above, we generate the following output:

Optimization terminated successfully. Current function value: 0.162588 Iterations 10 Logit Regression Results ============================================================================== Dep. Variable: dividend No. Observations: 70 Model: Logit Df Residuals: 67 Method: MLE Df Model: 2 Date: Sat, 03 Feb 2018 Pseudo R-squ.: 0.7654 Time: 21:24:55 Log-Likelihood: -11.381 converged: True LL-Null: -48.520 LLR p-value: 7.425e-17 ============================================================================== coef std err z P>|z| [95.0% Conf. Int.] ------------------------------------------------------------------------------ const -10.6858 3.441 -3.105 0.002 -17.431 -3.941 x1 0.8919 0.297 2.999 0.003 0.309 1.475 x2 0.3953 1.100 0.359 0.719 -1.761 2.552 ============================================================================== Possibly complete quasi-separation: A fraction 0.29 of observations can be perfectly predicted. This might indicate that there is complete quasi-separation. In this case some parameters will not be identified.

Of our predictors, we see that years is indicated as statistically significant at the 1% significance level.

Note that the **Pseudo R-Squared** statistic provided in the output is not necessarily reliable as a measure of fit in logistic regressions. This is because the degree of variation in a nominal dependent variable is a lot smaller than that of an interval one, and therefore a very large number of observations are needed for this statistic to have credence.

## Odds Ratios and Probabilities

Now, we need to obtain the odds ratio for the years variable. An odds ratio is a ratio comparing the likelihood of an event happening compared to the likelihood of the event not happening.

For instance, the odds ratio for our years variable is **exp(0.8919) = 2.439**. This means that if a company’s lifespan increases by a year, then the odds of that company paying a dividend is **2.439:1**.

Now, let’s interpret the odds ratio in a way that makes more sense. We can convert the odds ratio to a probability as follows:

probability=(oddsratio/(1+oddsratio)) probability

Therefore, we can then calculate that if the lifespan of a company increases by a year, then the likelihood of that company paying a dividend goes up by 70.93%:

**(2.439/(1+2.439)) = 0.7093**

Note that while we ran a logistic regression on all our observations, we can also split the data into training and test data using train_test_split.

print("Training set score: {:.3f}".format(logreg.score(x_train,y_train))) Training set score: 0.885 print("Test set score: {:.3f}".format(logreg.score(x_test,y_test))) Test set score: 0.944

We see that we obtained high training and test scores across our data, indicating that the logistic regression showed a high degree of accuracy in determining whether a stock would pay a dividend or not.

# Summary

In this tutorial, you have learned how to:

- Create and interpret linear regression output using statsmodels
- Use the Breusch-Pagan test to screen for heteroscedasticity
- Employ the sklearn library to generate regression output
- Test for multicollinearity using the Variance Inflation Factor test
- Generate logistic regression models, interpret odds ratios, and convert to probabilities