We use linear models primarily to analyse cross-sectional data; i.e. data collected at one specific point in time across several observations. We can also use such models with time series data, but need to be cautious of issues such as serial correlation.

Data can generally take on one of three formats:

**Nominal:**A variable that is classified by certain categories that have no particular ordering, e.g. male or female, black or white, etc.**Ordinal:**Variables that are classified according to a ranking system; e.g. rating your favourite food, with a rating of 1 being the lowest and 10 being the highest.**Interval:**A variable where the distance between two observations is continuous and not subject to a particular category, e.g. temperature, speed, etc.

# Linear Modelling: Ordinary Least Squares regressions

Let us first start with Ordinary Least Squares regressions, which is a common type of linear model used when the dependent variable is either in interval or scale form. To run this type of regression in R, we use the **lm** command.

Using an example, we seek to identify the determinants of variation on the **one-year return** of a hypothetical dataset of 49 stocks (our dependent variable). Our independent variables are **dividend yield**, **earnings ranking**, and **debt to equity ratio**.

The purpose of Ordinary Least Squares is to reveal the effect on the dependent variable when the independent variable is changed by 1 unit. For example, when we increase X by 1 unit, by how many units does X increase/decrease?

The Ordinary Least Squares regression has the following structure:

Dependent Variable = Intercept + X1 + X2 + ... + Xn + Error Term

We denote our independent variables by **X**. The **intercept** is the minimum value of the dependent variable if all independent variables are set to 0. We interpret our **error term** as the degree of variation in the dependent variable not explained by our model.

We choose the following dependent (Y) and independent variables for this regression:

**Dependent Variable (Y)**

- 1-year stock return in basis points: A 1000 basis point increase equals a 10% stock return in the past year.

**Independent Variables (X)**

- Dividends (dummy variable): We assign a ranking of 1 to companies that currently pay dividends. If dividends are not paid, we assign a ranking of 0.

- Earnings ranking: We rank each stock from 1 to 49 based on their most recent quarterly earnings performance. In other words, we assign a ranking of 1 to the company that increased earnings the most in a given quarter, while we assign a ranking of 49 to the company with the least earnings.

- Debt to Equity: We report the debt-to-equity ratio of each company as standard in percentage form.

## Ordinary Least Squares: Regression and Output

reg1 <- lm(stock_return ~ dividend + earnings_ranking + debt_to_equity) summary(reg1) Call: lm(formula = stock_return ~ dividend + earnings_ranking + debt_to_equity) Residuals: Min 1Q Median 3Q Max -411.55 -65.42 7.60 99.18 286.14 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5001.194 132.231 37.822 <2e-16 *** dividend -45.946 98.521 -0.466 0.643 earnings_ranking -97.695 3.156 -30.951 <2e-16 *** debt_to_equity -26.950 66.670 -0.404 0.688 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 153.8 on 45 degrees of freedom Multiple R-squared: 0.9882, Adjusted R-squared: 0.9875 F-statistic: 1260 on 3 and 45 DF, p-value: < 2.2e-16

When we yield our output, we see the earnings_ranking variable is statistically significant at the 5% level of significance (or 95% confidence interval). This means that earnings have a material effect on stock returns.

Additionally, the R-Squared value of 0.9882 is very high - meaning that **98.82%** of the variation in stock returns is allegedly "explained" by this model.

We now test for multicollinearity and heteroscedasticity to ensure that the statistical readings of our model are reliable. If we find high t-statistics and R-Squared values in our output, it means our standard errors may be biased.

## Multicollinearity

- Multicollinearity exists when two independent variables in the linear model are significantly related, to the extent that inclusion of both variables in a regression model may skew our OLS estimates.

- To test for multicollinearity, we use the Variance Inflation Factor (VIF) test across our independent variables. We determine that a reading of 5 or greater indicates the presence of multicollinearity.

- From the below, given that we did not calculate VIF readings of 5 or greater, this indicates that our model does not suffer from multicollinearity.

library(car) reg1m <- lm(dividend ~ earnings_ranking + debt_to_equity) summary(reg1m) reg2m <- lm(earnings_ranking ~ dividend + debt_to_equity) summary(reg2m) reg3m <- lm(debt_to_equity ~ earnings_ranking + dividend) summary(reg3m) > vif(reg1m) earnings_ranking debt_to_equity 2.215381 2.215381 > vif(reg2m) dividend debt_to_equity 2.696695 2.696695 > vif(reg3m) earnings_ranking dividend 4 4

## Heteroscedasticity

- Heteroscedasticity occurs when we have an uneven variance across our observations.

- For instance, let us assume that we have a mixture of small and large-cap stocks across our observations. This means that we have an uneven variance; i.e. returns will move unevenly depending on the size of the company.

- We test for heteroscedasticity using the Breusch-Pagan test at the 5% level of significance. With a p-value of
**0.001163**our test is significant and this indicates that heteroscedasticity is present in our model.

# Redefine variables Y <- cbind(stock_return) X <- cbind(dividend, earnings_ranking, debt_to_equity) > bptest(Y ~ X) studentized Breusch-Pagan test data: Y ~ X BP = 15.947, df = 3, p-value = 0.001163

Heteroscedasticity may cause bias in our standard errors. This affects our readings of whether or not our variables are statistically significant. One possible way to rid our model of heteroscedasticity is by redefining our variables, i.e. scaling our stock returns to assume a constant market capitalisation. We scale our returns to assume that all are based on a $100 million capitalisation. For instance, in the first column we are scaling the stock return of 691 basis points from a market cap of $185m to $100m: **(691/185)*100 = 373.51.**

With a p-value of **0.06982**, our Breusch-Pagan test becomes insignificant at the 5% level. This indicates that we have corrected for heteroscedasticity:

# Redefine variables Y <- cbind(stock_return_scaled) X <- cbind(dividend, earnings_ranking, debt_to_equity) > bptest(Y ~ X) studentized Breusch-Pagan test data: Y ~ X BP = 7.0663, df = 3, p-value = 0.06982

## Updated Regression Output

reg2 <- lm(stock_return_scaled ~ dividend + earnings_ranking + debt_to_equity) summary(reg2) Call: lm(formula = stock_return_scaled ~ dividend + earnings_ranking + debt_to_equity) Residuals: Min 1Q Median 3Q Max -142.56 -53.63 -15.25 22.21 603.66 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 681.402 103.270 6.598 4.03e-08 *** dividend -102.704 76.943 -1.335 0.188653 earnings_ranking -10.147 2.465 -4.116 0.000162 *** debt_to_equity -182.134 52.068 -3.498 0.001068 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 120.1 on 45 degrees of freedom Multiple R-squared: 0.3573, Adjusted R-squared: 0.3144 F-statistic: 8.338 on 3 and 45 DF, p-value: 0.0001616

## Model Interpretation

Y = 681.402 - 102.704(dividend) - 10.147(earnings_ranking) - 182.134(debt_to_equity)

According to the output:

- A stock that pays a dividend corresponds to a
**-102.704**drop in that stock's return in basis points.

- A higher earnings ranking (or lower earnings rate in ranking terms) corresponds to a
**-10.147**drop in a stock's return in basis points.

- When we increase the debt/equity ratio by 1, we observe a
**-182.134**drop in a stock's return in basis points. On an incremental basis (-182.134/100), an increase of 0.01 in the debt/equity ratio corresponds to a -1.82 drop in percentage terms in overall stock returns.

However, note that the dividend variable shows statistical insignificance. In general, there are times where it is a bad idea to drop insignificant variables when it has theoretical relevance to the dependent variable, as is the case here.

If we choose to drop the dividend variable, then we come up with the following output:

Call: lm(formula = stock_return_scaled ~ earnings_ranking + debt_to_equity) Residuals: Min 1Q Median 3Q Max -138.86 -70.46 -15.50 32.15 619.21 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 605.958 87.161 6.952 1.07e-08 *** earnings_ranking -7.907 1.821 -4.342 7.68e-05 *** debt_to_equity -213.532 46.845 -4.558 3.81e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 121.1 on 46 degrees of freedom Multiple R-squared: 0.3318, Adjusted R-squared: 0.3028 F-statistic: 11.42 on 2 and 46 DF, p-value: 9.384e-05

## Should We Keep or Discard Insignificant Variables?

In general, theoretical relevance of a variable should always be the deciding factor in deciding whether we should keep or drop said variable from a linear model.

Once we drop the dividend variable, the effect of the other variables grows larger in terms of their coefficients. Naturally, there are far more than three independent variables that could affect a stock's return.

If we exclude theoretically important variables, this could overstate the effect of remaining variables on the dependent variable. The general consensus is that dropping insignificant variables from a model is not ideal. An exception being that we should drop variables for technical reasons such as multicollinearity.

Quora: Why is dropping insignificant predictors from a regression model a bad idea?

# Generalized Linear Modelling: Binomial Logistic Regressions

A binomial logistic regression is a type of **Generalized Linear Model**, run using the **glm** command in R. The purpose of a binomial logistic regression is to regress a dependent variable that can only take on finite values. In this instance, we wish to calculate an odds ratio for our regression - i.e. what is the probability that a stock will increase its dividend payments based on another event taking place? Here, we use the **glm** function to construct a binomial logistic regression in R.

This generalized linear regression model is used when we wish to analyse a regression where our dependent variable is a nominal variable. For instance, in the previous example using Ordinary Least Squares we saw that stock returns was an interval variable as it could take on a continuous range. However, let us assume that we now wish to have a nominal (or dummy) variable as our dependent variable. Here is the structure of our variables within the glm model:

**Dependent Variable (Y)**

*- Dividends (dummy variable) - if the company pays dividends currently it is given a ranking of 1. If the company does not pay dividends, it is given a ranking of 0.*

**Independent Variables (X)**

*- Years: The number of years of consecutive dividend increases.*

*- Earnings Estimates: If the stock beat earnings for the past four consecutive quarters, it is given a ranking of 1. Otherwise, it is given a ranking of 0.*

Using the binomial logistic regression, we wish to analyse the likelihood of a company paying a dividend based on its years of consecutive dividend increases along with the beating of earnings estimates in the past four quarters. Note that in this case, we are not looking to obtain a specific coefficient value but rather an odds ratio; a ratio that tells us the likelihood of a dividend payment based on any independent variable.

## glm Regression Model

> summary(dividend) Call: glm(formula = dividend ~ years + earnings_estimates, family = "binomial", data = mydata) Deviance Residuals: Min 1Q Median 3Q Max -1.18463 -0.22497 -0.01288 0.00776 1.96192 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -10.6858 3.4413 -3.105 0.0019 ** years 0.8919 0.2974 2.999 0.0027 ** earnings_estimates 0.3953 1.1003 0.359 0.7194 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 97.041 on 69 degrees of freedom Residual deviance: 22.762 on 67 degrees of freedom AIC: 28.762 Number of Fisher Scoring iterations: 9

Note that we do not have an R-Squared statistic in this instance - given far less variance in our dependent variable an R-Squared statistic would prove unreliable. Our "years" variable is statistically significant at the 5% level, whereas the earnings_estimates variable is statistically insignificant.

To calculate an odds ratio, it is necessary to get the exponent of the independent variable in question - since a binomial logit model will show an exponential rather than a linear trend line. Since our **years** variable is statistically significant at the 5% level - this is the variable that we will exponentiate:

> exp(0.8919) [1] 2.439761

We have calculated a log odds ratio of **2.439:1**, which means that the odds of a dividend increase are 2.439 times greater than the odds of the dividend staying at the same rate or decreasing.

Moreover, as the following source elaborates in more detail, we can convert these odds into a probability as follows:

exp / (1+exp) = probability

In the case of a log odds ratio of 2.439, we yield a probability of **70.92%** chance of a dividend increase, where *2.439761/(1+2.437961) = 0.7092*.

## Code Scripts and Datasets

Hope you enjoyed this tutorial!

The full code is available by **subscribing to my mailing list**.

Upon subscription, you will receive full access to the codes and datasets for my tutorials, as well as a comprehensive course in regression analysis in both Python and R.