Linear Models in R: OLS and Logistic Regressions

We use linear models primarily to analyse cross-sectional data; i.e. data collected at one specific point in time across several observations. We can also use such models with time series data, but need to be cautious of issues such as serial correlation.

Data can generally take on one of three formats:

  • Nominal: A variable that is classified by certain categories that have no particular ordering, e.g. male or female, black or white, etc.
  • Ordinal: Variables that are classified according to a ranking system; e.g. rating your favourite food, with a rating of 1 being the lowest and 10 being the highest.
  • Interval: A variable where the distance between two observations is continuous and not subject to a particular category, e.g. temperature, speed, etc.

Linear Modelling: Ordinary Least Squares regressions

Let us first start with Ordinary Least Squares regressions, which is a common type of linear model used when the dependent variable is either in interval or scale form. To run this type of regression in R, we use the lm command.

Using an example, we seek to identify the determinants of variation on the one-year return of a hypothetical dataset of 49 stocks (our dependent variable). Our independent variables are dividend yield, earnings ranking, and debt to equity ratio.

The purpose of Ordinary Least Squares is to reveal the effect on the dependent variable when the independent variable is changed by 1 unit. For example, when we increase X by 1 unit, by how many units does X increase/decrease?

The Ordinary Least Squares regression has the following structure:

Dependent Variable = Intercept + X1 + X2 + ... + Xn + Error Term

We denote our independent variables by X. The intercept is the minimum value of the dependent variable if all independent variables are set to 0. We interpret our error term as the degree of variation in the dependent variable not explained by our model.

We choose the following dependent (Y) and independent variables for this regression:
Dependent Variable (Y)

  • 1-year stock return in basis points: A 1000 basis point increase equals a 10% stock return in the past year.

Independent Variables (X)

  • Dividends (dummy variable): We assign a ranking of 1 to companies that currently pay dividends. If dividends are not paid, we assign a ranking of 0.
  • Earnings ranking: We rank each stock from 1 to 49 based on their most recent quarterly earnings performance. In other words, we assign a ranking of 1 to the company that increased earnings the most in a given quarter, while we assign a ranking of 49 to the company with the least earnings.
  • Debt to Equity: We report the debt-to-equity ratio of each company as standard in percentage form.


Ordinary Least Squares: Regression and Output

reg1 <- lm(stock_return ~ dividend + earnings_ranking + debt_to_equity)

lm(formula = stock_return ~ dividend + earnings_ranking + debt_to_equity)

    Min      1Q  Median      3Q     Max 
-411.55  -65.42    7.60   99.18  286.14 

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      5001.194    132.231  37.822   <2e-16 ***
dividend          -45.946     98.521  -0.466    0.643    
earnings_ranking  -97.695      3.156 -30.951   <2e-16 ***
debt_to_equity    -26.950     66.670  -0.404    0.688    
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 153.8 on 45 degrees of freedom
Multiple R-squared:  0.9882,	Adjusted R-squared:  0.9875 
F-statistic:  1260 on 3 and 45 DF,  p-value: < 2.2e-16

When we yield our output, we see the earnings_ranking variable is statistically significant at the 5% level of significance (or 95% confidence interval). This means that earnings have a material effect on stock returns.

Additionally, the R-Squared value of 0.9882 is very high - meaning that 98.82% of the variation in stock returns is allegedly "explained" by this model.

We now test for multicollinearity and heteroscedasticity to ensure that the statistical readings of our model are reliable. If we find high t-statistics and R-Squared values in our output, it means our standard errors may be biased.


  • Multicollinearity exists when two independent variables in the linear model are significantly related, to the extent that inclusion of both variables in a regression model may skew our OLS estimates.
  • To test for multicollinearity, we use the Variance Inflation Factor (VIF) test across our independent variables. We determine that a reading of 5 or greater indicates the presence of multicollinearity.
  • From the below, given that we did not calculate VIF readings of 5 or greater, this indicates that our model does not suffer from multicollinearity.
reg1m <- lm(dividend ~ earnings_ranking + debt_to_equity)
reg2m <- lm(earnings_ranking ~ dividend + debt_to_equity)
reg3m <- lm(debt_to_equity ~ earnings_ranking + dividend)
> vif(reg1m)
earnings_ranking   debt_to_equity 
        2.215381         2.215381
> vif(reg2m)
      dividend debt_to_equity 
      2.696695       2.696695
> vif(reg3m)
earnings_ranking         dividend 
               4                4 


  • Heteroscedasticity occurs when we have an uneven variance across our observations.
  • For instance, let us assume that we have a mixture of small and large-cap stocks across our observations. This means that we have an uneven variance; i.e. returns will move unevenly depending on the size of the company.
  • We test for heteroscedasticity using the Breusch-Pagan test at the 5% level of significance. With a p-value of 0.001163 our test is significant and this indicates that heteroscedasticity is present in our model.
# Redefine variables 
Y <- cbind(stock_return)
X <- cbind(dividend, earnings_ranking, debt_to_equity)

> bptest(Y ~ X)

	studentized Breusch-Pagan test

data:  Y ~ X
BP = 15.947, df = 3, p-value = 0.001163

Heteroscedasticity may cause bias in our standard errors. This affects our readings of whether or not our variables are statistically significant. One possible way to rid our model of heteroscedasticity is by redefining our variables, i.e. scaling our stock returns to assume a constant market capitalisation. We scale our returns to assume that all are based on a $100 million capitalisation. For instance, in the first column we are scaling the stock return of 691 basis points from a market cap of $185m to $100m: (691/185)*100 = 373.51.

linear model

With a p-value of 0.06982, our Breusch-Pagan test becomes insignificant at the 5% level. This indicates that we have corrected for heteroscedasticity:

# Redefine variables 
Y <- cbind(stock_return_scaled)
X <- cbind(dividend, earnings_ranking, debt_to_equity)

> bptest(Y ~ X)

	studentized Breusch-Pagan test

data:  Y ~ X
BP = 7.0663, df = 3, p-value = 0.06982

Updated Regression Output

reg2 <- lm(stock_return_scaled ~ dividend + earnings_ranking + debt_to_equity)

lm(formula = stock_return_scaled ~ dividend + earnings_ranking + 

    Min      1Q  Median      3Q     Max 
-142.56  -53.63  -15.25   22.21  603.66 

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       681.402    103.270   6.598 4.03e-08 ***
dividend         -102.704     76.943  -1.335 0.188653    
earnings_ranking  -10.147      2.465  -4.116 0.000162 ***
debt_to_equity   -182.134     52.068  -3.498 0.001068 ** 
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 120.1 on 45 degrees of freedom
Multiple R-squared:  0.3573,	Adjusted R-squared:  0.3144 
F-statistic: 8.338 on 3 and 45 DF,  p-value: 0.0001616

Model Interpretation

Y = 681.402 - 102.704(dividend) - 10.147(earnings_ranking) - 182.134(debt_to_equity)

According to the output:

  • A stock that pays a dividend corresponds to a -102.704 drop in that stock's return in basis points.
  • A higher earnings ranking (or lower earnings rate in ranking terms) corresponds to a -10.147 drop in a stock's return in basis points.
  • When we increase the debt/equity ratio by 1, we observe a -182.134 drop in a stock's return in basis points. On an incremental basis (-182.134/100), an increase of 0.01 in the debt/equity ratio corresponds to a -1.82 drop in percentage terms in overall stock returns.

However, note that the dividend variable shows statistical insignificance. In general, there are times where it is a bad idea to drop insignificant variables when it has theoretical relevance to the dependent variable, as is the case here.

If we choose to drop the dividend variable, then we come up with the following output:

lm(formula = stock_return_scaled ~ earnings_ranking + debt_to_equity)

    Min      1Q  Median      3Q     Max 
-138.86  -70.46  -15.50   32.15  619.21 

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       605.958     87.161   6.952 1.07e-08 ***
earnings_ranking   -7.907      1.821  -4.342 7.68e-05 ***
debt_to_equity   -213.532     46.845  -4.558 3.81e-05 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 121.1 on 46 degrees of freedom
Multiple R-squared:  0.3318,	Adjusted R-squared:  0.3028 
F-statistic: 11.42 on 2 and 46 DF,  p-value: 9.384e-05

Should We Keep or Discard Insignificant Variables?

In general, theoretical relevance of a variable should always be the deciding factor in deciding whether we should keep or drop said variable from a linear model.

Once we drop the dividend variable, the effect of the other variables grows larger in terms of their coefficients. Naturally, there are far more than three independent variables that could affect a stock's return.

If we exclude theoretically important variables, this could overstate the effect of remaining variables on the dependent variable. The general consensus is that dropping insignificant variables from a model is not ideal. An exception being that we should drop variables for technical reasons such as multicollinearity.

Quora: Why is dropping insignificant predictors from a regression model a bad idea?

Generalized Linear Modelling: Binomial Logistic Regressions

A binomial logistic regression is a type of Generalized Linear Model, run using the glm command in R. The purpose of a binomial logistic regression is to regress a dependent variable that can only take on finite values. In this instance, we wish to calculate an odds ratio for our regression - i.e. what is the probability that a stock will increase its dividend payments based on another event taking place? Here, we use the glm function to construct a binomial logistic regression in R.

This generalized linear regression model is used when we wish to analyse a regression where our dependent variable is a nominal variable. For instance, in the previous example using Ordinary Least Squares we saw that stock returns was an interval variable as it could take on a continuous range. However, let us assume that we now wish to have a nominal (or dummy) variable as our dependent variable. Here is the structure of our variables within the glm model:

Dependent Variable (Y)

- Dividends (dummy variable) - if the company pays dividends currently it is given a ranking of 1. If the company does not pay dividends, it is given a ranking of 0.

Independent Variables (X)

- Years: The number of years of consecutive dividend increases.

- Earnings Estimates: If the stock beat earnings for the past four consecutive quarters, it is given a ranking of 1. Otherwise, it is given a ranking of 0.

Using the binomial logistic regression, we wish to analyse the likelihood of a company paying a dividend based on its years of consecutive dividend increases along with the beating of earnings estimates in the past four quarters. Note that in this case, we are not looking to obtain a specific coefficient value but rather an odds ratio; a ratio that tells us the likelihood of a dividend payment based on any independent variable.

glm Regression Model

> summary(dividend)

glm(formula = dividend ~ years + earnings_estimates, family = "binomial", 
    data = mydata)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.18463  -0.22497  -0.01288   0.00776   1.96192  

                   Estimate Std. Error z value Pr(>|z|)   
(Intercept)        -10.6858     3.4413  -3.105   0.0019 **
years                0.8919     0.2974   2.999   0.0027 **
earnings_estimates   0.3953     1.1003   0.359   0.7194   
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 97.041  on 69  degrees of freedom
Residual deviance: 22.762  on 67  degrees of freedom
AIC: 28.762

Number of Fisher Scoring iterations: 9

Note that we do not have an R-Squared statistic in this instance - given far less variance in our dependent variable an R-Squared statistic would prove unreliable. Our "years" variable is statistically significant at the 5% level, whereas the earnings_estimates variable is statistically insignificant.
To calculate an odds ratio, it is necessary to get the exponent of the independent variable in question - since a binomial logit model will show an exponential rather than a linear trend line. Since our years variable is statistically significant at the 5% level - this is the variable that we will exponentiate:

> exp(0.8919)
[1] 2.439761

We have calculated a log odds ratio of 2.439:1, which means that the odds of a dividend increase are 2.439 times greater than the odds of the dividend staying at the same rate or decreasing.

Moreover, as the following source elaborates in more detail, we can convert these odds into a probability as follows:

exp / (1+exp) = probability

In the case of a log odds ratio of 2.439, we yield a probability of 70.92% chance of a dividend increase, where 2.439761/(1+2.437961) = 0.7092.