Ordinary Least Squares and the lm command in R

Ordinary Least Squares (OLS) is one of the most commonly used forms of regression analysis. We use this model primarily to analyse cross-sectional data; i.e. data collected at one specific point in time across several observations. We can also use Ordinary Least Squares with time series data, but need to be cautious of issues such as serial correlation. To run this type of regression in R, we use the lm command.

Using an example, we seek to identify the determinants of variation on the one-year return of a hypothetical dataset of 49 stocks (our dependent variable). Our independent variables are dividend yield, earnings ranking, and debt to equity ratio.

The purpose of Ordinary Least Squares is to reveal the effect on the dependent variable when the independent variable is changed by 1 unit. For example, when we increase X by 1 unit, by how many units does X increase/decrease?

The Ordinary Least Squares regression has the following structure:

Dependent Variable = Intercept + X1 + X2 + ... + Xn + Error Term

We denote our independent variables by X. The intercept is the minimum value of the dependent variable if all independent variables are set to 0. We interpret our error term as the degree of variation in the dependent variable not explained by our model.

We choose the following dependent (Y) and independent variables for this regression:
Dependent Variable (Y)

  • 1-year stock return in basis points: A 1000 basis point increase equals a 10% stock return in the past year.

Independent Variables (X)

  • Dividends (dummy variable): We assign a ranking of 1 to companies that currently pay dividends. If dividends are not paid, we assign a ranking of 0.
  • Earnings ranking: We rank each stock from 1 to 49 based on their most recent quarterly earnings performance. In other words, we assign a ranking of 1 to the company that increased earnings the most in a given quarter, while we assign a ranking of 49 to the company with the least earnings.
  • Debt to Equity: We report the debt-to-equity ratio of each company as standard in percentage form.


Ordinary Least Squares: Regression and Output

When we yield our output, we see the earnings_ranking variable is statistically significant at the 5% level of significance (or 95% confidence interval). This means that earnings have a material effect on stock returns.

Additionally, the R-Squared value of 0.9882 is very high – meaning that 98.82% of the variation in stock returns is allegedly “explained” by this model.

We now test for multicollinearity and heteroscedasticity to ensure that the statistical readings of our model are reliable. If we find high t-statistics and R-Squared values in our output, it means our standard errors may be biased.


  • Multicollinearity exists when two independent variables are significantly related, to the extent that inclusion of both variables in a regression model may skew our OLS estimates.
  • To test for multicollinearity, we use the Variance Inflation Factor (VIF) test across our independent variables. We determine that a reading of 5 or greater indicates the presence of multicollinearity.
  • From the below, given that we did not calculate VIF readings of 5 or greater, this indicates that our model does not suffer from multicollinearity.



  • Heteroscedasticity occurs when we have an uneven variance across our observations.
  • For instance, let us assume that we have a mixture of small and large-cap stocks across our observations. This means that we have an uneven variance; i.e. returns will move unevenly depending on the size of the company.
  • We test for heteroscedasticity using the Breusch-Pagan test at the 5% level of significance. With a p-value of 0.001163 our test is significant and this indicates that heteroscedasticity is present in our model.

Heteroscedasticity may cause bias in our standard errors. This affects our readings of whether or not our variables are statistically significant. One possible way to rid our model of heteroscedasticity is by redefining our variables, i.e. scaling our stock returns to assume a constant market capitalisation. We scale our returns to assume that all are based on a $100 million capitalisation. For instance, in the first column we are scaling the stock return of 691 basis points from a market cap of $185m to $100m: (691/185)*100 = 373.51.

ordinary least squares

When we now run our new Ordinary Least Squares regression and Breusch-Pagan tests using the stock_return_scaled as our new dependent variable, we find that debt to equity also becomes statistically significant at the 5% level. Moreover, with a p-value of 0.06982, our Breusch-Pagan test becomes insignificant at the 5% level. This indicates that we have corrected for heteroscedasticity:


Updated Regression Output


Model Interpretation

Y = 681.402 - 102.704(dividend) - 10.147(earnings_ranking) - 182.134(debt_to_equity)

According to the output:

  • A stock that pays a dividend corresponds to a -102.704 drop in that stock’s return in basis points.
  • A higher earnings ranking (or lower earnings rate in ranking terms) corresponds to a -10.147 drop in a stock’s return in basis points.
  • When we increase the debt/equity ratio by 1, we observe a -182.134 drop in a stock’s return in basis points. On an incremental basis (-182.134/100), an increase of 0.01 in the debt/equity ratio corresponds to a -1.82 drop in percentage terms in overall stock returns.

However, note that the dividend variable shows statistical insignificance. In general, there are times where it is a bad idea to drop insignificant variables when it has theoretical relevance to the dependent variable, as is the case here.

If we choose to drop the dividend variable, then we come up with the following output:


Should We Keep or Discard Insignificant Variables?

In general, theoretical relevance of a variable should always be the deciding factor in deciding whether we should keep or drop said variable.

Once we drop the dividend variable, the effect of the other variables grows larger in terms of their coefficients. Naturally, there are far more than three independent variables that could affect a stock’s return.

If we exclude theoretically important variables, this could overstate the effect of remaining variables on the dependent variable. The general consensus is that dropping insignificant variables from a model is not ideal. An exception being that we should drop variables for technical reasons such as multicollinearity.

Quora: Why is dropping insignificant predictors from a regression model a bad idea?