ARIMA Models: Stock Price Forecasting with Python and R

ARIMA (Autoregressive Integrated Moving Average) is a major tool used in time series analysis to attempt to forecast future values of a variable based on its present value. For this particular example, I use a stock price dataset of Johnson & Johnson (JNJ) from 2006-2016, and use the aforementioned model to conduct price forecasting on this time series.

Background

The purpose of ARIMA is to determine the nature of the relationship between our residuals, which would provide our model with a certain degree of forecasting power. In the first instance, in order to conduct a time series analysis we must express our dataset in terms of logarithms. If our data is expressed solely in price terms, then this does not allow for continuous compounding of returns over time and will give misleading results.

An ARIMA model consists of coordinates (p, d, q):

  • p stands for the number of autoregressive terms, i.e. the number of observations from past time values used to forecast future values. e.g. if the value of p is 2, then this means that two previous time observations in the series are being used to forecast the future trend.
  • d denotes the number of differences needed to make the time series stationary (i.e. one with a constant mean, variance, and autocorrelation). For instance, if d = 1, then it means that a first-difference of the series must be obtained to transform it into a stationary one.
  • q represents the moving average of the previous forecast errors in our model, or the lagged values of the error term. As an example, if q has a value of 1, then this means that we have one lagged value of the error term in the model.

Implementing ARIMA with statsmodels in Python

Here is how to implement an ARIMA model in Python using the pandas and statsmodels libraries.

1. Load Libraries

Firstly, we load our libraries as standard. The library of major importance in this case is statsmodels, since we are using this library to calculate the ACF and PACF statistics, and also to formulate the ARIMA model:

import pandas
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import numpy as np
import math
from statsmodels.tsa.stattools import acf, pacf
import statsmodels.tsa.stattools as ts
from statsmodels.tsa.arima_model import ARIMA

2. Import csv and define “price” variable using pandas

variables = pandas.read_csv('jnj.csv')
price = variables['price']

3. Autocorrelation and Partial Autocorrelation Plots

We use the ACF and PACF (Autocorrelation and Partial Autocorrelation) readings to determine whether our data is stationary upon differencing. The autocorrelation and partial autocorrelation function both measure, to varying degrees, the correlation coefficient between a series and lags of the variable over time. An autoregressive process is when a time series follows a particular pattern in that its present value is in some way correlated to its past value(s). For instance, if we are able to use regression analysis to discern the present value of a variable from using its past value, then we refer to this as an AR(1) process:

Xt = ß0 + ß1X(t-1) + et

However, there are some instances in which the present value of a variable can be determined from the past two or three values, which would incorporate an AR(2) or AR(3) process respectively:

Xt = ß0 + ß1X(t-1) + ß2X(t-2) + et
Xt = ß0 + ß1X(t-1) + ß2X(t-2) + ß3X(t-3) + et

Let us see how we can generate the acf and pacf plots:

lnprice=np.log(price)
lnprice
plt.plot(lnprice)
plt.show()
acf_1 =  acf(lnprice)[1:20]
plt.plot(acf_1)
plt.show()
test_df = pandas.DataFrame([acf_1]).T
test_df.columns = ['Pandas Autocorrelation']
test_df.index += 1
test_df.plot(kind='bar')
pacf_1 =  pacf(lnprice)[1:20]
plt.plot(pacf_1)
plt.show()
test_df = pandas.DataFrame([pacf_1]).T
test_df.columns = ['Pandas Partial Autocorrelation']
test_df.index += 1
test_df.plot(kind='bar')
result = ts.adfuller(lnprice, 1)
result

We see that statsmodels produces the autocorrelation and partial autocorrelation plots:

acf

pacf

Moreover, we have confirmation that our data follows an AR(1) stationary process (one with a constant mean, variance, and autocorrelation), and we see that the price plot now shows a stationary process:

lnprice_diff=lnprice-lnprice.shift()
diff=lnprice_diff.dropna()
acf_1_diff =  acf(diff)[1:20]
test_df = pandas.DataFrame([acf_1_diff]).T
test_df.columns = ['First Difference Autocorrelation']
test_df.index += 1
test_df.plot(kind='bar')
pacf_1_diff =  pacf(diff)[1:20]
plt.plot(pacf_1_diff)
plt.show()

4. ARIMA Model Generation

price_matrix=lnprice.as_matrix()
model = ARIMA(price_matrix, order=(0,1,0))
model_fit = model.fit(disp=0)
print(model_fit.summary())

Using the (0,1,0) configuration, our ARIMA model is generated:
 
arima

As previously mentioned, our data is in logarithmic format. Since we are analysing stock price, this format is necessary to account for compounding returns. However, once we have obtained the forecasts (for seven periods out in this case), then we can obtain the real price forecast by converting the logarithmic figure to an exponent:

predictions=model_fit.predict(122, 127, typ='levels')
predictions
predictionsadjusted=np.exp(predictions)
predictionsadjusted

predictions
 
In this particular instance, we are predicting future values in their own right for illustration purposes – but will backtest in R by segregating data into both training and test data. This is done to determine how accurate the forecasts of the model are by comparing the model output (created from the training data) to the test data.

ARIMA with R

Now, let us see how this model can be implemented using R.

For this particular model, 80% of our data (i.e. the first 96 observations) are used as our training data used to actually build the model, while the remainder of the data will serve as our test data; i.e. this data is used to compare the forecasts from ARIMA to the actual in order to judge model accuracy.

5. Generate ACF and PACF plots

Again, we generate ACF and PACF Plots:

acf(lnstock, lag.max=20)
pacf(lnstock, lag.max=20)

 

 

6. Dickey-Fuller Test

In order to use an ARIMA model, we now wish to conduct a more formal test to determine if our time series is stationary; i.e. do we have a constant mean, variance and autocorrelation across our time series dataset. For this purpose, we use the Dickey-Fuller Test. At the 5% level of significance:
 
H0: Non-stationary series
HA: Stationary series
 

data:  lnstock
Dickey-Fuller = -2.0974, Lag order = 4, p-value = 0.888
alternative hypothesis: stationary

 
With a p-value of 0.888, we cannot reject the null hypothesis of non-stationarity in our series.

However, when the data is first-differenced, we see that our p-value drops below 0.05, and we can therefore now reject the null hypothesis of non-stationarity:

data:  difflnstock
Dickey-Fuller = -5.0751, Lag order = 4, p-value = 0.01
alternative hypothesis: stationary

Moreover, running the auto.arima function in R for both models returns a random walk with drift; i.e. ARIMA(0, 1, 0). On this basis, we choose to specify a (0, 1, 0) ARIMA model for both stocks:
 

7. ARIMA Output

To generate an ARIMA plot and output by letting R itself decide the appropriate parameters using ARIMA, we use the auto.arima function as follows after defining our time series:
 

pricearima <- ts(lnstock, start = c(2006,09), frequency = 12)
fitlnstock<-auto.arima(pricearima)

 
arima
 

Series: pricearima 
ARIMA(0,1,0) with drift         

Coefficients:
       drift
      0.0075
s.e.  0.0044

sigma^2 estimated as 0.00182:  log likelihood=165.38
AIC=-326.76   AICc=-326.63   BIC=-321.65

 
We see that ARIMA ultimately diagnoses a random walk with drift for the stock; meaning that the movements of the stock price are random, but follows a purposeful pattern in the interim. Indeed, the movements of various financial assets have been found to be random, but typically follow a random walk with drift meaning that purposeful patterns play out in the short-term which can be exploited. Note that in an ideal situation, ARIMAX would be employed which forecasts an ARIMA model by also taking explanatory variables into account. However, in situations where we wish to forecast a time series based on its past values alone, then ARIMA is a standard model for doing so.

If we so wish, we can also obtain the exponents of the ln forecasts in order to obtain real price, i.e:

#Forecasted Values From ARIMA
forecastedvalues_ln=forecast(fitlnstock,h=26)
forecastedvalues_ln
plot(forecastedvalues_ln)

forecastedvaluesextracted=as.numeric(forecastedvalues_ln$mean)
finalforecastvalues=exp(forecastedvaluesextracted)
finalforecastvalues

Projecting 26 periods out would give us the following forecast in exp terms:

> finalforecastvalues
[1] 101.4698 102.2290 102.9938 103.7644 104.5408 105.3230 106.1110 106.9049 107.7048 108.5106 109.3225 110.1404 110.9645 111.7947 112.6312 113.4739 114.3229 115.1782 116.0400 116.9082 117.7829 118.6641 119.5520 120.4465 121.3476 122.2556

 

8. Training-Test Validation

> df<-data.frame(price[96:121],finalforecastvalues)
> col_headings<-c("Actual Price","Forecasted Price")
> names(df)<-col_headings
> attach(df)
> percentage_error=((df$`Actual Price`-df$`Forecasted Price`)/(df$`Actual Price`))
> percentage_error

 [1] -0.0074819537 -0.0038130090 -0.0004170119 -0.0433719063 -0.0976804738 -0.0728155807 -0.1013633836 -0.1252634992
 [9] -0.1149107514 -0.1541401644 -0.1308660880 -0.2056309085 -0.2228489342 -0.1383453174 -0.1361083651 -0.1281169831
[17] -0.1178398054 -0.1099174498 -0.0873208758 -0.0575334338 -0.0521664682  0.0152039673  0.0389721026 -0.0092715631
[25] -0.0272381466 -0.0290005779

> mean(percentage_error)

[1] -0.08151102

From the above, we see that there is an average of an 8% deviation between the actual price and the forecasted price provided by ARIMA.

9. Ljung-Box Test

While we could potentially use this model to forecast future values for price, an important test used to validate the findings of the ARIMA model is the Ljung-Box test. Essentially, the test is being used to determine if the residuals of our time series follow a random pattern, or if there is a significant degree of non-randomness.

H0: Residuals follow a random pattern
HA: Residuals do not follow a random pattern

Note that the method for choosing a specific number of lags for Ljung-Box can be quite arbitrary. In this regard, we will run the Ljung-Box test with lags 5, 10, and 15. To run this test in R, we use the following functions:
 

> Box.test(fitlnstock$resid, lag=5, type="Ljung-Box")

Box-Ljung test

data:  fitlnstock$resid
X-squared = 9.2542, df = 5, p-value = 0.09934

> Box.test(fitlnstock$resid, lag=10, type="Ljung-Box")

Box-Ljung test

data:  fitlnstock$resid
X-squared = 12.979, df = 10, p-value = 0.2249

> Box.test(fitlnstock$resid, lag=15, type="Ljung-Box")

Box-Ljung test

data:  fitlnstock$resid
X-squared = 15.395, df = 15, p-value = 0.4234

 
From the above, we see that we have an insignificant p-value at all lags. This means that there is likely a high degree of randomness exhibited by our residuals (consistent with a random walk with drift model) and therefore our ARIMA model is free of autocorrelation.

Disclosure: The author is long JNJ at the time of writing. However, the example above is for illustration purposes only, and in no way constitutes promotion of a security or any other form of investment advice.