When working with a time series, one important thing we wish to determine is whether one series “causes” changes in another.

In other words, is there a strong correlation between a time series and another given a number of lags?

The way we can detect this is through measuring **cross-correlation**.

For instance, one time series could serve as a lagging indicator. This is where the effect of a change in one time series transfers to the other time series several periods later. This is quite common in economic data; e.g. an economic shock having an effect on GDP two quarters later.

But how do we measure the lag where this is significant? One very handy way of doing so in R is using the **ccf** (cross-correlation) function.

Running this function allows us to determine the lag at which the correlation between two time series is strongest.

Two important things that we must ensure when we run a cross-correlation:

- Our time series is stationary.
- Once we have chosen the suitable lag, we are then able to detect and correct for serial correlation if necessary.

If you are unfamiliar with these, then please review two of my previous posts on stationarity and serial correlation:

# ccf Plot for Campus Gym Data

The sample dataset I use for this example is the campus gym dataset from Kaggle.

The goal is to determine if there is a correlation between the change in temperature every 10 minutes and the number of people using a campus gym.

When the number of people and temperature for every 10 minutes is plotted, a possible trend for both variables is evident and thus non-stationarity may be present:

# Stationarity Testing

Let us now formally test for stationarity using the **ADF** and **KPSS** tests:

> adf.test(number_people) Augmented Dickey-Fuller Test data: number_people Dickey-Fuller = -21.959, Lag order = 29, p-value = 0.01 alternative hypothesis: stationary Warning message: In adf.test(number_people) : p-value smaller than printed p-value > adf.test(temperature) Augmented Dickey-Fuller Test data: temperature Dickey-Fuller = -11.943, Lag order = 29, p-value = 0.01 alternative hypothesis: stationary Warning message: In adf.test(temperature) : p-value smaller than printed p-value > kpss.test(number_people) KPSS Test for Level Stationarity data: number_people KPSS Level = 2.8567, Truncation lag parameter = 37, p-value = 0.01 Warning message: In kpss.test(number_people) : p-value smaller than printed p-value > kpss.test(temperature) KPSS Test for Level Stationarity data: temperature KPSS Level = 19.836, Truncation lag parameter = 37, p-value = 0.01 Warning message: In kpss.test(temperature) : p-value smaller than printed p-value

While our ADF test indicates **stationarity**, the KPSS test conversely predicts that a **unit root** is present. In spite of the inconsistency, the plots still indicate a trend and the two series are therefore **differenced**:

> adf.test(diffnumberpeople) Augmented Dickey-Fuller Test data: diffnumberpeople Dickey-Fuller = -37.114, Lag order = 29, p-value = 0.01 alternative hypothesis: stationary Warning message: In adf.test(diffnumberpeople) : p-value smaller than printed p-value > adf.test(difftemperature) Augmented Dickey-Fuller Test data: difftemperature Dickey-Fuller = -32.268, Lag order = 29, p-value = 0.01 alternative hypothesis: stationary Warning message: In adf.test(difftemperature) : p-value smaller than printed p-value > kpss.test(diffnumberpeople) KPSS Test for Level Stationarity data: diffnumberpeople KPSS Level = 0.001961, Truncation lag parameter = 37, p-value = 0.1 Warning message: In kpss.test(diffnumberpeople) : p-value greater than printed p-value > kpss.test(difftemperature) KPSS Test for Level Stationarity data: difftemperature KPSS Level = 0.0028682, Truncation lag parameter = 37, p-value = 0.1 Warning message: In kpss.test(difftemperature) : p-value greater than printed p-value

We now see that while the ADF test still shows a p-value below 0.05, the KPSS test now shows a p-value above 0.05. This indicates **trend stationarity**.

Moreover, when we plot the ACF for the two differenced variables, stationarity is indicated:

# Cross-Correlation: Output and Plot

Let us now plot the **cross-correlation (CCF function)** for the two differenced variables:

> crosscorrelation<-ccf(diffnumberpeople,difftemperature) > crosscorrelation Autocorrelations of series ‘X’, by lag -41 -40 -39 -38 -37 -36 -35 -34 -33 -32 -31 -30 -29 -28 -0.047 -0.044 -0.068 -0.039 -0.011 -0.036 -0.034 -0.053 -0.023 -0.014 -0.013 -0.012 -0.043 -0.024 -27 -26 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -0.004 -0.009 0.005 -0.037 -0.019 -0.005 0.002 0.001 -0.033 -0.010 0.006 0.011 0.017 -0.001 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 0.010 0.016 0.029 0.058 0.023 -0.005 0.025 0.038 0.052 0.026 0.001 0.013 0.047 0.071 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0.024 -0.005 -0.007 0.029 0.036 0.023 -0.008 -0.011 0.026 0.019 0.032 0.007 -0.008 0.020 15 16 17 18 19 20 21 22 23 24 25 26 27 28 0.021 0.033 0.008 -0.008 0.015 0.024 0.044 0.012 -0.001 0.005 0.009 0.050 0.013 0.000 29 30 31 32 33 34 35 36 37 38 39 40 41 0.004 -0.002 0.045 0.018 0.006 -0.013 -0.011 0.046 0.013 -0.001 -0.014 -0.008 0.009

From the plot, we see that we have the strongest cross-correlations at lag **0** and lag **-39**. So, let’s go ahead and run regressions on these:

> #No lags > reg1<-lm(diffnumberpeople~difftemperature) > summary(reg1) Call: lm(formula = diffnumberpeople ~ difftemperature) Residuals: Min 1Q Median 3Q Max -68.001 -4.001 -0.001 3.999 74.988 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.0008798 0.0484658 0.018 0.986 difftemperature 0.9903364 0.0859110 11.527 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 7.825 on 26063 degrees of freedom Multiple R-squared: 0.005073, Adjusted R-squared: 0.005034 F-statistic: 132.9 on 1 and 26063 DF, p-value: < 2.2e-16 > > library(lmtest) > dwtest(reg1) Durbin-Watson test data: reg1 DW = 2.4562, p-value = 1 alternative hypothesis: true autocorrelation is greater than 0 > > #Lag X(-39) > yplus39=tail(diffnumberpeople,26025) > reg2<-lm(yplus39~difftemperature[1:26025]) > summary(reg2) Call: lm(formula = yplus39 ~ difftemperature[1:26025]) Residuals: Min 1Q Median 3Q Max -68.001 -4.001 -0.001 3.999 77.999 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.0005438 0.0486618 0.011 0.991 difftemperature[1:26025] -0.1084926 0.0866091 -1.253 0.210 Residual standard error: 7.85 on 26023 degrees of freedom Multiple R-squared: 6.03e-05, Adjusted R-squared: 2.187e-05 F-statistic: 1.569 on 1 and 26023 DF, p-value: 0.2103 > dwtest(reg2) Durbin-Watson test data: reg2 DW = 2.444, p-value = 1 alternative hypothesis: true autocorrelation is greater than 0

# Conclusion

When we have run both of the regressions and corrected for serial correlation, we have a positive and significant independent variable at lag 0. However, note that we also a negative and insignificant independent variable at lag -39.

The output suggests that the **correlation is indeed strongest and significant when no lag is present**.

Of course, it could be the case that when using the cross-correlation function, we observe the strongest correlations at a certain lag – this did not happen in this instance.

However, when looking at correlations across time series models, it always helps to experiment with correlations across different periods to obtain the strongest predictive model.

## Code Scripts and Datasets

Hope you enjoyed this tutorial!

The full code is available by **subscribing to my mailing list**.

Upon subscription, you will receive full access to the codes and datasets for my tutorials, as well as a comprehensive course in regression analysis in both Python and R.