# Cross-Correlation Function In R (ccf)

When working with a time series, one important thing we wish to determine is whether one series “causes” changes in another.

In other words, is there a strong correlation between a time series and another given a number of lags?

The way we can detect this is through measuring cross-correlation.

For instance, one time series could serve as a lagging indicator. This is where the effect of a change in one time series transfers to the other time series several periods later. This is quite common in economic data; e.g. an economic shock having an effect on GDP two quarters later.

But how do we measure the lag where this is significant? One very handy way of doing so in R is using the ccf (cross-correlation) function.

Running this function allows us to determine the lag at which the correlation between two time series is strongest.

Two important things that we must ensure when we run a cross-correlation:

1. Our time series is stationary.
2. Once we have chosen the suitable lag, we are then able to detect and correct for serial correlation if necessary.

If you are unfamiliar with these, then please review two of my previous posts on stationarity and serial correlation:

# ccf Plot for Campus Gym Data

The sample dataset I use for this example is the campus gym dataset from Kaggle.

The goal is to determine if there is a correlation between the change in temperature every 10 minutes and the number of people using a campus gym.

When the number of people and temperature for every 10 minutes is plotted, a possible trend for both variables is evident and thus non-stationarity may be present:

# Stationarity Testing

Let us now formally test for stationarity using the ADF and KPSS tests:

```> adf.test(number_people)

Augmented Dickey-Fuller Test

data:  number_people
Dickey-Fuller = -21.959, Lag order = 29, p-value = 0.01
alternative hypothesis: stationary

Warning message:
In adf.test(number_people) : p-value smaller than printed p-value

Augmented Dickey-Fuller Test

data:  temperature
Dickey-Fuller = -11.943, Lag order = 29, p-value = 0.01
alternative hypothesis: stationary

Warning message:
In adf.test(temperature) : p-value smaller than printed p-value
> kpss.test(number_people)

KPSS Test for Level Stationarity

data:  number_people
KPSS Level = 2.8567, Truncation lag parameter = 37, p-value = 0.01

Warning message:
In kpss.test(number_people) : p-value smaller than printed p-value
> kpss.test(temperature)

KPSS Test for Level Stationarity

data:  temperature
KPSS Level = 19.836, Truncation lag parameter = 37, p-value = 0.01

Warning message:
In kpss.test(temperature) : p-value smaller than printed p-value
```

While our ADF test indicates stationarity, the KPSS test conversely predicts that a unit root is present. In spite of the inconsistency, the plots still indicate a trend and the two series are therefore differenced:

```> adf.test(diffnumberpeople)

Augmented Dickey-Fuller Test

data:  diffnumberpeople
Dickey-Fuller = -37.114, Lag order = 29, p-value = 0.01
alternative hypothesis: stationary

Warning message:
In adf.test(diffnumberpeople) : p-value smaller than printed p-value

Augmented Dickey-Fuller Test

data:  difftemperature
Dickey-Fuller = -32.268, Lag order = 29, p-value = 0.01
alternative hypothesis: stationary

Warning message:
In adf.test(difftemperature) : p-value smaller than printed p-value
> kpss.test(diffnumberpeople)

KPSS Test for Level Stationarity

data:  diffnumberpeople
KPSS Level = 0.001961, Truncation lag parameter = 37, p-value = 0.1

Warning message:
In kpss.test(diffnumberpeople) : p-value greater than printed p-value
> kpss.test(difftemperature)

KPSS Test for Level Stationarity

data:  difftemperature
KPSS Level = 0.0028682, Truncation lag parameter = 37, p-value = 0.1

Warning message:
In kpss.test(difftemperature) : p-value greater than printed p-value
```

We now see that while the ADF test still shows a p-value below 0.05, the KPSS test now shows a p-value above 0.05. This indicates trend stationarity.

Moreover, when we plot the ACF for the two differenced variables, stationarity is indicated:

# Cross-Correlation: Output and Plot

Let us now plot the cross-correlation (CCF function) for the two differenced variables:

```> crosscorrelation<-ccf(diffnumberpeople,difftemperature)
> crosscorrelation

Autocorrelations of series ‘X’, by lag

-41    -40    -39    -38    -37    -36    -35    -34    -33    -32    -31    -30    -29    -28
-0.047 -0.044 -0.068 -0.039 -0.011 -0.036 -0.034 -0.053 -0.023 -0.014 -0.013 -0.012 -0.043 -0.024
-27    -26    -25    -24    -23    -22    -21    -20    -19    -18    -17    -16    -15    -14
-0.004 -0.009  0.005 -0.037 -0.019 -0.005  0.002  0.001 -0.033 -0.010  0.006  0.011  0.017 -0.001
-13    -12    -11    -10     -9     -8     -7     -6     -5     -4     -3     -2     -1      0
0.010  0.016  0.029  0.058  0.023 -0.005  0.025  0.038  0.052  0.026  0.001  0.013  0.047  0.071
1      2      3      4      5      6      7      8      9     10     11     12     13     14
0.024 -0.005 -0.007  0.029  0.036  0.023 -0.008 -0.011  0.026  0.019  0.032  0.007 -0.008  0.020
15     16     17     18     19     20     21     22     23     24     25     26     27     28
0.021  0.033  0.008 -0.008  0.015  0.024  0.044  0.012 -0.001  0.005  0.009  0.050  0.013  0.000
29     30     31     32     33     34     35     36     37     38     39     40     41
0.004 -0.002  0.045  0.018  0.006 -0.013 -0.011  0.046  0.013 -0.001 -0.014 -0.008  0.009
```

From the plot, we see that we have the strongest cross-correlations at lag 0 and lag -39. So, let’s go ahead and run regressions on these:

```> #No lags
> reg1<-lm(diffnumberpeople~difftemperature)
> summary(reg1)

Call:
lm(formula = diffnumberpeople ~ difftemperature)

Residuals:
Min      1Q  Median      3Q     Max
-68.001  -4.001  -0.001   3.999  74.988

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)     0.0008798  0.0484658   0.018    0.986
difftemperature 0.9903364  0.0859110  11.527   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.825 on 26063 degrees of freedom
Multiple R-squared:  0.005073,	Adjusted R-squared:  0.005034
F-statistic: 132.9 on 1 and 26063 DF,  p-value: < 2.2e-16

>
> library(lmtest)
> dwtest(reg1)

Durbin-Watson test

data:  reg1
DW = 2.4562, p-value = 1
alternative hypothesis: true autocorrelation is greater than 0

>
> #Lag X(-39)
> yplus39=tail(diffnumberpeople,26025)
> reg2<-lm(yplus39~difftemperature[1:26025])
> summary(reg2)

Call:
lm(formula = yplus39 ~ difftemperature[1:26025])

Residuals:
Min      1Q  Median      3Q     Max
-68.001  -4.001  -0.001   3.999  77.999

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)               0.0005438  0.0486618   0.011    0.991
difftemperature[1:26025] -0.1084926  0.0866091  -1.253    0.210

Residual standard error: 7.85 on 26023 degrees of freedom
Multiple R-squared:  6.03e-05,	Adjusted R-squared:  2.187e-05
F-statistic: 1.569 on 1 and 26023 DF,  p-value: 0.2103

> dwtest(reg2)

Durbin-Watson test

data:  reg2
DW = 2.444, p-value = 1
alternative hypothesis: true autocorrelation is greater than 0
```

# Conclusion

When we have run both of the regressions and corrected for serial correlation, we have a positive and significant independent variable at lag 0. However, note that we also a negative and insignificant independent variable at lag -39.

The output suggests that the correlation is indeed strongest and significant when no lag is present.

Of course, it could be the case that when using the cross-correlation function, we observe the strongest correlations at a certain lag – this did not happen in this instance.

However, when looking at correlations across time series models, it always helps to experiment with correlations across different periods to obtain the strongest predictive model.