Panel data, along with cross-sectional and time series data, are the main data types that we encounter when working with regression analysis.

## Types of data

- Cross-Sectional: Data collected at one particular point in time
- Time Series: Data collected across several time periods
- Panel Data: A mixture of both cross-sectional and time series data, i.e. collected at a particular point in time and across several time periods

When it comes to panel data, standard regression analysis often falls short in isolating fixed and random effects.

- Fixed Effects: Effects that are independent of random disturbances, e.g. observations independent of time.
- Random Effects: Effects that include random disturbances.

Let us see how we can use the plm library in R to account for fixed and random effects. You can also view my YouTube tutorial on the topic here.

## Panel Data: Fixed and Random Effects

For this tutorial, we are going to use a dataset of weekly internet usage in MB across 33 weeks across three different companies (A, B, and C).

Let us run our fixed (within) and random (random) effect models:

> #Fixed Effects (within) and Random Effects (random) > library(plm) > plmwithin <- plm(usage~income+videohours+webpages+gender+age, data = mydata, model = "within") > plmrandom <- plm(usage~income+videohours+webpages+gender+age, data = mydata, model = "random") > summary(plmwithin)Oneway (individual) effect Within Model Call: plm(formula = usage ~ income + videohours + webpages + gender + age, data = mydata, model = "within") Balanced Panel: n=33, T=3, N=99 Residuals : Min. 1st Qu. Median 3rd Qu. Max. -7001.684 -1744.973 -18.793 1212.604 6423.064 Coefficients : Estimate Std. Error t-value Pr(>|t|) income 0.60389 0.14897 4.0538 0.0001451 *** videohours 1261.43899 111.13253 11.3508 < 2.2e-16 *** webpages 97.79839 42.06834 2.3248 0.0234299 * gender -1328.98316 812.05060 -1.6366 0.1068685 age 54.15710 36.54555 1.4819 0.1435128 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Total Sum of Squares: 4312700000 Residual Sum of Squares: 692730000 R-Squared: 0.83937 Adj. R-Squared: 0.51719 F-statistic: 63.7531 on 5 and 61 DF, p-value: < 2.22e-16> summary(plmrandom)Oneway (individual) effect Random Effect Model (Swamy-Arora's transformation) Call: plm(formula = usage ~ income + videohours + webpages + gender + age, data = mydata, model = "random") Balanced Panel: n=33, T=3, N=99 Effects: var std.dev share idiosyncratic 11356210 3370 0.912 individual 1093188 1046 0.088 theta: 0.1191 Residuals : Min. 1st Qu. Median 3rd Qu. Max. -8153.757 -1808.112 -90.183 2101.833 8507.735 Coefficients : Estimate Std. Error t-value Pr(>|t|) (Intercept) -1831.20613 1482.37880 -1.2353 0.21982 income 0.51416 0.12419 4.1401 7.627e-05 *** videohours 1265.14090 94.06589 13.4495 < 2.2e-16 *** webpages 95.35198 37.72137 2.5278 0.01316 * gender -1688.46458 708.69203 -2.3825 0.01923 * age 57.27720 30.01488 1.9083 0.05944 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Total Sum of Squares: 5790300000 Residual Sum of Squares: 1031900000 R-Squared: 0.82178 Adj. R-Squared: 0.77198 F-statistic: 85.7674 on 5 and 93 DF, p-value: < 2.22e-16

In the case of the first regression, we are accounting for fixed effects (or internet usage independent of time), while the second is accounting for random effects (including time).

Note that the variables *gender* and *age* which were deemed insigificant in the fixed effects regression are now being deemed significant in the random effects regression.

## Analysing with fixef

Now, let us use *fixef* to isolate the effects of time on internet usage.

> #Effect of time on internet usage > fixef(plmwithin,type="dmean")1 2 3 4 5 6 7 8 9 10 11 3054.6050 -467.6298 -1557.9555 524.9491 -18.3859 -1374.6854 2300.6432 1890.2534 -541.1499 4721.5132 241.6241 12 13 14 15 16 17 18 19 20 21 22 -1438.6650 2208.1437 -2036.5721 -848.0145 508.1098 -269.4162 2335.4139 803.8547 -3633.5643 -1399.4824 2142.2455 23 24 25 26 27 28 29 30 31 32 33 258.5465 -2458.9901 -1724.9200 -3979.0983 70.2551 717.6419 1289.9506 -1763.4756 2241.4846 2945.1578 -4742.3872> summary(fixef(plmwithin, type = "dmean"))Estimate Std. Error t-value Pr(>|t|) 1 3054.605 2860.669 1.0678 0.28561 2 -467.630 2452.521 -0.1907 0.84878 3 -1557.955 2368.244 -0.6579 0.51063 4 524.949 2734.229 0.1920 0.84775 5 -18.386 2861.050 -0.0064 0.99487 6 -1374.685 2719.298 -0.5055 0.61319 7 2300.643 2353.954 0.9774 0.32839 8 1890.253 2598.290 0.7275 0.46692 9 -541.150 2518.805 -0.2148 0.82989 10 4721.513 2619.317 1.8026 0.07146 . 11 241.624 2765.534 0.0874 0.93038 12 -1438.665 2868.905 -0.5015 0.61604 13 2208.144 2651.892 0.8327 0.40503 14 -2036.572 2549.967 -0.7987 0.42448 15 -848.015 2643.324 -0.3208 0.74835 16 508.110 2594.214 0.1959 0.84472 17 -269.416 2463.474 -0.1094 0.91291 18 2335.414 2662.246 0.8772 0.38036 19 803.855 2475.448 0.3247 0.74538 20 -3633.564 2504.424 -1.4509 0.14682 21 -1399.482 2727.735 -0.5131 0.60791 22 2142.245 2477.582 0.8647 0.38723 23 258.546 2702.259 0.0957 0.92378 24 -2458.990 3036.375 -0.8098 0.41803 25 -1724.920 2489.795 -0.6928 0.48844 26 -3979.098 2758.028 -1.4427 0.14910 27 70.255 2595.908 0.0271 0.97841 28 717.642 2494.345 0.2877 0.77357 29 1289.951 2575.964 0.5008 0.61654 30 -1763.476 2458.731 -0.7172 0.47323 31 2241.485 2441.171 0.9182 0.35851 32 2945.158 2725.083 1.0808 0.27980 33 -4742.387 2820.176 -1.6816 0.09265 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Observe that weeks 10 and 33 are significant at the 10% level. For week 10, we see that usage across A, B, and C is higher than usual:

On the other hand, usage for week 33 across A, B, and C is lower than usual:

This could indicate that something is taking place across these particular weeks to cause internet usage to be higher and lower than usual. We see that using *fixef* has allowed us to isolate the effect of time in this regard.

Now, let us extract fixed effects - i.e. effects independent of time.

> #Extracting fixed effects with fixef > twoway <- plm(usage~income+videohours+webpages+gender+age+company,data=mydata,model="within",effect="time") > fixef(twoway,effect="time")a b c -451.3598 -1202.3393 -3451.6416

We see that of the three internet companies, company C has the largest negative coefficient.

Let's subset usage for the three companies and see what is going on:

> #Usage by provider> a <- subset(mydata, company=="a") > b <- subset(mydata, company=="b") > c <- subset(mydata, company=="c")> #Mean usage by provider > mean(a$usage)[1] 11646.94> mean(b$usage)[1] 8851.848> mean(c$usage)[1] 6279.03

We see that mean internet usage for company C is lower than that of A and B.

Moreover, visualising usage across the 33 weeks reveals an interesting insight:

> #Usage visualisations> plot(a$usage,type='l') > plot(b$usage,type='l') > plot(c$usage,type='l')

### Usage A

### Usage B

### Usage C

We see that in the case of C, usage remains low overall but there are certain periods where we see a spike in usage. However, we see these spikes much more frequently for A and B.

In this regard, comparing fixed and random effects has allowed us to isolate the impact of time on usage patterns for C.

## Conclusion

In this tutorial, you have learned:

- The difference between a fixed and random effects model
- How to use the plm library
- How to isolate fixed and random effects in a panel dataset

Many thanks for viewing this tutorial, and if you have any questions please leave them in the comments below.

Hello:

Thank you a lot for your tutorial. Quite useful. Some questions:

– What do the results that you get when you run “summary(fixef(plmwithin, type = “dmean”))” tells you or mean?

– When you say “let us extract fixed effects” and run “fixef(twoway,effect=”time”)”, how would you interpret the results exactly?

– What does “We see that of the three internet companies, company C has the largest negative coefficient.” mean?

Thank you for your tutorial!

Hi Pedro,

Many thanks for your questions.

1) This allows us to isolate the effects of time on internet usage, i.e. what would be the differences in usage if time was constant?

2) and 3) You can see that C has the largest negative coefficient along with the lowest mean usage. This implies that the impact of time leads to a lower mean usage for company C.

Hope this helps.

Best,

Michael

Do you know if plm is used in repeated measurements studies and medical interventions or coaching interventions for linear regression ?

I haven’t used plm on health data directly, but I imagine that it is applicable to the scenario you pointed out. In general, plm can be used when there is a combination of cross-sectional and time series data among the observations in the dataset.