A Tutorial on the Use of Partial Least Squares and Principal Components Analysis for the Identification Problem in the Age-Period-Cohort-Analysis
In the analysis of trends in health outcomes, an ongoing issue is how to separate and estimate effects of age, time period and cohort. As these three variables are perfectly collinear by design, regression coefficients in a general linear model are not unique. In this tutorial, we review why identification is a problem and how this problem may be tackled by using partial least squares and principal component regression as they provide a flexible modelling strategy for the Age-Period-Cohort analysis. We show that both methods produce regression coefficients that fulfil the same collinearity constraint as the variables age, time period and cohort. We show that because the constraint imposed by partial least squares and principal component regression is inherent in the mathematical relation amongst the three variables, this leads to more interpretable results. We used one dataset from a Taiwanese health screening program to illustrate how to use partial least squares regression to analyze the trends in body heights with three continuous variables for age, period and cohort. We then used another dataset of hepatocellular carcinoma mortality rates for the Taiwanese men to illustrate how to use partial least squares regression to analyze data with aggregated data. We used the second dataset to show the relation between the intrinsic estimator, a recently proposed method for the Age-Period-Cohort analysis, and partial least squares regression. We also showed that the inclusion of all indicator variables provides a more consistent approach. R code for our analyses was provided in the appendix.