Practical Approaches to Principal Component Analysis in the Presence of Missing Values
Alexander Ilin and Tapani Raiko
Helsinki University of Technology, Espoo, Finland.
Principal component analysis (PCA) is a classical data analysis
technique that finds linear transformations of data that retain maximal amount
of variance. We study a case where some of the data values are missing,
and show that this problem has many features which are usually associated
with nonlinear models, such as overfitting and bad locally optimal solutions.
Probabilistic formulation of PCA provides a good foundation for handling
missing values, and we introduce formulas for doing that. In case of high
dimensional and very sparse data, overfitting becomes a severe problem and
traditional algorithms for PCA are very slow. We introduce a novel fast al-
gorithm and extend it to variational Bayesian learning. Different versions
of PCA are compared in artificial experiments, demonstrating the effects of
regularization and modeling of posterior variance. The scalability of the pro-
posed algorithm is demonstrated by applying it to the Netflix problem.