EUSPCA: Exactly uncorrelated sparse principal component analysis
The R package euspca
finds
number of uncorrelated sparse principal components. The main function,
euspca
, finds the unnormalized loading matrix
by minimizing the following formula:
where:
is the empirical covariance or correlation matrix of the original variables,
is the (i,j)-th component of , and
is a user-specified value that controls the sparsity of .
In this tutorial, we will use the euspca
package to find
uncorrelated sparse principal components for the syn
dataset.
Synthetic data
We will use the syn
dataset, which contains the
covariance matrix for 9 variables:
and
are independent and each follows ,
where:
-
,
are three hidden factors, where:
- ,
- , and
- , where is independent of and .
To see what this matrix looks like, use the following code:
data(syn) # load data
print(syn)
## v1 v2 v3 v4 v5 v6 v7 v8 v9
## v1 291 290 290 0 0 0 87.00 87.00 87.00
## v2 290 291 290 0 0 0 87.00 87.00 87.00
## v3 290 290 291 0 0 0 87.00 87.00 87.00
## v4 0 0 0 301 300 300 294.00 294.00 294.00
## v5 0 0 0 300 301 300 294.00 294.00 294.00
## v6 0 0 0 300 300 301 294.00 294.00 294.00
## v7 87 87 87 294 294 294 316.22 315.22 315.22
## v8 87 87 87 294 294 294 315.22 316.22 315.22
## v9 87 87 87 294 294 294 315.22 315.22 316.22
What to expect from the analysis
Ideally, we want each principal component to capture the underlying independent factors, and , respectively. The information about and is solely contained in the groups and , respectively. Additionally, the variables within each group, and , are essentially the same, so they should be combined together with equal weights to capture each factor.
This suggests two sparse linear combinations of the original variables as ideal principal components: one using the variables with equal weights to capture the factor and the other using the variable with equal weights to capture the factor .
Analysis
We apply euspca
to the syn
dataset using
the following code:
euspca_syn = euspca(syn, is.data.mat=FALSE, k=2, lamb=1000, scale=FALSE, track=NULL)
The resulting normalized loadings, normalized to have a unit length for each row, are:
round(euspca_syn$loadings,3)
## v1 v2 v3 v4 v5 v6 v7 v8 v9
## [1,] 0.000 0.000 0.000 0.577 0.577 0.577 0 0 0
## [2,] -0.577 -0.577 -0.577 0.000 0.000 0.000 0 0 0
The first row is the loading vector for the first sparse principal component, and the second row is for the second. We see that some entries are 0 due to the sparsity constraint, which encourages simpler and more interpretable components. Moreover, we see that the loading vectors align with our expectations from the analysis.
To see a summary of the derived principal components, use:
print(euspca_syn)
## 2 uncorrelated sparse PCs
## % of explained var. : 65.04
## % of non-zero loadings : 33.33
##
## Correlation of PCs
## PC1 PC2
## PC1 1 0
## PC2 0 1
## Max. abs. cor. : 0
We can see that 66% of the total variance in the data is explained by these components, and they are uncorrelated, with a correlation of 0.