Skip to contents

EUSPCA: Exactly uncorrelated sparse principal component analysis

The R package euspca finds kk number of uncorrelated sparse principal components. The main function, euspca, finds the unnormalized loading matrix 𝐕k×p\mathbf V \in \mathbb R^{k \times p} by minimizing the following formula: tr(𝐕𝚺n2𝐕T)+λij|vij|subject to𝐕𝚺n𝐕T=𝐈, - \mbox{tr} (\mathbf V \mathbf \Sigma_n^2 \mathbf V^T ) + \lambda \textstyle\sum_{ij} | v_{ij} | ~~\mbox{subject to} ~ \mathbf V \mathbf \Sigma_n \mathbf V^T = \mathbf{I}, where:

  • Σnp×p\Sigma_n \in \mathbb R^{p\times p} is the empirical covariance or correlation matrix of the original pp variables,

  • vijv_{ij} is the (i,j)-th component of 𝐕\mathbf V, and

  • λ\lambda is a user-specified value that controls the sparsity of 𝐕\mathbf V.

In this tutorial, we will use the euspca package to find uncorrelated sparse principal components for the syn dataset.

Installation

To get started, load the euspca package into your R session:

Synthetic data

We will use the syn dataset, which contains the covariance matrix for 9 variables:

  • ξi=η1+ϵi,i=1,2,3,\xi_i = \eta_1 + \epsilon_i, ~ i=1,2,3,

  • ξi=η2+ϵi,i=4,5,6,\xi_i = \eta_2 + \epsilon_i, ~ i=4,5,6,

  • ξi=η3+ϵi,i=7,8,9,\xi_i = \eta_3 + \epsilon_i, ~ i=7,8,9, and

  • ϵi,i=1,,9\epsilon_i, ~ i=1,\ldots,9 are independent and each follows N(0,1)N(0,1),

where:

  • ηi,i=1,2,3\eta_i, ~ i=1,2,3, are three hidden factors, where:
    • η1N(0,290)\eta_1 \sim N(0,290),
    • η2N(0,300)\eta_2 \sim N(0,300), and
    • η3=0.3η1+0.98η2+ϵ\eta_3 = 0.3 \eta_1 + 0.98 \eta_2 + \epsilon, where ϵN(0,1)\epsilon \sim N(0,1) is independent of η1\eta_1 and η2\eta_2.

To see what this matrix looks like, use the following code:

data(syn) # load data
print(syn)
##     v1  v2  v3  v4  v5  v6     v7     v8     v9
## v1 291 290 290   0   0   0  87.00  87.00  87.00
## v2 290 291 290   0   0   0  87.00  87.00  87.00
## v3 290 290 291   0   0   0  87.00  87.00  87.00
## v4   0   0   0 301 300 300 294.00 294.00 294.00
## v5   0   0   0 300 301 300 294.00 294.00 294.00
## v6   0   0   0 300 300 301 294.00 294.00 294.00
## v7  87  87  87 294 294 294 316.22 315.22 315.22
## v8  87  87  87 294 294 294 315.22 316.22 315.22
## v9  87  87  87 294 294 294 315.22 315.22 316.22

What to expect from the analysis

Ideally, we want each principal component to capture the underlying independent factors, η1\eta_1 and η2\eta_2, respectively. The information about η1\eta_1 and η2\eta_2 is solely contained in the groups (ξ1,ξ2,ξ3)(\xi_1,\xi_2,\xi_3) and (ξ4,ξ5,ξ6)(\xi_4,\xi_5,\xi_6), respectively. Additionally, the variables within each group, (ξ1,ξ2,ξ3)(\xi_1,\xi_2,\xi_3) and (ξ4,ξ5,ξ6)(\xi_4,\xi_5,\xi_6), are essentially the same, so they should be combined together with equal weights to capture each factor.

This suggests two sparse linear combinations of the original variables as ideal principal components: one using the variables (ξ1,ξ2,ξ3)(\xi_1,\xi_2,\xi_3) with equal weights to capture the factor η1\eta_1 and the other using the variable (ξ4,ξ5,ξ6)(\xi_4,\xi_5,\xi_6) with equal weights to capture the factor η2\eta_2.

Analysis

We apply euspca to the syn dataset using the following code:

euspca_syn = euspca(syn, is.data.mat=FALSE, k=2, lamb=1000, scale=FALSE, track=NULL) 

The resulting normalized loadings, normalized to have a unit length for each row, are:

round(euspca_syn$loadings,3)
##          v1     v2     v3    v4    v5    v6 v7 v8 v9
## [1,]  0.000  0.000  0.000 0.577 0.577 0.577  0  0  0
## [2,] -0.577 -0.577 -0.577 0.000 0.000 0.000  0  0  0

The first row is the loading vector for the first sparse principal component, and the second row is for the second. We see that some entries are 0 due to the sparsity constraint, which encourages simpler and more interpretable components. Moreover, we see that the loading vectors align with our expectations from the analysis.

To see a summary of the derived principal components, use:

print(euspca_syn)
## 2 uncorrelated sparse PCs 
## % of explained var. : 65.04 
## % of non-zero loadings : 33.33 
## 
## Correlation of PCs 
##     PC1 PC2
## PC1   1   0
## PC2   0   1
## Max. abs. cor. : 0

We can see that 66% of the total variance in the data is explained by these components, and they are uncorrelated, with a correlation of 0.