← Back to Blog

Expectation-Maximization (EM) algorithm part II (EM-PCA)

In this section, we will explore another practical application of the EM-algorithm to speed up the computation of PCA. This section assumes the readers have already read my introduction part of EM algorithm.

Classical PCA

In classical PCA, given an input matrix XRN×M\mathbf{X} \in \mathbb{R}^{N \times M} (In genetic settings, NN can be the number of individuals, and MM be the number of SNPs), the goal is to find an orthonormal matrix WRm×KW \in \mathbb{R}^{m\times K} containing KK orthonormal vectors; and the corresponding scores (weight) along each vector zRKz \in \mathbb{R}^K, such that each individual XiRm\mathbf{X}_i \in \mathbb{R}^m can be reconstructed using x^i=Wzi\hat{\mathbf{x}}_i = \mathbf{Wz}_i with the minimum error. Mathematically, we are trying the minimize:

J(W,Z)=1NXZWTF2J(\mathbf{W},\mathbf{Z}) = \frac{1}{N} || \mathbf{X} - \mathbf{ZW}^T ||_F^2

Under the constraint that W\mathbf{W} is an orthonormal matrix, the PC scores for each individual are therefore uncorrelated. Moreover, it can be proved by induction that the orthonormal matrix W\mathbf{W} is the eigenvectors corresponding to the top-KK largest eigenvalues, which implies it captures the top-KK variance when projecting X into the lower dimensional orthonormal subspace.

EM PCA

If we treat the dimensional reduction method as a probabilistic model, then the score matrix Z\mathbf{Z} becomes a probabilistic distribution, and suppose we have the following assumptions:

  • Underlying latent variable has a Gaussian distribution
  • There is a linear relationship between latent and observed variables
  • Isotropic Gaussian noise (covariance proportional to an identity matrix) in observed dimension

Then we can set up the model as

x=Wz+μ+ϵP(z)=N(μ0,Σ)P(xz)=N(Wz+μ0,σ2I)\begin{align} \mathbf{x} = \mathbf{Wz} + \mathbf{\mu} + \mathbf{\epsilon} \\ P(\mathbf{z}) = \mathcal{N}(\mathbf{\mu}_0,\mathbf{\Sigma}) \\ P(\mathbf{x} | \mathbf{z}) = \mathcal{N}(\mathbf{Wz} + \mathbf{\mu}_0, \sigma^2\mathbf{I}) \end{align}

Notice here we can assume μ0=0\mathbf{\mu}_0 = \pmb{0} and Σ=I\mathbf{\Sigma} = \mathbf{I} without losing generality, since if they are not, we can always find another W=WU\mathbf{W}' = \mathbf{WU} such that UzN(0,I)\mathbf{Uz} \sim \mathcal{N}(\pmb{0},\mathbf{I}).

Then the marginal probability of x\mathbf{x} can be expressed as

p(x)=N(μ,WWT+σ2I)p(\mathbf{x}) = \mathcal{N}(\mathbf{\mu},\mathbf{WW}^T+\sigma^2\mathbf{I})

To see this, notice that

E[x]=E[μ+Wz+ϵ]=μ+WE[z]+E[ϵ]=μVar(x)=E[(μ+Wz+ϵ)(μ+Wz+ϵ)T]=WWT+σ2I\begin{align} \mathbb{E}[\mathbf{x}] = \mathbb{E}[\mathbf{\mu+Wz+\epsilon}] = \mathbf{\mu} + \mathbf{W}\mathbb{E}[\mathbf{z}] + \mathbb{E}[\mathbf{\epsilon}] = \mu \\ Var(\mathbb{x}) = \mathbb{E}[(\mathbf{\mu+Wz+\epsilon})(\mathbf{\mu+Wz+\epsilon})^T] = \mathbf{WW}^T + \sigma^2\mathbf{I} \end{align}

Further, the covariance can be easily calculated as

Cov[x,z]=E[(xμ)(z0)T]=E[xzTμzT]=E[(Wz+μ+ϵ)zT]μE[zT]=WE[zzT]=W\begin{align} Cov[\mathbf{x},\mathbf{z}] & = \mathbb{E}[(\mathbf{x}-\mathbf{\mu})(\mathbf{z}-\mathbf{0})^T] \\ &= \mathbb{E}[\mathbf{xz}^T - \mathbf{\mu z}^T] \\ &= \mathbb{E}[\mathbf{(Wz + \mu + \epsilon)z}^T] - \mathbf{\mu}\mathbb{E}[\mathbf{z}^T] \\ &= \mathbf{W}\mathbb{E}[\mathbf{zz}^T] \\ & = \mathbf{W} \end{align}

Then the joint probability is:

p([zx])=N([zx][0μ],[IWTWWWT+σ2I])p\left(\begin{bmatrix} \mathbf{z} \\ \mathbf{x} \end{bmatrix}\right) = \mathcal{N}\left(\begin{bmatrix} \mathbf{z} \\ \mathbf{x} \end{bmatrix} \bigg| \begin{bmatrix} \mathbf{0} \\ \boldsymbol{\mu} \end{bmatrix}, \begin{bmatrix} \mathbf{I} & \mathbf{W}^T \\ \mathbf{W} & \mathbf{WW}^T + \sigma^2\mathbf{I} \end{bmatrix}\right)

Applying Gaussian conditional probability, we get:

p(zx)=N(zm,V),m=WT(WWT+σ2I)1(xμ),V=IWT(WWT+σ2I)1Wp(\mathbf{z|x}) = \mathcal{N}(\mathbf{z | m, V}), \quad \mathbf{m} = \mathbf{W}^T(\mathbf{WW}^T + \sigma^2\mathbf{I})^{-1}(\mathbf{x} - \boldsymbol{\mu}), \quad \mathbf{V} = \mathbf{I} - \mathbf{W}^T(\mathbf{WW}^T + \sigma^2\mathbf{I})^{-1}\mathbf{W}

We can simplify the problem by standardizing our dataset X\pmb{X} so that μ=0\pmb{\mu} = 0. We, therefore, complete the setup of a typical EM algorithm.

In E-step, we compute

limσ0p(ZX)=WT(WWT)1XT\lim_{\sigma \to 0}p(\mathbf{Z}|\mathbf{X}) = \mathbf{W}^T(\mathbf{WW}^T)^{-1}\mathbf{X}^T

If σ0\sigma \neq 0, the results would become probabilistic, in which case we don’t discuss. Notice that WWT\pmb{WW}^T is an m×mm \times m matrix, which takes O(m3)O(m^3) to compute the inverse. We instead cleverly apply the matrix inverse property to transform WT(WWT)1\pmb{W}^T(\pmb{WW}^T)^{-1} to (WTW)1WT(\pmb{W}^T\pmb{W})^{-1}\pmb{W}^T, which reduces the inverse computation to O(K3)O(K^3), so that

Z^=(WTW)1WTXT\pmb{\hat{Z}} = (\pmb{W}^T\pmb{W})^{-1}\pmb{W}^T\pmb{X}^T

In M-step, we compute the Q function

Q(θ,θ(t))=E[log(X,ZW,σ2]=i=1np(zixi)(log(p(xizi))+log(p(zi)))\begin{align} & Q(\theta, \theta^{(t)}) = E[log(\pmb{X},\pmb{Z}| \pmb{W}, \sigma^2] \\ &= \sum_{i=1}^n p(\pmb{z}_i|\pmb{x}_i)(log(p(\pmb{x}_i|\pmb{z}_i)) + log(p(\pmb{z}_i))) \end{align}

and by taking the partial derivative for W\pmb{W}, we get

W^=XZT(ZZT)1\pmb{\hat{W}} = \pmb{X}\pmb{Z}^T(\pmb{Z\pmb{Z}^T})^{-1}

We thus complete the construction of the EM-PCA algorithm. Notice the complexity of the EM-PCA algorithm is dominated by O(TKmn)O(TKmn), and T is the number of iterations. This algorithm is linear regarding sample size and feature dimension, therefore bringing great advantage when the reduction dimension KmK \ll m and KnK \ll n.

Experimental Results

Now let’s apply the algorithm to the 1000 Genome dataset to see how well the algorithm performs. As can be easily seen, EM-PCA has a fast convergence rate with a small runtime complexity.

Ancestry inference using EM-PCA algorithm

References:

  • Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
  • Siva, Nayanah. “1000 Genomes project.” Nature biotechnology 26.3 (2008): 256-257.