Jekyll2023-12-26T12:22:21-08:00https://fboyang.github.io/feed.xmlBoyang Fupersonal descriptionBoyang Fu (付伯阳）fbyang1995@gmail.comExpectation-Maximization (EM) algorithm part II (EM-PCA)2022-06-20T00:00:00-07:002022-06-20T00:00:00-07:00https://fboyang.github.io/posts/2013/08/blog-post-2<p>In this section, we will explore another practical application of the EM-algorithm to speed up the computation of PCA. This section assumes the readers have already read my <a href="https://fboyang.github.io/posts/2022/04/blog-post-1/">introduction part of EM algorithm</a>.</p>
<h3 id="classical-pca">Classical PCA</h3>
<p>In classical PCA, given an input matrix $\mathbf{X} \in \mathbb{R}^{N \times M}$ (In genetic settings, $N$ can be the number of individuals, and $M$ be the number of SNPs), the goal is to find an orthonormal matrix
$W \in \mathbb{R}^{m\times K}$ containing $K$ orthonormal vectors; and the corresponding scores (weight) along each
vector $z \in \mathbb{R}^K$ , such that each individual $\mathbf{X}_i \in \mathbb{R}^m$ can
be reconstructed using $\hat{\mathbf{x}}_i = \mathbf{Wz}_i$ with the minimum error.
Mathematically, we are trying the minimize:
\(\begin{align}
J(\mathbf{W},\mathbf{Z}) = \frac{1}{N} || \mathbf{X} - \mathbf{ZW}^T ||_F^2
\end{align}\)</p>
<p>Under the constraint that $\mathbf{W}$ is an orthonormal matrix, the PC
scores for each individual are therefore uncorrelated. Moreover, it can
be proved by induction that the orthonormal matrix $\mathbf{W}$ is the
eigenvectors corresponding to the top-$K$ largest eigenvalues, which implies it captures the top-$K$ variance when projecting X into the lower dimensional orthonormal subspace.</p>
<h3 id="em-pca">EM PCA</h3>
<p>If we treat the dimensional reduction method as a probabilistic model, then the score matrix $\mathbf{Z}$ becomes a probabilistic distribution, and suppose we have the following assumptions:</p>
<ul>
<li>Underlying latent variable has a Gaussian distribution</li>
<li>There is a linear relationship between latent and observed variables</li>
<li>Isotropic Gaussian noise (covariance proportional to an identity matrix) in observed dimension</li>
</ul>
<p>Then we can set up the model as</p>
\[\begin{align}
\mathbf{x} = \mathbf{Wz} + \mathbf{\mu} + \mathbf{\epsilon} \\
P(\mathbf{z}) = \mathcal{N}(\mathbf{\mu}_0,\mathbf{\Sigma}) \\
P(\mathbf{x} | \mathbf{z}) = \mathcal{N}(\mathbf{Wz} + \mathbf{\mu}_0, \sigma^2\mathbf{I})
\end{align}\]
<p>Notice here we can assume $\mathbf{\mu}_0 = \pmb{0}$ and $\mathbf{\Sigma} = \mathbf{I}$ without losing generality, since if they are not, we can always find another $\mathbf{W}’ = \mathbf{WU}$ such that $\mathbf{Uz} \sim \mathcal{N}(\pmb{0},\mathbf{I})$.</p>
<p>Then the marginal probability of $\mathbf{x}$ can be expressed as
\(\begin{align}
p(\mathbf{x}) = \mathcal{N}(\mathbf{\mu},\mathbf{WW}^T+\sigma^2\mathbf{I})
\end{align}\)</p>
<p>To see this, notice that
\(\begin{align}
\mathbb{E}[\mathbf{x}] = \mathbb{E}[\mathbf{\mu+Wz+\epsilon}] = \mathbf{\mu} + \mathbf{W}\mathbb{E}[\mathbf{z}] + \mathbb{E}[\mathbf{\epsilon}] = \mu \\
Var(\mathbb{x}) = \mathbb{E}[(\mathbf{\mu+Wz+\epsilon})(\mathbf{\mu+Wz+\epsilon})^T] = \mathbf{WW}^T + \sigma^2\mathbf{I}
\end{align}\)</p>
<p>Further, the covariance can be easily calculated as
\(\begin{align}
Cov[\mathbf{x},\mathbf{z}] & = \mathbb{E}[(\mathbf{x}-\mathbf{\mu})(\mathbf{z}-\mathbf{0})^T] \\ &= \mathbb{E}[\mathbf{xz}^T - \mathbf{\mu z}^T] \\ &= \mathbb{E}[\mathbf{(Wz + \mu + \epsilon)z}^T] - \mathbf{\mu}\mathbb{E}[\mathbf{z}^T] \\ &= \mathbf{W}\mathbb{E}[\mathbf{zz}^T] \\ & = \mathbf{W}
\end{align}\)</p>
<p>Then the joint probability is:
\(p\left(\begin{bmatrix} \mathbf{z} \\ \mathbf{x} \end{bmatrix}\right) = \mathcal{N}\left(\begin{bmatrix} \mathbf{z} \\ \mathbf{x} \end{bmatrix} \bigg| \begin{bmatrix} \mathbf{0} \\ \boldsymbol{\mu} \end{bmatrix}, \begin{bmatrix} \mathbf{I} & \mathbf{W}^T \\ \mathbf{W} & \mathbf{WW}^T + \sigma^2\mathbf{I} \end{bmatrix}\right)\)</p>
<p>Applying Gaussian conditional probability, we get:
\(p(\mathbf{z|x}) = \mathcal{N}(\mathbf{z | m, V}), \quad \mathbf{m} = \mathbf{W}^T(\mathbf{WW}^T + \sigma^2\mathbf{I})^{-1}(\mathbf{x} - \boldsymbol{\mu}), \quad \mathbf{V} = \mathbf{I} - \mathbf{W}^T(\mathbf{WW}^T + \sigma^2\mathbf{I})^{-1}\mathbf{W}\)</p>
<p>We can simplify the problem by standardizing our dataset $\pmb{X}$ so that $\pmb{\mu} = 0$. We, therefore, complete the setup of a typical EM algorithm.</p>
<p>In E-step, we compute
\(% \begin{align}
\lim_{\sigma \to 0}p(\mathbf{Z}|\mathbf{X}) = \mathbf{W}^T(\mathbf{WW}^T)^{-1}\mathbf{X}^T
% \end{align}\)
If $\sigma \neq 0$, the results would become probabilistic, in which case we don’t discuss. Notice that $\pmb{WW}^T$ is an $m \times m$ matrix, which takes $O(m^3)$ to compute the inverse. We instead cleverly apply the matrix inverse property to transform $\pmb{W}^T(\pmb{WW}^T)^{-1}$ to $(\pmb{W}^T\pmb{W})^{-1}\pmb{W}^T$, which reduces the inverse computation to $O(K^3)$, so that
\(% \begin{align}
\pmb{\hat{Z}} = (\pmb{W}^T\pmb{W})^{-1}\pmb{W}^T\pmb{X}^T
% \end{align}\)
In M-step, we compute the Q function
\(\begin{align}
& Q(\theta, \theta^{(t)}) = E[log(\pmb{X},\pmb{Z}| \pmb{W}, \sigma^2] \\ &= \sum_{i=1}^n p(\pmb{z}_i|\pmb{x}_i)(log(p(\pmb{x}_i|\pmb{z}_i)) + log(p(\pmb{z}_i)))
\end{align}\)
and by taking the partial derivative for $\pmb{W}$, we get
\(% \begin{align}
\pmb{\hat{W}} = \pmb{X}\pmb{Z}^T(\pmb{Z\pmb{Z}^T})^{-1}
% \end{align}\)
We thus complete the construction of the EM-PCA algorithm.
Notice the complexity of the EM-PCA algorithm is dominated by $O(TKmn)$, and T is the number of iterations. This algorithm is linear regarding sample size and feature dimension, therefore bringing great advantage when the reduction dimension $K \ll m$ and $K \ll n$.</p>
<h3 id="experimental-results">Experimental Results</h3>
<p>Now let’s apply the algorithm to the 1000 Genome dataset to see how well the algorithm performs. As can be easily seen, EM-PCA has a fast convergence rate with a small runtime complexity.</p>
<p><img src="/images/EM/em-pca.png" alt="Ancestry inference using EM-PCA algorithm" /></p>
<h4 id="references">References:</h4>
<ul>
<li>Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.</li>
<li>Siva, Nayanah. “1000 Genomes project.” Nature biotechnology 26.3 (2008): 256-257.</li>
</ul>Boyang Fu (付伯阳）fbyang1995@gmail.comIn this section, we will explore another practical application of the EM-algorithm to speed up the computation of PCA. This section assumes the readers have already read my introduction part of EM algorithm.Expectation-Maximization (EM) algorithm part I (Introduction)2022-04-17T00:00:00-07:002022-04-17T00:00:00-07:00https://fboyang.github.io/posts/2022/04/blog-post-1<h2 id="introduction">Introduction</h2>
<p>Maximum likelihood estimation (MLE) is a way of estimating
the parameters of a statistic model given observation.
It is conducted to find the parameters that maximize observations’ likelihood under certain model distribution
assumptions. However, in many real life problems, we are
dealing with problems with parameters that are not directly
available to infer given the limited data we have, which
are called <strong>hidden variables Z</strong>. Many problems in the
areas of genomics involve dealing with hidden variables.
Typical examples in genomics are (i) inferring microbial
communities (Z: different communities) (ii) inferring the
ancestries of a group of individuals (Z: different ancestries) (iii)
inferring the cell type content from specific sequencing data
(Z: different cell types). Problems involving hidden variables
like these are typically hard to directly performing maximum
likelihood estimation.</p>
<h3 id="motivating-example">Motivating example</h3>
<p>Using an application to motivate the abstract math formula is more intuitive to understand. So before going into any math detail, let’s use the popular 1000 Genome dataset as an example. In the example dataset, we have approximately 1000 individuals (n) from different ancestries; each individual has ~13k carefully selected SNPs (m). Formally, let’s denote it as <strong>X</strong>, which is a $ n \times m $ matrix with value ${0, 1}$, or the haplotype matrix. The objective is to learn, without being given explicitly, the ancestry of each individual in an unsupervised approach.</p>
<h3 id="em-algorithm">EM algorithm</h3>
<p>Let’s assume we believe there are $K$ difference ancestries. Denote \(\pmb{X}_{i,j}\) as the genotype data of individual $i$ at SNP $j$. Let’s say the $j^{th}$ SNP in individual $i$ is passed by ancestry $k$, and ancestry group $k$ has $f_{k,j}$ chance to pass a value $1$ at SNP $j$ to the offspring. Let’s denote $\pmb{Z}$ as a ancestry matrix of each individual, so that $\pmb{Z}_{i,k} = 1$ means individual $i$ belongs to the ancestry group $k$. The prior distribution of $\pmb{Z}$ is determined a multivariate normal distribution characterized by $ \pmb{\pi} $. (If you wonder why we are making all these assumptions, please check my Li-Stephens HMM blog for more details).
Mathematically:</p>
\[\pmb{Z}_{i}|\pmb{\pi} \overset{iid}{\sim} Mult(\pmb{\pi}), \quad
\pmb{X}_{i,j} | (\pmb{Z}_{i,k} = 1) \sim Ber(f_{j,k})\]
<p>Given we have the ancestry information matrix \(\pmb{Z}\), solving for parameter parameter $ f_{k,j} $ is easy:</p>
\[P(\pmb{X}, \pmb{Z} \vert \pmb{f},\pmb{\pi}) = \prod_{i=1}^N \prod_{k=1}^K \pi_k P(\pmb{X}_i\vert \pmb{Z}_{i,k} = 1)^{\mathbb{1}\{\pmb{Z}_{i,k}=1\}}\]
<p>Where the last equation is based on the (simply, but usually unrealistic) assumption that each SNP is independent to each other. Then we can solve the above equation using Maximum Likelihood Estimation (MLE) easily. For simplicity, let’s use $\pmb{\theta}$ to be the parameters we want to learn above. To rewrite the above equation in log form, we have</p>
\[\log(P(\pmb{X}, \pmb{Z} \| \pmb{\theta})) = \sum_{i=1}^n\sum_{k=1}^K \mathbb{1}\{\pmb{Z}_{i,k}=1\}\log(\pi_k p(\pmb{X}_i \| \pmb{f}_k))\]
<p>What makes it hard is when we do not observe the ancestry groups \(\mathbf{Z}\). Then we need to infer the probability of ancestry given the observation \(\mathbf{X}: P(\mathbf{Z} \vert \mathbf{X}, \mathbf{\theta})\). But this requres us to know \(\pmb{\theta}\) first! This comes to the first step of EM algorithm: <strong>initialize the parameters</strong> as \(\mathbf{\theta}^{(0)}\). After we <em>cheat</em> the posterior \(P(\mathbf{Z} \vert \mathbf{X}, \mathbf{\theta}^{(0)})\), we can form the complete data likelihood to learn the updated parameter \(\mathbf{\theta}\) :</p>
\[\begin{align*}
Q(\pmb{\theta}, \pmb{\theta}^{(0)}) &= E_Z [\log(p(\pmb{X}, \pmb{Z} \vert \theta))] \\
&= E_Z [\log(p(\pmb{Z})p(\pmb{X} \vert \pmb{Z}, \theta))] \\
&= \sum_{i=1}^n\sum_{k=1}^K P(z_k \vert x_{i}, \pmb{\theta}^{(0)}) \log(\pi_k \prod_{j=1}^m p(x_{i,j} \vert f_{j,k}))
\end{align*}\]
<p>and the <em>cheated</em> posterior can be learned as</p>
\[\begin{align*}
&P(z_k | x_i, \pmb{\theta}^{(0)} ) =\frac{P(x_i| z_k, \pmb{\theta}^{(0)})P(z_k|\pmb{\theta}^{(0)})}{P(x_i| \pmb{\theta}^{(0)})} \\
& = \frac{\pi_k^{(0)} \prod_{j=1}^mf_{i,j}^{(0)x_{i,j}}(1-f_{i,j}^{(0)})^{1-x_{i,j}}}{\sum_{k=1}^K\pi_k^{(0)} \prod_{j=1}^mf_{i,j}^{(0)x_{i,j}}(1-f_{i,j}^{(0)})^{1-x_{i,j}}}
\end{align*}\]
<p>This is called the Expectation step (E-step). After we learned the posterior, we plug it back to get the $Q(\pmb{\theta},\pmb{\theta}^{(0)})$ and find the parameters $\pmb{\theta}$ that maximize the function. (M-step). We perform the abovementioned step several times until converge.</p>
<h3 id="experimental-results">Experimental Results</h3>
<p>Let’s now implement the algorithm and see how it looks in the dataset described in the motivating example. After finishing running the EM algorithm, I label each individual with the index $k$ that gives the maximum probability for visualization purposes. Then I use PCA to project the data into 2D space. I found using $K=4$ ancestral components yields the best visual separation.</p>
<p><img src="/images/EM/em-cluster.jpg" alt="Ancestry inference using EM algorithm" /></p>
<ul>
<li>Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.</li>
<li>Siva, Nayanah. “1000 Genomes project.” Nature biotechnology 26.3 (2008): 256-257.</li>
</ul>Boyang Fu (付伯阳）fbyang1995@gmail.comIntroduction Maximum likelihood estimation (MLE) is a way of estimating the parameters of a statistic model given observation. It is conducted to find the parameters that maximize observations’ likelihood under certain model distribution assumptions. However, in many real life problems, we are dealing with problems with parameters that are not directly available to infer given the limited data we have, which are called hidden variables Z. Many problems in the areas of genomics involve dealing with hidden variables. Typical examples in genomics are (i) inferring microbial communities (Z: different communities) (ii) inferring the ancestries of a group of individuals (Z: different ancestries) (iii) inferring the cell type content from specific sequencing data (Z: different cell types). Problems involving hidden variables like these are typically hard to directly performing maximum likelihood estimation.