Expectation-Maximization (EM) algorithm part I (Introduction)
Introduction
Maximum likelihood estimation (MLE) is a way of estimating the parameters of a statistic model given observation. It is conducted to find the parameters that maximize observations’ likelihood under certain model distribution assumptions. However, in many real life problems, we are dealing with problems with parameters that are not directly available to infer given the limited data we have, which are called hidden variables Z. Many problems in the areas of genomics involve dealing with hidden variables. Typical examples in genomics are (i) inferring microbial communities (Z: different communities) (ii) inferring the ancestries of a group of individuals (Z: different ancestries) (iii) inferring the cell type content from specific sequencing data (Z: different cell types). Problems involving hidden variables like these are typically hard to directly performing maximum likelihood estimation.
Motivating example
Using an application to motivate the abstract math formula is more intuitive to understand. So before going into any math detail, let’s use the popular 1000 Genome dataset as an example. In the example dataset, we have approximately 1000 individuals (n) from different ancestries; each individual has ~13k carefully selected SNPs (m). Formally, let’s denote it as X, which is a matrix with value , or the haplotype matrix. The objective is to learn, without being given explicitly, the ancestry of each individual in an unsupervised approach.
EM algorithm
Let’s assume we believe there are difference ancestries. Denote as the genotype data of individual at SNP . Let’s say the SNP in individual is passed by ancestry , and ancestry group has chance to pass a value at SNP to the offspring. Let’s denote as a ancestry matrix of each individual, so that means individual belongs to the ancestry group . The prior distribution of is determined a multivariate normal distribution characterized by .
Mathematically:
Given we have the ancestry information matrix , solving for parameter is easy:
To rewrite the above equation in log form, we have
What makes it hard is when we do not observe the ancestry groups . Then we need to infer the probability of ancestry given the observation . But this requires us to know first! This comes to the first step of EM algorithm: initialize the parameters as . After we cheat the posterior , we can form the complete data likelihood to learn the updated parameter :
and the cheated posterior can be learned as
This is called the Expectation step (E-step). After we learned the posterior, we plug it back to get the and find the parameters that maximize the function. (M-step). We perform the abovementioned step several times until converge.
Experimental Results
Let’s now implement the algorithm and see how it looks in the dataset described in the motivating example. After finishing running the EM algorithm, I label each individual with the index that gives the maximum probability for visualization purposes. Then I use PCA to project the data into 2D space. I found using ancestral components yields the best visual separation.

References:
- Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
- Siva, Nayanah. “1000 Genomes project.” Nature biotechnology 26.3 (2008): 256-257.