Learning biological patterns is hard, even for large language models. Learning meaningful biology at the genetic level? That’s even harder. Yet, a recently proposed LLM, Evo, asks the bold question: Is DNA all you need?
In this section, we will explore another practical application of the EM-algorithm to speed up the computation of PCA. This section assumes the readers have already read my introduction part of EM algorithm.
Maximum likelihood estimation (MLE) is a way of estimating the parameters of a statistic model given observation. It is conducted to find the parameters that maximize observations’ likelihood under certain model distribution assumptions. However, in many real life problems, we are dealing with problems with parameters that are not directly available to infer given the limited data we have, which are called hidden variables Z. Many problems in the areas of genomics involve dealing with hidden variables. Typical examples in genomics are (i) inferring microbial communities (Z: different communities) (ii) inferring the ancestries of a group of individuals (Z: different ancestries) (iii) inferring the cell type content from specific sequencing data (Z: different cell types). Problems involving hidden variables like these are typically hard to directly performing maximum likelihood estimation.