Debugging RNAseq - (ii) Core Concept - Expectation Maximization
We recently started a series on debugging RNAseq analysis. At presest, the programs used for analysis are used as black boxes, and the results are taken on faith. This becomes a problem, when the known genes do not show up as expected. Does that mean the prior knowledge is wrong, or the samples are prepared incorrectly, or some parameters/assumptions in the analysis programs are incorrect? Moreover, who will debug the analysis - bioinformaticians reading all relevant biology papers, or biologists learning about the analysis step?
In my opinion, the later group can and should take responsibility. In these posts, I like to first explain the main concepts behind the “black boxes” in simple language and then give a set of simple code-snippets to make sense of the analysis steps. Speaking of concepts, we will focus on four topics -
- What do the expression analysis programs like Kallisto, Salmon, etc. do?
- What do the statistical analysis programs like DESeq2, edgeR, etc. do?
- How to make sense of and debug Trinity assemblies,
- Why you should avoid “GO analysis” and go for biologically meaningful alternatives.
I am assuming that the readers understand the shotgun sequencing process. In case of RNAseq, the expressed genes are split into small pieces, and the sequences of those pieces are determined by the Illumina (or other) machines. The outputs of the sequencing step are a collection of large files with data as follows -
@EAS20_8_6_1_9_1972/1 trim=6
ACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCG
+
HHHHHHGHHHHFHHGGHHFHHHHHFHHFHFHHHHHFHHHHHFHHHHHHHHFHHFHFDHHGG@BGGHCDHE:;3)7.A973A:AA5>AD9G=D<D
@EAS20_8_6_1_163_1521/1
GCAGAAAACGTTCTGCATTTGCCACTGATGTACCGCCGAACTTCAACACTCGCATGGTTGTTACCTCGTTACCTTTGGTCGAAAAAAAAAGCCCGCACTG
+
HGHHIHHHDHHHHHIHHIHHHHHHHHHBHHHHHFHCFHHHHHHHGHHHHHEHHFHHHGHHIHHHGHGHHHIHFHHHHHGH?5<<;BD>6>?BGEHHGHFG
@EAS20_8_6_1_178_1948/1
ATTCGAGGTAATGCCCCACTGCCAGCAGTTTTTCGACCGGATCGATAACAGTAACGTTGTGACCGCGCGCTTCTAATACGCCGGCCATAATGGCGATCGA
+
GHHHHHHHHHGGHHHHHHGHHHHHHHEHHHHHHHHHHHHHFHHHEHHHHHHHFHFHIHHIHHHIHHHGHIIGGHBGGGHHFFGBHIFFGCIGGEFDG@AG
The second of each four lines above is the sequence of a small piece of a gene. The analysis program attempts to find the expression levels of all genes in different samples starting from those sequences. Biologists intuitively understand the broad steps, namely (i) align those short sequencing reads against the known genes to find the “counts” of reads mapping each gene, (ii) take the ratio of counts of a gene from tissue 1 and tissue 2 to determine fold change. We will explain what programs like Kallisto, Salmon, DESeq2, edgeR add to this process.
Concept 1 - Expectation Maximization to Estimate the Expression Levels
Let us first look into what Kallisto and Salmon do beyond alignment. Why cannot we just use raw alignment numbers?
Programs like Kallisto and Salmon (and poorly written RSEM) have two components -
(i) align reads on the available genes (not genome), (ii) proportionally assign the alignments among multiple isoforms.
Let me explain this second part with an example and a series of conceptual steps.
Step 1
To keep things simple, we assume that your organism has exactly two genes A and B of equal length, and they have no sequence overlap. We also assume that the shotgun reads come uniformly from any part of any gene. Let us say among 2000 shotgun reads in a sample, 600 match gene A and 1400 match gene B. In this case, it is easy to say that expression levels of A and B have a ratio of 3:7.
What if the genes have overlaps in 50% of their lengths? This is increasingly common, when we not only consider the genes but also their multiple isoforms.
In case of overlap, we need to consider three kind of reads - ones matching gene A exclusively, ones matching gene B exclusively and ones matching both reads. In our case, out of 2000 reads, 300 match gene A exclusively, 700 match gene B exclusively and 1000 match both gene A and B. So, the real question is how to split those 1000 reads among A and B.
If we know that the genes A and B are expressed at 3:7 ratio, we can split those 1000 reads among A and B in the same ratio and all numbers will be consistent. In real life, the relative expression level is what we are trying to find through this alignment process. So, how do we split the reads?
Step 2
Let us consider an extreme case to understand the dilemma. Suppose in another experiment 1400 reads map A only, 0 reads map B only and 600 map both A and B. How do we split those 600 among A and B? Moreover, we assume that the sequences of A and B cannot be seen. You do not know how much overlap they have and so on.
Here are a few possibilities -
- Split those 600 equally among A and B. So, A gets 1700 reads and B gets 300.
- Give all those 600 to B. A has 1400 and B has 600.
- Give all those 600 to A. A has 2000 and B has 0.
- Remove those 600 from the analysis. A has 1400, B has 0, and 600 reads are discarded.
Among 1-4, which one do you like?
Let us discuss the merits and demerits of the options. If we choose 4, we will be removing many reads for genes and isoforms with substantial overlap. Our analysis will create bias in favor of rapidly diverging genes.
If we choose 1, unexpressed genes or isoforms may get large expression due to overlap with highly expressed members. For example, if Hoxa1 is strongly expressed in brain and Hoxa2 is unexpressed but happens to have sequence overlap with Hoxa1, it will get some reads assigned to it. This will make the biology wrong for many lowly expressed (but important) transcription factors.
Programs like Kallisto, Salmon and poorly written RSEM solve this problem by using expectation maximization, an iterative method that finds self-consistent solution to both the partioning and determination of expression levels problems.
Step 3
Let us go back to the question of step 1. We will follow and iterative procedure and show how we can get the correct answer of 3:7, based on which the numbers were generated.
Here is how the procedure works. We split the shared reads equally among two genes, compute their relative expression levels and resplit the shared reads based on the newly computed expression levels. This process continues until the results converge.
Initial split -
A B
accounted reads - 300 - 700
unaccounted reads - 500 - 500
Total - 800 - 1200
The expression levels of A:B = 2:3. Therefore, we split the unaccounted reads in the same ratio among A and B.
Next round - split 2:3
A B
accounted reads - 300 - 700
unaccounted reads - 400 - 600
Total - 700 - 1300
Next round - split 7:13
A B
accounted reads - 300 - 700
unaccounted reads - 350 - 650
Total - 650 - 1350
Next round split - 13-27
A B
accounted reads - 300 - 700
unaccounted reads - 325 - 675
Total - 625 - 1375
Next round split - 25-55
A B
accounted reads - 300 - 700
unaccounted reads - 312.5 - 687.5
Total - 612.5 - 1387.5
Next round split - 24.5-55.5
A B
accounted reads - 300 - 700
unaccounted reads - 306.25 - 693.5
Total - 606.25 - 1393.5
If you continue this process, soon you will converge on 3:7 split for the unaccounted reads and also the expression levels of A:B = 3:7.
Can you use these ideas to solve the problem posed in step 2? Which of the possibilities (1-4) will you pick?