We recently started a series on debugging RNAseq analysis. In the first post of this series, we explained how the programs like Kallisto and Salmon work based on a simple example. This covered the basics of Expectation Maximization.
For this post, we created a data set exactly matching the example of the last post so that you can see that theory agrees with practice. Please check this github repo, where we created two synthetic genes (geneA and geneB) of length 2000nt. The last 1000 nt of these genes are identical and the first 1000 nt are completely different.
Next, we sampled 600 225nt fragments randomly from the geneA and 1400 from geneB. From those 225nt fragments, we created two FASTQ files with 100nt forward and reverse reads. This is exactly how Illumina sequences paired end reads, except that real sequencers introduce some degree of randomness in insert size and errors. Given that 50% of geneA and geneB are identical, we expact at least half of the reads match both genes.
We ran kallisto using default parameter -
kallisto index -i id genes.fa kallisto quant -i id -o kallisto-output file1.fq.gz file2.fq.gz
and got this output (“abundance.tsv”) -
target_id length eff_length est_counts tpm geneA 2000 1776 597.473 298736 geneB 2000 1776 1402.53 701264
Based on “est_counts” column, the program assigned 597 reads to geneA and 1403 reads to geneB just like we expected.
Feel free to try Salmon with the same example, or create other examples with more genes to better understand the behavior of these programs.