Debugging RNAseq - (iv) Effective Length and TPM

Debugging RNAseq - (iv) Effective Length and TPM


In previous two posts on RNAseq concepts (here and here), we explained the inner workings of programs like Kallisto and Salmon based on a simple example. We also created a small simulated set identical to the example, ran Kallisto on it and got results matching theory.

Apart from confirming theory, the simulation exercise provides insight into the output columns like effective length and TPM. For explanation of various terms, you can check blog posts by Harold Pimental and Rob Patro. If they appear too complicated, here is an easy way to understand them.

Let us look at the Kallisto output (“abundance.tsv”) again.

target_id       length  eff_length      est_counts      tpm
geneA   2000    1776    597.473 298736
geneB   2000    1776    1402.53 701264

We already know how “est_counts” is derived. Among 2000 reads, ~600 matched geneA and ~1400 matched geneB. Those numbers are reflected in the “est_counts” column.

The last column (“tpm”) can be derived easily from “est_counts” in the following way.

tpm = 1e6 * (est_counts/2000) =est_counts * 500

To understand “eff_length”, we need to go back to how the simulated reads were generated. From the previous post, “we sampled 600 225nt fragments randomly from the geneA and 1400 from geneB. From those 225nt fragments, we created two FASTQ files with 100nt forward and reverse reads.”

Effective length (“eff_length”) is gene length minus insert size.

eff_length = gene_length - insert_size = 2000 - 225 = 1775

The best way to learn is to run the simulation with other variations of the parameters and see how the Kallisto (or Salmon) output changes. I will post a few more examples, but if you like to experiment on your own, here is the PERL code to generate the genes. If you like Python or R versions, let me know, and I can generate them easily for you.

$seq="";

for($i=0; $i<1000; $i++)
{
        $x=rand;
        $x=int(4*$x);
        $seq .="A" if($x==0);
        $seq .="C" if($x==1);
        $seq .="T" if($x==2);
        $seq .="G" if($x==3);
}

print "$seq\n";

I created three random 1000nt pieces using the above code and then manually stiched them together to create geneA and geneB.

Here is the code to create the FASTQ reads from random positions of the genes -

use Useful;
%a=Useful::read_FASTA("genes.fa");

$count=0;

open(OUT1,">file1.fq");
open(OUT2,">file2.fq");


## GeneA
while($count<600)
{
        $x=rand; $x=int((2000-225)*$x);
        $sub=substr($a{"geneA"},$x,225);
        $L=length($sub);
        if($L==225)
        {
                $f=substr($sub,0,100);
                $o=$f; $o=~s/[ATGC]/X/g;
                print OUT1 "\@NS500156\n$f\n+\n$o\n";
                $sub=Useful::reverse($sub);
                $r=substr($sub,0,100);
                $o=$r; $o=~s/[ATGC]/X/g;
                print OUT2 "\@NS500156\n$r\n+\n$o\n";
                $count++;
        }
}


## GeneB
$count=0;
while($count<1400)
{
        $x=rand; $x=int((2000-225)*$x);
        $sub=substr($a{"geneB"},$x,225);
        $L=length($sub);
        if($L==225)
        {
                $f=substr($sub,0,100);
                $o=$f; $o=~s/[ATGC]/X/g;
                print OUT1 "\@NS500156\n$f\n+\n$o\n";
                $sub=Useful::reverse($sub);
                $r=substr($sub,0,100);
                $o=$r; $o=~s/[ATGC]/X/g;
                print OUT2 "\@NS500156\n$r\n+\n$o\n";
                $count++;
        }
}

Here is Useful.pm mentioned in the above code -

package Useful;
use strict;

### read FASTA file in %assoc array ###
sub read_FASTA
{
        my($filename,$cond)=@_;
        open(IN,$filename);
        my $name; my %assoc; my $seq;
        while(<IN>)
        {
                if($_=~/>(\S+)(.*)/)
                {
                        $name=$1;
                        $assoc{$name}="";
                        ### $name=~s/.*\|//g;
                }
                else
                {
                        if ($_=~/^(.*)\s*$/)
                        {
                                $seq=$1;
                                $seq=uc($seq) if($cond==0);
                                $assoc{$name}.=$seq;
                        }
                }
        }
        close(IN);
        return(%assoc);
}


Written by M. //