RNAseq Questions - How to Combine Gene Annotations (gff/gtf) with Kallisto Counts?

RNAseq Questions - How to Combine Gene Annotations (gff/gtf) with Kallisto Counts?


This is a continuation of our discussion from yesterday’s post. For general context, over the last year, I have been meeting many biologists and training them on NGS RNAseq data analysis. Sometimes they even bring their own research data to the class and learn to analyze as well as see results.

My primary assumption is that the students do not come with a computer science background. Hence I try to minimize the number of new coding concepts they need to learn to analyze their NGS data. On Friday, I posted “A Minimalist R Cheatsheet for NGS Biology”. Those are the core functions I try to stick to unless it is necessary to expand beyond them.

On the other hand, I do not compromise on explaining the statistical concepts. The bioinformaticians and biologists are generally led to consider the analysis programs as “kits”, “recipes” or “pipelines”. That is ok if one chooses to be trained as a technician, but the scientists may like to go a bit deeper.

During the course of the year, I came across many technical questions from the community and like to answer them here in a series of blog posts. If you have a question, feel free to email me at coding4medicine@gmail.com. Some of these are strictly related to data analysis (for example, check “Tryst Between Marilyn Monroe and Albert Einstein”, whereas the others are conceptual.

I am also compile all answers in the RNAseq tutorial in our expert membership section. You can join here.

Today’s question - How to Combine Gene Annotations in gtf with Kallisto Counts?

Yesterday we discussed how to load three Kallisto abundance.csv files in a data frame. One question I often come across is how to also add gene annotation from a separate gff or gtf file.

I like to give you a solution that uses Bioconductor minimally and allows you to leverage the Minimalist R Cheat Sheet. Here is how to proceed.

Use “rtracklayer” to Load Annotations from gff/gtf

In this example, we will use a gtf file. First we will load it in R using the “rtracklayer” package from Bioconductor. Make sure you install it if you have not done already.

The “import” function from “rtracklayer” will load the “gtf” file in R.

annot=rtracklayer::import("weird_organism.gtf")

The data structure “annot” is not a data frame, but we can immediately convert it into a data frame using “as.data.frame” function.

annot=annot %>% as.data.frame
annot %>% head

Remove Redundant Information from the “annot” Data Frame

You will notice that the “annot” data frame has many redundant rows and columns. In the specific one we ran this test one -

  1. We needed to keep only the lines with “transcript” in the “type” column. The other rows like “gene”, “exon” and “CDS” could go.

  2. Moreover, we cared about only two columns - “transcript_id” and “gene_name”.

  3. Finally, we needed to rename “transcript_id” as “target_id” to match the Kallisto convention for the first column in our “combined” RNAseq data frame. We generated that data frame in yesterday’s post.

Here is the full dplyr command to accomplish all these -

annot_small = annot %>% as.data.frame  %>% filter(type=="transcript") %>% select(transcript_id,gene_name) %>% mutate(target_id=transcript_id) %>% select(-transcript_id)

Apology Section

Every command I used here is included in the Minimalist Cheat-sheet. Therefore, no apologies.



Written by M. //