RNAseq Questions - How to Load and Combine Salmon Data in R?

RNAseq Questions - How to Load and Combine Salmon Data in R?


Over the last year, I have been meeting many biologists and training them on NGS RNAseq data analysis. Sometimes they even bring their own research data to the class and learn to analyze as well as see results.

My primary assumption is that the students do not come with a computer science background. Hence I try to minimize the number of new coding concepts they need to learn to analyze their NGS data. On Friday, I posted “A Minimalist R Cheatsheet for NGS Biology”. Those are the core functions I try to stick to unless it is necessary to expand beyond them.

On the other hand, I do not compromise on explaining the statistical concepts. The bioinformaticians and biologists are generally led to consider the analysis programs as “kits”, “recipes” or “pipelines”. That is ok if one chooses to be trained as a technician, but the scientists may like to go a bit deeper.

During the course of the year, I came across many technical questions from the community and like to answer them here in a series of blog posts. If you have a question, feel free to email me at coding4medicine@gmail.com. Some of these are strictly related to data analysis (for example, check “Tryst Between Marilyn Monroe and Albert Einstein”, whereas the others are conceptual.

I will also compile all answers in the RNAseq tutorial in our expert membership section. You can join here.

Today’s question - How to Load Data from Salmon Analysis in R?

In two previous blog-posts (here and here), I have shown how to load Kallisto results in a data frame in R. You see here how your scripts need minimal changes to load Salmon data.

Salmon manual shows the file format for Salmon output here. Only two differences with Kallisto concern us here -

  1. Files are stored in “quant.sf” instead of “abundance.tsv”.
  2. The columns of interest are called “Name” and “TPM” instead of “target_id” and “tpm” (Yes, case matters).

Therefore, here are the modified codes for loading Salmon data -

brain=read_tsv("brain/quant.sf")
brain=brain %>% mutate(brain=TPM,target_id=Name) %>% select(target_id,brain)

heart=read_tsv("heart/abundance.tsv")
heart=heart %>% mutate(heart=TPM,target_id=Name) %>% select(target_id,heart)

muscle=read_tsv("muscle/abundance.tsv")
muscle=muscle %>% mutate(muscle=TPM,target_id=Name) %>% select(target_id,muscle)

combined=brain %>% inner_join(heart) %>% inner_join(muscle)

The code for loading large number of Salmon files can be written by modifying this answer with those two simple changes -

names=c("brain", "heart", "muscle")

for (i in 1:3) {

        command_parts=c('df=read_tsv("', names[i] , '/quant.sf")')
        command=paste(command_parts, sep='', collapse='')
        eval(parse(text=command))

        command_parts=c('df=df %>% mutate(', names[i], '=TPM, target_id=Name) %>% select(target_id,',  names[i] , ')')
        command=paste(command_parts, sep='', collapse='')
        eval(parse(text=command))

        if(i==1)
        {
                combined=df
        }
        else
        {
                combined=combined %>% inner_join(df)
        }
}

head(combined)


Written by M. //