RNAseq Questions - Loading Many Kallisto Count Files in R

RNAseq Questions - Loading Many Kallisto Count Files in R


This is a continuation of our discussion from yesterday’s post. For general context, over the last year, I have been meeting many biologists and training them on NGS RNAseq data analysis. Sometimes they even bring their own research data to the class and learn to analyze as well as see results.

My primary assumption is that the students do not come with a computer science background. Hence I try to minimize the number of new coding concepts they need to learn to analyze their NGS data. On Friday, I posted “A Minimalist R Cheatsheet for NGS Biology”. Those are the core functions I try to stick to unless it is necessary to expand beyond them.

On the other hand, I do not compromise on explaining the statistical concepts. The bioinformaticians and biologists are generally led to consider the analysis programs as “kits”, “recipes” or “pipelines”. That is ok if one chooses to be trained as a technician, but the scientists may like to go a bit deeper.

During the course of the year, I came across many technical questions from the community and like to answer them here in a series of blog posts. If you have a question, feel free to email me at coding4medicine@gmail.com. Some of these are strictly related to data analysis (for example, check “Tryst Between Marilyn Monroe and Albert Einstein”, whereas the others are conceptual.

I am also compile all answers in the RNAseq tutorial in our expert membership section. You can join here.

Today’s question - How to Load Many Files in R with Kallisto Analysis Data?

Yesterday we discussed how to load three Kallisto abundance.csv files. What will you do, if you have 30 or, even worse, 300 samples? You can type all of those 300 names one after another, but there are two easier ways. Also, one of those ways will teach you something new and useful about R.

One option is to use the recent Bioconductor package “tximport”. We do not cover it in this post.

This post discusses an approach that allows you to scale up all kinds of analysis in R for hundreds of files. Maybe you will come across a data analysis situation, where there is no ready-made package, or maybe you want to perform a different kind of analysis from the Kallisto data sets (e.g. compare the “eff_length” columns of the output files).

“eval” and “parse”

We introduce two new R commands - “eval” and “parse”. They are not discussed in my Minimalist R Cheat Sheet, because these commands are not meant for the beginners.

Here is what they do. If you can create a R command as a string (e.g. “str”), then “eval(parse(text=s)” will execute the command. Here is an example.

x=7
str="5+x"
eval(parse(text=str))

Rewrite Yesterday’s Code for Loading Kallisto Counts

Yesterday, we showed an example, where we loaded RNAseq data for “brain”, “heart” and “muscle” and combined them in one data frame. We need to write the same code as a for loop so that one can easily extend it for 30 or 300 samples.

We first rewrite yesterday’s code a bit differently, but performing essentially the same task.

df=read_tsv("brain/abundance.tsv")
df=brain %>% mutate(brain=tpm) %>% select(target_id,brain)

combined=df


df=read_tsv("heart/abundance.tsv")
df=heart %>% mutate(heart=tpm) %>% select(target_id,heart)

combined = combined %>% inner_join(df)


df=read_tsv("muscle/abundance.tsv")
df=muscle %>% mutate(muscle=tpm) %>% select(target_id,muscle)

combined = combined %>% inner_join(df)

We made two changes. First, we split the last line of combining all data into multiple steps. Second, we called the data frames “df” instead of “brain”, “heart” and “muscle”. This is made possible by the first change.

Use “eval” and “parse” to Implement a Block

Let us implement one of those blocks in the above code using “eval” and “parse”. In the first two lines, you notice that commands include the sample name “brain”. Instead we will create a string variable sample=”brain” and then rewrite the commands using the string variable. This is where “eval” and “parse” come in.

Each line of code in previous block will become three lines. In the first line, we assemble the parts of the command as a string vector. In the second line, we use “paste” to join them into a word and in the third line we run it with eval/parse.

sample="brain"

command_parts=c('df=read_tsv("', sample , '/abundance.tsv")')
command=paste(command_parts, sep='', collapse='')
eval(parse(text=command))

command_parts=c('df=df %>% mutate(', sample, '=tpm) %>% select(target_id,',  sample , ')')
command=paste(command_parts, sep='', collapse='')
eval(parse(text=command))

Complete Solution

If you understand the above step, rest of the code is easy.

names=c("brain", "heart", "muscle")

for (i in 1:3) {

        command_parts=c('df=read_tsv("', names[i] , '/abundance.tsv")')
        command=paste(command_parts, sep='', collapse='')
        eval(parse(text=command))

        command_parts=c('df=df %>% mutate(', names[i], '=tpm) %>% select(target_id,',  names[i] , ')')
        command=paste(command_parts, sep='', collapse='')
        eval(parse(text=command))

        if(i==1)
        {
                combined=df
        }
        else
        {
                combined=combined %>% inner_join(df)
        }
}

head(combined)

Apology Section

I generally apologize every time I introduce a new R command not included in the minimalist cheat-sheet, and here I need to apologize three times with once each for “eval”, “parse” and “paste”. I am ambivalent about “paste” and may add it to the cheat sheet. In general, my preference is to use functions from the stringr library instead of core R, and stringr equivalent of “paste” is “str_c”. The other two commands are for advanced classrooms, and therefore not included in the Minimalist Cheat Sheet. If you like to learn more about modifying the abststact syntax tree in R, check Advanced R by Wickham available online.



Written by M. //