Over the last year, I have been meeting many biologists and training them on NGS RNAseq data analysis. Sometimes they even bring their own research data to the class and learn to analyze as well as see results.
My primary assumption is that the students do not come with a computer science background. Hence I try to minimize the number of new coding concepts they need to learn to analyze their NGS data. On Friday, I posted “A Minimalist R Cheatsheet for NGS Biology”. Those are the core functions I try to stick to unless it is necessary to expand beyond them.
On the other hand, I do not compromise on explaining the statistical concepts. The bioinformaticians and biologists are generally led to consider the analysis programs as “kits”, “recipes” or “pipelines”. That is ok if one chooses to be trained as a technician, but the scientists may like to go a bit deeper.
During the course of the year, I came across many technical questions from the community and like to answer them here in a series of blog posts. If you have a question, feel free to email me at email@example.com. Some of these are strictly related to data analysis (for example, check “Tryst Between Marilyn Monroe and Albert Einstein”, whereas the others are conceptual.
Today’s question - How to Load Data in R after a Kallisto Analysis?
This question may appear too simple, but there is a twist. In fact, yesterday I have been working back and forth with an expert member from Tunisia to sort out the later part. Somehow the solution was working on my computer, but not on her’s. We figured out what was different.
Let me solve the simple problem first. Suppose you have three samples (e.g. “brain”, “heart” and “muscle” or “root”, “shoot” and “flower”) with Kallisto counts stored in folders with the same names. How will load them in a data frame in R?
brain=read_tsv("brain/abundance.tsv") brain=brain %>% mutate(brain=tpm) %>% select(target_id,brain) heart=read_tsv("heart/abundance.tsv") heart=heart %>% mutate(heart=tpm) %>% select(target_id,heart) muscle=read_tsv("muscle/abundance.tsv") muscle=muscle %>% mutate(muscle=tpm) %>% select(target_id,muscle) combined=brain %>% inner_join(heart) %>% inner_join(muscle)
In the first two lines, we are loading the “brain” sample in a data frame named “brain”. The output table in Kallisto has five columns - “target_id”, “length”, “eff_length”, “est_counts” and “tpm”. We are keeping only the “target_id” and “tpm” columns and moreover renaming the “tpm” column to “brain”. If you are familiar with the tidyverse package dplyr (which we always teach in our online class), these commands should be straighforward.
We load the “heart” and “muscle” samples in the same way in data frames named “heart” and “muscle”. Finally, we use the inner_join function from dplyr to join all data in a combined data frame called “combined”.
The answer for the plant-related sample will look like this -
root=read_tsv("root/abundance.tsv") root=root %>% mutate(root=tpm) %>% select(target_id,root) shoot=read_tsv("shoot/abundance.tsv") shoot=shoot %>% mutate(shoot=tpm) %>% select(target_id,shoot) leaf=read_tsv("leaf/abundance.tsv") leaf=leaf %>% mutate(leaf=tpm) %>% select(target_id,leaf) combined=root %>% inner_join(shoot) %>% inner_join(leaf)
Now here is the twist. What will you do, if you have 30 or even worse 300 samples? You can type all of those 300 names one after another, but there are two easier ways. Also, one of those ways will teach you something new and useful about R. We cover that in tomorrow’s post.