In my experience, there are large disconnects between the bioinformaticians and biologists regarding RNAseq data analysis. In the current mode of operations, biologists send their samples to core sequencing labs, core sequencing labs send the FASTQ sequence files to ‘expert bioinformaticians’ and then the bioinformaticians pass derived tables to the biologists. Biologists publish, get money and the process starts once again.
Going into more details on the bioinformatics step, for organisms with well-annotated genomes and genes, bioinformaticians map the sequences on the genomes/genes (RSEM, Kallisto, Salmon, etc.), perform statistical analysis and (maybe) run “biological analysis” tools.
The biologists are not starting without prior information. They expect some genes (or certain classes of genes) to show up or not show up in the screen based on work by others and general understanding of the experiment. If those expected results are not confirmed, how does one debug the entire experiment + analysis?
In a series of blog posts, we will present some useful tips on this topic. You are expected to know the basics of R and also their data wrangling libraries (dplyr). We will primarily use R here, but you can easily translate the concepts into Python, if you are familiar with pandas/matplotlib/seaborn libraries.
The main theme in this presentation will be that we will try to keep the R part minimalist following this cheatsheet so that biologists can reproduce most of the analysis without getting another degree in computer science.