As mentioned in the previous post, I have been working on a R library for RNAseq data analysis. The goal of this library is to provide clean, easy-to-remember functions for analysis. Also, we offer live online classes on R and RNAseq data analysis. For both efforts, it is helpful to discuss the related visualization functions in R.
Visualization of data is an important component of RNAseq analysis. Researchers use four types of functions for this purpose -
- Commonly used functions from the base or ggplot library for drawing histograms, boxplots, etc,
- Functions provided by the RNAseq-related packages like edgeR, limma or DEseq2,
- Functions from specialized packages to create heatmaps, Venn diagrams and trees.
- Functions from other Bioconductor packages to visualize patterns on the genome or sequence Logo plots.
The last category above is part of derivative analysis after one completes the statistical analysis of RNAseq data, and we will not discuss it here. The first three are usually included in all RNAseq analysis workflows. Let us learn a bit more about them.
R base library includes functions for drawing scatterplots, barplots, histograms, boxplots, pi diagrams and many other familiar charting types. These functions are fast, and I often use them to generate quick plots of common type. However, I try to make sure not to overuse them, and here is why.
Hadley Wickham, the author of ggplot, implemented the ideas of Leland Wilkinson to break the task of plotting into a components following some kind of “grammar”. The abbreviation “gg” in ggplot stands for “grammar of graphics”. The main advantage of ggplot comes not from its ability to be extensible without adding new functions. If someone wants to create a new chart type, he can break its various aspects into ggplot grammar and then extend only that part that is truly new. That way an user can easily combine the new functionality with the entire power of ggplot.
That brings us to the second category - charting functions implemented by various RNAseq packages like edgeR, limma and DEseq2. I identified thirteen, which are mentioned in online tutorials and therefore get used by the community.
- meanSdplot (vsn library?)
These functions have two defects - (i) they combine mathematical manipulation and plotting together, and (ii) the packaged functions often use classes or objects as input parameters. Therefore, someone trying to make cosmetic changes to the plot will need to understand what those S4 classes are doing internally. In terms of plotting action, almost all of these functions draw scatterplots, nothing fancy. In rnaseq.work, I am looking into simplifying the APIs for using those functions.
Finally, let me discuss the last category - Venn diagrams, heatmaps and hierarchical clustering trees. Although ggplot can draw all of them, the specialized libraries available for them are well developed. I am ambivalent about using ggplot versus going the specialized route.