Rnaseq.work - Current APIs and Design Decisions


As mentioned in an earlier post, I have been working on a R library for RNAseq data analysis. The goal of this library is to provide clean, easy-to-remember functions for data analysis. In this post, I will describe the functional options chosen for the rna_visualize function for plotting of data. I will also discuss the design and coding challenges encountered during this implementation.

I am implementing the following options for the rna_visualize function. The “method” field describes the type of plot (e.g. hist, BCV, PCA, etc.) and “lib” field describes whether you want the plot created using the “base” or “ggplot” library. Additional fields are added to the function depending on the method. For example, in case of “hist” or histogram, you need to also mention a column. Similarly, for “MA” plots, you need to give two columns. The example for the ‘col’ field given here is based on the pasilla dataset from Bioconductor.

rna_visualize(data_table, method="hist", lib="base", col="treated1")
rna_visualize(data_table, method="hist", lib="ggplot", col="untreated1")

The full list of methods being implemented is shown below. Please note that implemention is not complete for all of them, but I am making rapid progress. You can try “hist” and “MA” to see how this package works.

rna_visualize(data_table, method="hist", col="treated1")
rna_visualize(data_table, method="MA", col=c("treated1", "untreated1"))
rna_visualize(data_table, method="counts", gene="gene1")
rna_visualize(data_table, method="PCA")
rna_visualize(data_table, method="MDS")
rna_visualize(data_table, method="BCV")
rna_visualize(data_table, method="smear")
rna_visualize(data_table, method="dispersions")
rna_visualize(data_table, method="sparsity")
rna_visualize(data_table, method="volcano")

The first design difficulty I experienced is as follows. If you have done RNAse analysis, you know that it starts with a count table and a design table, and then generates an expression table with p-values and fold changes for the genes. Some of the plot types use the original count table, whereas the others use the final expression table. To incorporate both, I named the first parameter “data_table”. It will be the raw count table or final expression table with fold changes depending on the method.

The second design difficulty is with the format of the count table, and whether to use the tidyverse convention or Bioconductor convention. In Hadley Wickham’s tidyverse style of data representation, a count table will have the genes in the first column and the counts in the remaining columns. Row-names are just integers. In Bioconductor style, genes are used as row-names. My preference is to use the tidyverse style, because that allows me to use several excellent packages (e.g. dplyr). However, if I do so, I need to keep switching between genes as row-names and the first column for compatibility reason. This is a major source of ambiguity. So, for the time being, I am using the Bioconductor convention.

The third challenge is to be able to use DESeq/DESeq2 and its S4 class. Both limma and edgeR store their primary data in lists. Lists are easier to handle than S4 classes.

The fourth challenge I am experiencing is in implementing the plotting functions for ggplot2. For the “base” library, I can leverage the functions provided by the libraries to generate plots. In case of the ggplot2 library, I have to write plotting code myself. How much of the existing code I can leverage to accomplish that is the question.

In the following post, I will present codes so that you can run the already implemented functions. That way, you can start using this package and possibly send me feedback on bugs or features.

In other news, our slots for the January class on RNAseq analysis are rapidly filling up. If you are planning to join, do sign up soon. These modules are designed for those, who do not have a lot of experience in R. We also have a free module on R to help the absolute beginners get started.


Written by M. //

Web Statistics