Popular RNAseq packages often use the formula notation in R. For example, the DESeq package uses it in the design parameter, whereas edgeR creates its design matrix by expanding a formula with “model.matrix”. The formula syntax seems to confuse many users of these libraries.
As an example of confusion, check this Biostars thread. It ended in a mini-fight between Kirill Grigorev and Devon Ryan.
Grigorev commented - “I find it pretty astonishing (but also symptomatic of the entire R universe) that the official DESeq2 tutorials neither bother to explain the nomenclature by themselves, or even provide pointers as to where one can read up on it.’Analyzing RNA-seq data with DESeq2’ mentions ANOVA exactly once, and quite a distance away from where they introduce designs. And when they talk about designs, they just sort of head into it without any asides. I understand that formulae like these are pretty standard for R, but with packages like these, you likely get a lot of non-programmer crowd, Python crowd etc, who have never seen this and would not know what they need to google to find out…”,
whereas Ryan argued - “I’d be astonished if a specialized package like DESeq2 bothered to explain standard statistics nomenclature, it’s sort of assumed that if you’re going to be using a package for statistical analysis that you’ll have a basic background in statistics. I find that assumption entirely fair.”.
Irrespective of which side you agree with, there is no doubt you need to understand the formula syntax to use the RNAseq packages. So let us get started.
One of the top hits google gave me for “R formula” was a website from US National Parks (nps.gov). That surprised me, because park rangers were the last people I expected to chit-chat about R formula notations. It was even more puzzling, when the page turned out to be irrelevant. Why did google algorithm place an unrelated page among the top hits?
Among the relevant links -
The easiest reference to get started is this tutorial from datacamp. It is a bit long, but is definitely worth the time.
If you are looking for something quick and short, check this pdf tutorial from the university of Chicago. Especially pay attention to the table and equations on the second page.
Finally, in the context of RNAseq, this biostar discussion is possibly the most useful. Make sure you read the previous two links before, or otherwise you will get confused.
[Edit. 12/12] Mike Love, the first author of DESeq2, pointed out in twitter that the original syntax came from a 1973 paper by Wilkinson and Rogers titled “Symbolic Description of Factorial Models for Analysis of Variance”. He mentioned that the DESeq vignette had diagrams on interpreting interactions and many design examples in “?results”. Finally, their edX course on linear models is another good place to check.
We will take bits and pieces from the above links to write down a number of relevant points.
The simplest type of formula you will come across in the context of RNAseq is of the form (“~condition”), where “condition” is defined by another data frame. That data frame may label some samples as condition type “A” and others as condition type “B” (e.g. treatment and control). Subsequently, the design formula tells the RNAseq analysis program to compare between samples “A” versus “B”. This is useful, when you want to compare your samples in several different ways. Include multiple columns in your model data frame, and then run the statistical analysis by changing the design formula from one to another.
|In more interesting cases, you will see the symbols “+”, “-“, “*”, “:” and “||” in the formula. Those symbols have special meanings.|
We will start with the “+” symbol, which you are expected to often encounter in the context of linear modeling. A formula “~ X+Y” means there are two independent variables X and Y, and fit model by considering them both. In case of linear regression with “~ X+Y”, R is expected to fit a linear model of the form “a_0 + a_1 X + a_2 Y”. The parameters “a_0”, “a_1” and “a_2” are determined from the data.
Sometimes the formula includes a constant term. For example, the tutorial for Sleuth RNAseq package uses “~1”. Using only a constant term in the formula means find only the intercept. If you say “~ 0+X”, you are asking R to fit a linear regression that has no intercept.
Using the “-“ symbol is another way to remove something from the formula. For exmple, you can use “~ X-1” to tell R that there is no intercept, and it will be equivalent to “~ 0+X”. In both cases, R will fit the model “a_1 X”, whereas formula “~X” will make R fit the model “a_0 + a_1 X”.
The symbols “:”, “^” and “” are used to include higher order terms. For example, “~X+Y+X:Y” asks R to fit the linear model - “a_0 + a_1 X + a_2 Y + a_3 X Y”. The same relation can be written in multiple ways using the symbols “:”, “^” and “”. Those are convenient especially when you have multiple terms and multiple orders of interaction.
Finally, if you like to find out what linear model will be fit by a complicated formula, use the “model.matrix” and “model_matrix” functions. You can practice on the mtcars or iris data sets by using the examples given here.
In other news, we developed remotely taught online classes for researchers in biology without a lot of experience in R. Our upcoming January class covers RNAseq data analysis. We also have a free module on R to help the absolute beginners get started.