bioconductor v3.9.0 EDASeq
Numerical and graphical summaries of RNA-Seq read data.
Link to this section Summary
Functions
Exploratory Data Analysis and Normalization for RNA-Seq data
Methods for Function MDPlot
in Package EDASeq
"SeqExpressionSet" class for collections of short reads
Methods for Function barplot
in Package EDASeq
Methods for Function betweenLaneNormalization
in Package EDASeq
Methods for Function biasBoxplot
in Package EDASeq
Methods for Function biasPlot
in Package EDASeq
Methods for Function boxplot
in Package EDASeq
Get gene length and GC-content
Methods for Function meanVarPlot
in Package EDASeq
Function to create a new SeqExpressionSet object.
Methods for Function plotNtFrequency
in Package EDASeq
Methods for Function plotPCA
in Package EDASeq
Methods for Function plotQuality
in Package EDASeq
Methods for Function plotRLE
in Package EDASeq
Methods for Function plot
in Package EDASeq
Methods for Function withinLaneNormalization
in Package EDASeq
GC-content of S. Cerevisiae genes
Length of S. Cerevisiae genes
Link to this section Functions
EDASeq_package()
Exploratory Data Analysis and Normalization for RNA-Seq data
Description
Numerical summaries and graphical representations of some key features of the data along with implementations of both within-lane normalization methods for GC content bias and between-lane normalization methods to adjust for sequencing depth and possibly other differences in distribution.
Details
The SeqExpressionSet class is used to store gene-level counts along with sample information. It extends the virtual class eSet . See the help page of the class for details.
"Read-level" information is managed via the FastqFileList and BamFileList classes of Rsamtools
.
Most used graphic tools for the FastqFileList and BamFileList objects are: 'barplot', 'plotQuality', 'plotNtFrequency'. For SeqExpressionSet objects are: 'biasPlot', 'meanVarPlot', 'MDPlot'.
To perform gene-level normalization use the functions 'withinLaneNormalization' and 'betweenLaneNormalization'.
An 'As' method exists to coerce SeqExpressionSet objects to CountDataSet objects ( DESeq
package).
See the package vignette for a typical Exploratory Data Analysis example.
Author
Davide Risso and Sandrine Dudoit. Maintainer: Davide Risso risso.davide@gmail.com
References
J. H. Bullard, E. A. Purdom, K. D. Hansen and S. Dudoit (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics Vol. 11, Article 94.
D. Risso, K. Schwartz, G. Sherlock and S. Dudoit (2011). GC-Content Normalization for RNA-Seq Data. Technical Report No. 291, Division of Biostatistics, University of California, Berkeley, Berkeley, CA.
MDPlot_methods()
Methods for Function MDPlot
in Package EDASeq
Description
MDPlot
produces a mean-difference smooth scatterplot of two lanes in an experiment.
Usage
MDPlot(x,y,...)
Arguments
Argument | Description |
---|---|
x | Either a numeric matrix or a SeqExpressionSet object containing the gene expression. |
y | A numeric vecor specifying the lanes to be compared. |
... | See par |
Details
The mean-difference (MD) plot is a useful plot to visualize difference in two lanes of an experiment. From a MDPlot one can see if normalization is needed and if a linear scaling is sufficient or nonlinear normalization is more effective.
The MDPlot also plots a lowess fit (in red) underlying a possible trend in the bias related to the mean expression.
Examples
library(yeastRNASeq)
data(geneLevelData)
data(yeastGC)
sub <- intersect(rownames(geneLevelData), names(yeastGC))
mat <- as.matrix(geneLevelData[sub,])
data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))),
featureData=AnnotatedDataFrame(data.frame(gc=yeastGC[sub])))
MDPlot(data,c(1,3))
SeqExpressionSet_class()
"SeqExpressionSet" class for collections of short reads
Description
This class represents a collection of digital expression data (usually counts from RNA-Seq technology) along with sample information.
Seealso
eSet , newSeqExpressionSet
, biasPlot
, withinLaneNormalization
, betweenLaneNormalization
Author
Davide Risso risso.davide@gmail.com
Examples
showMethods(class="SeqExpressionSet", where=getNamespace("EDASeq"))
counts <- matrix(data=0, nrow=100, ncol=4)
for(i in 1:4) {
counts[,i] <- rpois(100,lambda=50)
}
cond <- c(rep("A", 2), rep("B", 2))
data <- newSeqExpressionSet(counts, phenoData=AnnotatedDataFrame(data.frame(conditions=cond)))
head(counts(data))
boxplot(data, col=as.numeric(pData(data)[,1])+1)
barplot_methods()
Methods for Function barplot
in Package EDASeq
Description
High-level functions to produce barplots of some complex objects.
betweenLaneNormalization_methods()
Methods for Function betweenLaneNormalization
in Package EDASeq
Description
Between-lane normalization for sequencing depth and possibly other distributional differences between lanes.
Usage
betweenLaneNormalization(x, which=c("median","upper","full"), offset=FALSE, round=TRUE)
Arguments
Argument | Description |
---|---|
x | A numeric matrix representing the counts or a SeqExpressionSet object. |
which | Method used to normalized. See the details section and the reference below for details. |
offset | Should the normalized value be returned as an offset leaving the original counts unchanged? |
round | If TRUE the normalization returns rounded values (pseudo-counts). Ignored if offset=TRUE. |
Details
This method implements three normalizations described in Bullard et al. (2010). The methods are: list(" ", list(list(list("median"), ":"), list("a scaling normalization that forces the median of each lane to be the same.")), " ", list(list(list("upper"), ":"), list("the same but with the upper quartile.")), " ", list(list(list("full"), ":"), list("a non linear full quantile normalization, in the spirit of the one used in microarrays.")), " ")
Author
Davide Risso.
References
J. H. Bullard, E. A. Purdom, K. D. Hansen and S. Dudoit (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics Vol. 11, Article 94.
D. Risso, K. Schwartz, G. Sherlock and S. Dudoit (2011). GC-Content Normalization for RNA-Seq Data. Manuscript in Preparation.
Examples
library(yeastRNASeq)
data(geneLevelData)
data(yeastGC)
sub <- intersect(rownames(geneLevelData), names(yeastGC))
mat <- as.matrix(geneLevelData[sub, ])
data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))),
featureData=AnnotatedDataFrame(data.frame(gc=yeastGC[sub])))
norm <- betweenLaneNormalization(data, which="full", offset=FALSE)
biasBoxplot_methods()
Methods for Function biasBoxplot
in Package EDASeq
Description
biasBoxplot
produces a boxplot representing the distribution of a quantity of interest (e.g. gene counts, log-fold-changes, ...) stratified by a covariate (e.g. gene length, GC-contet, ...).
Usage
biasBoxplot(x,y,num.bins,...)
Arguments
Argument | Description |
---|---|
x | A numeric vector with the quantity of interest (e.g. gene counts, log-fold-changes, ...) |
y | A numeric vector with the covariate of interest (e.g. gene length, GC-contet, ...) |
num.bins | A numeric value specifying the number of bins in wich to stratify y . Default to 10. |
... | See par |
Examples
library(yeastRNASeq)
data(geneLevelData)
data(yeastGC)
sub <- intersect(rownames(geneLevelData), names(yeastGC))
mat <- as.matrix(geneLevelData[sub,])
data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))),
featureData=AnnotatedDataFrame(data.frame(gc=yeastGC[sub])))
lfc <- log(geneLevelData[sub, 3] + 1) - log(geneLevelData[sub, 1] + 1)
biasBoxplot(lfc, yeastGC[sub], las=2, cex.axis=.7)
biasPlot_methods()
Methods for Function biasPlot
in Package EDASeq
Description
biasPlot
produces a plot of the lowess
regression of the counts on a covariate of interest, tipically the GC-content or the length of the genes.
Examples
library(yeastRNASeq)
data(geneLevelData)
data(yeastGC)
sub <- intersect(rownames(geneLevelData), names(yeastGC))
mat <- as.matrix(geneLevelData[sub,])
data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))),
featureData=AnnotatedDataFrame(data.frame(gc=yeastGC[sub])))
biasPlot(data,"gc",ylim=c(0,5),log=TRUE)
boxplot_methods()
Methods for Function boxplot
in Package EDASeq
Description
High-level functions to produce boxplots of some complex objects.
getGeneLengthAndGCContent()
Get gene length and GC-content
Description
Automatically retrieves gene length and GC-content information from Biomart or org.db packages.
Usage
getGeneLengthAndGCContent(id, org, mode=c("biomart", "org.db"))
Arguments
Argument | Description |
---|---|
id | Character vector of one or more ENSEMBL or ENTREZ gene IDs. |
org | Organism three letter code, e.g. 'hsa' for 'Homo sapiens'. See also: http://www.genome.jp/kegg/catalog/org_list.html; In org.db mode, this can be also a specific genome assembly, e.g. 'hg38' or 'sacCer3'. |
mode | Mode to retrieve the information. Defaults to 'biomart'. See Details. |
Details
The 'biomart' mode is based on functionality from the biomaRt packgage and retrieves the required information from the BioMart database. This is available for all ENSEMBL organisms and is typically most current, but can be time-consuming when querying several thousand genes at a time.
The 'org.db' mode uses organism-based annotation packages from Bioconductor. This is much faster than the 'biomart' mode, but is only available for selected model organism currently supported by BioC annotation functionality.
Results for the same gene ID(s) can differ between both modes as they are based on different sources for the underlying genome assembly. While the 'biomart' mode uses the latest ENSEMBL version, the 'org.db' mode uses BioC annotation packages typically built from UCSC.
Value
A numeric matrix with two columns: gene length and GC-content.
Seealso
getSequence
to retrieve a genomic sequence from BioMart,
genes
to extract genomic coordinates from a TxDb object,
getSeq
to extract genomic sequences from a BSgenome object,
alphabetFrequency
to calculate nucleotide frequencies.
Author
Ludwig Geistlinger Ludwig.Geistlinger@bio.ifi.lmu.de
Examples
getGeneLengthAndGCContent("ENSG00000012048", "hsa")
meanVarPlot_methods()
Methods for Function meanVarPlot
in Package EDASeq
Description
meanVarPlot
produces a smoothScatter
plot of the mean variance relation.
newSeqExpressionSet()
Function to create a new SeqExpressionSet object.
Description
User-level function to create new objects of the class SeqExpressionSet .
Usage
newSeqExpressionSet(counts,
normalizedCounts = matrix(data=NA, nrow=nrow(counts), ncol=ncol(counts), dimnames=dimnames(counts)),
offset = matrix(data=0, nrow=nrow(counts), ncol=ncol(counts), dimnames=dimnames(counts)),
phenoData = annotatedDataFrameFrom(counts, FALSE),
featureData = annotatedDataFrameFrom(counts, TRUE),
...)
Arguments
Argument | Description |
---|---|
counts | A matrix containing the counts for an RNA-Seq experiment. One column for each lane and one row for each gene. |
normalizedCounts | A matrix with the same dimensions of counts with the normalized counts. |
offset | A matrix with the same dimensions of counts defining the offset (usually useful for normalization purposes). See the package vignette for a discussion on the offset. |
phenoData | A data.frame or AnnotatedDataFrame with sample information, such as biological condition, library preparation protocol, flow-cell,... |
featureData | A data.frame or AnnotatedDataFrame with feature information, such as gene length, GC-content, ... |
list() | Other arguments will be passed to the constructor inherited from eSet . |
Value
An object of class SeqExpressionSet .
Seealso
SeqExpressionSet
Author
Davide Risso
Examples
counts <- matrix(data=0, nrow=100, ncol=4)
for(i in 1:4) {
counts[, i] <- rpois(100, lambda=50)
}
cond <- c(rep("A", 2), rep("B", 2))
counts <- newSeqExpressionSet(counts, phenoData=data.frame(conditions=cond))
plotNtFrequency_methods()
Methods for Function plotNtFrequency
in Package EDASeq
Description
Plots the nucleotide frequencies per position.
plotPCA_methods()
Methods for Function plotPCA
in Package EDASeq
Description
plotPCA
produces a Principal Component Analysis (PCA) plot of the counts in object
Usage
list(list("plotPCA"), list("matrix"))(object, k=2, labels=TRUE, isLog=FALSE, ...)
list(list("plotPCA"), list("SeqExpressionSet"))(object, k=2, labels=TRUE, ...)
Arguments
Argument | Description |
---|---|
object | Either a numeric matrix or a SeqExpressionSet object containing the gene expression. |
k | The number of principal components to be plotted. |
labels | Logical. If TRUE , and k=2 , it plots the colnames of object as point labels. |
isLog | Logical. Set to TRUE if the data are already on the log scale. |
... | See par |
Details
The Principal Component Analysis (PCA) plot is a useful diagnostic plot to highlight differences in the distribution of replicate samples, by projecting the samples into a lower dimensional space.
If there is strong differential expression between two classes, one expects the samples to cluster by class in the first few Principal Components (PCs) (usually 2 or 3 components are enough). This plot also highlights possible batch effects and/or outlying samples.
Examples
library(yeastRNASeq)
data(geneLevelData)
mat <- as.matrix(geneLevelData)
data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))))
plotPCA(data, col=rep(1:2, each=2))
plotQuality_methods()
Methods for Function plotQuality
in Package EDASeq
Description
plotQuality
produces a plot of the quality of the reads.
plotRLE_methods()
Methods for Function plotRLE
in Package EDASeq
Description
plotRLE
produces a Relative Log Expression (RLE) plot of the counts in x
Usage
plotRLE(x, ...)
Arguments
Argument | Description |
---|---|
x | Either a numeric matrix or a SeqExpressionSet object containing the gene expression. |
... | See par |
Details
The Relative Log Expression (RLE) plot is a useful diagnostic plot to visualize the differences between the distributions of read counts across samples.
It shows the boxplots of the log-ratios of the gene-level read counts of each sample to those of a reference sample (defined as the median across the samples). Ideally, the distributions should be centered around the zero line and as tight as possible. Clear deviations indicate the need for normalization and/or the presence of outlying samples.
Examples
library(yeastRNASeq)
data(geneLevelData)
mat <- as.matrix(geneLevelData)
data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))))
plotRLE(data, col=rep(2:3, each=2))
plot_methods()
Methods for Function plot
in Package EDASeq
Description
High-level function to produce plots given one BamFileList
object and one FastqFileList
object.
withinLaneNormalization_methods()
Methods for Function withinLaneNormalization
in Package EDASeq
Description
Within-lane normalization for GC-content (or other lane-specific) bias.
Usage
withinLaneNormalization(x, y, which=c("loess","median","upper","full"), offset=FALSE, num.bins=10, round=TRUE)
Arguments
Argument | Description |
---|---|
x | A numeric matrix representing the counts or a SeqExpressionSet object. |
y | A numeric vector representing the covariate to normalize for (if x is a matrix) or a character vector with the name of the covariate (if x is a SeqExpressionSet object). Usually it is the GC-content. |
which | Method used to normalized. See the details section and the reference below for details. |
offset | Should the normalized value be returned as an offset leaving the original counts unchanged? |
num.bins | The number of bins used to stratify the covariate for median , upper and full methods. Ignored if loess . See the reference for a discussion on the number of bins. |
round | If TRUE the normalization returns rounded values (pseudo-counts). Ignored if offset=TRUE. |
Details
This method implements four normalizations described in Risso et al. (2011).
The loess
normalization transforms the data by regressing the counts on y
and subtracting the loess fit from the counts to remove the dependence.
The median
, upper
and full
normalizations are based on the stratification of the genes based on y
. Once the genes are stratified in num.bins
strata, the methods work as follows.
list("
", list(list(list("median"), ":"), list("scales the data to have the same median in each bin.")), "
", list(list(list("upper"), ":"), list("the same but with the upper quartile.")), "
", list(list(list("full"), ":"), list("forces the distribution of each stratum to be the same using a non linear full quantile normalization, in the spirit of the one used in microarrays.")), "
")
Author
Davide Risso.
References
D. Risso, K. Schwartz, G. Sherlock and S. Dudoit (2011). GC-Content Normalization for RNA-Seq Data. Manuscript in Preparation.
Examples
library(yeastRNASeq)
data(geneLevelData)
data(yeastGC)
sub <- intersect(rownames(geneLevelData), names(yeastGC))
mat <- as.matrix(geneLevelData[sub, ])
data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))),
featureData=AnnotatedDataFrame(data.frame(gc=yeastGC[sub])))
norm <- withinLaneNormalization(data, "gc", which="full", offset=FALSE)
yeastGC()
GC-content of S. Cerevisiae genes
Description
This data set gives the GC-content (proportion of G and C) of the genes of S. Cerevisiae , from SGD release 64 annotation.
Format
A vector containing 6717 observations.
Usage
yeastGC
yeastLength()
Length of S. Cerevisiae genes
Description
This data set gives the length (in base pairs) of the genes of S. Cerevisiae , from SGD release 64 annotation.
Format
A vector containing 6717 observations.
Usage
yeastLength