bioconductor v3.9.0 EDASeq

Numerical and graphical summaries of RNA-Seq read data.

Link to this section Summary

Functions

Exploratory Data Analysis and Normalization for RNA-Seq data

Methods for Function MDPlot in Package EDASeq

"SeqExpressionSet" class for collections of short reads

Methods for Function barplot in Package EDASeq

Methods for Function betweenLaneNormalization in Package EDASeq

Methods for Function biasBoxplot in Package EDASeq

Methods for Function biasPlot in Package EDASeq

Methods for Function boxplot in Package EDASeq

Get gene length and GC-content

Methods for Function meanVarPlot in Package EDASeq

Function to create a new SeqExpressionSet object.

Methods for Function plotNtFrequency in Package EDASeq

Methods for Function plotPCA in Package EDASeq

Methods for Function plotQuality in Package EDASeq

Methods for Function plotRLE in Package EDASeq

Methods for Function plot in Package EDASeq

Methods for Function withinLaneNormalization in Package EDASeq

GC-content of S. Cerevisiae genes

Length of S. Cerevisiae genes

Link to this section Functions

Link to this function

EDASeq_package()

Exploratory Data Analysis and Normalization for RNA-Seq data

Description

Numerical summaries and graphical representations of some key features of the data along with implementations of both within-lane normalization methods for GC content bias and between-lane normalization methods to adjust for sequencing depth and possibly other differences in distribution.

Details

The SeqExpressionSet class is used to store gene-level counts along with sample information. It extends the virtual class eSet . See the help page of the class for details.

"Read-level" information is managed via the FastqFileList and BamFileList classes of Rsamtools .

Most used graphic tools for the FastqFileList and BamFileList objects are: 'barplot', 'plotQuality', 'plotNtFrequency'. For SeqExpressionSet objects are: 'biasPlot', 'meanVarPlot', 'MDPlot'.

To perform gene-level normalization use the functions 'withinLaneNormalization' and 'betweenLaneNormalization'.

An 'As' method exists to coerce SeqExpressionSet objects to CountDataSet objects ( DESeq package).

See the package vignette for a typical Exploratory Data Analysis example.

Author

Davide Risso and Sandrine Dudoit. Maintainer: Davide Risso risso.davide@gmail.com

References

J. H. Bullard, E. A. Purdom, K. D. Hansen and S. Dudoit (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics Vol. 11, Article 94.

D. Risso, K. Schwartz, G. Sherlock and S. Dudoit (2011). GC-Content Normalization for RNA-Seq Data. Technical Report No. 291, Division of Biostatistics, University of California, Berkeley, Berkeley, CA.

Link to this function

MDPlot_methods()

Methods for Function MDPlot in Package EDASeq

Description

MDPlot produces a mean-difference smooth scatterplot of two lanes in an experiment.

Usage

MDPlot(x,y,...)

Arguments

ArgumentDescription
xEither a numeric matrix or a SeqExpressionSet object containing the gene expression.
yA numeric vecor specifying the lanes to be compared.
...See par

Details

The mean-difference (MD) plot is a useful plot to visualize difference in two lanes of an experiment. From a MDPlot one can see if normalization is needed and if a linear scaling is sufficient or nonlinear normalization is more effective.

The MDPlot also plots a lowess fit (in red) underlying a possible trend in the bias related to the mean expression.

Examples

library(yeastRNASeq)
data(geneLevelData)
data(yeastGC)

sub <- intersect(rownames(geneLevelData), names(yeastGC))

mat <- as.matrix(geneLevelData[sub,])

data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))),
featureData=AnnotatedDataFrame(data.frame(gc=yeastGC[sub])))

MDPlot(data,c(1,3))
Link to this function

SeqExpressionSet_class()

"SeqExpressionSet" class for collections of short reads

Description

This class represents a collection of digital expression data (usually counts from RNA-Seq technology) along with sample information.

Seealso

eSet , newSeqExpressionSet , biasPlot , withinLaneNormalization , betweenLaneNormalization

Author

Davide Risso risso.davide@gmail.com

Examples

showMethods(class="SeqExpressionSet", where=getNamespace("EDASeq"))

counts <- matrix(data=0, nrow=100, ncol=4)
for(i in 1:4) {
counts[,i] <- rpois(100,lambda=50)
}
cond <- c(rep("A", 2), rep("B", 2))

data <- newSeqExpressionSet(counts, phenoData=AnnotatedDataFrame(data.frame(conditions=cond)))

head(counts(data))
boxplot(data, col=as.numeric(pData(data)[,1])+1)
Link to this function

barplot_methods()

Methods for Function barplot in Package EDASeq

Description

High-level functions to produce barplots of some complex objects.

Link to this function

betweenLaneNormalization_methods()

Methods for Function betweenLaneNormalization in Package EDASeq

Description

Between-lane normalization for sequencing depth and possibly other distributional differences between lanes.

Usage

betweenLaneNormalization(x, which=c("median","upper","full"), offset=FALSE, round=TRUE)

Arguments

ArgumentDescription
xA numeric matrix representing the counts or a SeqExpressionSet object.
whichMethod used to normalized. See the details section and the reference below for details.
offsetShould the normalized value be returned as an offset leaving the original counts unchanged?
roundIf TRUE the normalization returns rounded values (pseudo-counts). Ignored if offset=TRUE.

Details

This method implements three normalizations described in Bullard et al. (2010). The methods are: list(" ", list(list(list("median"), ":"), list("a scaling normalization that forces the median of each lane to be the same.")), " ", list(list(list("upper"), ":"), list("the same but with the upper quartile.")), " ", list(list(list("full"), ":"), list("a non linear full quantile normalization, in the spirit of the one used in microarrays.")), " ")

Author

Davide Risso.

References

J. H. Bullard, E. A. Purdom, K. D. Hansen and S. Dudoit (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics Vol. 11, Article 94.

D. Risso, K. Schwartz, G. Sherlock and S. Dudoit (2011). GC-Content Normalization for RNA-Seq Data. Manuscript in Preparation.

Examples

library(yeastRNASeq)
data(geneLevelData)
data(yeastGC)

sub <- intersect(rownames(geneLevelData), names(yeastGC))

mat <- as.matrix(geneLevelData[sub, ])

data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))),
featureData=AnnotatedDataFrame(data.frame(gc=yeastGC[sub])))

norm <- betweenLaneNormalization(data, which="full", offset=FALSE)
Link to this function

biasBoxplot_methods()

Methods for Function biasBoxplot in Package EDASeq

Description

biasBoxplot produces a boxplot representing the distribution of a quantity of interest (e.g. gene counts, log-fold-changes, ...) stratified by a covariate (e.g. gene length, GC-contet, ...).

Usage

biasBoxplot(x,y,num.bins,...)

Arguments

ArgumentDescription
xA numeric vector with the quantity of interest (e.g. gene counts, log-fold-changes, ...)
yA numeric vector with the covariate of interest (e.g. gene length, GC-contet, ...)
num.binsA numeric value specifying the number of bins in wich to stratify y . Default to 10.
...See par

Examples

library(yeastRNASeq)
data(geneLevelData)
data(yeastGC)

sub <- intersect(rownames(geneLevelData), names(yeastGC))

mat <- as.matrix(geneLevelData[sub,])

data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))),
featureData=AnnotatedDataFrame(data.frame(gc=yeastGC[sub])))

lfc <- log(geneLevelData[sub, 3] + 1) - log(geneLevelData[sub, 1] + 1)

biasBoxplot(lfc, yeastGC[sub], las=2, cex.axis=.7)
Link to this function

biasPlot_methods()

Methods for Function biasPlot in Package EDASeq

Description

biasPlot produces a plot of the lowess regression of the counts on a covariate of interest, tipically the GC-content or the length of the genes.

Examples

library(yeastRNASeq)
data(geneLevelData)
data(yeastGC)

sub <- intersect(rownames(geneLevelData), names(yeastGC))

mat <- as.matrix(geneLevelData[sub,])

data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))),
featureData=AnnotatedDataFrame(data.frame(gc=yeastGC[sub])))

biasPlot(data,"gc",ylim=c(0,5),log=TRUE)
Link to this function

boxplot_methods()

Methods for Function boxplot in Package EDASeq

Description

High-level functions to produce boxplots of some complex objects.

Link to this function

getGeneLengthAndGCContent()

Get gene length and GC-content

Description

Automatically retrieves gene length and GC-content information from Biomart or org.db packages.

Usage

getGeneLengthAndGCContent(id, org, mode=c("biomart", "org.db"))

Arguments

ArgumentDescription
idCharacter vector of one or more ENSEMBL or ENTREZ gene IDs.
orgOrganism three letter code, e.g. 'hsa' for 'Homo sapiens'. See also: http://www.genome.jp/kegg/catalog/org_list.html; In org.db mode, this can be also a specific genome assembly, e.g. 'hg38' or 'sacCer3'.
modeMode to retrieve the information. Defaults to 'biomart'. See Details.

Details

The 'biomart' mode is based on functionality from the biomaRt packgage and retrieves the required information from the BioMart database. This is available for all ENSEMBL organisms and is typically most current, but can be time-consuming when querying several thousand genes at a time.

The 'org.db' mode uses organism-based annotation packages from Bioconductor. This is much faster than the 'biomart' mode, but is only available for selected model organism currently supported by BioC annotation functionality.

Results for the same gene ID(s) can differ between both modes as they are based on different sources for the underlying genome assembly. While the 'biomart' mode uses the latest ENSEMBL version, the 'org.db' mode uses BioC annotation packages typically built from UCSC.

Value

A numeric matrix with two columns: gene length and GC-content.

Seealso

getSequence to retrieve a genomic sequence from BioMart, genes to extract genomic coordinates from a TxDb object, getSeq to extract genomic sequences from a BSgenome object, alphabetFrequency to calculate nucleotide frequencies.

Author

Ludwig Geistlinger Ludwig.Geistlinger@bio.ifi.lmu.de

Examples

getGeneLengthAndGCContent("ENSG00000012048", "hsa")
Link to this function

meanVarPlot_methods()

Methods for Function meanVarPlot in Package EDASeq

Description

meanVarPlot produces a smoothScatter plot of the mean variance relation.

Link to this function

newSeqExpressionSet()

Function to create a new SeqExpressionSet object.

Description

User-level function to create new objects of the class SeqExpressionSet .

Usage

newSeqExpressionSet(counts,
                    normalizedCounts = matrix(data=NA, nrow=nrow(counts), ncol=ncol(counts), dimnames=dimnames(counts)),
                    offset = matrix(data=0, nrow=nrow(counts), ncol=ncol(counts), dimnames=dimnames(counts)),
                    phenoData = annotatedDataFrameFrom(counts, FALSE),
                    featureData = annotatedDataFrameFrom(counts, TRUE),
                    ...)

Arguments

ArgumentDescription
countsA matrix containing the counts for an RNA-Seq experiment. One column for each lane and one row for each gene.
normalizedCountsA matrix with the same dimensions of counts with the normalized counts.
offsetA matrix with the same dimensions of counts defining the offset (usually useful for normalization purposes). See the package vignette for a discussion on the offset.
phenoDataA data.frame or AnnotatedDataFrame with sample information, such as biological condition, library preparation protocol, flow-cell,...
featureDataA data.frame or AnnotatedDataFrame with feature information, such as gene length, GC-content, ...
list()Other arguments will be passed to the constructor inherited from eSet .

Value

An object of class SeqExpressionSet .

Seealso

SeqExpressionSet

Author

Davide Risso

Examples

counts <- matrix(data=0, nrow=100, ncol=4)
for(i in 1:4) {
counts[, i] <- rpois(100, lambda=50)
}
cond <- c(rep("A", 2), rep("B", 2))

counts <- newSeqExpressionSet(counts, phenoData=data.frame(conditions=cond))
Link to this function

plotNtFrequency_methods()

Methods for Function plotNtFrequency in Package EDASeq

Description

Plots the nucleotide frequencies per position.

Link to this function

plotPCA_methods()

Methods for Function plotPCA in Package EDASeq

Description

plotPCA produces a Principal Component Analysis (PCA) plot of the counts in object

Usage

list(list("plotPCA"), list("matrix"))(object, k=2, labels=TRUE, isLog=FALSE, ...)
list(list("plotPCA"), list("SeqExpressionSet"))(object, k=2, labels=TRUE, ...)

Arguments

ArgumentDescription
objectEither a numeric matrix or a SeqExpressionSet object containing the gene expression.
kThe number of principal components to be plotted.
labelsLogical. If TRUE , and k=2 , it plots the colnames of object as point labels.
isLogLogical. Set to TRUE if the data are already on the log scale.
...See par

Details

The Principal Component Analysis (PCA) plot is a useful diagnostic plot to highlight differences in the distribution of replicate samples, by projecting the samples into a lower dimensional space.

If there is strong differential expression between two classes, one expects the samples to cluster by class in the first few Principal Components (PCs) (usually 2 or 3 components are enough). This plot also highlights possible batch effects and/or outlying samples.

Examples

library(yeastRNASeq)
data(geneLevelData)

mat <- as.matrix(geneLevelData)

data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))))

plotPCA(data, col=rep(1:2, each=2))
Link to this function

plotQuality_methods()

Methods for Function plotQuality in Package EDASeq

Description

plotQuality produces a plot of the quality of the reads.

Link to this function

plotRLE_methods()

Methods for Function plotRLE in Package EDASeq

Description

plotRLE produces a Relative Log Expression (RLE) plot of the counts in x

Usage

plotRLE(x, ...)

Arguments

ArgumentDescription
xEither a numeric matrix or a SeqExpressionSet object containing the gene expression.
...See par

Details

The Relative Log Expression (RLE) plot is a useful diagnostic plot to visualize the differences between the distributions of read counts across samples.

It shows the boxplots of the log-ratios of the gene-level read counts of each sample to those of a reference sample (defined as the median across the samples). Ideally, the distributions should be centered around the zero line and as tight as possible. Clear deviations indicate the need for normalization and/or the presence of outlying samples.

Examples

library(yeastRNASeq)
data(geneLevelData)

mat <- as.matrix(geneLevelData)

data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))))


plotRLE(data, col=rep(2:3, each=2))

Methods for Function plot in Package EDASeq

Description

High-level function to produce plots given one BamFileList object and one FastqFileList object.

Link to this function

withinLaneNormalization_methods()

Methods for Function withinLaneNormalization in Package EDASeq

Description

Within-lane normalization for GC-content (or other lane-specific) bias.

Usage

withinLaneNormalization(x, y, which=c("loess","median","upper","full"), offset=FALSE, num.bins=10, round=TRUE)

Arguments

ArgumentDescription
xA numeric matrix representing the counts or a SeqExpressionSet object.
yA numeric vector representing the covariate to normalize for (if x is a matrix) or a character vector with the name of the covariate (if x is a SeqExpressionSet object). Usually it is the GC-content.
whichMethod used to normalized. See the details section and the reference below for details.
offsetShould the normalized value be returned as an offset leaving the original counts unchanged?
num.binsThe number of bins used to stratify the covariate for median , upper and full methods. Ignored if loess . See the reference for a discussion on the number of bins.
roundIf TRUE the normalization returns rounded values (pseudo-counts). Ignored if offset=TRUE.

Details

This method implements four normalizations described in Risso et al. (2011).

The loess normalization transforms the data by regressing the counts on y and subtracting the loess fit from the counts to remove the dependence.

The median , upper and full normalizations are based on the stratification of the genes based on y . Once the genes are stratified in num.bins strata, the methods work as follows. list(" ", list(list(list("median"), ":"), list("scales the data to have the same median in each bin.")), " ", list(list(list("upper"), ":"), list("the same but with the upper quartile.")), " ", list(list(list("full"), ":"), list("forces the distribution of each stratum to be the same using a non linear full quantile normalization, in the spirit of the one used in microarrays.")), " ")

Author

Davide Risso.

References

D. Risso, K. Schwartz, G. Sherlock and S. Dudoit (2011). GC-Content Normalization for RNA-Seq Data. Manuscript in Preparation.

Examples

library(yeastRNASeq)
data(geneLevelData)
data(yeastGC)

sub <- intersect(rownames(geneLevelData), names(yeastGC))

mat <- as.matrix(geneLevelData[sub, ])

data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))),
featureData=AnnotatedDataFrame(data.frame(gc=yeastGC[sub])))

norm <- withinLaneNormalization(data, "gc", which="full", offset=FALSE)

GC-content of S. Cerevisiae genes

Description

This data set gives the GC-content (proportion of G and C) of the genes of S. Cerevisiae , from SGD release 64 annotation.

Format

A vector containing 6717 observations.

Usage

yeastGC

Length of S. Cerevisiae genes

Description

This data set gives the length (in base pairs) of the genes of S. Cerevisiae , from SGD release 64 annotation.

Format

A vector containing 6717 observations.

Usage

yeastLength