bioconductor v3.9.0 EDASeq

Numerical summaries and graphical representations of some key features of the data along with implementations of both within-lane normalization methods for GC content bias and between-lane normalization methods to adjust for sequencing depth and possibly other differences in distribution.

Details

The SeqExpressionSet class is used to store gene-level counts along with sample information. It extends the virtual class eSet . See the help page of the class for details.

"Read-level" information is managed via the FastqFileList and BamFileList classes of Rsamtools .

Most used graphic tools for the FastqFileList and BamFileList objects are: 'barplot', 'plotQuality', 'plotNtFrequency'. For SeqExpressionSet objects are: 'biasPlot', 'meanVarPlot', 'MDPlot'.

To perform gene-level normalization use the functions 'withinLaneNormalization' and 'betweenLaneNormalization'.

An 'As' method exists to coerce SeqExpressionSet objects to CountDataSet objects ( DESeq package).

See the package vignette for a typical Exploratory Data Analysis example.

Author

Davide Risso and Sandrine Dudoit. Maintainer: Davide Risso risso.davide@gmail.com

References

J. H. Bullard, E. A. Purdom, K. D. Hansen and S. Dudoit (2010). Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics Vol. 11, Article 94.

D. Risso, K. Schwartz, G. Sherlock and S. Dudoit (2011). GC-Content Normalization for RNA-Seq Data. Technical Report No. 291, Division of Biostatistics, University of California, Berkeley, Berkeley, CA.

MDPlot_methods()

Methods for Function MDPlot in Package EDASeq

Description

MDPlot produces a mean-difference smooth scatterplot of two lanes in an experiment.

Usage

MDPlot(x,y,...)

Arguments

Argument	Description
`x`	Either a numeric matrix or a SeqExpressionSet object containing the gene expression.
`y`	A numeric vecor specifying the lanes to be compared.
`...`	See `par`

Details

The mean-difference (MD) plot is a useful plot to visualize difference in two lanes of an experiment. From a MDPlot one can see if normalization is needed and if a linear scaling is sufficient or nonlinear normalization is more effective.

The MDPlot also plots a lowess fit (in red) underlying a possible trend in the bias related to the mean expression.

Examples

library(yeastRNASeq)
data(geneLevelData)
data(yeastGC)

sub <- intersect(rownames(geneLevelData), names(yeastGC))

mat <- as.matrix(geneLevelData[sub,])

data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))),
featureData=AnnotatedDataFrame(data.frame(gc=yeastGC[sub])))

MDPlot(data,c(1,3))

SeqExpressionSet_class()

"SeqExpressionSet" class for collections of short reads

Description

This class represents a collection of digital expression data (usually counts from RNA-Seq technology) along with sample information.

Author

Davide Risso risso.davide@gmail.com

Examples

showMethods(class="SeqExpressionSet", where=getNamespace("EDASeq"))

counts <- matrix(data=0, nrow=100, ncol=4)
for(i in 1:4) {
counts[,i] <- rpois(100,lambda=50)
}
cond <- c(rep("A", 2), rep("B", 2))

data <- newSeqExpressionSet(counts, phenoData=AnnotatedDataFrame(data.frame(conditions=cond)))

head(counts(data))
boxplot(data, col=as.numeric(pData(data)[,1])+1)

barplot_methods()

Methods for Function barplot in Package EDASeq

Description

High-level functions to produce barplots of some complex objects.

betweenLaneNormalization_methods()

Methods for Function betweenLaneNormalization in Package EDASeq

Description

Between-lane normalization for sequencing depth and possibly other distributional differences between lanes.

Usage

betweenLaneNormalization(x, which=c("median","upper","full"), offset=FALSE, round=TRUE)

Arguments

Argument	Description
`x`	A numeric matrix representing the counts or a SeqExpressionSet object.
`which`	Method used to normalized. See the details section and the reference below for details.
`offset`	Should the normalized value be returned as an offset leaving the original counts unchanged?
`round`	If TRUE the normalization returns rounded values (pseudo-counts). Ignored if offset=TRUE.

Details

This method implements three normalizations described in Bullard et al. (2010). The methods are: list(" ", list(list(list("median"), ":"), list("a scaling normalization that forces the median of each lane to be the same.")), " ", list(list(list("upper"), ":"), list("the same but with the upper quartile.")), " ", list(list(list("full"), ":"), list("a non linear full quantile normalization, in the spirit of the one used in microarrays.")), " ")

Author

Davide Risso.

References

D. Risso, K. Schwartz, G. Sherlock and S. Dudoit (2011). GC-Content Normalization for RNA-Seq Data. Manuscript in Preparation.

Examples

library(yeastRNASeq)
data(geneLevelData)
data(yeastGC)

sub <- intersect(rownames(geneLevelData), names(yeastGC))

mat <- as.matrix(geneLevelData[sub, ])

data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))),
featureData=AnnotatedDataFrame(data.frame(gc=yeastGC[sub])))

norm <- betweenLaneNormalization(data, which="full", offset=FALSE)

biasBoxplot_methods()

Methods for Function biasBoxplot in Package EDASeq

Description

biasBoxplot produces a boxplot representing the distribution of a quantity of interest (e.g. gene counts, log-fold-changes, ...) stratified by a covariate (e.g. gene length, GC-contet, ...).

Usage

biasBoxplot(x,y,num.bins,...)

Arguments

Argument	Description
`x`	A numeric vector with the quantity of interest (e.g. gene counts, log-fold-changes, ...)
`y`	A numeric vector with the covariate of interest (e.g. gene length, GC-contet, ...)
`num.bins`	A numeric value specifying the number of bins in wich to stratify `y` . Default to 10.
`...`	See `par`

Examples

library(yeastRNASeq)
data(geneLevelData)
data(yeastGC)

sub <- intersect(rownames(geneLevelData), names(yeastGC))

mat <- as.matrix(geneLevelData[sub,])

data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))),
featureData=AnnotatedDataFrame(data.frame(gc=yeastGC[sub])))

lfc <- log(geneLevelData[sub, 3] + 1) - log(geneLevelData[sub, 1] + 1)

biasBoxplot(lfc, yeastGC[sub], las=2, cex.axis=.7)

biasPlot_methods()

Methods for Function biasPlot in Package EDASeq

Description

biasPlot produces a plot of the lowess regression of the counts on a covariate of interest, tipically the GC-content or the length of the genes.

Examples

library(yeastRNASeq)
data(geneLevelData)
data(yeastGC)

sub <- intersect(rownames(geneLevelData), names(yeastGC))

mat <- as.matrix(geneLevelData[sub,])

data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))),
featureData=AnnotatedDataFrame(data.frame(gc=yeastGC[sub])))

biasPlot(data,"gc",ylim=c(0,5),log=TRUE)

boxplot_methods()

Methods for Function boxplot in Package EDASeq

Description

High-level functions to produce boxplots of some complex objects.

getGeneLengthAndGCContent()

Get gene length and GC-content

Description

Automatically retrieves gene length and GC-content information from Biomart or org.db packages.

Usage

getGeneLengthAndGCContent(id, org, mode=c("biomart", "org.db"))

Arguments

Argument	Description
`id`	Character vector of one or more ENSEMBL or ENTREZ gene IDs.
`org`	Organism three letter code, e.g. 'hsa' for 'Homo sapiens'. See also: http://www.genome.jp/kegg/catalog/org_list.html; In org.db mode, this can be also a specific genome assembly, e.g. 'hg38' or 'sacCer3'.
`mode`	Mode to retrieve the information. Defaults to 'biomart'. See Details.

Details

The 'biomart' mode is based on functionality from the biomaRt packgage and retrieves the required information from the BioMart database. This is available for all ENSEMBL organisms and is typically most current, but can be time-consuming when querying several thousand genes at a time.

The 'org.db' mode uses organism-based annotation packages from Bioconductor. This is much faster than the 'biomart' mode, but is only available for selected model organism currently supported by BioC annotation functionality.

Results for the same gene ID(s) can differ between both modes as they are based on different sources for the underlying genome assembly. While the 'biomart' mode uses the latest ENSEMBL version, the 'org.db' mode uses BioC annotation packages typically built from UCSC.

Value

A numeric matrix with two columns: gene length and GC-content.

Author

Ludwig Geistlinger Ludwig.Geistlinger@bio.ifi.lmu.de

Examples

getGeneLengthAndGCContent("ENSG00000012048", "hsa")

meanVarPlot_methods()

Methods for Function meanVarPlot in Package EDASeq

Description

meanVarPlot produces a smoothScatter plot of the mean variance relation.

newSeqExpressionSet()

Function to create a new SeqExpressionSet object.

Description

User-level function to create new objects of the class SeqExpressionSet .

Usage

newSeqExpressionSet(counts,
                    normalizedCounts = matrix(data=NA, nrow=nrow(counts), ncol=ncol(counts), dimnames=dimnames(counts)),
                    offset = matrix(data=0, nrow=nrow(counts), ncol=ncol(counts), dimnames=dimnames(counts)),
                    phenoData = annotatedDataFrameFrom(counts, FALSE),
                    featureData = annotatedDataFrameFrom(counts, TRUE),
                    ...)

Arguments

Argument	Description
`counts`	A matrix containing the counts for an RNA-Seq experiment. One column for each lane and one row for each gene.
`normalizedCounts`	A matrix with the same dimensions of `counts` with the normalized counts.
`offset`	A matrix with the same dimensions of `counts` defining the offset (usually useful for normalization purposes). See the package vignette for a discussion on the offset.
`phenoData`	A data.frame or `AnnotatedDataFrame` with sample information, such as biological condition, library preparation protocol, flow-cell,...
`featureData`	A data.frame or `AnnotatedDataFrame` with feature information, such as gene length, GC-content, ...
`list()`	Other arguments will be passed to the constructor inherited from eSet .

Value

An object of class SeqExpressionSet .

Author

Davide Risso

Examples

counts <- matrix(data=0, nrow=100, ncol=4)
for(i in 1:4) {
counts[, i] <- rpois(100, lambda=50)
}
cond <- c(rep("A", 2), rep("B", 2))

counts <- newSeqExpressionSet(counts, phenoData=data.frame(conditions=cond))

plotNtFrequency_methods()

Methods for Function plotNtFrequency in Package EDASeq

Description

Plots the nucleotide frequencies per position.

plotPCA_methods()

Methods for Function plotPCA in Package EDASeq

Description

plotPCA produces a Principal Component Analysis (PCA) plot of the counts in object

Usage

list(list("plotPCA"), list("matrix"))(object, k=2, labels=TRUE, isLog=FALSE, ...)
list(list("plotPCA"), list("SeqExpressionSet"))(object, k=2, labels=TRUE, ...)

Arguments

Argument	Description
`object`	Either a numeric matrix or a SeqExpressionSet object containing the gene expression.
`k`	The number of principal components to be plotted.
`labels`	Logical. If `TRUE` , and `k=2` , it plots the `colnames` of `object` as point labels.
`isLog`	Logical. Set to `TRUE` if the data are already on the log scale.
`...`	See `par`

Details

The Principal Component Analysis (PCA) plot is a useful diagnostic plot to highlight differences in the distribution of replicate samples, by projecting the samples into a lower dimensional space.

If there is strong differential expression between two classes, one expects the samples to cluster by class in the first few Principal Components (PCs) (usually 2 or 3 components are enough). This plot also highlights possible batch effects and/or outlying samples.

Examples

library(yeastRNASeq)
data(geneLevelData)

mat <- as.matrix(geneLevelData)

data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))))

plotPCA(data, col=rep(1:2, each=2))

plotQuality_methods()

Methods for Function plotQuality in Package EDASeq

Description

plotQuality produces a plot of the quality of the reads.

plotRLE_methods()

Methods for Function plotRLE in Package EDASeq

Description

plotRLE produces a Relative Log Expression (RLE) plot of the counts in x

Usage

plotRLE(x, ...)

Arguments

Argument	Description
`x`	Either a numeric matrix or a SeqExpressionSet object containing the gene expression.
`...`	See `par`

Details

The Relative Log Expression (RLE) plot is a useful diagnostic plot to visualize the differences between the distributions of read counts across samples.

It shows the boxplots of the log-ratios of the gene-level read counts of each sample to those of a reference sample (defined as the median across the samples). Ideally, the distributions should be centered around the zero line and as tight as possible. Clear deviations indicate the need for normalization and/or the presence of outlying samples.

Examples

library(yeastRNASeq)
data(geneLevelData)

mat <- as.matrix(geneLevelData)

data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))))


plotRLE(data, col=rep(2:3, each=2))

plot_methods()

Methods for Function plot in Package EDASeq

Description

High-level function to produce plots given one BamFileList object and one FastqFileList object.

withinLaneNormalization_methods()

Methods for Function withinLaneNormalization in Package EDASeq

Description

Within-lane normalization for GC-content (or other lane-specific) bias.

Usage

withinLaneNormalization(x, y, which=c("loess","median","upper","full"), offset=FALSE, num.bins=10, round=TRUE)

Arguments

Argument	Description
`x`	A numeric matrix representing the counts or a SeqExpressionSet object.
`y`	A numeric vector representing the covariate to normalize for (if `x` is a matrix) or a character vector with the name of the covariate (if `x` is a SeqExpressionSet object). Usually it is the GC-content.
`which`	Method used to normalized. See the details section and the reference below for details.
`offset`	Should the normalized value be returned as an offset leaving the original counts unchanged?
`num.bins`	The number of bins used to stratify the covariate for `median` , `upper` and `full` methods. Ignored if `loess` . See the reference for a discussion on the number of bins.
`round`	If TRUE the normalization returns rounded values (pseudo-counts). Ignored if offset=TRUE.

Details

This method implements four normalizations described in Risso et al. (2011).

The loess normalization transforms the data by regressing the counts on y and subtracting the loess fit from the counts to remove the dependence.

The median , upper and full normalizations are based on the stratification of the genes based on y . Once the genes are stratified in num.bins strata, the methods work as follows. list(" ", list(list(list("median"), ":"), list("scales the data to have the same median in each bin.")), " ", list(list(list("upper"), ":"), list("the same but with the upper quartile.")), " ", list(list(list("full"), ":"), list("forces the distribution of each stratum to be the same using a non linear full quantile normalization, in the spirit of the one used in microarrays.")), " ")

Author

Davide Risso.

References

D. Risso, K. Schwartz, G. Sherlock and S. Dudoit (2011). GC-Content Normalization for RNA-Seq Data. Manuscript in Preparation.

Examples

library(yeastRNASeq)
data(geneLevelData)
data(yeastGC)

sub <- intersect(rownames(geneLevelData), names(yeastGC))

mat <- as.matrix(geneLevelData[sub, ])

data <- newSeqExpressionSet(mat,
phenoData=AnnotatedDataFrame(
data.frame(conditions=factor(c("mut", "mut", "wt", "wt")),
row.names=colnames(geneLevelData))),
featureData=AnnotatedDataFrame(data.frame(gc=yeastGC[sub])))

norm <- withinLaneNormalization(data, "gc", which="full", offset=FALSE)

yeastGC()

GC-content of S. Cerevisiae genes

Description

This data set gives the GC-content (proportion of G and C) of the genes of S. Cerevisiae , from SGD release 64 annotation.

Format

A vector containing 6717 observations.

Usage

yeastGC

yeastLength()

Length of S. Cerevisiae genes

Description

This data set gives the length (in base pairs) of the genes of S. Cerevisiae , from SGD release 64 annotation.

Format

A vector containing 6717 observations.

Usage

yeastLength

v3.9.0

bioconductor v3.9.0 EDASeq

Link to this section Summary

Functions

Link to this section Functions

EDASeq_package()

Description

Details

Author

References

MDPlot_methods()

Description

Usage

Arguments

Details

Examples

SeqExpressionSet_class()

Description

Seealso

Author

Examples

barplot_methods()

Description

betweenLaneNormalization_methods()

Description

Usage

Arguments

Details

Author

References

Examples

biasBoxplot_methods()

Description

Usage

Arguments

Examples

biasPlot_methods()

Description

Examples

boxplot_methods()

Description

getGeneLengthAndGCContent()

Description

Usage

Arguments

Details

Value

Seealso

Author

Examples

meanVarPlot_methods()

Description

newSeqExpressionSet()

Description

Usage

Arguments

Value

Seealso

Author

Examples

plotNtFrequency_methods()

Description

plotPCA_methods()

Description

Usage

Arguments

Details

Examples

plotQuality_methods()

Description

plotRLE_methods()

Description

Usage

Arguments

Details

Examples

plot_methods()

Description

withinLaneNormalization_methods()

Description