bioconductor v3.9.0 Sva

The sva package contains functions for removing batch

Link to this section Summary

Functions

Adjust for batch effects using an empirical Bayes framework

A function for estimating the probability that each gene is an empirical control

A function for quickly calculating f statistic p-values for use in sva

A function for quickly calculating f statistics for use in sva

A function for performing frozen surrogate variable analysis as proposed in Parker, Corrada Bravo and Leek 2013

A function for estimating surrogate variables by estimating empirical control probes

A function for calculating the number of surrogate variables to estimate in a model

A function for estimating surrogate variables with the two step approach of Leek and Storey 2007

A function for computing quality surrogate variables (qSVs)

A function for reading in coverage data from degradation-susceptible regions

A function for estimating surrogate variables using a supervised approach

sva: a package for removing artifacts from microarray and sequencing data

A function to adjust gene expression data before network inference

A function for post-hoc checking of an sva object to check for degenerate cases.

A function for estimating surrogate variables for count based RNA-seq data.

A function for estimating surrogate variables with the two step approach of Leek and Storey 2007

Link to this section Functions

Adjust for batch effects using an empirical Bayes framework

Description

ComBat allows users to adjust for batch effects in datasets where the batch covariate is known, using methodology described in Johnson et al. 2007. It uses either parametric or non-parametric empirical Bayes frameworks for adjusting data for batch effects. Users are returned an expression matrix that has been corrected for batch effects. The input data are assumed to be cleaned and normalized before batch effect removal.

Usage

ComBat(dat, batch, mod = NULL, par.prior = TRUE, prior.plots = FALSE,
  mean.only = FALSE, ref.batch = NULL, BPPARAM = bpparam("SerialParam"))

Arguments

ArgumentDescription
datGenomic measure matrix (dimensions probe x sample) - for example, expression matrix
batchBatch covariate (only one batch allowed)
modModel matrix for outcome of interest and other covariates besides batch
par.prior(Optional) TRUE indicates parametric adjustments will be used, FALSE indicates non-parametric adjustments will be used
prior.plots(Optional) TRUE give prior plots with black as a kernel estimate of the empirical batch effect density and red as the parametric
mean.only(Optional) FALSE If TRUE ComBat only corrects the mean of the batch effect (no scale adjustment)
ref.batch(Optional) NULL If given, will use the selected batch as a reference for batch adjustment.
BPPARAM(Optional) BiocParallelParam for parallel operation

Value

data A probe x sample genomic measure matrix, adjusted for batch effects.

Examples

library(bladderbatch)
data(bladderdata)
dat <- bladderEset[1:50,]

pheno = pData(dat)
edata = exprs(dat)
batch = pheno$batch
mod = model.matrix(~as.factor(cancer), data=pheno)

# parametric adjustment
combat_edata1 = ComBat(dat=edata, batch=batch, mod=NULL, par.prior=TRUE, prior.plots=FALSE)

# non-parametric adjustment, mean-only version
combat_edata2 = ComBat(dat=edata, batch=batch, mod=NULL, par.prior=FALSE, mean.only=TRUE)

# reference-batch version, with covariates
combat_edata3 = ComBat(dat=edata, batch=batch, mod=mod, par.prior=TRUE, ref.batch=3)
Link to this function

empiricalcontrols()

A function for estimating the probability that each gene is an empirical control

Description

This function uses the iteratively reweighted surrogate variable analysis approach to estimate the probability that each gene is an empirical control.

Usage

empirical.controls(dat, mod, mod0 = NULL, n.sv, B = 5, type = c("norm",
  "counts"))

Arguments

ArgumentDescription
datThe transformed data matrix with the variables in rows and samples in columns
modThe model matrix being used to fit the data
mod0The null model being compared when fitting the data
n.svThe number of surogate variables to estimate
BThe number of iterations of the irwsva algorithm to perform
typeIf type is norm then standard irwsva is applied, if type is counts, then the moderated log transform is applied first

Value

pcontrol A vector of probabilites that each gene is a control.

Examples

library(bladderbatch)
data(bladderdata)
dat <- bladderEset[1:5000,]

pheno = pData(dat)
edata = exprs(dat)
mod = model.matrix(~as.factor(cancer), data=pheno)

n.sv = num.sv(edata,mod,method="leek")
pcontrol <- empirical.controls(edata,mod,mod0=NULL,n.sv=n.sv,type="norm")

A function for quickly calculating f statistic p-values for use in sva

Description

This function does simple linear algebra to calculate f-statistics for each row of a data matrix comparing the nested models defined by the design matrices for the alternative (mod) and and null (mod0) cases. The columns of mod0 must be a subset of the columns of mod.

Usage

f.pvalue(dat, mod, mod0)

Arguments

ArgumentDescription
datThe transformed data matrix with the variables in rows and samples in columns
modThe model matrix being used to fit the data
mod0The null model being compared when fitting the data

Value

p A vector of F-statistic p-values one for each row of dat.

Examples

library(bladderbatch)
data(bladderdata)
dat <- bladderEset[1:50,]

pheno = pData(dat)
edata = exprs(dat)
mod = model.matrix(~as.factor(cancer), data=pheno)
mod0 = model.matrix(~1,data=pheno)

pValues = f.pvalue(edata,mod,mod0)
qValues = p.adjust(pValues,method="BH")

A function for quickly calculating f statistics for use in sva

Description

This function does simple linear algebra to calculate f-statistics for each row of a data matrix comparing the nested models defined by the design matrices for the alternative (mod) and and null (mod0) cases. The columns of mod0 must be a subset of the columns of mod.

Usage

fstats(dat, mod, mod0)

Arguments

ArgumentDescription
datThe transformed data matrix with the variables in rows and samples in columns
modThe model matrix being used to fit the data
mod0The null model being compared when fitting the data

Value

fstats A vector of F-statistics one for each row of dat.

Examples

library(bladderbatch)
data(bladderdata)
dat <- bladderEset[1:50,]

pheno = pData(dat)
edata = exprs(dat)
mod = model.matrix(~as.factor(cancer), data=pheno)
mod0 = model.matrix(~1,data=pheno)

fs <- fstats(edata, mod, mod0)

A function for performing frozen surrogate variable analysis as proposed in Parker, Corrada Bravo and Leek 2013

Description

This function performs frozen surrogate variable analysis as described in Parker, Corrada Bravo and Leek 2013. The approach uses a training database to create surrogate variables which are then used to remove batch effects both from the training database and a new data set for prediction purposes. For inferential analysis see sva , svaseq , with low level functionality available through irwsva.build and ssva .

Usage

fsva(dbdat, mod, sv, newdat = NULL, method = c("fast", "exact"))

Arguments

ArgumentDescription
dbdatA m genes by n arrays matrix of expression data from the database/training data
modThe model matrix for the terms included in the analysis for the training data
svThe surrogate variable object created by running sva on dbdat using mod.
newdat(optional) A set of test samples to be adjusted using the training database
methodIf method ="fast" then the SVD is calculated using an online approach, this may introduce slight bias. If method="exact" the exact SVD is calculated, but will be slower

Value

db An adjusted version of the training database where the effect of batch/expression heterogeneity has been removed

new An adjusted version of the new samples, adjusted one at a time using the fsva methodology.

newsv Surrogate variables for the new samples

Examples

library(bladderbatch)
library(pamr)
data(bladderdata)
dat <- bladderEset[1:50,]

pheno = pData(dat)
edata = exprs(dat)

set.seed(1234)
trainIndicator = sample(1:57,size=30,replace=FALSE)
testIndicator = (1:57)[-trainIndicator]
trainData = edata[,trainIndicator]
testData = edata[,testIndicator]
trainPheno = pheno[trainIndicator,]
testPheno = pheno[testIndicator,]

mydata = list(x=trainData,y=trainPheno$cancer)
mytrain = pamr.train(mydata)
table(pamr.predict(mytrain,testData,threshold=2),testPheno$cancer)

trainMod = model.matrix(~cancer,data=trainPheno)
trainMod0 = model.matrix(~1,data=trainPheno)
trainSv = sva(trainData,trainMod,trainMod0)

fsvaobj = fsva(trainData,trainMod,trainSv,testData)
mydataSv = list(x=fsvaobj$db,y=trainPheno$cancer)
mytrainSv = pamr.train(mydataSv)
table(pamr.predict(mytrainSv,fsvaobj$new,threshold=1),testPheno$cancer)

A function for estimating surrogate variables by estimating empirical control probes

Description

This function is the implementation of the iteratively re-weighted least squares approach for estimating surrogate variables. As a buy product, this function produces estimates of the probability of being an empirical control. See the function empirical.controls for a direct estimate of the empirical controls.

Usage

irwsva.build(dat, mod, mod0 = NULL, n.sv, B = 5)

Arguments

ArgumentDescription
datThe transformed data matrix with the variables in rows and samples in columns
modThe model matrix being used to fit the data
mod0The null model being compared when fitting the data
n.svThe number of surogate variables to estimate
BThe number of iterations of the irwsva algorithm to perform

Value

sv The estimated surrogate variables, one in each column

pprob.gam: A vector of the posterior probabilities each gene is affected by heterogeneity

pprob.b A vector of the posterior probabilities each gene is affected by mod

n.sv The number of significant surrogate variables

Examples

library(bladderbatch)
data(bladderdata)
dat <- bladderEset[1:5000,]

pheno = pData(dat)
edata = exprs(dat)
mod = model.matrix(~as.factor(cancer), data=pheno)

n.sv = num.sv(edata,mod,method="leek")
res <- irwsva.build(edata, mod, mod0 = NULL,n.sv,B=5)

A function for calculating the number of surrogate variables to estimate in a model

Description

This function estimates the number of surrogate variables that should be included in a differential expression model. The default approach is based on a permutation procedure originally prooposed by Buja and Eyuboglu 1992. The function also provides an interface to the asymptotic approach proposed by Leek 2011 Biometrics.

Usage

num.sv(dat, mod, method = c("be", "leek"), vfilter = NULL, B = 20,
  seed = NULL)

Arguments

ArgumentDescription
datThe transformed data matrix with the variables in rows and samples in columns
modThe model matrix being used to fit the data
methodOne of "be" or "leek" as described in the details section
vfilterYou may choose to filter to the vfilter most variable rows before performing the analysis
BThe number of permutaitons to use if method = "be"
seedSet a seed when using the permutation approach

Value

n.sv The number of surrogate variables to use in the sva software

Examples

library(bladderbatch)
data(bladderdata)
dat <- bladderEset[1:5000,]

pheno = pData(dat)
edata = exprs(dat)
mod = model.matrix(~as.factor(cancer), data=pheno)

n.sv = num.sv(edata,mod,method="leek")

A function for estimating surrogate variables with the two step approach of Leek and Storey 2007

Description

This function is the implementation of the two step approach for estimating surrogate variables proposed by Leek and Storey 2007 PLoS Genetics. This function is primarily included for backwards compatibility. Newer versions of the sva algorithm are available through sva , svaseq , with low level functionality available through irwsva.build and ssva .

Usage

psva(dat, batch, ...)

Arguments

ArgumentDescription
datThe transformed data matrix with the variables in rows and samples in columns
batchA factor variable giving the known batch levels
...Other arguments to the sva function.

Value

psva.D Data with batch effect removed but biological heterogeneity preserved

Author

Elana J. Fertig

Examples

library(bladderbatch)
library(limma)
data(bladderdata)
dat <- bladderEset[1:50,]

pheno = pData(dat)
edata = exprs(dat)
batch = pheno$batch
batch.fac = as.factor(batch)

psva_data <- psva(edata,batch.fac)

A function for computing quality surrogate variables (qSVs)

Description

This function computes quality surrogate variables (qSVs) from the library-size- and read-length-normalized degradation matrix for subsequent RNA quality correction

Usage

qsva(degradationMatrix, mod = matrix(1, ncol = 1, nrow =
  ncol(degradationMatrix)))

Arguments

ArgumentDescription
degradationMatrixthe normalized degradation matrix, region by sample
mod(Optional) statistical model used in DE analysis

Value

the qSV adjustment variables

Examples

## Find files
bwPath <- system.file('extdata', 'bwtool', package = 'sva')

## Read the data
degCovAdj = read.degradation.matrix(
covFiles = list.files(bwPath,full.names=TRUE),
sampleNames = list.files(bwPath), readLength = 76,
totalMapped = rep(100e6,5),type="bwtool")

## Input data
head(degCovAdj)

## Results
qsva(degCovAdj)
Link to this function

readdegradationmatrix()

A function for reading in coverage data from degradation-susceptible regions

Description

This function reads in degradation regions to form a library-size- and read-length-normalized degradation matrix for subsequent RNA quality correction

Usage

read.degradation.matrix(covFiles, sampleNames, totalMapped, readLength = 100,
  normFactor = 8e+07, type = c("bwtool", "region_matrix_single",
  "region_matrix_all"), BPPARAM = bpparam())

Arguments

ArgumentDescription
covFilescoverage file(s) for degradation regions
sampleNamessample names; creates column names of degradation matrix
totalMappedhow many reads per sample (library size normalization)
readLengthread length in base pairs (read length normalization)
normFactorcommon library size to normalize to; 80M reads as default
typewhether input are individual bwtool output, region_matrix run on individual samples, or region_matrix run on all samples together
BPPARAM(Optional) BiocParallelParam for parallel operation

Value

the normalized degradation matrix, region by sample

Examples

# bwtool
bwPath = system.file('extdata', 'bwtool', package = 'sva')
degCovAdj = read.degradation.matrix(
covFiles = list.files(bwPath,full.names=TRUE),
sampleNames = list.files(bwPath), readLength = 76,
totalMapped = rep(100e6,5),type="bwtool")
head(degCovAdj)

# region_matrix: each sample
r1Path = system.file('extdata', 'region_matrix_one', package = 'sva')
degCovAdj1 = read.degradation.matrix(
covFiles = list.files(r1Path,full.names=TRUE),
sampleNames = list.files(r1Path), readLength = 76,
totalMapped = rep(100e6,5),type="region_matrix_single")
head(degCovAdj1)

r2Path = system.file('extdata', 'region_matrix_all', package = 'sva')
degCovAdj2 = read.degradation.matrix(
covFiles = list.files(r2Path,full.names=TRUE),
sampleNames = list.files(r1Path), readLength = 76,
totalMapped = rep(100e6,5),type="region_matrix_all")
head(degCovAdj2)

A function for estimating surrogate variables using a supervised approach

Description

This function implements a supervised surrogate variable analysis approach where genes/probes known to be affected by artifacts but not by the biological variables of interest are assumed to be known in advance. This supervised sva approach can be called through the sva and svaseq functions by specifying controls.

Usage

ssva(dat, controls, n.sv)

Arguments

ArgumentDescription
datThe transformed data matrix with the variables in rows and samples in columns
controlsA vector of probabilities (between 0 and 1, inclusive) that each gene is a control. A value of 1 means the gene is certainly a control and a value of 0 means the gene is certainly not a control.
n.svThe number of surogate variables to estimate

Value

sv The estimated surrogate variables, one in each column

pprob.gam: A vector of the posterior probabilities each gene is affected by heterogeneity (exactly equal to controls for ssva)

pprob.b A vector of the posterior probabilities each gene is affected by mod (always null for ssva)

n.sv The number of significant surrogate variables

Examples

library(bladderbatch)
data(bladderdata)
dat <- bladderEset[1:5000,]

pheno = pData(dat)
edata = exprs(dat)
mod = model.matrix(~as.factor(cancer), data=pheno)

n.sv = num.sv(edata,mod,method="leek")
set.seed(1234)
controls <- runif(nrow(edata))
ssva_res <- ssva(edata,controls,n.sv)

sva: a package for removing artifacts from microarray and sequencing data

Description

sva has functionality to estimate and remove artifacts from high dimensional data the sva function can be used to estimate artifacts from microarray data the svaseq function can be used to estimate artifacts from count-based RNA-sequencing (and other sequencing) data. The ComBat function can be used to remove known batch effecs from microarray data. The fsva function can be used to remove batch effects for prediction problems.

This function is the implementation of the iteratively re-weighted least squares approach for estimating surrogate variables. As a by product, this function produces estimates of the probability of being an empirical control. See the function empirical.controls for a direct estimate of the empirical controls.

Usage

sva(dat, mod, mod0 = NULL, n.sv = NULL, controls = NULL,
  method = c("irw", "two-step", "supervised"), vfilter = NULL, B = 5,
  numSVmethod = "be")

Arguments

ArgumentDescription
datThe transformed data matrix with the variables in rows and samples in columns
modThe model matrix being used to fit the data
mod0The null model being compared when fitting the data
n.svThe number of surogate variables to estimate
controlsA vector of probabilities (between 0 and 1, inclusive) that each gene is a control. A value of 1 means the gene is certainly a control and a value of 0 means the gene is certainly not a control.
methodFor empirical estimation of control probes use "irw". If control probes are known use "supervised"
vfilterYou may choose to filter to the vfilter most variable rows before performing the analysis. vfilter must be NULL if method is "supervised"
BThe number of iterations of the irwsva algorithm to perform
numSVmethodIf n.sv is NULL, sva will attempt to estimate the number of needed surrogate variables. This should not be adapted by the user unless they are an expert.

Details

A vignette is available by typing browseVignettes("sva") in the R prompt.

Value

sv The estimated surrogate variables, one in each column

pprob.gam: A vector of the posterior probabilities each gene is affected by heterogeneity

pprob.b A vector of the posterior probabilities each gene is affected by mod

n.sv The number of significant surrogate variables

Author

Jeffrey T. Leek, W. Evan Johnson, Hilary S. Parker, Andrew E. Jaffe, John D. Storey, Yuqing Zhang

References

For the package: Leek JT, Johnson WE, Parker HS, Jaffe AE, and Storey JD. (2012) The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics DOI:10.1093/bioinformatics/bts034

For sva: Leek JT and Storey JD. (2008) A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences , 105: 18718-18723.

For sva: Leek JT and Storey JD. (2007) Capturing heterogeneity in gene expression studies by `Surrogate Variable Analysis'. PLoS Genetics, 3: e161. For Combat: Johnson WE, Li C, Rabinovic A (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8 (1), 118-127 For svaseq: Leek JT (2014) svaseq: removing batch and other artifacts from count-based sequencing data. bioRxiv doi: TBD For fsva: Parker HS, Bravo HC, Leek JT (2013) Removing batch effects for prediction problems with frozen surrogate variable analysis arXiv:1301.3947 For psva: Parker HS, Leek JT, Favorov AV, Considine M, Xia X, Chavan S, Chung CH, Fertig EJ (2014) Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction Bioinformatics doi: 10.1093/bioinformatics/btu375 ## Examples r library(bladderbatch) data(bladderdata) dat <- bladderEset[1:5000,] pheno = pData(dat) edata = exprs(dat) mod = model.matrix(~as.factor(cancer), data=pheno) mod0 = model.matrix(~1,data=pheno) n.sv = num.sv(edata,mod,method="leek") svobj = sva(edata,mod,mod0,n.sv=n.sv)

A function to adjust gene expression data before network inference

Description

This function corrects a gene expression matrix prior to network inference by returning the residuals after regressing out the top principal components. The number of principal components to remove can be determined using a permutation-based approach using the "num.sv" function with method = "be"

Usage

sva_network(dat, n.pc)

Arguments

ArgumentDescription
datThe raw gene expression data matrix (with variables in rows and samples in columns)
n.pcThe number of principal components to remove

Value

dat.adjusted Cleaned gene expression data matrix with the top prinicpal components removed

Examples

library(bladderbatch)
data(bladderdata)
dat <- bladderEset[1:5000,]

pheno = pData(dat)
edata = exprs(dat)
mod = model.matrix(~as.factor(cancer), data=pheno)

n.pc = num.sv(edata, mod, method="be")
dat.adjusted = sva_network(edata, n.pc)

A function for post-hoc checking of an sva object to check for degenerate cases.

Description

This function is designed to check for degenerate cases in the sva fit and fix the sva object where possible.

Usage

sva.check(svaobj, dat, mod, mod0)

Arguments

ArgumentDescription
svaobjThe transformed data matrix with the variables in rows and samples in columns
datThe data set that was used to build the surrogate variables
modThe model matrix being used to fit the data
mod0The null model matrix being used to fit the data

Details

empirical.controls for a direct estimate of the empirical controls.

Value

sv The estimated surrogate variables, one in each column

pprob.gam: A vector of the posterior probabilities each gene is affected by heterogeneity

pprob.b A vector of the posterior probabilities each gene is affected by mod

n.sv The number of significant surrogate variables

Examples

library(bladderbatch)
data(bladderdata)
#dat <- bladderEset
dat <- bladderEset[1:5000,]

pheno = pData(dat)
edata = exprs(dat)
mod = model.matrix(~as.factor(cancer), data=pheno)
mod0 = model.matrix(~1,data=pheno)

n.sv = num.sv(edata,mod,method="leek")
svobj = sva(edata,mod,mod0,n.sv=n.sv)
svacheckobj = sva.check(svobj,edata,mod,mod0)

A function for estimating surrogate variables for count based RNA-seq data.

Description

This function is the implementation of the iteratively re-weighted least squares approach for estimating surrogate variables. As a by product, this function produces estimates of the probability of being an empirical control. This function first applies a moderated log transform as described in Leek 2014 before calculating the surrogate variables. See the function empirical.controls for a direct estimate of the empirical controls.

Usage

svaseq(dat, mod, mod0 = NULL, n.sv = NULL, controls = NULL,
  method = c("irw", "two-step", "supervised"), vfilter = NULL, B = 5,
  numSVmethod = "be", constant = 1)

Arguments

ArgumentDescription
datThe transformed data matrix with the variables in rows and samples in columns
modThe model matrix being used to fit the data
mod0The null model being compared when fitting the data
n.svThe number of surogate variables to estimate
controlsA vector of probabilities (between 0 and 1, inclusive) that each gene is a control. A value of 1 means the gene is certainly a control and a value of 0 means the gene is certainly not a control.
methodFor empirical estimation of control probes use "irw". If control probes are known use "supervised"
vfilterYou may choose to filter to the vfilter most variable rows before performing the analysis. vfilter must be NULL if method is "supervised"
BThe number of iterations of the irwsva algorithm to perform
numSVmethodIf n.sv is NULL, sva will attempt to estimate the number of needed surrogate variables. This should not be adapted by the user unless they are an expert.
constantThe function takes log(dat + constant) before performing sva. By default constant = 1, all values of dat + constant should be positive.

Value

sv The estimated surrogate variables, one in each column

pprob.gam: A vector of the posterior probabilities each gene is affected by heterogeneity

pprob.b A vector of the posterior probabilities each gene is affected by mod

n.sv The number of significant surrogate variables

Examples

library(zebrafishRNASeq)
data(zfGenes)
filter = apply(zfGenes, 1, function(x) length(x[x>5])>=2)
filtered = zfGenes[filter,]
genes = rownames(filtered)[grep("^ENS", rownames(filtered))]
controls = grepl("^ERCC", rownames(filtered))
group = as.factor(rep(c("Ctl", "Trt"), each=3))
dat0 = as.matrix(filtered)

mod1 = model.matrix(~group)
mod0 = cbind(mod1[,1])
svseq = svaseq(dat0,mod1,mod0,n.sv=1)$sv
plot(svseq,pch=19,col="blue")
Link to this function

twostepsvabuild()

A function for estimating surrogate variables with the two step approach of Leek and Storey 2007

Description

This function is the implementation of the two step approach for estimating surrogate variables proposed by Leek and Storey 2007 PLoS Genetics. This function is primarily included for backwards compatibility. Newer versions of the sva algorithm are available through sva , svaseq , with low level functionality available through irwsva.build and ssva .

Usage

twostepsva.build(dat, mod, n.sv)

Arguments

ArgumentDescription
datThe transformed data matrix with the variables in rows and samples in columns
modThe model matrix being used to fit the data
n.svThe number of surogate variables to estimate

Value

sv The estimated surrogate variables, one in each column

pprob.gam: A vector of the posterior probabilities each gene is affected by heterogeneity

pprob.b A vector of the posterior probabilities each gene is affected by mod (this is always null for the two-step approach)

n.sv The number of significant surrogate variables

Examples

library(bladderbatch)
library(limma)
data(bladderdata)
dat <- bladderEset

pheno = pData(dat)
edata = exprs(dat)
mod = model.matrix(~as.factor(cancer), data=pheno)

n.sv = num.sv(edata,mod,method="leek")
svatwostep <- twostepsva.build(edata,mod,n.sv)