bioconductor v3.9.0 Scater

A collection of tools for doing various analyses of

Link to this section Summary

Functions

The "Single Cell Expression Set" (SCESet) class

Additional accessors for the typical elements of a SingleCellExperiment object.

Accessor and replacement for bootstrap results in a SingleCellExperiment object

Calculate average counts, adjusting for size factors or library size

Calculate counts per million (CPM)

Calculate fragments per kilobase of exon per million reads mapped (FPKM)

Calculate QC metrics

Calculate transcripts-per-million (TPM)

Centre size factors at unity

Get feature annotation information from Biomart

Estimate the percentage of variance explained for each PC.

Estimate the percentage of variance explained for each gene.

Identify outlier values

Compute library size factors

Multiple plot function for ggplot2 plots

Count the number of non-zero counts per cell or feature

Normalize a SingleCellExperiment object using pre-computed size factors

Divide columns of a count matrix by the size factors

Plot column metadata

Plot the explanatory PCs for each variable

Plot explanatory variables ordered by percentage of variance explained

Plot expression values for all cells

Plot frequency against mean for each feature

Plot expression against transcript length

Plot heatmap of gene expression values

Plot the highest expressing features

Plot cells in plate positions

Plot a relative log expression (RLE) plot

Plot reduced dimensions

Plot row metadata

Plot an overview of expression for each cell

Plot specific reduced dimensions

Read sparse count matrix from file

Create a diffusion map from cell-level data

Perform MDS on cell-level data

Perform PCA on cell-level data

Perform t-SNE on cell-level data

Perform UMAP on cell-level data

Cell information for the small example single-cell counts dataset to demonstrate capabilities of scater

A small example of single-cell counts dataset to demonstrate capabilities of scater

Single-cell analysis toolkit for expression in R

General visualization parameters

Variable selection for visualization

Sum counts across a set of cells

Sum counts across a feature set

Convert an SCESet object to a SingleCellExperiment object

Make feature names unique

Link to this section Functions

The "Single Cell Expression Set" (SCESet) class

Description

S4 class and the main class used by scater to hold single cell expression data. SCESet extends the basic Bioconductor ExpressionSet class.

Details

This class is initialized from a matrix of expression values.

Methods that operate on SCESet objects constitute the basic scater workflow.

References

Thanks to the Monocle package (github.com/cole-trapnell-lab/monocle-release/) for their CellDataSet class, which provided the inspiration and template for SCESet.

Additional accessors for the typical elements of a SingleCellExperiment object.

Description

Convenience functions to access commonly-used assays of the SingleCellExperiment object.

Usage

norm_exprs(object)
norm_exprs(object) <- value
stand_exprs(object)
stand_exprs(object) <- value
fpkm(object)
fpkm(object) <- value

Arguments

ArgumentDescription
objectSingleCellExperiment class object from which to access or to which to assign assay values. Namely: "exprs", norm_exprs", "stand_exprs", "fpkm". The following are imported from SingleCellExperiment : "counts", "normcounts", "logcounts", "cpm", "tpm".
valuea numeric matrix (e.g. for exprs )

Value

a matrix of normalised expression data

a matrix of standardised expressiond data

a matrix of FPKM values

A matrix of numeric, integer or logical values.

Author

Davis McCarthy

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts), colData = sc_example_cell_info)

example_sce <- normalize(example_sce)
head(logcounts(example_sce)[,1:10])
head(exprs(example_sce)[,1:10]) # identical to logcounts()

example_sce <- SingleCellExperiment(
assays = list(norm_counts = sc_example_counts), colData = sc_example_cell_info)

counts(example_sce) <- sc_example_counts
norm_exprs(example_sce) <- log2(calculateCPM(example_sce, use_size_factors = FALSE) + 1)

stand_exprs(example_sce) <- log2(calculateCPM(example_sce, use_size_factors = FALSE) + 1)

tpm(example_sce) <- calculateTPM(example_sce, effective_length = 5e4)

cpm(example_sce) <- calculateCPM(example_sce, use_size_factors = FALSE)

fpkm(example_sce)

Accessor and replacement for bootstrap results in a SingleCellExperiment object

Description

SingleCellExperiment objects can contain bootstrap expression values (for example, as generated by the kallisto software for quantifying feature abundance). These functions conveniently access and replace the 'bootstrap' elements in the assays slot with the value supplied, which must be an matrix of the correct size, namely the same number of rows and columns as the SingleCellExperiment object as a whole.

Usage

bootstraps(object)
bootstraps(object) <- value
list(list("bootstraps"), list("SingleCellExperiment"))(object)
list(list("bootstraps"), list("SingleCellExperiment,array"))(object) <- value

Arguments

ArgumentDescription
objecta SingleCellExperiment object.
valuean array of class "numeric" containing bootstrap expression values

Value

If accessing bootstraps slot of an SingleCellExperiment , then an array with the bootstrap values, otherwise an SingleCellExperiment object containing new bootstrap values.

Author

Davis McCarthy

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts), colData = sc_example_cell_info)
bootstraps(example_sce)
Link to this function

calculateAverage()

Calculate average counts, adjusting for size factors or library size

Description

Calculate average counts per feature, adjusting them to account for normalization due to size factors or library sizes.

Usage

calculateAverage(object, exprs_values = "counts",
  use_size_factors = TRUE, subset_row = NULL,
  BPPARAM = SerialParam())
calcAverage(object, exprs_values = "counts", use_size_factors = TRUE,
  subset_row = NULL, BPPARAM = SerialParam())

Arguments

ArgumentDescription
objectA SingleCellExperiment object or count matrix.
exprs_valuesA string specifying the assay of object containing the count matrix, if object is a SingleCellExperiment.
use_size_factorsa logical scalar specifying whether the size factors in object should be used to construct effective library sizes.
subset_rowA vector specifying the subset of rows of object for which to return a result.
BPPARAMA BiocParallelParam object specifying whether the calculations should be parallelized.

Details

The size-adjusted average count is defined by dividing each count by the size factor and taking the average across cells. All sizes factors are scaled so that the mean is 1 across all cells, to ensure that the averages are interpretable on the scale of the raw counts.

Assuming that object is a SingleCellExperiment:

  • If use_size_factors=TRUE , size factors are automatically extracted from the object. Note that different size factors may be used for features marked as spike-in controls. This is due to the presence of control-specific size factors in object , see normalizeSCE for more details.

  • If use_size_factors=FALSE , all size factors in object are ignored. Size factors are instead computed from the library sizes, using librarySizeFactors .

  • If use_size_factors is a numeric vector, it will override the any size factors for non-spike-in features in object . The spike-in size factors will still be used for the spike-in transcripts. If no size factors are available, they will be computed from the library sizes using librarySizeFactors .

If object is a matrix or matrix-like object, size factors can be supplied by setting use_size_factors to a numeric vector. Otherwise, the sum of counts for each cell is used as the size factor through librarySizeFactors .

Value

Vector of average count values with same length as number of features, or the number of features in subset_row if supplied.

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
list(counts = sc_example_counts),
colData = sc_example_cell_info)

## calculate average counts
ave_counts <- calculateAverage(example_sce)

Calculate counts per million (CPM)

Description

Calculate count-per-million (CPM) values from the count data.

Usage

calculateCPM(object, exprs_values = "counts", use_size_factors = TRUE,
  subset_row = NULL)

Arguments

ArgumentDescription
objectA SingleCellExperiment object or count matrix.
exprs_valuesA string specifying the assay of object containing the count matrix, if object is a SingleCellExperiment.
use_size_factorsA logical scalar indicating whether size factors in object should be used to compute effective library sizes. If not, all size factors are deleted and library size-based factors are used instead (see librarySizeFactors . Alternatively, a numeric vector containing a size factor for each cell, which is used in place of sizeFactor(object) .
subset_rowA vector specifying the subset of rows of object for which to return a result.

Details

If requested, size factors are used to define the effective library sizes. This is done by scaling all size factors such that the mean scaled size factor is equal to the mean sum of counts across all features. The effective library sizes are then used to in the denominator of the CPM calculation.

Assuming that object is a SingleCellExperiment:

  • If use_size_factors=TRUE , size factors are automatically extracted from the object. Note that effective library sizes may be computed differently for features marked as spike-in controls. This is due to the presence of control-specific size factors in object , see normalizeSCE for more details.

  • If use_size_factors=FALSE , all size factors in object are ignored. The total count for each cell will be used as the library size for all features (endogenous genes and spike-in controls).

  • If use_size_factors is a numeric vector, it will override the any size factors for non-spike-in features in object . The spike-in size factors will still be used for the spike-in transcripts. If no size factors are available, the library sizes will be used.

If object is a matrix or matrix-like object, size factors will only be used if use_size_factors is a numeric vector. Otherwise, the sum of counts for each cell is directly used as the library size.

Value

Numeric matrix of CPM values.

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
list(counts = sc_example_counts),
colData = sc_example_cell_info)

cpm(example_sce) <- calculateCPM(example_sce, use_size_factors = FALSE)
Link to this function

calculateFPKM()

Calculate fragments per kilobase of exon per million reads mapped (FPKM)

Description

Calculate fragments per kilobase of exon per million reads mapped (FPKM) values for expression from counts for a set of features.

Usage

calculateFPKM(object, effective_length, ..., subset_row = NULL)

Arguments

ArgumentDescription
objectA SingleCellExperiment object or a numeric matrix of counts.
effective_lengthNumeric vector providing the effective length for each feature in object .
...Further arguments to pass to calculateCPM .
subset_rowA vector specifying the subset of rows of object for which to return a result.

Value

A numeric matrix of FPKM values.

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
list(counts = sc_example_counts),
colData = sc_example_cell_info)

eff_len <- runif(nrow(example_sce), 500, 2000)
fout <- calculateFPKM(example_sce, eff_len, use_size_factors = FALSE)
Link to this function

calculateQCMetrics()

Calculate QC metrics

Description

Compute quality control (QC) metrics for each feature and cell in a SingleCellExperiment object, accounting for specified control sets.

Usage

calculateQCMetrics(object, exprs_values = "counts",
  feature_controls = NULL, cell_controls = NULL, percent_top = c(50,
  100, 200, 500), detection_limit = 0, use_spikes = TRUE,
  compact = FALSE, BPPARAM = SerialParam())

Arguments

ArgumentDescription
objectA SingleCellExperiment object containing expression values, usually counts.
exprs_valuesA string indicating which assays in the object should be used to define expression.
feature_controlsA named list containing one or more vectors (a character vector of feature names, a logical vector, or a numeric vector of indices), used to identify feature controls such as ERCC spike-in sets or mitochondrial genes.
cell_controlsA named list containing one or more vectors (a character vector of cell (sample) names, a logical vector, or a numeric vector of indices), used to identify cell controls, e.g., blank wells or bulk controls.
percent_topAn integer vector. Each element is treated as a number of top genes to compute the percentage of library size occupied by the most highly expressed genes in each cell. See pct_X_top_Y_features below for more details.
detection_limitA numeric scalar to be passed to nexprs , specifying the lower detection limit for expression.
use_spikesA logical scalar indicating whether existing spike-in sets in object should be automatically added to feature_controls , see ? .
compactA logical scalar indicating whether the metrics should be returned in a compact format as a nested DataFrame.
BPPARAMA BiocParallelParam object specifying whether the QC calculations should be parallelized.

Details

This function calculates useful quality control metrics to help with pre-processing of data and identification of potentially problematic features and cells.

Underscores in assayNames(object) and in feature_controls or cell_controls can cause theoretically cause ambiguities in the names of the output metrics. While problems are highly unlikely, users are advised to avoid underscores when naming their controls/assays.

If the expression values are double-precision, the per-row means may not be exactly identity for different choices of BPPARAM . This is due to differences in rounding error when summation is performed across different numbers of cores. If it is important to obtain numerically identical results (e.g., when using the per-row means for sensitive procedures like t-SNE) across various parallelization schemes, we suggest manually calculating those statistics using rowMeans .

Value

A SingleCellExperiment object containing QC metrics in the row and column metadata.

Author

Davis McCarthy, with (many!) modifications by Aaron Lun

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce <- calculateQCMetrics(example_sce)

## with a set of feature controls defined
example_sce <- calculateQCMetrics(example_sce,
feature_controls = list(set1 = 1:40))

## with a named set of feature controls defined
example_sce <- calculateQCMetrics(example_sce,
feature_controls = list(ERCC = 1:40))

Calculate transcripts-per-million (TPM)

Description

Calculate transcripts-per-million (TPM) values for expression from counts for a set of features.

Usage

calculateTPM(object, effective_length = NULL, exprs_values = "counts",
  subset_row = NULL)

Arguments

ArgumentDescription
objectA SingleCellExperiment object or a count matrix.
effective_lengthNumeric vector containing the effective length for each feature in object . If NULL , it is assumed that exprs_values has already been adjusted for transcript length.
exprs_valuesString or integer specifying the assay containing the counts in object , if it is a SingleCellExperiment.
subset_rowA vector specifying the subset of rows of object for which to return a result.

Details

For read count data, this function assumes uniform coverage along the (effective) length of the transcript. Thus, the number of transcripts for a gene is proportional to the read count divided by the transcript length.

For UMI count data, this function should be run with effective_length=NULL , i.e., no division by the effective length. This is because the number of UMIs is a direct (albeit probably biased) estimate of the number of transcripts.

Value

A numeric matrix of TPM values.

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info)

eff_len <- runif(nrow(example_sce), 500, 2000)
tout <- calculateTPM(example_sce, effective_length = eff_len)
Link to this function

centreSizeFactors()

Centre size factors at unity

Description

Scales all size factors so that the average size factor across cells is equal to 1.

Usage

centreSizeFactors(object, centre = 1)

Arguments

ArgumentDescription
objectA SingleCellExperiment object containing any number (or zero) sets of size factors.
centreA numeric scalar, the value around which all sets of size factors should be centred.

Details

Centering of size factors at unity ensures that division by size factors yields values on the same scale as the raw counts. This is important for the interpretation of the normalized values, as well as comaprisons between features normalized with different size factors (e.g., spike-ins).

Value

A SingleCellExperiment with modified size factors that are centred at unity.

Seealso

normalizeSCE

Author

Aaron Lun

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)

sizeFactors(example_sce) <- runif(ncol(example_sce))
sizeFactors(example_sce, "ERCC") <- runif(ncol(example_sce))
example_sce <- centreSizeFactors(example_sce)

mean(sizeFactors(example_sce))
mean(sizeFactors(example_sce, "ERCC"))
Link to this function

getBMFeatureAnnos()

Get feature annotation information from Biomart

Description

Use the biomaRt package to add feature annotation information to an SingleCellExperiment .

Usage

getBMFeatureAnnos(object, ids = rownames(object),
  filters = "ensembl_gene_id", attributes = c(filters, "mgi_symbol",
  "chromosome_name", "gene_biotype", "start_position", "end_position"),
  biomart = "ENSEMBL_MART_ENSEMBL", dataset = "mmusculus_gene_ensembl",
  host = "www.ensembl.org")

Arguments

ArgumentDescription
objectA SingleCellExperiment object.
idsA character vector containing the identifiers for all rows of object , of the same type specified by filters .
filtersCharacter vector defining the filters to pass to the getBM function.
attributesCharacter vector defining the attributes to pass to getBM .
biomartString defining the biomaRt to be used, to be passed to useMart . Default is "ENSEMBL_MART_ENSEMBL" .
datasetString defining the dataset to use, to be passed to useMart . Default is "mmusculus_gene_ensembl" , which should be changed if the organism is not mouse.
hostCharacter string argument which can be used to select a particular "host" to pass to useMart . Useful for accessing archived versions of biomaRt data. Default is "www.ensembl.org" , in which case the current version of the biomaRt (now hosted by Ensembl) is used.

Value

A SingleCellExperiment object containing feature annotation. The input feature_symbol appears as the feature_symbol field in the rowData of the output object.

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)

mock_id <- paste0("ENSMUSG", sprintf("%011d", seq_len(nrow(example_sce))))
example_sce <- getBMFeatureAnnos(example_sce, ids=mock_id)
Link to this function

getExplanatoryPCs()

Estimate the percentage of variance explained for each PC.

Description

Estimate the percentage of variance explained for each PC.

Usage

getExplanatoryPCs(object, use_dimred = "PCA", ncomponents = 10,
  rerun = FALSE, run_args = list(), ...)

Arguments

ArgumentDescription
objectA SingleCellExperiment object containing expression values and per-cell experimental information.
use_dimredString specifying the field in reducedDims(object) that contains the PCA results.
ncomponentsInteger scalar specifying the number of the top principal components to use.
rerunLogical scalar indicating whether the PCA should be repeated, even if pre-computed results are already present.
run_argsA named list of arguments to pass to runPCA .
...Additional arguments passed to getVarianceExplained .

Details

This function computes the percentage of variance in PC scores that is explained by variables in the sample-level metadata. It allows identification of important PCs that are driven by known experimental conditions, e.g., treatment, disease. PCs correlated with technical factors (e.g., batch effects, library size) can also be detected and removed prior to further analysis.

By default, the function will attempt to use pre-computed PCA results in object . This is done by taking the top ncomponents PCs from the matrix identified by use_dimred . If these are not available or if rerun=TRUE , the function will rerun the PCA using runPCA .

Value

A matrix containing the percentage of variance explained by each factor (column) and for each PC (row).

Seealso

plotExplanatoryPCs , getVarianceExplained

Author

Aaron Lun

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info)
example_sce <- normalize(example_sce)

r2mat <- getExplanatoryPCs(example_sce)
Link to this function

getVarianceExplained()

Estimate the percentage of variance explained for each gene.

Description

Estimate the percentage of variance explained for each gene.

Usage

getVarianceExplained(object, exprs_values = "logcounts",
  variables = NULL, chunk = 1000)

Arguments

ArgumentDescription
objectA SingleCellExperiment object containing expression values and per-cell experimental information.
exprs_valuesString specifying the expression values for which to compute the variance.
variablesCharacter vector specifying the explanatory factors in colData(object) to use. Default is NULL , in which case all variables in colData(object) are considered.
chunkInteger scalar specifying the chunk size for chunk-wise processing. Only affects the speed/memory usage trade-off.

Details

This function computes the percentage of variance in gene expression that is explained by variables in the sample-level metadata. It allows problematic factors to be quickly identified, as well as the genes that are most affected.

Value

A matrix containing the percentage of variance explained by each factor (column) and for each gene (row).

Seealso

plotExplanatoryVariables

Author

Aaron Lun

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info)
example_sce <- normalize(example_sce)

r2mat <- getVarianceExplained(example_sce)

Identify outlier values

Description

Convenience function to determine which values in a numeric vector are outliers based on the median absolute deviation (MAD).

Usage

isOutlier(metric, nmads = 5, type = c("both", "lower", "higher"),
  log = FALSE, subset = NULL, batch = NULL, min_diff = NA)

Arguments

ArgumentDescription
metricNumeric vector of values.
nmadsA numeric scalar, specifying the minimum number of MADs away from median required for a value to be called an outlier.
typeString indicating whether outliers should be looked for at both tails ( "both" ), only at the lower tail ( "lower" ) or the upper tail ( "higher" ).
logLogical scalar, should the values of the metric be transformed to the log10 scale before computing MADs?
subsetLogical or integer vector, which subset of values should be used to calculate the median/MAD? If NULL , all values are used. Missing values will trigger a warning and will be automatically ignored.
batchFactor of length equal to metric , specifying the batch to which each observation belongs. A median/MAD is calculated for each batch, and outliers are then identified within each batch.
min_diffA numeric scalar indicating the minimum difference from the median to consider as an outlier. The outlier threshold is defined from the larger of nmads MADs and min_diff , to avoid calling many outliers when the MAD is very small. If NA , it is ignored.

Details

Lower and upper thresholds are stored in the "threshold" attribute of the returned vector. This is a numeric vector of length 2 when batch=NULL for the threshold on each side. Otherwise, it is a matrix with one named column per level of batch and two rows (one per threshold).

Value

A logical vector of the same length as the metric argument, specifying the observations that are considered as outliers.

Author

Aaron Lun

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce <- calculateQCMetrics(example_sce)

## with a set of feature controls defined
example_sce <- calculateQCMetrics(example_sce,
feature_controls = list(set1 = 1:40))
isOutlier(example_sce$total_counts, nmads = 3)
Link to this function

librarySizeFactors()

Compute library size factors

Description

Define size factors from the library sizes after centering. This ensures that the library size adjustment yields values comparable to those generated after normalization with other sets of size factors.

Usage

librarySizeFactors(object, exprs_values = "counts", subset_row = NULL)

Arguments

ArgumentDescription
objectA count matrix or SingleCellExperiment object containing counts.
exprs_valuesA string indicating the assay of object containing the counts, if object is a SingleCellExperiment.
subset_rowA vector specifying whether the rows of object should be (effectively) subsetted before calculating library sizes.

Value

A numeric vector of size factors.

Examples

data("sc_example_counts")
summary(librarySizeFactors(sc_example_counts))

Multiple plot function for ggplot2 plots

Description

Place multiple ggplot plots on one page.

Usage

multiplot(..., plotlist = NULL, cols = 1, layout = NULL)

Arguments

ArgumentDescription
...One or more ggplot objects.
plotlistA list of ggplot objects, as an alternative to ... .
colsA numeric scalar giving the number of columns in the layout.
layoutA matrix specifying the layout. If present, cols is ignored.

Details

If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE) , then:

Value

A ggplot object.

Examples

library(ggplot2)

## This example uses the ChickWeight dataset, which comes with ggplot2
## First plot
p1 <- ggplot(ChickWeight, aes(x = Time, y = weight, colour = Diet, group = Chick)) +
geom_line() +
ggtitle("Growth curve for individual chicks")
## Second plot
p2 <- ggplot(ChickWeight, aes(x = Time, y = weight, colour = Diet)) +
geom_point(alpha = .3) +
geom_smooth(alpha = .2, size = 1) +
ggtitle("Fitted growth curve per diet")

## Third plot
p3 <- ggplot(subset(ChickWeight, Time == 21), aes(x = weight, colour = Diet)) +
geom_density() +
ggtitle("Final weight, by diet")
## Fourth plot
p4 <- ggplot(subset(ChickWeight, Time == 21), aes(x = weight, fill = Diet)) +
geom_histogram(colour = "black", binwidth = 50) +
facet_grid(Diet ~ .) +
ggtitle("Final weight, by diet") +
theme(legend.position = "none")        # No legend (redundant in this graph)

## Combine plots and display
multiplot(p1, p2, p3, p4, cols = 2)

Count the number of non-zero counts per cell or feature

Description

An efficient internal function that counts the number of non-zero counts in each row (per feature) or column (per cell). This avoids the need to construct an intermediate logical matrix.

Usage

nexprs(object, detection_limit = 0, exprs_values = "counts",
  byrow = FALSE, subset_row = NULL, subset_col = NULL,
  BPPARAM = SerialParam())

Arguments

ArgumentDescription
objectA SingleCellExperiment object or a numeric matrix of expression values.
detection_limitNumeric scalar providing the value above which observations are deemed to be expressed.
exprs_valuesString or integer specifying the assay of object to obtain the count matrix from, if object is a SingleCellExperiment.
byrowLogical scalar indicating whether to count the number of detected cells per feature. If FALSE , the function will count the number of detected features per cell.
subset_rowLogical, integer or character vector indicating which rows (i.e. features) to use.
subset_colLogical, integer or character vector indicating which columns (i.e., cells) to use.
BPPARAMA BiocParallelParam object specifying whether the calculations should be parallelized.

Details

Setting subset_row or subset_col is equivalent to subsetting object before calling nexprs , but more efficient as a new copy of the matrix is not constructed.

Value

An integer vector containing counts per gene or cell, depending on the provided arguments.

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info)

nexprs(example_sce)[1:10]
nexprs(example_sce, byrow = TRUE)[1:10]

Normalize a SingleCellExperiment object using pre-computed size factors

Description

Compute normalized expression values from count data in a SingleCellExperiment object, using the size factors stored in the object.

Usage

normalizeSCE(object, exprs_values = "counts", return_log = TRUE,
  log_exprs_offset = NULL, centre_size_factors = TRUE,
  preserve_zeroes = FALSE)
list(list("normalize"), list("SingleCellExperiment"))(object,
  exprs_values = "counts", return_log = TRUE,
  log_exprs_offset = NULL, centre_size_factors = TRUE,
  preserve_zeroes = FALSE)

Arguments

ArgumentDescription
objectA SingleCellExperiment object.
exprs_valuesString indicating which assay contains the count data that should be used to compute log-transformed expression values.
return_logLogical scalar, should normalized values be returned on the log2 scale? If TRUE , output is stored as "logcounts" in the returned object; if FALSE output is stored as "normcounts" .
log_exprs_offsetNumeric scalar specifying the pseudo-count to add when log-transforming expression values. If NULL , the value is taken from metadata(object)$log.exprs.offset if defined, otherwise it is set to 1.
centre_size_factorsLogical scalar indicating whether size fators should be centred.
preserve_zeroesLogical scalar indicating whether zeroes should be preserved when dealing with non-unity offsets.

Details

Normalized expression values are computed by dividing the counts for each cell by the size factor for that cell. This aims to remove cell-specific scaling biases, e.g., due to differences in sequencing coverage or capture efficiency. If log=TRUE , log-normalized values are calculated by adding log_exprs_offset to the normalized count and performing a log2 transformation.

Features marked as spike-in controls will be normalized with control-specific size factors, if these are available. This reflects the fact that spike-in controls are subject to different biases than those that are removed by gene-specific size factors (namely, total RNA content). If size factors for a particular spike-in set are not available, a warning will be raised.

If centre_size_factors=TRUE , all sets of size factors will be centred to have the same mean prior to calculation of normalized expression values. This ensures that abundances are roughly comparable between features normalized with different sets of size factors. By default, the centre mean is unity, which means that the computed exprs can be interpreted as being on the same scale as log-counts. It also means that the added log_exprs_offset can be interpreted as a pseudo-count (i.e., on the same scale as the counts).

If preserve_zeroes=TRUE and the pseudo-count is not unity, size factors are instead centered at the specified value of log_exprs_offset . The log-transformation is then performed on the normalized expression values with a pseudo-count of 1, which ensures that zeroes remain so in the output matrix. This yields the same results as preserve_zeroes=FALSE minus a matrix-wide constant of log2(log_exprs_offset) .

In some cases, the function will return a DelayedMatrix with delayed division and log-transformation operations. This requires that the assay specified by exprs_values contains a DelayedMatrix , and only one set of size factors is used for all features. This avoids the need to explicitly calculate normalized expression values across a very large (possibly file-backed) matrix.

Value

A SingleCellExperiment object containing normalized expression values in "normcounts" if log=FALSE , and log-normalized expression values in "logcounts" if log=TRUE . All size factors will also be centred in the output object if centre_size_factors=TRUE .

Author

Davis McCarthy and Aaron Lun

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)

example_sce <- normalize(example_sce)
Link to this function

normalizeCounts()

Divide columns of a count matrix by the size factors

Description

Compute (log-)normalized expression values by dividing counts for each cell by the corresponding size factor.

Usage

normalizeCounts(x, size_factors, return_log = TRUE,
  log_exprs_offset = 1, centre_size_factors = FALSE,
  subset_row = NULL)

Arguments

ArgumentDescription
xA count matrix, with cells in the columns and genes in the rows.
size_factorsA numeric vector of size factors for all cells.
return_logLogical scalar, should normalized values be returned on the log2 scale?
log_exprs_offsetNumeric scalar specifying the offset to add when log-transforming expression values.
centre_size_factorsLogical scalar indicating whether size fators should be centred.
subset_rowA vector specifying the subset of rows of x for which to return a result.

Details

This function will compute log-normalized expression values from x . It will endeavour to return an object of the same class as x , with particular focus on DelayedMatrix inputs/outputs.

Note that the default centre_size_factors differs from that in normalizeSCE . Users of this function are assumed to know what they're doing with respect to normalization.

Value

A matrix-like object of (log-)normalized expression values.

Author

Aaron Lun

Examples

data("sc_example_counts")
normed <- normalizeCounts(sc_example_counts,
librarySizeFactors(sc_example_counts))

Plot column metadata

Description

Plot column-level (i.e., cell) metadata in an SingleCellExperiment object.

Usage

plotColData(object, y, x = NULL, colour_by = NULL, shape_by = NULL,
  size_by = NULL, by_exprs_values = "logcounts",
  by_show_single = FALSE, ...)

Arguments

ArgumentDescription
objectA SingleCellExperiment object containing expression values and experimental information.
ySpecification of the column-level metadata to show on the y-axis, see ?" for possible values. Note that only metadata fields will be searched, assays will not be used.
xSpecification of the column-level metadata to show on the x-axis, see ?" for possible values. Again, only metadata fields will be searched, assays will not be used.
colour_bySpecification of a column metadata field or a feature to colour by, see ?" for possible values.
shape_bySpecification of a column metadata field or a feature to shape by, see ?" for possible values.
size_bySpecification of a column metadata field or a feature to size by, see ?" for possible values.
by_exprs_valuesA string or integer scalar specifying which assay to obtain expression values from, for use in point aesthetics - see ?" for details.
by_show_singleLogical scalar specifying whether single-level factors should be used for point aesthetics, see ?" for details.
...Additional arguments for visualization, see ?" for details.

Details

If y is continuous and x=NULL , a violin plot is generated. If x is categorical, a grouped violin plot will be generated, with one violin for each level of x . If x is continuous, a scatter plot will be generated.

If y is categorical and x is continuous, horizontal violin plots will be generated. If x is missing or categorical, rectangule plots will be generated where the area of a rectangle is proportional to the number of points for a combination of factors.

Note that plotPhenoData and plotCellData are synonyms for plotColData . These are artifacts of the transition from the old SCESet class, and will be deprecated in future releases.

Value

A ggplot object.

Author

Davis McCarthy, with modifications by Aaron Lun

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce <- calculateQCMetrics(example_sce)
example_sce <- normalize(example_sce)

plotColData(example_sce, y = "total_features_by_counts",
x = "log10_total_counts", colour_by = "Mutation_Status")

plotColData(example_sce, y = "total_features_by_counts",
x = "log10_total_counts", colour_by = "Mutation_Status",
size_by = "Gene_0001", shape_by = "Treatment")

plotColData(example_sce, y = "Treatment",
x = "log10_total_counts", colour_by = "Mutation_Status")

plotColData(example_sce, y = "total_features_by_counts",
x = "Cell_Cycle", colour_by = "Mutation_Status")
Link to this function

plotExplanatoryPCs()

Plot the explanatory PCs for each variable

Description

Plot the explanatory PCs for each variable

Usage

plotExplanatoryPCs(object, nvars_to_plot = 10, npcs_to_plot = 50,
  theme_size = 10, ...)

Arguments

ArgumentDescription
objectA SingleCellExperiment object containing expression values and experimental information. Alternatively, a matrix containing the output of getExplanatoryPCs .
nvars_to_plotInteger scalar specifying the number of variables with the greatest explanatory power to plot. This can be set to Inf to show all variables.
npcs_to_plotInteger scalar specifying the number of PCs to plot.
theme_sizenumeric scalar providing base font size for ggplot theme.
...Parameters to be passed to getExplanatoryPCs .

Details

A density plot is created for each variable, showing the R-squared for each successive PC (up to npcs_to_plot PCs). Only the nvars_to_plot variables with the largest maximum R-squared across PCs are shown.

If object is a SingleCellExperiment object, getExplanatoryPCs will be called to compute the variance in expression explained by each variable in each gene. Users may prefer to run getExplanatoryPCs manually and pass the resulting matrix as object , in which case the R-squared values are used directly.

Value

A ggplot object.

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info)
example_sce <- normalize(example_sce)

plotExplanatoryPCs(example_sce)
Link to this function

plotExplanatoryVariables()

Plot explanatory variables ordered by percentage of variance explained

Description

Plot explanatory variables ordered by percentage of variance explained

Usage

plotExplanatoryVariables(object, nvars_to_plot = 10,
  min_marginal_r2 = 0, theme_size = 10, ...)

Arguments

ArgumentDescription
objectA SingleCellExperiment object containing expression values and experimental information. Alternatively, a matrix containing the output of getVarianceExplained .
nvars_to_plotInteger scalar specifying the number of variables with the greatest explanatory power to plot. This can be set to Inf to show all variables.
min_marginal_r2Numeric scalar specifying the minimal value required for median marginal R-squared for a variable to be plotted. Only variables with a median marginal R-squared strictly larger than this value will be plotted.
theme_sizeNumeric scalar specifying the font size to use for the plotting theme
...Parameters to be passed to getVarianceExplained .

Details

A density plot is created for each variable, showing the distribution of R-squared across all genes. Only the nvars_to_plot variables with the largest median R-squared across genes are shown. Variables are also only shown if they have median R-squared values above min_marginal_r2 .

If object is a SingleCellExperiment object, getVarianceExplained will be called to compute the variance in expression explained by each variable in each gene. Users may prefer to run getVarianceExplained manually and pass the resulting matrix as object , in which case the R-squared values are used directly.

Value

A ggplot object.

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info)
example_sce <- normalize(example_sce)

plotExplanatoryVariables(example_sce)
Link to this function

plotExpression()

Plot expression values for all cells

Description

Plot expression values for a set of features (e.g. genes or transcripts) in a SingleExperiment object, against a continuous or categorical covariate for all cells.

Usage

plotExpression(object, features, x = NULL, exprs_values = "logcounts",
  log2_values = FALSE, colour_by = NULL, shape_by = NULL,
  size_by = NULL, by_exprs_values = exprs_values,
  by_show_single = FALSE, xlab = NULL, feature_colours = TRUE,
  one_facet = TRUE, ncol = 2, scales = "fixed", ...)

Arguments

ArgumentDescription
objectA SingleCellExperiment object containing expression values and other metadata.
featuresA character vector (of feature names), a logical vector or numeric vector (of indices) specifying the features to plot.
xSpecification of a column metadata field or a feature to show on the x-axis, see ?" for possible values.
exprs_valuesA string or integer scalar specifying which assay in assays(object) to obtain expression values from.
log2_valuesLogical scalar, specifying whether the expression values be transformed to the log2-scale for plotting (with an offset of 1 to avoid logging zeroes).
colour_bySpecification of a column metadata field or a feature to colour by, see ?" for possible values.
shape_bySpecification of a column metadata field or a feature to shape by, see ?" for possible values.
size_bySpecification of a column metadata field or a feature to size by, see ?" for possible values.
by_exprs_valuesA string or integer scalar specifying which assay to obtain expression values from, for use in point aesthetics - see ?" for details.
by_show_singleLogical scalar specifying whether single-level factors should be used for point aesthetics, see ?" for details.
xlabString specifying the label for x-axis. If NULL (default), x will be used as the x-axis label.
feature_coloursLogical scalar indicating whether violins should be coloured by feature when x and colour_by are not specified and one_facet=TRUE .
one_facetLogical scalar indicating whether grouped violin plots for multiple features should be put onto one facet. Only relevant when x=NULL .
ncolInteger scalar, specifying the number of columns to be used for the panels of a multi-facet plot.
scalesString indicating whether should multi-facet scales be fixed ( "fixed" ), free ( "free" ), or free in one dimension ( "free_x" , "free_y" ). Passed to the scales argument in the facet_wrap when multiple facets are generated.
...Additional arguments for visualization, see ?" for details.

Details

This function plots expression values for one or more features. If x is not specified, a violin plot will be generated of expression values. If x is categorical, a grouped violin plot will be generated, with one violin for each level of x . If x is continuous, a scatter plot will be generated.

If multiple features are requested and x is not specified and one_facet=TRUE , a grouped violin plot will be generated with one violin per feature. This will be coloured by feature if colour_by=NULL and feature_colours=TRUE , to yield a more aesthetically pleasing plot. Otherwise, if x is specified or one_facet=FALSE , a multi-panel plot will be generated where each panel corresponds to a feature. Each panel will be a scatter plot or (grouped) violin plot, depending on the nature of x .

Note that this assumes that the expression values are numeric. If not, and x is continuous, horizontal violin plots will be generated. If x is missing or categorical, rectangule plots will be generated where the area of a rectangle is proportional to the number of points for a combination of factors.

Value

A ggplot object.

Author

Davis McCarthy, with modifications by Aaron Lun

Examples

## prepare data
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce <- calculateQCMetrics(example_sce)
sizeFactors(example_sce) <- colSums(counts(example_sce))
example_sce <- normalize(example_sce)

## default plot
plotExpression(example_sce, 1:15)

## plot expression against an x-axis value
plotExpression(example_sce, c("Gene_0001", "Gene_0004"), x="Mutation_Status")
plotExpression(example_sce, c("Gene_0001", "Gene_0004"), x="Gene_0002")

## add visual options
plotExpression(example_sce, 1:6, colour_by = "Mutation_Status")
plotExpression(example_sce, 1:6, colour_by = "Mutation_Status",
shape_by = "Treatment", size_by = "Gene_0010")

## plot expression against expression values for Gene_0004
plotExpression(example_sce, 1:4, "Gene_0004", show_smooth = TRUE)
Link to this function

plotExprsFreqVsMean()

Plot frequency against mean for each feature

Description

Plot the frequency of expression (i.e., percentage of expressing cells) against the mean expression level for each feature in a SingleCellExperiment object.

Usage

plotExprsFreqVsMean(object, freq_exprs, mean_exprs, controls,
  exprs_values = "counts", by_show_single = FALSE,
  show_smooth = TRUE, show_se = TRUE, ...)

Arguments

ArgumentDescription
objectA SingleCellExperiment object.
freq_exprsSpecification of the row-level metadata field containing the number of expressing cells per feature, see ?" for possible values. Note that only metadata fields will be searched, assays will not be used. If not supplied or NULL , this defaults to "n_cells_by_counts" or equivalent for compacted data.
mean_exprsSpecification of the row-level metadata field containing the mean expression of each feature, see ?" for possible values. Again, only metadata fields will be searched, assays will not be used. If not supplied or NULL , this defaults to "mean_counts" or equivalent for compacted data.
controlsSpecification of the row-level metadata column indicating whether a feature is a control, see ?" for possible values. Only metadata fields will be searched, assays will not be used. If not supplied, this defaults to "is_feature_control" or equivalent for compacted data.
exprs_valuesString specifying the assay used for the default freq_exprs and mean_exprs . This can be set to, e.g., "logcounts" so that freq_exprs defaults to "n_cells_by_logcounts" .
by_show_singleLogical scalar specifying whether a single-level factor for controls should be used for colouring, see ?" for details.
show_smoothLogical scalar, should a smoothed fit (through feature controls if available; all features otherwise) be shown on the plot? See geom_smooth for details.
show_seLogical scalar, should the standard error be shown for a smoothed fit?
...Further arguments passed to plotRowData .

Details

This function plots gene expression frequency versus mean expression level, which can be useful to assess the effects of technical dropout in the dataset. We fit a non-linear least squares curve for the relationship between expression frequency and mean expression. We use this curve to define the number of genes above high technical dropout and the numbers of genes that are expressed in at least 50% and at least 25% of cells.

The plot will attempt to colour the points based on whether the corresponding features are labelled as feature controls in object . This can be turned off by setting controls=NULL .

Value

A ggplot object.

Seealso

plotRowData

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce <- normalize(example_sce)

example_sce <- calculateQCMetrics(example_sce,
feature_controls = list(set1 = 1:500))
plotExprsFreqVsMean(example_sce)

plotExprsFreqVsMean(example_sce, size_by = "is_feature_control")
Link to this function

plotExprsVsTxLength()

Plot expression against transcript length

Description

Plot mean expression values for all features in a SingleCellExperiment object against transcript length values.

Usage

plotExprsVsTxLength(object, tx_length = "median_feat_eff_len",
  length_is_assay = FALSE, exprs_values = "logcounts",
  log2_values = FALSE, colour_by = NULL, shape_by = NULL,
  size_by = NULL, by_exprs_values = exprs_values,
  by_show_single = FALSE, xlab = "Median transcript length",
  show_exprs_sd = FALSE, ...)

Arguments

ArgumentDescription
objectA SingleCellExperiment object.
tx_lengthTranscript lengths for all features, to plot on the x-axis. If length_is_assay=FALSE , this can take any of the values described in ?" for feature-level metadata; data in assays(object) will not be searched. Otherwise, if length_is_assay=TRUE , tx_length should be the name or index of an assay in object .
length_is_assayLogical scalar indicating whether tx_length refers to an assay of object containing transcript lengths for all features in all cells.
exprs_valuesA string or integer scalar specifying which assay in assays(object) to obtain expression values from.
log2_valuesLogical scalar, specifying whether the expression values be transformed to the log2-scale for plotting (with an offset of 1 to avoid logging zeroes).
colour_bySpecification of a column metadata field or a feature to colour by, see ?" for possible values.
shape_bySpecification of a column metadata field or a feature to shape by, see ?" for possible values.
size_bySpecification of a column metadata field or a feature to size by, see ?" for possible values.
by_exprs_valuesA string or integer scalar specifying which assay to obtain expression values from, for use in point aesthetics - see ?" for details.
by_show_singleLogical scalar specifying whether single-level factors should be used for point aesthetics, see ?" for details.
xlabString specifying the label for x-axis.
show_exprs_sdLogical scalar indicating whether the standard deviation of expression values for each feature should be plotted.
...Additional arguments for visualization, see ?" for details.

Details

If length_is_assay=TRUE , the median transcript length of each feature across all cells is used. This may be necessary if the effective transcript length differs across cells, e.g., as observed in the results from pseudo-aligners.

Value

A ggplot object.

Author

Davis McCarthy, with modifications by Aaron Lun

Examples

data("sc_example_counts")
data("sc_example_cell_info")
rd <- DataFrame(gene_id = rownames(sc_example_counts),
feature_id = paste("feature", rep(1:500, each = 4), sep = "_"),
median_tx_length = rnorm(2000, mean = 5000, sd = 500),
other = sample(LETTERS, 2000, replace = TRUE)
)
rownames(rd) <- rownames(sc_example_counts)
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info, rowData = rd
)
example_sce <- normalize(example_sce)

plotExprsVsTxLength(example_sce, "median_tx_length")
plotExprsVsTxLength(example_sce, "median_tx_length", show_smooth = TRUE)
plotExprsVsTxLength(example_sce, "median_tx_length", show_smooth = TRUE,
colour_by = "other", show_exprs_sd = TRUE)

## using matrix of tx length values in assays(object)
mat <- matrix(rnorm(ncol(example_sce) * nrow(example_sce), mean = 5000,
sd = 500), nrow = nrow(example_sce))
dimnames(mat) <- dimnames(example_sce)
assay(example_sce, "tx_len") <- mat

plotExprsVsTxLength(example_sce, "tx_len", show_smooth = TRUE,
length_is_assay = TRUE, show_exprs_sd = TRUE)

## using a vector of tx length values
plotExprsVsTxLength(example_sce,
data.frame(rnorm(2000, mean = 5000, sd = 500)))

Plot heatmap of gene expression values

Description

Create a heatmap of expression values for each cell and specified features in a SingleCellExperiment object.

Usage

plotHeatmap(object, features, columns = NULL,
  exprs_values = "logcounts", center = FALSE, zlim = NULL,
  symmetric = FALSE, color = NULL, colour_columns_by = NULL,
  by_exprs_values = exprs_values, by_show_single = FALSE,
  show_colnames = TRUE, ...)

Arguments

ArgumentDescription
objectA SingleCellExperiment object.
featuresA character vector of row names, a logical vector of integer vector of indices specifying rows of object to show in the heatmap.
columnsA vector specifying the subset of columns in object to show as columns in the heatmp. By default, all columns are used in their original order.
exprs_valuesA string or integer scalar indicating which assay of object should be used as expression values for colouring in the heatmap.
centerA logical scalar indicating whether each row should have its mean expression centered at zero prior to plotting.
zlimA numeric vector of length 2, specifying the upper and lower bounds for the expression values. This winsorizes the expression matrix prior to plotting (but after centering, if center=TRUE ). If NULL , it defaults to the range of the expression matrix.
symmetricA logical scalar specifying whether the default zlim should be symmetric around zero. If TRUE , the maximum absolute value of zlim will be computed and multiplied by c(-1, 1) to redefine zlim .
colorA vector of colours specifying the palette to use for mapping expression values to colours. This defaults to the default setting in pheatmap .
colour_columns_byA list of values specifying how the columns should be annotated with colours. Each entry of the list can be of the form described by ?" . A character vector can also be supplied and will be treated as a list of strings.
by_exprs_valuesA string or integer scalar specifying which assay to obtain expression values from, for colouring of column-level data - see ?" for details.
by_show_singleLogical scalar specifying whether single-level factors should be used for column-level colouring, see ?" for details.
show_colnamesLogical scalar specifying whether column names should be shown, if available in object .
...Additional arguments to pass to pheatmap .

Details

Setting center=TRUE is useful for examining log-fold changes of each cell's expression profile from the average across all cells. This avoids issues with the entire row appearing a certain colour because the gene is highly/lowly expressed across all cells.

Setting zlim preserves the dynamic range of colours in the presence of outliers. Otherwise, the plot may be dominated by a few genes, which will flatten the observed colours for the rest of the heatmap.

Value

A heatmap is produced on the current graphics device. The output of pheatmap is invisibly returned.

Seealso

pheatmap

Author

Aaron Lun

Examples

example(normalizeSCE) # borrowing the example objects in here.
plotHeatmap(example_sce, features=rownames(example_sce)[1:10])
plotHeatmap(example_sce, features=rownames(example_sce)[1:10],
center=TRUE, symmetric=TRUE)

plotHeatmap(example_sce, features=rownames(example_sce)[1:10],
colour_columns_by=c("Mutation_Status", "Cell_Cycle"))
Link to this function

plotHighestExprs()

Plot the highest expressing features

Description

Plot the features with the highest average expression across all cells, along with their expression in each individual cell.

Usage

plotHighestExprs(object, n = 50, controls, colour_cells_by,
  drop_features = NULL, exprs_values = "counts",
  by_exprs_values = exprs_values, by_show_single = TRUE,
  feature_names_to_plot = NULL, as_percentage = TRUE)

Arguments

ArgumentDescription
objectA SingleCellExperiment object.
nA numeric scalar specifying the number of the most expressed features to show.
controlsSpecification of the row-level metadata column indicating whether a feature is a control, see ?" for possible values. Only metadata fields will be searched, assays will not be used. If not supplied, this defaults to "is_feature_control" or equivalent for compacted data.
colour_cells_bySpecification of a column metadata field or a feature to colour by, see ?" for possible values. If not supplied, this defaults to "total_features_by_counts" or equivalent for compacted data.
drop_featuresA character, logical or numeric vector indicating which features (e.g. genes, transcripts) to drop when producing the plot. For example, spike-in transcripts might be dropped to examine the contribution from endogenous genes.
exprs_valuesA integer scalar or string specifying the assay to obtain expression values from.
by_exprs_valuesA string or integer scalar specifying which assay to obtain expression values from, for use in colouring - see ?" for details.
by_show_singleLogical scalar specifying whether single-level factors should be used for colouring, see ?" for details. Default is NULL , in which case rownames(object) are used.
feature_names_to_plotSpecification of which row-level metadata column contains the feature names, see ?" for possible values.
as_percentagelogical scalar indicating whether percentages should be plotted. If FALSE , the raw exprs_values are shown instead.

Details

This function will plot the percentage of counts accounted for by the top n most highly expressed features across the dataset. Each feature corresponds to a row on the plot, sorted by average expression (denoted by the point).

The plot will attempt to colour the points based on whether the corresponding feature is labelled as a control in object . This can be turned off by setting controls=NULL .

The distribution of expression across all cells is shown as tick marks for each feature. These ticks can be coloured according to cell-level metadata, as specified by colour_cells_by . Setting colour_cells_by=NULL will disable all tick colouring.

Value

A ggplot object.

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce <- calculateQCMetrics(example_sce,
feature_controls = list(set1 = 1:500)
)

plotHighestExprs(example_sce, colour_cells_by ="total_features_by_counts")
plotHighestExprs(example_sce, controls = NULL)
plotHighestExprs(example_sce, colour_cells_by="Mutation_Status")
Link to this function

plotPlatePosition()

Plot cells in plate positions

Description

Plots cells in their position on a plate, coloured by metadata variables or feature expression values from a SingleCellExperiment object.

Usage

plotPlatePosition(object, plate_position = NULL, colour_by = NULL,
  size_by = NULL, shape_by = NULL, by_exprs_values = "logcounts",
  by_show_single = FALSE, add_legend = TRUE, theme_size = 24,
  point_alpha = 0.6, point_size = 24)

Arguments

ArgumentDescription
objectA SingleCellExperiment object.
plate_positionA character vector specifying the plate position for each cell (e.g., A01, B12, and so on, where letter indicates row and number indicates column). If NULL , the function will attempt to extract this from object$plate_position . Alternatively, a list of two factors ( "row" and "column" ) can be supplied, specifying the row (capital letters) and column (integer) for each cell in object .
colour_bySpecification of a column metadata field or a feature to colour by, see ?" for possible values.
size_bySpecification of a column metadata field or a feature to size by, see ?" for possible values.
shape_bySpecification of a column metadata field or a feature to shape by, see ?" for possible values.
by_exprs_valuesA string or integer scalar specifying which assay to obtain expression values from, for use in point aesthetics - see ?" for details.
by_show_singleLogical scalar specifying whether single-level factors should be used for point aesthetics, see ?" for details.
add_legendLogical scalar specifying whether a legend should be shown.
theme_sizeNumeric scalar, see ?" for details.
point_alphaNumeric scalar specifying the transparency of the points, see ?" for details.
point_sizeNumeric scalar specifying the size of the points, see ?" for details.

Details

This function expects plate positions to be given in a charcter format where a letter indicates the row on the plate and a numeric value indicates the column. Each cell has a plate position such as "A01", "B12", "K24" and so on. From these plate positions, the row is extracted as the letter, and the column as the numeric part. Alternatively, the row and column identities can be directly supplied by setting plate_position as a list of two factors.

Value

A ggplot object.

Author

Davis McCarthy, with modifications by Aaron Lun

Examples

## prepare data
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce <- normalize(example_sce)
example_sce <- calculateQCMetrics(example_sce)

## define plate positions
example_sce$plate_position <- paste0(
rep(LETTERS[1:5], each = 8),
rep(formatC(1:8, width = 2, flag = "0"), 5)
)

## plot plate positions
plotPlatePosition(example_sce, colour_by = "Mutation_Status")

plotPlatePosition(example_sce, shape_by = "Treatment", colour_by = "Gene_0004")

plotPlatePosition(example_sce, shape_by = "Treatment", size_by = "Gene_0001",
colour_by = "Cell_Cycle")

Plot a relative log expression (RLE) plot

Description

Produce a relative log expression (RLE) plot of one or more transformations of cell expression values.

Usage

plotRLE(object, exprs_values = "logcounts", exprs_logged = TRUE,
  style = "minimal", legend = TRUE, ordering = NULL,
  colour_by = NULL, by_exprs_values = exprs_values, ...)

Arguments

ArgumentDescription
objectA SingleCellExperiment object.
exprs_valuesA string or integer scalar specifying the expression matrix in object to use.
exprs_loggedA logical scalar indicating whether the expression matrix is already log-transformed. If not, a log2-transformation (+1) will be performed prior to plotting.
styleString defining the boxplot style to use, either "minimal" (default) or "full" ; see Details.
legendLogical scalar specifying whether a legend should be shown.
orderingA vector specifying the ordering of cells in the RLE plot. This can be useful for arranging cells by experimental conditions or batches.
colour_bySpecification of a column metadata field or a feature to colour by, see ?" for possible values.
by_exprs_valuesA string or integer scalar specifying which assay to obtain expression values from, for use in point aesthetics - see ?" for details.
...further arguments passed to geom_boxplot when style="full" .

Details

Relative log expression (RLE) plots are a powerful tool for visualising unwanted variation in high dimensional data. These plots were originally devised for gene expression data from microarrays but can also be used on single-cell expression data. RLE plots are particularly useful for assessing whether a procedure aimed at removing unwanted variation (e.g., scaling normalisation) has been successful.

If style is full , the usual ggplot2 boxplot is created for each cell. Here, the box shows the inter-quartile range and whiskers extend no more than 1.5 times the IQR from the hinge (the 25th or 75th percentile). Data beyond the whiskers are called outliers and are plotted individually. The median (50th percentile) is shown with a white bar. This approach is detailed and flexible, but can take a long time to plot for large datasets.

If style is minimal , a Tufte-style boxplot is created for each cell. Here, the median is shown with a circle, the IQR in a grey line, and whiskers (as defined above) for the plots are shown with coloured lines. No outliers are shown for this plot style. This approach is more succinct and faster for large numbers of cells.

Value

A ggplot object

Author

Davis McCarthy, with modifications by Aaron Lun

References

Gandolfo LC, Speed TP. RLE Plots: Visualising Unwanted Variation in High Dimensional Data. arXiv [stat.ME]. 2017. Available: http://arxiv.org/abs/1704.03590

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce <- normalize(example_sce)

plotRLE(example_sce, colour_by = "Mutation_Status", style = "minimal")

plotRLE(example_sce, colour_by = "Mutation_Status", style = "full",
outlier.alpha = 0.1, outlier.shape = 3, outlier.size = 0)
Link to this function

plotReducedDim()

Plot reduced dimensions

Description

Plot cell-level reduced dimension results stored in a SingleCellExperiment object.

Usage

plotReducedDim(object, use_dimred, ncomponents = 2, percentVar = NULL,
  colour_by = NULL, shape_by = NULL, size_by = NULL,
  by_exprs_values = "logcounts", by_show_single = FALSE,
  text_by = NULL, text_size = 5, text_colour = "black", ...)

Arguments

ArgumentDescription
objectA SingleCellExperiment object.
use_dimredA string or integer scalar indicating the reduced dimension result in reducedDims(object) to plot.
ncomponentsA numeric scalar indicating the number of dimensions to plot, starting from the first dimension. Alternatively, a numeric vector specifying the dimensions to be plotted.
percentVarA numeric vector giving the proportion of variance in expression explained by each reduced dimension. Only expected to be used in PCA settings, e.g., in the plotPCA function.
colour_bySpecification of a column metadata field or a feature to colour by, see ?" for possible values.
shape_bySpecification of a column metadata field or a feature to shape by, see ?" for possible values.
size_bySpecification of a column metadata field or a feature to size by, see ?" for possible values.
by_exprs_valuesA string or integer scalar specifying which assay to obtain expression values from, for use in point aesthetics - see ?" for details.
by_show_singleLogical scalar specifying whether single-level factors should be used for point aesthetics, see ?" for details.
text_bySpecification of a column metadata field for which to add text - see ?" for possible values. This must refer to a categorical field, i.e., coercible into a factor.
text_sizeNumeric scalar specifying the size of added text.
text_colourString specifying the colour of the added text.
...Additional arguments for visualization, see ?" for details.

Details

If ncomponents is a scalar equal to 2, a scatterplot of the first two dimensions is produced. If ncomponents is greater than 2, a pairs plots for the top dimensions is produced.

Alternatively, if ncomponents is a vector of length 2, a scatterplot of the two specified dimensions is produced. If it is of length greater than 2, a pairs plot is produced containing all pairwise plots between the specified dimensions.

The text_by option will add factor levels as labels onto the plot, placed at the median coordinate across all points in that level. This is useful for annotating position-related metadata (e.g., clusters) when there are too many levels to distinguish by colour. It is only available for scatterplots.

Value

A ggplot object

Author

Davis McCarthy, with modifications by Aaron Lun

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce <- normalize(example_sce)

example_sce <- runPCA(example_sce, ncomponents=5)
plotReducedDim(example_sce, "PCA")
plotReducedDim(example_sce, "PCA", colour_by="Cell_Cycle")
plotReducedDim(example_sce, "PCA", colour_by="Gene_0001")

plotReducedDim(example_sce, "PCA", ncomponents=5)
plotReducedDim(example_sce, "PCA", ncomponents=5, colour_by="Cell_Cycle",
shape_by="Treatment")

Plot row metadata

Description

Plot row-level (i.e., gene) metadata from a SingleCellExperiment object.

Usage

plotRowData(object, y, x = NULL, colour_by = NULL, shape_by = NULL,
  size_by = NULL, by_exprs_values = "logcounts",
  by_show_single = FALSE, ...)

Arguments

ArgumentDescription
objectA SingleCellExperiment object containing expression values and experimental information.
ySpecification of the row-level metadata to show on the y-axis, see ?" for possible values. Note that only metadata fields will be searched, assays will not be used.
xSpecification of the row-level metadata to show on the x-axis, see ?" for possible values. Again, only metadata fields will be searched, assays will not be used.
colour_bySpecification of a row metadata field or a cell to colour by, see ?" for possible values.
shape_bySpecification of a row metadata field or a cell to shape by, see ?" for possible values.
size_bySpecification of a row metadata field or a cell to size by, see ?" for possible values.
by_exprs_valuesA string or integer scalar specifying which assay to obtain expression values from, for use in point aesthetics - see ?" for details.
by_show_singleLogical scalar specifying whether single-level factors should be used for point aesthetics, see ?" for details.
...Additional arguments for visualization, see ?" for details.

Details

If y is continuous and x=NULL , a violin plot is generated. If x is categorical, a grouped violin plot will be generated, with one violin for each level of x . If x is continuous, a scatter plot will be generated.

If y is categorical and x is continuous, horizontal violin plots will be generated. If x is missing or categorical, rectangule plots will be generated where the area of a rectangle is proportional to the number of points for a combination of factors.

Note that plotFeatureData is a synonym for plotRowData . This is an artifact of the transition from the old SCESet class, and will be deprecated in future releases.

Value

A ggplot object.

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce <- calculateQCMetrics(example_sce,
feature_controls = list(ERCC=1:40))
example_sce <- normalize(example_sce)

plotRowData(example_sce, y="n_cells_by_counts", x="log10_total_counts")
plotRowData(example_sce, y="n_cells_by_counts",
size_by ="log10_total_counts",
colour_by = "is_feature_control")

Plot an overview of expression for each cell

Description

Plot the relative proportion of the library size that is accounted for by the most highly expressed features for each cell in a SingleCellExperiment object.

Usage

plotScater(x, nfeatures = 500, exprs_values = "counts",
  colour_by = NULL, by_exprs_values = exprs_values,
  by_show_single = FALSE, block1 = NULL, block2 = NULL, ncol = 3,
  line_width = 1.5, theme_size = 10)

Arguments

ArgumentDescription
xA SingleCellExperiment object.
nfeaturesNumeric scalar indicating the number of top-expressed features to show n the plot.
exprs_valuesString or integer scalar indicating which assay of object should be used to obtain the expression values for this plot.
colour_bySpecification of a column metadata field or a feature to colour by, see ?" for possible values. The curve for each cell will be coloured according to this specification.
by_exprs_valuesA string or integer scalar specifying which assay to obtain expression values from, for use in line colouring - see ?" for details.
by_show_singleLogical scalar specifying whether single-level factors should be used for line colouring, see ?" for details.
block1Specification of a factor by which to separate the cells into blocks (separate panels) in the plot. This can be any type of value described in ?" for column-level metadata. Default is NULL , in which case there is no blocking.
block2Same as block1 , providing another level of blocking.
ncolNumber of columns to use for facet_wrap if only one block is defined.
line_widthNumeric scalar specifying the line width.
theme_sizeNumeric scalar specifying the font size to use for the plotting theme.

Details

For each cell, the features are ordered from most-expressed to least-expressed. The cumulative proportion of the total expression for the cell is computed across the top nfeatures features. These plots can flag cells with a very high proportion of the library coming from a small number of features; such cells are likely to be problematic for downstream analyses.

Using the colour and blocking arguments can flag overall differences in cells under different experimental conditions or affected by different batch and other variables. If only one of block1 and block2 are specified, each panel corresponds to a separate level of the specified blocking factor. If both are specified, each panel corresponds to a combination of levels.

Value

a ggplot plot object

Author

Davis McCarthy, with modifications by Aaron Lun

Examples

## Set up an example SingleCellExperiment
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)

plotScater(example_sce)
plotScater(example_sce, exprs_values = "counts", colour_by = "Cell_Cycle")
plotScater(example_sce, block1 = "Treatment", colour_by = "Cell_Cycle")

cpm(example_sce) <- calculateCPM(example_sce, use_size_factors = FALSE)
plotScater(example_sce, exprs_values = "cpm", block1 = "Treatment",
block2 = "Mutation_Status", colour_by = "Cell_Cycle")

Plot specific reduced dimensions

Description

Wrapper functions to create plots for specific types of reduced dimension results in a SingleCellExperiment object, or, if they are not already present, to calculate those results and then plot them.

Usage

plotPCASCE(object, ..., rerun = FALSE, ncomponents = 2,
  run_args = list())
plotTSNE(object, ..., rerun = FALSE, ncomponents = 2,
  run_args = list())
plotUMAP(object, ..., rerun = FALSE, ncomponents = 2,
  run_args = list())
plotDiffusionMap(object, ..., rerun = FALSE, ncomponents = 2,
  run_args = list())
plotMDS(object, ..., rerun = FALSE, ncomponents = 2,
  run_args = list())
list(list("plotPCA"), list("SingleCellExperiment"))(object, ..., rerun = FALSE,
  ncomponents = 2, run_args = list())

Arguments

ArgumentDescription
objectA SingleCellExperiment object.
...Additional arguments to pass to plotReducedDim .
rerunLogical, should the reduced dimensions be recomputed even if object contains an appropriately named set of results in the reducedDims slot?
ncomponentsNumeric scalar indicating the number of dimensions components to (calculate and) plot. This can also be a numeric vector, see ? for details.
run_argsArguments to pass to runPCA , runTSNE , etc.

Details

Each function will search the reducedDims slot for an appropriately named set of results and pass those coordinates onto plotReducedDim . If the results are not present or rerun=TRUE , they will be computed using the relevant run* function. The result name and run* function for each plot* function are:

  • "PCA" and runPCA for plotPCA

  • "TSNE" and runTSNE for plotTSNE

  • "DiffusionMap" and runDiffusionMap for plotDiffusionMap

  • "MDS" and runMDS for "plotMDS"
    Users can specify arguments to the run* functions via run_args .

If ncomponents is a numeric vector, the maximum value will be used to determine the required number of dimensions to compute in the run* functions. However, only the specified dimensions in ncomponents will be plotted.

Value

A ggplot object.

Seealso

runPCA , runDiffusionMap , runTSNE , runMDS , plotReducedDim

Author

Davis McCarthy, with modifications by Aaron Lun

Examples

## Set up an example SingleCellExperiment
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce <- normalize(example_sce)

## Examples plotting PC1 and PC2
plotPCA(example_sce)
plotPCA(example_sce, colour_by = "Cell_Cycle")
plotPCA(example_sce, colour_by = "Cell_Cycle", shape_by = "Treatment")
plotPCA(example_sce, colour_by = "Cell_Cycle", shape_by = "Treatment",
size_by = "Mutation_Status")

## Force legend to appear for shape:
example_subset <- example_sce[, example_sce$Treatment == "treat1"]
plotPCA(example_subset, colour_by = "Cell_Cycle", shape_by = "Treatment",
by_show_single = TRUE)

## Examples plotting more than 2 PCs
plotPCA(example_sce, ncomponents = 4, colour_by = "Treatment",
shape_by = "Mutation_Status")

## Same for TSNE:
plotTSNE(example_sce, run_args=list(perplexity = 10))

## Same for DiffusionMaps:
plotDiffusionMap(example_sce)

## Same for MDS plots:
plotMDS(example_sce)
Link to this function

readSparseCounts()

Read sparse count matrix from file

Description

Reads a sparse count matrix from file containing a dense tabular format.

Usage

readSparseCounts(file, sep = "  ", quote = NULL, comment.char = "",
  row.names = TRUE, col.names = TRUE, ignore.row = 0L,
  skip.row = 0L, ignore.col = 0L, skip.col = 0L, chunk = 1000L)

Arguments

ArgumentDescription
fileA string containing a file path to a count table, or a connection object opened in read-only text mode.
sepA string specifying the delimiter between fields in file .
quoteA string specifying the quote character, e.g., in column or row names.
comment.charA string specifying the comment character after which values are ignored.
row.namesA logical scalar specifying whether row names are present.
col.namesA logical scalar specifying whether column names are present.
ignore.rowAn integer scalar specifying the number of rows to ignore at the start of the file, before the column names.
skip.rowAn integer scalar specifying the number of rows to ignore at the start of the file, after the column names.
ignore.colAn integer scalar specifying the number of columns to ignore at the start of the file, before the column names.
skip.colAn integer scalar specifying the number of columns to ignore at the start of the file, after the column names.
chunkA integer scalar indicating the chunk size to use, i.e., number of rows to read at any one time.

Details

This function provides a convenient method for reading dense arrays from flat files into a sparse matrix in memory. Memory usage can be further improved by setting chunk to a smaller positive value.

The ignore.* and skip.* parameters allow irrelevant rows or columns to be skipped. Note that the distinction between the two parameters is only relevant when row.names=FALSE (for skipping/ignoring columns) or col.names=FALSE (for rows).

Value

A dgCMatrix containing double-precision values (usually counts) for each row (gene) and column (cell).

Seealso

read.table , readMM

Author

Aaron Lun

Examples

outfile <- tempfile()
write.table(data.frame(A=1:5, B=0, C=0:4, row.names=letters[1:5]),
file=outfile, col.names=NA, sep="   ", quote=FALSE)

readSparseCounts(outfile)
Link to this function

runDiffusionMap()

Create a diffusion map from cell-level data

Description

Produce a diffusion map for the cells, based on the data in a SingleCellExperiment object.

Usage

runDiffusionMap(object, ncomponents = 2, ntop = 500,
  feature_set = NULL, exprs_values = "logcounts",
  scale_features = TRUE, use_dimred = NULL, n_dimred = NULL, ...)

Arguments

ArgumentDescription
objectA SingleCellExperiment object
ncomponentsNumeric scalar indicating the number of diffusion components to obtain.
ntopNumeric scalar specifying the number of most variable features to use for constructing the diffusion map.
feature_setCharacter vector of row names, a logical vector or a numeric vector of indices indicating a set of features to use to construct the diffusion map. This will override any ntop argument if specified.
exprs_valuesInteger scalar or string indicating which assay of object should be used to obtain the expression values for the calculations.
scale_featuresLogical scalar, should the expression values be standardised so that each feature has unit variance?
use_dimredString or integer scalar specifying the entry of reducedDims(object) to use as input to DiffusionMap . Default is to not use existing reduced dimension results.
n_dimredInteger scalar, number of dimensions of the reduced dimension slot to use when use_dimred is supplied. Defaults to all available dimensions.
...Additional arguments to pass to DiffusionMap .

Details

The function DiffusionMap is used internally to compute the diffusion map.

Setting use_dimred allows users to easily construct a diffusion map from low-rank approximations of the original expression matrix (e.g., after PCA). In such cases, arguments such as ntop , feature_set , exprs_values and scale_features will be ignored.

The behaviour of DiffusionMap seems to be non-deterministic, in a manner that is not responsive to any set.seed call. The reason for this is unknown.

Value

A SingleCellExperiment object containing the coordinates of the first ncomponent diffusion map components for each cell. This is stored in the "DiffusionMap" entry of the reducedDims slot.

Seealso

destiny , plotDiffusionMap

Author

Aaron Lun, based on code by Davis McCarthy

References

Haghverdi L, Buettner F, Theis FJ. Diffusion maps for high-dimensional single-cell analysis of differentiation data. Bioinformatics. 2015; doi:10.1093/bioinformatics/btv325

Examples

## Set up an example SingleCellExperiment
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce <- normalize(example_sce)

example_sce <- runDiffusionMap(example_sce)
reducedDimNames(example_sce)
head(reducedDim(example_sce))

Perform MDS on cell-level data

Description

Perform multi-dimensional scaling (MDS) on cells, based on the data in a SingleCellExperiment object.

Usage

runMDS(object, ncomponents = 2, ntop = 500, feature_set = NULL,
  exprs_values = "logcounts", scale_features = TRUE,
  use_dimred = NULL, n_dimred = NULL, method = "euclidean")

Arguments

ArgumentDescription
objectA SingleCellExperiment object.
ncomponentsNumeric scalar indicating the number of MDS dimensions to obtain.
ntopNumeric scalar specifying the number of most variable features to use for MDS.
feature_setCharacter vector of row names, a logical vector or a numeric vector of indices indicating a set of features to use for MDS. This will override any ntop argument if specified.
exprs_valuesInteger scalar or string indicating which assay of object should be used to obtain the expression values for the calculations.
scale_featuresLogical scalar, should the expression values be standardised so that each feature has unit variance?
use_dimredString or integer scalar specifying the entry of reducedDims(object) to use as input to cmdscale . Default is to not use existing reduced dimension results.
n_dimredInteger scalar, number of dimensions of the reduced dimension slot to use when use_dimred is supplied. Defaults to all available dimensions.
methodString specifying the type of distance to be computed between cells.

Details

The function cmdscale is used internally to compute the multidimensional scaling components to plot.

Setting use_dimred allows users to easily perform MDS on low-rank approximations of the original expression matrix (e.g., after PCA). In such cases, arguments such as ntop , feature_set , exprs_values and scale_features will be ignored.

Value

A SingleCellExperiment object containing the coordinates of the first ncomponent MDS dimensions for each cell. This is stored in the "MDS" entry of the reducedDims slot.

Seealso

cmdscale , plotMDS

Author

Aaron Lun, based on code by Davis McCarthy

Examples

## Set up an example SingleCellExperiment
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce <- normalize(example_sce)

example_sce <- runMDS(example_sce)
reducedDimNames(example_sce)
head(reducedDim(example_sce))

Perform PCA on cell-level data

Description

Perform a principal components analysis (PCA) on cells, based on the data in a SingleCellExperiment object.

Usage

list(list("runPCA"), list("SingleCellExperiment"))(x, ncomponents = 2,
  method = NULL, ntop = 500, exprs_values = "logcounts",
  feature_set = NULL, scale_features = TRUE, use_coldata = FALSE,
  selected_variables = NULL, detect_outliers = FALSE,
  BSPARAM = ExactParam(), BPPARAM = SerialParam())

Arguments

ArgumentDescription
xA SingleCellExperiment object.
ncomponentsNumeric scalar indicating the number of principal components to obtain.
methodDeprecated, string specifying how the PCA should be performed.
ntopNumeric scalar specifying the number of most variable features to use for PCA.
exprs_valuesInteger scalar or string indicating which assay of object should be used to obtain the expression values for the calculations.
feature_setCharacter vector of row names, a logical vector or a numeric vector of indices indicating a set of features to use for PCA. This will override any ntop argument if specified.
scale_featuresLogical scalar, should the expression values be standardised so that each feature has unit variance? This will also remove features with standard deviations below 1e-8.
use_coldataLogical scalar specifying whether the column data should be used instead of expression values to perform PCA.
selected_variablesList of strings or a character vector indicating which variables in colData(object) to use for PCA when use_coldata=TRUE . If a list, each entry can take the form described in ?" .
detect_outliersLogical scalar, should outliers be detected based on PCA coordinates generated from column-level metadata?
BSPARAMA BiocSingularParam object specifying which algorithm should be used to perform the PCA.
BPPARAMA BiocParallelParam object specifying whether the PCA should be parallelized.

Details

The function prcomp is used internally to do the PCA when method="prcomp" . Alternatively, the irlba package can be used, which performs a fast approximation of PCA through the prcomp_irlba function. This is especially useful for large, sparse matrices.

Note that prcomp_irlba involves a random initialization, after which it converges towards the exact PCs. This means that the result will change slightly across different runs. For full reproducibility, users should call set.seed prior to running runPCA with method="irlba" .

If use_coldata=TRUE , PCA will be performed on column-level metadata instead of the gene expression matrix. The selected_variables defaults to a vector containing:

  • "pct_counts_top_100_features"

  • "total_features_by_counts"

  • "pct_counts_feature_control"

  • "total_features_feature_control"

  • "log10_total_counts_endogenous"

  • "log10_total_counts_feature_control"
    This can be useful for identifying outliers cells based on QC metrics, especially when combined with detect_outliers=TRUE . If outlier identification is enabled, the outlier field of the output colData will contain the identified outliers.

Value

A SingleCellExperiment object containing the first ncomponent principal coordinates for each cell. If use_coldata=FALSE , this is stored in the "PCA" entry of the reducedDims slot. Otherwise, it is stored in the "PCA_coldata" entry.

The proportion of variance explained by each PC is stored as a numeric vector in the "percentVar" attribute of the reduced dimension matrix. Note that this will only be of length equal to ncomponents when method is not "prcomp" . This is because approximate PCA methods do not compute singular values for all components.

Seealso

prcomp , plotPCA

Author

Aaron Lun, based on code by Davis McCarthy

Examples

## Set up an example SingleCellExperiment
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce <- normalize(example_sce)

example_sce <- runPCA(example_sce)
reducedDimNames(example_sce)
head(reducedDim(example_sce))

Perform t-SNE on cell-level data

Description

Perform t-stochastic neighbour embedding (t-SNE) for the cells, based on the data in a SingleCellExperiment object.

Usage

runTSNE(object, ncomponents = 2, ntop = 500, feature_set = NULL,
  exprs_values = "logcounts", scale_features = TRUE,
  use_dimred = NULL, n_dimred = NULL, perplexity = min(50,
  floor(ncol(object)/5)), pca = TRUE, initial_dims = 50,
  normalize = TRUE, theta = 0.5, external_neighbors = FALSE,
  BNPARAM = KmknnParam(), BPPARAM = SerialParam(), ...)

Arguments

ArgumentDescription
objectA SingleCellExperiment object.
ncomponentsNumeric scalar indicating the number of t-SNE dimensions to obtain.
ntopNumeric scalar specifying the number of most variable features to use for t-SNE.
feature_setCharacter vector of row names, a logical vector or a numeric vector of indices indicating a set of features to use for t-SNE. This will override any ntop argument if specified.
exprs_valuesInteger scalar or string indicating which assay of object should be used to obtain the expression values for the calculations.
scale_featuresLogical scalar, should the expression values be standardised so that each feature has unit variance?
use_dimredString or integer scalar specifying the entry of reducedDims(object) to use as input to Rtsne . Default is to not use existing reduced dimension results.
n_dimredInteger scalar, number of dimensions of the reduced dimension slot to use when use_dimred is supplied. Defaults to all available dimensions.
perplexityNumeric scalar defining the perplexity parameter, see ? for more details.
pcaLogical scalar passed to Rtsne , indicating whether an initial PCA step should be performed. This is ignored if use_dimred is specified.
initial_dimsInteger scalar passed to Rtsne , specifying the number of principal components to be retained if pca=TRUE .
normalizeLogical scalar indicating if input values should be scaled for numerical precision, see normalize_input .
thetaNumeric scalar specifying the approximation accuracy of the Barnes-Hut algorithm, see Rtsne for details.
external_neighborsLogical scalar indicating whether a nearest neighbors search should be computed externally with findKNN .
BNPARAMA BiocNeighborParam object specifying the neighbor search algorithm to use when external_neighbors=TRUE .
BPPARAMA BiocParallelParam object specifying how the neighbor search should be parallelized when external_neighbors=TRUE .
...Additional arguments to pass to Rtsne .

Details

The function Rtsne is used internally to compute the t-SNE. Note that the algorithm is not deterministic, so different runs of the function will produce differing results. Users are advised to test multiple random seeds, and then use set.seed to set a random seed for replicable results.

The value of the perplexity parameter can have a large effect on the results. By default, the function will try to provide a reasonable setting, by scaling the perplexity with the number of cells until it reaches a maximum of 50. However, it is often worthwhile to manually try multiple values to ensure that the conclusions are robust.

Setting use_dimred allows users to easily perform t-SNE on low-rank approximations of the original expression matrix (e.g., after PCA). In such cases, arguments such as ntop , feature_set , exprs_values and scale_features will be ignored.

If external_neighbors=TRUE , the nearest neighbor search step is conducted using a different algorithm to that in the Rtsne function. This can be parallelized or approximate to achieve greater speed for large data sets. The neighbor search results are then used for t-SNE via the Rtsne_neighbors function.

Value

A SingleCellExperiment object containing the coordinates of the first ncomponent t-SNE dimensions for each cell. This is stored in the "TSNE" entry of the reducedDims slot.

Seealso

Rtsne , plotTSNE

Author

Aaron Lun, based on code by Davis McCarthy

References

L.J.P. van der Maaten. Barnes-Hut-SNE. In Proceedings of the International Conference on Learning Representations, 2013.

Examples

## Set up an example SingleCellExperiment
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce <- normalize(example_sce)

example_sce <- runTSNE(example_sce)
reducedDimNames(example_sce)
head(reducedDim(example_sce))

Perform UMAP on cell-level data

Description

Perform uniform manifold approximation and projection (UMAP) for the cells, based on the data in a SingleCellExperiment object.

Usage

runUMAP(object, ncomponents = 2, ntop = 500, feature_set = NULL,
  exprs_values = "logcounts", scale_features = TRUE,
  use_dimred = NULL, n_dimred = NULL, pca = 50, n_neighbors = 15,
  external_neighbors = FALSE, BNPARAM = KmknnParam(),
  BPPARAM = SerialParam(), ...)

Arguments

ArgumentDescription
objectA SingleCellExperiment object.
ncomponentsNumeric scalar indicating the number of UMAP dimensions to obtain.
ntopNumeric scalar specifying the number of most variable features to use for UMAP.
feature_setCharacter vector of row names, a logical vector or a numeric vector of indices indicating a set of features to use for UMAP. This will override any ntop argument if specified.
exprs_valuesInteger scalar or string indicating which assay of object should be used to obtain the expression values for the calculations.
scale_featuresLogical scalar, should the expression values be standardised so that each feature has unit variance?
use_dimredString or integer scalar specifying the entry of reducedDims(object) to use as input to Rtsne . Default is to not use existing reduced dimension results.
n_dimredInteger scalar, number of dimensions of the reduced dimension slot to use when use_dimred is supplied. Defaults to all available dimensions.
pcaInteger scalar specifying how many PCs should be used as input into UMAP, if the PCA is to be recomputed on the subsetted expression matrix. Only used when code use_dimred=NULL , and if pca=NULL , no PCA is performed at all.
n_neighborsInteger scalar, number of nearest neighbors to identify when constructing the initial graph.
external_neighborsLogical scalar indicating whether a nearest neighbors search should be computed externally with findKNN .
BNPARAMA BiocNeighborParam object specifying the neighbor search algorithm to use when external_neighbors=TRUE .
BPPARAMA BiocParallelParam object specifying how the neighbor search should be parallelized when external_neighbors=TRUE .
...Additional arguments to pass to umap .

Details

The function umap is used internally to compute the UMAP. Note that the algorithm is not deterministic, so different runs of the function will produce differing results. Users are advised to test multiple random seeds, and then use set.seed to set a random seed for replicable results.

Setting use_dimred allows users to easily perform UMAP on low-rank approximations of the original expression matrix (e.g., after PCA). In such cases, arguments such as ntop , feature_set , exprs_values and scale_features will be ignored.

If external_neighbors=TRUE , the nearest neighbor search step is conducted using a different algorithm to that in the umap function. This can be parallelized or approximate to achieve greater speed for large data sets. The neighbor search results are then used directly to create the UMAP embedding.

Value

A SingleCellExperiment object containing the coordinates of the first ncomponent UMAP dimensions for each cell. This is stored in the "UMAP" entry of the reducedDims slot.

Seealso

umap , plotUMAP

Author

Aaron Lun

References

McInnes L, Healy J (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv.

Examples

## Set up an example SingleCellExperiment
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce <- normalize(example_sce)

example_sce <- runUMAP(example_sce)
reducedDimNames(example_sce)
head(reducedDim(example_sce))
Link to this function

sc_example_cell_info()

Cell information for the small example single-cell counts dataset to demonstrate capabilities of scater

Description

This data.frame contains cell metadata information for the 40 cells included in the example counts dataset included in the package.

Format

a data.frame instance, 1 row per cell.

Usage

sc_example_cell_info

Value

NULL, but makes aavailable a data frame with cell metadata

Author

Davis McCarthy, 2015-03-05

Link to this function

sc_example_counts()

A small example of single-cell counts dataset to demonstrate capabilities of scater

Description

This data set contains counts for 2000 genes for 40 cells. They are from a real experiment, but details have been anonymised.

Format

a matrix instance, 1 row per gene.

Usage

sc_example_counts

Value

NULL, but makes aavailable a matrix of count data

Author

Davis McCarthy, 2015-03-05

Link to this function

scater_package()

Single-cell analysis toolkit for expression in R

Description

scater provides a class and numerous functions for the quality control, normalisation and visualisation of single-cell RNA-seq expression data.

Details

In particular, scater provides easy generation of quality control metrics and simple functions to visualise quality control metrics and their relationships.

Link to this function

scater_plot_args()

General visualization parameters

Description

scater functions that plot points share a number of visualization parameters, which are described on this page.

Seealso

plotColData , plotRowData , plotReducedDim , plotExpression , plotPlatePosition , and most other plotting functions.

Link to this function

scater_vis_var()

Variable selection for visualization

Description

A number of scater functions accept a SingleCellExperiment object and extract (meta)data from it for use in a plot. These values are then used on the x- or y-axes (e.g., plotColData ) or for tuning visual parameters, e.g., colour_by , shape_by , size_by . This page describes how the selection of these values can be controlled by the user, by passing appropriate values to the arguments of the desired plotting function.

Seealso

plotColData , plotRowData , plotReducedDim , plotExpression , plotPlatePosition , and most other plotting functions.

Link to this function

sumCountsAcrossCells()

Sum counts across a set of cells

Description

Create a count matrix where counts for all cells in a set are summed together.

Usage

sumCountsAcrossCells(object, ids, exprs_values = "counts",
  BPPARAM = SerialParam())

Arguments

ArgumentDescription
objectA SingleCellExperiment object or a count matrix.
idsA factor specifying the set to which each cell in object belongs.
exprs_valuesA string or integer scalar specifying the assay of object containing counts, if object is a SingleCellExperiment.
BPPARAMA BiocParallelParam object specifying how summation should be parallelized.

Details

This function provides a convenient method for aggregating counts across multiple columns for each feature. A typical application would be to sum counts across all cells in each cluster to obtain pseudo-bulk samples for further analysis.

Any NA values in ids are implicitly ignored and will not be considered or reported. This may be useful, e.g., to remove undesirable cells by setting their entries in ids to NA .

Value

A count matrix where counts for all cells in the same set are summed together for each feature.

Author

Aaron Lun

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info)

ids <- sample(LETTERS[1:5], ncol(example_sce), replace=TRUE)
out <- sumCountsAcrossCells(example_sce, ids)
dimnames(out)
Link to this function

sumCountsAcrossFeatures()

Sum counts across a feature set

Description

Create a count matrix where counts for all features in a set are summed together.

Usage

sumCountsAcrossFeatures(object, ids, exprs_values = "counts",
  BPPARAM = SerialParam())

Arguments

ArgumentDescription
objectA SingleCellExperiment object or a count matrix.
idsA factor specifying the set to which each feature in object belongs.
exprs_valuesA string or integer scalar specifying the assay of object containing counts, if object is a SingleCellExperiment.
BPPARAMA BiocParallelParam object specifying whether summation should be parallelized.

Details

This function provides a convenient method for aggregating counts across multiple rows for each cell. For example, genes with multiple mapping locations in the reference will often manifest as multiple rows with distinct Ensembl/Entrez IDs. These counts can be aggregated into a single feature by setting the shared identifier (usually the gene symbol) as ids .

It is theoretically possible to aggregate transcript-level counts to gene-level counts with this function. However, it is often better to do so with dedicated functions (e.g., from the tximport or tximeta packages) that account for differences in length across isoforms.

Any NA values in ids are implicitly ignored and will not be considered or reported. This may be useful, e.g., to remove undesirable feature sets by setting their entries in ids to NA .

Value

A count matrix where counts for all features in the same set are summed together within each cell.

Author

Aaron Lun

Examples

data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info)

ids <- sample(LETTERS, nrow(example_sce), replace=TRUE)
out <- sumCountsAcrossFeatures(example_sce, ids)
dimnames(out)
Link to this function

toSingleCellExperiment()

Convert an SCESet object to a SingleCellExperiment object

Description

Convert an SCESet object produced with an older version of the package to a SingleCellExperiment object compatible with the current version.

Usage

updateSCESet(object)
toSingleCellExperiment(object)

Arguments

ArgumentDescription
objectan SCESet object to be updated

Value

a SingleCellExperiment object

Examples

updateSCESet(example_sceset)
toSingleCellExperiment(example_sceset)
Link to this function

uniquifyFeatureNames()

Make feature names unique

Description

Combine a user-interpretable feature name (e.g., gene symbol) with a standard identifier that is guaranteed to be unique and valid (e.g., Ensembl) for use as row names.

Usage

uniquifyFeatureNames(ID, names)

Arguments

ArgumentDescription
IDA character vector of unique identifiers.
namesA character vector of feature names.

Details

This function will attempt to use names if it is unique. If not, it will append the _ID to any non-unique value of names . Missing names will be replaced entirely by ID .

The output is guaranteed to be unique, assuming that ID is also unique. This can be directly used as the row names of a SingleCellExperiment object.

Value

A character vector of unique-ified feature names.

Author

Aaron Lun

Examples

uniquifyFeatureNames(
ID=paste0("ENSG0000000", 1:5),
names=c("A", NA, "B", "C", "A")
)