bioconductor v3.9.0 GenomicAlignments
Provides efficient containers for storing and manipulating
Link to this section Summary
Functions
GAlignmentPairs objects
GAlignmentsList objects
GAlignments objects
(Legacy) GappedReads objects
OverlapEncodings objects
CIGAR utility functions
Map range coordinates between reads and genome space using CIGAR alignments
Coverage of a GAlignments, GAlignmentPairs, or GAlignmentsList object
Encode the overlaps between RNA-seq reads and the transcripts of a gene model
Finding hits between reads and transcripts that are compatible with the splicing of the transcript
Pairing the elements of a GAlignments object
Finding overlapping genomic alignments
Classify ranges (reads) as compatible with existing genomic annotations or as having novel splice events
Intra range transformations of a GAlignments or GAlignmentsList object
Extract junctions from genomic alignments
Pile the letters of a set of aligned reads on top of a set of genomic positions
Reading genomic alignments from a file
Lay read sequences alongside the reference space, using their CIGARs
Set operations on GAlignments objects
Stack the read sequences stored in a BAM file on a region of interest
Perform overlap queries between reads and genomic features
Link to this section Functions
GAlignmentPairs_class()
GAlignmentPairs objects
Description
The GAlignmentPairs class is a container for storing list("pairs of genomic ", " alignments") . These pairs are typically obtained by aligning paired-end reads to a reference genome or transcriptome.
Details
A GAlignmentPairs object is a list-like object where each list element represents a pair of genomic alignment.
An alignment pair is made of a "first" and a "last"/"second" alignment, and is formally represented by a GAlignments object of length 2. In most applications, an alignment pair will represent an aligned paired-end read. In that case, the "first" member of the pair represents the alignment of the first end of the read (aka "first segment in the template", using SAM Spec terminology), and the "last" member of the pair represents the alignment of the second end of the read (aka "last segment in the template", using SAM Spec terminology).
In general, a GAlignmentPairs object will be created by loading
records from a BAM (or SAM) file containing aligned paired-end reads,
using the readGAlignmentPairs
function (see below).
Each element in the returned object will be obtained by pairing 2
records.
Seealso
readGAlignmentPairs
for reading aligned paired-end reads from a file (typically a BAM file) into a GAlignmentPairs object.GAlignments objects for handling aligned single-end reads.
makeGAlignmentPairs
for pairing the elements of a GAlignments object into a GAlignmentPairs object.junctions-methods for extracting and summarizing junctions from a GAlignmentPairs object.
coverage-methods for computing the coverage of a GAlignmentPairs object.
findOverlaps-methods for finding range overlaps between a GAlignmentPairs object and another range-based object.
seqinfo
in the GenomeInfoDb package for getting/setting/modifying the sequence information stored in an object.The GRanges and GRangesList classes defined and documented in the GenomicRanges package.
Author
Hervé Pagès
Examples
library(Rsamtools) # for the ex1.bam file
ex1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
galp <- readGAlignmentPairs(ex1_file, use.names=TRUE, strandMode=1)
galp
length(galp)
head(galp)
head(names(galp))
first(galp)
last(galp)
# or
second(galp)
strandMode(galp)
first(galp, real.strand=TRUE)
last(galp, real.strand=TRUE)
strand(galp)
strandMode(galp) <- 2
first(galp, real.strand=TRUE)
last(galp, real.strand=TRUE)
strand(galp)
seqnames(galp)
head(njunc(galp))
table(isProperPair(galp))
seqlevels(galp)
## Rename the reference sequences:
seqlevels(galp) <- sub("seq", "chr", seqlevels(galp))
seqlevels(galp)
galp[[1]]
unlist(galp)
grglist(galp) # a GRangesList object
strandMode(galp) <- 1
grglist(galp)
## Alternatively the strand mode can be changed with invertStrand():
invertStrand(galp)
stopifnot(identical(unname(elementNROWS(grglist(galp))), njunc(galp) + 2L))
granges(galp) # a GRanges object
GAlignmentsList_class()
GAlignmentsList objects
Description
The GAlignmentsList class is a container for storing a collection of GAlignments objects.
Details
A GAlignmentsList object contains a list of GAlignments objects.
The majority of operations on this page are described in more detail
on the GAlignments man page, see ? GAlignments
.
Seealso
readGAlignmentsList
for reading genomic alignments from a file (typically a BAM file) into a GAlignmentsList object.GAlignments and GAlignmentPairs objects for handling aligned single- and paired-end reads, respectively.
junctions-methods for extracting and summarizing junctions from a GAlignmentsList object.
findOverlaps-methods for finding range overlaps between a GAlignmentsList object and another range-based object.
seqinfo
in the GenomeInfoDb package for getting/setting/modifying the sequence information stored in an object.The GRanges and GRangesList classes defined and documented in the GenomicRanges package.
Author
Valerie Obenchain Valerie.Obenchain@RoswellPark.org
References
http://samtools.sourceforge.net/
Examples
gal1 <- GAlignments(
seqnames=Rle(factor(c("chr1", "chr2", "chr1", "chr3")),
c(1, 3, 2, 4)),
pos=1:10, cigar=paste0(10:1, "M"),
strand=Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
names=head(letters, 10), score=1:10)
gal2 <- GAlignments(
seqnames=Rle(factor(c("chr2", "chr4")), c(3, 4)), pos=1:7,
cigar=c("5M", "3M2N3M2N3M", "5M", "10M", "5M1N4M", "8M2N1M", "5M"),
strand=Rle(strand(c("-", "+")), c(4, 3)),
names=tail(letters, 7), score=1:7)
galist <- GAlignmentsList(noGaps=gal1, Gaps=gal2)
## ---------------------------------------------------------------------
## A. BASIC MANIPULATION
## ---------------------------------------------------------------------
length(galist)
names(galist)
seqnames(galist)
strand(galist)
head(cigar(galist))
head(qwidth(galist))
head(start(galist))
head(end(galist))
head(width(galist))
head(njunc(galist))
seqlevels(galist)
## Rename the reference sequences:
seqlevels(galist) <- sub("chr", "seq", seqlevels(galist))
seqlevels(galist)
grglist(galist) # a GRangesList object
rglist(galist) # an IRangesList object
## ---------------------------------------------------------------------
## B. SUBSETTING
## ---------------------------------------------------------------------
galist[strand(galist) == "-"]
has_junctions <- sapply(galist,
function(x) any(grepl("N", cigar(x), fixed=TRUE)))
galist[has_junctions]
## Different ways to subset:
galist[2] # a GAlignments object of length 1
galist[[2]] # a GAlignments object of length 1
grglist(galist[2]) # a GRangesList object of length 1
rglist(galist[2]) # a NormalIRangesList object of length 1
## ---------------------------------------------------------------------
## C. mcols()/elementMetadata()
## ---------------------------------------------------------------------
## Metadata can be defined on the individual GAlignment elements
## and the overall GAlignmentsList object. By default, 'level=between'
## extracts the GALignmentsList metadata. Using 'level=within'
## will extract the metadata on the individual GAlignments objects.
mcols(galist) ## no metadata on the GAlignmentsList object
mcols(galist, level="within")
## ---------------------------------------------------------------------
## D. readGAlignmentsList()
## ---------------------------------------------------------------------
library(pasillaBamSubset)
## 'file' as character.
fl <- untreated3_chr4()
galist1 <- readGAlignmentsList(fl)
galist1[1:3]
length(galist1)
table(elementNROWS(galist1))
## When 'file' is a BamFile, 'asMates' must be TRUE. If FALSE,
## the data are treated as single-end and each list element of the
## GAlignmentsList will be of length 1. For single-end data
## use readGAlignments() instead of readGAlignmentsList().
bf <- BamFile(fl, yieldSize=3, asMates=TRUE)
readGAlignmentsList(bf)
## Use a 'param' to fine tune the results.
param <- ScanBamParam(flag=scanBamFlag(isProperPair=TRUE))
galist2 <- readGAlignmentsList(fl, param=param)
length(galist2)
## ---------------------------------------------------------------------
## E. COERCION
## ---------------------------------------------------------------------
## The granges() and grlist() coercions support 'ignore.strand' to
## allow ranges from different strands to be combined. In this example
## paired-end reads aligned to opposite strands were read into a
## GAlignmentsList. If the desired operation is to combine these ranges,
## regardless of junctions or the space between pairs, 'ignore.strand'
## must be TRUE.
granges(galist[1])
granges(galist[1], ignore.strand=TRUE)
## grglist()
galist <- GAlignmentsList(noGaps=gal1, Gaps=gal2)
grglist(galist)
grglist(galist, ignore.strand=TRUE)
GAlignments_class()
GAlignments objects
Description
The GAlignments class is a simple container which purpose is to store a set of genomic alignments that will hold just enough information for supporting the operations described below.
Details
A GAlignments object is a vector-like object where each element describes a genomic alignment i.e. how a given sequence (called "query" or "read", typically short) aligns to a reference sequence (typically long).
Typically, a GAlignments object will be created by loading records from a BAM (or SAM) file and each element in the resulting object will correspond to a record. BAM/SAM records generally contain a lot of information but only part of that information is loaded in the GAlignments object. In particular, we discard the query sequences (SEQ field), the query qualities (QUAL), the mapping qualities (MAPQ) and any other information that is not needed in order to support the operations or methods described below.
This means that multi-reads (i.e. reads with multiple hits in the
reference) won't receive any special treatment i.e. the various SAM/BAM
records corresponding to a multi-read will show up in the GAlignments
object as if they were coming from different/unrelated queries.
Also paired-end reads will be treated as single-end reads and the
pairing information will be lost (see ?
for how to handle aligned paired-end reads).
Each element of a GAlignments object consists of:
The name of the reference sequence. (This is the RNAME field in a SAM/BAM record.)
The strand in the reference sequence to which the query is aligned. (This information is stored in the FLAG field in a SAM/BAM record.)
The CIGAR string in the "Extended CIGAR format" (see the SAM Format Specifications for the details).
The 1-based leftmost position/coordinate of the clipped query relative to the reference sequence. We will refer to it as the "start" of the query. (This is the POS field in a SAM/BAM record.)
The 1-based rightmost position/coordinate of the clipped query relative to the reference sequence. We will refer to it as the "end" of the query. (This is NOT explicitly stored in a SAM/BAM record but can be inferred from the POS and CIGAR fields.) Note that all positions/coordinates are always relative to the first base at the 5' end of the plus strand of the reference sequence, even when the query is aligned to the minus strand.
The genomic intervals between the "start" and "end" of the query that are "covered" by the alignment. Saying that the full [start,end] interval is covered is the same as saying that the alignment contains no junction (no N in the CIGAR). It is then considered to be a simple alignment. Note that a simple alignment can have mismatches or deletions (in the reference). In other words, a deletion (encoded with a D in the CIGAR) is NOT considered to introduce a gap in the coverage, but a junction is.
Note that the last 2 items are not expicitly stored in the GAlignments object: they are inferred on-the-fly from the CIGAR and the "start".
Optionally, a GAlignments object can have names (accessed thru the
names
generic function) which will be coming from
the QNAME field of the SAM/BAM records.
The rest of this man page will focus on describing how to:
Access the information stored in a GAlignments object in a way that is independent from how the data are actually stored internally.
How to create and manipulate a GAlignments object.
Seealso
readGAlignments
for reading genomic alignments from a file (typically a BAM file) into a GAlignments object.GAlignmentPairs objects for handling aligned paired-end reads.
junctions-methods for extracting and summarizing junctions from a GAlignments object.
coverage-methods for computing the coverage of a GAlignments object.
findOverlaps-methods for finding overlapping genomic alignments.
seqinfo
in the GenomeInfoDb package for getting/setting/modifying the sequence information stored in an object.The GRanges and GRangesList classes defined and documented in the GenomicRanges package.
The CompressedIRangesList class defined and documented in the IRanges package.
Author
Hervé Pagès and P. Aboyoun
References
http://samtools.sourceforge.net/
Examples
library(Rsamtools) # for the ex1.bam file
ex1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
gal <- readGAlignments(ex1_file, param=ScanBamParam(what="flag"))
gal
## ---------------------------------------------------------------------
## A. BASIC MANIPULATION
## ---------------------------------------------------------------------
length(gal)
head(gal)
names(gal) # no names by default
seqnames(gal)
strand(gal)
head(cigar(gal))
head(qwidth(gal))
table(qwidth(gal))
head(start(gal))
head(end(gal))
head(width(gal))
head(njunc(gal))
seqlevels(gal)
## Invert the strand:
invertStrand(gal)
## Rename the reference sequences:
seqlevels(gal) <- sub("seq", "chr", seqlevels(gal))
seqlevels(gal)
grglist(gal) # a GRangesList object
stopifnot(identical(unname(elementNROWS(grglist(gal))), njunc(gal) + 1L))
granges(gal) # a GRanges object
rglist(gal) # a CompressedIRangesList object
stopifnot(identical(unname(elementNROWS(rglist(gal))), njunc(gal) + 1L))
ranges(gal) # an IRanges object
## Modify the number of lines in 'show'
options(showHeadLines=3)
options(showTailLines=2)
gal
## Revert to default
options(showHeadLines=NULL)
options(showTailLines=NULL)
## ---------------------------------------------------------------------
## B. SUBSETTING
## ---------------------------------------------------------------------
gal[strand(gal) == "-"]
gal[grep("I", cigar(gal), fixed=TRUE)]
gal[grep("N", cigar(gal), fixed=TRUE)] # no junctions
## A confirmation that none of the alignments contains junctions (in
## other words, each alignment can be represented by a single genomic
## range on the reference):
stopifnot(all(njunc(gal) == 0))
## Different ways to subset:
gal[6] # a GAlignments object of length 1
grglist(gal)[[6]] # a GRanges object of length 1
rglist(gal)[[6]] # a NormalIRanges object of length 1
## Unlike N operations, D operations don't introduce gaps:
ii <- grep("D", cigar(gal), fixed=TRUE)
gal[ii]
njunc(gal[ii])
grglist(gal[ii])
## qwidth() vs width():
gal[qwidth(gal) != width(gal)]
## This MUST return an empty object:
gal[cigar(gal) == "35M" & qwidth(gal) != 35]
## but this doesn't have too:
gal[cigar(gal) != "35M" & qwidth(gal) == 35]
GappedReads_class()
(Legacy) GappedReads objects
Description
The GappedReads class extends the GAlignments class.
A GappedReads object contains all the information contained in a
GAlignments object plus the sequences of the queries.
Those sequences can be accessed via the qseq
accessor.
Seealso
GAlignments objects.
Author
Hervé Pagès
References
http://samtools.sourceforge.net/
Examples
greads_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
greads <- readGappedReads(greads_file)
greads
qseq(greads)
OverlapEncodings_class()
OverlapEncodings objects
Description
The OverlapEncodings class is a container for storing the
"overlap encodings" returned by the encodeOverlaps
function.
Usage
## -=-= OverlapEncodings getters =-=-
list(list("Loffset"), list("OverlapEncodings"))(x)
list(list("Roffset"), list("OverlapEncodings"))(x)
list(list("encoding"), list("OverlapEncodings"))(x)
list(list("levels"), list("OverlapEncodings"))(x)
list(list("flippedQuery"), list("OverlapEncodings"))(x)
## -=-= Coercing an OverlapEncodings object =-=-
list(list("as.data.frame"), list("OverlapEncodings"))(x, row.names=NULL, optional=FALSE, ...)
## -=-= Low-level encoding utilities =-=-
encodingHalves(x, single.end.on.left=FALSE, single.end.on.right=FALSE,
as.factors=FALSE)
Lencoding(x, ...)
Rencoding(x, ...)
list(list("njunc"), list("ANY"))(x)
Lnjunc(x, single.end.on.left=FALSE)
Rnjunc(x, single.end.on.right=FALSE)
isCompatibleWithSplicing(x)
Arguments
Argument | Description |
---|---|
x | An OverlapEncodings object. For the low-level encoding utilities, x can also be a character vector or factor containing encodings. |
row.names | NULL or a character vector. |
optional | Ignored. |
... | Extra arguments passed to the as.data.frame method for OverlapEncodings objects are ignored. Extra arguments passed to Lencoding or Rencoding are passed down to encodingHalves . |
single.end.on.left, single.end.on.right | By default the 2 halves of a single-end encoding are considered to be NAs. If single.end.on.left (resp. single.end.on.right ) is TRUE , then the left (resp. right) half of a single-end encoding is considered to be the unmodified encoding. |
as.factors | By default encodingHalves returns the 2 encoding halves as a list of 2 character vectors parallel to the input. If as.factors is TRUE , then it returns them as a list of 2 factors parallel to the input. |
Details
Given a query
and a subject
of the same length, both
list-like objects with top-level elements typically containing multiple
ranges (e.g. IntegerRangesList objects), the "overlap
encoding" of the i-th element in query
and i-th element in
subject
is a character string describing how the ranges in
query[[i]]
are qualitatively positioned relatively to
the ranges in subject[[i]]
.
The encodeOverlaps
function computes those overlap
encodings and returns them in an OverlapEncodings object of the same
length as query
and subject
.
The topic of working with overlap encodings is covered in details
in the "OverlapEncodings" vignette located this package
( GenomicAlignments ) and accessible with
vignette("OverlapEncodings")
.
Seealso
The "OverlapEncodings" vignette in this package.
The
encodeOverlaps
function for computing "overlap encodings".The
pcompare
function in the IRanges package for the interpretation of the strings returned byencoding
.The GRangesList class defined and documented in the GenomicRanges package.
Author
Hervé Pagès
Examples
## ---------------------------------------------------------------------
## A. BASIC MANIPULATION OF AN OverlapEncodings OBJECT
## ---------------------------------------------------------------------
example(encodeOverlaps) # to generate the 'ovenc' object
length(ovenc)
Loffset(ovenc)
Roffset(ovenc)
encoding(ovenc)
levels(ovenc)
nlevels(ovenc)
flippedQuery(ovenc)
njunc(ovenc)
as.data.frame(ovenc)
njunc(levels(ovenc))
## ---------------------------------------------------------------------
## B. WORKING WITH PAIRED-END ENCODINGS (POSSIBLY MIXED WITH SINGLE-END
## ENCODINGS)
## ---------------------------------------------------------------------
encodings <- c("4:jmmm:agmm:aagm:aaaf:", "3--1:jmm--b:agm--i:")
encodingHalves(encodings)
encodingHalves(encodings, single.end.on.left=TRUE)
encodingHalves(encodings, single.end.on.right=TRUE)
encodingHalves(encodings, single.end.on.left=TRUE,
single.end.on.right=TRUE)
Lencoding(encodings)
Lencoding(encodings, single.end.on.left=TRUE)
Rencoding(encodings)
Rencoding(encodings, single.end.on.right=TRUE)
njunc(encodings)
Lnjunc(encodings)
Lnjunc(encodings, single.end.on.left=TRUE)
Rnjunc(encodings)
Rnjunc(encodings, single.end.on.right=TRUE)
## ---------------------------------------------------------------------
## C. DETECTION OF "SPLICE COMPATIBLE" OVERLAPS
## ---------------------------------------------------------------------
## Reads that are compatible with the splicing of the transcript can
## be detected with a regular expression (the regular expression below
## assumes that reads have at most 2 junctions):
|regex0 <- "(:[fgij]:|:[jg].:.[gf]:|:[jg]..:.g.:..[gf]:)"|
grepl(regex0, encoding(ovenc)) # read4 is NOT "compatible"
## This was for illustration purpose only. In practise you don't need
## (and should not) use this regular expression, but use instead the
## isCompatibleWithSplicing() utility function:
isCompatibleWithSplicing(ovenc)
cigar_utils()
CIGAR utility functions
Description
Utility functions for low-level CIGAR manipulation.
Usage
## -=-= Supported CIGAR operations =-=-
CIGAR_OPS
## -=-= Transform CIGARs into other useful representations =-=-
explodeCigarOps(cigar, ops=CIGAR_OPS)
explodeCigarOpLengths(cigar, ops=CIGAR_OPS)
cigarToRleList(cigar)
## -=-= Summarize CIGARs =-=-
cigarOpTable(cigar)
## -=-= From CIGARs to ranges =-=-
cigarRangesAlongReferenceSpace(cigar, flag=NULL,
N.regions.removed=FALSE, pos=1L, f=NULL,
ops=CIGAR_OPS, drop.empty.ranges=FALSE, reduce.ranges=FALSE,
with.ops=FALSE)
cigarRangesAlongQuerySpace(cigar, flag=NULL,
before.hard.clipping=FALSE, after.soft.clipping=FALSE,
ops=CIGAR_OPS, drop.empty.ranges=FALSE, reduce.ranges=FALSE,
with.ops=FALSE)
cigarRangesAlongPairwiseSpace(cigar, flag=NULL,
N.regions.removed=FALSE, dense=FALSE,
ops=CIGAR_OPS, drop.empty.ranges=FALSE, reduce.ranges=FALSE,
with.ops=FALSE)
extractAlignmentRangesOnReference(cigar, pos=1L,
drop.D.ranges=FALSE, f=NULL)
## -=-= From CIGARs to sequence lengths =-=-
cigarWidthAlongReferenceSpace(cigar, flag=NULL,
N.regions.removed=FALSE)
cigarWidthAlongQuerySpace(cigar, flag=NULL,
before.hard.clipping=FALSE, after.soft.clipping=FALSE)
cigarWidthAlongPairwiseSpace(cigar, flag=NULL,
N.regions.removed=FALSE, dense=FALSE)
## -=-= Narrow CIGARs =-=-
cigarNarrow(cigar, start=NA, end=NA, width=NA)
cigarQNarrow(cigar, start=NA, end=NA, width=NA)
## -=-= Translate coordinates between query and reference spaces =-=-
queryLoc2refLoc(qloc, cigar, pos=1L)
queryLocs2refLocs(qlocs, cigar, pos=1L, flag=NULL)
Arguments
Argument | Description |
---|---|
cigar | A character vector or factor containing the extended CIGAR strings. It can be of arbitrary length except for queryLoc2refLoc which only accepts a single CIGAR (as a character vector or factor of length 1). |
ops | Character vector containing the extended CIGAR operations to actually consider. Zero-length operations or operations not listed ops are ignored. |
flag | NULL or an integer vector containing the SAM flag for each read. According to the SAM Spec v1.4, flag bit 0x4 is the only reliable place to tell whether a segment (or read) is mapped (bit is 0) or not (bit is 1). If flag is supplied, then cigarRangesAlongReferenceSpace , cigarRangesAlongQuerySpace , cigarRangesAlongPairwiseSpace , and extractAlignmentRangesOnReference don't produce any range for unmapped reads i.e. they treat them as if their CIGAR was empty (independently of what their CIGAR is). If flag is supplied, then cigarWidthAlongReferenceSpace , cigarWidthAlongQuerySpace , and cigarWidthAlongPairwiseSpace return NA s for unmapped reads. |
N.regions.removed | TRUE or FALSE . If TRUE , then cigarRangesAlongReferenceSpace and cigarWidthAlongReferenceSpace report ranges/widths with respect to the "reference" space from which the N regions have been removed, and cigarRangesAlongPairwiseSpace and cigarWidthAlongPairwiseSpace report them with respect to the "pairwise" space from which the N regions have been removed. |
pos | An integer vector containing the 1-based leftmost position/coordinate for each (eventually clipped) read sequence. Must have length 1 (in which case it's recycled to the length of cigar ), or the same length as cigar . |
f | NULL or a factor of length cigar . If NULL , then the ranges are grouped by alignment i.e. the returned IRangesList object has 1 list element per element in cigar . Otherwise they are grouped by factor level i.e. the returned IRangesList object has 1 list element per level in f and is named with those levels. For example, if f is a factor containing the chromosome for each read, then the returned IRangesList object will have 1 list element per chromosome and each list element will contain all the ranges on that chromosome. |
drop.empty.ranges | Should empty ranges be dropped? |
reduce.ranges | Should adjacent ranges coming from the same cigar be merged or not? Using TRUE can significantly reduce the size of the returned object. |
with.ops | TRUE or FALSE indicating whether the returned ranges should be named with their corresponding CIGAR operation. |
before.hard.clipping | TRUE or FALSE . If TRUE , then cigarRangesAlongQuerySpace and cigarWidthAlongQuerySpace report ranges/widths with respect to the "query" space to which the H regions have been added. before.hard.clipping and after.soft.clipping cannot both be TRUE . |
after.soft.clipping | TRUE or FALSE . If TRUE , then cigarRangesAlongQuerySpace and cigarWidthAlongQuerySpace report ranges/widths with respect to the "query" space from which the S regions have been removed. before.hard.clipping and after.soft.clipping cannot both be TRUE . |
dense | TRUE or FALSE . If TRUE , then cigarRangesAlongPairwiseSpace and cigarWidthAlongPairwiseSpace report ranges/widths with respect to the "pairwise" space from which the I, D, and N regions have been removed. N.regions.removed and dense cannot both be TRUE . |
drop.D.ranges | Should the ranges corresponding to a deletion from the reference (encoded with a D in the CIGAR) be dropped? By default we keep them to be consistent with the pileup tool from SAMtools. Note that, when drop.D.ranges is TRUE , then Ds and Ns in the CIGAR are equivalent. |
start,end,width | Vectors of integers. NAs and negative values are accepted and "solved" according to the rules of the SEW (Start/End/Width) interface (see ? for the details). |
qloc | An integer vector containing "query-based locations" i.e. 1-based locations relative to the query sequence stored in the SAM/BAM file. |
qlocs | A list of the same length as cigar where each element is an integer vector containing "query-based locations" i.e. 1-based locations relative to the corresponding query sequence stored in the SAM/BAM file. |
Value
CIGAR_OPS
is a predefined character vector containing the supported
extended CIGAR operations: M, I, D, N, S, H, P, =, X. See p. 4 of the
SAM Spec v1.4 at http://samtools.sourceforge.net/ for the list of
extended CIGAR operations and their meanings.
For explodeCigarOps
and explodeCigarOpLengths
:
Both functions return a list of the same length as cigar
where each
list element is a character vector (for explodeCigarOps
) or an integer
vector (for explodeCigarOpLengths
). The 2 lists have the same shape,
that is, same length()
and same elementNROWS()
. The i-th
character vector in the list returned by explodeCigarOps
contains one
single-letter string per CIGAR operation in cigar[i]
. The i-th integer
vector in the list returned by explodeCigarOpLengths
contains the
corresponding CIGAR operation lengths. Zero-length operations or operations
not listed in ops
are ignored.
For cigarToRleList
: A CompressedRleList object.
For cigarOpTable
: An integer matrix with number of rows equal
to the length of cigar
and nine columns, one for each extended
CIGAR operation.
For cigarRangesAlongReferenceSpace
, cigarRangesAlongQuerySpace
,
cigarRangesAlongPairwiseSpace
, and
extractAlignmentRangesOnReference
: An IRangesList
object (more precisely a CompressedIRangesList object)
with 1 list element per element in cigar
.
However, if f
is a factor, then the returned
IRangesList object can be a SimpleIRangesList
object (instead of CompressedIRangesList ), and in that case,
has 1 list element per level in f
and is named with those levels.
For cigarWidthAlongReferenceSpace
and
cigarWidthAlongPairwiseSpace
: An integer vector of the same
length as cigar
where each element is the width of the alignment
with respect to the "reference" and "pairwise" space, respectively.
More precisely, for cigarWidthAlongReferenceSpace
, the returned
widths are the lengths of the alignments on the reference,
N gaps included (except if N.regions.removed
is TRUE
).
NAs or "*"
in cigar
will produce NAs in the returned vector.
For cigarWidthAlongQuerySpace
: An integer vector of the same
length as cigar
where each element is the length of the corresponding
query sequence as inferred from the CIGAR string. Note that, by default
(i.e. if before.hard.clipping
and after.soft.clipping
are
FALSE
), this is the length of the query sequence stored in the
SAM/BAM file. If before.hard.clipping
or after.soft.clipping
is TRUE
, the returned widths are the lengths of the query sequences
before hard clipping or after soft clipping.
NAs or "*"
in cigar
will produce NAs in the returned vector.
For cigarNarrow
and cigarQNarrow
: A character vector
of the same length as cigar
containing the narrowed cigars.
In addition the vector has an "rshift" attribute which is an integer
vector of the same length as cigar
. It contains the values that
would need to be added to the POS field of a SAM/BAM file as a
consequence of this cigar narrowing.
For queryLoc2refLoc
: An integer vector of the same length as
qloc
containing the "reference-based locations" (i.e. the
1-based locations relative to the reference sequence) corresponding
to the "query-based locations" passed in qloc
.
For queryLocs2refLocs
: A list of the same length as
qlocs
where each element is an integer vector containing
the "reference-based locations" corresponding to the "query-based
locations" passed in the corresponding element in qlocs
.
Seealso
The sequenceLayer function in the GenomicAlignments package for laying the query sequences alongside the "reference" or "pairwise" spaces.
The GAlignments container for storing a set of genomic alignments.
The IRanges , IRangesList , and RleList classes in the IRanges package.
The
coverage
generic and methods for computing the coverage across a set of ranges or genomic ranges.
Author
Hervé Pagès & P. Aboyoun
References
http://samtools.sourceforge.net/
Examples
## ---------------------------------------------------------------------
## A. CIGAR_OPS, explodeCigarOps(), explodeCigarOpLengths(),
## cigarToRleList(), and cigarOpTable()
## ---------------------------------------------------------------------
## Supported CIGAR operations:
CIGAR_OPS
## Transform CIGARs into other useful representations:
cigar1 <- "3H15M55N4M2I6M2D5M6S"
cigar2 <- c("40M2I9M", cigar1, "2S10M2000N15M", "3H33M5H")
explodeCigarOps(cigar2)
explodeCigarOpLengths(cigar2)
explodeCigarOpLengths(cigar2, ops=c("I", "S"))
cigarToRleList(cigar2)
## Summarize CIGARs:
cigarOpTable(cigar2)
## ---------------------------------------------------------------------
## B. From CIGARs to ranges and to sequence lengths
## ---------------------------------------------------------------------
## CIGAR ranges along the "reference" space:
cigarRangesAlongReferenceSpace(cigar1, with.ops=TRUE)[[1]]
cigarRangesAlongReferenceSpace(cigar1,
reduce.ranges=TRUE, with.ops=TRUE)[[1]]
ops <- setdiff(CIGAR_OPS, "N")
cigarRangesAlongReferenceSpace(cigar1, ops=ops, with.ops=TRUE)[[1]]
cigarRangesAlongReferenceSpace(cigar1, ops=ops,
reduce.ranges=TRUE, with.ops=TRUE)[[1]]
ops <- setdiff(CIGAR_OPS, c("D", "N"))
cigarRangesAlongReferenceSpace(cigar1, ops=ops, with.ops=TRUE)[[1]]
cigarWidthAlongReferenceSpace(cigar1)
pos2 <- c(1, 1001, 1, 351)
cigarRangesAlongReferenceSpace(cigar2, pos=pos2, with.ops=TRUE)
res1a <- extractAlignmentRangesOnReference(cigar2, pos=pos2)
res1b <- cigarRangesAlongReferenceSpace(cigar2,
pos=pos2,
ops=setdiff(CIGAR_OPS, "N"),
reduce.ranges=TRUE)
stopifnot(identical(res1a, res1b))
res2a <- extractAlignmentRangesOnReference(cigar2, pos=pos2,
drop.D.ranges=TRUE)
res2b <- cigarRangesAlongReferenceSpace(cigar2,
pos=pos2,
ops=setdiff(CIGAR_OPS, c("D", "N")),
reduce.ranges=TRUE)
stopifnot(identical(res2a, res2b))
seqnames <- factor(c("chr6", "chr6", "chr2", "chr6"),
levels=c("chr2", "chr6"))
extractAlignmentRangesOnReference(cigar2, pos=pos2, f=seqnames)
## CIGAR ranges along the "query" space:
cigarRangesAlongQuerySpace(cigar2, with.ops=TRUE)
cigarWidthAlongQuerySpace(cigar1)
cigarWidthAlongQuerySpace(cigar1, before.hard.clipping=TRUE)
## CIGAR ranges along the "pairwise" space:
cigarRangesAlongPairwiseSpace(cigar2, with.ops=TRUE)
cigarRangesAlongPairwiseSpace(cigar2, dense=TRUE, with.ops=TRUE)
## ---------------------------------------------------------------------
## C. COMPUTE THE COVERAGE OF THE READS STORED IN A BAM FILE
## ---------------------------------------------------------------------
## The information stored in a BAM file can be used to compute the
## "coverage" of the mapped reads i.e. the number of reads that hit any
## given position in the reference genome.
## The following function takes the path to a BAM file and returns an
## object representing the coverage of the mapped reads that are stored
## in the file. The returned object is an RleList object named with the
## names of the reference sequences that actually receive some coverage.
flag0 <- scanBamFlag(isUnmappedQuery=FALSE, isDuplicate=FALSE)
extractCoverageFromBAM <- function(bamfile)
{
stopifnot(is(bamfile, "BamFile"))
## This ScanBamParam object allows us to load only the necessary
## information from the file.
param <- ScanBamParam(flag=flag0, what=c("rname", "pos", "cigar"))
bam <- scanBam(bamfile, param=param)[[1]]
## Note that unmapped reads and reads that are PCR/optical duplicates
## have already been filtered out by using the ScanBamParam object
## above.
f <- factor(bam$rname, levels=seqlevels(bamfile))
irl <- extractAlignmentRangesOnReference(bam$cigar, pos=bam$pos, f=f)
coverage(irl, width=seqlengths(bamfile))
}
library(Rsamtools)
f1 <- system.file("extdata", "ex1.bam", package="Rsamtools")
cvg <- extractCoverageFromBAM(BamFile(f1))
## extractCoverageFromBAM() is equivalent but slightly more efficient
## than loading a GAlignments object and computing its coverage:
cvg2 <- coverage(readGAlignments(f1, param=ScanBamParam(flag=flag0)))
stopifnot(identical(cvg, cvg2))
## ---------------------------------------------------------------------
## D. cigarNarrow() and cigarQNarrow()
## ---------------------------------------------------------------------
## cigarNarrow():
cigarNarrow(cigar1) # only drops the soft/hard clipping
cigarNarrow(cigar1, start=10)
cigarNarrow(cigar1, start=15)
cigarNarrow(cigar1, start=15, width=57)
cigarNarrow(cigar1, start=16)
#cigarNarrow(cigar1, start=16, width=55) # ERROR! (empty cigar)
cigarNarrow(cigar1, start=71)
cigarNarrow(cigar1, start=72)
cigarNarrow(cigar1, start=75)
## cigarQNarrow():
cigarQNarrow(cigar1, start=4, end=-3)
cigarQNarrow(cigar1, start=10)
cigarQNarrow(cigar1, start=19)
cigarQNarrow(cigar1, start=24)
## ---------------------------------------------------------------------
## E. PERFORMANCE
## ---------------------------------------------------------------------
if (interactive()) {
## We simulate 20 millions aligned reads, all 40-mers. 95% of them
## align with no indels. 5% align with a big deletion in the
## reference. In the context of an RNAseq experiment, those 5% would
## be suspected to be "junction reads".
set.seed(123)
nreads <- 20000000L
njunctionreads <- nreads * 5L / 100L
cigar3 <- character(nreads)
cigar3[] <- "40M"
junctioncigars <- paste(
paste(10:30, "M", sep=""),
paste(sample(80:8000, njunctionreads, replace=TRUE), "N", sep=""),
paste(30:10, "M", sep=""), sep="")
cigar3[sample(nreads, njunctionreads)] <- junctioncigars
some_fake_rnames <- paste("chr", c(1:6, "X"), sep="")
rname <- factor(sample(some_fake_rnames, nreads, replace=TRUE),
levels=some_fake_rnames)
pos <- sample(80000000L, nreads, replace=TRUE)
## The following takes < 3 sec. to complete:
system.time(irl1 <- extractAlignmentRangesOnReference(cigar3, pos=pos))
## The following takes < 4 sec. to complete:
system.time(irl2 <- extractAlignmentRangesOnReference(cigar3, pos=pos,
f=rname))
## The sizes of the resulting objects are about 240M and 160M,
## respectively:
object.size(irl1)
object.size(irl2)
}
coordinate_mapping_methods()
Map range coordinates between reads and genome space using CIGAR alignments
Description
Map range coordinates between reads (local) and genome (reference) space
using the CIGAR in a GAlignments
object.
See ?
in the
GenomicRanges package for mapping coordinates between features
in the transcriptome and genome space.
Usage
list(list("mapToAlignments"), list("GenomicRanges,GAlignments"))(x, alignments, ...)
list(list("pmapToAlignments"), list("GenomicRanges,GAlignments"))(x, alignments, ...)
list(list("mapFromAlignments"), list("GenomicRanges,GAlignments"))(x, alignments, ...)
list(list("pmapFromAlignments"), list("GenomicRanges,GAlignments"))(x, alignments, ...)
Arguments
Argument | Description |
---|---|
x | GenomicRanges object of positions to be mapped. x must have names when mapping to the genome. |
alignments | A GAlignments object that represents the alignment of x to the genome. The aligments object must have names. When mapping to the genome names are used to determine mapping pairs and in the reverse direction they are used as the seqlevels of the output object. |
list() | Arguments passed to other methods. |
Details
These methods use a GAlignments
object to represent the alignment
between the ranges in x
and the output. The following CIGAR
operations in the "Extended CIGAR format" are used in the mapping
algorithm:
M, X, = Sequence match or mismatch
I Insertion to the reference
D Deletion from the reference
N Skipped region from the reference
S Soft clip on the read
H Hard clip on the read
P Silent deletion from the padded reference
list(list("mapToAlignments"), ", ", list("pmapToAlignments")) list(" ", " The CIGAR is used to map the genomic (reference) position ", list("x"), " to ", " local coordinates. The mapped position starts at ", " ", list(" ", " start(x) - start(alignments) + 1 ", " "), " ", " and is incremented or decremented as the algorithm walks the length of ", " the CIGAR. A successful mapping in this direction requires that ", " ", list("x"), " fall within ", list("alignments"), ". ", " ", " The seqlevels of the return object are taken from the ", " ", list("alignments"), " object and will be a name descriptive of the read ", " or aligned region. In this direction, mapping is attempted between all ", " elements of ", list("x"), " and all elements of ", list("alignments"), ". ", " ")
list(list("mapFromAlignments"), ", ", list("pmapFromAlignments")) list(" ", " The CIGAR is used to map the local position ", list("x"), " to genomic ", " (reference) coordinates. The mapped position starts at ", " ", list(" ", " start(x) + start(alignments) - 1 ", " "), " ", " and is incremented or decremented as the algorithm walks the length of ", " the CIGAR. A successful mapping in this direction requires that the ", " width of ", list("alignments"), " is <= the width of ", list("x"), ". ", " ", " When mapping to the genome, name matching is used to determine the ", " mapping pairs (vs attempting to match all possible pairs). Ranges in ", " ", list("x"), " are only mapped to ranges in ", list("alignments"), " with the ", " same name. Name matching is motivated by use cases such as ", " differentially expressed regions where the expressed regions in ", " ", list("x"), " would only be related to a subset of regions in ", " ", list("alignments"), ", which may contains gene or transcript ranges. ", " ")
list("element-wise versions") list(" ", " ", list("pmapToAlignments"), " and ", list("pmapFromAlignments"), " are element-wise ", " (aka
parallel
) versions of ", list("mapToAlignments"), " and ", " ", list("mapFromAlignments"), ". The i-th range in ", list("x"), " is mapped to the ", " i-th range in ", list("alignments"), "; ", list("x"), " and ", list("alignments"), " must ", " have the same length. ", " ", " Ranges in ", list("x"), " that do not map (out of bounds) are returned as ", " zero-width ranges starting at 0. These ranges are given the special ", " seqname of "UNMAPPED". Note the non-parallel methods do not return ", " unmapped ranges so the "UNMAPPED" seqname is unique to ", " ", list("pmapToAlignments"), " and ", list("pmapFromAlignments"), ". ", " ")list("strand") list(" ", " By SAM convention, the CIGAR string is reported for mapped reads on the ", " forward genomic strand. There is no need to consider strand in these ", " methods. The output of these methods will always be unstranded ", " (i.e., "*"). ", " ")
Value
An object the same class as x
.
Parallel methods return an object the same shape as x
. Ranges that
cannot be mapped (out of bounds) are returned as zero-width ranges starting
at 0 with a seqname of "UNMAPPED".
Non-parallel methods return an object that varies in length similar to a
Hits object. The result only contains mapped records, out of bound ranges
are not returned. xHits
and alignmentsHits
metadata columns
indicate the elements of x
and alignments
used in the mapping.
When present, names from x
are propagated to the output. When
mapping locally, the seqlevels of the output are the names on the
alignment
object. When mapping globally, the output seqlevels are
the seqlevels of alignment
which are usually chromosome names.
Seealso
?
in the in the GenomicFeatures package for methods mapping between transcriptome and genome space.http://samtools.sourceforge.net/ for a description of the Extended CIGAR format.
Author
V. Obenchain, M. Lawrence and H. Pagès
Examples
## ---------------------------------------------------------------------
## A. Basic use
## ---------------------------------------------------------------------
## 1. Map to local space with mapToAlignments()
## ---------------------------------------------------------------------
## Mapping to local coordinates requires 'x' to be within 'alignments'.
## In this 'x', the second range is too long and can't be mapped.
alignments <- GAlignments("chr1", 10L, "11M", strand("*"), names="read_A")
x <- GRanges("chr1", IRanges(c(12, 12), width=c(6, 20)))
mapToAlignments(x, alignments)
## The element-wise version of the function returns unmapped ranges
## as zero-width ranges with a seqlevel of "UNMAPPED":
pmapToAlignments(x, c(alignments, alignments))
## Mapping the same range through different alignments demonstrates
## how the CIGAR operations affect the outcome.
ops <- c("no-op", "junction", "insertion", "deletion")
x <- GRanges(rep("chr1", 4), IRanges(rep(12, 4), width=rep(6, 4), names=ops))
alignments <- GAlignments(rep("chr1", 4), rep(10L, 4),
cigar = c("11M", "5M2N4M", "5M2I4M", "5M2D4M"),
strand = strand(rep("*", 4)),
names = paste0("region_", 1:4))
pmapToAlignments(x, alignments)
## 2. Map to genome space with mapFromAlignments()
## ---------------------------------------------------------------------
## One of the criteria when mapping to genomic coordinates is that the
## shifted 'x' range falls within 'alignments'. Here the first 'x'
## range has a shifted start value of 14 (5 + 10 - 1 = 14) with a width of
## 2 and so is successfully mapped. The second has a shifted start of 29
## (20 + 10 - 1 = 29) which is outside the range of 'alignments'.
x <- GRanges("chr1", IRanges(c(5, 20), width=2, names=rep("region_A", 2)))
alignments <- GAlignments("chr1", 10L, "11M", strand("*"), names="region_A")
mapFromAlignments(x, alignments)
## Another characteristic of mapping this direction is the name matching
## used to determine pairs. Mapping is only attempted between ranges in 'x'
## and 'alignments' with the same name. If we change the name of the first 'x'
## range, only the second will be mapped to 'alignment'. We know the second
## range fails to map so we get an empty result.
names(x) <- c("region_B", "region_A")
mapFromAlignments(x, alignments)
## CIGAR operations: insertions reduce the width of the output while
## junctions and deletions increase it.
ops <- c("no-op", "junction", "insertion", "deletion")
x <- GRanges(rep("chr1", 4), IRanges(rep(3, 4), width=rep(5, 4), names=ops))
alignments <- GAlignments(rep("chr1", 4), rep(10L, 4),
cigar = c("11M", "5M2N4M", "5M2I4M", "5M2D4M"),
strand = strand(rep("*", 4)))
pmapFromAlignments(x, alignments)
## ---------------------------------------------------------------------
## B. TATA box motif: mapping from read -> genome -> transcript
## ---------------------------------------------------------------------
## The TATA box motif is a conserved DNA sequence in the core promoter
## region. Many eukaryotic genes have a TATA box located approximately
## 25-35 base pairs upstream of the transcription start site. The motif is
## the binding site of general transcription factors or histones and
## plays a key role in transcription.
## In this example, the position of the TATA box motif (if present) is
## located in the DNA sequence corresponding to read ranges. The local
## motif positions are mapped to genome coordinates and then mapped
## to gene features such as promoters regions.
## Load reads from chromosome 4 of D. melanogaster (dm3):
library(pasillaBamSubset)
fl <- untreated1_chr4()
gal <- readGAlignments(fl)
## Extract DNA sequences corresponding to the read ranges:
library(GenomicFeatures)
library(BSgenome.Dmelanogaster.UCSC.dm3)
dna <- extractTranscriptSeqs(BSgenome.Dmelanogaster.UCSC.dm3, grglist(gal))
## Search for the consensus motif TATAAA in the sequences:
box <- vmatchPattern("TATAAA", dna)
## Some sequences had more than one match:
table(elementNROWS(box))
## The element-wise function we'll use for mapping to genome coordinates
## requires the two input argument to have the same length. We need to
## replicate the read ranges to match the number of motifs found.
## Expand the read ranges to match motifs found:
motif <- elementNROWS(box) != 0
alignments <- rep(gal[motif], elementNROWS(box)[motif])
## We make the IRanges into a GRanges object so the seqlevels can
## propagate to the output. Seqlevels are needed in the last mapping step.
readCoords <- GRanges(seqnames(alignments), unlist(box, use.names=FALSE))
## Map the local position of the motif to genome coordinates:
genomeCoords <- pmapFromAlignments(readCoords, alignments)
genomeCoords
## We are interested in the location of the TATA box motifs in the
## promoter regions. To perform the mapping we need the promoter ranges
## as a GRanges or GRangesList.
## Extract promoter regions 50 bp upstream from the transcription start site:
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
promoters <- promoters(txdb, upstream=50, downstream=0)
## Map the genome coordinates to the promoters:
names(promoters) <- mcols(promoters)$tx_name ## must be named
mapToTranscripts(genomeCoords, promoters)
coverage_methods()
Coverage of a GAlignments, GAlignmentPairs, or GAlignmentsList object
Description
coverage
methods for GAlignments ,
GAlignmentPairs , GAlignmentsList , and
BamFile objects.
NOTE: The coverage
generic function and methods
for IntegerRanges and IntegerRangesList
objects are defined and documented in the IRanges package.
Methods for GRanges and
GRangesList objects are defined and
documented in the GenomicRanges package.
Usage
list(list("coverage"), list("GAlignments"))(x, shift=0L, width=NULL, weight=1L,
method=c("auto", "sort", "hash"), drop.D.ranges=FALSE)
list(list("coverage"), list("GAlignmentPairs"))(x, shift=0L, width=NULL, weight=1L,
method=c("auto", "sort", "hash"), drop.D.ranges=FALSE)
list(list("coverage"), list("GAlignmentsList"))(x, shift=0L, width=NULL, weight=1L, ...)
list(list("coverage"), list("BamFile"))(x, shift=0L, width=NULL, weight=1L, ...,
param=ScanBamParam())
list(list("coverage"), list("character"))(x, shift=0L, width=NULL, weight=1L, ...,
yieldSize=2500000L)
Arguments
Argument | Description |
---|---|
x | A GAlignments , GAlignmentPairs , GAlignmentsList , or BamFile object, or the path to a BAM file. |
shift, width, weight | See coverage method for GRanges objects in the GenomicRanges package. |
method | See ? in the IRanges package for a description of this argument. |
drop.D.ranges | Whether the coverage calculation should ignore ranges corresponding to D (deletion) in the CIGAR string. |
... | Additional arguments passed to the coverage method for GAlignments objects. |
param | An optional ScanBamParam object passed to readGAlignments . |
yieldSize | An optional argument controlling how many records are input when iterating through a BamFile . |
Details
The methods for GAlignments and GAlignmentPairs objects do: list(" coverage(grglist(x, drop.D.ranges=drop.D.ranges), ...) ")
The method for GAlignmentsList objects does: list(" coverage(unlist(x), ...) ")
The method for BamFile objects iterates through a
BAM file, reading yieldSize(x)
records (or all records, if
is.na(yieldSize(x))
) and calculating:
list(" gal <- readGAlignments(x, param=param)
", " coverage(gal, shift=shift, width=width, weight=weight, ...)
")
The method for character
vectors of length 1 creates a
BamFile object from x
and performs the
calculation for coverage,BamFile-method
.
Value
A named RleList object with one coverage vector per
seqlevel in x
.
Seealso
coverage
in the IRanges package.coverage-methods in the GenomicRanges package.
RleList objects in the IRanges package.
GAlignments and GAlignmentPairs objects.
BamFile objects in the Rsamtools package.
Examples
## ---------------------------------------------------------------------
## A. EXAMPLE WITH TOY DATA
## ---------------------------------------------------------------------
ex1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
## Coverage of a GAlignments object:
gal <- readGAlignments(ex1_file)
cvg1 <- coverage(gal)
cvg1
## Coverage of a GAlignmentPairs object:
galp <- readGAlignmentPairs(ex1_file)
cvg2 <- coverage(galp)
cvg2
## Coverage of a GAlignmentsList object:
galist <- readGAlignmentsList(ex1_file)
cvg3 <- coverage(galist)
cvg3
table(mcols(galist)$mate_status)
mated_idx <- which(mcols(galist)$mate_status == "mated")
mated_galist <- galist[mated_idx]
mated_cvg3 <- coverage(mated_galist)
mated_cvg3
## Sanity checks:
stopifnot(identical(cvg1, cvg3))
stopifnot(identical( cvg2, mated_cvg3))
## ---------------------------------------------------------------------
## B. EXAMPLE WITH REAL DATA
## ---------------------------------------------------------------------
library(pasillaBamSubset)
## See '?pasillaBamSubset' for more information about the 2 BAM files
## included in this package.
reads <- readGAlignments(untreated3_chr4())
table(njunc(reads)) # data contains junction reads
## Junctions do NOT contribute to the coverage:
read1 <- reads[which(njunc(reads) != 0L)[1]] # 1st read with a junction
read1 # cigar shows a "skipped region" of length 15306
grglist(read1)[[1]] # the junction is between pos 4500 and 19807
coverage(read1)$chr4 # junction is not covered
## Sanity checks:
cvg <- coverage(reads)
read_chunks <- unlist(grglist(reads), use.names=FALSE)
read_chunks_per_chrom <- split(read_chunks, seqnames(read_chunks))
stopifnot(identical(sum(cvg), sum(width(read_chunks_per_chrom))))
galist <- readGAlignmentsList(untreated3_chr4())
stopifnot(identical(cvg, coverage(galist)))
encodeOverlaps_methods()
Encode the overlaps between RNA-seq reads and the transcripts of a gene model
Description
In the context of an RNA-seq experiment, encoding the overlaps between the aligned reads and the transcripts of a given gene model can be used for detecting those overlaps that are compatible with the splicing of the transcript.
The central tool for this is the encodeOverlaps
method for
GRangesList objects, which computes the "overlap
encodings" between a query
and a subject
, both list-like
objects with list elements containing multiple ranges.
Other related utilities are also documented in this man page.
Usage
encodeOverlaps(query, subject, hits=NULL, ...)
list(list("encodeOverlaps"), list("GRangesList,GRangesList"))(query, subject, hits=NULL,
flip.query.if.wrong.strand=FALSE)
## Related utilities:
flipQuery(x, i)
selectEncodingWithCompatibleStrand(ovencA, ovencB,
query.strand, subject.strand, hits=NULL)
isCompatibleWithSkippedExons(x, max.skipped.exons=NA)
extractSteppedExonRanks(x, for.query.right.end=FALSE)
extractSpannedExonRanks(x, for.query.right.end=FALSE)
extractSkippedExonRanks(x, for.query.right.end=FALSE)
extractQueryStartInTranscript(query, subject, hits=NULL, ovenc=NULL,
flip.query.if.wrong.strand=FALSE,
for.query.right.end=FALSE)
Arguments
Argument | Description |
---|---|
query, subject | Typically GRangesList objects representing the the aligned reads and the transcripts of a given gene model, respectively. If the 2 objects don't have the same length, and if the hits argument is not supplied, then the shortest is recycled to the length of the longest (the standard recycling rules apply). More generally speaking, query and subject must be list-like objects with list elements containing multiple ranges e.g. IntegerRangesList or GRangesList objects. |
hits | An optional Hits object typically obtained from a previous call to findOverlaps . Strictly speaking, hits only needs to be compatible with query and subject , that is, queryLength and subjectLength must be equal to length(query) and length(subject) , respectively. Supplying hits is a convenient way to do encodeOverlaps(query[queryHits(hits)], subject[subjectHits(hits)]) , that is, calling encodeOverlaps(query, subject, hits) is equivalent to the above, but is much more efficient, especially when query and/or subject are big. Of course, when hits is supplied, query and subject are not expected to have the same length anymore. |
... | Additional arguments for methods. |
flip.query.if.wrong.strand | See the "OverlapEncodings" vignette located in this package ( GenomicAlignments ). |
x | For flipQuery : a GRangesList object. For isCompatibleWithSkippedExons , extractSteppedExonRanks , extractSpannedExonRanks , and extractSkippedExonRanks : an OverlapEncodings object, a factor, or a character vector. |
i | Subscript specifying the elements in x to flip. If missing, all the elements are flipped. |
ovencA, ovencB, ovenc | OverlapEncodings objects. |
query.strand, subject.strand | Vector-like objects containing the strand of the query and subject, respectively. |
max.skipped.exons | Not supported yet. If NA (the default), the number of skipped exons must be 1 or more (there is no max). |
for.query.right.end | If TRUE , then the information reported in the output is for the right ends of the paired-end reads. Using for.query.right.end=TRUE with single-end reads is an error. |
Details
See ?OverlapEncodings
for a short introduction to "overlap encodings".
The topic of working with overlap encodings is covered in details
in the "OverlapEncodings" vignette located this package
( GenomicAlignments ) and accessible with
vignette("OverlapEncodings")
.
Value
For encodeOverlaps
: An OverlapEncodings object.
If hits
is not supplied, this object is parallel to the
longest of query
and subject
, that is, it has the length
of the longest and the i-th encoding in it corresponds to the i-th element
in the longest.
If hits
is supplied, then the returned object is parallel
to it, that is, it has one encoding per hit.
For flipQuery
: TODO
For selectEncodingWithCompatibleStrand
: TODO
For isCompatibleWithSkippedExons
: A logical vector parallel
to x
.
For extractSteppedExonRanks
, extractSpannedExonRanks
, and
extractSkippedExonRanks
: TODO
For extractQueryStartInTranscript
: TODO
Seealso
The OverlapEncodings class for a brief introduction to "overlap encodings".
The Hits class defined and documented in the S4Vectors package.
The "OverlapEncodings" vignette in this package.
findCompatibleOverlaps
for a specialized version offindOverlaps
that usesencodeOverlaps
internally to keep only the hits where the junctions in the aligned read are compatible with the splicing of the annotated transcript.The GRangesList class defined and documented in the GenomicRanges package.
The
findOverlaps
generic function defined in the IRanges package.
Author
Hervé Pagès
Examples
## ---------------------------------------------------------------------
## A. BETWEEN 2 IntegerRangesList OBJECTS
## ---------------------------------------------------------------------
## In the context of an RNA-seq experiment, encoding the overlaps
## between 2 GRangesList objects, one containing the reads (the query),
## and one containing the transcripts (the subject), can be used for
## detecting hits between reads and transcripts that are "compatible"
## with the splicing of the transcript. Here we illustrate this with 2
## IntegerRangesList objects, in order to keep things simple:
## 4 aligned reads in the query:
read1 <- IRanges(c(7, 15, 22), c(9, 19, 23)) # 2 junctions
read2 <- IRanges(c(5, 15), c(9, 17)) # 1 junction
read3 <- IRanges(c(16, 22), c(19, 24)) # 1 junction
read4 <- IRanges(c(16, 23), c(19, 24)) # 1 junction
query <- IRangesList(read1, read2, read3, read4)
## 1 transcript in the subject:
tx <- IRanges(c(1, 4, 15, 22, 38), c(2, 9, 19, 25, 47)) # 5 exons
subject <- IRangesList(tx)
## Encode the overlaps:
ovenc <- encodeOverlaps(query, subject)
ovenc
encoding(ovenc)
## ---------------------------------------------------------------------
## B. BETWEEN 2 GRangesList OBJECTS
## ---------------------------------------------------------------------
## With real RNA-seq data, the reads and transcripts will typically be
## stored in GRangesList objects. Please refer to the "OverlapEncodings"
## vignette in this package for realistic examples.
findCompatibleOverlaps_methods()
Finding hits between reads and transcripts that are compatible with the splicing of the transcript
Description
In the context of an RNA-seq experiment, findCompatibleOverlaps
(or countCompatibleOverlaps
) can be used for finding (or counting)
hits between reads and transcripts that are compatible
with the splicing of the transcript.
Usage
findCompatibleOverlaps(query, subject)
countCompatibleOverlaps(query, subject)
Arguments
Argument | Description |
---|---|
query | A GAlignments or GAlignmentPairs object representing the aligned reads. |
subject | A GRangesList object representing the transcripts. |
Details
findCompatibleOverlaps
is a specialized version of
findOverlaps
that uses
encodeOverlaps
internally to keep only
the hits where the junctions in the aligned read are compatible
with the splicing of the annotated transcript.
The topic of working with overlap encodings is covered in details
in the "OverlapEncodings" vignette located this package
( GenomicAlignments ) and accessible with
vignette("OverlapEncodings")
.
Value
A Hits object for findCompatibleOverlaps
.
An integer vector parallel to (i.e. same length as) query
for countCompatibleOverlaps
.
Seealso
The
findOverlaps
generic function defined in the IRanges package.The
encodeOverlaps
generic function and OverlapEncodings class.The "OverlapEncodings" vignette in this package.
GAlignments and GAlignmentPairs objects.
GRangesList objects in the GenomicRanges package.
Author
Hervé Pagès
Examples
## Here we only show a simple example illustrating the use of
## countCompatibleOverlaps() on a very small data set. Please
## refer to the "OverlapEncodings" vignette in the GenomicAlignments
## package for a comprehensive presentation of "overlap
## encodings" and related tools/concepts (e.g. "compatible"
## overlaps, "almost compatible" overlaps etc...), and for more
## examples.
## sm_treated1.bam contains a small subset of treated1.bam, a BAM
## file containing single-end reads from the "Pasilla" experiment
## (RNA-seq, Fly, see the pasilla data package for the details)
## and aligned to reference genome BDGP Release 5 (aka dm3 genome on
## the UCSC Genome Browser):
sm_treated1 <- system.file("extdata", "sm_treated1.bam",
package="GenomicAlignments", mustWork=TRUE)
## Load the alignments:
flag0 <- scanBamFlag(isDuplicate=FALSE, isNotPassingQualityControls=FALSE)
param0 <- ScanBamParam(flag=flag0)
gal <- readGAlignments(sm_treated1, use.names=TRUE, param=param0)
## Load the transcripts (IMPORTANT: Like always, the reference genome
## of the transcripts must be *exactly* the same as the reference
## genome used to align the reads):
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
exbytx <- exonsBy(txdb, by="tx", use.names=TRUE)
## Number of "compatible" transcripts per alignment in 'gal':
gal_ncomptx <- countCompatibleOverlaps(gal, exbytx)
mcols(gal)$ncomptx <- gal_ncomptx
table(gal_ncomptx)
mean(gal_ncomptx >= 1)
## --> 33% of the alignments in 'gal' are "compatible" with at least
## 1 transcript in 'exbytx'.
## Keep only alignments compatible with at least 1 transcript in
## 'exbytx':
compgal <- gal[gal_ncomptx >= 1]
head(compgal)
findMateAlignment()
Pairing the elements of a GAlignments object
Description
Utilities for pairing the elements of a GAlignments object.
NOTE: Until BioC 2.13, findMateAlignment
was the power horse used
by readGAlignmentPairs
for pairing the records loaded
from a BAM file containing aligned paired-end reads.
Starting with BioC 2.14, readGAlignmentPairs
relies
on scanBam
for the
pairing.
Usage
findMateAlignment(x)
makeGAlignmentPairs(x, use.names=FALSE, use.mcols=FALSE, strandMode=1)
## Related low-level utilities:
getDumpedAlignments()
countDumpedAlignments()
flushDumpedAlignments()
Arguments
Argument | Description |
---|
|x
| A named GAlignments object with metadata columns flag
, mrnm
, and mpos
. Typically obtained by loading aligned paired-end reads from a BAM file with: list("
", " param <- ScanBamParam(what=c("flag", "mrnm", "mpos"))
", " x <- readGAlignments(..., use.names=TRUE, param=param)
", " ") |
|use.names
| Whether the names on the input object should be propagated to the returned object or not. |
|use.mcols
| Names of the metadata columns to propagate to the returned GAlignmentPairs object. |
|strandMode
| Strand mode to set on the returned GAlignmentPairs object. See ?
for more information. |
Details
list(list("Pairing algorithm used by findMateAlignment"), list(" ", list("findMateAlignment"), " is the power horse used by ", list("makeGAlignmentPairs"), " ", "for pairing the records loaded from a BAM file containing aligned paired-end ", "reads. ", " ", "It implements the following pairing algorithm: ", list(" ", " ", list(), " First, only records with flag bit 0x1 (multiple segments) set to 1, ", " flag bit 0x4 (segment unmapped) set to 0, and flag bit 0x8 (next ", " segment in the template unmapped) set to 0, are candidates for ",
" pairing (see the SAM Spec for a description of flag bits and fields).
", " ", list("findMateAlignment"), " will ignore any other record. That is, ", " records that correspond to single-end reads, or records that ", " correspond to paired-end reads where one or both ends are unmapped, ", " are discarded. ", " ", " ", list(), " Then the algorithm looks at the following fields and flag bits: ", " ", list(" ", " ", list(), " (A) QNAME ",
" ", list(), " (B) RNAME, RNEXT
", " ", list(), " (C) POS, PNEXT ", " ", list(), " (D) Flag bits Ox10 (segment aligned to minus strand) ", " and 0x20 (next segment aligned to minus strand) ", " ", list(), " (E) Flag bits 0x40 (first segment in template) and 0x80 (last ", " segment in template) ", " ", list(), " (F) Flag bit 0x2 (proper pair) ", " ", list(), " (G) Flag bit 0x100 (secondary alignment) ",
" "), "
", " 2 records rec1 and rec2 are considered mates iff all the following ", " conditions are satisfied: ", " ", list(" ", " ", list(), " (A) QNAME(rec1) == QNAME(rec2) ", " ", list(), " (B) RNEXT(rec1) == RNAME(rec2) and RNEXT(rec2) == RNAME(rec1) ", " ", list(), " (C) PNEXT(rec1) == POS(rec2) and PNEXT(rec2) == POS(rec1) ", " ", list(), " (D) Flag bit 0x20 of rec1 == Flag bit 0x10 of rec2 and ", " Flag bit 0x20 of rec2 == Flag bit 0x10 of rec1 ",
" ", list(), " (E) rec1 corresponds to the first segment in the template and
", " rec2 corresponds to the last segment in the template, OR, ", " rec2 corresponds to the first segment in the template and ", " rec1 corresponds to the last segment in the template ", " ", list(), " (F) rec1 and rec2 have same flag bit 0x2 ", " ", list(), " (G) rec1 and rec2 have same flag bit 0x100 ", " "), " "),
"
"))
| list(list("Timing and memory requirement of the pairing algorithm"), list(" ", "The estimated timings and memory requirements on a modern Linux system are ", "(those numbers may vary depending on your hardware and OS): ", list(" ", " nb of alignments | time | required memory ", " -----------------+--------------+---------------- ", " 8 millions | 28 sec | 1.4 GB ", " 16 millions | 58 sec | 2.8 GB ", " 32 millions | 2 min | 5.6 GB ", | | " 64 millions | 4 min 30 sec | 11.2 GB "), " ", "This is for a ", list("GAlignments"), " object coming from a file with an ", ""average nb of records per unique QNAME" of 2.04. A value of 2 (which means ", "the file contains only primary reads) is optimal for the pairing algorithm. ", "A greater value, say > 3, will significantly degrade its performance. ", "An easy way to avoid this degradation is to load only primary alignments ", "by setting the ", list("isSecondaryAlignment"), |
" flag to ", list("FALSE"), " in
", "ScanBamParam(). See examples in ", list("?", list("readGAlignmentPairs")), " for how ", "to do this. "))
list(list("Ambiguous pairing"), list(" ", "The above algorithm will find almost all pairs unambiguously, even when ", "the same pair of reads maps to several places in the genome. Note ", "that, when a given pair maps to a single place in the genome, looking ", "at (A) is enough to pair the 2 corresponding records. The additional ", "conditions (B), (C), (D), (E), (F), and (G), are only here to help in ", "the situation where more than 2 records share the same QNAME. And that ", "works most of the times. Unfortunately there are still situations where ",
"this is not enough to solve the pairing problem unambiguously.
", "
", "For example, here are 4 records (loaded in a GAlignments object)
", "that cannot be paired with the above algorithm:
", "
", "Showing the 4 records as a GAlignments object of length 4:
", list("
", "GAlignments with 4 alignments and 2 metadata columns:
", " seqnames strand cigar qwidth start end
", "
" PNEXT: the mpos col
"), " ", "As you can see, the aligner has aligned the same pair to the same ", "location twice! The only difference between the 2 aligned pairs is in ", "the CIGAR i.e. one end of the pair is aligned twice to the same location ", "with exactly the same CIGAR while the other end of the pair is aligned ", "twice to the same location but with slightly different CIGARs. ", " ", "Now showing the corresponding flag bits: ", list(" ", " isPaired isProperPair isUnmappedQuery hasUnmappedMate isMinusStrand ",
"[1,] 1 1 0 0 0
", "[2,] 1 1 0 0 0 ", "[3,] 1 1 0 0 1 ", "[4,] 1 1 0 0 1 ", " isMateMinusStrand isFirstMateRead isSecondMateRead isSecondaryAlignment ", "[1,] 1 0 1 0 ", "[2,] 1 0 1 0 ",
"[3,] 0 1 0 0
", "[4,] 0 1 0 0 ", " isNotPassingQualityControls isDuplicate ", "[1,] 0 0 ", "[2,] 0 0 ", "[3,] 0 0 ", "[4,] 0 0 "), " ", " ", "As you can see, rec(1) and rec(2) are second mates, rec(3) and rec(4) ",
"are both first mates. But looking at (A), (B), (C), (D), (E), (F), and (G),
", "the pairs could be rec(1) <-> rec(3) and rec(2) <-> rec(4), or they could ", "be rec(1) <-> rec(4) and rec(2) <-> rec(3). There is no way to ", "disambiguate! ", " ", "So ", list("findMateAlignment"), " is just ignoring (with a warning) those alignments ", "with ambiguous pairing, and dumping them in a place from which they can be ", "retrieved later (i.e. after ", list("findMateAlignment"), " has returned) for ",
"further examination (see "Dumped alignments" subsection below for the details).
", "In other words, alignments that cannot be paired unambiguously are not paired ", "at all. Concretely, this means that ", list(list("readGAlignmentPairs")), " is ", "guaranteed to return a ", list("GAlignmentPairs"), " object ", "where every pair was formed in an non-ambiguous way. Note that, in practice, ", "this approach doesn't seem to leave aside a lot of records because ambiguous ", "pairing events seem pretty rare. "))
list(list("Dumped alignments"), list(" ", "Alignments with ambiguous pairing are dumped in a place ("the dump ", "environment") from which they can be retrieved with ", list("getDumpedAlignments()"), " after ", list("findMateAlignment"), " has returned. ", " ", "Two additional utilities are provided for manipulation of the dumped ", "alignments: ", list("countDumpedAlignments"), " for counting them (a fast equivalent ", "to ", list("length(getDumpedAlignments())"), "), and ", list("flushDumpedAlignments"),
" to
", "flush "the dump environment". Note that "the dump environment" is ", "automatically flushed at the beginning of a call to ", list("findMateAlignment"), ". "))
Value
For findMateAlignment
: An integer vector of the same length as
x
, containing only positive or NA values, where the i-th element
is interpreted as follow:
An NA value means that no mate or more than 1 mate was found for
x[i]
.A non-NA value j gives the index in
x
ofx[i]
's mate.
For makeGAlignmentPairs
: A GAlignmentPairs object where the
pairs are formed internally by calling findMateAlignment
on x
.
For getDumpedAlignments
: NULL
or a GAlignments object
containing the dumped alignments. See "Dumped alignments" subsection in
the "Details" section above for the details.
For countDumpedAlignments
: The number of dumped alignments.
Nothing for flushDumpedAlignments
.
Seealso
GAlignments and GAlignmentPairs objects.
Author
Hervé Pagès
Examples
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
mustWork=TRUE)
param <- ScanBamParam(what=c("flag", "mrnm", "mpos"))
x <- readGAlignments(bamfile, use.names=TRUE, param=param)
mate <- findMateAlignment(x)
head(mate)
table(is.na(mate))
galp0 <- makeGAlignmentPairs(x)
galp <- makeGAlignmentPairs(x, use.name=TRUE, use.mcols="flag")
galp
colnames(mcols(galp))
colnames(mcols(first(galp)))
colnames(mcols(last(galp)))
findOverlaps_methods()
Finding overlapping genomic alignments
Description
Finds range overlaps between a GAlignments , GAlignmentPairs , or GAlignmentsList object, and another range-based object.
NOTE: The findOverlaps
generic function and methods
for IntegerRanges and IntegerRangesList
objects are defined and documented in the IRanges package.
The methods for GRanges and
GRangesList objects are defined and
documented in the GenomicRanges package.
GAlignments , GAlignmentPairs , and GAlignmentsList
objects also support countOverlaps
, overlapsAny
, and
subsetByOverlaps
thanks to the default methods defined in the
IRanges package and to the findOverlaps
method defined in
this package and documented below.
Usage
list(list("findOverlaps"), list("GAlignments,GAlignments"))(query, subject,
maxgap=-1L, minoverlap=0L,
type=c("any", "start", "end", "within"),
select=c("all", "first", "last", "arbitrary"),
ignore.strand=FALSE)
Arguments
Argument | Description |
---|---|
query, subject | A GAlignments , GAlignmentPairs , or GAlignmentsList object for either query or subject . A vector-like object containing ranges for the other one. |
maxgap, minoverlap, type, select | See ? in the IRanges package for a description of these arguments. |
ignore.strand | When set to TRUE , the strand information is ignored in the overlap calculations. |
Details
When the query or the subject (or both) is a GAlignments
object, it is first turned into a GRangesList object (with
as( , "GRangesList")
) and then the rules described previously
apply. GAlignmentsList objects are coerced to GAlignments
then to a GRangesList . Feature indices are mapped back to the
original GAlignmentsList list elements.
When the query is a GAlignmentPairs object, it is first
turned into a GRangesList object (with as( , "GRangesList")
)
and then the rules described previously apply.
Value
A Hits object when select="all"
or an integer
vector otherwise.
Seealso
Examples
ex1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
galn <- readGAlignments(ex1_file)
subject <- granges(galn)[1]
## Note the absence of query no. 9 (i.e. 'galn[9]') in this result:
as.matrix(findOverlaps(galn, subject))
## This is because, by default, findOverlaps()/countOverlaps() are
## strand specific:
galn[8:10]
countOverlaps(galn[8:10], subject)
countOverlaps(galn[8:10], subject, ignore.strand=TRUE)
## Count alignments in 'galn' that DO overlap with 'subject' vs those
## that do NOT:
table(overlapsAny(galn, subject))
## Extract those that DO:
subsetByOverlaps(galn, subject)
## GAlignmentsList
galist <- GAlignmentsList(galn[8:10], galn[3000:3002])
gr <- GRanges(c("seq1", "seq1", "seq2"),
IRanges(c(15, 18, 1233), width=1),
strand=c("-", "+", "+"))
countOverlaps(galist, gr)
countOverlaps(galist, gr, ignore.strand=TRUE)
findOverlaps(galist, gr)
findOverlaps(galist, gr, ignore.strand=TRUE)
findSpliceOverlaps_methods()
Classify ranges (reads) as compatible with existing genomic annotations or as having novel splice events
Description
The findSpliceOverlaps
function identifies ranges (reads) that are
compatible with a specific transcript isoform. The non-compatible ranges are
analyzed for the presence of novel splice events.
Usage
findSpliceOverlaps(query, subject, ignore.strand=FALSE, ...)
list(list("findSpliceOverlaps"), list("GRangesList,GRangesList"))(query, subject, ignore.strand=FALSE, ..., cds=NULL)
list(list("findSpliceOverlaps"), list("GAlignments,GRangesList"))(query, subject, ignore.strand=FALSE, ..., cds=NULL)
list(list("findSpliceOverlaps"), list("GAlignmentPairs,GRangesList"))(query, subject, ignore.strand=FALSE, ..., cds=NULL)
list(list("findSpliceOverlaps"), list("BamFile,ANY"))(query, subject, ignore.strand=FALSE, ...,
param=ScanBamParam(), singleEnd=TRUE)
Arguments
Argument | Description |
---|---|
query | A GRangesList , GAlignments , GAlignmentPairs , or BamFile object containing the reads. Can also be a single string containing the path to a BAM file. Single or paired-end reads are specified with the singleEnd argument (default FALSE). Paired-end reads can be supplied in a BAM file or GAlignmentPairs object. Single-end are expected to be in a BAM file, GAlignments or GRanges object. |
subject | A GRangesList containing the annotations. This list is expected to contain exons grouped by transcripts. |
ignore.strand | When set to TRUE , strand information is ignored in the overlap calculations. |
... | Additional arguments such as param and singleEnd used in the method for BamFile objects. See below. |
cds | Optional GRangesList of coding regions for each transcript in the subject . If provided, the "coding" output column will be a logical vector indicating if the read falls in a coding region. When not provided, the "coding" output is NA . |
param | An optional ScanBamParam instance to further influence scanning, counting, or filtering. |
singleEnd | A logical value indicating if reads are single or paired-end. See summarizeOverlaps for more information. |
Details
When a read maps compatibly and uniquely to a transcript isoform we can quantify the expression and look for shifts in the balance of isoform expression. If a read does not map in compatible way, novel splice events such as splice junctions, novel exons or retentions can be quantified and compared across samples.
findSpliceOverlaps
detects which reads (query) match to
transcripts (subject) in a compatible fashion. Compatibility is based
on both the transcript bounds and splicing pattern. Assessing the
splicing pattern involves comparision of the read splices (i.e., the
N operations in the CIGAR) with the transcript introns. For paired-end
reads, the inter-read gap is not considered a splice junction. The analysis
of non-compatible reads for novel splice events is under construction.
Value
The output is a Hits object with
the metadata columns defined below. Each column is a logical
indicating if the read (query) met the criteria.
list("compatible: ") list("Every splice (N) in a read alignment matches ", " an intron in an annotated transcript. The read does not ", " extend into an intron or outside the transcript bounds. ", " ")
list("unique: ") list("The read is compatible with only one annotated ", " transcript. ", " ")
list("strandSpecific: ") list("The query (read) was stranded. ", " ")
Seealso
GRangesList objects in the GenomicRanges package.
GAlignments and GAlignmentPairs objects.
BamFile objects in the Rsamtools package.
Note
WARNING: The current implementation of findSpliceOverlaps
doesn't
work properly on paired-end reads where the 2 ends overlap!
Author
Michael Lawrence and Valerie Obenchain Valerie.Obenchain@RoswellPark.org
Examples
## -----------------------------------------------------------------------
## Isoform expression :
## -----------------------------------------------------------------------
## findSpliceOverlaps() can assist in quantifying isoform expression
## by identifying reads that map compatibly and uniquely to a
## transcript isoform.
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
library(pasillaBamSubset)
se <- untreated1_chr4() ## single-end reads
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
exbytx <- exonsBy(txdb, "tx")
cdsbytx <- cdsBy(txdb, "tx")
param <- ScanBamParam(which=GRanges("chr4", IRanges(1e5,3e5)))
sehits <- findSpliceOverlaps(se, exbytx, cds=cdsbytx, param=param)
## Tally the reads by category to get an idea of read distribution.
lst <- lapply(mcols(sehits), table)
nms <- names(lst)%in% c("compatible", "unique")
tbl <- do.call(rbind, lst[nms])
tbl
## Reads compatible with one or more transcript isoforms.
rnms <- rownames(tbl)
tbl[rnms == "compatible","TRUE"]/sum(tbl[rnms == "compatible",])
## Reads compatible with a single isoform.
tbl[rnms == "unique","TRUE"]/sum(tbl[rnms == "unique",])
## All reads fall in a coding region as defined by
## the txdb annotation.
lst[["coding"]]
## Check : Total number of reads should be the same across categories.
lapply(lst, sum)
## -----------------------------------------------------------------------
## Paired-end reads :
## -----------------------------------------------------------------------
## 'singleEnd' is set to FALSE for a BAM file with paired-end reads.
pe <- untreated3_chr4()
hits2 <- findSpliceOverlaps(pe, exbytx, singleEnd=FALSE, param=param)
## In addition to BAM files, paired-end reads can be supplied in a
## GAlignmentPairs object.
genes <- GRangesList(
GRanges("chr1", IRanges(c(5, 20), c(10, 25)), "+"),
GRanges("chr1", IRanges(c(5, 22), c(15, 25)), "+"))
galp <- GAlignmentPairs(
GAlignments("chr1", 5L, "11M4N6M", strand("+")),
GAlignments("chr1", 50L, "6M", strand("-")))
findSpliceOverlaps(galp, genes)
intra_range_methods()
Intra range transformations of a GAlignments or GAlignmentsList object
Description
This man page documents intra range transformations of a GAlignments or GAlignmentsList object.
See ?`` and
?in the IRanges package for a quick introduction to intra range and inter range transformations. Intra range methods for [GRanges](#granges) and [GRangesList](#grangeslist) objects are defined and documented in the GenomicRanges package. ## Usage ```r list(list("qnarrow"), list("GAlignments"))(x, start=NA, end=NA, width=NA) list(list("qnarrow"), list("GAlignmentsList"))(x, start=NA, end=NA, width=NA) ``` ## Arguments |Argument |Description| |------------- |----------------| |`x` | A [GAlignments](#galignments) or [GAlignmentsList](#galignmentslist) object. | |`start, end, width` | Vectors of integers. NAs and negative values are accepted and "solved" according to the rules of the SEW (Start/End/Width) interface (see `?` for more information about the SEW interface). See `?
for more information about the start
, end
, and width
arguments. |
Details
- () list(" ", " ", list("qnarrow"), " on a ", list("GAlignments"), " object behaves like ", list("narrow"), " ", " except that the ", list("start"), "/", list("end"), "/", list("width"), " arguments here ", " specify the narrowing with respect to the query sequences instead of ", " the reference sequences. ", " ", " ", list("qnarrow"), " on a ", list("GAlignmentsList"), " object ", " returns a ", list("GAlignmentsList"), " object. ", " ")
Value
An object of the same class as -- and parallel to (i.e. same length
and names as) -- the original object x
.
Seealso
GAlignments and GAlignmentsList objects.
The intra-range-methods man page in the IRanges package.
The intra-range-methods man page in the GenomicRanges package.
Note
There is no difference between narrow
and qnarrow
when
all the alignments have a simple CIGAR (i.e. no indels or junctions).
Author
Hervé Pagès and V. Obenchain Valerie.Obenchain@RoswellPark.org
Examples
## ---------------------------------------------------------------------
## A. ON A GAlignments OBJECT
## ---------------------------------------------------------------------
ex1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
param <- ScanBamParam(what=c("seq", "qual"))
gal <- readGAlignments(ex1_file, param=param)
gal
## This trims 3 nucleotides on the left and 5 nucleotides on the right
## of each alignment:
gal2 <- qnarrow(gal, start=4, end=-6)
gal2
## Note that the 'start' and 'end' values are relative to the query
## sequences and specify the query substring that must be kept for each
## alignment. Negative values are relative to the right end of the query
## sequence.
## Also note that the metadata columns on 'gal' are propagated as-is so
## the "seq" and "qual" matadata columns must be adjusted "by hand" with
## narrow();
mcols(gal2)$seq <- narrow(mcols(gal)$seq, start=4, end=-6)
mcols(gal2)$qual <- narrow(mcols(gal)$qual, start=4, end=-6)
gal2
## Sanity checks:
stopifnot(identical(qwidth(gal2), width(mcols(gal2)$seq)))
stopifnot(identical(qwidth(gal2), width(mcols(gal2)$qual)))
## ---------------------------------------------------------------------
## B. ON A GAlignmentsList OBJECT
## ---------------------------------------------------------------------
gal1 <- GAlignments(
seqnames=Rle(factor(c("chr1", "chr2", "chr1", "chr3")),
c(1, 3, 2, 4)),
pos=1:10, cigar=paste0(10:1, "M"),
strand=Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
names=head(letters, 10), score=1:10)
gal2 <- GAlignments(
seqnames=Rle(factor(c("chr2", "chr4")), c(3, 4)), pos=1:7,
cigar=c("5M", "3M2N3M2N3M", "5M", "10M", "5M1N4M", "8M2N1M", "5M"),
strand=Rle(strand(c("-", "+")), c(4, 3)),
names=tail(letters, 7), score=1:7)
galist <- GAlignmentsList(noGaps=gal1, Gaps=gal2)
galist
qnarrow(galist)
junctions_methods()
Extract junctions from genomic alignments
Description
Given an object x
containing genomic alignments (e.g. a
GAlignments , GAlignmentPairs , or GAlignmentsList
object), junctions(x)
extracts the junctions from it and
summarizeJunctions(x)
extracts and summarizes them.
readTopHatJunctions
and readSTARJunctions
are utilities
for importing the junction file generated by the TopHat and STAR
aligners, respectively.
Usage
## junctions() generic and methods
## -------------------------------
junctions(x, use.mcols=FALSE, ...)
list(list("junctions"), list("GAlignments"))(x, use.mcols=FALSE)
list(list("junctions"), list("GAlignmentPairs"))(x, use.mcols=FALSE)
list(list("junctions"), list("GAlignmentsList"))(x, use.mcols=FALSE, ignore.strand=FALSE)
## summarizeJunctions() and NATURAL_INTRON_MOTIFS
## ----------------------------------------------
summarizeJunctions(x, with.revmap=FALSE, genome=NULL)
NATURAL_INTRON_MOTIFS
## Utilities for importing the junction file generated by some aligners
## --------------------------------------------------------------------
readTopHatJunctions(file, file.is.raw.juncs=FALSE)
readSTARJunctions(file)
Arguments
Argument | Description |
---|---|
x | A GAlignments , GAlignmentPairs , or GAlignmentsList object. |
use.mcols | TRUE or FALSE (the default). Whether the metadata columns on x (accessible with mcols(x) ) should be propagated to the returned object or not. |
... | Additional arguments, for use in specific methods. |
ignore.strand | TRUE or FALSE (the default). If set to TRUE , then the strand of x is set to "*" prior to any computation. |
with.revmap | TRUE or FALSE (the default). If set to TRUE , then a revmap metadata column is added to the output of summarizeJunctions . This metadata column is an IntegerList object representing the mapping from each element in the ouput (i.e. each junction) to the corresponding elements in the input x . |
genome | NULL (the default), or a BSgenome object containing the sequences of the reference genome that was used to align the reads, or the name of this reference genome specified in a way that is accepted by the getBSgenome function defined in the BSgenome software package. In that case the corresponding BSgenome data package needs to be already installed (see ? in the BSgenome package for the details). If genome is supplied, then the intron_motif and intron_strand metadata columns are computed (based on the dinucleotides found at the intron boundaries) and added to the output of summarizeJunctions . See the Value section below for a description of these metadata columns. |
file | The path (or a connection) to the junction file generated by the aligner. This file should be the junctions.bed or new_list.juncs file for readTopHatJunctions , and the SJ.out.tab file for readSTARJunctions . |
file.is.raw.juncs | TRUE or FALSE (the default). If set to TRUE , then the input file is assumed to be a TopHat .juncs file instead of the junctions.bed file generated by TopHat. A TopHat .juncs file can be obtained by passing the junctions.bed file thru TopHat's bed_to_juncs script. See the TopHat manual at http://tophat.cbcb.umd.edu/manual.shtml for more information. |
Details
An N operation in the CIGAR of a genomic alignment is interpreted as a
junction. junctions(x)
will return the genomic ranges of all
junctions found in x
.
More precisely, on a GAlignments object x
,
junctions(x)
is equivalent to:
list(" psetdiff(granges(x), grglist(x, order.as.in.query=TRUE))
")
On a GAlignmentPairs object x
, it's equivalent to (but
faster than):
list(" mendoapply(c, junctions(first(x, real.strand=TRUE)),
", " junctions(last(x, real.strand=TRUE)))
")
Note that starting with BioC 3.2, the behavior of junctions
on a
GAlignmentPairs object has been slightly modified so that the
returned ranges now have the list("real strand") set on them. See the
documentation of the real.strand
argument in the man page of
GAlignmentPairs objects for more information.
NATURAL_INTRON_MOTIFS
is a predefined character vector containing
the 5 natural intron motifs described at
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC84117/ .
Value
junctions(x)
returns the genomic ranges of the junctions in a
GRangesList object parallel to x
(i.e. with 1 list element per element in x
).
If x
has names on it, they're propagated to the returned object.
If use.mcols
is TRUE and x
has metadata columns on it
(accessible with mcols(x)
), they're propagated to the returned object.
summarizeJunctions
returns the genomic ranges of the unique
junctions in x
in an unstranded GRanges
object with the following metadata columns:
score
: The total number of alignments crossing each junction, i.e. that have the junction encoded in their CIGAR.plus_score
andminus_score
: The strand-specific number of alignments crossing each junction.revmap
: [Only ifwith.revmap
was set toTRUE
.] An IntegerList object representing the mapping from each element in the ouput (i.e. each junction) to the corresponding elements in inputx
.intron_motif
andintron_strand
: [Only ifgenome
was supplied.] The intron motif and strand for each junction, based on the dinucleotides found in the genome sequences at the intron boundaries. Theintron_motif
metadata column is a factor whose levels are the 5 natural intron motifs stored in predefined character vectorNATURAL_INTRON_MOTIFS
. If the dinucleotides found at the intron boundaries don't match any of these natural intron motifs, thenintron_motif
andintron_strand
are set toNA
and*
, respectively.readTopHatJunctions
andreadSTARJunctions
return the junctions reported in the input file in a stranded GRanges object. With the following metadata columns forreadTopHatJunctions
(when reading in the junctions.bed file):name
: An id assigned by TopHat to each junction. This id is of the form JUNC00000017 and is unique within the junctions.bed file.score
: The total number of alignments crossing each junction.
With the following metadata columns forreadSTARJunctions
:intron_motif
andintron_strand
: The intron motif and strand for each junction, based on the code found in the input file (0: non-canonical, 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT). Note that of the 5 natural intron motifs stored in predefined character vectorNATURAL_INTRON_MOTIFS
, only the first 3 are assigned codes by the STAR software (2 codes per motif, one if the intron is on the plus strand and one if it's on the minus strand). Thus theintron_motif
metadata column is a factor with only 3 levels. If code is 0, thenintron_motif
andintron_strand
are set toNA
and*
, respectively.um_reads
: The number of uniquely mapping reads crossing the junction (a pair where the 2 alignments cross the same junction is counted only once).mm_reads
: The number of multi-mapping reads crossing the junction (a pair where the 2 alignments cross the same junction is counted only once).
See STAR manual at https://code.google.com/p/rna-star/ for more information.
Seealso
The
readGAlignments
andreadGAlignmentPairs
functions for reading genomic alignments from a BAM file.GAlignments , GAlignmentPairs , and GAlignmentsList objects.
GRanges and GRangesList objects implemented and documented in the GenomicRanges package.
IntegerList objects implemented and documented in the IRanges package.
The
getBSgenome
function in the BSgenome package, for searching the installed BSgenome data packages for the specified genome and returning it as a BSgenome object.The
extractList
function in the IRanges package, for extracting groups of elements from a vector-like object and returning them into a List object.
Author
Hervé Pagès
References
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC84117/ for the 5 natural
intron motifs stored in predefined character vector
NATURAL_INTRON_MOTIFS
.
TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions
TopHat2 paper: http://genomebiology.com/2013/14/4/r36
TopHat2 software and manual: http://tophat.cbcb.umd.edu/
STAR: ultrafast universal RNA-seq aligner
STAR paper: http://bioinformatics.oxfordjournals.org/content/early/2012/10/25/bioinformatics.bts635
STAR software and manual: https://code.google.com/p/rna-star/
Examples
library(RNAseqData.HNRNPC.bam.chr14)
bamfile <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
## ---------------------------------------------------------------------
## A. junctions()
## ---------------------------------------------------------------------
gal <- readGAlignments(bamfile)
table(njunc(gal)) # some alignments have 3 junctions!
juncs <- junctions(gal)
juncs
stopifnot(identical(unname(elementNROWS(juncs)), njunc(gal)))
galp <- readGAlignmentPairs(bamfile)
juncs <- junctions(galp)
juncs
stopifnot(identical(unname(elementNROWS(juncs)), njunc(galp)))
## ---------------------------------------------------------------------
## B. summarizeJunctions()
## ---------------------------------------------------------------------
## By default, only the "score", "plus_score", and "minus_score"
## metadata columns are returned:
junc_summary <- summarizeJunctions(gal)
junc_summary
## The "score" metadata column reports the total number of alignments
## crossing each junction, i.e., that have the junction encoded in their
## CIGAR:
median(mcols(junc_summary)$score)
## The "plus_score" and "minus_score" metadata columns report the
## strand-specific number of alignments crossing each junction:
stopifnot(identical(mcols(junc_summary)$score,
mcols(junc_summary)$plus_score +
mcols(junc_summary)$minus_score))
## If 'with.revmap' is TRUE, the "revmap" metadata column is added to
## the output. This metadata column is an IntegerList object represen-
## ting the mapping from each element in the ouput (i.e. a junction) to
## the corresponding elements in the input 'x'. Here we're going to use
## this to compute a 'score2' for each junction. We obtain this score
## by summing the mapping qualities of the alignments crossing the
## junction:
gal <- readGAlignments(bamfile, param=ScanBamParam(what="mapq"))
junc_summary <- summarizeJunctions(gal, with.revmap=TRUE)
junc_score2 <- sum(extractList(mcols(gal)$mapq,
mcols(junc_summary)$revmap))
mcols(junc_summary)$score2 <- junc_score2
## If the name of the reference genome is specified thru the 'genome'
## argument (in which case the corresponding BSgenome data package needs
## to be installed), then summarizeJunctions() returns the intron motif
## and strand for each junction.
## Since the reads in RNAseqData.HNRNPC.bam.chr14 were aligned to
## the hg19 genome, the following requires that you have
## BSgenome.Hsapiens.UCSC.hg19 installed:
junc_summary <- summarizeJunctions(gal, with.revmap=TRUE, genome="hg19")
mcols(junc_summary)$score2 <- junc_score2 # putting 'score2' back
## The "intron_motif" metadata column is a factor whose levels are the
## 5 natural intron motifs stored in predefined character vector
## 'NATURAL_INTRON_MOTIFS':
table(mcols(junc_summary)$intron_motif)
## ---------------------------------------------------------------------
## C. STRANDED RNA-seq PROTOCOL
## ---------------------------------------------------------------------
## Here is a simple test for checking whether the RNA-seq protocol was
## stranded or not:
strandedTest <- function(plus_score, minus_score)
(sum(plus_score ^ 2) + sum(minus_score ^ 2)) /
sum((plus_score + minus_score) ^ 2)
## The result of this test is guaranteed to be >= 0.5 and <= 1.
## If, for each junction, the strand of the crossing alignments looks
## random (i.e. "plus_score" and "minus_score" are close), then
## strandedTest() will return a value close to 0.5. If it doesn't look
## random (i.e. for each junction, one of "plus_score" and "minus_score"
## is much bigger than the other), then strandedTest() will return a
## value close to 1.
## If the reads are single-end, the test is meaningful when applied
## directly on 'junc_summary'. However, for the test to be meaningful
## on paired-end reads, it needs to be applied on the first and last
## alignments separately:
junc_summary1 <- summarizeJunctions(first(galp))
junc_summary2 <- summarizeJunctions(last(galp))
strandedTest(mcols(junc_summary1)$plus_score,
mcols(junc_summary1)$minus_score)
strandedTest(mcols(junc_summary2)$plus_score,
mcols(junc_summary2)$minus_score)
## Both values are close to 0.5 which suggests that the RNA-seq protocol
## used for this experiment was not stranded.
## ---------------------------------------------------------------------
## D. UTILITIES FOR IMPORTING THE JUNCTION FILE GENERATED BY SOME
## ALIGNERS
## ---------------------------------------------------------------------
## The TopHat aligner generates a junctions.bed file where it reports
## all the junctions satisfying some "quality" criteria (see the TopHat
## manual at http://tophat.cbcb.umd.edu/manual.shtml for more
## information). This file can be loaded with readTopHatJunctions():
runname <- names(RNAseqData.HNRNPC.bam.chr14_BAMFILES)[1]
junctions_file <- system.file("extdata", "tophat2_out", runname,
"junctions.bed",
package="RNAseqData.HNRNPC.bam.chr14")
th_junctions <- readTopHatJunctions(junctions_file)
## Comparing the "TopHat junctions" with the result of
## summarizeJunctions():
th_junctions14 <- th_junctions
seqlevels(th_junctions14, pruning.mode="coarse") <- "chr14"
mcols(th_junctions14)$intron_strand <- strand(th_junctions14)
strand(th_junctions14) <- "*"
## All the "TopHat junctions" are in 'junc_summary':
stopifnot(all(th_junctions14 %in% junc_summary))
## But not all the junctions in 'junc_summary' are reported by TopHat
## (that's because TopHat reports only junctions that satisfy some
## "quality" criteria):
is_in_th_junctions14 <- junc_summary %in% th_junctions14
table(is_in_th_junctions14) # 32 junctions are not in TopHat's
# junctions.bed file
junc_summary2 <- junc_summary[is_in_th_junctions14]
## 'junc_summary2' and 'th_junctions14' contain the same junctions in
## the same order:
stopifnot(all(junc_summary2 == th_junctions14))
## Let's merge their metadata columns. We use our own version of
## merge() for this, which is stricter (it checks that the common
## columns are the same in the 2 data frames to merge) and also
## simpler:
merge2 <- function(df1, df2)
{
common_colnames <- intersect(colnames(df1), colnames(df2))
lapply(common_colnames,
function(colname)
stopifnot(all(df1[ , colname] == df2[ , colname])))
extra_mcolnames <- setdiff(colnames(df2), colnames(df1))
cbind(df1, df2[ , extra_mcolnames, drop=FALSE])
}
mcols(th_junctions14) <- merge2(mcols(th_junctions14),
mcols(junc_summary2))
## Here is a peculiar junction reported by TopHat:
idx0 <- which(mcols(th_junctions14)$score2 == 0L)
th_junctions14[idx0]
gal[mcols(th_junctions14)$revmap[[idx0]]]
## The junction is crossed by 5 alignments (score is 5), all of which
## have a mapping quality of 0!
pileLettersAt()
Pile the letters of a set of aligned reads on top of a set of genomic positions
Description
pileLettersAt
extracts the letters/nucleotides of a set of
reads that align to a set of genomic positions of interest.
The extracted letters are returned as "piles of letters" (one per
genomic position of interest) stored in an XStringSet
(typically DNAStringSet ) object.
Usage
pileLettersAt(x, seqnames, pos, cigar, at)
Arguments
Argument | Description |
---|---|
x | An XStringSet (typically DNAStringSet ) object containing N unaligned read sequences (a.k.a. the query sequences) reported with respect to the + strand. |
seqnames | A factor- Rle parallel to x . For each i , seqnames[i] must be the name of the reference sequence of the i-th alignment. |
pos | An integer vector parallel to x . For each i , pos[i] must be the 1-based position on the reference sequence of the first aligned letter in x[[i]] . |
cigar | A character vector parallel to x . Contains the extended CIGAR strings of the alignments. |
at | A GPos object containing the genomic positions of interest. seqlevels(at) must be identical to levels(seqnames) . If at is not a GPos object, pileLettersAt will first try to turn it into one by calling the GPos constructor function on it. So for example at can be a GRanges object (or any other GenomicRanges derivative), and, in that case, each range in it will be interpreted as a run of adjacent genomic positions. See ? in the GenomicRanges package for more information. |
Details
x
, seqnames
, pos
, cigar
must be 4 parallel
vectors describing N aligned reads.
Value
An XStringSet (typically DNAStringSet )
object parallel to at
(i.e. with 1 string per genomic
position).
Seealso
The
pileup
andapplyPileups
functions defined in the Rsamtools package, as well as the SAMtools mpileup command (available at http://samtools.sourceforge.net/ as part of the SAMtools project), for more powerful flexible alternatives.The
stackStringsFromBam
function for stacking the read sequences (or their quality strings) stored in a BAM file on a region of interest.DNAStringSet objects in the Biostrings package.
GPos objects in the GenomicRanges package.
GAlignments objects.
cigar-utils for the CIGAR utility functions used internally by
pileLettersAt
.
Author
Hervé Pagès
Examples
## Input
## - A BAM file:
bamfile <- BamFile(system.file("extdata", "ex1.bam", package="Rsamtools"))
seqinfo(bamfile) # to see the seqlevels and seqlengths
stackStringsFromBam(bamfile, param="seq1:1-21") # a quick look at
# the reads
## - A GPos object containing Genomic Positions Of Interest:
my_GPOI <- GPos(c("seq1:1-5", "seq1:21-21", "seq1:1575-1575",
"seq2:1513-1514"))
## Some preliminary massage on 'my_GPOI'
seqinfo(my_GPOI) <- merge(seqinfo(my_GPOI), seqinfo(bamfile))
seqlevels(my_GPOI) <- seqlevelsInUse(my_GPOI)
## Load the BAM file in a GAlignments object. Note that we load only
## the reads aligned to the sequences in 'seqlevels(my_GPOI)'. Also,
## in order to be consistent with applyPileups() and SAMtools (m)pileup,
## we filter out the following BAM records:
## - secondary alignments (flag bit 0x100);
## - reads not passing quality controls (flag bit 0x200);
## - PCR or optical duplicates (flag bit 0x400).
## See ?ScanBamParam and the SAM Spec for more information.
which <- as(seqinfo(my_GPOI), "GRanges")
flag <- scanBamFlag(isSecondaryAlignment=FALSE,
isNotPassingQualityControls=FALSE,
isDuplicate=FALSE)
what <- c("seq", "qual")
param <- ScanBamParam(flag=flag, what=c("seq", "qual"), which=which)
gal <- readGAlignments(bamfile, param=param)
seqlevels(gal) <- seqlevels(my_GPOI)
## Extract the read sequences (a.k.a. query sequences) and quality
## strings. Both are reported with respect to the + strand.
qseq <- mcols(gal)$seq
qual <- mcols(gal)$qual
nucl_piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal),
my_GPOI)
qual_piles <- pileLettersAt(qual, seqnames(gal), start(gal), cigar(gal),
my_GPOI)
mcols(my_GPOI)$nucl_piles <- nucl_piles
mcols(my_GPOI)$qual_piles <- qual_piles
my_GPOI
## Finally, to summarize A/C/G/T frequencies at each position:
alphabetFrequency(nucl_piles, baseOnly=TRUE)
## Note that the pileup() function defined in the Rsamtools package
## can be used to obtain a similar result:
scanbam_param <- ScanBamParam(flag=flag, which=my_GPOI)
pileup_param <- PileupParam(max_depth=5000,
min_base_quality=0,
distinguish_strands=FALSE)
pileup(bamfile, scanBamParam=scanbam_param, pileupParam=pileup_param)
readGAlignments()
Reading genomic alignments from a file
Description
Read genomic alignments from a file (typically a BAM file) into a GAlignments , GAlignmentPairs , GAlignmentsList , or GappedReads object.
Usage
readGAlignments(file, index=file, use.names=FALSE, param=NULL,
with.which_label=FALSE)
readGAlignmentPairs(file, index=file, use.names=FALSE, param=NULL,
with.which_label=FALSE, strandMode=1)
readGAlignmentsList(file, index=file, use.names=FALSE,
param=ScanBamParam(), with.which_label=FALSE)
readGappedReads(file, index=file, use.names=FALSE, param=NULL,
with.which_label=FALSE)
Arguments
Argument | Description |
---|---|
file | The path to the file to read or a BamFile object. Can also be a BamViews object for readGAlignments . |
index | The path to the index file of the BAM file to read. Must be given without the '.bai' extension. See scanBam in the Rsamtools packages for more information. |
use.names | TRUE or FALSE . By default (i.e. use.names=FALSE ), the resulting object has no names. If use.names is TRUE , then the names are constructed from the query template names (QNAME field in a SAM/BAM file). Note that the 2 records in a pair (when using readGAlignmentPairs or the records in a group (when using readGAlignmentsList ) have the same QNAME. |
param | NULL or a ScanBamParam object. Like for scanBam , this influences what fields and which records are imported. However, note that the fields specified thru this ScanBamParam object will be loaded in addition to any field required for generating the returned object ( GAlignments , GAlignmentPairs , or GappedReads object), but only the fields requested by the user will actually be kept as metadata columns of the object. By default (i.e. param=NULL or param=ScanBamParam() ), no additional field is loaded. The flag used is scanBamFlag(isUnmappedQuery=FALSE) for readGAlignments , readGAlignmentsList , and readGappedReads . (i.e. only records corresponding to mapped reads are loaded), and scanBamFlag(isUnmappedQuery=FALSE, isPaired=TRUE, for readGAlignmentPairs (i.e. only records corresponding to paired-end reads with both ends mapped are loaded). |
with.which_label | TRUE or FALSE (the default). If TRUE and if param has a which component, a "which_label" metadata column is added to the returned GAlignments or GappedReads object, or to the first and last components of the returned GAlignmentPairs object. In the case of readGAlignmentsList , it's added as an inner metadata column, that is, the metadata column is placed on the GAlignments object obtained by unlisting the returned GAlignmentsList object. The purpose of this metadata column is to unambiguously identify the range in which where each element in the returned object originates from. The labels used to identify the ranges are normally of the form "seq1:12250-246500" , that is, they're the same as the names found on the outer list that scanBam would return if called with the same param argument. If some ranges are duplicated, then the labels are made unique by appending a unique suffix to all of them. The "which_label" metadata column is represented as a factor- Rle . |
strandMode | Strand mode to set on the returned GAlignmentPairs object. See ? for more information. |
Details
readGAlignments
reads a file containing aligned reads as a GAlignments object. See?
for a description of GAlignments objects. Whenfile
is a BamViews object,readGAlignments
visits each path inbamPaths(file)
, returning the result ofreadGAlignments
applied to the specified path. Whenindex
is missing, it is set equal tobamIndicies(file)
. Only reads inbamRanges(file)
are returned (ifparam
is supplied,bamRanges(file)
takes precedence overbamWhich(param)
). The return value is a SimpleList object, with elements of the list corresponding to each path.bamSamples(file)
is available as metadata columns (accessed withmcols
) of the returned SimpleList object.readGAlignmentPairs
reads a file containing aligned paired-end reads as a GAlignmentPairs object. See?
for a description of GAlignmentPairs objects.readGAlignmentsList
reads a file containing aligned reads as a GAlignmentsList object. See?
for a description of GAlignmentsList objects.readGAlignmentsList
pairs records into mates according to the pairing criteria described below. The 1st mate will always be 1st in the GAlignmentsList list elements that have mate_status set to"mated"
, and the 2nd mate will always be 2nd. AGAlignmentsList
is returned with a mate_status metadata column on the outer list elements.mate_status
is a factor with 3 levels indicating mate status, mated , ambiguous or unmated :mated: primary or non-primary pairs
ambiguous: multiple segments matching to the same location (indistinguishable)
unmated: mate does not exist or is unmapped When the file argument is a BamFile, asMates=TRUE must be set, otherwise the data are treated as single-end reads. See the asMates section of
?
in the Rsamtools package for details.readGappedReads
reads a file containing aligned reads as a GappedReads object. See?
for a description of GappedReads objects.
For all these functions, flags, tags and ranges may be specified in the supplied ScanBamParam object for fine tuning of results.
Value
A GAlignments object for readGAlignments
.
A GAlignmentPairs object for readGAlignmentPairs
.
Note that a BAM (or SAM) file can in theory contain a mix of single-end
and paired-end reads, but in practise it seems that single-end and
paired-end are not mixed. In other words, the value of flag bit 0x1
( isPaired
) is the same for all the records in a file.
So if readGAlignmentPairs
returns a GAlignmentPairs object
of length zero, this almost always means that the BAM (or SAM) file
contains alignments for single-end reads (although it could also mean that
the user-supplied ScanBamParam is filtering out
everything, or that the file is empty, or that all the records in the file
correspond to unmapped reads).
A GAlignmentsList object for readGAlignmentsList
.
When the list contains paired-end reads a metadata data column of
mate_status
is added to the object. See details in the
Bam specific back-ends' section on this man page. A [GappedReads](#gappedreads) object for
readGappedReads. ## Seealso * [
scanBam](#scanbam) and [
ScanBamParam](#scanbamparam) in the Rsamtools package. * [GAlignments](#galignments) , [GAlignmentPairs](#galignmentpairs) , [GAlignmentsList](#galignmentslist) , and [GappedReads](#gappedreads) objects. * [IRangesList](#irangeslist) objects (used in the examples below to specify the
whichregions) in the IRanges package. ## Note BAM records corresponding to unmapped reads are always ignored. Starting with Rsamtools 1.7.1 (BioC 2.10), PCR or optical duplicates are loaded by default (use
scanBamFlag(isDuplicate=FALSE)` to
drop them).
## Author
Hervé Pagès hpages@fredhutch.org and
Valerie Obenchain Valerie.Obenchain@RoswellPark.org
## Examples
r ## --------------------------------------------------------------------- ## A. readGAlignments() ## --------------------------------------------------------------------- ## Simple use: bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) gal1 <- readGAlignments(bamfile) gal1 names(gal1) ## Using the 'use.names' arg: gal2 <- readGAlignments(bamfile, use.names=TRUE) gal2 head(names(gal2)) ## Using the 'param' arg to drop PCR or optical duplicates as well as ## secondary alignments, and to load additional BAM fields: param <- ScanBamParam(flag=scanBamFlag(isDuplicate=FALSE, isSecondaryAlignment=FALSE), what=c("qual", "flag")) gal3 <- readGAlignments(bamfile, param=param) gal3 mcols(gal3) ## Using the 'param' arg to load alignments from particular regions. which <- IRangesList(seq1=IRanges(1000, 1100), seq2=IRanges(c(1546, 1555, 1567), width=10)) param <- ScanBamParam(which=which) gal4 <- readGAlignments(bamfile, use.names=TRUE, param=param) gal4 ## IMPORTANT NOTE: A given record is loaded one time for each region ## it overlaps with. We call this "duplicated record selection" (this ## is a scanBam() feature, readGAlignments() is based on scanBam()): which <- IRangesList(seq2=IRanges(c(1555, 1567), width=10)) param <- ScanBamParam(which=which) gal5 <- readGAlignments(bamfile, use.names=TRUE, param=param) gal5 # record EAS114_26:7:37:79:581 was loaded twice ## This becomes clearer if we use 'with.which_label=TRUE' to identify ## the region in 'which' where each element in 'gal5' originates from. gal5 <- readGAlignments(bamfile, use.names=TRUE, param=param, with.which_label=TRUE) gal5 ## Not surprisingly, we also get "duplicated record selection" when ## 'which' contains repeated or overlapping regions. Using the same ## regions as we did for 'gal4' above, except that now we're ## repeating the region on seq1: which <- IRangesList(seq1=rep(IRanges(1000, 1100), 2), seq2=IRanges(c(1546, 1555, 1567), width=10)) param <- ScanBamParam(which=which) gal4b <- readGAlignments(bamfile, use.names=TRUE, param=param) length(gal4b) # > length(gal4), because all the records overlapping # with bases 1000 to 1100 on seq1 are now duplicated ## The "duplicated record selection" will artificially increase the ## coverage or affect other downstream results. It can be mitigated ## (but not completely eliminated) by first "reducing" the set of ## regions: which <- reduce(which) which param <- ScanBamParam(which=which) gal4c <- readGAlignments(bamfile, use.names=TRUE, param=param) length(gal4c) # < length(gal4), because the 2 first original regions # on seq2 were merged into a single one ## Note that reducing the set of regions didn't completely eliminate ## "duplicated record selection". Records that overlap the 2 reduced ## regions on seq2 (which$seq2) are loaded twice (like for 'gal5' ## above). See example D. below for how to completely eliminate ## "duplicated record selection". ## Using the 'param' arg to load tags. Except for MF and Aq, the tags ## specified below are predefined tags (see the SAM Spec for the list ## of predefined tags and their meaning). param <- ScanBamParam(tag=c("MF", "Aq", "NM", "UQ", "H0", "H1"), what="isize") gal6 <- readGAlignments(bamfile, param=param) mcols(gal6) # "tag" cols always after "what" cols ## With a BamViews object: fls <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) bv <- BamViews(fls, bamSamples=DataFrame(info="test", row.names="ex1"), auto.range=TRUE) ## Note that the "readGAlignments" method for BamViews objects ## requires the ShortRead package to be installed. aln <- readGAlignments(bv) aln aln[[1]] aln[colnames(bv)] mcols(aln) ## --------------------------------------------------------------------- ## B. readGAlignmentPairs() ## --------------------------------------------------------------------- galp1 <- readGAlignmentPairs(bamfile) head(galp1) names(galp1) ## Here we use the 'param' arg to filter by proper pair, drop PCR / ## optical duplicates, and drop secondary alignments. Filtering by ## proper pair and dropping secondary alignments can help make the ## pairing algorithm run significantly faster: param <- ScanBamParam(flag=scanBamFlag(isProperPair=TRUE, isDuplicate=FALSE, isSecondaryAlignment=FALSE)) galp2 <- readGAlignmentPairs(bamfile, use.names=TRUE, param=param) galp2 head(galp2) head(names(galp2)) ## --------------------------------------------------------------------- ## C. readGAlignmentsList() ## --------------------------------------------------------------------- library(pasillaBamSubset) ## 'file' as character. bam <- untreated3_chr4() galist1 <- readGAlignmentsList(bam) galist1[1:3] length(galist1) table(elementNROWS(galist1)) ## When 'file' is a BamFile, 'asMates' must be TRUE. If FALSE, ## the data are treated as single-end and each list element of the ## GAlignmentsList will be of length 1. For single-end data ## use readGAlignments(). bamfile <- BamFile(bam, yieldSize=3, asMates=TRUE) readGAlignmentsList(bamfile) ## Use a 'param' to fine tune the results. param <- ScanBamParam(flag=scanBamFlag(isProperPair=TRUE)) galist2 <- readGAlignmentsList(bam, param=param) length(galist2) ## --------------------------------------------------------------------- ## D. COMPARING 4 STRATEGIES FOR LOADING THE ALIGNMENTS THAT OVERLAP ## WITH THE EXONIC REGIONS ON FLY CHROMMOSOME 4 ## --------------------------------------------------------------------- library(pasillaBamSubset) bam <- untreated1_chr4() library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene ex <- exons(txdb) seqlevels(ex, pruning.mode="coarse") <- "chr4" length(ex) ## Some of the exons overlap with each other: isDisjoint(ex) # FALSE exonic_regions <- reduce(ex) isDisjoint(exonic_regions) # no more overlaps length(exonic_regions) ## Strategy #1: slow and loads a lot of records more than once (see ## "duplicated record selection" in example A. above). param1 <- ScanBamParam(which=ex) gal1 <- readGAlignments(bam, param=param1) length(gal1) # many "duplicated records" ## Strategy #2: faster and generates less duplicated records but ## doesn't eliminate them. param2 <- ScanBamParam(which=exonic_regions) gal2 <- readGAlignments(bam, param=param2) length(gal2) # less "duplicated records" ## Strategy #3: fast and completely eliminates duplicated records. gal0 <- readGAlignments(bam) gal3 <- subsetByOverlaps(gal0, exonic_regions, ignore.strand=TRUE) length(gal3) # no "duplicated records" ## Note that, in this case using 'exonic_regions' or 'ex' makes no ## difference: gal3b <- subsetByOverlaps(gal0, ex, ignore.strand=TRUE) stopifnot(identical(gal3, gal3b)) ## Strategy #4: strategy #3 however can require a lot of memory if the ## file is big because we load all the alignments into memory before we ## select those that overlap with the exonic regions. Strategy #4 ## addresses this by loading the file by chunks. bamfile <- BamFile(bam, yieldSize=50000) open(bamfile) while (length(chunk0 <- readGAlignments(bamfile))) { chunk <- subsetByOverlaps(chunk0, ex, ignore.strand=TRUE) cat("chunk0:", length(chunk0), "- chunk:", length(chunk), " ") ## ... do something with 'chunk' ... } close(bamfile) ## --------------------------------------------------------------------- ## E. readGappedReads() ## --------------------------------------------------------------------- greads1 <- readGappedReads(bamfile) greads1 names(greads1) qseq(greads1) greads2 <- readGappedReads(bamfile, use.names=TRUE) head(greads2) head(names(greads2))
sequenceLayer()
Lay read sequences alongside the reference space, using their CIGARs
Description
sequenceLayer
can lay strings that belong to a given space (e.g.
the "query"
space) alongside another space (e.g. the
"reference"
space) by removing/injecting substrings from/into them,
using the supplied CIGARs.
Its primary use case is to lay the read sequences stored in a BAM file
(which are considered to belong to the "query"
space) alongside
the "reference"
space. It can also be used to remove the parts
of the read sequences that correspond to soft-clipping. More generally
it can lay strings that belong to any supported space alongside any other
supported space. See the Details section below for the list of supported
spaces.
Usage
sequenceLayer(x, cigar, from="query", to="reference",
D.letter="-", N.letter=".",
I.letter="-", S.letter="+", H.letter="+")
Arguments
Argument | Description |
---|---|
x | An XStringSet object containing strings that belong to a given space. |
cigar | A character vector or factor of the same length as x containing the extended CIGAR strings (one per element in x ). |
from, to | A single string specifying one of the 8 supported spaces listed in the Details section below. from must be the current space (i.e. the space the strings in x belong to) and to is the space alonside which to lay the strings in x . |
D.letter, N.letter, I.letter, S.letter, H.letter | A single letter used as a filler for injections. More on this in the Details section below. |
Details
The 8 supported spaces are: "reference"
,
"reference-N-regions-removed"
, "query"
,
"query-before-hard-clipping"
, "query-after-soft-clipping"
,
"pairwise"
, "pairwise-N-regions-removed"
,
and "pairwise-dense"
.
Each space can be characterized by the extended CIGAR
operations that are visible in it. A CIGAR operation is
said to be visible in a given space if it "runs along it",
that is, if it's associated with a block of contiguous positions in
that space (the size of the block being the length of the operation).
For example, the M/=/X operations are visible in all spaces,
the D/N operations are visible in the "reference"
space
but not in the "query"
space, the S operation is visible
in the "query"
space but not in the "reference"
or in the "query-after-soft-clipping"
space, etc...
Here are the extended CIGAR operations that are visible in each space:
reference: M, D, N, =, X
reference-N-regions-removed: M, D, =, X
query: M, I, S, =, X
query-before-hard-clipping: M, I, S, H, =, X
query-after-soft-clipping: M, I, =, X
pairwise: M, I, D, N, =, X
pairwise-N-regions-removed: M, I, D, =, X
pairwise-dense: M, =, X
sequenceLayer
lays a string that belongs to one space alongside another by (1) removing the substrings associated with operations that are not visible anymore in the new space, and (2) injecting substrings associated with operations that become visible in the new space. Each injected substring has the length of the operation associated with it, and its content is controlled via the corresponding*.letter
argument.
For example, when going from the "query"
space to the
"reference"
space (the default), the I- and S-substrings (i.e.
the substrings associated with I/S operations) are removed, and
substrings associated with D/N operations are injected. More precisely,
the D-substrings are filled with the letter specified in D.letter
,
and the N-substrings with the letter specified in N.letter
.
The other *.letter
arguments are ignored in that case.
Value
An XStringSet object of the same class and length as x
.
Seealso
The
stackStringsFromBam
function for stacking the read sequences (or their quality strings) stored in a BAM file on a region of interest.The
readGAlignments
function for loading read sequences from a BAM file (via a GAlignments object).The
extractAt
andreplaceAt
functions in the Biostrings package for extracting/replacing arbitrary substrings from/in a string or set of strings.cigar-utils for the CIGAR utility functions used internally by
sequenceLayer
.
Author
Hervé Pagès
Examples
## ---------------------------------------------------------------------
## A. FROM "query" TO "reference" SPACE
## ---------------------------------------------------------------------
## Load read sequences from a BAM file (they will be returned in a
## GAlignments object):
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
param <- ScanBamParam(what="seq")
gal <- readGAlignments(bamfile, param=param)
qseq <- mcols(gal)$seq # the read sequences (aka query sequences)
## Lay the query sequences alongside the reference space. This will
## remove the substrings associated with insertions to the reference
## (I operations) and soft clipping (S operations), and will inject new
## substrings (filled with "-") where deletions from the reference (D
## operations) and skipped regions from the reference (N operations)
## occurred during the alignment process:
qseq_on_ref <- sequenceLayer(qseq, cigar(gal))
## A typical use case for doing the above is to compute 1 consensus
## sequence per chromosome. The code below shows how this can be done
## in 2 extra steps.
## Step 1: Compute one consensus matrix per chromosome.
qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal))
pos_by_chrom <- splitAsList(start(gal), seqnames(gal))
cm_by_chrom <- lapply(names(pos_by_chrom),
function(seqname)
consensusMatrix(qseq_on_ref_by_chrom[[seqname]],
as.prob=TRUE,
shift=pos_by_chrom[[seqname]]-1,
width=seqlengths(gal)[[seqname]]))
names(cm_by_chrom) <- names(pos_by_chrom)
## 'cm_by_chrom' is a list of consensus matrices. Each matrix has 17
## rows (1 per letter in the DNA alphabet) and 1 column per chromosome
## position.
## Step 2: Compute the consensus string from each consensus matrix.
## We'll put "+" in the strings wherever there is no coverage for that
## position, and "N" where there is coverage but no consensus.
cs_by_chrom <- lapply(cm_by_chrom,
function(cm) {
## Because consensusString() doesn't like consensus matrices
## with columns that contain only zeroes (and you will have
## columns like that for chromosome positions that don't
## receive any coverage), we need to "fix" 'cm' first.
idx <- colSums(cm) == 0
cm["+", idx] <- 1
DNAString(consensusString(cm, ambiguityMap="N"))
})
## consensusString() provides some flexibility to let you extract
## the consensus in different ways. See '?consensusString' in the
## Biostrings package for the details.
## Finally, note that the read quality strings can also be used as
## input for sequenceLayer():
param <- ScanBamParam(what="qual")
gal <- readGAlignments(bamfile, param=param)
qual <- mcols(gal)$qual # the read quality strings
qual_on_ref <- sequenceLayer(qual, cigar(gal))
## Note that since the "-" letter is a valid quality code, there is
## no way to distinguish it from the "-" letters inserted by
## sequenceLayer().
## ---------------------------------------------------------------------
## B. FROM "query" TO "query-after-soft-clipping" SPACE
## ---------------------------------------------------------------------
## Going from "query" to "query-after-soft-clipping" simply removes
## the substrings associated with soft clipping (S operations):
qseq <- DNAStringSet(c("AAAGTTCGAA", "TTACGATTAN", "GGATAATTTT"))
cigar <- c("3H10M", "2S7M1S2H", "2M1I1M3D2M4S")
clipped_qseq <- sequenceLayer(qseq, cigar,
from="query", to="query-after-soft-clipping")
sequenceLayer(clipped_qseq, cigar,
from="query-after-soft-clipping", to="query")
sequenceLayer(clipped_qseq, cigar,
from="query-after-soft-clipping", to="query",
S.letter="-")
## ---------------------------------------------------------------------
## C. BRING QUERY AND REFERENCE SEQUENCES TO THE "pairwise" or
## "pairwise-dense" SPACE
## ---------------------------------------------------------------------
## Load read sequences from a BAM file:
library(RNAseqData.HNRNPC.bam.chr14)
bamfile <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
param <- ScanBamParam(what="seq",
which=GRanges("chr14", IRanges(1, 25000000)))
gal <- readGAlignments(bamfile, param=param)
qseq <- mcols(gal)$seq # the read sequences (aka query sequences)
## Load the corresponding reference sequences from the appropriate
## BSgenome package (the reads in RNAseqData.HNRNPC.bam.chr14 were
## aligned to hg19):
library(BSgenome.Hsapiens.UCSC.hg19)
rseq <- getSeq(Hsapiens, as(gal, "GRanges")) # the reference sequences
## Bring 'qseq' and 'rseq' to the "pairwise" space.
## For 'qseq', this will remove the substrings associated with soft
## clipping (S operations) and inject substrings (filled with "-")
## associated with deletions from the reference (D operations) and
## skipped regions from the reference (N operations). For 'rseq', this
## will inject substrings (filled with "-") associated with insertions
## to the reference (I operations).
qseq2 <- sequenceLayer(qseq, cigar(gal),
from="query", to="pairwise")
rseq2 <- sequenceLayer(rseq, cigar(gal),
from="reference", to="pairwise")
## Sanity check: 'qseq2' and 'rseq2' should have the same shape.
stopifnot(identical(elementNROWS(qseq2), elementNROWS(rseq2)))
## A closer look at reads with insertions and deletions:
cigar_op_table <- cigarOpTable(cigar(gal))
head(cigar_op_table)
I_idx <- which(cigar_op_table[ , "I"] >= 2) # at least 2 insertions
qseq2[I_idx]
rseq2[I_idx]
D_idx <- which(cigar_op_table[ , "D"] >= 2) # at least 2 deletions
qseq2[D_idx]
rseq2[D_idx]
## A closer look at reads with skipped regions:
N_idx <- which(cigar_op_table[ , "N"] != 0)
qseq2[N_idx]
rseq2[N_idx]
## A variant of the "pairwise" space is the "pairwise-dense" space.
## In that space, all indels and skipped regions are removed from 'qseq'
## and 'rseq'.
qseq3 <- sequenceLayer(qseq, cigar(gal),
from="query", to="pairwise-dense")
rseq3 <- sequenceLayer(rseq, cigar(gal),
from="reference", to="pairwise-dense")
## Sanity check: 'qseq3' and 'rseq3' should have the same shape.
stopifnot(identical(elementNROWS(qseq3), elementNROWS(rseq3)))
## Insertions were removed:
qseq3[I_idx]
rseq3[I_idx]
## Deletions were removed:
qseq3[D_idx]
rseq3[D_idx]
## Skipped regions were removed:
qseq3[N_idx]
rseq3[N_idx]
## ---------------------------------------------------------------------
## D. SANITY CHECKS
## ---------------------------------------------------------------------
SPACES <- c("reference",
"reference-N-regions-removed",
"query",
"query-before-hard-clipping",
"query-after-soft-clipping",
"pairwise",
"pairwise-N-regions-removed",
"pairwise-dense")
cigarWidth <- list(
function(cigar) cigarWidthAlongReferenceSpace(cigar),
function(cigar) cigarWidthAlongReferenceSpace(cigar,
N.regions.removed=TRUE),
function(cigar) cigarWidthAlongQuerySpace(cigar),
function(cigar) cigarWidthAlongQuerySpace(cigar,
before.hard.clipping=TRUE),
function(cigar) cigarWidthAlongQuerySpace(cigar,
after.soft.clipping=TRUE),
function(cigar) cigarWidthAlongPairwiseSpace(cigar),
function(cigar) cigarWidthAlongPairwiseSpace(cigar,
N.regions.removed=TRUE),
function(cigar) cigarWidthAlongPairwiseSpace(cigar, dense=TRUE)
)
cigar <- c("3H2S4M1D2M2I1M5N3M6H", "5M1I3M2D4M2S")
seq <- list(
BStringSet(c(A="AAAA-BBC.....DDD", B="AAAAABBB--CCCC")),
BStringSet(c(A="AAAA-BBCDDD", B="AAAAABBB--CCCC")),
BStringSet(c(A="++AAAABBiiCDDD", B="AAAAAiBBBCCCC++")),
BStringSet(c(A="+++++AAAABBiiCDDD++++++", B="AAAAAiBBBCCCC++")),
BStringSet(c(A="AAAABBiiCDDD", B="AAAAAiBBBCCCC")),
BStringSet(c(A="AAAA-BBiiC.....DDD", B="AAAAAiBBB--CCCC")),
BStringSet(c(A="AAAA-BBiiCDDD", B="AAAAAiBBB--CCCC")),
BStringSet(c(A="AAAABBCDDD", B="AAAAABBBCCCC"))
)
stopifnot(all(sapply(1:8,
function(i) identical(width(seq[[i]]), cigarWidth[[i]](cigar))
)))
sequenceLayer2 <- function(x, cigar, from, to)
sequenceLayer(x, cigar, from=from, to=to, I.letter="i")
identical_XStringSet <- function(target, current)
{
ok1 <- identical(class(target), class(current))
ok2 <- identical(names(target), names(current))
ok3 <- all(target == current)
ok1 && ok2 && ok3
}
res <- sapply(1:8, function(i) {
sapply(1:8, function(j) {
target <- seq[[j]]
current <- sequenceLayer2(seq[[i]], cigar,
from=SPACES[i], to=SPACES[j])
identical_XStringSet(target, current)
})
})
stopifnot(all(res))
setops_methods()
Set operations on GAlignments objects
Description
Performs set operations on GAlignments objects.
NOTE: The pintersect
generic function and method
for IntegerRanges objects is defined and documented in the
IRanges package.
Methods for GRanges and
GRangesList objects are defined and
documented in the GenomicRanges package.
Usage
list(list("pintersect"), list("GAlignments,GRanges"))(x, y, ...)
list(list("pintersect"), list("GRanges,GAlignments"))(x, y, ...)
Arguments
Argument | Description |
---|---|
x, y | A GAlignments object and a GRanges object. They must have the same length. |
... | Further arguments to be passed to or from other methods. |
Value
A GAlignments object parallel to (i.e. same length as)
x
and y
.
Seealso
The GAlignments class.
The setops-methods man page in the GenomicRanges package.
Examples
## Parallel intersection of a GAlignments and a GRanges object:
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
gal <- readGAlignments(bamfile)
pintersect(gal, shift(as(gal, "GRanges"), 6L))
stackStringsFromBam()
Stack the read sequences stored in a BAM file on a region of interest
Description
stackStringsFromBam
stacks the read sequences (or their quality
strings) stored in a BAM file over a user-specified region.
alphabetFrequencyFromBam
computes the alphabet frequency of the
reads over a user-specified region.
Both functions take into account the CIGAR of each read to "lay" the read sequence (or its quality string) alongside the reference space. This step ensures that each nucleotide in a read is associated with the correct position on the reference sequence.
Usage
stackStringsFromBam(file, index=file, param,
what="seq", use.names=FALSE,
D.letter="-", N.letter=".",
Lpadding.letter="+", Rpadding.letter="+")
alphabetFrequencyFromBam(file, index=file, param, what="seq", ...)
Arguments
Argument | Description |
---|---|
file, index | The path to the BAM file to read, and to the index file of the BAM file to read, respectively. The latter is given without the '.bai' extension. See scanBam for more information. |
param | A ScanBamParam object containing exactly 1 genomic region (i.e. unlist(bamWhich(param)) must have length 1). Alternatively, param can be a GRanges or IntegerRangesList object containing exactly 1 genomic region (the strand will be ignored in case of a GRanges object), or a character string specifying a single genomic region (in the "chr14:5201-5300" format). |
what | A single string. Either "seq" or "qual" . If "seq" (the default), the read sequences will be stacked. If "qual" , the read quality strings will be stacked. |
use.names | Use the query template names (QNAME field) as the names of the returned object? If not (the default), then the returned object has no names. |
D.letter, N.letter | A single letter used as a filler for injections. The 2 arguments are passed down to the sequenceLayer function. See ? for more details. |
Lpadding.letter, Rpadding.letter | A single letter to use for padding the sequences on the left, and another one to use for padding on the right. The 2 arguments are passed down to the stackStrings function defined in the Biostrings package. See ? in the Biostrings package for more details. |
... | Further arguments to be passed to alphabetFrequency . |
Details
stackStringsFromBam
performs the 3 following steps:
Load the read sequences (or their quality strings) from the BAM file. Only the read sequences that overlap with the specified region are loaded. This is done with the
readGAlignments
function. Note that if the file contains paired-end reads, the pairing is ignored.Lay the sequences alongside the reference space, using their CIGARs. This is done with the
sequenceLayer
function.Stack them on the specified region. This is done with the
stackStrings
function defined in the Biostrings package.alphabetFrequencyFromBam
also performs steps 1. and 2. but, instead of stacking the sequences at step 3., it computes the nucleotide frequencies for each genomic position in the specified region.
Value
For stackStringsFromBam
: A rectangular (i.e. constant-width)
DNAStringSet object (if what
is "seq"
)
or BStringSet object (if what
is "qual"
).
For alphabetFrequencyFromBam
: By default a matrix like one returned
by alphabetFrequency
. The matrix has 1 row per
nucleotide position in the specified region.
Seealso
The
pileLettersAt
function for piling the letters of a set of aligned reads on top of a set of genomic positions.The
readGAlignments
function for loading read sequences (or their quality strings) from a BAM file (via a GAlignments object).The
sequenceLayer
function for laying read sequences alongside the reference space, using their CIGARs.The
stackStrings
function in the Biostrings package for stacking an arbitrary XStringSet object.The
alphabetFrequency
function in the Biostrings package.The SAMtools mpileup command available at http://samtools.sourceforge.net/ as part of the SAMtools project.
Note
TWO IMPORTANT CAVEATS ABOUT stackStringsFromBam
:
Specifying a big genomic region, say >= 100000 bp, can require a lot of
memory (especially with high coverage reads) and is not recommended.
See the pileLettersAt
function for piling the read letters
on top of a set of genomic positions, which is more flexible and more
memory efficient.
Paired-end reads are treated as single-end reads (i.e. they're not paired).
Author
Hervé Pagès
Examples
## ---------------------------------------------------------------------
## A. EXAMPLE WITH TOY DATA
## ---------------------------------------------------------------------
bamfile1 <- BamFile(system.file("extdata", "ex1.bam", package="Rsamtools"))
region1 <- GRanges("seq1", IRanges(1, 60)) # region of interest
## Stack the read sequences:
stackStringsFromBam(bamfile1, param=region1)
## Compute the "consensus matrix" (1 column per nucleotide position
## in the region of interest):
af <- alphabetFrequencyFromBam(bamfile1, param=region1, baseOnly=TRUE)
cm1a <- t(af[ , DNA_BASES])
cm1a
## Stack their quality strings:
stackStringsFromBam(bamfile1, param=region1, what="qual")
## Control the number of reads to display:
options(showHeadLines=18)
options(showTailLines=6)
stackStringsFromBam(bamfile1, param=GRanges("seq1", IRanges(61, 120)))
stacked_qseq <- stackStringsFromBam(bamfile1, param="seq2:1509-1519")
stacked_qseq # deletion in read 13
af <- alphabetFrequencyFromBam(bamfile1, param="seq2:1509-1519",
baseOnly=TRUE)
cm1b <- t(af[ , DNA_BASES]) # consensus matrix
cm1b
## Sanity check:
stopifnot(identical(consensusMatrix(stacked_qseq)[DNA_BASES, ], cm1b))
stackStringsFromBam(bamfile1, param="seq2:1509-1519", what="qual")
## ---------------------------------------------------------------------
## B. EXAMPLE WITH REAL DATA
## ---------------------------------------------------------------------
library(RNAseqData.HNRNPC.bam.chr14)
bamfile2 <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1])
## Region of interest:
region2 <- GRanges("chr14", IRanges(19650095, 19650159))
readGAlignments(bamfile2, param=ScanBamParam(which=region2))
stackStringsFromBam(bamfile2, param=region2)
af <- alphabetFrequencyFromBam(bamfile2, param=region2, baseOnly=TRUE)
cm2 <- t(af[ , DNA_BASES]) # consensus matrix
cm2
## ---------------------------------------------------------------------
## C. COMPUTE READ CONSENSUS SEQUENCE FOR REGION OF INTEREST
## ---------------------------------------------------------------------
## Let's write our own little naive function to go from consensus matrix
## to consensus sequence. For each nucleotide position in the region of
## interest (i.e. each column in the matrix), we select the letter with
## highest frequency. We also use special letter "*" at positions where
## there is a tie, and special letter "." at positions where all the
## frequencies are 0 (a particular type of tie):
cm_to_cs <- function(cm)
{
stopifnot(is.matrix(cm))
nr <- nrow(cm)
rnames <- rownames(cm)
stopifnot(!is.null(rnames) && all(nchar(rnames) == 1L))
selection <- apply(cm, 2,
function(x) {
i <- which.max(x)
if (x[i] == 0L)
return(nr + 1L)
if (sum(x == x[i]) != 1L)
return(nr + 2L)
i
})
paste0(c(rnames, ".", "*")[selection], collapse="")
}
cm_to_cs(cm1a)
cm_to_cs(cm1b)
cm_to_cs(cm2)
## Note that the consensus sequences we obtain are relative to the
## plus strand of the reference sequence.
summarizeOverlaps_methods()
Perform overlap queries between reads and genomic features
Description
summarizeOverlaps
extends findOverlaps
by providing
options to resolve reads that overlap multiple features.
Usage
list(list("summarizeOverlaps"), list("GRanges,GAlignments"))(
features, reads, mode=Union,
ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
list(list("summarizeOverlaps"), list("GRangesList,GAlignments"))(
features, reads, mode=Union,
ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
list(list("summarizeOverlaps"), list("GRanges,GRanges"))(
features, reads, mode=Union,
ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
list(list("summarizeOverlaps"), list("GRangesList,GRanges"))(
features, reads, mode=Union,
ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
list(list("summarizeOverlaps"), list("GRanges,GAlignmentPairs"))(
features, reads, mode=Union,
ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
list(list("summarizeOverlaps"), list("GRangesList,GAlignmentPairs"))(
features, reads, mode=Union,
ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
## mode funtions
Union(features, reads, ignore.strand=FALSE,
inter.feature=TRUE)
IntersectionStrict(features, reads, ignore.strand=FALSE,
inter.feature=TRUE)
IntersectionNotEmpty(features, reads, ignore.strand=FALSE,
inter.feature=TRUE)
list(list("summarizeOverlaps"), list("GRanges,BamFile"))(
features, reads, mode=Union,
ignore.strand=FALSE, inter.feature=TRUE, singleEnd=TRUE,
fragments=FALSE, param=ScanBamParam(), preprocess.reads=NULL, ...)
list(list("summarizeOverlaps"), list("BamViews,missing"))(
features, reads, mode=Union,
ignore.strand=FALSE, inter.feature=TRUE, singleEnd=TRUE,
fragments=FALSE, param=ScanBamParam(), preprocess.reads=NULL, ...)
Arguments
Argument | Description |
---|---|
features | A GRanges or a GRangesList object of genomic regions of interest. When a GRanges is supplied, each row is considered a feature. When a GRangesList is supplied, each higher list-level is considered a feature. This distinction is important when defining overlaps. When features is a BamViews the reads argument is missing. Features are extracted from the bamRanges and the reads from bamPaths . Metadata from bamPaths and bamSamples are stored in the colData of the resulting RangedSummarizedExperiment object. bamExperiment metadata are stored in the metadata slot. |
reads | A GRanges , GRangesList GAlignments , GAlignmentsList , GAlignmentPairs , BamViews or BamFileList object that represents the data to be counted by summarizeOverlaps . reads is missing when a BamViews object is the only argument supplied to summarizeOverlaps . reads are the files specified in bamPaths of the BamViews object. |
mode | mode can be one of the pre-defined count methods such as "Union", "IntersectionStrict", or "IntersectionNotEmpty" or it a user supplied count function. For a custom count function, the input arguments must match those of the pre-defined options and the function must return a vector of counts the same length as the annotation ('features' argument). See examples for details. The pre-defined options are designed after the counting modes available in the HTSeq package by Simon Anders (see references). |
"Union" : (Default) Reads that overlap any portion of exactly one feature are counted. Reads that overlap multiple features are discarded. This is the most conservative of the 3 modes.
"IntersectionStrict" : A read must fall completely "within" the feature to be counted. If a read overlaps multiple features but falls "within" only one, the read is counted for that feature. If the read is "within" multiple features, the read is discarded.
"IntersectionNotEmpty" : A read must fall in a unique disjoint region of a feature to be counted. When a read overlaps multiple features, the features are partitioned into disjoint intervals. Regions that are shared between the features are discarded leaving only the unique disjoint regions. If the read overlaps one of these remaining regions, it is assigned to the feature the unique disjoint region came from.
user supplied function : A function can be supplied as the
mode
argument. It must (1) have arguments that correspond tofeatures
,reads
,ignore.strand
andinter.feature
arguments (as in the defined mode functions) and (2) return a vector of counts the same length asfeatures
.
|ignore.strand
| A logical indicating if strand should be considered when matching. | |inter.feature
| (Default TRUE) A logical indicating if the countingmode
should be aware of overlapping features. When TRUE (default), reads mapping to multiple features are dropped (i.e., not counted). When FALSE, these reads are retained and a count is assigned to each feature they map to. There are 6 possible combinations of themode
andinter.feature
arguments. Wheninter.feature=FALSE
the behavior of modes Union and IntersectionStrict are essentially countOverlaps with type=any andtype=within
, respectively. IntersectionNotEmpty does not reduce to a simple countOverlaps because common (shared) regions of the annotation are removed before counting. | |preprocess.reads
| A function applied to the reads before counting. The first argument should bereads
and the return value should be an object compatible with thereads
argument to the counting modes, Union, IntersectionStrict and IntersectionNotEmpty. The distinction between a user-defined 'mode' and user-defined 'preprocess.reads' function is that in the first case the user defines how to count; in the second case the reads are preprocessed before counting with a pre-defined mode. See examples. | |...
| Additional arguments passed to functions or methods called from withinsummarizeOverlaps
. For BAM file methods arguments may includesingleEnd
,fragments
orparam
which apply to reading records from a file (see below). Providingcount.mapped.reads=TRUE
include additional passes through the BAM file to collect statistics similar to those fromcountBam
. ABPPARAM
argument can be passed down to thebplapply
called bysummarizeOverlaps
. The argument can be MulticoreParam(), SnowParam(), BatchJobsParam() or DoparParam(). See the BiocParallel package for details in specifying the params. | |singleEnd
| (Default TRUE) A logical indicating if reads are single or paired-end. In Bioconductor > 2.12 it is not necessary to sort paired-end BAM files byqname
. When counting withsummarizeOverlaps
, settingsingleEnd=FALSE
will trigger paired-end reading and counting. It is fine to also setasMates=TRUE
in theBamFile
but is not necessary whensingleEnd=FALSE
. | |fragments
| (Default FALSE) A logical; applied to paired-end data only.fragments
controls which function is used to read the data which subsequently affects which records are included in counting. Whenfragments=FALSE
, data are read withreadGAlignmentPairs
and returned in aGAlignmentPairs
class. In this case, singletons, reads with unmapped pairs, and other fragments, are dropped. Whenfragments=TRUE
, data are read withreadGAlignmentsList
and returned in aGAlignmentsList
class. This class holds mated pairs as well as same-strand pairs, singletons, reads with unmapped pairs and other fragments. Because more records are kept, generally counts will be higher whenfragments=TRUE
. The term mated pairs refers to records paired with the algorithm described on the?
man page. | |param
| An optionalScanBamParam
instance to further influence scanning, counting, or filtering. See?
for details of how records are returned when bothyieldSize
is specified in aBamFile
andwhich
is defined in aScanBamParam
. |
Details
list(" ", " ", list(list(), list(list("summarizeOverlaps"), " offers counting modes to resolve reads that ", " overlap multiple features. The ", list("mode"), " argument defines a set of rules ", " to resolve the read to a single feature such that each read is counted a ", " maximum of once. New to GenomicRanges >= 1.13.9 is the ", " ", list("inter.feature"), " argument which allows reads to be counted for each ", " feature they overlap. When ", list("inter.feature=TRUE"),
" the counting modes
", " are aware of feature overlap; reads that overlap multiple features are ", " dropped and not counted. When ", list("inter.feature=FALSE"), " multiple feature ", " overlap is ignored and reads are counted once for each feature they map ", " to. This essentially reduces modes ", list("Union"), " and ", " ", list("IntersectionStrict"), " to ", list("countOverlaps"), " with ", " ", list("type="any""), ", and ", list("type="within""),
", respectively.
", " ", list("IntersectionNotEmpty"), " is not reduced to a derivative of ", " ", list("countOverlaps"), " because the shared regions are removed before ", " counting. ", " ", " The ", list("BamViews"), ", ", list("BamFile"), " and ", list("BamFileList"), " methods ", " summarize overlaps across one or several files. The latter uses ", " ", list("bplapply"), "; control parallel evaluation using the ", " ", list(list("register")), " interface in the ",
list("BiocParallel"), " package.
", " ")), " ", " ", list(list("features :"), list(" ", " A ", list("feature"), " can be any portion of a genomic region such as a gene, ", " transcript, exon etc. When the ", list("features"), " argument is a ", " ", list("GRanges"), " the rows define the features. The result ", " will be the same length as the ", list("GRanges"), ". When ", " ", list("features"), " is a ", list("GRangesList"), " the highest ", " list-level defines the features and the result will be the same length ",
" as the ", list("GRangesList"), ".
", " ", " When ", list("inter.feature=TRUE"), ", each count ", list("mode"), " attempts to ", " assign a read that overlaps multiple features to a single feature. If ", " there are ranges that should be considered together (e.g., exons by ", " transcript or cds regions by gene) the ", list("GRangesList"), " ", " would be appropriate. If there is no grouping in the data then a ", " ", list("GRanges"), " would be appropriate. ",
" ")), "
", " ", list(list("paired-end reads :"), list(" ", " Paired-end reads are counted as a single hit if one or both parts ", " of the pair are overlapped. Paired-end records can be counted in ", " a ", list("GAlignmentPairs"), " container or BAM file. ", " ", " Counting pairs in BAM files: ", " ", list(" ", " ", list(), list("The ", list("singleEnd"), " argument should be FALSE."), " ", " ", list(), list("When ", list("reads"), " are supplied as a BamFile or BamFileList, ",
" the ", list("asMates"), " argument to the BamFile should be TRUE."), "
", " ", list(), list("When ", list("fragments"), " is FALSE, a ", list("GAlignmentPairs"), " ", " object is used in counting (pairs only)."), " ", " ", list(), list("When ", list("fragments"), " is TRUE, a ", list("GAlignmentsList"), " ", " object is used in counting (pairs, singletons, unmapped ", " mates, etc.)"), " ", " "), " ", " ")), " ",
" ")
Value
A RangedSummarizedExperiment object. The
assays
slot holds the counts, rowRanges
holds the annotation
from features
.
When reads
is a BamFile
or BamFileList
colData
is an empty DataFrame with a single row named counts . If
count.mapped.reads=TRUE
, colData
holds the output of
countBam
in 3 columns named records (total records),
nucleotides and mapped (mapped records).
When features
is a BamViews
colData
includes
2 columns named bamSamples
and bamIndices
.
In all other cases, colData
has columns of object
(class of reads) and records (length of reads
).
Seealso
The DESeq2 , DEXSeq and edgeR packages.
The RangedSummarizedExperiment class defined in the SummarizedExperiment package.
The GAlignments and GAlignmentPairs classes.
The BamFileList and BamViews classes in the Rsamtools package.
The readGAlignments and readGAlignmentPairs functions.
Author
Valerie Obenchain Valerie.Obenchain@RoswellPark.org
References
HTSeq : http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html
htseq-count : http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
Examples
reads <- GAlignments(
names = c("a","b","c","d","e","f","g"),
seqnames = Rle(c(rep(c("chr1", "chr2"), 3), "chr1")),
pos = as.integer(c(1400, 2700, 3400, 7100, 4000, 3100, 5200)),
cigar = c("500M", "100M", "300M", "500M", "300M",
"50M200N50M", "50M150N50M"),
strand = strand(rep("+", 7)))
gr <- GRanges(
seqnames = c(rep("chr1", 7), rep("chr2", 4)), strand = "+",
ranges = IRanges(c(1000, 3000, 3600, 4000, 4000, 5000, 5400,
2000, 3000, 7000, 7500),
width = c(500, 500, 300, 500, 900, 500, 500,
900, 500, 600, 300),
names=c("A", "B", "C1", "C2", "D1", "D2", "E", "F",
"G", "H1", "H2")))
groups <- factor(c(1,2,3,3,4,4,5,6,7,8,8))
grl <- splitAsList(gr, groups)
names(grl) <- LETTERS[seq_along(grl)]
## ---------------------------------------------------------------------
## Counting modes.
## ---------------------------------------------------------------------
## First count with a GRanges as the 'features'. 'Union' is the
## most conservative counting mode followed by 'IntersectionStrict'
## then 'IntersectionNotEmpty'.
counts1 <-
data.frame(union=assays(summarizeOverlaps(gr, reads))$counts,
intStrict=assays(summarizeOverlaps(gr, reads,
mode="IntersectionStrict"))$counts,
intNotEmpty=assays(summarizeOverlaps(gr, reads,
mode="IntersectionNotEmpty"))$counts)
colSums(counts1)
## Split the 'features' into a GRangesList and count again.
counts2 <-
data.frame(union=assays(summarizeOverlaps(grl, reads))$counts,
intStrict=assays(summarizeOverlaps(grl, reads,
mode="IntersectionStrict"))$counts,
intNotEmpty=assays(summarizeOverlaps(grl, reads,
mode="IntersectionNotEmpty"))$counts)
colSums(counts2)
## The GRangesList ('grl' object) has 8 features whereas the GRanges
## ('gr' object) has 11. The affect on counting can be seen by looking
## at feature 'H' with mode 'Union'. In the GRanges this feature is
## represented by ranges 'H1' and 'H2',
gr[c("H1", "H2")]
## and by list element 'H' in the GRangesList,
grl["H"]
## Read "d" hits both 'H1' and 'H2'. This is considered a multi-hit when
## using a GRanges (each range is a separate feature) so the read was
## dropped and not counted.
counts1[c("H1", "H2"), ]
## When using a GRangesList, each list element is considered a feature.
## The read hits multiple ranges within list element 'H' but only one
## list element. This is not considered a multi-hit so the read is counted.
counts2["H", ]
## ---------------------------------------------------------------------
## Counting multi-hit reads.
## ---------------------------------------------------------------------
## The goal of the counting modes is to provide a set of rules that
## resolve reads hitting multiple features so each read is counted
## a maximum of once. However, sometimes it may be desirable to count
## a read for each feature it overlaps. This can be accomplished by
## setting 'inter.feature' to FALSE.
## When 'inter.feature=FALSE', modes 'Union' and 'IntersectionStrict'
## essentially reduce to countOverlaps() with type="any" and
## type="within", respectively.
## When 'inter.feature=TRUE' only features "A", "F" and "G" have counts.
se1 <- summarizeOverlaps(gr, reads, mode="Union", inter.feature=TRUE)
assays(se1)$counts
## When 'inter.feature=FALSE' all 11 features have a count. There are
## 7 total reads so one or more reads were counted more than once.
se2 <- summarizeOverlaps(gr, reads, mode="Union", inter.feature=FALSE)
assays(se2)$counts
## ---------------------------------------------------------------------
## Counting BAM files.
## ---------------------------------------------------------------------
library(pasillaBamSubset)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "gene")
## (i) Single-end :
## Large files can be iterated over in chunks by setting a
## 'yieldSize' on the BamFile.
bf_s <- BamFile(untreated1_chr4(), yieldSize=50000)
se_s <- summarizeOverlaps(exbygene, bf_s, singleEnd=TRUE)
table(assays(se_s)$counts > 0)
## When a character (file name) is provided as 'reads' instead
## of a BamFile object summarizeOverlaps() will create a BamFile
## and set a reasonable default 'yieldSize'.
## (ii) Paired-end :
## A paired-end file may contain singletons, reads with unmapped
## pairs or reads with more than two fragments. When 'fragments=FALSE'
## only reads paired by the algorithm are included in the counting.
nofrag <- summarizeOverlaps(exbygene, untreated3_chr4(),
singleEnd=FALSE, fragments=FALSE)
table(assays(nofrag)$counts > 0)
## When 'fragments=TRUE' all singletons, reads with unmapped pairs
## and other fragments will be included in the counting.
bf <- BamFile(untreated3_chr4(), asMates=TRUE)
frag <- summarizeOverlaps(exbygene, bf, singleEnd=FALSE, fragments=TRUE)
table(assays(frag)$counts > 0)
## As expected, using 'fragments=TRUE' results in a larger number
## of total counts because singletons, unmapped pairs etc. are
## included in the counting.
## Total reads in the file:
countBam(untreated3_chr4())
## Reads counted with 'fragments=FALSE':
sum(assays(nofrag)$counts)
## Reads counted with 'fragments=TRUE':
sum(assays(frag)$counts)
## ---------------------------------------------------------------------
## Use ouput of summarizeOverlaps() for differential expression analysis
## with DESeq2 or edgeR.
## ---------------------------------------------------------------------
fls <- list.files(system.file("extdata", package="GenomicAlignments"),
recursive=TRUE, pattern="*bam$", full=TRUE)
names(fls) <- basename(fls)
bf <- BamFileList(fls, index=character(), yieldSize=1000)
genes <- GRanges(
seqnames = c(rep("chr2L", 4), rep("chr2R", 5), rep("chr3L", 2)),
ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600,
4000, 7500, 5000, 5400),
width=c(rep(500, 3), 600, 900, 500, 300, 900,
300, 500, 500)))
se <- summarizeOverlaps(genes, bf)
## When the reads are BAM files, the 'colData' contains summary
## information from a call to countBam().
colData(se)
## Start differential expression analysis with the DESeq2 or edgeR
## package:
library(DESeq2)
deseq <- DESeqDataSet(se, design= ~ 1)
library(edgeR)
edger <- DGEList(assays(se)$counts, group=rownames(colData(se)))
## ---------------------------------------------------------------------
## Filter records by map quality before counting.
## (user-supplied 'mode' function)
## ---------------------------------------------------------------------
## The 'mode' argument can take a custom count function whose
## arguments are the same as those in the current counting modes
## (i.e., Union, IntersectionNotEmpty, IntersectionStrict).
## In this example records are filtered by map quality before counting.
mapq_filter <- function(features, reads, ignore.strand, inter.feature)
{
require(GenomicAlignments) # needed for parallel evaluation
Union(features, reads[mcols(reads)$mapq >= 20],
ignore.strand, inter.feature)
}
genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200), width=100))
param <- ScanBamParam(what="mapq")
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
se <- summarizeOverlaps(genes, fl, mode=mapq_filter, param=param)
assays(se)$counts
## The count function can be completely custom (i.e., not use the
## pre-defined count functions at all). Requirements are that
## the input arguments match the pre-defined modes and the output
## is a vector of counts the same length as 'features'.
my_count <- function(features, reads, ignore.strand, inter.feature) {
## perform filtering, or subsetting etc.
require(GenomicAlignments) # needed for parallel evaluation
countOverlaps(features, reads)
}
## ---------------------------------------------------------------------
## Preprocessing reads before counting with a standard count mode.
## (user-supplied 'preprocess.reads' function)
## ---------------------------------------------------------------------
## The 'preprocess.reads' argument takes a function that is
## applied to the reads before counting with a pre-defined mode.
ResizeReads <- function(reads, width=1, fix="start", ...) {
reads <- as(reads, "GRanges")
stopifnot(all(strand(reads) != "*"))
resize(reads, width=width, fix=fix, ...)
}
## By default ResizeReads() counts reads that overlap on the 5' end:
summarizeOverlaps(grl, reads, mode=Union, preprocess.reads=ResizeReads)
## Count reads that overlap on the 3' end by passing new values
## for 'width' and 'fix':
summarizeOverlaps(grl, reads, mode=Union, preprocess.reads=ResizeReads,
width=1, fix="end")
## ---------------------------------------------------------------------
## summarizeOverlaps() with BamViews.
## ---------------------------------------------------------------------
## bamSamples and bamPaths metadata are included in the colData.
## bamExperiment metadata is put into the metadata slot.
fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE)
rngs <- GRanges(c("seq1", "seq2"), IRanges(1, c(1575, 1584)))
samp <- DataFrame(info="test", row.names="ex1")
view <- BamViews(fl, bamSamples=samp, bamRanges=rngs)
se <- summarizeOverlaps(view, mode=Union, ignore.strand=TRUE)
colData(se)
metadata(se)