bioconductor v3.9.0 GenomicAlignments

The GAlignmentPairs class is a container for storing list("pairs of genomic ", " alignments") . These pairs are typically obtained by aligning paired-end reads to a reference genome or transcriptome.

Details

A GAlignmentPairs object is a list-like object where each list element represents a pair of genomic alignment.

An alignment pair is made of a "first" and a "last"/"second" alignment, and is formally represented by a GAlignments object of length 2. In most applications, an alignment pair will represent an aligned paired-end read. In that case, the "first" member of the pair represents the alignment of the first end of the read (aka "first segment in the template", using SAM Spec terminology), and the "last" member of the pair represents the alignment of the second end of the read (aka "last segment in the template", using SAM Spec terminology).

In general, a GAlignmentPairs object will be created by loading records from a BAM (or SAM) file containing aligned paired-end reads, using the readGAlignmentPairs function (see below). Each element in the returned object will be obtained by pairing 2 records.

readGAlignmentPairs for reading aligned paired-end reads from a file (typically a BAM file) into a GAlignmentPairs object.
GAlignments objects for handling aligned single-end reads.
makeGAlignmentPairs for pairing the elements of a GAlignments object into a GAlignmentPairs object.
junctions-methods for extracting and summarizing junctions from a GAlignmentPairs object.
coverage-methods for computing the coverage of a GAlignmentPairs object.
findOverlaps-methods for finding range overlaps between a GAlignmentPairs object and another range-based object.
seqinfo in the GenomeInfoDb package for getting/setting/modifying the sequence information stored in an object.
The GRanges and GRangesList classes defined and documented in the GenomicRanges package.

Author

Hervé Pagès

Examples

library(Rsamtools)  # for the ex1.bam file
ex1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
galp <- readGAlignmentPairs(ex1_file, use.names=TRUE, strandMode=1)
galp

length(galp)
head(galp)
head(names(galp))

first(galp)
last(galp)
# or
second(galp)

strandMode(galp)
first(galp, real.strand=TRUE)
last(galp, real.strand=TRUE)
strand(galp)

strandMode(galp) <- 2
first(galp, real.strand=TRUE)
last(galp, real.strand=TRUE)
strand(galp)

seqnames(galp)

head(njunc(galp))
table(isProperPair(galp))
seqlevels(galp)

## Rename the reference sequences:
seqlevels(galp) <- sub("seq", "chr", seqlevels(galp))
seqlevels(galp)

galp[[1]]
unlist(galp)

grglist(galp)  # a GRangesList object

strandMode(galp) <- 1
grglist(galp)

## Alternatively the strand mode can be changed with invertStrand():
invertStrand(galp)

stopifnot(identical(unname(elementNROWS(grglist(galp))), njunc(galp) + 2L))

granges(galp)  # a GRanges object

GAlignmentsList_class()

GAlignmentsList objects

Description

The GAlignmentsList class is a container for storing a collection of GAlignments objects.

Details

A GAlignmentsList object contains a list of GAlignments objects. The majority of operations on this page are described in more detail on the GAlignments man page, see ? GAlignments .

readGAlignmentsList for reading genomic alignments from a file (typically a BAM file) into a GAlignmentsList object.
GAlignments and GAlignmentPairs objects for handling aligned single- and paired-end reads, respectively.
junctions-methods for extracting and summarizing junctions from a GAlignmentsList object.
findOverlaps-methods for finding range overlaps between a GAlignmentsList object and another range-based object.
seqinfo in the GenomeInfoDb package for getting/setting/modifying the sequence information stored in an object.
The GRanges and GRangesList classes defined and documented in the GenomicRanges package.

Author

Valerie Obenchain Valerie.Obenchain@RoswellPark.org

References

http://samtools.sourceforge.net/

Examples

gal1 <- GAlignments(
seqnames=Rle(factor(c("chr1", "chr2", "chr1", "chr3")),
c(1, 3, 2, 4)),
pos=1:10, cigar=paste0(10:1, "M"),
strand=Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
names=head(letters, 10), score=1:10)

gal2 <- GAlignments(
seqnames=Rle(factor(c("chr2", "chr4")), c(3, 4)), pos=1:7,
cigar=c("5M", "3M2N3M2N3M", "5M", "10M", "5M1N4M", "8M2N1M", "5M"),
strand=Rle(strand(c("-", "+")), c(4, 3)),
names=tail(letters, 7), score=1:7)

galist <- GAlignmentsList(noGaps=gal1, Gaps=gal2)


## ---------------------------------------------------------------------
## A. BASIC MANIPULATION
## ---------------------------------------------------------------------

length(galist)
names(galist)
seqnames(galist)
strand(galist)
head(cigar(galist))
head(qwidth(galist))
head(start(galist))
head(end(galist))
head(width(galist))
head(njunc(galist))
seqlevels(galist)

## Rename the reference sequences:
seqlevels(galist) <- sub("chr", "seq", seqlevels(galist))
seqlevels(galist)

grglist(galist)  # a GRangesList object
rglist(galist)   # an IRangesList object

## ---------------------------------------------------------------------
## B. SUBSETTING
## ---------------------------------------------------------------------

galist[strand(galist) == "-"]
has_junctions <- sapply(galist,
function(x) any(grepl("N", cigar(x), fixed=TRUE)))
galist[has_junctions]

## Different ways to subset:
galist[2]             # a GAlignments object of length 1
galist[[2]]           # a GAlignments object of length 1
grglist(galist[2])  # a GRangesList object of length 1
rglist(galist[2])   # a NormalIRangesList object of length 1

## ---------------------------------------------------------------------
## C. mcols()/elementMetadata()
## ---------------------------------------------------------------------

## Metadata can be defined on the individual GAlignment elements
## and the overall GAlignmentsList object. By default, 'level=between'
## extracts the GALignmentsList metadata. Using 'level=within'
## will extract the metadata on the individual GAlignments objects.

mcols(galist) ## no metadata on the GAlignmentsList object
mcols(galist, level="within")


## ---------------------------------------------------------------------
## D. readGAlignmentsList()
## ---------------------------------------------------------------------

library(pasillaBamSubset)

## 'file' as character.
fl <- untreated3_chr4()
galist1 <- readGAlignmentsList(fl)

galist1[1:3]
length(galist1)
table(elementNROWS(galist1))

## When 'file' is a BamFile, 'asMates' must be TRUE. If FALSE,
## the data are treated as single-end and each list element of the
## GAlignmentsList will be of length 1. For single-end data
## use readGAlignments() instead of readGAlignmentsList().
bf <- BamFile(fl, yieldSize=3, asMates=TRUE)
readGAlignmentsList(bf)

## Use a 'param' to fine tune the results.
param <- ScanBamParam(flag=scanBamFlag(isProperPair=TRUE))
galist2 <- readGAlignmentsList(fl, param=param)
length(galist2)


## ---------------------------------------------------------------------
## E. COERCION
## ---------------------------------------------------------------------

## The granges() and grlist() coercions support 'ignore.strand' to
## allow ranges from different strands to be combined. In this example
## paired-end reads aligned to opposite strands were read into a
## GAlignmentsList. If the desired operation is to combine these ranges,
## regardless of junctions or the space between pairs, 'ignore.strand'
## must be TRUE.
granges(galist[1])
granges(galist[1], ignore.strand=TRUE)

## grglist()
galist <- GAlignmentsList(noGaps=gal1, Gaps=gal2)
grglist(galist)
grglist(galist, ignore.strand=TRUE)

GAlignments_class()

GAlignments objects

Description

The GAlignments class is a simple container which purpose is to store a set of genomic alignments that will hold just enough information for supporting the operations described below.

Details

A GAlignments object is a vector-like object where each element describes a genomic alignment i.e. how a given sequence (called "query" or "read", typically short) aligns to a reference sequence (typically long).

Typically, a GAlignments object will be created by loading records from a BAM (or SAM) file and each element in the resulting object will correspond to a record. BAM/SAM records generally contain a lot of information but only part of that information is loaded in the GAlignments object. In particular, we discard the query sequences (SEQ field), the query qualities (QUAL), the mapping qualities (MAPQ) and any other information that is not needed in order to support the operations or methods described below.

This means that multi-reads (i.e. reads with multiple hits in the reference) won't receive any special treatment i.e. the various SAM/BAM records corresponding to a multi-read will show up in the GAlignments object as if they were coming from different/unrelated queries. Also paired-end reads will be treated as single-end reads and the pairing information will be lost (see ? for how to handle aligned paired-end reads).

Each element of a GAlignments object consists of:

The name of the reference sequence. (This is the RNAME field in a SAM/BAM record.)
The strand in the reference sequence to which the query is aligned. (This information is stored in the FLAG field in a SAM/BAM record.)
The CIGAR string in the "Extended CIGAR format" (see the SAM Format Specifications for the details).
The 1-based leftmost position/coordinate of the clipped query relative to the reference sequence. We will refer to it as the "start" of the query. (This is the POS field in a SAM/BAM record.)
The 1-based rightmost position/coordinate of the clipped query relative to the reference sequence. We will refer to it as the "end" of the query. (This is NOT explicitly stored in a SAM/BAM record but can be inferred from the POS and CIGAR fields.) Note that all positions/coordinates are always relative to the first base at the 5' end of the plus strand of the reference sequence, even when the query is aligned to the minus strand.
The genomic intervals between the "start" and "end" of the query that are "covered" by the alignment. Saying that the full [start,end] interval is covered is the same as saying that the alignment contains no junction (no N in the CIGAR). It is then considered to be a simple alignment. Note that a simple alignment can have mismatches or deletions (in the reference). In other words, a deletion (encoded with a D in the CIGAR) is NOT considered to introduce a gap in the coverage, but a junction is.

Note that the last 2 items are not expicitly stored in the GAlignments object: they are inferred on-the-fly from the CIGAR and the "start".

Optionally, a GAlignments object can have names (accessed thru the names generic function) which will be coming from the QNAME field of the SAM/BAM records.

The rest of this man page will focus on describing how to:

Access the information stored in a GAlignments object in a way that is independent from how the data are actually stored internally.
How to create and manipulate a GAlignments object.

readGAlignments for reading genomic alignments from a file (typically a BAM file) into a GAlignments object.
GAlignmentPairs objects for handling aligned paired-end reads.
junctions-methods for extracting and summarizing junctions from a GAlignments object.
coverage-methods for computing the coverage of a GAlignments object.
findOverlaps-methods for finding overlapping genomic alignments.
seqinfo in the GenomeInfoDb package for getting/setting/modifying the sequence information stored in an object.
The GRanges and GRangesList classes defined and documented in the GenomicRanges package.
The CompressedIRangesList class defined and documented in the IRanges package.

Author

Hervé Pagès and P. Aboyoun

References

http://samtools.sourceforge.net/

Examples

library(Rsamtools)  # for the ex1.bam file
ex1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
gal <- readGAlignments(ex1_file, param=ScanBamParam(what="flag"))
gal

## ---------------------------------------------------------------------
## A. BASIC MANIPULATION
## ---------------------------------------------------------------------
length(gal)
head(gal)
names(gal)  # no names by default
seqnames(gal)
strand(gal)
head(cigar(gal))
head(qwidth(gal))
table(qwidth(gal))
head(start(gal))
head(end(gal))
head(width(gal))
head(njunc(gal))
seqlevels(gal)

## Invert the strand:
invertStrand(gal)

## Rename the reference sequences:
seqlevels(gal) <- sub("seq", "chr", seqlevels(gal))
seqlevels(gal)

grglist(gal)  # a GRangesList object
stopifnot(identical(unname(elementNROWS(grglist(gal))), njunc(gal) + 1L))
granges(gal)  # a GRanges object
rglist(gal)   # a CompressedIRangesList object
stopifnot(identical(unname(elementNROWS(rglist(gal))), njunc(gal) + 1L))
ranges(gal)   # an IRanges object

## Modify the number of lines in 'show'
options(showHeadLines=3)
options(showTailLines=2)
gal

## Revert to default
options(showHeadLines=NULL)
options(showTailLines=NULL)

## ---------------------------------------------------------------------
## B. SUBSETTING
## ---------------------------------------------------------------------
gal[strand(gal) == "-"]
gal[grep("I", cigar(gal), fixed=TRUE)]
gal[grep("N", cigar(gal), fixed=TRUE)]  # no junctions

## A confirmation that none of the alignments contains junctions (in
## other words, each alignment can be represented by a single genomic
## range on the reference):
stopifnot(all(njunc(gal) == 0))

## Different ways to subset:
gal[6]             # a GAlignments object of length 1
grglist(gal)[[6]]  # a GRanges object of length 1
rglist(gal)[[6]]   # a NormalIRanges object of length 1

## Unlike N operations, D operations don't introduce gaps:
ii <- grep("D", cigar(gal), fixed=TRUE)
gal[ii]
njunc(gal[ii])
grglist(gal[ii])

## qwidth() vs width():
gal[qwidth(gal) != width(gal)]

## This MUST return an empty object:
gal[cigar(gal) == "35M" & qwidth(gal) != 35]
## but this doesn't have too:
gal[cigar(gal) != "35M" & qwidth(gal) == 35]

GappedReads_class()

(Legacy) GappedReads objects

Description

The GappedReads class extends the GAlignments class.

A GappedReads object contains all the information contained in a GAlignments object plus the sequences of the queries. Those sequences can be accessed via the qseq accessor.

Author

Hervé Pagès

References

http://samtools.sourceforge.net/

Examples

greads_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
greads <- readGappedReads(greads_file)
greads
qseq(greads)

OverlapEncodings_class()

OverlapEncodings objects

Description

The OverlapEncodings class is a container for storing the "overlap encodings" returned by the encodeOverlaps function.

Usage

## -=-= OverlapEncodings getters =-=-
list(list("Loffset"), list("OverlapEncodings"))(x)
list(list("Roffset"), list("OverlapEncodings"))(x)
list(list("encoding"), list("OverlapEncodings"))(x)
list(list("levels"), list("OverlapEncodings"))(x)
list(list("flippedQuery"), list("OverlapEncodings"))(x)
## -=-= Coercing an OverlapEncodings object =-=-
list(list("as.data.frame"), list("OverlapEncodings"))(x, row.names=NULL, optional=FALSE, ...)
## -=-= Low-level encoding utilities =-=-
encodingHalves(x, single.end.on.left=FALSE, single.end.on.right=FALSE,
                  as.factors=FALSE)
Lencoding(x, ...)
Rencoding(x, ...)
list(list("njunc"), list("ANY"))(x)
Lnjunc(x, single.end.on.left=FALSE)
Rnjunc(x, single.end.on.right=FALSE)
isCompatibleWithSplicing(x)

Arguments

Argument	Description
`x`	An OverlapEncodings object. For the low-level encoding utilities, `x` can also be a character vector or factor containing encodings.
`row.names`	`NULL` or a character vector.
`optional`	Ignored.
`...`	Extra arguments passed to the `as.data.frame` method for OverlapEncodings objects are ignored. Extra arguments passed to `Lencoding` or `Rencoding` are passed down to `encodingHalves` .
`single.end.on.left, single.end.on.right`	By default the 2 halves of a single-end encoding are considered to be NAs. If `single.end.on.left` (resp. `single.end.on.right` ) is `TRUE` , then the left (resp. right) half of a single-end encoding is considered to be the unmodified encoding.
`as.factors`	By default `encodingHalves` returns the 2 encoding halves as a list of 2 character vectors parallel to the input. If `as.factors` is `TRUE` , then it returns them as a list of 2 factors parallel to the input.

Details

Given a query and a subject of the same length, both list-like objects with top-level elements typically containing multiple ranges (e.g. IntegerRangesList objects), the "overlap encoding" of the i-th element in query and i-th element in subject is a character string describing how the ranges in query[[i]] are qualitatively positioned relatively to the ranges in subject[[i]] .

The encodeOverlaps function computes those overlap encodings and returns them in an OverlapEncodings object of the same length as query and subject .

The topic of working with overlap encodings is covered in details in the "OverlapEncodings" vignette located this package ( GenomicAlignments ) and accessible with vignette("OverlapEncodings") .

Author

Hervé Pagès

Examples

## ---------------------------------------------------------------------
## A. BASIC MANIPULATION OF AN OverlapEncodings OBJECT
## ---------------------------------------------------------------------

example(encodeOverlaps)  # to generate the 'ovenc' object

length(ovenc)
Loffset(ovenc)
Roffset(ovenc)
encoding(ovenc)
levels(ovenc)
nlevels(ovenc)
flippedQuery(ovenc)
njunc(ovenc)

as.data.frame(ovenc)
njunc(levels(ovenc))

## ---------------------------------------------------------------------
## B. WORKING WITH PAIRED-END ENCODINGS (POSSIBLY MIXED WITH SINGLE-END
##    ENCODINGS)
## ---------------------------------------------------------------------

encodings <- c("4:jmmm:agmm:aagm:aaaf:", "3--1:jmm--b:agm--i:")

encodingHalves(encodings)
encodingHalves(encodings, single.end.on.left=TRUE)
encodingHalves(encodings, single.end.on.right=TRUE)
encodingHalves(encodings, single.end.on.left=TRUE,
single.end.on.right=TRUE)

Lencoding(encodings)
Lencoding(encodings, single.end.on.left=TRUE)
Rencoding(encodings)
Rencoding(encodings, single.end.on.right=TRUE)

njunc(encodings)
Lnjunc(encodings)
Lnjunc(encodings, single.end.on.left=TRUE)
Rnjunc(encodings)
Rnjunc(encodings, single.end.on.right=TRUE)

## ---------------------------------------------------------------------
## C. DETECTION OF "SPLICE COMPATIBLE" OVERLAPS
## ---------------------------------------------------------------------

## Reads that are compatible with the splicing of the transcript can
## be detected with a regular expression (the regular expression below
## assumes that reads have at most 2 junctions):
|regex0 <- "(:[fgij]:|:[jg].:.[gf]:|:[jg]..:.g.:..[gf]:)"|
grepl(regex0, encoding(ovenc))  # read4 is NOT "compatible"

## This was for illustration purpose only. In practise you don't need
## (and should not) use this regular expression, but use instead the
## isCompatibleWithSplicing() utility function:
isCompatibleWithSplicing(ovenc)

cigar_utils()

CIGAR utility functions

Description

Utility functions for low-level CIGAR manipulation.

Usage

## -=-= Supported CIGAR operations =-=-
CIGAR_OPS
## -=-= Transform CIGARs into other useful representations =-=-
explodeCigarOps(cigar, ops=CIGAR_OPS)
explodeCigarOpLengths(cigar, ops=CIGAR_OPS)
cigarToRleList(cigar)
## -=-= Summarize CIGARs =-=-
cigarOpTable(cigar)
## -=-= From CIGARs to ranges =-=-
cigarRangesAlongReferenceSpace(cigar, flag=NULL,
        N.regions.removed=FALSE, pos=1L, f=NULL,
        ops=CIGAR_OPS, drop.empty.ranges=FALSE, reduce.ranges=FALSE,
        with.ops=FALSE)
cigarRangesAlongQuerySpace(cigar, flag=NULL,
        before.hard.clipping=FALSE, after.soft.clipping=FALSE,
        ops=CIGAR_OPS, drop.empty.ranges=FALSE, reduce.ranges=FALSE,
        with.ops=FALSE)
cigarRangesAlongPairwiseSpace(cigar, flag=NULL,
        N.regions.removed=FALSE, dense=FALSE,
        ops=CIGAR_OPS, drop.empty.ranges=FALSE, reduce.ranges=FALSE,
        with.ops=FALSE)
extractAlignmentRangesOnReference(cigar, pos=1L,
        drop.D.ranges=FALSE, f=NULL)
## -=-= From CIGARs to sequence lengths =-=-
cigarWidthAlongReferenceSpace(cigar, flag=NULL,
        N.regions.removed=FALSE)
cigarWidthAlongQuerySpace(cigar, flag=NULL,
        before.hard.clipping=FALSE, after.soft.clipping=FALSE)
cigarWidthAlongPairwiseSpace(cigar, flag=NULL,
        N.regions.removed=FALSE, dense=FALSE)
## -=-= Narrow CIGARs =-=-
cigarNarrow(cigar, start=NA, end=NA, width=NA)
cigarQNarrow(cigar, start=NA, end=NA, width=NA)
## -=-= Translate coordinates between query and reference spaces =-=-
queryLoc2refLoc(qloc, cigar, pos=1L)
queryLocs2refLocs(qlocs, cigar, pos=1L, flag=NULL)

Arguments

Argument	Description
`cigar`	A character vector or factor containing the extended CIGAR strings. It can be of arbitrary length except for `queryLoc2refLoc` which only accepts a single CIGAR (as a character vector or factor of length 1).
`ops`	Character vector containing the extended CIGAR operations to actually consider. Zero-length operations or operations not listed `ops` are ignored.
`flag`	`NULL` or an integer vector containing the SAM flag for each read. According to the SAM Spec v1.4, flag bit 0x4 is the only reliable place to tell whether a segment (or read) is mapped (bit is 0) or not (bit is 1). If `flag` is supplied, then `cigarRangesAlongReferenceSpace` , `cigarRangesAlongQuerySpace` , `cigarRangesAlongPairwiseSpace` , and `extractAlignmentRangesOnReference` don't produce any range for unmapped reads i.e. they treat them as if their CIGAR was empty (independently of what their CIGAR is). If `flag` is supplied, then `cigarWidthAlongReferenceSpace` , `cigarWidthAlongQuerySpace` , and `cigarWidthAlongPairwiseSpace` return `NA` s for unmapped reads.
`N.regions.removed`	`TRUE` or `FALSE` . If `TRUE` , then `cigarRangesAlongReferenceSpace` and `cigarWidthAlongReferenceSpace` report ranges/widths with respect to the "reference" space from which the N regions have been removed, and `cigarRangesAlongPairwiseSpace` and `cigarWidthAlongPairwiseSpace` report them with respect to the "pairwise" space from which the N regions have been removed.
`pos`	An integer vector containing the 1-based leftmost position/coordinate for each (eventually clipped) read sequence. Must have length 1 (in which case it's recycled to the length of `cigar` ), or the same length as `cigar` .
`f`	`NULL` or a factor of length `cigar` . If `NULL` , then the ranges are grouped by alignment i.e. the returned IRangesList object has 1 list element per element in `cigar` . Otherwise they are grouped by factor level i.e. the returned IRangesList object has 1 list element per level in `f` and is named with those levels. For example, if `f` is a factor containing the chromosome for each read, then the returned IRangesList object will have 1 list element per chromosome and each list element will contain all the ranges on that chromosome.
`drop.empty.ranges`	Should empty ranges be dropped?
`reduce.ranges`	Should adjacent ranges coming from the same cigar be merged or not? Using `TRUE` can significantly reduce the size of the returned object.
`with.ops`	`TRUE` or `FALSE` indicating whether the returned ranges should be named with their corresponding CIGAR operation.
`before.hard.clipping`	`TRUE` or `FALSE` . If `TRUE` , then `cigarRangesAlongQuerySpace` and `cigarWidthAlongQuerySpace` report ranges/widths with respect to the "query" space to which the H regions have been added. `before.hard.clipping` and `after.soft.clipping` cannot both be `TRUE` .
`after.soft.clipping`	`TRUE` or `FALSE` . If `TRUE` , then `cigarRangesAlongQuerySpace` and `cigarWidthAlongQuerySpace` report ranges/widths with respect to the "query" space from which the S regions have been removed. `before.hard.clipping` and `after.soft.clipping` cannot both be `TRUE` .
`dense`	`TRUE` or `FALSE` . If `TRUE` , then `cigarRangesAlongPairwiseSpace` and `cigarWidthAlongPairwiseSpace` report ranges/widths with respect to the "pairwise" space from which the I, D, and N regions have been removed. `N.regions.removed` and `dense` cannot both be `TRUE` .
`drop.D.ranges`	Should the ranges corresponding to a deletion from the reference (encoded with a D in the CIGAR) be dropped? By default we keep them to be consistent with the pileup tool from SAMtools. Note that, when `drop.D.ranges` is `TRUE` , then Ds and Ns in the CIGAR are equivalent.
`start,end,width`	Vectors of integers. NAs and negative values are accepted and "solved" according to the rules of the SEW (Start/End/Width) interface (see `?` for the details).
`qloc`	An integer vector containing "query-based locations" i.e. 1-based locations relative to the query sequence stored in the SAM/BAM file.
`qlocs`	A list of the same length as `cigar` where each element is an integer vector containing "query-based locations" i.e. 1-based locations relative to the corresponding query sequence stored in the SAM/BAM file.

Value

CIGAR_OPS is a predefined character vector containing the supported extended CIGAR operations: M, I, D, N, S, H, P, =, X. See p. 4 of the SAM Spec v1.4 at http://samtools.sourceforge.net/ for the list of extended CIGAR operations and their meanings.

For explodeCigarOps and explodeCigarOpLengths : Both functions return a list of the same length as cigar where each list element is a character vector (for explodeCigarOps ) or an integer vector (for explodeCigarOpLengths ). The 2 lists have the same shape, that is, same length() and same elementNROWS() . The i-th character vector in the list returned by explodeCigarOps contains one single-letter string per CIGAR operation in cigar[i] . The i-th integer vector in the list returned by explodeCigarOpLengths contains the corresponding CIGAR operation lengths. Zero-length operations or operations not listed in ops are ignored.

For cigarToRleList : A CompressedRleList object.

For cigarOpTable : An integer matrix with number of rows equal to the length of cigar and nine columns, one for each extended CIGAR operation.

For cigarRangesAlongReferenceSpace , cigarRangesAlongQuerySpace , cigarRangesAlongPairwiseSpace , and extractAlignmentRangesOnReference : An IRangesList object (more precisely a CompressedIRangesList object) with 1 list element per element in cigar . However, if f is a factor, then the returned IRangesList object can be a SimpleIRangesList object (instead of CompressedIRangesList ), and in that case, has 1 list element per level in f and is named with those levels.

For cigarWidthAlongReferenceSpace and cigarWidthAlongPairwiseSpace : An integer vector of the same length as cigar where each element is the width of the alignment with respect to the "reference" and "pairwise" space, respectively. More precisely, for cigarWidthAlongReferenceSpace , the returned widths are the lengths of the alignments on the reference, N gaps included (except if N.regions.removed is TRUE ). NAs or "*" in cigar will produce NAs in the returned vector.

For cigarWidthAlongQuerySpace : An integer vector of the same length as cigar where each element is the length of the corresponding query sequence as inferred from the CIGAR string. Note that, by default (i.e. if before.hard.clipping and after.soft.clipping are FALSE ), this is the length of the query sequence stored in the SAM/BAM file. If before.hard.clipping or after.soft.clipping is TRUE , the returned widths are the lengths of the query sequences before hard clipping or after soft clipping. NAs or "*" in cigar will produce NAs in the returned vector.

For cigarNarrow and cigarQNarrow : A character vector of the same length as cigar containing the narrowed cigars. In addition the vector has an "rshift" attribute which is an integer vector of the same length as cigar . It contains the values that would need to be added to the POS field of a SAM/BAM file as a consequence of this cigar narrowing.

For queryLoc2refLoc : An integer vector of the same length as qloc containing the "reference-based locations" (i.e. the 1-based locations relative to the reference sequence) corresponding to the "query-based locations" passed in qloc .

For queryLocs2refLocs : A list of the same length as qlocs where each element is an integer vector containing the "reference-based locations" corresponding to the "query-based locations" passed in the corresponding element in qlocs .

The sequenceLayer function in the GenomicAlignments package for laying the query sequences alongside the "reference" or "pairwise" spaces.
The GAlignments container for storing a set of genomic alignments.
The IRanges , IRangesList , and RleList classes in the IRanges package.
The coverage generic and methods for computing the coverage across a set of ranges or genomic ranges.

Author

Hervé Pagès & P. Aboyoun

References

http://samtools.sourceforge.net/

Examples

## ---------------------------------------------------------------------
## A. CIGAR_OPS, explodeCigarOps(), explodeCigarOpLengths(),
##    cigarToRleList(), and cigarOpTable()
## ---------------------------------------------------------------------

## Supported CIGAR operations:
CIGAR_OPS

## Transform CIGARs into other useful representations:
cigar1 <- "3H15M55N4M2I6M2D5M6S"
cigar2 <- c("40M2I9M", cigar1, "2S10M2000N15M", "3H33M5H")

explodeCigarOps(cigar2)
explodeCigarOpLengths(cigar2)
explodeCigarOpLengths(cigar2, ops=c("I", "S"))
cigarToRleList(cigar2)

## Summarize CIGARs:
cigarOpTable(cigar2)

## ---------------------------------------------------------------------
## B. From CIGARs to ranges and to sequence lengths
## ---------------------------------------------------------------------

## CIGAR ranges along the "reference" space:
cigarRangesAlongReferenceSpace(cigar1, with.ops=TRUE)[[1]]

cigarRangesAlongReferenceSpace(cigar1,
reduce.ranges=TRUE, with.ops=TRUE)[[1]]

ops <- setdiff(CIGAR_OPS, "N")

cigarRangesAlongReferenceSpace(cigar1, ops=ops, with.ops=TRUE)[[1]]

cigarRangesAlongReferenceSpace(cigar1, ops=ops,
reduce.ranges=TRUE, with.ops=TRUE)[[1]]

ops <- setdiff(CIGAR_OPS, c("D", "N"))

cigarRangesAlongReferenceSpace(cigar1, ops=ops, with.ops=TRUE)[[1]]

cigarWidthAlongReferenceSpace(cigar1)

pos2 <- c(1, 1001, 1,  351)

cigarRangesAlongReferenceSpace(cigar2, pos=pos2, with.ops=TRUE)

res1a <- extractAlignmentRangesOnReference(cigar2, pos=pos2)
res1b <- cigarRangesAlongReferenceSpace(cigar2,
pos=pos2,
ops=setdiff(CIGAR_OPS, "N"),
reduce.ranges=TRUE)
stopifnot(identical(res1a, res1b))

res2a <- extractAlignmentRangesOnReference(cigar2, pos=pos2,
drop.D.ranges=TRUE)
res2b <- cigarRangesAlongReferenceSpace(cigar2,
pos=pos2,
ops=setdiff(CIGAR_OPS, c("D", "N")),
reduce.ranges=TRUE)
stopifnot(identical(res2a, res2b))

seqnames <- factor(c("chr6", "chr6", "chr2", "chr6"),
levels=c("chr2", "chr6"))
extractAlignmentRangesOnReference(cigar2, pos=pos2, f=seqnames)

## CIGAR ranges along the "query" space:
cigarRangesAlongQuerySpace(cigar2, with.ops=TRUE)
cigarWidthAlongQuerySpace(cigar1)
cigarWidthAlongQuerySpace(cigar1, before.hard.clipping=TRUE)

## CIGAR ranges along the "pairwise" space:
cigarRangesAlongPairwiseSpace(cigar2, with.ops=TRUE)
cigarRangesAlongPairwiseSpace(cigar2, dense=TRUE, with.ops=TRUE)

## ---------------------------------------------------------------------
## C. COMPUTE THE COVERAGE OF THE READS STORED IN A BAM FILE
## ---------------------------------------------------------------------
## The information stored in a BAM file can be used to compute the
## "coverage" of the mapped reads i.e. the number of reads that hit any
## given position in the reference genome.
## The following function takes the path to a BAM file and returns an
## object representing the coverage of the mapped reads that are stored
## in the file. The returned object is an RleList object named with the
## names of the reference sequences that actually receive some coverage.

flag0 <- scanBamFlag(isUnmappedQuery=FALSE, isDuplicate=FALSE)

extractCoverageFromBAM <- function(bamfile)
{
stopifnot(is(bamfile, "BamFile"))
## This ScanBamParam object allows us to load only the necessary
## information from the file.
param <- ScanBamParam(flag=flag0, what=c("rname", "pos", "cigar"))
bam <- scanBam(bamfile, param=param)[[1]]
## Note that unmapped reads and reads that are PCR/optical duplicates
## have already been filtered out by using the ScanBamParam object
## above.
f <- factor(bam$rname, levels=seqlevels(bamfile))
irl <- extractAlignmentRangesOnReference(bam$cigar, pos=bam$pos, f=f)
coverage(irl, width=seqlengths(bamfile))
}

library(Rsamtools)
f1 <- system.file("extdata", "ex1.bam", package="Rsamtools")
cvg <- extractCoverageFromBAM(BamFile(f1))

## extractCoverageFromBAM() is equivalent but slightly more efficient
## than loading a GAlignments object and computing its coverage:
cvg2 <- coverage(readGAlignments(f1, param=ScanBamParam(flag=flag0)))
stopifnot(identical(cvg, cvg2))

## ---------------------------------------------------------------------
## D. cigarNarrow() and cigarQNarrow()
## ---------------------------------------------------------------------

## cigarNarrow():
cigarNarrow(cigar1)  # only drops the soft/hard clipping
cigarNarrow(cigar1, start=10)
cigarNarrow(cigar1, start=15)
cigarNarrow(cigar1, start=15, width=57)
cigarNarrow(cigar1, start=16)
#cigarNarrow(cigar1, start=16, width=55)  # ERROR! (empty cigar)
cigarNarrow(cigar1, start=71)
cigarNarrow(cigar1, start=72)
cigarNarrow(cigar1, start=75)

## cigarQNarrow():
cigarQNarrow(cigar1, start=4, end=-3)
cigarQNarrow(cigar1, start=10)
cigarQNarrow(cigar1, start=19)
cigarQNarrow(cigar1, start=24)

## ---------------------------------------------------------------------
## E. PERFORMANCE
## ---------------------------------------------------------------------

if (interactive()) {
## We simulate 20 millions aligned reads, all 40-mers. 95% of them
## align with no indels. 5% align with a big deletion in the
## reference. In the context of an RNAseq experiment, those 5% would
## be suspected to be "junction reads".
set.seed(123)
nreads <- 20000000L
njunctionreads <- nreads * 5L / 100L
cigar3 <- character(nreads)
cigar3[] <- "40M"
junctioncigars <- paste(
paste(10:30, "M", sep=""),
paste(sample(80:8000, njunctionreads, replace=TRUE), "N", sep=""),
paste(30:10, "M", sep=""), sep="")
cigar3[sample(nreads, njunctionreads)] <- junctioncigars
some_fake_rnames <- paste("chr", c(1:6, "X"), sep="")
rname <- factor(sample(some_fake_rnames, nreads, replace=TRUE),
levels=some_fake_rnames)
pos <- sample(80000000L, nreads, replace=TRUE)

## The following takes < 3 sec. to complete:
system.time(irl1 <- extractAlignmentRangesOnReference(cigar3, pos=pos))

## The following takes < 4 sec. to complete:
system.time(irl2 <- extractAlignmentRangesOnReference(cigar3, pos=pos,
f=rname))

## The sizes of the resulting objects are about 240M and 160M,
## respectively:
object.size(irl1)
object.size(irl2)
}

coordinate_mapping_methods()

Map range coordinates between reads and genome space using CIGAR alignments

Description

Map range coordinates between reads (local) and genome (reference) space using the CIGAR in a GAlignments object.

See ? in the GenomicRanges package for mapping coordinates between features in the transcriptome and genome space.

Usage

list(list("mapToAlignments"), list("GenomicRanges,GAlignments"))(x, alignments, ...) 
list(list("pmapToAlignments"), list("GenomicRanges,GAlignments"))(x, alignments, ...) 
list(list("mapFromAlignments"), list("GenomicRanges,GAlignments"))(x, alignments, ...) 
list(list("pmapFromAlignments"), list("GenomicRanges,GAlignments"))(x, alignments, ...)

Arguments

Argument	Description
`x`	GenomicRanges object of positions to be mapped. `x` must have names when mapping to the genome.
`alignments`	A GAlignments object that represents the alignment of `x` to the genome. The `aligments` object must have names. When mapping to the genome names are used to determine mapping pairs and in the reverse direction they are used as the seqlevels of the output object.
`list()`	Arguments passed to other methods.

Details

These methods use a GAlignments object to represent the alignment between the ranges in x and the output. The following CIGAR operations in the "Extended CIGAR format" are used in the mapping algorithm:

M, X, = Sequence match or mismatch
I Insertion to the reference
D Deletion from the reference
N Skipped region from the reference
S Soft clip on the read
H Hard clip on the read
P Silent deletion from the padded reference
list(list("mapToAlignments"), ", ", list("pmapToAlignments")) list(" ", " The CIGAR is used to map the genomic (reference) position ", list("x"), " to ", " local coordinates. The mapped position starts at ", " ", list(" ", " start(x) - start(alignments) + 1 ", " "), " ", " and is incremented or decremented as the algorithm walks the length of ", " the CIGAR. A successful mapping in this direction requires that ", " ", list("x"), " fall within ", list("alignments"), ". ", " ", " The seqlevels of the return object are taken from the ", " ", list("alignments"), " object and will be a name descriptive of the read ", " or aligned region. In this direction, mapping is attempted between all ", " elements of ", list("x"), " and all elements of ", list("alignments"), ". ", " ")
list(list("mapFromAlignments"), ", ", list("pmapFromAlignments")) list(" ", " The CIGAR is used to map the local position ", list("x"), " to genomic ", " (reference) coordinates. The mapped position starts at ", " ", list(" ", " start(x) + start(alignments) - 1 ", " "), " ", " and is incremented or decremented as the algorithm walks the length of ", " the CIGAR. A successful mapping in this direction requires that the ", " width of ", list("alignments"), " is <= the width of ", list("x"), ". ", " ", " When mapping to the genome, name matching is used to determine the ", " mapping pairs (vs attempting to match all possible pairs). Ranges in ", " ", list("x"), " are only mapped to ranges in ", list("alignments"), " with the ", " same name. Name matching is motivated by use cases such as ", " differentially expressed regions where the expressed regions in ", " ", list("x"), " would only be related to a subset of regions in ", " ", list("alignments"), ", which may contains gene or transcript ranges. ", " ")
list("element-wise versions") list(" ", " ", list("pmapToAlignments"), " and ", list("pmapFromAlignments"), " are element-wise ", " (aka parallel) versions of ", list("mapToAlignments"), " and ", " ", list("mapFromAlignments"), ". The i-th range in ", list("x"), " is mapped to the ", " i-th range in ", list("alignments"), "; ", list("x"), " and ", list("alignments"), " must ", " have the same length. ", " ", " Ranges in ", list("x"), " that do not map (out of bounds) are returned as ", " zero-width ranges starting at 0. These ranges are given the special ", " seqname of "UNMAPPED". Note the non-parallel methods do not return ", " unmapped ranges so the "UNMAPPED" seqname is unique to ", " ", list("pmapToAlignments"), " and ", list("pmapFromAlignments"), ". ", " ")
list("strand") list(" ", " By SAM convention, the CIGAR string is reported for mapped reads on the ", " forward genomic strand. There is no need to consider strand in these ", " methods. The output of these methods will always be unstranded ", " (i.e., "*"). ", " ")

Value

An object the same class as x .

Parallel methods return an object the same shape as x . Ranges that cannot be mapped (out of bounds) are returned as zero-width ranges starting at 0 with a seqname of "UNMAPPED".

Non-parallel methods return an object that varies in length similar to a Hits object. The result only contains mapped records, out of bound ranges are not returned. xHits and alignmentsHits metadata columns indicate the elements of x and alignments used in the mapping.

When present, names from x are propagated to the output. When mapping locally, the seqlevels of the output are the names on the alignment object. When mapping globally, the output seqlevels are the seqlevels of alignment which are usually chromosome names.

Author

V. Obenchain, M. Lawrence and H. Pagès

Examples

## ---------------------------------------------------------------------
## A. Basic use
## ---------------------------------------------------------------------

## 1. Map to local space with mapToAlignments()
## ---------------------------------------------------------------------

## Mapping to local coordinates requires 'x' to be within 'alignments'.
## In this 'x', the second range is too long and can't be mapped.
alignments <- GAlignments("chr1", 10L, "11M", strand("*"), names="read_A")
x <- GRanges("chr1", IRanges(c(12, 12), width=c(6, 20)))
mapToAlignments(x, alignments)

## The element-wise version of the function returns unmapped ranges
## as zero-width ranges with a seqlevel of "UNMAPPED":
pmapToAlignments(x, c(alignments, alignments))

## Mapping the same range through different alignments demonstrates
## how the CIGAR operations affect the outcome.
ops <- c("no-op", "junction", "insertion", "deletion")
x <- GRanges(rep("chr1", 4), IRanges(rep(12, 4), width=rep(6, 4), names=ops))
alignments <- GAlignments(rep("chr1", 4), rep(10L, 4),
cigar = c("11M", "5M2N4M", "5M2I4M", "5M2D4M"),
strand = strand(rep("*", 4)),
names = paste0("region_", 1:4))
pmapToAlignments(x, alignments)

## 2. Map to genome space with mapFromAlignments()
## ---------------------------------------------------------------------

## One of the criteria when mapping to genomic coordinates is that the
## shifted 'x' range falls within 'alignments'. Here the first 'x'
## range has a shifted start value of 14 (5 + 10 - 1 = 14) with a width of
## 2 and so is successfully mapped. The second has a shifted start of 29
## (20 + 10 - 1 = 29) which is outside the range of 'alignments'.
x <- GRanges("chr1", IRanges(c(5, 20), width=2, names=rep("region_A", 2)))
alignments <- GAlignments("chr1", 10L, "11M", strand("*"), names="region_A")
mapFromAlignments(x, alignments)

## Another characteristic of mapping this direction is the name matching
## used to determine pairs. Mapping is only attempted between ranges in 'x'
## and 'alignments' with the same name. If we change the name of the first 'x'
## range, only the second will be mapped to 'alignment'. We know the second
## range fails to map so we get an empty result.
names(x) <- c("region_B", "region_A")
mapFromAlignments(x, alignments)

## CIGAR operations: insertions reduce the width of the output while
## junctions and deletions increase it.
ops <- c("no-op", "junction", "insertion", "deletion")
x <- GRanges(rep("chr1", 4), IRanges(rep(3, 4), width=rep(5, 4), names=ops))
alignments <- GAlignments(rep("chr1", 4), rep(10L, 4),
cigar = c("11M", "5M2N4M", "5M2I4M", "5M2D4M"),
strand = strand(rep("*", 4)))
pmapFromAlignments(x, alignments)

## ---------------------------------------------------------------------
## B. TATA box motif: mapping from read -> genome -> transcript
## ---------------------------------------------------------------------

## The TATA box motif is a conserved DNA sequence in the core promoter
## region. Many eukaryotic genes have a TATA box located approximately
## 25-35 base pairs upstream of the transcription start site. The motif is
## the binding site of general transcription factors or histones and
## plays a key role in transcription.

## In this example, the position of the TATA box motif (if present) is
## located in the DNA sequence corresponding to read ranges. The local
## motif positions are mapped to genome coordinates and then mapped
## to gene features such as promoters regions.

## Load reads from chromosome 4 of D. melanogaster (dm3):
library(pasillaBamSubset)
fl <- untreated1_chr4()
gal <- readGAlignments(fl)

## Extract DNA sequences corresponding to the read ranges:
library(GenomicFeatures)
library(BSgenome.Dmelanogaster.UCSC.dm3)
dna <- extractTranscriptSeqs(BSgenome.Dmelanogaster.UCSC.dm3, grglist(gal))

## Search for the consensus motif TATAAA in the sequences:
box <- vmatchPattern("TATAAA", dna)

## Some sequences had more than one match:
table(elementNROWS(box))

## The element-wise function we'll use for mapping to genome coordinates
## requires the two input argument to have the same length. We need to
## replicate the read ranges to match the number of motifs found.

## Expand the read ranges to match motifs found:
motif <- elementNROWS(box) != 0
alignments <- rep(gal[motif], elementNROWS(box)[motif])

## We make the IRanges into a GRanges object so the seqlevels can
## propagate to the output. Seqlevels are needed in the last mapping step.
readCoords <- GRanges(seqnames(alignments), unlist(box, use.names=FALSE))

## Map the local position of the motif to genome coordinates:
genomeCoords <- pmapFromAlignments(readCoords, alignments)
genomeCoords

## We are interested in the location of the TATA box motifs in the
## promoter regions. To perform the mapping we need the promoter ranges
## as a GRanges or GRangesList.

## Extract promoter regions 50 bp upstream from the transcription start site:
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
promoters <- promoters(txdb, upstream=50, downstream=0)

## Map the genome coordinates to the promoters:
names(promoters) <- mcols(promoters)$tx_name  ## must be named
mapToTranscripts(genomeCoords, promoters)

coverage_methods()

Coverage of a GAlignments, GAlignmentPairs, or GAlignmentsList object

Description

coverage methods for GAlignments , GAlignmentPairs , GAlignmentsList , and BamFile objects.

NOTE: The coverage generic function and methods for IntegerRanges and IntegerRangesList objects are defined and documented in the IRanges package. Methods for GRanges and GRangesList objects are defined and documented in the GenomicRanges package.

Usage

list(list("coverage"), list("GAlignments"))(x, shift=0L, width=NULL, weight=1L,
         method=c("auto", "sort", "hash"), drop.D.ranges=FALSE)
list(list("coverage"), list("GAlignmentPairs"))(x, shift=0L, width=NULL, weight=1L,
         method=c("auto", "sort", "hash"), drop.D.ranges=FALSE)
list(list("coverage"), list("GAlignmentsList"))(x, shift=0L, width=NULL, weight=1L, ...)
list(list("coverage"), list("BamFile"))(x, shift=0L, width=NULL, weight=1L, ...,
         param=ScanBamParam())
list(list("coverage"), list("character"))(x, shift=0L, width=NULL, weight=1L, ...,
         yieldSize=2500000L)

Arguments

Argument	Description
`x`	A GAlignments , GAlignmentPairs , GAlignmentsList , or BamFile object, or the path to a BAM file.
`shift, width, weight`	See `coverage` method for GRanges objects in the GenomicRanges package.
`method`	See `?` in the IRanges package for a description of this argument.
`drop.D.ranges`	Whether the coverage calculation should ignore ranges corresponding to D (deletion) in the CIGAR string.
`...`	Additional arguments passed to the `coverage` method for GAlignments objects.
`param`	An optional ScanBamParam object passed to `readGAlignments` .
`yieldSize`	An optional argument controlling how many records are input when iterating through a BamFile .

Details

The methods for GAlignments and GAlignmentPairs objects do: list(" coverage(grglist(x, drop.D.ranges=drop.D.ranges), ...) ")

The method for GAlignmentsList objects does: list(" coverage(unlist(x), ...) ")

The method for BamFile objects iterates through a BAM file, reading yieldSize(x) records (or all records, if is.na(yieldSize(x)) ) and calculating: list(" gal <- readGAlignments(x, param=param) ", " coverage(gal, shift=shift, width=width, weight=weight, ...) ")

The method for character vectors of length 1 creates a BamFile object from x and performs the calculation for coverage,BamFile-method .

Value

A named RleList object with one coverage vector per seqlevel in x .

Examples

## ---------------------------------------------------------------------
## A. EXAMPLE WITH TOY DATA
## ---------------------------------------------------------------------

ex1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")

## Coverage of a GAlignments object:
gal <- readGAlignments(ex1_file)
cvg1 <- coverage(gal)
cvg1

## Coverage of a GAlignmentPairs object:
galp <- readGAlignmentPairs(ex1_file)
cvg2 <- coverage(galp)
cvg2

## Coverage of a GAlignmentsList object:
galist <- readGAlignmentsList(ex1_file)
cvg3 <- coverage(galist)
cvg3

table(mcols(galist)$mate_status)
mated_idx <- which(mcols(galist)$mate_status == "mated")
mated_galist <- galist[mated_idx]
mated_cvg3 <- coverage(mated_galist)
mated_cvg3

## Sanity checks:
stopifnot(identical(cvg1, cvg3))
stopifnot(identical( cvg2, mated_cvg3))

## ---------------------------------------------------------------------
## B. EXAMPLE WITH REAL DATA
## ---------------------------------------------------------------------

library(pasillaBamSubset)
## See '?pasillaBamSubset' for more information about the 2 BAM files
## included in this package.
reads <- readGAlignments(untreated3_chr4())
table(njunc(reads))  # data contains junction reads

## Junctions do NOT contribute to the coverage:
read1 <- reads[which(njunc(reads) != 0L)[1]]  # 1st read with a junction
read1  # cigar shows a "skipped region" of length 15306
grglist(read1)[[1]]  # the junction is between pos 4500 and 19807
coverage(read1)$chr4  # junction is not covered

## Sanity checks:
cvg <- coverage(reads)
read_chunks <- unlist(grglist(reads), use.names=FALSE)
read_chunks_per_chrom <- split(read_chunks, seqnames(read_chunks))
stopifnot(identical(sum(cvg), sum(width(read_chunks_per_chrom))))

galist <- readGAlignmentsList(untreated3_chr4())
stopifnot(identical(cvg, coverage(galist)))

encodeOverlaps_methods()

Encode the overlaps between RNA-seq reads and the transcripts of a gene model

Description

In the context of an RNA-seq experiment, encoding the overlaps between the aligned reads and the transcripts of a given gene model can be used for detecting those overlaps that are compatible with the splicing of the transcript.

The central tool for this is the encodeOverlaps method for GRangesList objects, which computes the "overlap encodings" between a query and a subject , both list-like objects with list elements containing multiple ranges.

Other related utilities are also documented in this man page.

Usage

encodeOverlaps(query, subject, hits=NULL, ...)
list(list("encodeOverlaps"), list("GRangesList,GRangesList"))(query, subject, hits=NULL,
               flip.query.if.wrong.strand=FALSE)
## Related utilities:
flipQuery(x, i)
selectEncodingWithCompatibleStrand(ovencA, ovencB,
                                   query.strand, subject.strand, hits=NULL)
isCompatibleWithSkippedExons(x, max.skipped.exons=NA)
extractSteppedExonRanks(x, for.query.right.end=FALSE)
extractSpannedExonRanks(x, for.query.right.end=FALSE)
extractSkippedExonRanks(x, for.query.right.end=FALSE)
extractQueryStartInTranscript(query, subject, hits=NULL, ovenc=NULL,
                              flip.query.if.wrong.strand=FALSE,
                              for.query.right.end=FALSE)

Arguments

Argument	Description
`query, subject`	Typically GRangesList objects representing the the aligned reads and the transcripts of a given gene model, respectively. If the 2 objects don't have the same length, and if the `hits` argument is not supplied, then the shortest is recycled to the length of the longest (the standard recycling rules apply). More generally speaking, `query` and `subject` must be list-like objects with list elements containing multiple ranges e.g. IntegerRangesList or GRangesList objects.
`hits`	An optional Hits object typically obtained from a previous call to `findOverlaps` . Strictly speaking, `hits` only needs to be compatible with `query` and `subject` , that is, `queryLength` and `subjectLength` must be equal to `length(query)` and `length(subject)` , respectively. Supplying `hits` is a convenient way to do `encodeOverlaps(query[queryHits(hits)], subject[subjectHits(hits)])` , that is, calling `encodeOverlaps(query, subject, hits)` is equivalent to the above, but is much more efficient, especially when `query` and/or `subject` are big. Of course, when `hits` is supplied, `query` and `subject` are not expected to have the same length anymore.
`...`	Additional arguments for methods.
`flip.query.if.wrong.strand`	See the "OverlapEncodings" vignette located in this package ( GenomicAlignments ).
`x`	For `flipQuery` : a GRangesList object. For `isCompatibleWithSkippedExons` , `extractSteppedExonRanks` , `extractSpannedExonRanks` , and `extractSkippedExonRanks` : an OverlapEncodings object, a factor, or a character vector.
`i`	Subscript specifying the elements in `x` to flip. If missing, all the elements are flipped.
`ovencA, ovencB, ovenc`	OverlapEncodings objects.
`query.strand, subject.strand`	Vector-like objects containing the strand of the query and subject, respectively.
`max.skipped.exons`	Not supported yet. If `NA` (the default), the number of skipped exons must be 1 or more (there is no max).
`for.query.right.end`	If `TRUE` , then the information reported in the output is for the right ends of the paired-end reads. Using `for.query.right.end=TRUE` with single-end reads is an error.

Details

See ?OverlapEncodings for a short introduction to "overlap encodings".

The topic of working with overlap encodings is covered in details in the "OverlapEncodings" vignette located this package ( GenomicAlignments ) and accessible with vignette("OverlapEncodings") .

Value

For encodeOverlaps : An OverlapEncodings object. If hits is not supplied, this object is parallel to the longest of query and subject , that is, it has the length of the longest and the i-th encoding in it corresponds to the i-th element in the longest. If hits is supplied, then the returned object is parallel to it, that is, it has one encoding per hit.

For flipQuery : TODO

For selectEncodingWithCompatibleStrand : TODO

For isCompatibleWithSkippedExons : A logical vector parallel to x .

For extractSteppedExonRanks , extractSpannedExonRanks , and extractSkippedExonRanks : TODO

For extractQueryStartInTranscript : TODO

The OverlapEncodings class for a brief introduction to "overlap encodings".
The Hits class defined and documented in the S4Vectors package.
The "OverlapEncodings" vignette in this package.
findCompatibleOverlaps for a specialized version of findOverlaps that uses encodeOverlaps internally to keep only the hits where the junctions in the aligned read are compatible with the splicing of the annotated transcript.
The GRangesList class defined and documented in the GenomicRanges package.
The findOverlaps generic function defined in the IRanges package.

Author

Hervé Pagès

Examples

## ---------------------------------------------------------------------
## A. BETWEEN 2 IntegerRangesList OBJECTS
## ---------------------------------------------------------------------
## In the context of an RNA-seq experiment, encoding the overlaps
## between 2 GRangesList objects, one containing the reads (the query),
## and one containing the transcripts (the subject), can be used for
## detecting hits between reads and transcripts that are "compatible"
## with the splicing of the transcript. Here we illustrate this with 2
## IntegerRangesList objects, in order to keep things simple:

## 4 aligned reads in the query:
read1 <- IRanges(c(7, 15, 22), c(9, 19, 23))  # 2 junctions
read2 <- IRanges(c(5, 15), c(9, 17))  # 1 junction
read3 <- IRanges(c(16, 22), c(19, 24))  # 1 junction
read4 <- IRanges(c(16, 23), c(19, 24))  # 1 junction
query <- IRangesList(read1, read2, read3, read4)

## 1 transcript in the subject:
tx <- IRanges(c(1, 4, 15, 22, 38), c(2, 9, 19, 25, 47))  # 5 exons
subject <- IRangesList(tx)

## Encode the overlaps:
ovenc <- encodeOverlaps(query, subject)
ovenc
encoding(ovenc)

## ---------------------------------------------------------------------
## B. BETWEEN 2 GRangesList OBJECTS
## ---------------------------------------------------------------------
## With real RNA-seq data, the reads and transcripts will typically be
## stored in GRangesList objects. Please refer to the "OverlapEncodings"
## vignette in this package for realistic examples.

findCompatibleOverlaps_methods()

Finding hits between reads and transcripts that are compatible with the splicing of the transcript

Description

In the context of an RNA-seq experiment, findCompatibleOverlaps (or countCompatibleOverlaps ) can be used for finding (or counting) hits between reads and transcripts that are compatible with the splicing of the transcript.

Usage

findCompatibleOverlaps(query, subject)
countCompatibleOverlaps(query, subject)

Arguments

Argument	Description
`query`	A GAlignments or GAlignmentPairs object representing the aligned reads.
`subject`	A GRangesList object representing the transcripts.

Details

findCompatibleOverlaps is a specialized version of findOverlaps that uses encodeOverlaps internally to keep only the hits where the junctions in the aligned read are compatible with the splicing of the annotated transcript.

The topic of working with overlap encodings is covered in details in the "OverlapEncodings" vignette located this package ( GenomicAlignments ) and accessible with vignette("OverlapEncodings") .

Value

A Hits object for findCompatibleOverlaps .

An integer vector parallel to (i.e. same length as) query for countCompatibleOverlaps .

Author

Hervé Pagès

Examples

## Here we only show a simple example illustrating the use of
## countCompatibleOverlaps() on a very small data set. Please
## refer to the "OverlapEncodings" vignette in the GenomicAlignments
## package for a comprehensive presentation of "overlap
## encodings" and related tools/concepts (e.g. "compatible"
## overlaps, "almost compatible" overlaps etc...), and for more
## examples.

## sm_treated1.bam contains a small subset of treated1.bam, a BAM
## file containing single-end reads from the "Pasilla" experiment
## (RNA-seq, Fly, see the pasilla data package for the details)
## and aligned to reference genome BDGP Release 5 (aka dm3 genome on
## the UCSC Genome Browser):
sm_treated1 <- system.file("extdata", "sm_treated1.bam",
package="GenomicAlignments", mustWork=TRUE)

## Load the alignments:
flag0 <- scanBamFlag(isDuplicate=FALSE, isNotPassingQualityControls=FALSE)
param0 <- ScanBamParam(flag=flag0)
gal <- readGAlignments(sm_treated1, use.names=TRUE, param=param0)

## Load the transcripts (IMPORTANT: Like always, the reference genome
## of the transcripts must be *exactly* the same as the reference
## genome used to align the reads):
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
exbytx <- exonsBy(txdb, by="tx", use.names=TRUE)

## Number of "compatible" transcripts per alignment in 'gal':
gal_ncomptx <- countCompatibleOverlaps(gal, exbytx)
mcols(gal)$ncomptx <- gal_ncomptx
table(gal_ncomptx)
mean(gal_ncomptx >= 1)
## --> 33% of the alignments in 'gal' are "compatible" with at least
## 1 transcript in 'exbytx'.

## Keep only alignments compatible with at least 1 transcript in
## 'exbytx':
compgal <- gal[gal_ncomptx >= 1]
head(compgal)

findMateAlignment()

Pairing the elements of a GAlignments object

Description

Utilities for pairing the elements of a GAlignments object.

NOTE: Until BioC 2.13, findMateAlignment was the power horse used by readGAlignmentPairs for pairing the records loaded from a BAM file containing aligned paired-end reads. Starting with BioC 2.14, readGAlignmentPairs relies on scanBam for the pairing.

Usage

findMateAlignment(x)
makeGAlignmentPairs(x, use.names=FALSE, use.mcols=FALSE, strandMode=1)
## Related low-level utilities:
getDumpedAlignments()
countDumpedAlignments()
flushDumpedAlignments()

Arguments

Argument	Description

|x | A named GAlignments object with metadata columns flag , mrnm , and mpos . Typically obtained by loading aligned paired-end reads from a BAM file with: list(" ", " param <- ScanBamParam(what=c("flag", "mrnm", "mpos")) ", " x <- readGAlignments(..., use.names=TRUE, param=param) ", " ") | |use.names | Whether the names on the input object should be propagated to the returned object or not. | |use.mcols | Names of the metadata columns to propagate to the returned GAlignmentPairs object. | |strandMode | Strand mode to set on the returned GAlignmentPairs object. See ? for more information. |

Details

list(list("Pairing algorithm used by findMateAlignment"), list(" ", list("findMateAlignment"), " is the power horse used by ", list("makeGAlignmentPairs"), " ", "for pairing the records loaded from a BAM file containing aligned paired-end ", "reads. ", " ", "It implements the following pairing algorithm: ", list(" ", " ", list(), " First, only records with flag bit 0x1 (multiple segments) set to 1, ", " flag bit 0x4 (segment unmapped) set to 0, and flag bit 0x8 (next ", " segment in the template unmapped) set to 0, are candidates for ",

"        pairing (see the SAM Spec for a description of flag bits and fields).

", " ", list("findMateAlignment"), " will ignore any other record. That is, ", " records that correspond to single-end reads, or records that ", " correspond to paired-end reads where one or both ends are unmapped, ", " are discarded. ", " ", " ", list(), " Then the algorithm looks at the following fields and flag bits: ", " ", list(" ", " ", list(), " (A) QNAME ",

    "          ", list(), " (B) RNAME, RNEXT

", " ", list(), " (C) POS, PNEXT ", " ", list(), " (D) Flag bits Ox10 (segment aligned to minus strand) ", " and 0x20 (next segment aligned to minus strand) ", " ", list(), " (E) Flag bits 0x40 (first segment in template) and 0x80 (last ", " segment in template) ", " ", list(), " (F) Flag bit 0x2 (proper pair) ", " ", list(), " (G) Flag bit 0x100 (secondary alignment) ",

    "        "), "

", " 2 records rec1 and rec2 are considered mates iff all the following ", " conditions are satisfied: ", " ", list(" ", " ", list(), " (A) QNAME(rec1) == QNAME(rec2) ", " ", list(), " (B) RNEXT(rec1) == RNAME(rec2) and RNEXT(rec2) == RNAME(rec1) ", " ", list(), " (C) PNEXT(rec1) == POS(rec2) and PNEXT(rec2) == POS(rec1) ", " ", list(), " (D) Flag bit 0x20 of rec1 == Flag bit 0x10 of rec2 and ", " Flag bit 0x20 of rec2 == Flag bit 0x10 of rec1 ",

    "          ", list(), " (E) rec1 corresponds to the first segment in the template and

", " rec2 corresponds to the last segment in the template, OR, ", " rec2 corresponds to the first segment in the template and ", " rec1 corresponds to the last segment in the template ", " ", list(), " (F) rec1 and rec2 have same flag bit 0x2 ", " ", list(), " (G) rec1 and rec2 have same flag bit 0x100 ", " "), " "),

"))

| list(list("Timing and memory requirement of the pairing algorithm"), list(" ", "The estimated timings and memory requirements on a modern Linux system are ", "(those numbers may vary depending on your hardware and OS): ", list(" ", " nb of alignments | time | required memory ", " -----------------+--------------+---------------- ", " 8 millions | 28 sec | 1.4 GB ", " 16 millions | 58 sec | 2.8 GB ", " 32 millions | 2 min | 5.6 GB ", | | " 64 millions | 4 min 30 sec | 11.2 GB "), " ", "This is for a ", list("GAlignments"), " object coming from a file with an ", ""average nb of records per unique QNAME" of 2.04. A value of 2 (which means ", "the file contains only primary reads) is optimal for the pairing algorithm. ", "A greater value, say > 3, will significantly degrade its performance. ", "An easy way to avoid this degradation is to load only primary alignments ", "by setting the ", list("isSecondaryAlignment"), |

" flag to ", list("FALSE"), " in

", "ScanBamParam(). See examples in ", list("?", list("readGAlignmentPairs")), " for how ", "to do this. "))

list(list("Ambiguous pairing"), list(" ", "The above algorithm will find almost all pairs unambiguously, even when ", "the same pair of reads maps to several places in the genome. Note ", "that, when a given pair maps to a single place in the genome, looking ", "at (A) is enough to pair the 2 corresponding records. The additional ", "conditions (B), (C), (D), (E), (F), and (G), are only here to help in ", "the situation where more than 2 records share the same QNAME. And that ", "works most of the times. Unfortunately there are still situations where ",

"this is not enough to solve the pairing problem unambiguously.

", " ", "For example, here are 4 records (loaded in a GAlignments object) ", "that cannot be paired with the above algorithm: ", " ", "Showing the 4 records as a GAlignments object of length 4: ", list(" ", "GAlignments with 4 alignments and 2 metadata columns: ", " seqnames strand cigar qwidth start end ", " ", | " SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 ", " SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 ", " SRR031714.2658602 chr2R - 13M372N24M 37 6983858 6984266 ", " SRR031714.2658602 chr2R - 13M378N24M 37 6983858 6984272 ", " width njunc | mrnm mpos ", " | ", " SRR031714.2658602 421 1 | chr2R 6983858 ", | | " SRR031714.2658602 421 1 | chr2R 6983858 ", " SRR031714.2658602 409 1 | chr2R 6983850 ", " SRR031714.2658602 415 1 | chr2R 6983850 "), " ", "Note that the BAM fields show up in the following columns: ", list(" ", " ", list(), " QNAME: the names of the GAlignments object (unnamed col) ", " ", list(), " RNAME: the seqnames col ", " ", list(), " POS: the start col ", " ", list(), " RNEXT: the mrnm col ", " ", list(), |

    " PNEXT: the mpos col

"), " ", "As you can see, the aligner has aligned the same pair to the same ", "location twice! The only difference between the 2 aligned pairs is in ", "the CIGAR i.e. one end of the pair is aligned twice to the same location ", "with exactly the same CIGAR while the other end of the pair is aligned ", "twice to the same location but with slightly different CIGARs. ", " ", "Now showing the corresponding flag bits: ", list(" ", " isPaired isProperPair isUnmappedQuery hasUnmappedMate isMinusStrand ",

    "[1,]        1            1               0               0             0

", "[2,] 1 1 0 0 0 ", "[3,] 1 1 0 0 1 ", "[4,] 1 1 0 0 1 ", " isMateMinusStrand isFirstMateRead isSecondMateRead isSecondaryAlignment ", "[1,] 1 0 1 0 ", "[2,] 1 0 1 0 ",

    "[3,]                 0               1                0                    0

", "[4,] 0 1 0 0 ", " isNotPassingQualityControls isDuplicate ", "[1,] 0 0 ", "[2,] 0 0 ", "[3,] 0 0 ", "[4,] 0 0 "), " ", " ", "As you can see, rec(1) and rec(2) are second mates, rec(3) and rec(4) ",

"are both first mates. But looking at (A), (B), (C), (D), (E), (F), and (G),

", "the pairs could be rec(1) <-> rec(3) and rec(2) <-> rec(4), or they could ", "be rec(1) <-> rec(4) and rec(2) <-> rec(3). There is no way to ", "disambiguate! ", " ", "So ", list("findMateAlignment"), " is just ignoring (with a warning) those alignments ", "with ambiguous pairing, and dumping them in a place from which they can be ", "retrieved later (i.e. after ", list("findMateAlignment"), " has returned) for ",

"further examination (see "Dumped alignments" subsection below for the details).

", "In other words, alignments that cannot be paired unambiguously are not paired ", "at all. Concretely, this means that ", list(list("readGAlignmentPairs")), " is ", "guaranteed to return a ", list("GAlignmentPairs"), " object ", "where every pair was formed in an non-ambiguous way. Note that, in practice, ", "this approach doesn't seem to leave aside a lot of records because ambiguous ", "pairing events seem pretty rare. "))

list(list("Dumped alignments"), list(" ", "Alignments with ambiguous pairing are dumped in a place ("the dump ", "environment") from which they can be retrieved with ", list("getDumpedAlignments()"), " after ", list("findMateAlignment"), " has returned. ", " ", "Two additional utilities are provided for manipulation of the dumped ", "alignments: ", list("countDumpedAlignments"), " for counting them (a fast equivalent ", "to ", list("length(getDumpedAlignments())"), "), and ", list("flushDumpedAlignments"),

" to

", "flush "the dump environment". Note that "the dump environment" is ", "automatically flushed at the beginning of a call to ", list("findMateAlignment"), ". "))

Value

For findMateAlignment : An integer vector of the same length as x , containing only positive or NA values, where the i-th element is interpreted as follow:

An NA value means that no mate or more than 1 mate was found for x[i] .
A non-NA value j gives the index in x of x[i] 's mate.

For makeGAlignmentPairs : A GAlignmentPairs object where the pairs are formed internally by calling findMateAlignment on x .

For getDumpedAlignments : NULL or a GAlignments object containing the dumped alignments. See "Dumped alignments" subsection in the "Details" section above for the details.

For countDumpedAlignments : The number of dumped alignments.

Nothing for flushDumpedAlignments .

Author

Hervé Pagès

Examples

bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
mustWork=TRUE)
param <- ScanBamParam(what=c("flag", "mrnm", "mpos"))
x <- readGAlignments(bamfile, use.names=TRUE, param=param)

mate <- findMateAlignment(x)
head(mate)
table(is.na(mate))
galp0 <- makeGAlignmentPairs(x)
galp <- makeGAlignmentPairs(x, use.name=TRUE, use.mcols="flag")
galp
colnames(mcols(galp))
colnames(mcols(first(galp)))
colnames(mcols(last(galp)))

findOverlaps_methods()

Finding overlapping genomic alignments

Description

Finds range overlaps between a GAlignments , GAlignmentPairs , or GAlignmentsList object, and another range-based object.

NOTE: The findOverlaps generic function and methods for IntegerRanges and IntegerRangesList objects are defined and documented in the IRanges package. The methods for GRanges and GRangesList objects are defined and documented in the GenomicRanges package.

GAlignments , GAlignmentPairs , and GAlignmentsList objects also support countOverlaps , overlapsAny , and subsetByOverlaps thanks to the default methods defined in the IRanges package and to the findOverlaps method defined in this package and documented below.

Usage

list(list("findOverlaps"), list("GAlignments,GAlignments"))(query, subject,
    maxgap=-1L, minoverlap=0L,
    type=c("any", "start", "end", "within"),
    select=c("all", "first", "last", "arbitrary"),
    ignore.strand=FALSE)

Arguments

Argument	Description
`query, subject`	A GAlignments , GAlignmentPairs , or GAlignmentsList object for either `query` or `subject` . A vector-like object containing ranges for the other one.
`maxgap, minoverlap, type, select`	See `?` in the IRanges package for a description of these arguments.
`ignore.strand`	When set to `TRUE` , the strand information is ignored in the overlap calculations.

Details

When the query or the subject (or both) is a GAlignments object, it is first turned into a GRangesList object (with as( , "GRangesList") ) and then the rules described previously apply. GAlignmentsList objects are coerced to GAlignments then to a GRangesList . Feature indices are mapped back to the original GAlignmentsList list elements.

When the query is a GAlignmentPairs object, it is first turned into a GRangesList object (with as( , "GRangesList") ) and then the rules described previously apply.

Value

A Hits object when select="all" or an integer vector otherwise.

Examples

ex1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
galn <- readGAlignments(ex1_file)

subject <- granges(galn)[1]

## Note the absence of query no. 9 (i.e. 'galn[9]') in this result:
as.matrix(findOverlaps(galn, subject))

## This is because, by default, findOverlaps()/countOverlaps() are
## strand specific:
galn[8:10]
countOverlaps(galn[8:10], subject)
countOverlaps(galn[8:10], subject, ignore.strand=TRUE)

## Count alignments in 'galn' that DO overlap with 'subject' vs those
## that do NOT:
table(overlapsAny(galn, subject))
## Extract those that DO:
subsetByOverlaps(galn, subject)

## GAlignmentsList
galist <- GAlignmentsList(galn[8:10], galn[3000:3002])
gr <- GRanges(c("seq1", "seq1", "seq2"),
IRanges(c(15, 18, 1233), width=1),
strand=c("-", "+", "+"))

countOverlaps(galist, gr)
countOverlaps(galist, gr, ignore.strand=TRUE)
findOverlaps(galist, gr)
findOverlaps(galist, gr, ignore.strand=TRUE)

findSpliceOverlaps_methods()

Classify ranges (reads) as compatible with existing genomic annotations or as having novel splice events

Description

The findSpliceOverlaps function identifies ranges (reads) that are compatible with a specific transcript isoform. The non-compatible ranges are analyzed for the presence of novel splice events.

Usage

findSpliceOverlaps(query, subject, ignore.strand=FALSE, ...)
list(list("findSpliceOverlaps"), list("GRangesList,GRangesList"))(query, subject, ignore.strand=FALSE, ..., cds=NULL)
list(list("findSpliceOverlaps"), list("GAlignments,GRangesList"))(query, subject, ignore.strand=FALSE, ..., cds=NULL)
list(list("findSpliceOverlaps"), list("GAlignmentPairs,GRangesList"))(query, subject, ignore.strand=FALSE, ..., cds=NULL)
list(list("findSpliceOverlaps"), list("BamFile,ANY"))(query, subject, ignore.strand=FALSE, ...,
                   param=ScanBamParam(), singleEnd=TRUE)

Arguments

Argument	Description
`query`	A GRangesList , GAlignments , GAlignmentPairs , or BamFile object containing the reads. Can also be a single string containing the path to a BAM file. Single or paired-end reads are specified with the `singleEnd` argument (default FALSE). Paired-end reads can be supplied in a BAM file or GAlignmentPairs object. Single-end are expected to be in a BAM file, GAlignments or GRanges object.
`subject`	A GRangesList containing the annotations. This list is expected to contain exons grouped by transcripts.
`ignore.strand`	When set to `TRUE` , strand information is ignored in the overlap calculations.
`...`	Additional arguments such as `param` and `singleEnd` used in the method for BamFile objects. See below.
`cds`	Optional GRangesList of coding regions for each transcript in the `subject` . If provided, the "coding" output column will be a `logical` vector indicating if the read falls in a coding region. When not provided, the "coding" output is `NA` .
`param`	An optional `ScanBamParam` instance to further influence scanning, counting, or filtering.
`singleEnd`	A logical value indicating if reads are single or paired-end. See `summarizeOverlaps` for more information.

Details

When a read maps compatibly and uniquely to a transcript isoform we can quantify the expression and look for shifts in the balance of isoform expression. If a read does not map in compatible way, novel splice events such as splice junctions, novel exons or retentions can be quantified and compared across samples.

findSpliceOverlaps detects which reads (query) match to transcripts (subject) in a compatible fashion. Compatibility is based on both the transcript bounds and splicing pattern. Assessing the splicing pattern involves comparision of the read splices (i.e., the N operations in the CIGAR) with the transcript introns. For paired-end reads, the inter-read gap is not considered a splice junction. The analysis of non-compatible reads for novel splice events is under construction.

Value

The output is a Hits object with the metadata columns defined below. Each column is a logical indicating if the read (query) met the criteria.

list("compatible: ") list("Every splice (N) in a read alignment matches ", " an intron in an annotated transcript. The read does not ", " extend into an intron or outside the transcript bounds. ", " ")
list("unique: ") list("The read is compatible with only one annotated ", " transcript. ", " ")
list("strandSpecific: ") list("The query (read) was stranded. ", " ")

Note

WARNING: The current implementation of findSpliceOverlaps doesn't work properly on paired-end reads where the 2 ends overlap!

Author

Michael Lawrence and Valerie Obenchain Valerie.Obenchain@RoswellPark.org

Examples

## -----------------------------------------------------------------------
## Isoform expression :
## -----------------------------------------------------------------------
## findSpliceOverlaps() can assist in quantifying isoform expression
## by identifying reads that map compatibly and uniquely to a
## transcript isoform.
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
library(pasillaBamSubset)
se <- untreated1_chr4() ## single-end reads
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
exbytx <- exonsBy(txdb, "tx")
cdsbytx <- cdsBy(txdb, "tx")
param <- ScanBamParam(which=GRanges("chr4", IRanges(1e5,3e5)))
sehits <- findSpliceOverlaps(se, exbytx, cds=cdsbytx, param=param)

## Tally the reads by category to get an idea of read distribution.
lst <- lapply(mcols(sehits), table)
nms <- names(lst)%in% c("compatible", "unique")
tbl <- do.call(rbind, lst[nms])
tbl

## Reads compatible with one or more transcript isoforms.
rnms <- rownames(tbl)
tbl[rnms == "compatible","TRUE"]/sum(tbl[rnms == "compatible",])

## Reads compatible with a single isoform.
tbl[rnms == "unique","TRUE"]/sum(tbl[rnms == "unique",])

## All reads fall in a coding region as defined by
## the txdb annotation.
lst[["coding"]]

## Check : Total number of reads should be the same across categories.
lapply(lst, sum)

## -----------------------------------------------------------------------
## Paired-end reads :
## -----------------------------------------------------------------------
## 'singleEnd' is set to FALSE for a BAM file with paired-end reads.
pe <- untreated3_chr4()
hits2 <- findSpliceOverlaps(pe, exbytx, singleEnd=FALSE, param=param)

## In addition to BAM files, paired-end reads can be supplied in a
## GAlignmentPairs object.
genes <- GRangesList(
GRanges("chr1", IRanges(c(5, 20), c(10, 25)), "+"),
GRanges("chr1", IRanges(c(5, 22), c(15, 25)), "+"))
galp <- GAlignmentPairs(
GAlignments("chr1", 5L, "11M4N6M", strand("+")),
GAlignments("chr1", 50L, "6M", strand("-")))
findSpliceOverlaps(galp, genes)

intra_range_methods()

Intra range transformations of a GAlignments or GAlignmentsList object

Description

This man page documents intra range transformations of a GAlignments or GAlignmentsList object.

See ?`` and?in the IRanges package for a quick introduction to intra range and inter range transformations. Intra range methods for [GRanges](#granges) and [GRangesList](#grangeslist) objects are defined and documented in the GenomicRanges package. ## Usage ```r list(list("qnarrow"), list("GAlignments"))(x, start=NA, end=NA, width=NA) list(list("qnarrow"), list("GAlignmentsList"))(x, start=NA, end=NA, width=NA) ``` ## Arguments |Argument |Description| |------------- |----------------| |`x` | A [GAlignments](#galignments) or [GAlignmentsList](#galignmentslist) object. | |`start, end, width` | Vectors of integers. NAs and negative values are accepted and "solved" according to the rules of the SEW (Start/End/Width) interface (see `?` for more information about the SEW interface). See `? for more information about the start , end , and width arguments. |

Details

() list(" ", " ", list("qnarrow"), " on a ", list("GAlignments"), " object behaves like ", list("narrow"), " ", " except that the ", list("start"), "/", list("end"), "/", list("width"), " arguments here ", " specify the narrowing with respect to the query sequences instead of ", " the reference sequences. ", " ", " ", list("qnarrow"), " on a ", list("GAlignmentsList"), " object ", " returns a ", list("GAlignmentsList"), " object. ", " ")

Value

An object of the same class as -- and parallel to (i.e. same length and names as) -- the original object x .

Note

There is no difference between narrow and qnarrow when all the alignments have a simple CIGAR (i.e. no indels or junctions).

Author

Hervé Pagès and V. Obenchain Valerie.Obenchain@RoswellPark.org

Examples

## ---------------------------------------------------------------------
## A. ON A GAlignments OBJECT
## ---------------------------------------------------------------------
ex1_file <- system.file("extdata", "ex1.bam", package="Rsamtools")
param <- ScanBamParam(what=c("seq", "qual"))
gal <- readGAlignments(ex1_file, param=param)
gal

## This trims 3 nucleotides on the left and 5 nucleotides on the right
## of each alignment:
gal2 <- qnarrow(gal, start=4, end=-6)
gal2

## Note that the 'start' and 'end' values are relative to the query
## sequences and specify the query substring that must be kept for each
## alignment. Negative values are relative to the right end of the query
## sequence.

## Also note that the metadata columns on 'gal' are propagated as-is so
## the "seq" and "qual" matadata columns must be adjusted "by hand" with
## narrow();
mcols(gal2)$seq <- narrow(mcols(gal)$seq, start=4, end=-6)
mcols(gal2)$qual <- narrow(mcols(gal)$qual, start=4, end=-6)
gal2

## Sanity checks:
stopifnot(identical(qwidth(gal2), width(mcols(gal2)$seq)))
stopifnot(identical(qwidth(gal2), width(mcols(gal2)$qual)))

## ---------------------------------------------------------------------
## B. ON A GAlignmentsList OBJECT
## ---------------------------------------------------------------------
gal1 <- GAlignments(
seqnames=Rle(factor(c("chr1", "chr2", "chr1", "chr3")),
c(1, 3, 2, 4)),
pos=1:10, cigar=paste0(10:1, "M"),
strand=Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
names=head(letters, 10), score=1:10)

gal2 <- GAlignments(
seqnames=Rle(factor(c("chr2", "chr4")), c(3, 4)), pos=1:7,
cigar=c("5M", "3M2N3M2N3M", "5M", "10M", "5M1N4M", "8M2N1M", "5M"),
strand=Rle(strand(c("-", "+")), c(4, 3)),
names=tail(letters, 7), score=1:7)

galist <- GAlignmentsList(noGaps=gal1, Gaps=gal2)
galist

qnarrow(galist)

junctions_methods()

Extract junctions from genomic alignments

Description

Given an object x containing genomic alignments (e.g. a GAlignments , GAlignmentPairs , or GAlignmentsList object), junctions(x) extracts the junctions from it and summarizeJunctions(x) extracts and summarizes them.

readTopHatJunctions and readSTARJunctions are utilities for importing the junction file generated by the TopHat and STAR aligners, respectively.

Usage

## junctions() generic and methods
## -------------------------------
junctions(x, use.mcols=FALSE, ...)
list(list("junctions"), list("GAlignments"))(x, use.mcols=FALSE)
list(list("junctions"), list("GAlignmentPairs"))(x, use.mcols=FALSE)
list(list("junctions"), list("GAlignmentsList"))(x, use.mcols=FALSE, ignore.strand=FALSE)
## summarizeJunctions() and NATURAL_INTRON_MOTIFS
## ----------------------------------------------
summarizeJunctions(x, with.revmap=FALSE, genome=NULL)
NATURAL_INTRON_MOTIFS
## Utilities for importing the junction file generated by some aligners
## --------------------------------------------------------------------
readTopHatJunctions(file, file.is.raw.juncs=FALSE)
readSTARJunctions(file)

Arguments

Argument	Description
`x`	A GAlignments , GAlignmentPairs , or GAlignmentsList object.
`use.mcols`	`TRUE` or `FALSE` (the default). Whether the metadata columns on `x` (accessible with `mcols(x)` ) should be propagated to the returned object or not.
`...`	Additional arguments, for use in specific methods.
`ignore.strand`	`TRUE` or `FALSE` (the default). If set to `TRUE` , then the strand of `x` is set to `"*"` prior to any computation.
`with.revmap`	`TRUE` or `FALSE` (the default). If set to `TRUE` , then a `revmap` metadata column is added to the output of `summarizeJunctions` . This metadata column is an IntegerList object representing the mapping from each element in the ouput (i.e. each junction) to the corresponding elements in the input `x` .
`genome`	`NULL` (the default), or a BSgenome object containing the sequences of the reference genome that was used to align the reads, or the name of this reference genome specified in a way that is accepted by the `getBSgenome` function defined in the BSgenome software package. In that case the corresponding BSgenome data package needs to be already installed (see `?` in the BSgenome package for the details). If `genome` is supplied, then the `intron_motif` and `intron_strand` metadata columns are computed (based on the dinucleotides found at the intron boundaries) and added to the output of `summarizeJunctions` . See the Value section below for a description of these metadata columns.
`file`	The path (or a connection) to the junction file generated by the aligner. This file should be the junctions.bed or new_list.juncs file for `readTopHatJunctions` , and the SJ.out.tab file for `readSTARJunctions` .
`file.is.raw.juncs`	`TRUE` or `FALSE` (the default). If set to `TRUE` , then the input file is assumed to be a TopHat .juncs file instead of the junctions.bed file generated by TopHat. A TopHat .juncs file can be obtained by passing the junctions.bed file thru TopHat's bed_to_juncs script. See the TopHat manual at http://tophat.cbcb.umd.edu/manual.shtml for more information.

Details

An N operation in the CIGAR of a genomic alignment is interpreted as a junction. junctions(x) will return the genomic ranges of all junctions found in x .

More precisely, on a GAlignments object x , junctions(x) is equivalent to: list(" psetdiff(granges(x), grglist(x, order.as.in.query=TRUE)) ") On a GAlignmentPairs object x , it's equivalent to (but faster than): list(" mendoapply(c, junctions(first(x, real.strand=TRUE)), ", " junctions(last(x, real.strand=TRUE))) ") Note that starting with BioC 3.2, the behavior of junctions on a GAlignmentPairs object has been slightly modified so that the returned ranges now have the list("real strand") set on them. See the documentation of the real.strand argument in the man page of GAlignmentPairs objects for more information.

NATURAL_INTRON_MOTIFS is a predefined character vector containing the 5 natural intron motifs described at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC84117/ .

Value

junctions(x) returns the genomic ranges of the junctions in a GRangesList object parallel to x (i.e. with 1 list element per element in x ). If x has names on it, they're propagated to the returned object. If use.mcols is TRUE and x has metadata columns on it (accessible with mcols(x) ), they're propagated to the returned object.

summarizeJunctions returns the genomic ranges of the unique junctions in x in an unstranded GRanges object with the following metadata columns:

score : The total number of alignments crossing each junction, i.e. that have the junction encoded in their CIGAR.
plus_score and minus_score : The strand-specific number of alignments crossing each junction.
revmap : [Only if with.revmap was set to TRUE .] An IntegerList object representing the mapping from each element in the ouput (i.e. each junction) to the corresponding elements in input x .
intron_motif and intron_strand : [Only if genome was supplied.] The intron motif and strand for each junction, based on the dinucleotides found in the genome sequences at the intron boundaries. The intron_motif metadata column is a factor whose levels are the 5 natural intron motifs stored in predefined character vector NATURAL_INTRON_MOTIFS . If the dinucleotides found at the intron boundaries don't match any of these natural intron motifs, then intron_motif and intron_strand are set to NA and * , respectively.

readTopHatJunctions and readSTARJunctions return the junctions reported in the input file in a stranded GRanges object. With the following metadata columns for readTopHatJunctions (when reading in the junctions.bed file):
name : An id assigned by TopHat to each junction. This id is of the form JUNC00000017 and is unique within the junctions.bed file.
score : The total number of alignments crossing each junction.
With the following metadata columns for readSTARJunctions :
intron_motif and intron_strand : The intron motif and strand for each junction, based on the code found in the input file (0: non-canonical, 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT). Note that of the 5 natural intron motifs stored in predefined character vector NATURAL_INTRON_MOTIFS , only the first 3 are assigned codes by the STAR software (2 codes per motif, one if the intron is on the plus strand and one if it's on the minus strand). Thus the intron_motif metadata column is a factor with only 3 levels. If code is 0, then intron_motif and intron_strand are set to NA and * , respectively.
um_reads : The number of uniquely mapping reads crossing the junction (a pair where the 2 alignments cross the same junction is counted only once).
mm_reads : The number of multi-mapping reads crossing the junction (a pair where the 2 alignments cross the same junction is counted only once).
See STAR manual at https://code.google.com/p/rna-star/ for more information.

The readGAlignments and readGAlignmentPairs functions for reading genomic alignments from a BAM file.
GAlignments , GAlignmentPairs , and GAlignmentsList objects.
GRanges and GRangesList objects implemented and documented in the GenomicRanges package.
IntegerList objects implemented and documented in the IRanges package.
The getBSgenome function in the BSgenome package, for searching the installed BSgenome data packages for the specified genome and returning it as a BSgenome object.
The extractList function in the IRanges package, for extracting groups of elements from a vector-like object and returning them into a List object.

Author

Hervé Pagès

References

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC84117/ for the 5 natural intron motifs stored in predefined character vector NATURAL_INTRON_MOTIFS .

TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions

TopHat2 paper: http://genomebiology.com/2013/14/4/r36
TopHat2 software and manual: http://tophat.cbcb.umd.edu/

STAR: ultrafast universal RNA-seq aligner

STAR paper: http://bioinformatics.oxfordjournals.org/content/early/2012/10/25/bioinformatics.bts635
STAR software and manual: https://code.google.com/p/rna-star/

Examples

library(RNAseqData.HNRNPC.bam.chr14)
bamfile <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]

## ---------------------------------------------------------------------
## A. junctions()
## ---------------------------------------------------------------------

gal <- readGAlignments(bamfile)
table(njunc(gal))  # some alignments have 3 junctions!
juncs <- junctions(gal)
juncs

stopifnot(identical(unname(elementNROWS(juncs)), njunc(gal)))

galp <- readGAlignmentPairs(bamfile)
juncs <- junctions(galp)
juncs

stopifnot(identical(unname(elementNROWS(juncs)), njunc(galp)))

## ---------------------------------------------------------------------
## B. summarizeJunctions()
## ---------------------------------------------------------------------

## By default, only the "score", "plus_score", and "minus_score"
## metadata columns are returned:
junc_summary <- summarizeJunctions(gal)
junc_summary

## The "score" metadata column reports the total number of alignments
## crossing each junction, i.e., that have the junction encoded in their
## CIGAR:
median(mcols(junc_summary)$score)

## The "plus_score" and "minus_score" metadata columns report the
## strand-specific number of alignments crossing each junction:
stopifnot(identical(mcols(junc_summary)$score,
mcols(junc_summary)$plus_score +
mcols(junc_summary)$minus_score))

## If 'with.revmap' is TRUE, the "revmap" metadata column is added to
## the output. This metadata column is an IntegerList object represen-
## ting the mapping from each element in the ouput (i.e. a junction) to
## the corresponding elements in the input 'x'. Here we're going to use
## this to compute a 'score2' for each junction. We obtain this score
## by summing the mapping qualities of the alignments crossing the
## junction:
gal <- readGAlignments(bamfile, param=ScanBamParam(what="mapq"))
junc_summary <- summarizeJunctions(gal, with.revmap=TRUE)
junc_score2 <- sum(extractList(mcols(gal)$mapq,
mcols(junc_summary)$revmap))
mcols(junc_summary)$score2 <- junc_score2

## If the name of the reference genome is specified thru the 'genome'
## argument (in which case the corresponding BSgenome data package needs
## to be installed), then summarizeJunctions() returns the intron motif
## and strand for each junction.
## Since the reads in RNAseqData.HNRNPC.bam.chr14 were aligned to
## the hg19 genome, the following requires that you have
## BSgenome.Hsapiens.UCSC.hg19 installed:
junc_summary <- summarizeJunctions(gal, with.revmap=TRUE, genome="hg19")
mcols(junc_summary)$score2 <- junc_score2  # putting 'score2' back

## The "intron_motif" metadata column is a factor whose levels are the
## 5 natural intron motifs stored in predefined character vector
## 'NATURAL_INTRON_MOTIFS':
table(mcols(junc_summary)$intron_motif)

## ---------------------------------------------------------------------
## C. STRANDED RNA-seq PROTOCOL
## ---------------------------------------------------------------------

## Here is a simple test for checking whether the RNA-seq protocol was
## stranded or not:
strandedTest <- function(plus_score, minus_score)
(sum(plus_score ^ 2) + sum(minus_score ^ 2)) /
sum((plus_score + minus_score) ^ 2)

## The result of this test is guaranteed to be >= 0.5 and <= 1.
## If, for each junction, the strand of the crossing alignments looks
## random (i.e. "plus_score" and "minus_score" are close), then
## strandedTest() will return a value close to 0.5. If it doesn't look
## random (i.e. for each junction, one of "plus_score" and "minus_score"
## is much bigger than the other), then strandedTest() will return a
## value close to 1.

## If the reads are single-end, the test is meaningful when applied
## directly on 'junc_summary'. However, for the test to be meaningful
## on paired-end reads, it needs to be applied on the first and last
## alignments separately:
junc_summary1 <- summarizeJunctions(first(galp))
junc_summary2 <- summarizeJunctions(last(galp))
strandedTest(mcols(junc_summary1)$plus_score,
mcols(junc_summary1)$minus_score)
strandedTest(mcols(junc_summary2)$plus_score,
mcols(junc_summary2)$minus_score)
## Both values are close to 0.5 which suggests that the RNA-seq protocol
## used for this experiment was not stranded.

## ---------------------------------------------------------------------
## D. UTILITIES FOR IMPORTING THE JUNCTION FILE GENERATED BY SOME
##    ALIGNERS
## ---------------------------------------------------------------------

## The TopHat aligner generates a junctions.bed file where it reports
## all the junctions satisfying some "quality" criteria (see the TopHat
## manual at http://tophat.cbcb.umd.edu/manual.shtml for more
## information). This file can be loaded with readTopHatJunctions():
runname <- names(RNAseqData.HNRNPC.bam.chr14_BAMFILES)[1]
junctions_file <- system.file("extdata", "tophat2_out", runname,
"junctions.bed",
package="RNAseqData.HNRNPC.bam.chr14")
th_junctions <- readTopHatJunctions(junctions_file)

## Comparing the "TopHat junctions" with the result of
## summarizeJunctions():
th_junctions14 <- th_junctions
seqlevels(th_junctions14, pruning.mode="coarse") <- "chr14"
mcols(th_junctions14)$intron_strand <- strand(th_junctions14)
strand(th_junctions14) <- "*"

## All the "TopHat junctions" are in 'junc_summary':
stopifnot(all(th_junctions14 %in% junc_summary))

## But not all the junctions in 'junc_summary' are reported by TopHat
## (that's because TopHat reports only junctions that satisfy some
## "quality" criteria):
is_in_th_junctions14 <- junc_summary %in% th_junctions14
table(is_in_th_junctions14)  # 32 junctions are not in TopHat's
# junctions.bed file
junc_summary2 <- junc_summary[is_in_th_junctions14]

## 'junc_summary2' and 'th_junctions14' contain the same junctions in
## the same order:
stopifnot(all(junc_summary2 == th_junctions14))

## Let's merge their metadata columns. We use our own version of
## merge() for this, which is stricter (it checks that the common
## columns are the same in the 2 data frames to merge) and also
## simpler:
merge2 <- function(df1, df2)
{
common_colnames <- intersect(colnames(df1), colnames(df2))
lapply(common_colnames,
function(colname)
stopifnot(all(df1[ , colname] == df2[ , colname])))
extra_mcolnames <- setdiff(colnames(df2), colnames(df1))
cbind(df1, df2[ , extra_mcolnames, drop=FALSE])
}

mcols(th_junctions14) <- merge2(mcols(th_junctions14),
mcols(junc_summary2))

## Here is a peculiar junction reported by TopHat:
idx0 <- which(mcols(th_junctions14)$score2 == 0L)
th_junctions14[idx0]
gal[mcols(th_junctions14)$revmap[[idx0]]]
## The junction is crossed by 5 alignments (score is 5), all of which
## have a mapping quality of 0!

pileLettersAt()

Pile the letters of a set of aligned reads on top of a set of genomic positions

Description

pileLettersAt extracts the letters/nucleotides of a set of reads that align to a set of genomic positions of interest. The extracted letters are returned as "piles of letters" (one per genomic position of interest) stored in an XStringSet (typically DNAStringSet ) object.

Usage

pileLettersAt(x, seqnames, pos, cigar, at)

Arguments

Argument	Description
`x`	An XStringSet (typically DNAStringSet ) object containing N unaligned read sequences (a.k.a. the query sequences) reported with respect to the + strand.
`seqnames`	A factor- Rle parallel to `x` . For each `i` , `seqnames[i]` must be the name of the reference sequence of the i-th alignment.
`pos`	An integer vector parallel to `x` . For each `i` , `pos[i]` must be the 1-based position on the reference sequence of the first aligned letter in `x[[i]]` .
`cigar`	A character vector parallel to `x` . Contains the extended CIGAR strings of the alignments.
`at`	A GPos object containing the genomic positions of interest. `seqlevels(at)` must be identical to `levels(seqnames)` . If `at` is not a GPos object, `pileLettersAt` will first try to turn it into one by calling the `GPos` constructor function on it. So for example `at` can be a GRanges object (or any other GenomicRanges derivative), and, in that case, each range in it will be interpreted as a run of adjacent genomic positions. See `?` in the GenomicRanges package for more information.

Details

x , seqnames , pos , cigar must be 4 parallel vectors describing N aligned reads.

Value

An XStringSet (typically DNAStringSet ) object parallel to at (i.e. with 1 string per genomic position).

The pileup and applyPileups functions defined in the Rsamtools package, as well as the SAMtools mpileup command (available at http://samtools.sourceforge.net/ as part of the SAMtools project), for more powerful flexible alternatives.
The stackStringsFromBam function for stacking the read sequences (or their quality strings) stored in a BAM file on a region of interest.
DNAStringSet objects in the Biostrings package.
GPos objects in the GenomicRanges package.
GAlignments objects.
cigar-utils for the CIGAR utility functions used internally by pileLettersAt .

Author

Hervé Pagès

Examples

## Input

##   - A BAM file:
bamfile <- BamFile(system.file("extdata", "ex1.bam", package="Rsamtools"))
seqinfo(bamfile)  # to see the seqlevels and seqlengths
stackStringsFromBam(bamfile, param="seq1:1-21")  # a quick look at
# the reads

##   - A GPos object containing Genomic Positions Of Interest:
my_GPOI <- GPos(c("seq1:1-5", "seq1:21-21", "seq1:1575-1575",
"seq2:1513-1514"))

## Some preliminary massage on 'my_GPOI'

seqinfo(my_GPOI) <- merge(seqinfo(my_GPOI), seqinfo(bamfile))
seqlevels(my_GPOI) <- seqlevelsInUse(my_GPOI)

## Load the BAM file in a GAlignments object. Note that we load only
## the reads aligned to the sequences in 'seqlevels(my_GPOI)'. Also,
## in order to be consistent with applyPileups() and SAMtools (m)pileup,
## we filter out the following BAM records:
##   - secondary alignments (flag bit 0x100);
##   - reads not passing quality controls (flag bit 0x200);
##   - PCR or optical duplicates (flag bit 0x400).
## See ?ScanBamParam and the SAM Spec for more information.

which <- as(seqinfo(my_GPOI), "GRanges")
flag <- scanBamFlag(isSecondaryAlignment=FALSE,
isNotPassingQualityControls=FALSE,
isDuplicate=FALSE)
what <- c("seq", "qual")
param <- ScanBamParam(flag=flag, what=c("seq", "qual"), which=which)
gal <- readGAlignments(bamfile, param=param)
seqlevels(gal) <- seqlevels(my_GPOI)

## Extract the read sequences (a.k.a. query sequences) and quality
## strings. Both are reported with respect to the + strand.

qseq <- mcols(gal)$seq
qual <- mcols(gal)$qual

nucl_piles <- pileLettersAt(qseq, seqnames(gal), start(gal), cigar(gal),
my_GPOI)
qual_piles <- pileLettersAt(qual, seqnames(gal), start(gal), cigar(gal),
my_GPOI)
mcols(my_GPOI)$nucl_piles <- nucl_piles
mcols(my_GPOI)$qual_piles <- qual_piles
my_GPOI

## Finally, to summarize A/C/G/T frequencies at each position:
alphabetFrequency(nucl_piles, baseOnly=TRUE)

## Note that the pileup() function defined in the Rsamtools package
## can be used to obtain a similar result:
scanbam_param <- ScanBamParam(flag=flag, which=my_GPOI)
pileup_param <- PileupParam(max_depth=5000,
min_base_quality=0,
distinguish_strands=FALSE)
pileup(bamfile, scanBamParam=scanbam_param, pileupParam=pileup_param)

readGAlignments()

Reading genomic alignments from a file

Description

Read genomic alignments from a file (typically a BAM file) into a GAlignments , GAlignmentPairs , GAlignmentsList , or GappedReads object.

Usage

readGAlignments(file, index=file, use.names=FALSE, param=NULL,
                      with.which_label=FALSE)
readGAlignmentPairs(file, index=file, use.names=FALSE, param=NULL,
                          with.which_label=FALSE, strandMode=1)
readGAlignmentsList(file, index=file, use.names=FALSE,
                          param=ScanBamParam(), with.which_label=FALSE)
readGappedReads(file, index=file, use.names=FALSE, param=NULL,
                      with.which_label=FALSE)

Arguments

Argument	Description
`file`	The path to the file to read or a BamFile object. Can also be a BamViews object for `readGAlignments` .
`index`	The path to the index file of the BAM file to read. Must be given without the '.bai' extension. See `scanBam` in the Rsamtools packages for more information.
`use.names`	`TRUE` or `FALSE` . By default (i.e. `use.names=FALSE` ), the resulting object has no names. If `use.names` is `TRUE` , then the names are constructed from the query template names (QNAME field in a SAM/BAM file). Note that the 2 records in a pair (when using `readGAlignmentPairs` or the records in a group (when using `readGAlignmentsList` ) have the same QNAME.
`param`	`NULL` or a ScanBamParam object. Like for `scanBam` , this influences what fields and which records are imported. However, note that the fields specified thru this ScanBamParam object will be loaded in addition to any field required for generating the returned object ( GAlignments , GAlignmentPairs , or GappedReads object), but only the fields requested by the user will actually be kept as metadata columns of the object. By default (i.e. `param=NULL` or `param=ScanBamParam()` ), no additional field is loaded. The flag used is `scanBamFlag(isUnmappedQuery=FALSE)` for `readGAlignments` , `readGAlignmentsList` , and `readGappedReads` . (i.e. only records corresponding to mapped reads are loaded), and `scanBamFlag(isUnmappedQuery=FALSE, isPaired=TRUE,` for `readGAlignmentPairs` (i.e. only records corresponding to paired-end reads with both ends mapped are loaded).
`with.which_label`	`TRUE` or `FALSE` (the default). If `TRUE` and if `param` has a `which` component, a `"which_label"` metadata column is added to the returned GAlignments or GappedReads object, or to the `first` and `last` components of the returned GAlignmentPairs object. In the case of `readGAlignmentsList` , it's added as an inner metadata column, that is, the metadata column is placed on the GAlignments object obtained by unlisting the returned GAlignmentsList object. The purpose of this metadata column is to unambiguously identify the range in `which` where each element in the returned object originates from. The labels used to identify the ranges are normally of the form `"seq1:12250-246500"` , that is, they're the same as the names found on the outer list that `scanBam` would return if called with the same `param` argument. If some ranges are duplicated, then the labels are made unique by appending a unique suffix to all of them. The `"which_label"` metadata column is represented as a factor- Rle .
`strandMode`	Strand mode to set on the returned GAlignmentPairs object. See `?` for more information.

Details

readGAlignments reads a file containing aligned reads as a GAlignments object. See ? for a description of GAlignments objects. When file is a BamViews object, readGAlignments visits each path in bamPaths(file) , returning the result of readGAlignments applied to the specified path. When index is missing, it is set equal to bamIndicies(file) . Only reads in bamRanges(file) are returned (if param is supplied, bamRanges(file) takes precedence over bamWhich(param) ). The return value is a SimpleList object, with elements of the list corresponding to each path. bamSamples(file) is available as metadata columns (accessed with mcols ) of the returned SimpleList object.
readGAlignmentPairs reads a file containing aligned paired-end reads as a GAlignmentPairs object. See ? for a description of GAlignmentPairs objects.
readGAlignmentsList reads a file containing aligned reads as a GAlignmentsList object. See ? for a description of GAlignmentsList objects. readGAlignmentsList pairs records into mates according to the pairing criteria described below. The 1st mate will always be 1st in the GAlignmentsList list elements that have mate_status set to "mated" , and the 2nd mate will always be 2nd. A GAlignmentsList is returned with a mate_status metadata column on the outer list elements. mate_status is a factor with 3 levels indicating mate status, mated , ambiguous or unmated :
mated: primary or non-primary pairs
ambiguous: multiple segments matching to the same location (indistinguishable)
unmated: mate does not exist or is unmapped When the file argument is a BamFile, asMates=TRUE must be set, otherwise the data are treated as single-end reads. See the asMates section of ? in the Rsamtools package for details.
readGappedReads reads a file containing aligned reads as a GappedReads object. See ? for a description of GappedReads objects.

For all these functions, flags, tags and ranges may be specified in the supplied ScanBamParam object for fine tuning of results.

Value

A GAlignments object for readGAlignments .

A GAlignmentPairs object for readGAlignmentPairs . Note that a BAM (or SAM) file can in theory contain a mix of single-end and paired-end reads, but in practise it seems that single-end and paired-end are not mixed. In other words, the value of flag bit 0x1 ( isPaired ) is the same for all the records in a file. So if readGAlignmentPairs returns a GAlignmentPairs object of length zero, this almost always means that the BAM (or SAM) file contains alignments for single-end reads (although it could also mean that the user-supplied ScanBamParam is filtering out everything, or that the file is empty, or that all the records in the file correspond to unmapped reads).

A GAlignmentsList object for readGAlignmentsList . When the list contains paired-end reads a metadata data column of mate_status is added to the object. See details in the Bam specific back-ends' section on this man page. A [GappedReads](#gappedreads) object forreadGappedReads. ## Seealso * [scanBam](#scanbam) and [ScanBamParam](#scanbamparam) in the Rsamtools package. * [GAlignments](#galignments) , [GAlignmentPairs](#galignmentpairs) , [GAlignmentsList](#galignmentslist) , and [GappedReads](#gappedreads) objects. * [IRangesList](#irangeslist) objects (used in the examples below to specify thewhichregions) in the IRanges package. ## Note BAM records corresponding to unmapped reads are always ignored. Starting with Rsamtools 1.7.1 (BioC 2.10), PCR or optical duplicates are loaded by default (usescanBamFlag(isDuplicate=FALSE)` to drop them). ## Author Hervé Pagès hpages@fredhutch.org and Valerie Obenchain Valerie.Obenchain@RoswellPark.org ## Examples r ## --------------------------------------------------------------------- ## A. readGAlignments() ## --------------------------------------------------------------------- ## Simple use: bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) gal1 <- readGAlignments(bamfile) gal1 names(gal1) ## Using the 'use.names' arg: gal2 <- readGAlignments(bamfile, use.names=TRUE) gal2 head(names(gal2)) ## Using the 'param' arg to drop PCR or optical duplicates as well as ## secondary alignments, and to load additional BAM fields: param <- ScanBamParam(flag=scanBamFlag(isDuplicate=FALSE, isSecondaryAlignment=FALSE), what=c("qual", "flag")) gal3 <- readGAlignments(bamfile, param=param) gal3 mcols(gal3) ## Using the 'param' arg to load alignments from particular regions. which <- IRangesList(seq1=IRanges(1000, 1100), seq2=IRanges(c(1546, 1555, 1567), width=10)) param <- ScanBamParam(which=which) gal4 <- readGAlignments(bamfile, use.names=TRUE, param=param) gal4 ## IMPORTANT NOTE: A given record is loaded one time for each region ## it overlaps with. We call this "duplicated record selection" (this ## is a scanBam() feature, readGAlignments() is based on scanBam()): which <- IRangesList(seq2=IRanges(c(1555, 1567), width=10)) param <- ScanBamParam(which=which) gal5 <- readGAlignments(bamfile, use.names=TRUE, param=param) gal5 # record EAS114_26:7:37:79:581 was loaded twice ## This becomes clearer if we use 'with.which_label=TRUE' to identify ## the region in 'which' where each element in 'gal5' originates from. gal5 <- readGAlignments(bamfile, use.names=TRUE, param=param, with.which_label=TRUE) gal5 ## Not surprisingly, we also get "duplicated record selection" when ## 'which' contains repeated or overlapping regions. Using the same ## regions as we did for 'gal4' above, except that now we're ## repeating the region on seq1: which <- IRangesList(seq1=rep(IRanges(1000, 1100), 2), seq2=IRanges(c(1546, 1555, 1567), width=10)) param <- ScanBamParam(which=which) gal4b <- readGAlignments(bamfile, use.names=TRUE, param=param) length(gal4b) # > length(gal4), because all the records overlapping # with bases 1000 to 1100 on seq1 are now duplicated ## The "duplicated record selection" will artificially increase the ## coverage or affect other downstream results. It can be mitigated ## (but not completely eliminated) by first "reducing" the set of ## regions: which <- reduce(which) which param <- ScanBamParam(which=which) gal4c <- readGAlignments(bamfile, use.names=TRUE, param=param) length(gal4c) # < length(gal4), because the 2 first original regions # on seq2 were merged into a single one ## Note that reducing the set of regions didn't completely eliminate ## "duplicated record selection". Records that overlap the 2 reduced ## regions on seq2 (which$seq2) are loaded twice (like for 'gal5' ## above). See example D. below for how to completely eliminate ## "duplicated record selection". ## Using the 'param' arg to load tags. Except for MF and Aq, the tags ## specified below are predefined tags (see the SAM Spec for the list ## of predefined tags and their meaning). param <- ScanBamParam(tag=c("MF", "Aq", "NM", "UQ", "H0", "H1"), what="isize") gal6 <- readGAlignments(bamfile, param=param) mcols(gal6) # "tag" cols always after "what" cols ## With a BamViews object: fls <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) bv <- BamViews(fls, bamSamples=DataFrame(info="test", row.names="ex1"), auto.range=TRUE) ## Note that the "readGAlignments" method for BamViews objects ## requires the ShortRead package to be installed. aln <- readGAlignments(bv) aln aln[[1]] aln[colnames(bv)] mcols(aln) ## --------------------------------------------------------------------- ## B. readGAlignmentPairs() ## --------------------------------------------------------------------- galp1 <- readGAlignmentPairs(bamfile) head(galp1) names(galp1) ## Here we use the 'param' arg to filter by proper pair, drop PCR / ## optical duplicates, and drop secondary alignments. Filtering by ## proper pair and dropping secondary alignments can help make the ## pairing algorithm run significantly faster: param <- ScanBamParam(flag=scanBamFlag(isProperPair=TRUE, isDuplicate=FALSE, isSecondaryAlignment=FALSE)) galp2 <- readGAlignmentPairs(bamfile, use.names=TRUE, param=param) galp2 head(galp2) head(names(galp2)) ## --------------------------------------------------------------------- ## C. readGAlignmentsList() ## --------------------------------------------------------------------- library(pasillaBamSubset) ## 'file' as character. bam <- untreated3_chr4() galist1 <- readGAlignmentsList(bam) galist1[1:3] length(galist1) table(elementNROWS(galist1)) ## When 'file' is a BamFile, 'asMates' must be TRUE. If FALSE, ## the data are treated as single-end and each list element of the ## GAlignmentsList will be of length 1. For single-end data ## use readGAlignments(). bamfile <- BamFile(bam, yieldSize=3, asMates=TRUE) readGAlignmentsList(bamfile) ## Use a 'param' to fine tune the results. param <- ScanBamParam(flag=scanBamFlag(isProperPair=TRUE)) galist2 <- readGAlignmentsList(bam, param=param) length(galist2) ## --------------------------------------------------------------------- ## D. COMPARING 4 STRATEGIES FOR LOADING THE ALIGNMENTS THAT OVERLAP ## WITH THE EXONIC REGIONS ON FLY CHROMMOSOME 4 ## --------------------------------------------------------------------- library(pasillaBamSubset) bam <- untreated1_chr4() library(TxDb.Dmelanogaster.UCSC.dm3.ensGene) txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene ex <- exons(txdb) seqlevels(ex, pruning.mode="coarse") <- "chr4" length(ex) ## Some of the exons overlap with each other: isDisjoint(ex) # FALSE exonic_regions <- reduce(ex) isDisjoint(exonic_regions) # no more overlaps length(exonic_regions) ## Strategy #1: slow and loads a lot of records more than once (see ## "duplicated record selection" in example A. above). param1 <- ScanBamParam(which=ex) gal1 <- readGAlignments(bam, param=param1) length(gal1) # many "duplicated records" ## Strategy #2: faster and generates less duplicated records but ## doesn't eliminate them. param2 <- ScanBamParam(which=exonic_regions) gal2 <- readGAlignments(bam, param=param2) length(gal2) # less "duplicated records" ## Strategy #3: fast and completely eliminates duplicated records. gal0 <- readGAlignments(bam) gal3 <- subsetByOverlaps(gal0, exonic_regions, ignore.strand=TRUE) length(gal3) # no "duplicated records" ## Note that, in this case using 'exonic_regions' or 'ex' makes no ## difference: gal3b <- subsetByOverlaps(gal0, ex, ignore.strand=TRUE) stopifnot(identical(gal3, gal3b)) ## Strategy #4: strategy #3 however can require a lot of memory if the ## file is big because we load all the alignments into memory before we ## select those that overlap with the exonic regions. Strategy #4 ## addresses this by loading the file by chunks. bamfile <- BamFile(bam, yieldSize=50000) open(bamfile) while (length(chunk0 <- readGAlignments(bamfile))) { chunk <- subsetByOverlaps(chunk0, ex, ignore.strand=TRUE) cat("chunk0:", length(chunk0), "- chunk:", length(chunk), " ") ## ... do something with 'chunk' ... } close(bamfile) ## --------------------------------------------------------------------- ## E. readGappedReads() ## --------------------------------------------------------------------- greads1 <- readGappedReads(bamfile) greads1 names(greads1) qseq(greads1) greads2 <- readGappedReads(bamfile, use.names=TRUE) head(greads2) head(names(greads2))

sequenceLayer()

Lay read sequences alongside the reference space, using their CIGARs

Description

sequenceLayer can lay strings that belong to a given space (e.g. the "query" space) alongside another space (e.g. the "reference" space) by removing/injecting substrings from/into them, using the supplied CIGARs.

Its primary use case is to lay the read sequences stored in a BAM file (which are considered to belong to the "query" space) alongside the "reference" space. It can also be used to remove the parts of the read sequences that correspond to soft-clipping. More generally it can lay strings that belong to any supported space alongside any other supported space. See the Details section below for the list of supported spaces.

Usage

sequenceLayer(x, cigar, from="query", to="reference",
              D.letter="-", N.letter=".",
              I.letter="-", S.letter="+", H.letter="+")

Arguments

Argument	Description
`x`	An XStringSet object containing strings that belong to a given space.
`cigar`	A character vector or factor of the same length as `x` containing the extended CIGAR strings (one per element in `x` ).
`from, to`	A single string specifying one of the 8 supported spaces listed in the Details section below. `from` must be the current space (i.e. the space the strings in `x` belong to) and `to` is the space alonside which to lay the strings in `x` .
`D.letter, N.letter, I.letter, S.letter, H.letter`	A single letter used as a filler for injections. More on this in the Details section below.

Details

The 8 supported spaces are: "reference" , "reference-N-regions-removed" , "query" , "query-before-hard-clipping" , "query-after-soft-clipping" , "pairwise" , "pairwise-N-regions-removed" , and "pairwise-dense" .

Each space can be characterized by the extended CIGAR operations that are visible in it. A CIGAR operation is said to be visible in a given space if it "runs along it", that is, if it's associated with a block of contiguous positions in that space (the size of the block being the length of the operation). For example, the M/=/X operations are visible in all spaces, the D/N operations are visible in the "reference" space but not in the "query" space, the S operation is visible in the "query" space but not in the "reference" or in the "query-after-soft-clipping" space, etc...

Here are the extended CIGAR operations that are visible in each space:

reference: M, D, N, =, X
reference-N-regions-removed: M, D, =, X
query: M, I, S, =, X
query-before-hard-clipping: M, I, S, H, =, X
query-after-soft-clipping: M, I, =, X
pairwise: M, I, D, N, =, X
pairwise-N-regions-removed: M, I, D, =, X
pairwise-dense: M, =, X

sequenceLayer lays a string that belongs to one space alongside another by (1) removing the substrings associated with operations that are not visible anymore in the new space, and (2) injecting substrings associated with operations that become visible in the new space. Each injected substring has the length of the operation associated with it, and its content is controlled via the corresponding *.letter argument.

For example, when going from the "query" space to the "reference" space (the default), the I- and S-substrings (i.e. the substrings associated with I/S operations) are removed, and substrings associated with D/N operations are injected. More precisely, the D-substrings are filled with the letter specified in D.letter , and the N-substrings with the letter specified in N.letter . The other *.letter arguments are ignored in that case.

Value

An XStringSet object of the same class and length as x .

The stackStringsFromBam function for stacking the read sequences (or their quality strings) stored in a BAM file on a region of interest.
The readGAlignments function for loading read sequences from a BAM file (via a GAlignments object).
The extractAt and replaceAt functions in the Biostrings package for extracting/replacing arbitrary substrings from/in a string or set of strings.
cigar-utils for the CIGAR utility functions used internally by sequenceLayer .

Author

Hervé Pagès

Examples

## ---------------------------------------------------------------------
## A. FROM "query" TO "reference" SPACE
## ---------------------------------------------------------------------

## Load read sequences from a BAM file (they will be returned in a
## GAlignments object):
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
param <- ScanBamParam(what="seq")
gal <- readGAlignments(bamfile, param=param)
qseq <- mcols(gal)$seq  # the read sequences (aka query sequences)

## Lay the query sequences alongside the reference space. This will
## remove the substrings associated with insertions to the reference
## (I operations) and soft clipping (S operations), and will inject new
## substrings (filled with "-") where deletions from the reference (D
## operations) and skipped regions from the reference (N operations)
## occurred during the alignment process:
qseq_on_ref <- sequenceLayer(qseq, cigar(gal))

## A typical use case for doing the above is to compute 1 consensus
## sequence per chromosome. The code below shows how this can be done
## in 2 extra steps.

## Step 1: Compute one consensus matrix per chromosome.
qseq_on_ref_by_chrom <- splitAsList(qseq_on_ref, seqnames(gal))
pos_by_chrom <- splitAsList(start(gal), seqnames(gal))

cm_by_chrom <- lapply(names(pos_by_chrom),
function(seqname)
consensusMatrix(qseq_on_ref_by_chrom[[seqname]],
as.prob=TRUE,
shift=pos_by_chrom[[seqname]]-1,
width=seqlengths(gal)[[seqname]]))
names(cm_by_chrom) <- names(pos_by_chrom)

## 'cm_by_chrom' is a list of consensus matrices. Each matrix has 17
## rows (1 per letter in the DNA alphabet) and 1 column per chromosome
## position.

## Step 2: Compute the consensus string from each consensus matrix.
## We'll put "+" in the strings wherever there is no coverage for that
## position, and "N" where there is coverage but no consensus.
cs_by_chrom <- lapply(cm_by_chrom,
function(cm) {
## Because consensusString() doesn't like consensus matrices
## with columns that contain only zeroes (and you will have
## columns like that for chromosome positions that don't
## receive any coverage), we need to "fix" 'cm' first.
idx <- colSums(cm) == 0
cm["+", idx] <- 1
DNAString(consensusString(cm, ambiguityMap="N"))
})

## consensusString() provides some flexibility to let you extract
## the consensus in different ways. See '?consensusString' in the
## Biostrings package for the details.

## Finally, note that the read quality strings can also be used as
## input for sequenceLayer():
param <- ScanBamParam(what="qual")
gal <- readGAlignments(bamfile, param=param)
qual <- mcols(gal)$qual  # the read quality strings
qual_on_ref <- sequenceLayer(qual, cigar(gal))
## Note that since the "-" letter is a valid quality code, there is
## no way to distinguish it from the "-" letters inserted by
## sequenceLayer().

## ---------------------------------------------------------------------
## B. FROM "query" TO "query-after-soft-clipping" SPACE
## ---------------------------------------------------------------------

## Going from "query" to "query-after-soft-clipping" simply removes
## the substrings associated with soft clipping (S operations):
qseq <- DNAStringSet(c("AAAGTTCGAA", "TTACGATTAN", "GGATAATTTT"))
cigar <- c("3H10M", "2S7M1S2H", "2M1I1M3D2M4S")
clipped_qseq <- sequenceLayer(qseq, cigar,
from="query", to="query-after-soft-clipping")

sequenceLayer(clipped_qseq, cigar,
from="query-after-soft-clipping", to="query")

sequenceLayer(clipped_qseq, cigar,
from="query-after-soft-clipping", to="query",
S.letter="-")

## ---------------------------------------------------------------------
## C. BRING QUERY AND REFERENCE SEQUENCES TO THE "pairwise" or
##    "pairwise-dense" SPACE
## ---------------------------------------------------------------------

## Load read sequences from a BAM file:
library(RNAseqData.HNRNPC.bam.chr14)
bamfile <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
param <- ScanBamParam(what="seq",
which=GRanges("chr14", IRanges(1, 25000000)))
gal <- readGAlignments(bamfile, param=param)
qseq <- mcols(gal)$seq  # the read sequences (aka query sequences)

## Load the corresponding reference sequences from the appropriate
## BSgenome package (the reads in RNAseqData.HNRNPC.bam.chr14 were
## aligned to hg19):
library(BSgenome.Hsapiens.UCSC.hg19)
rseq <- getSeq(Hsapiens, as(gal, "GRanges"))  # the reference sequences

## Bring 'qseq' and 'rseq' to the "pairwise" space.
## For 'qseq', this will remove the substrings associated with soft
## clipping (S operations) and inject substrings (filled with "-")
## associated with deletions from the reference (D operations) and
## skipped regions from the reference (N operations). For 'rseq', this
## will inject substrings (filled with "-") associated with insertions
## to the reference (I operations).
qseq2 <- sequenceLayer(qseq, cigar(gal),
from="query", to="pairwise")
rseq2 <- sequenceLayer(rseq, cigar(gal),
from="reference", to="pairwise")

## Sanity check: 'qseq2' and 'rseq2' should have the same shape.
stopifnot(identical(elementNROWS(qseq2), elementNROWS(rseq2)))

## A closer look at reads with insertions and deletions:
cigar_op_table <- cigarOpTable(cigar(gal))
head(cigar_op_table)

I_idx <- which(cigar_op_table[ , "I"] >= 2)  # at least 2 insertions
qseq2[I_idx]
rseq2[I_idx]

D_idx <- which(cigar_op_table[ , "D"] >= 2)  # at least 2 deletions
qseq2[D_idx]
rseq2[D_idx]

## A closer look at reads with skipped regions:
N_idx <- which(cigar_op_table[ , "N"] != 0)
qseq2[N_idx]
rseq2[N_idx]

## A variant of the "pairwise" space is the "pairwise-dense" space.
## In that space, all indels and skipped regions are removed from 'qseq'
## and 'rseq'.
qseq3 <- sequenceLayer(qseq, cigar(gal),
from="query", to="pairwise-dense")
rseq3 <- sequenceLayer(rseq, cigar(gal),
from="reference", to="pairwise-dense")

## Sanity check: 'qseq3' and 'rseq3' should have the same shape.
stopifnot(identical(elementNROWS(qseq3), elementNROWS(rseq3)))

## Insertions were removed:
qseq3[I_idx]
rseq3[I_idx]

## Deletions were removed:
qseq3[D_idx]
rseq3[D_idx]

## Skipped regions were removed:
qseq3[N_idx]
rseq3[N_idx]

## ---------------------------------------------------------------------
## D. SANITY CHECKS
## ---------------------------------------------------------------------
SPACES <- c("reference",
"reference-N-regions-removed",
"query",
"query-before-hard-clipping",
"query-after-soft-clipping",
"pairwise",
"pairwise-N-regions-removed",
"pairwise-dense")

cigarWidth <- list(
function(cigar) cigarWidthAlongReferenceSpace(cigar),
function(cigar) cigarWidthAlongReferenceSpace(cigar,
N.regions.removed=TRUE),
function(cigar) cigarWidthAlongQuerySpace(cigar),
function(cigar) cigarWidthAlongQuerySpace(cigar,
before.hard.clipping=TRUE),
function(cigar) cigarWidthAlongQuerySpace(cigar,
after.soft.clipping=TRUE),
function(cigar) cigarWidthAlongPairwiseSpace(cigar),
function(cigar) cigarWidthAlongPairwiseSpace(cigar,
N.regions.removed=TRUE),
function(cigar) cigarWidthAlongPairwiseSpace(cigar, dense=TRUE)
)

cigar <- c("3H2S4M1D2M2I1M5N3M6H", "5M1I3M2D4M2S")

seq <- list(
BStringSet(c(A="AAAA-BBC.....DDD", B="AAAAABBB--CCCC")),
BStringSet(c(A="AAAA-BBCDDD", B="AAAAABBB--CCCC")),
BStringSet(c(A="++AAAABBiiCDDD", B="AAAAAiBBBCCCC++")),
BStringSet(c(A="+++++AAAABBiiCDDD++++++", B="AAAAAiBBBCCCC++")),
BStringSet(c(A="AAAABBiiCDDD", B="AAAAAiBBBCCCC")),
BStringSet(c(A="AAAA-BBiiC.....DDD", B="AAAAAiBBB--CCCC")),
BStringSet(c(A="AAAA-BBiiCDDD", B="AAAAAiBBB--CCCC")),
BStringSet(c(A="AAAABBCDDD", B="AAAAABBBCCCC"))
)

stopifnot(all(sapply(1:8,
function(i) identical(width(seq[[i]]), cigarWidth[[i]](cigar))
)))

sequenceLayer2 <- function(x, cigar, from, to)
sequenceLayer(x, cigar, from=from, to=to, I.letter="i")

identical_XStringSet <- function(target, current)
{
ok1 <- identical(class(target), class(current))
ok2 <- identical(names(target), names(current))
ok3 <- all(target == current)
ok1 && ok2 && ok3
}

res <- sapply(1:8, function(i) {
sapply(1:8, function(j) {
target <- seq[[j]]
current <- sequenceLayer2(seq[[i]], cigar,
from=SPACES[i], to=SPACES[j])
identical_XStringSet(target, current)
})
})
stopifnot(all(res))

setops_methods()

Set operations on GAlignments objects

Description

Performs set operations on GAlignments objects.

NOTE: The pintersect generic function and method for IntegerRanges objects is defined and documented in the IRanges package. Methods for GRanges and GRangesList objects are defined and documented in the GenomicRanges package.

Usage

list(list("pintersect"), list("GAlignments,GRanges"))(x, y, ...)
list(list("pintersect"), list("GRanges,GAlignments"))(x, y, ...)

Arguments

Argument	Description
`x, y`	A GAlignments object and a GRanges object. They must have the same length.
`...`	Further arguments to be passed to or from other methods.

Value

A GAlignments object parallel to (i.e. same length as) x and y .

Examples

## Parallel intersection of a GAlignments and a GRanges object:
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
gal <- readGAlignments(bamfile)
pintersect(gal, shift(as(gal, "GRanges"), 6L))

stackStringsFromBam()

Stack the read sequences stored in a BAM file on a region of interest

Description

stackStringsFromBam stacks the read sequences (or their quality strings) stored in a BAM file over a user-specified region.

alphabetFrequencyFromBam computes the alphabet frequency of the reads over a user-specified region.

Both functions take into account the CIGAR of each read to "lay" the read sequence (or its quality string) alongside the reference space. This step ensures that each nucleotide in a read is associated with the correct position on the reference sequence.

Usage

stackStringsFromBam(file, index=file, param,
                    what="seq", use.names=FALSE,
                    D.letter="-", N.letter=".",
                    Lpadding.letter="+", Rpadding.letter="+")
alphabetFrequencyFromBam(file, index=file, param, what="seq", ...)

Arguments

Argument	Description
`file, index`	The path to the BAM file to read, and to the index file of the BAM file to read, respectively. The latter is given without the '.bai' extension. See `scanBam` for more information.
`param`	A ScanBamParam object containing exactly 1 genomic region (i.e. `unlist(bamWhich(param))` must have length 1). Alternatively, `param` can be a GRanges or IntegerRangesList object containing exactly 1 genomic region (the strand will be ignored in case of a GRanges object), or a character string specifying a single genomic region (in the `"chr14:5201-5300"` format).
`what`	A single string. Either `"seq"` or `"qual"` . If `"seq"` (the default), the read sequences will be stacked. If `"qual"` , the read quality strings will be stacked.
`use.names`	Use the query template names (QNAME field) as the names of the returned object? If not (the default), then the returned object has no names.
`D.letter, N.letter`	A single letter used as a filler for injections. The 2 arguments are passed down to the `sequenceLayer` function. See `?` for more details.
`Lpadding.letter, Rpadding.letter`	A single letter to use for padding the sequences on the left, and another one to use for padding on the right. The 2 arguments are passed down to the `stackStrings` function defined in the Biostrings package. See `?` in the Biostrings package for more details.
`...`	Further arguments to be passed to alphabetFrequency .

Details

stackStringsFromBam performs the 3 following steps:

Load the read sequences (or their quality strings) from the BAM file. Only the read sequences that overlap with the specified region are loaded. This is done with the readGAlignments function. Note that if the file contains paired-end reads, the pairing is ignored.
Lay the sequences alongside the reference space, using their CIGARs. This is done with the sequenceLayer function.
Stack them on the specified region. This is done with the stackStrings function defined in the Biostrings package.

alphabetFrequencyFromBam also performs steps 1. and 2. but, instead of stacking the sequences at step 3., it computes the nucleotide frequencies for each genomic position in the specified region.

Value

For stackStringsFromBam : A rectangular (i.e. constant-width) DNAStringSet object (if what is "seq" ) or BStringSet object (if what is "qual" ).

For alphabetFrequencyFromBam : By default a matrix like one returned by alphabetFrequency . The matrix has 1 row per nucleotide position in the specified region.

The pileLettersAt function for piling the letters of a set of aligned reads on top of a set of genomic positions.
The readGAlignments function for loading read sequences (or their quality strings) from a BAM file (via a GAlignments object).
The sequenceLayer function for laying read sequences alongside the reference space, using their CIGARs.
The stackStrings function in the Biostrings package for stacking an arbitrary XStringSet object.
The alphabetFrequency function in the Biostrings package.
The SAMtools mpileup command available at http://samtools.sourceforge.net/ as part of the SAMtools project.

Note

TWO IMPORTANT CAVEATS ABOUT stackStringsFromBam :

Specifying a big genomic region, say >= 100000 bp, can require a lot of memory (especially with high coverage reads) and is not recommended. See the pileLettersAt function for piling the read letters on top of a set of genomic positions, which is more flexible and more memory efficient.

Paired-end reads are treated as single-end reads (i.e. they're not paired).

Author

Hervé Pagès

Examples

## ---------------------------------------------------------------------
## A. EXAMPLE WITH TOY DATA
## ---------------------------------------------------------------------

bamfile1 <- BamFile(system.file("extdata", "ex1.bam", package="Rsamtools"))

region1 <- GRanges("seq1", IRanges(1, 60))  # region of interest

## Stack the read sequences:
stackStringsFromBam(bamfile1, param=region1)

## Compute the "consensus matrix" (1 column per nucleotide position
## in the region of interest):
af <- alphabetFrequencyFromBam(bamfile1, param=region1, baseOnly=TRUE)
cm1a <- t(af[ , DNA_BASES])
cm1a

## Stack their quality strings:
stackStringsFromBam(bamfile1, param=region1, what="qual")

## Control the number of reads to display:
options(showHeadLines=18)
options(showTailLines=6)
stackStringsFromBam(bamfile1, param=GRanges("seq1", IRanges(61, 120)))

stacked_qseq <- stackStringsFromBam(bamfile1, param="seq2:1509-1519")
stacked_qseq  # deletion in read 13
af <- alphabetFrequencyFromBam(bamfile1, param="seq2:1509-1519",
baseOnly=TRUE)
cm1b <- t(af[ , DNA_BASES])  # consensus matrix
cm1b

## Sanity check:
stopifnot(identical(consensusMatrix(stacked_qseq)[DNA_BASES, ], cm1b))

stackStringsFromBam(bamfile1, param="seq2:1509-1519", what="qual")

## ---------------------------------------------------------------------
## B. EXAMPLE WITH REAL DATA
## ---------------------------------------------------------------------

library(RNAseqData.HNRNPC.bam.chr14)
bamfile2 <- BamFile(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1])

## Region of interest:
region2 <- GRanges("chr14", IRanges(19650095, 19650159))

readGAlignments(bamfile2, param=ScanBamParam(which=region2))

stackStringsFromBam(bamfile2, param=region2)

af <- alphabetFrequencyFromBam(bamfile2, param=region2, baseOnly=TRUE)
cm2 <- t(af[ , DNA_BASES])  # consensus matrix
cm2

## ---------------------------------------------------------------------
## C. COMPUTE READ CONSENSUS SEQUENCE FOR REGION OF INTEREST
## ---------------------------------------------------------------------

## Let's write our own little naive function to go from consensus matrix
## to consensus sequence. For each nucleotide position in the region of
## interest (i.e. each column in the matrix), we select the letter with
## highest frequency. We also use special letter "*" at positions where
## there is a tie, and special letter "." at positions where all the
## frequencies are 0 (a particular type of tie):
cm_to_cs <- function(cm)
{
stopifnot(is.matrix(cm))
nr <- nrow(cm)
rnames <- rownames(cm)
stopifnot(!is.null(rnames) && all(nchar(rnames) == 1L))
selection <- apply(cm, 2,
function(x) {
i <- which.max(x)
if (x[i] == 0L)
return(nr + 1L)
if (sum(x == x[i]) != 1L)
return(nr + 2L)
i
})
paste0(c(rnames, ".", "*")[selection], collapse="")
}

cm_to_cs(cm1a)
cm_to_cs(cm1b)
cm_to_cs(cm2)

## Note that the consensus sequences we obtain are relative to the
## plus strand of the reference sequence.

summarizeOverlaps_methods()

Perform overlap queries between reads and genomic features

Description

summarizeOverlaps extends findOverlaps by providing options to resolve reads that overlap multiple features.

Usage

list(list("summarizeOverlaps"), list("GRanges,GAlignments"))(
  features, reads, mode=Union,
  ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
list(list("summarizeOverlaps"), list("GRangesList,GAlignments"))(
  features, reads, mode=Union,
  ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
list(list("summarizeOverlaps"), list("GRanges,GRanges"))(
  features, reads, mode=Union,
  ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
list(list("summarizeOverlaps"), list("GRangesList,GRanges"))(
  features, reads, mode=Union,
  ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
list(list("summarizeOverlaps"), list("GRanges,GAlignmentPairs"))(
  features, reads, mode=Union,
  ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
list(list("summarizeOverlaps"), list("GRangesList,GAlignmentPairs"))(
  features, reads, mode=Union,
  ignore.strand=FALSE, inter.feature=TRUE, preprocess.reads=NULL, ...)
## mode funtions
Union(features, reads, ignore.strand=FALSE,
                       inter.feature=TRUE)
IntersectionStrict(features, reads, ignore.strand=FALSE,
                                    inter.feature=TRUE)
IntersectionNotEmpty(features, reads, ignore.strand=FALSE,
                                      inter.feature=TRUE)
list(list("summarizeOverlaps"), list("GRanges,BamFile"))(
  features, reads, mode=Union,
  ignore.strand=FALSE, inter.feature=TRUE, singleEnd=TRUE,
  fragments=FALSE, param=ScanBamParam(), preprocess.reads=NULL, ...)
list(list("summarizeOverlaps"), list("BamViews,missing"))(
  features, reads, mode=Union,
  ignore.strand=FALSE, inter.feature=TRUE, singleEnd=TRUE,
  fragments=FALSE, param=ScanBamParam(), preprocess.reads=NULL, ...)

Arguments

Argument	Description
`features`	A GRanges or a GRangesList object of genomic regions of interest. When a GRanges is supplied, each row is considered a feature. When a GRangesList is supplied, each higher list-level is considered a feature. This distinction is important when defining overlaps. When `features` is a BamViews the `reads` argument is missing. Features are extracted from the `bamRanges` and the `reads` from `bamPaths` . Metadata from `bamPaths` and `bamSamples` are stored in the `colData` of the resulting RangedSummarizedExperiment object. `bamExperiment` metadata are stored in the `metadata` slot.
`reads`	A GRanges , GRangesList GAlignments , GAlignmentsList , GAlignmentPairs , BamViews or BamFileList object that represents the data to be counted by `summarizeOverlaps` . `reads` is missing when a BamViews object is the only argument supplied to `summarizeOverlaps` . `reads` are the files specified in `bamPaths` of the BamViews object.
`mode`	`mode` can be one of the pre-defined count methods such as "Union", "IntersectionStrict", or "IntersectionNotEmpty" or it a user supplied count function. For a custom count function, the input arguments must match those of the pre-defined options and the function must return a vector of counts the same length as the annotation ('features' argument). See examples for details. The pre-defined options are designed after the counting modes available in the HTSeq package by Simon Anders (see references).

"Union" : (Default) Reads that overlap any portion of exactly one feature are counted. Reads that overlap multiple features are discarded. This is the most conservative of the 3 modes.
"IntersectionStrict" : A read must fall completely "within" the feature to be counted. If a read overlaps multiple features but falls "within" only one, the read is counted for that feature. If the read is "within" multiple features, the read is discarded.
"IntersectionNotEmpty" : A read must fall in a unique disjoint region of a feature to be counted. When a read overlaps multiple features, the features are partitioned into disjoint intervals. Regions that are shared between the features are discarded leaving only the unique disjoint regions. If the read overlaps one of these remaining regions, it is assigned to the feature the unique disjoint region came from.
user supplied function : A function can be supplied as the mode argument. It must (1) have arguments that correspond to features , reads , ignore.strand and inter.feature arguments (as in the defined mode functions) and (2) return a vector of counts the same length as features .
|ignore.strand | A logical indicating if strand should be considered when matching. | |inter.feature | (Default TRUE) A logical indicating if the counting mode should be aware of overlapping features. When TRUE (default), reads mapping to multiple features are dropped (i.e., not counted). When FALSE, these reads are retained and a count is assigned to each feature they map to. There are 6 possible combinations of the mode and inter.feature arguments. When inter.feature=FALSE the behavior of modes Union and IntersectionStrict are essentially countOverlaps with type=any and type=within , respectively. IntersectionNotEmpty does not reduce to a simple countOverlaps because common (shared) regions of the annotation are removed before counting. | |preprocess.reads | A function applied to the reads before counting. The first argument should be reads and the return value should be an object compatible with the reads argument to the counting modes, Union, IntersectionStrict and IntersectionNotEmpty. The distinction between a user-defined 'mode' and user-defined 'preprocess.reads' function is that in the first case the user defines how to count; in the second case the reads are preprocessed before counting with a pre-defined mode. See examples. | |... | Additional arguments passed to functions or methods called from within summarizeOverlaps . For BAM file methods arguments may include singleEnd , fragments or param which apply to reading records from a file (see below). Providing count.mapped.reads=TRUE include additional passes through the BAM file to collect statistics similar to those from countBam . A BPPARAM argument can be passed down to the bplapply called by summarizeOverlaps . The argument can be MulticoreParam(), SnowParam(), BatchJobsParam() or DoparParam(). See the BiocParallel package for details in specifying the params. | |singleEnd | (Default TRUE) A logical indicating if reads are single or paired-end. In Bioconductor > 2.12 it is not necessary to sort paired-end BAM files by qname . When counting with summarizeOverlaps , setting singleEnd=FALSE will trigger paired-end reading and counting. It is fine to also set asMates=TRUE in the BamFile but is not necessary when singleEnd=FALSE . | |fragments | (Default FALSE) A logical; applied to paired-end data only. fragments controls which function is used to read the data which subsequently affects which records are included in counting. When fragments=FALSE , data are read with readGAlignmentPairs and returned in a GAlignmentPairs class. In this case, singletons, reads with unmapped pairs, and other fragments, are dropped. When fragments=TRUE , data are read with readGAlignmentsList and returned in a GAlignmentsList class. This class holds mated pairs as well as same-strand pairs, singletons, reads with unmapped pairs and other fragments. Because more records are kept, generally counts will be higher when fragments=TRUE . The term mated pairs refers to records paired with the algorithm described on the ? man page. | |param | An optional ScanBamParam instance to further influence scanning, counting, or filtering. See ? for details of how records are returned when both yieldSize is specified in a BamFile and which is defined in a ScanBamParam . |

Details

list(" ", " ", list(list(), list(list("summarizeOverlaps"), " offers counting modes to resolve reads that ", " overlap multiple features. The ", list("mode"), " argument defines a set of rules ", " to resolve the read to a single feature such that each read is counted a ", " maximum of once. New to GenomicRanges >= 1.13.9 is the ", " ", list("inter.feature"), " argument which allows reads to be counted for each ", " feature they overlap. When ", list("inter.feature=TRUE"),

" the counting modes

", " are aware of feature overlap; reads that overlap multiple features are ", " dropped and not counted. When ", list("inter.feature=FALSE"), " multiple feature ", " overlap is ignored and reads are counted once for each feature they map ", " to. This essentially reduces modes ", list("Union"), " and ", " ", list("IntersectionStrict"), " to ", list("countOverlaps"), " with ", " ", list("type="any""), ", and ", list("type="within""),

", respectively.

", " ", list("IntersectionNotEmpty"), " is not reduced to a derivative of ", " ", list("countOverlaps"), " because the shared regions are removed before ", " counting. ", " ", " The ", list("BamViews"), ", ", list("BamFile"), " and ", list("BamFileList"), " methods ", " summarize overlaps across one or several files. The latter uses ", " ", list("bplapply"), "; control parallel evaluation using the ", " ", list(list("register")), " interface in the ",

list("BiocParallel"), " package.

", " ")), " ", " ", list(list("features :"), list(" ", " A ", list("feature"), " can be any portion of a genomic region such as a gene, ", " transcript, exon etc. When the ", list("features"), " argument is a ", " ", list("GRanges"), " the rows define the features. The result ", " will be the same length as the ", list("GRanges"), ". When ", " ", list("features"), " is a ", list("GRangesList"), " the highest ", " list-level defines the features and the result will be the same length ",

"      as the ", list("GRangesList"), ".

", " ", " When ", list("inter.feature=TRUE"), ", each count ", list("mode"), " attempts to ", " assign a read that overlaps multiple features to a single feature. If ", " there are ranges that should be considered together (e.g., exons by ", " transcript or cds regions by gene) the ", list("GRangesList"), " ", " would be appropriate. If there is no grouping in the data then a ", " ", list("GRanges"), " would be appropriate. ",

"    ")), "

", " ", list(list("paired-end reads :"), list(" ", " Paired-end reads are counted as a single hit if one or both parts ", " of the pair are overlapped. Paired-end records can be counted in ", " a ", list("GAlignmentPairs"), " container or BAM file. ", " ", " Counting pairs in BAM files: ", " ", list(" ", " ", list(), list("The ", list("singleEnd"), " argument should be FALSE."), " ", " ", list(), list("When ", list("reads"), " are supplied as a BamFile or BamFileList, ",

"              the ", list("asMates"), " argument to the BamFile should be TRUE."), "

", " ", list(), list("When ", list("fragments"), " is FALSE, a ", list("GAlignmentPairs"), " ", " object is used in counting (pairs only)."), " ", " ", list(), list("When ", list("fragments"), " is TRUE, a ", list("GAlignmentsList"), " ", " object is used in counting (pairs, singletons, unmapped ", " mates, etc.)"), " ", " "), " ", " ")), " ",

"  ")

Value

A RangedSummarizedExperiment object. The assays slot holds the counts, rowRanges holds the annotation from features .

When reads is a BamFile or BamFileList colData is an empty DataFrame with a single row named counts . If count.mapped.reads=TRUE , colData holds the output of countBam in 3 columns named records (total records), nucleotides and mapped (mapped records).

When features is a BamViews colData includes 2 columns named bamSamples and bamIndices .

In all other cases, colData has columns of object (class of reads) and records (length of reads ).

Author

Valerie Obenchain Valerie.Obenchain@RoswellPark.org

References

HTSeq : http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html

htseq-count : http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

Examples

reads <- GAlignments(
names = c("a","b","c","d","e","f","g"),
seqnames = Rle(c(rep(c("chr1", "chr2"), 3), "chr1")),
pos = as.integer(c(1400, 2700, 3400, 7100, 4000, 3100, 5200)),
cigar = c("500M", "100M", "300M", "500M", "300M",
"50M200N50M", "50M150N50M"),
strand = strand(rep("+", 7)))

gr <- GRanges(
seqnames = c(rep("chr1", 7), rep("chr2", 4)), strand = "+",
ranges = IRanges(c(1000, 3000, 3600, 4000, 4000, 5000, 5400,
2000, 3000, 7000, 7500),
width = c(500, 500, 300, 500, 900, 500, 500,
900, 500, 600, 300),
names=c("A", "B", "C1", "C2", "D1", "D2", "E", "F",
"G", "H1", "H2")))
groups <- factor(c(1,2,3,3,4,4,5,6,7,8,8))
grl <- splitAsList(gr, groups)
names(grl) <- LETTERS[seq_along(grl)]

## ---------------------------------------------------------------------
## Counting modes.
## ---------------------------------------------------------------------

## First count with a GRanges as the 'features'. 'Union' is the
## most conservative counting mode followed by 'IntersectionStrict'
## then 'IntersectionNotEmpty'.
counts1 <-
data.frame(union=assays(summarizeOverlaps(gr, reads))$counts,
intStrict=assays(summarizeOverlaps(gr, reads,
mode="IntersectionStrict"))$counts,
intNotEmpty=assays(summarizeOverlaps(gr, reads,
mode="IntersectionNotEmpty"))$counts)

colSums(counts1)

## Split the 'features' into a GRangesList and count again.
counts2 <-
data.frame(union=assays(summarizeOverlaps(grl, reads))$counts,
intStrict=assays(summarizeOverlaps(grl, reads,
mode="IntersectionStrict"))$counts,
intNotEmpty=assays(summarizeOverlaps(grl, reads,
mode="IntersectionNotEmpty"))$counts)
colSums(counts2)

## The GRangesList ('grl' object) has 8 features whereas the GRanges
## ('gr' object) has 11. The affect on counting can be seen by looking
## at feature 'H' with mode 'Union'. In the GRanges this feature is
## represented by ranges 'H1' and 'H2',
gr[c("H1", "H2")]

## and by list element 'H' in the GRangesList,
grl["H"]

## Read "d" hits both 'H1' and 'H2'. This is considered a multi-hit when
## using a GRanges (each range is a separate feature) so the read was
## dropped and not counted.
counts1[c("H1", "H2"), ]

## When using a GRangesList, each list element is considered a feature.
## The read hits multiple ranges within list element 'H' but only one
## list element. This is not considered a multi-hit so the read is counted.
counts2["H", ]

## ---------------------------------------------------------------------
## Counting multi-hit reads.
## ---------------------------------------------------------------------

## The goal of the counting modes is to provide a set of rules that
## resolve reads hitting multiple features so each read is counted
## a maximum of once. However, sometimes it may be desirable to count
## a read for each feature it overlaps. This can be accomplished by
## setting 'inter.feature' to FALSE.

## When 'inter.feature=FALSE', modes 'Union' and 'IntersectionStrict'
## essentially reduce to countOverlaps() with type="any" and
## type="within", respectively.

## When 'inter.feature=TRUE' only features "A", "F" and "G" have counts.
se1 <- summarizeOverlaps(gr, reads, mode="Union", inter.feature=TRUE)
assays(se1)$counts

## When 'inter.feature=FALSE' all 11 features have a count. There are
## 7 total reads so one or more reads were counted more than once.
se2 <- summarizeOverlaps(gr, reads, mode="Union", inter.feature=FALSE)
assays(se2)$counts

## ---------------------------------------------------------------------
## Counting BAM files.
## ---------------------------------------------------------------------

library(pasillaBamSubset)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
exbygene <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, "gene")

## (i) Single-end :

## Large files can be iterated over in chunks by setting a
## 'yieldSize' on the BamFile.
bf_s <- BamFile(untreated1_chr4(), yieldSize=50000)
se_s <- summarizeOverlaps(exbygene, bf_s, singleEnd=TRUE)
table(assays(se_s)$counts > 0)

## When a character (file name) is provided as 'reads' instead
## of a BamFile object summarizeOverlaps() will create a BamFile
## and set a reasonable default 'yieldSize'.

## (ii) Paired-end :

## A paired-end file may contain singletons, reads with unmapped
## pairs or reads with more than two fragments. When 'fragments=FALSE'
## only reads paired by the algorithm are included in the counting.
nofrag <- summarizeOverlaps(exbygene, untreated3_chr4(),
singleEnd=FALSE, fragments=FALSE)
table(assays(nofrag)$counts > 0)

## When 'fragments=TRUE' all singletons, reads with unmapped pairs
## and other fragments will be included in the counting.
bf <- BamFile(untreated3_chr4(), asMates=TRUE)
frag <- summarizeOverlaps(exbygene, bf, singleEnd=FALSE, fragments=TRUE)
table(assays(frag)$counts > 0)

## As expected, using 'fragments=TRUE' results in a larger number
## of total counts because singletons, unmapped pairs etc. are
## included in the counting.

## Total reads in the file:
countBam(untreated3_chr4())

## Reads counted with 'fragments=FALSE':
sum(assays(nofrag)$counts)

## Reads counted with 'fragments=TRUE':
sum(assays(frag)$counts)

## ---------------------------------------------------------------------
## Use ouput of summarizeOverlaps() for differential expression analysis
## with DESeq2 or edgeR.
## ---------------------------------------------------------------------

fls <- list.files(system.file("extdata", package="GenomicAlignments"),
recursive=TRUE, pattern="*bam$", full=TRUE)
names(fls) <- basename(fls)
bf <- BamFileList(fls, index=character(), yieldSize=1000)
genes <- GRanges(
seqnames = c(rep("chr2L", 4), rep("chr2R", 5), rep("chr3L", 2)),
ranges = IRanges(c(1000, 3000, 4000, 7000, 2000, 3000, 3600,
4000, 7500, 5000, 5400),
width=c(rep(500, 3), 600, 900, 500, 300, 900,
300, 500, 500)))
se <- summarizeOverlaps(genes, bf)

## When the reads are BAM files, the 'colData' contains summary
## information from a call to countBam().
colData(se)

## Start differential expression analysis with the DESeq2 or edgeR
## package:
library(DESeq2)
deseq <- DESeqDataSet(se, design= ~ 1)
library(edgeR)
edger <- DGEList(assays(se)$counts, group=rownames(colData(se)))

## ---------------------------------------------------------------------
## Filter records by map quality before counting.
## (user-supplied 'mode' function)
## ---------------------------------------------------------------------

## The 'mode' argument can take a custom count function whose
## arguments are the same as those in the current counting modes
## (i.e., Union, IntersectionNotEmpty, IntersectionStrict).
## In this example records are filtered by map quality before counting.

mapq_filter <- function(features, reads, ignore.strand, inter.feature)
{
require(GenomicAlignments) # needed for parallel evaluation
Union(features, reads[mcols(reads)$mapq >= 20],
ignore.strand, inter.feature)
}

genes <- GRanges("seq1", IRanges(seq(1, 1500, by=200), width=100))
param <- ScanBamParam(what="mapq")
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
se <- summarizeOverlaps(genes, fl, mode=mapq_filter, param=param)
assays(se)$counts

## The count function can be completely custom (i.e., not use the
## pre-defined count functions at all). Requirements are that
## the input arguments match the pre-defined modes and the output
## is a vector of counts the same length as 'features'.

my_count <- function(features, reads,  ignore.strand, inter.feature) {
## perform filtering, or subsetting etc.
require(GenomicAlignments) # needed for parallel evaluation
countOverlaps(features, reads)
}

## ---------------------------------------------------------------------
## Preprocessing reads before counting with a standard count mode.
## (user-supplied 'preprocess.reads' function)
## ---------------------------------------------------------------------

## The 'preprocess.reads' argument takes a function that is
## applied to the reads before counting with a pre-defined mode.

ResizeReads <- function(reads, width=1, fix="start", ...) {
reads <- as(reads, "GRanges")
stopifnot(all(strand(reads) != "*"))
resize(reads, width=width, fix=fix, ...)
}

## By default ResizeReads() counts reads that overlap on the 5' end:
summarizeOverlaps(grl, reads, mode=Union, preprocess.reads=ResizeReads)

## Count reads that overlap on the 3' end by passing new values
## for 'width' and 'fix':
summarizeOverlaps(grl, reads, mode=Union, preprocess.reads=ResizeReads,
width=1, fix="end")

## ---------------------------------------------------------------------
## summarizeOverlaps() with BamViews.
## ---------------------------------------------------------------------

## bamSamples and bamPaths metadata are included in the colData.
## bamExperiment metadata is put into the metadata slot.
fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE)
rngs <- GRanges(c("seq1", "seq2"), IRanges(1, c(1575, 1584)))
samp <- DataFrame(info="test", row.names="ex1")
view <- BamViews(fl, bamSamples=samp, bamRanges=rngs)
se <- summarizeOverlaps(view, mode=Union, ignore.strand=TRUE)
colData(se)
metadata(se)