bioconductor v3.9.0 Rsamtools
This package provides an interface to the 'samtools',
Link to this section Summary
Functions
Parameters for creating pileups from BAM files
Maintain and use BAM files
Views into a set of BAM files
Manipulate BCF files.
Manipulate indexed fasta files.
Represent BAM files for pileup summaries.
A base class for managing lists of Rsamtools file references
A base class for managing file references in Rsamtools
'samtools' aligned sequence utilities interface
Parameters for scanning BAM files
Parameters for scanning BCF files
Manipulate tabix indexed tab-delimited files.
Apply a user-provided function to calculate pile-up statistics across multiple BAM files.
Rsamtools Deprecated and Defunct
Deprecated functions
Retrieve sequence names defined in a tabix file.
Compress and index tabix-compatible files.
Use filters and output formats to calculate pile-up statistics for a BAM file.
Group the records of a BAM file based on their flag bits and count the number of records in each group
Import samtools 'pileup' files.
Import, count, index, filter, sort, and merge BAM' (binary alignment) files. ## Description Import binary
BAM' files into a list structure, with facilities for
selecting what fields and which records are imported, and other
operations to manipulate BAM files.
Operations on BCF' files. ## Description Import, coerce, or index variant call files in text or binary format. ## Usage ```r scanBcfHeader(file, ...) list(list("scanBcfHeader"), list("character"))(file, ...) scanBcf(file, ...) list(list("scanBcf"), list("character"))(file, index = file, ..., param=ScanBcfParam()) asBcf(file, dictionary, destination, ..., overwrite=FALSE, indexDestination=TRUE) list(list("asBcf"), list("character"))(file, dictionary, destination, ..., overwrite=FALSE, indexDestination=TRUE) indexBcf(file, ...) list(list("indexBcf"), list("character"))(file, ...) ``` ## Arguments |Argument |Description| |------------- |----------------| |
file| For
scanBcfand
scanBcfHeader, the character() file name of the BCF file to be processed, or an instance of class [
BcfFile](#bcffile) .| |
index| The character() file name(s) of the
BCF' index to be processed.|
|dictionary
| a character vector of the unique CHROM names in the VCF file.|
|destination
| The character(1) file name of the location where the BCF output file will be created. For asBcf
this is without the .bcf file suffix.|
|param
| A instance of ScanBcfParam influencing which records are parsed and the INFO and GENO information returned.|
|...
| Additional arguments, e.g., for scanBcfHeader,character-method
, mode
of BcfFile
.|
|overwrite
| A logical(1) indicating whether the destination can be over-written if it already exists.|
|indexDestination
| A logical(1) indicating whether the created destination file should also be indexed.|
Operations on indexed 'fasta' files.
Operations on tabix' (indexed, tab-delimited) files. ## Description Scan compressed, sorted, tabix-indexed, tab-delimited files. ## Usage ```r scanTabix(file, ..., param) list(list("scanTabix"), list("character,IntegerRangesList"))(file, ..., param) list(list("scanTabix"), list("character,GRanges"))(file, ..., param) ``` ## Arguments |Argument |Description| |------------- |----------------| |
file| The character() file name(s) of the tabix file be processed, or more flexibly an instance of class [
TabixFile](#tabixfile) .| |
param| A instance of
GRangesor
IntegerRangesListproviding the sequence names and regions to be parsed.| |
...| Additional arguments, currently ignored.| ## Value
scanTabix` returns a list, with one element per region. Each element
of the list is a character vector representing records in the region.
## Author
Martin Morgan mtmorgan@fhcrc.org.
## References
http://samtools.sourceforge.net/tabix.shtml
## Examples
r example(TabixFile)
Retrieve sequence names defined in a tabix file.
Quickly test if a BAM file has paired end reads
File compression for tabix (bgzip) and fasta (razip) files.
Link to this section Functions
ApplyPileupsParam_class()
Parameters for creating pileups from BAM files
Description
Use ApplyPileupsParam()
to create a parameter object influencing
what fields and which records are used to calculate pile-ups, and to
influence the values returned.
Usage
# Constructor
ApplyPileupsParam(flag = scanBamFlag(),
minBaseQuality = 13L, minMapQuality = 0L,
minDepth = 0L, maxDepth = 250L,
yieldSize = 1L, yieldBy = c("range", "position"), yieldAll = FALSE,
which = GRanges(), what = c("seq", "qual"))
# Accessors
plpFlag(object)
plpFlag(object) <- value
plpMaxDepth(object)
plpMaxDepth(object) <- value
plpMinBaseQuality(object)
plpMinBaseQuality(object) <- value
plpMinDepth(object)
plpMinDepth(object) <- value
plpMinMapQuality(object)
plpMinMapQuality(object) <- value
plpWhat(object)
plpWhat(object) <- value
plpWhich(object)
plpWhich(object) <- value
plpYieldAll(object)
plpYieldAll(object) <- value
plpYieldBy(object)
plpYieldBy(object) <- value
plpYieldSize(object)
plpYieldSize(object) <- value
list(list("show"), list("ApplyPileupsParam"))(object)
Arguments
Argument | Description |
---|---|
flag | An instance of the object returned by scanBamFlag , restricting various aspects of reads to be included or excluded. |
minBaseQuality | The minimum read base quality below which the base is ignored when summarizing pileup information. |
minMapQuality | The minimum mapping quality below which the entire read is ignored. |
minDepth | The minimum depth of the pile-up below which the position is ignored. |
maxDepth | The maximum depth of reads considered at any position; this can be used to limit memory consumption. |
yieldSize | The number of records to include in each call to FUN . |
yieldBy | How records are to be counted. By range (in which case yieldSize must equal 1) means that FUN is invoked once for each range in which . By position means that FUN is invoked whenever pile-ups have been accumulated for yieldSize positions, regardless of ranges in which . |
yieldAll | Whether to report all positions ( yieldAll=TRUE ), or just those passing the filtering criteria of flag , minBaseQuality , etc. When yieldAll=TRUE , positions not passing filter criteria have 0 entries in seq or qual . |
which | A GRanges or IntegerRangesList instance restricting pileup calculations to the corresponding genomic locations. |
what | A character() instance indicating what values are to be returned. One or more of c("seq", "qual") . |
object | An instace of class ApplyPileupsParam . |
value | An instance to be assigned to the corresponding slot of the ApplyPileupsParam instance. |
Seealso
Author
Martin Morgan
Examples
example(applyPileups)
BamFile_class()
Maintain and use BAM files
Description
Use BamFile()
to create a reference to a BAM file (and
optionally its index). The reference remains open across calls to
methods, avoiding costly index re-loading.
BamFileList()
provides a convenient way of managing a list of
BamFile
instances.
Usage
## Constructors
BamFile(file, index=file, ..., yieldSize=NA_integer_, obeyQname=FALSE,
asMates=FALSE, qnamePrefixEnd=NA, qnameSuffixStart=NA)
BamFileList(..., yieldSize=NA_integer_, obeyQname=FALSE, asMates=FALSE,
qnamePrefixEnd=NA, qnameSuffixStart=NA)
## Opening / closing
list(list("open"), list("BamFile"))(con, ...)
list(list("close"), list("BamFile"))(con, ...)
## accessors; also path(), index(), yieldSize()
list(list("isOpen"), list("BamFile"))(con, rw="")
list(list("isIncomplete"), list("BamFile"))(con)
list(list("obeyQname"), list("BamFile"))(object, ...)
obeyQname(object, ...) <- value
list(list("asMates"), list("BamFile"))(object, ...)
asMates(object, ...) <- value
list(list("qnamePrefixEnd"), list("BamFile"))(object, ...)
qnamePrefixEnd(object, ...) <- value
list(list("qnameSuffixStart"), list("BamFile"))(object, ...)
qnameSuffixStart(object, ...) <- value
## actions
list(list("scanBamHeader"), list("BamFile"))(files, ..., what=c("targets", "text"))
list(list("seqinfo"), list("BamFile"))(x)
list(list("seqinfo"), list("BamFileList"))(x)
list(list("filterBam"), list("BamFile"))(file, destination, index=file, ...,
filter=FilterRules(), indexDestination=TRUE,
param=ScanBamParam(what=scanBamWhat()))
list(list("indexBam"), list("BamFile"))(files, ...)
list(list("sortBam"), list("BamFile"))(file, destination, ..., byQname=FALSE, maxMemory=512)
list(list("mergeBam"), list("BamFileList"))(files, destination, ...)
## reading
list(list("scanBam"), list("BamFile"))(file, index=file, ..., param=ScanBamParam(what=scanBamWhat()))
## counting
list(list("idxstatsBam"), list("BamFile"))(file, index=file, ...)
list(list("countBam"), list("BamFile"))(file, index=file, ..., param=ScanBamParam())
list(list("countBam"), list("BamFileList"))(file, index=file, ..., param=ScanBamParam())
list(list("quickBamFlagSummary"), list("BamFile"))(file, ..., param=ScanBamParam(), main.groups.only=FALSE)
Arguments
Argument | Description |
---|---|
... | Additional arguments. For BamFileList , this can either be a single character vector of paths to BAM files, or several instances of BamFile objects. When a character vector of paths, a second named argument index can be a character() vector of length equal to the first argument specifying the paths to the index files, or character() to indicate that no index file is available. See BamFile . |
con | An instance of BamFile . |
x, object, file, files | A character vector of BAM file paths (for BamFile ) or a BamFile instance (for other methods). |
index | character(1); the BAM index file path (for BamFile ); ignored for all other methods on this page. |
yieldSize | Number of records to yield each time the file is read from with scanBam . See Fields section for details. |
asMates | Logical indicating if records should be paired as mates. See Fields section for details. |
qnamePrefixEnd | Single character (or NA) marking the end of the qname prefix. When specified, all characters prior to and including the qnamePrefixEnd are removed from the qname. If the prefix is not found in the qname the qname is not trimmed. Currently only implemented for mate-pairing (i.e., when asMates=TRUE in a BamFile. |
qnameSuffixStart | Single character (or NA) marking the start of the qname suffix. When specified, all characters following and including the qnameSuffixStart are removed from the qname. If the suffix is not found in the qname the qname is not trimmmed. Currently only implemented for mate-pairing (i.e., when asMates=TRUE in a BamFile. |
obeyQname | Logical indicating if the BAM file is sorted by qname . In Bioconductor > 2.12 paired-end files do not need to be sorted by qname . Instead use asMates=TRUE for reading paired-end data. See Fields section for details. |
value | Logical value for setting asMates and obeyQname in a BamFile instance. |
what | For scanBamHeader , a character vector specifying that either or both of c("targets", "text") are to be extracted from the header; see scanBam for additional detail. |
filter | A FilterRules instance. Functions in the FilterRules instance should expect a single DataFrame argument representing all information specified by param . Each function must return a logical vector, usually of length equal to the number of rows of the DataFrame . Return values are used to include (when TRUE ) corresponding records in the filtered BAM file. |
destination | character(1) file path to write filtered reads to. |
indexDestination | logical(1) indicating whether the destination file should also be indexed. |
byQname, maxMemory | See sortBam . |
param | An optional ScanBamParam instance to further influence scanning, counting, or filtering. |
rw | Mode of file; ignored. |
main.groups.only | See quickBamFlagSummary . |
Seealso
The
readGAlignments
,readGAlignmentPairs
, andreadGAlignmentsList
functions defined in the GenomicAlignments package.summarizeOverlaps
and findSpliceOverlaps-methods in the GenomicAlignments package for methods that work on a BamFile and BamFileList objects.
Author
Martin Morgan and Marc Carlson
Examples
##
## BamFile options.
##
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
bf <- BamFile(fl)
bf
## When 'asMates=TRUE' scanBam() reads the data in as
## pairs. See 'asMates' above for details of the pairing
## algorithm.
asMates(bf) <- TRUE
## When 'yieldSize' is set, scanBam() will iterate
## through the file in chunks.
yieldSize(bf) <- 500
## Some applications append a filename (e.g., NCBI Sequence Read
## Archive (SRA) toolkit) or allele identifier to the sequence qname.
## This may result in a unique qname for each record which presents a
## problem when mating paired-end reads (identical qnames is one
## criteria for paired-end mating). 'qnamePrefixEnd' and
## 'qnameSuffixStart' can be used to trim an unwanted prefix or suffix.
qnamePrefixEnd(bf) <- "/"
qnameSuffixStart(bf) <- "."
##
## Reading Bam files.
##
fl <- system.file("extdata", "ex1.bam", package="Rsamtools",
mustWork=TRUE)
(bf <- BamFile(fl))
head(seqlengths(bf)) # sequences and lengths in BAM file
if (require(RNAseqData.HNRNPC.bam.chr14)) {
bfl <- BamFileList(RNAseqData.HNRNPC.bam.chr14_BAMFILES)
bfl
bfl[1:2] # subset
bfl[[1]] # select first element -- BamFile
## merged across BAM files
seqinfo(bfl)
head(seqlengths(bfl))
}
length(scanBam(fl)[[1]][[1]]) # all records
bf <- open(BamFile(fl)) # implicit index
bf
identical(scanBam(bf), scanBam(fl))
close(bf)
## Use 'yieldSize' to iterate through a file in chunks.
bf <- open(BamFile(fl, yieldSize=1000))
while (nrec <- length(scanBam(bf)[[1]][[1]]))
cat("records:", nrec, "
")
close(bf)
## Repeatedly visit multiple ranges in the BamFile.
rng <- GRanges(c("seq1", "seq2"), IRanges(1, c(1575, 1584)))
bf <- open(BamFile(fl))
sapply(seq_len(length(rng)), function(i, bamFile, rng) {
param <- ScanBamParam(which=rng[i], what="seq")
bam <- scanBam(bamFile, param=param)[[1]]
alphabetFrequency(bam[["seq"]], baseOnly=TRUE, collapse=TRUE)
}, bf, rng)
close(bf)
BamViews_class()
Views into a set of BAM files
Description
Use BamViews()
to reference a set of disk-based BAM files to be
processed (e.g., queried using scanBam
) as a single
experiment .
Usage
## Constructor
BamViews(bamPaths=character(0),
bamIndicies=bamPaths,
bamSamples=DataFrame(row.names=make.unique(basename(bamPaths))),
bamRanges, bamExperiment = list(), ...)
list(list("BamViews"), list("missing"))(bamPaths=character(0),
bamIndicies=bamPaths,
bamSamples=DataFrame(row.names=make.unique(basename(bamPaths))),
bamRanges, bamExperiment = list(), ..., auto.range=FALSE)
## Accessors
bamPaths(x)
bamSamples(x)
bamSamples(x) <- value
bamRanges(x)
bamRanges(x) <- value
bamExperiment(x)
list(list("names"), list("BamViews"))(x)
list(list("names"), list("BamViews"))(x) <- value
list(list("dimnames"), list("BamViews"))(x)
list(list("dimnames"), list("BamViews,ANY"))(x) <- value
bamDirname(x, ...) <- value
## Subset
list(list("["), list("BamViews,ANY,ANY"))(x, i, j, list(), drop=TRUE)
list(list("["), list("BamViews,ANY,missing"))(x, i, j, list(), drop=TRUE)
list(list("["), list("BamViews,missing,ANY"))(x, i, j, list(), drop=TRUE)
## Input
list(list("scanBam"), list("BamViews"))(file, index = file, ..., param = ScanBamParam(what=scanBamWhat()))
list(list("countBam"), list("BamViews"))(file, index = file, ..., param = ScanBamParam())
## Show
list(list("show"), list("BamViews"))(object)
Arguments
Argument | Description |
---|---|
bamPaths | A character() vector of BAM path names. |
bamIndicies | A character() vector of BAM index file path names, without the .bai extension. |
bamSamples | A DataFrame instance with as many rows as length(bamPaths) , containing sample information associated with each path. |
bamRanges | Missing or a GRanges instance with ranges defined on the reference chromosomes of the BAM files. Ranges are not validated against the BAM files. |
bamExperiment | A list() containing additional information about the experiment. |
auto.range | If TRUE and all bamPaths exist, populate the ranges with the union of ranges returned in the target element of scanBamHeader . |
... | Additional arguments. |
x | An instance of BamViews . |
object | An instance of BamViews . |
value | An object of appropriate type to replace content. |
i | During subsetting, a logical or numeric index into bamRanges . |
j | During subsetting, a logical or numeric index into bamSamples and bamPaths . |
drop | A logical(1), ignored by all BamViews subsetting methods. |
file | An instance of BamViews . |
index | A character vector of indices, corresponding to the bamPaths(file) . |
param | An optional ScanBamParam instance to further influence scanning or counting. |
Author
Martin Morgan
Examples
fls <- system.file("extdata", "ex1.bam", package="Rsamtools",
mustWork=TRUE)
rngs <- GRanges(seqnames = Rle(c("chr1", "chr2"), c(9, 9)),
ranges = c(IRanges(seq(10000, 90000, 10000), width=500),
IRanges(seq(100000, 900000, 100000), width=5000)),
Count = seq_len(18L))
v <- BamViews(fls, bamRanges=rngs)
v
v[1:5,]
bamRanges(v[c(1:5, 11:15),])
bamDirname(v) <- getwd()
v
BcfFile_class()
Manipulate BCF files.
Description
Use BcfFile()
to create a reference to a BCF (and optionally
its index). The reference remains open across calls to methods,
avoiding costly index re-loading.
BcfFileList()
provides a convenient way of managing a list of
BcfFile
instances.
Usage
## Constructors
BcfFile(file, index = file,
mode=ifelse(grepl("\.bcf$", file), "rb", "r"))
BcfFileList(...)
## Opening / closing
list(list("open"), list("BcfFile"))(con, ...)
list(list("close"), list("BcfFile"))(con, ...)
## accessors; also path(), index()
list(list("isOpen"), list("BcfFile"))(con, rw="")
bcfMode(object)
## actions
list(list("scanBcfHeader"), list("BcfFile"))(file, ...)
list(list("scanBcf"), list("BcfFile"))(file, ..., param=ScanBcfParam())
list(list("indexBcf"), list("BcfFile"))(file, ...)
Arguments
Argument | Description |
---|---|
con, object | An instance of BcfFile . |
file | A character(1) vector of the BCF file path or, (for indexBcf) an instance of BcfFile point to a BCF file. |
index | A character(1) vector of the BCF index. |
mode | A character(1) vector; mode="rb" indicates a binary (BCF) file, mode="r" a text (VCF) file. |
param | An optional ScanBcfParam instance to further influence scanning. |
... | Additional arguments. For BcfFileList , this can either be a single character vector of paths to BCF files, or several instances of BcfFile objects. |
rw | Mode of file; ignored. |
Author
Martin Morgan
Examples
fl <- system.file("extdata", "ex1.bcf.gz", package="Rsamtools",
mustWork=TRUE)
bf <- BcfFile(fl) # implicit index
bf
identical(scanBcf(bf), scanBcf(fl))
rng <- GRanges(c("seq1", "seq2"), IRanges(1, c(1575, 1584)))
param <- ScanBcfParam(which=rng)
bcf <- scanBcf(bf, param=param) ## all ranges
## ranges one at a time 'bf'
open(bf)
sapply(seq_len(length(rng)), function(i, bcfFile, rng) {
param <- ScanBcfParam(which=rng)
bcf <- scanBcf(bcfFile, param=param)[[1]]
## do extensive work with bcf
isOpen(bf) ## file remains open
}, bf, rng)
FaFile_class()
Manipulate indexed fasta files.
Description
Use FaFile()
to create a reference to an indexed fasta
file. The reference remains open across calls to methods, avoiding
costly index re-loading.
FaFileList()
provides a convenient way of managing a list of
FaFile
instances.
Usage
## Constructors
FaFile(file, index=sprintf("%s.fai", file),
gzindex=sprintf("%s.gzi", file))
FaFileList(...)
## Opening / closing
list(list("open"), list("FaFile"))(con, ...)
list(list("close"), list("FaFile"))(con, ...)
## accessors; also path(), index()
list(list("gzindex"), list("FaFile"))(object, asNA=TRUE)
list(list("gzindex"), list("FaFileList"))(object, asNA=TRUE)
list(list("isOpen"), list("FaFile"))(con, rw="")
## actions
list(list("indexFa"), list("FaFile"))(file, ...)
list(list("scanFaIndex"), list("FaFile"))(file, ...)
list(list("scanFaIndex"), list("FaFileList"))(file, ..., as=c("GRangesList", "GRanges"))
list(list("seqinfo"), list("FaFile"))(x)
list(list("countFa"), list("FaFile"))(file, ...)
list(list("scanFa"), list("FaFile,GRanges"))(file, param, ...,
as=c("DNAStringSet", "RNAStringSet", "AAStringSet"))
list(list("scanFa"), list("FaFile,IntegerRangesList"))(file, param, ...,
as=c("DNAStringSet", "RNAStringSet", "AAStringSet"))
list(list("scanFa"), list("FaFile,missing"))(file, param, ...,
as=c("DNAStringSet", "RNAStringSet", "AAStringSet"))
list(list("getSeq"), list("FaFile"))(x, param, ...)
list(list("getSeq"), list("FaFileList"))(x, param, ...)
Arguments
Argument | Description |
---|---|
con, object, x | An instance of FaFile or (for gzindex and getSeq ) FaFileList . |
file, index, gzindex | A character(1) vector of the fasta or fasta index or bgzip index file path (for FaFile ), or an instance of class FaFile or FaFileList (for scanFaIndex , getSeq ). |
asNA | logical indicating if missing output should be NA or character() |
param | An optional GRanges or IntegerRangesList instance to select reads (and sub-sequences) for input. See Methods, below. |
... | Additional arguments. |
For
FaFileList
, this can either be a single character vector of paths to BAM files, or several instances ofFaFile
objects.For
scanFa,FaFile,missing-method
this can include arguemnts toreadDNAStringSet
/readRNAStringSet
/readAAStringSet
whenparam
is missing .
|rw
| Mode of file; ignored.| |as
| A character(1) vector indicating the type of object to return. |For
scanFaIndex
, defaultGRangesList
, with index information from each file is returned as an element of the list.For
scanFa
, defaultDNAStringSet
.GRangesList
, index information is collapsed across files into the unique index elements.
Author
Martin Morgan
Examples
fl <- system.file("extdata", "ce2dict1.fa", package="Rsamtools",
mustWork=TRUE)
fa <- open(FaFile(fl)) # open
countFa(fa)
(idx <- scanFaIndex(fa))
(dna <- scanFa(fa, param=idx[1:2]))
ranges(idx) <- narrow(ranges(idx), -10) # last 10 nucleotides
(dna <- scanFa(fa, param=idx[1:2]))
PileupFiles_class()
Represent BAM files for pileup summaries.
Description
Use PileupFiles()
to create a reference to BAM files (and
their indicies), to be used for calculating pile-up summaries.
Usage
## Constructors
PileupFiles(files, ..., param=ApplyPileupsParam())
list(list("PileupFiles"), list("character"))(files, ..., param=ApplyPileupsParam())
list(list("PileupFiles"), list("list"))(files, ..., param=ApplyPileupsParam())
## opening / closing
## open(con, ...)
## close(con, ...)
## accessors; also path()
list(list("isOpen"), list("PileupFiles"))(con, rw="")
plpFiles(object)
plpParam(object)
## actions
list(list("applyPileups"), list("PileupFiles,missing"))(files, FUN, ..., param)
list(list("applyPileups"), list("PileupFiles,ApplyPileupsParam"))(files, FUN, ..., param)
## display
list(list("show"), list("PileupFiles"))(object)
Arguments
Argument | Description |
---|---|
files | For PileupFiles , a character() or list of BamFile instances representing files to be included in the pileup. Using a list of BamFile allows indicies to be specified when these are in non-standard format. All elements of list() must be the same type. For applyPileups,PileupFiles-method , a PileupFiles instance. |
list() | Additional arguments, currently ignored. |
con, object | An instance of PileupFiles . |
FUN | A function of one argument; see applyPileups . |
param | An instance of ApplyPileupsParam , to select which records to include in the pileup, and which summary information to return. |
rw | character() indicating mode of file; not used for TabixFile . |
Author
Martin Morgan
Examples
example(applyPileups)
RsamtoolsFileList_class()
A base class for managing lists of Rsamtools file references
Description
RsamtoolsFileList
is a base class for managing lists of file
references in Rsamtools ; it is not intended for direct use --
see, e.g., BamFileList
.
Usage
list(list("path"), list("RsamtoolsFileList"))(object, ...)
list(list("index"), list("RsamtoolsFileList"))(object, ..., asNA = TRUE)
list(list("isOpen"), list("RsamtoolsFileList"))(con, rw="")
list(list("open"), list("RsamtoolsFileList"))(con, ...)
list(list("close"), list("RsamtoolsFileList"))(con, ...)
list(list("names"), list("RsamtoolsFileList"))(x)
list(list("yieldSize"), list("RsamtoolsFileList"))(object, ...)
Arguments
Argument | Description |
---|---|
con, object, x | An instance of a class derived from RsamtoolsFileList . |
asNA | logical indicating if missing output should be NA or character() |
rw | Mode of file; ignored. |
list() | Additional arguments. |
Author
Martin Morgan
RsamtoolsFile_class()
A base class for managing file references in Rsamtools
Description
RsamtoolsFile
is a base class for managing file references in
Rsamtools ; it is not intended for direct use by users -- see, e.g.,
BamFile
.
Usage
## accessors
list(list("path"), list("RsamtoolsFile"))(object, ...)
list(list("index"), list("RsamtoolsFile"))(object, ..., asNA = TRUE)
list(list("isOpen"), list("RsamtoolsFile"))(con, rw="")
list(list("yieldSize"), list("RsamtoolsFile"))(object, ...)
yieldSize(object, ...) <- value
list(list("show"), list("RsamtoolsFile"))(object)
Arguments
Argument | Description |
---|---|
con, object | An instance of a class derived from RsamtoolsFile . |
asNA | logical indicating if missing output should be NA or character() |
rw | Mode of file; ignored. |
list() | Additional arguments, unused. |
value | Replacement value. |
Author
Martin Morgan
Rsamtools_package()
'samtools' aligned sequence utilities interface
Description
This package provides facilities for parsing samtools BAM (binary) files representing aligned sequences.
Details
See packageDescription('Rsamtools')
for package details. A
useful starting point is the scanBam
manual page.
Note
This package documents the following classes for purely internal
reasons, see help pages in other packages: bzfile
, fifo
,
gzfile
, pipe
, unz
, url
.
Author
Author: Martin Morgan
Maintainer: Biocore Team c/o BioC user list bioconductor@stat.math.ethz.ch
References
The current source code for samtools and bcftools is from https://github.com/samtools/samtools . Additional material is at http://samtools.sourceforge.net/ .
Examples
packageDescription('Rsamtools')
ScanBamParam_class()
Parameters for scanning BAM files
Description
Use ScanBamParam()
to create a parameter object influencing
what fields and which records are imported from a (binary) BAM file.
Use of which
requires that a BAM index file
( <filename>.bai
) exists.
Usage
# Constructor
ScanBamParam(flag = scanBamFlag(), simpleCigar = FALSE,
reverseComplement = FALSE, tag = character(0), tagFilter = list(),
what = character(0), which, mapqFilter=NA_integer_)
# Constructor helpers
scanBamFlag(isPaired = NA, isProperPair = NA, isUnmappedQuery = NA,
hasUnmappedMate = NA, isMinusStrand = NA, isMateMinusStrand = NA,
isFirstMateRead = NA, isSecondMateRead = NA, isNotPrimaryRead = NA,
isSecondaryAlignment = NA, isNotPassingQualityControls = NA,
isDuplicate = NA, isSupplementaryAlignment = NA)
scanBamWhat()
# Accessors
bamFlag(object, asInteger=FALSE)
bamFlag(object) <- value
bamReverseComplement(object)
bamReverseComplement(object) <- value
bamSimpleCigar(object)
bamSimpleCigar(object) <- value
bamTag(object)
bamTag(object) <- value
bamTagFilter(object)
bamTagFilter(object) <- value
bamWhat(object)
bamWhat(object) <- value
bamWhich(object)
bamWhich(object) <- value
bamMapqFilter(object)
bamMapqFilter(object) <- value
list(list("show"), list("ScanBamParam"))(object)
# Flag utils
bamFlagAsBitMatrix(flag, bitnames=FLAG_BITNAMES)
bamFlagAND(flag1, flag2)
bamFlagTest(flag, value)
Arguments
Argument | Description |
---|---|
flag | For ScanBamParam , an integer(2) vector used to filter reads based on their 'flag' entry. This is most easily created with the scanBamFlag() helper function. For bamFlagAsBitMatrix , bamFlagTest an integer vector where each element represents a 'flag' entry. |
simpleCigar | A logical(1) vector which, when TRUE, returns only those reads for which the cigar (run-length encoded representation of the alignment) is missing or contains only matches / mismatches ( 'M' ). |
reverseComplement | A logical(1) vectors. BAM files store reads mapping to the minus strand as though they are on the plus strand. Rsamtools obeys this convention by default ( reverseComplement=FALSE ), but when this value is set to TRUE returns the sequence and quality scores of reads mapped to the minus strand in the reverse complement (sequence) and reverse (quality) of the read as stored in the BAM file. This might be useful if wishing to recover read and quality scores as represented in fastq files, but is NOT appropriate for variant calling or other alignment-based operations. |
tag | A character vector naming tags to be extracted. A tag is an optional field, with arbitrary information, stored with each record. Tags are identified by two-letter codes, so all elements of tag must have exactly 2 characters. |
tagFilter | A named list of atomic vectors. The name of each list element is the tag name (two-letter code), and the corresponding atomic vector is the set of acceptable values for the tag. Only reads with specified tags are included. NULL s, NA s, and empty strings are not allowed in the atomic vectors. |
what | A character vector naming the fields to return scanBamWhat() returns a vector of available fields. Fields are described on the scanBam help page. |
mapqFilter | A non-negative integer(1) specifying the minimum mapping quality to include. BAM records with mapping qualities less than mapqFilter are discarded. |
which | A GRanges , IntegerRangesList , or any object that can be coerced to a IntegerRangesList , or missing object, from which a IRangesList instance will be constructed. Names of the IRangesList correspond to reference sequences, and ranges to the regions on that reference sequence for which matches are desired. Because data types are coerced to IRangesList , which does not include strand information (use the flag argument instead). Only records with a read overlapping the specified ranges are returned. All ranges must have ends less than or equal to 536870912. When one record overlaps two ranges in which , the record is returned twice . |
isPaired | A logical(1) indicating whether unpaired (FALSE), paired (TRUE), or any (NA) read should be returned. |
isProperPair | A logical(1) indicating whether improperly paired (FALSE), properly paired (TRUE), or any (NA) read should be returned. A properly paired read is defined by the alignment algorithm and might, e.g., represent reads aligning to identical reference sequences and with a specified distance. |
isUnmappedQuery | A logical(1) indicating whether unmapped (TRUE), mapped (FALSE), or any (NA) read should be returned. |
hasUnmappedMate | A logical(1) indicating whether reads with mapped (FALSE), unmapped (TRUE), or any (NA) mate should be returned. |
isMinusStrand | A logical(1) indicating whether reads aligned to the plus (FALSE), minus (TRUE), or any (NA) strand should be returned. |
isMateMinusStrand | A logical(1) indicating whether mate reads aligned to the plus (FALSE), minus (TRUE), or any (NA) strand should be returned. |
isFirstMateRead | A logical(1) indicating whether the first mate read should be returned (TRUE) or not (FALSE), or whether mate read number should be ignored (NA). |
isSecondMateRead | A logical(1) indicating whether the second mate read should be returned (TRUE) or not (FALSE), or whether mate read number should be ignored (NA). |
isNotPrimaryRead | Deprecated; use isSecondaryAlignment . |
isSecondaryAlignment | A logical(1) indicating whether alignments that are secondary (TRUE), are not (FALSE) or whose secondary status does not matter (NA) should be returned. A non-primary alignment ( secondary alignment in the SAM specification) might result when a read aligns to multiple locations. One alignment is designated as primary and has this flag set to FALSE; the remainder, for which this flag is TRUE, are designated by the aligner as secondary. |
isNotPassingQualityControls | A logical(1) indicating whether reads passing quality controls (FALSE), reads not passing quality controls (TRUE), or any (NA) read should be returned. |
isDuplicate | A logical(1) indicating that un-duplicated (FALSE), duplicated (TRUE), or any (NA) reads should be returned. 'Duplicated' reads may represent PCR or optical duplicates. |
isSupplementaryAlignment | A logical(1) indicating that non-supplementary (FALSE), supplementary (TRUE), or any (NA) reads should be returned. The SAM specification indicates that 'supplementary' reads are part of a chimeric alignment. |
object | An instance of class ScanBamParam . |
value | An instance of the corresponding slot, to be assigned to object or, for bamFlagTest , a character(1) name of the flag to test, e.g., isUnmappedQuery , from the arguments to scanBamFlag . |
asInteger | logical(1) indicating whether flag should be returned as an encoded integer vector ( TRUE ) or human-readable form ( FALSE ). |
bitnames | Names of the flag bits to extract. Will be the colnames of the returned matrix. |
flag1, flag2 | Integer vectors containing flag entries. |
Seealso
Author
Martin Morgan
Examples
## defaults
p0 <- ScanBamParam()
## subset of reads based on genomic coordinates
which <- IRangesList(seq1=IRanges(1000, 2000),
seq2=IRanges(c(100, 1000), c(1000, 2000)))
p1 <- ScanBamParam(what=scanBamWhat(), which=which)
## subset of reads based on 'flag' value
p2 <- ScanBamParam(what=scanBamWhat(),
flag=scanBamFlag(isMinusStrand=FALSE))
## subset of fields
p3 <- ScanBamParam(what=c("rname", "strand", "pos", "qwidth"))
## use
fl <- system.file("extdata", "ex1.bam", package="Rsamtools",
mustWork=TRUE)
res <- scanBam(fl, param=p2)[[1]]
lapply(res, head)
## tags; NM: edit distance; H1: 1-difference hits
p4 <- ScanBamParam(tag=c("NM", "H1"), what="flag")
bam4 <- scanBam(fl, param=p4)
str(bam4[[1]][["tag"]])
## tagFilter
p5 <- ScanBamParam(tag=c("NM", "H1"), tagFilter=list(NM=c(2, 3, 4)))
bam5 <- scanBam(fl, param=p5)
table(bam5[[1]][["tag"]][["NM"]])
## flag utils
flag <- scanBamFlag(isUnmappedQuery=FALSE, isMinusStrand=TRUE)
p6 <- ScanBamParam(what="flag")
bam6 <- scanBam(fl, param=p6)
flag6 <- bam6[[1]][["flag"]]
head(bamFlagAsBitMatrix(flag6[1:9]))
colSums(bamFlagAsBitMatrix(flag6))
flag
bamFlagAsBitMatrix(flag)
ScanBcfParam_class()
Parameters for scanning BCF files
Description
Use ScanBcfParam()
to create a parameter object influencing the
INFO and GENO fields parsed, and which sample records are
imported from a BCF file. Use of which
requires that a BCF
index file ( <filename>.bci
) exists.
Usage
ScanBcfParam(fixed=character(), info=character(), geno=character(),
samples=character(), trimEmpty=TRUE, which, ...)
list(list("ScanBcfParam"), list("missing"))(fixed=character(), info=character(), geno=character(),
samples=character(), trimEmpty=TRUE, which, ...)
list(list("ScanBcfParam"), list("IntegerRangesList"))(fixed=character(), info=character(), geno=character(),
samples=character(), trimEmpty=TRUE, which, ...)
list(list("ScanBcfParam"), list("GRanges"))(fixed=character(), info=character(), geno=character(),
samples=character(), trimEmpty=TRUE, which, ...)
list(list("ScanBcfParam"), list("GRangesList"))(fixed=character(), info=character(), geno=character(),
samples=character(), trimEmpty=TRUE, which, ...)
## Accessors
bcfFixed(object)
bcfInfo(object)
bcfGeno(object)
bcfSamples(object)
bcfTrimEmpty(object)
bcfWhich(object)
Arguments
Argument | Description |
---|---|
fixed | A logical(1) for use with ScanVcfParam only. |
info | A character() vector of INFO fields (see scanVcfHeader ) to be returned. |
geno | A character() vector of GENO fields (see scanVcfHeader ) to be returned. character(0) returns all fields, NA_character_ returns none. |
samples | A character() vector of sample names (see scanVcfHeader ) to be returned. character(0) returns all fields, NA_character_ returns none. |
trimEmpty | A logical(1) indicating whether GENO fields with no values should be returned. |
which | An object, for which a method is defined (see usage, above), describing the sequences and ranges to be queried. Variants whose POS lies in the interval(s) [start, end) are returned. |
object | An instance of class ScanBcfParam . |
list() | Arguments used internally. |
Seealso
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
## see ?ScanVcfParam examples
TabixFile_class()
Manipulate tabix indexed tab-delimited files.
Description
Use TabixFile()
to create a reference to a Tabix file (and its
index). Once opened, the reference remains open across calls to
methods, avoiding costly index re-loading.
TabixFileList()
provides a convenient way of managing a list of
TabixFile
instances.
Usage
## Constructors
TabixFile(file, index = paste(file, "tbi", sep="."), ...,
yieldSize=NA_integer_)
TabixFileList(...)
## Opening / closing
list(list("open"), list("TabixFile"))(con, ...)
list(list("close"), list("TabixFile"))(con, ...)
## accessors; also path(), index(), yieldSize()
list(list("isOpen"), list("TabixFile"))(con, rw="")
## actions
list(list("seqnamesTabix"), list("TabixFile"))(file, ...)
list(list("headerTabix"), list("TabixFile"))(file, ...)
list(list("scanTabix"), list("TabixFile,GRanges"))(file, ..., param)
list(list("scanTabix"), list("TabixFile,IntegerRangesList"))(file, ..., param)
list(list("scanTabix"), list("TabixFile,missing"))(file, ..., param)
list(list("scanTabix"), list("character,ANY"))(file, ..., param)
list(list("scanTabix"), list("character,missing"))(file, ..., param)
countTabix(file, ...)
Arguments
Argument | Description |
---|---|
con | An instance of TabixFile . |
file | For TabixFile(), A character(1) vector to the tabix file path; can be remote (http://, ftp://). For countTabix , a character(1) or TabixFile instance. For others, a TabixFile instance. |
index | A character(1) vector of the tabix file index. |
yieldSize | Number of records to yield each time the file is read from using scanTabix . Only valid when param is unspecified. yieldSize does not alter existing yield sizes, include NA , when creating a TabixFileList from TabixFile instances. |
param | An instance of GRanges or IntegerRangesList, used to select which records to scan. |
... | Additional arguments. For TabixFileList , this can include file , index , and yieldSize arguments. The file can be a single character vector of paths to tabix files (and optionally a similarly lengthed vector of index ), or several instances of TabixFile objects. The arguments can also include yieldSize , applied to all elements of the list. |
rw | character() indicating mode of file; not used for TabixFile . |
Author
Martin Morgan
Examples
fl <- system.file("extdata", "example.gtf.gz", package="Rsamtools",
mustWork=TRUE)
tbx <- TabixFile(fl)
param <- GRanges(c("chr1", "chr2"), IRanges(c(1, 1), width=100000))
countTabix(tbx)
countTabix(tbx, param=param)
res <- scanTabix(tbx, param=param)
sapply(res, length)
res[["chr1:1-100000"]][1:2]
## parse to list of data.frame's
dff <- Map(function(elt) {
read.csv(textConnection(elt), sep=" ", header=FALSE)
}, res)
dff[["chr1:1-100000"]][1:5,1:8]
## parse 100 records at a time
length(scanTabix(tbx)[[1]]) # total number of records
tbx <- open(TabixFile(fl, yieldSize=100))
while(length(res <- scanTabix(tbx)[[1]]))
cat("records read:", length(res), "
")
close(tbx)
applyPileups()
Apply a user-provided function to calculate pile-up statistics across multiple BAM files.
Description
applyPileups
scans one or more BAM files, returning
position-specific sequence and quality summaries.
Usage
applyPileups(files, FUN, ..., param)
Arguments
Argument | Description |
---|---|
files | A PileupFiles instances. |
|FUN
| A function of 1 argument, x
, to be evaluated for each yield (see yieldSize
, yieldBy
, yieldAll
). The argument x
is a list
, with elements describing the current pile-up. The elements of the list are determined by the argument what
, and include: list("
", "
", " ", list(list("seqnames:"), list("(Always returned) A named ", list("integer()"), "
", " representing the seqnames corresponding to each position
", " reported in the pile-up. This is a run-length encoding, where
", " the names of the elements represent the seqnames, and the values
", " the number of successive positions corresponding to that
", " seqname.")), "
", "
", " ", list(list("pos:"), list("Always returned) A ", list("integer()"), " representing the
", |
" genomic coordinate of each pile-up position.")), "
", " ", " ", list(list("seq:"), list("An ", list("array"), " of dimensions nucleotide x file x ", " position. ", " ", " The ", list("nucleotide"), " dimension is length 5, corresponding to ", " ", list("A"), ", ", list("C"), ", ", list("G"), ", ", list("T"), ", and ", list("N"), " ", " respectively. ", " ", " Entries in the array represent the number of times the ", " nucleotide occurred in reads in the file overlapping the ",
" position.
", "
", " ")), "
", "
", " ", list(list("qual:"), list("Like ", list("seq"), ", but summarizing quality; the first
", " dimension is the Phred-encoded quality score, ranging from
", " ", list("!"), " (0) to ", list("~"), " (93).")), "
", "
", " ")
|list()
| Additional arguments, passed to methods.|
|param
| An instance of the object returned by ApplyPileupsParam
.|
Details
Regardless of param
values, the algorithm follows samtools by
excluding reads flagged as unmapped, secondary, duplicate, or failing
quality control.
Value
applyPileups
returns a list
equal in length to the
number of times FUN
has been called, with each element
containing the result of FUN
.
ApplyPileupsParam
returns an object describing the parameters.
Seealso
Author
Martin Morgan
References
http://samtools.sourceforge.net/
Examples
## The examples below are currently broken and have been disabled for now
fl <- system.file("extdata", "ex1.bam", package="Rsamtools",
mustWork=TRUE)
fls <- PileupFiles(c(fl, fl))
calcInfo <-
function(x)
{
## information at each pile-up position
info <- apply(x[["seq"]], 2, function(y) {
y <- y[c("A", "C", "G", "T"),,drop=FALSE]
y <- y + 1L # continuity
cvg <- colSums(y)
p <- y / cvg[col(y)]
h <- -colSums(p * log(p))
ifelse(cvg == 4L, NA, h)
})
list(seqnames=x[["seqnames"]], pos=x[["pos"]], info=info)
}
which <- GRanges(c("seq1", "seq2"), IRanges(c(1000, 1000), 2000))
param <- ApplyPileupsParam(which=which, what="seq")
res <- applyPileups(fls, calcInfo, param=param)
str(res)
head(res[[1]][["pos"]])# positions matching param
head(res[[1]][["info"]])# inforamtion in each file
## 'param' as part of 'files'
fls1 <- PileupFiles(c(fl, fl), param=param)
res1 <- applyPileups(fls1, calcInfo)
identical(res, res1)
## yield by position, across ranges
param <- ApplyPileupsParam(which=which, yieldSize=500L,
yieldBy="position", what="seq")
res <- applyPileups(fls, calcInfo, param=param)
sapply(res, "[[", "seqnames")
defunct()
Rsamtools Deprecated and Defunct
Description
The function, class, or data object you have asked for has been deprecated or made defunct.
deprecated()
Deprecated functions
Description
Functions listed on this page are no longer supported.
Details
For yieldTabix
, use the yieldSize
argument of
TabixFiles
.
For BamSampler
, use REDUCEsampler
with
reduceByYield
in the GenomicFiles
package.
Author
Martin Morgan mtmorgan@fhcrc.org.
headerTabix()
Retrieve sequence names defined in a tabix file.
Description
This function queries a tabix file, returning the names of the sequences used as a key when creating the file.
Usage
headerTabix(file, ...)
list(list("headerTabix"), list("character"))(file, ...)
Arguments
Argument | Description |
---|---|
file | A character(1) file path or TabixFile instance pointing to a tabix file. |
... | Additional arguments, currently ignored. |
Value
A list(4)
of the sequence names, column indicies used to sort
the file, the number of lines skipped while indexing, and the comment
character used while indexing.
Author
Martin Morgan mtmorgan@fhcrc.org.
Examples
fl <- system.file("extdata", "example.gtf.gz", package="Rsamtools",
mustWork=TRUE)
headerTabix(fl)
indexTabix()
Compress and index tabix-compatible files.
Description
Index (with indexTabix
) files that have been sorted into
ascending sequence, start and end position ordering.
Usage
indexTabix(file,
format=c("gff", "bed", "sam", "vcf", "vcf4", "psltbl"),
seq=integer(), start=integer(), end=integer(),
skip=0L, comment="#", zeroBased=FALSE, ...)
Arguments
Argument | Description |
---|---|
file | A characater(1) path to a sorted, bgzip-compressed file. |
format | The format of the data in the compressed file. A characater(1) matching one of the types named in the function signature. |
seq | If format is missing, then seq indicates the column in which the sequence identifier (e.g., chrq) is to be found. |
start | If format is missing, start indicates the column containing the start coordinate of the feature to be indexed. |
end | If format is missing, end indicates the column containing the ending coordinate of the feature to be indexed. |
skip | The number of lines to be skipped at the beginning of the file. |
comment | A single character which, when present as the first character in a line, indicates that the line is to be omitted. from indexing. |
zeroBased | A logical(1) indicating whether coordinats in the file are zero-based. |
... | Additional arguments. |
Value
The return value of indexTabix
is an updated instance of
file
reflecting the newly-created index file.
Author
Martin Morgan mtmorgan@fhcrc.org.
References
http://samtools.sourceforge.net/tabix.shtml
Examples
from <- system.file("extdata", "ex1.sam", package="Rsamtools",
mustWork=TRUE)
to <- tempfile()
zipped <- bgzip(from, to)
idx <- indexTabix(zipped, "sam")
tab <- TabixFile(zipped, idx)
pileup()
Use filters and output formats to calculate pile-up statistics for a BAM file.
Description
pileup
uses PileupParam
and ScanBamParam
objects
to calculate pileup statistics for a BAM file. The result is a
data.frame
with columns summarizing counts of reads overlapping
each genomic position, optionally differentiated on nucleotide,
strand, and position within read.
Usage
pileup(file, index=file, ..., scanBamParam=ScanBamParam(),
pileupParam=PileupParam())
## PileupParam constructor
PileupParam(max_depth=250, min_base_quality=13, min_mapq=0,
min_nucleotide_depth=1, min_minor_allele_depth=0,
distinguish_strands=TRUE, distinguish_nucleotides=TRUE,
ignore_query_Ns=TRUE, include_deletions=TRUE, include_insertions=FALSE,
left_bins=NULL, query_bins=NULL, cycle_bins=NULL)
phred2ASCIIOffset(phred=integer(),
scheme= c("Illumina 1.8+", "Sanger", "Solexa", "Illumina 1.3+",
"Illumina 1.5+"))
Arguments
Argument | Description |
---|---|
file | character(1) or BamFile ; BAM file path. |
index | When file is character(1), an optional character(1) of BAM index file path; see scanBam . |
list() | Additional arguments, perhaps used by methods. |
scanBamParam | An instance of ScanBamParam . |
pileupParam | An instance of PileupParam . |
max_depth | integer(1); maximum number of overlapping alignments considered for each position in the pileup. |
min_base_quality | integer(1); minimum QUAL value for each nucleotide in an alignment. Use phred2ASCIIOffset to help translate numeric or character values to these offsets. |
min_mapq | integer(1); minimum MAPQ value for an alignment to be included in pileup. |
min_nucleotide_depth | integer(1); minimum count of each nucleotide ( independent of other nucleotides) at a given position required for said nucleotide to appear in the result. |
min_minor_allele_depth | integer(1); minimum count of all nucleotides other than the major allele at a given position required for a particular nucleotide to appear in the result. |
distinguish_strands | logical(1); TRUE if result should differentiate between strands. |
distinguish_nucleotides | logical(1); TRUE if result should differentiate between nucleotides. |
ignore_query_Ns | logical(1); TRUE if non-determinate nucleotides in alignments should be excluded from the pileup. |
include_deletions | logical(1); TRUE to include deletions in pileup. |
include_insertions | logical(1); TRUE to include insertions in pileup. |
left_bins | numeric; all same sign; unique positions within a read to delimit bins. Position within read is based on counting from the 5' end regardless of strand . Minimum of two values are required so at least one range can be formed. NULL (default) indicates no binning. Use negative values to count from the opposite end. Sorted order not required. Floating-point values are coerced to integer . If you want to delimit bins based on sequencing cycles to, e.g., discard later cycles, query_bins probably gives the desired behavior. |
query_bins | numeric; all same sign; unique positions within a read to delimit bins. Position within a read is based on counting from the 5' end when the read aligns to plus strand , counting from the 3' end when read aligns to minus strand. Minimum of two values are required so at least one range can be formed. NULL (default) indicates no binning. Use negative values to count from the opposite end. Sorted order not required. Floating-point values are coerced to integer . |
phred | Either a numeric() (coerced to integer()) PHRED score (e.g., in the range (0, 41) for the Illumina 1.8+ scheme) or character() of printable ASCII characters. When phred is character(), it can be of length 1 with 1 or more characters, or of any length where all elements have exactly 1 character. |
scheme | Encoding scheme, used to translate numeric() PHRED scores to their ASCII code. Ignored when phred is already character(). |
cycle_bins | DEPRECATED. See left_bins for identical behavior. |
Details
NB : Use of pileup
assumes familiarity with
ScanBamParam
, and use of left_bins
and
query_bins
assumes familiarity with cut
.
pileup
visits each position in the BAM file, subject to
constraints implied by which
and flag
of
scanBamParam
. For a given position, all reads overlapping the
position that are consistent with constraints dictated by flag
of scanBamParam
and pileupParam
are used for
counting. pileup
also automatically excludes reads with flags
that indicate any of:
list("unmapped alignment (", list(list("isUnmappedQuery")), ")")
list("secondary alignment (", list(list("isSecondaryAlignment")), ")")
list("not passing quality controls (", list(list("isNotPassingQualityControls")), ")")
list("PCR or optical duplicate (", list(list("isDuplicate")), ")")
If no which
argument is supplied to the ScanBamParam
,
behavior reflects that of scanBam
: the entire file is visited
and counted.
Use a yieldSize
value when creating a BamFile
instance to manage memory consumption when using pileup with large BAM
files. There are some differences in how pileup
behavior is
affected when the yieldSize
value is set on the BAM file. See
more details below.
Many of the parameters of the pileupParam
interact. A simple
illustration is ignore_query_Ns
and
distinguish_nucleotides
, as mentioned in the
ignore_query_Ns
section.
Parameters for pileupParam
belong to two categories: parameters
that affect only the filtering criteria (so-called
behavior-only policies), and parameters that affect
filtering behavior and the schema of the results
( behavior+structure policies).
%% behavior-only
%% (other behavior-only arguments are self-evident or sam %% spcification-evident)
%% behavior+structure
distinguish_nucleotides
and distinguish_strands
when set
to TRUE
each add a column ( nucleotide
and strand
,
respectively) to the resulting data.frame
. If both are TRUE,
each combination of nucleotide
and strand
at a given
position is counted separately. Setting only one to TRUE
behaves as expected; for example, if only nucleotide
is set to
TRUE
, each nucleotide at a given position is counted
separately, but the distinction of on which strand the nucleotide
appears is ignored.
ignore_query_Ns
determines how ambiguous nucloetides are
treated. By default, ambiguous nucleotides (typically N in
BAM files) in alignments are ignored. If ignore_query_Ns
is
FALSE
, ambiguous nucleotides are included in counts; further,
if ignore_query_Ns
is FALSE
and
distinguish_nucleotides
is TRUE
the N
nucleotide value appears in the nucleotide column when a base at a
given position is ambiguous.
By default, deletions with respect to the reference genome to which
the reads were aligned are included in the counts in a pileup. If
include_deletions
is TRUE
and
distinguish_nucleotides
is TRUE
, the nucleotide column
uses a - character to indicate a deletion at a position.
The = nucleotide code in the SEQ
field (to mean
identical to reference genome ) is supported by pileup, such
that a match with the reference genome is counted separately in the
results if distinguish_nucleotides
is TRUE
.
CIGAR support: pileup
handles the extended CIGAR format by only
heeding operations that contribute to counts ( M , D ,
I ). If you are confused about how the different CIGAR
operations interact, see the SAM versions of the BAM files used for
pileup
unit testing for a fairly comprehensive human-readable
treatment.
list("Deletions and clipping: ", " ", " The extended CIGAR allows a number of operations conceptually ", " similar to a ", list("deletion"), " with respect to the reference ", " genome, but offer more specialized meanings than a simple ", " deletion. CIGAR ", list("N"), " operations (not to be confused with ", " ", list("N"), " used for non-determinate bases in the ", list("SEQ"), " field) ", " indicate a large skipped region, ", list("S"), " a soft clip, and ", " ", list("H"), " a hard clip. ", list("N"), ", ", list("S"), ", and ", list("H"), " ", " CIGAR operations are never counted: only counts of true deletions ", " (", list("D"), " in the CIGAR) can be included by setting ", " ", list("include_deletions"), " to ", list("TRUE"), ". ", " ", " Soft clip operations contribute to the relative position of ", " nucleotides within a read, whereas hard clip and refskip ", " operations do not. For example, consider a sequence in a bam file ", " that starts at 11, with a CIGAR ", list("2S1M"), " and ", list("SEQ"), " field ", " ", list("ttA"), ". The cycle position of the ", list("A"), " nucleotide will be ", " ", list("3"), ", but the reported position will be ", list("11"), ". If using ", " ", list("left_bins"), " or ", list("query_bins"), " it might make sense to first ", " preprocess your files to remove soft clipping. ", " ", " ")
list("Insertions and padding: ", " ", " ", list("pileup"), " can include counts of insertion operations by ", " setting ", list("include_insertions"), " to ", list("TRUE"), ". Details about ", " insertions are effectively truncated such that each insertion is ", " reduced to a single ", list("+"), " nucleotide. Information about ", " e.g. the nucleotide code and base quality within the insertion is ", " not included. ", " ", " Because ", list("+"), " is used as a nucleotide code to represent ", " insertions in ", list("pileup"), ", counts of the ", list("+"), " nucleotide ", " participate in voting for ", list("min_nucleotide_depth"), " and ", " ", list("min_minor_allele_depth"), " functionality. ", "
", " The position of an insertion is the position that precedes it on ", " the reference sequence. ", list("Note:"), " this means if ", " ", list("include_insertions"), " is ", list("TRUE"), " a single read will ", " contribute two nucleotide codes (", list("+"), ", plus that of the ", " non-insertion base) at a given position if the non-insertion base ", " passes filter criteria. ", " ", " ", list("P"), " CIGAR (padding) operations never affect counts. ", " ", " ")
The bin arguments query_bins
and left_bins
allow users to differentiate pileup counts based on arbitrary
non-overlapping regions within a read. pileup
relies on
cut
to derive bins, but limits input to numeric values
of the same sign (coerced to integer
s), including
+/-Inf
. If a bin argument is set pileup
automatically excludes bases outside the implicit outer range. Here
are some important points regarding bin arguments:
list(list("query_bins"), " vs. ", list("left_bins"), ":") list(" ", " ", " BAM files store sequence data from the 5' end to the 3' end ", " (regardless of to which strand the read aligns). That means for ", " reads aligned to the minus strand, cycle position within a read is ", " effectively reversed. Take care to use the appropriate bin ", " argument for your use case. ", " ", " The most common use of a bin argument is to bin later cycles ", " separately from earlier cycles; this is because accuracy typically ", " degrades in later cycles. For this application, ", list("query_bins"), " ", " yields the correct behavior because bin positions are relative to ", " cycle order (i.e., sensitive to strand). ", " ", " ", list("left_bins"), " (in contrast with ", list("query_bins"), ") determines ", " bin positions from the 5' end, regardless of strand. ", " ", " ")
list("Positive or negative bin values can be used to delmit bins ", " based on the effective ", list("start"), " or ", list("end"), " of a ", " read. For ", list("left_bin"), " the effective start is always the 5' end ", " (left-to-right as appear in the BAM file). ", " ", " For ", list("query_bin"), " the effective start is the first cycle of the ", " read as it came off the sequencer; that means the 5' end for reads ", " aligned to the plus strand and 3' for reads aligned to the minus ", " strand. ", " ", " ", list(" ", " ", " ", list(), list(list("From effective start of reads"), ": use positive values, ", " ", list("0"), ", and ", list("(+)Inf"), ". Because ", list(list("cut")), " ", " produces ranges in the form (first,last], ", list("0"), " should be ", " used to create a bin that includes the first position. To ", " account for variable-length reads in BAM files, use ", " ", list("(+)Inf"), " as the upper bound on a bin that extends to the ",
" end of all reads."), "
", " ", " ", list(), list(list("From effective end of reads"), ": use negative values ", " and ", list("-Inf"), ". ", list("-1"), " denotes the last position in a ", " read. Because ", list(list("cut")), " produces ranges in the form ", " (first,last], specify the lower bound of a bin by using one ", " less than the desired value, e.g., a bin that captures only ", " the second nucleotide from the last position: ",
" ", list("query_bins=c(-3, -2)"), ". To account for variable-length
", " reads in BAM files, use ", list("-Inf"), " as the lower bound on a ", " bin that extends to the beginning of all reads."), " ", " "), " ", " ") %% end subsection bin argument gotchas
pileup
obeysyieldSize
onBamFile
objects with some differences from howscanBam
responds toyieldSize
. Here are some points to keep in mind when usingpileup
in conjunction withyieldSize
:list(list("BamFile"), "s with a ", list("yieldSize"), " value set, once ", " opened and used with ", list("pileup"), ", ", list("should not be used"), " with ", " other functions that accept a ", list("BamFile"), " instance as ", " input. Create a new ", list("BamFile"), " instance instead of trying to ", " reuse.")
list(list("pileup"), " only returns genomic positions for which all ", " input has been processed. ", list("pileup"), " will hold in reserve ", " positions for which only partial data has been ", " processed. Positions held in reserve will be returned upon ", " subsequent calls to ", list("pileup"), " when all the input for a given ", " position has been processed.")
list("The correspondence between yieldSize and the number of rows ", " in the ", list("data.frame"), " returned from ", list("pileup"), " is not ", " 1-to-1. ", list("yieldSize"), " only limits the number of ", " ", list("alignments processed"), " from the BAM file each time the file ", " is read. Only a handful of alignments can lead to many distinct ", " records in the result.")
list("Like ", list("scanBam"), ", ", list("pileup"), " uses an empty return ", " object (a zero-row ", list("data.frame"), ") to indicate end-of-input.")
list("Sometimes reading ", list("yieldSize"), " records from the BAM file ", " does not result in any completed positions. In order to avoid ", " returning a zero-row ", list("data.frame"), " ", list("pileup"), " will ", " repeatedly process ", list("yieldSize"), " additional records until at ", " least one position can be returned to the user.")
Value
For pileup
a data.frame
with 1 row per unique
combination of differentiating column values that satisfied filter
criteria, with frequency ( count
) of unique combination. Columns
seqnames
, pos
, and count
always appear; optional
strand
, nulceotide
, and left_bin
/
query_bin
columns appear as dictated by arguments to
PileupParam
.
Column details:
list(list("seqnames"), ": factor. SAM ", list("RNAME"), " field of ", " alignment.")
list(list("pos"), ": integer(1). Genomic position of base. Derived by ", " offset from SAM ", list("POS"), " field of alignment.")
list(list("strand"), ": factor. ", list("strand"), " to which read aligns.")
list(list("nucleotide"), ": factor. ", list("nucleotide"), " of base, ", " extracted from SAM ", list("SEQ"), " field of alignment.")
list(list("left_bin"), " / ", list("query_bin"), ": factor. Bin in which base ", " appears.")
list(list("count"), ": integer(1). Frequency of combination of ", " differentiating fields, as indicated by values of other columns.")
See scanBam
for more detailed explanation of SAM fields.
If a which
argument is specified for the scanBamParam
, a
which_label
column ( factor
in the form
rname:start-end ) is automatically included. The
which_label
column is used to maintain grouping of results,
such that two queries of the same genomic region can be
differentiated.
Order of rows in data.frame
is first by order of
seqnames
column based on the BAM index file, then
non-decreasing order on columns pos
, then nucleotide
,
then strand
, then left_bin
/ query_bin
.
PileupParam
returns an instance of PileupParam class.
phred2ASCIIOffset
returns a named integer vector of the same
length or number of characters in phred
. The names are the
ASCII encoding, and the values are the offsets to be used with
min_base_quality
.
Seealso
list("Rsamtools")
list("BamFile")
list("ScanBamParam")
list("scanBam")
list("cut")
For the relationship between PHRED scores and ASCII encoding, see https://en.wikipedia.org/wiki/FASTQ_format#Encoding .
Author
Nathaniel Hayden nhayden@fredhutch.org
Examples
## The examples below apply equally to pileup queries for specific
## genomic ranges (specified by the 'which' parameter of 'ScanBamParam')
## and queries across entire files; the entire-file examples are
## included first to make them easy to find. The more detailed examples
## of pileup use queries with a 'which' argument.
library("RNAseqData.HNRNPC.bam.chr14")
fl <- RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
## Minimum base quality to be tallied
p_param <- PileupParam(min_base_quality = 10L)
list("
", "## no 'which' argument to ScanBamParam: process entire file at once
", "res <- pileup(fl, pileupParam = p_param)
", "head(res)
", "table(res$strand, res$nucleotide)
")
list("
", "## no 'which' argument to ScanBamParam with 'yieldSize' set on BamFile
", "bf <- open(BamFile(fl, yieldSize=5e4))
", "repeat {
", " res <- pileup(bf, pileupParam = p_param)
", " message(nrow(res), " rows in result data.frame")
", " if(nrow(res) == 0L)
", " break
", "}
", "close(bf)
")
## pileup for the first half of chr14 with default Pileup parameters
## 'which' argument: process only specified genomic range(s)
sbp <- ScanBamParam(which=GRanges("chr14", IRanges(1, 53674770)))
res <- pileup(fl, scanBamParam=sbp, pileupParam = p_param)
head(res)
table(res$strand, res$nucleotide)
## specify pileup parameters: include ambiguious nucleotides
## (the 'N' level in the nucleotide column of the data.frame)
p_param <- PileupParam(ignore_query_Ns=FALSE, min_base_quality = 10L)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
head(res)
table(res$strand, res$nucleotide)
## Don't distinguish strand, require a minimum frequency of 7 for a
## nucleotide at a genomic position to be included in results.
p_param <- PileupParam(distinguish_strands=TRUE,
min_base_quality = 10L,
min_nucleotide_depth=7)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
head(res)
table(res$nucleotide, res$strand)
## Any combination of the filtering criteria is possible: let's say we
## want a "coverage pileup" that only counts reads with mapping
## quality >= 13, and bases with quality >= 10 that appear at least 4
## times at each genomic position
p_param <- PileupParam(distinguish_nucleotides=FALSE,
distinguish_strands=FALSE,
min_mapq=13,
min_base_quality=10,
min_nucleotide_depth=4)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
head(res)
res <- res[, c("pos", "count")] ## drop seqnames and which_label cols
plot(count ~ pos, res, pch=".")
## ASCII offsets to help specify min_base_quality, e.g., quality of at
## least 10 on the Illumina 1.3+ scale
phred2ASCIIOffset(10, "Illumina 1.3+")
## Well-supported polymorphic positions (minor allele present in at
## least 5 alignments) with high map quality
p_param <- PileupParam(min_minor_allele_depth=5,
min_mapq=40,
min_base_quality = 10,
distinguish_strand=FALSE)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
dim(res) ## reduced to our biologically interesting positions
head(xtabs(count ~ pos + nucleotide, res))
## query_bins
list("
", "
", "## basic use of positive bins: single pivot; count bases that appear in
", "## first 10 cycles of a read separately from the rest
", "p_param <- PileupParam(query_bins=c(0, 10, Inf), min_base_quality = 10)
", "res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
", "
", "## basic use of positive bins: simple range; only include bases in
", "## cycle positions 6-10 within read
", "p_param <- PileupParam(query_bins=c(5, 10), min_base_quality = 10)
", "res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
",
"
", "## basic use of negative bins: single pivot; count bases that appear in
", "## last 3 cycle positions of a read separately from the rest.
", "p_param <- PileupParam(query_bins=c(-Inf, -4, -1), min_base_quality = 10)
", "res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
", "
", "## basic use of negative bins: drop nucleotides in last two cycle
", "## positions of each read
", "p_param <- PileupParam(query_bins=c(-Inf, -3), min_base_quality = 10)
", "res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
")
## typical use: beginning, middle, and end segments; because of the
## nature of sequencing technology, it is common for bases in the
## beginning and end segments of each read to be less reliable. pileup
## makes it easy to count each segment separately.
## Assume determined ahead of time that for the data 1-7 makes sense for
## beginning, 8-12 middle, >=13 end (actual cycle positions should be
## tailored to data in actual BAM files).
p_param <- PileupParam(query_bins=c(0, 7, 12, Inf), ## note non-symmetric
min_base_quality = 10)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
xt <- xtabs(count ~ nucleotide + query_bin, res)
print(xt)
t(t(xt) / colSums(xt)) ## cheap normalization for illustrative purposes
## 'implicit outer range': in contrast to c(0, 7, 12, Inf), suppose we
## still want to have beginning, middle, and end segements, but know
## that the first three cycles and any bases beyond the 16th cycle are
## irrelevant. Hence, the implicit outer range is (3,16]; all bases
## outside of that are dropped.
p_param <- PileupParam(query_bins=c(3, 7, 12, 16), min_base_quality = 10)
res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
xt <- xtabs(count ~ nucleotide + query_bin, res)
print(xt)
t(t(xt) / colSums(xt))
list("
", "## single-width bins: count each cycle within a read separately.
", "p_param <- PileupParam(query_bins=seq(1,20), min_base_quality = 10)
", "res <- pileup(fl, scanBamParam=sbp, pileupParam=p_param)
", "xt <- xtabs(count ~ nucleotide + query_bin, res)
", "print(xt[ , c(1:3, 18:19)]) ## fit on one screen
", "print(t(t(xt) / colSums(xt))[ , c(1:3, 18:19)])
")
quickBamFlagSummary()
Group the records of a BAM file based on their flag bits and count the number of records in each group
Description
quickBamFlagSummary
groups the records of a BAM file based on their flag
bits and counts the number of records in each group.
Usage
quickBamFlagSummary(file, ..., param=ScanBamParam(), main.groups.only=FALSE)
list(list("quickBamFlagSummary"), list("character"))(file, index=file, ..., param=ScanBamParam(),
main.groups.only=FALSE)
list(list("quickBamFlagSummary"), list("list"))(file, ..., param=ScanBamParam(), main.groups.only=FALSE)
Arguments
Argument | Description |
---|---|
file, index | For the character method, the path to the BAM file to read, and to the index file of the BAM file to read, respectively. For the list() method, file is a named list with elements qname and flag with content as from scanBam . |
... | Additional arguments, perhaps used by methods. |
param | An instance of ScanBamParam . This determines which records are considered in the counting. |
main.groups.only | If TRUE , then the counting is performed for the main groups only. |
Value
Nothing is returned. A summary of the counts is printed to the console
unless redirected by sink
.
Seealso
scanBam
,
ScanBamParam
.
BamFile
for a method for that class.
Author
H. Pages
References
http://samtools.sourceforge.net/
Examples
bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools",
mustWork=TRUE)
quickBamFlagSummary(bamfile)
readPileup()
Import samtools 'pileup' files.
Description
Import files created by evaluation of samtools' pileup -cv
command.
Usage
readPileup(file, ...)
list(list("readPileup"), list("connection"))(file, ..., variant=c("SNP", "indel", "all"))
Arguments
Argument | Description |
---|---|
file | The file name, or connection , of the pileup output file to be parsed. |
list() | Additional arguments, passed to methods. For instance, specify variant for the readPileup,character-method. |
variant | Type of variant to parse; select one. |
Value
readPileup
returns a GRanges
object.
The value returned by variant="SNP"
or variant="all"
contains:
list("
", "
", " ", list(list("space:"), list("The chromosome names (fastq ids) of the reference
", " sequence")), "
", "
", " ", list(list("position:"), list("The nucleotide position (base 1) of the variant.")), "
", "
", " ", list(list("referenceBase:"), list("The nucleotide in the reference sequence.")), "
", "
", " ", list(list("consensusBase;"), list("The consensus nucleotide, as determined by
", " samtools pileup.")), "
", "
", " ", list(list("consensusQuality:"), list(
"The phred-scaled consensus quality.")), "
", " ", " ", list(list("snpQuality:"), list("The phred-scaled SNP quality (probability of the ", " consensus being identical to the reference).")), " ", " ", " ", list(list("maxMappingQuality:"), list("The root mean square mapping quality of reads ", " overlapping the site.")), " ", " ", " ", list(list("coverage:"), list("The number of reads covering the site.")), " ", " ", " ")
The value returned by variant="indel"
contains space, position,
reference, consensus, consensusQuality, snpQuality, maxMappingQuality,
and coverage fields, and:
list(" ", " ", " ", list(list("alleleOne, alleleTwo"), list("The first (typically, in the reference ", " sequence) and second allelic variants.")), " ", " ", " ", list(list("alleleOneSupport, alleleTwoSupport"), list("The number of reads ", " supporting each allele.")), " ", " ", " ", list(list("additionalIndels"), list("The number of additional indels present.")), " ", " ", " ")
Author
Sean Davis
References
http://samtools.sourceforge.net/
Examples
fl <- system.file("extdata", "pileup.txt", package="Rsamtools",
mustWork=TRUE)
(res <- readPileup(fl))
xtabs(~referenceBase + consensusBase, mcols(res))[DNA_BASES,]
## uses a pipe, and arguments passed to read.table
## three successive piles of 100 records each
cmd <- "samtools pileup -cvf human_b36_female.fa.gz na19240_3M.bam"
p <- pipe(cmd, "r")
snp <- readPileup(p, nrow=100) # variant="SNP"
indel <- readPileup(p, nrow=100, variant="indel")
all <- readPileup(p, nrow=100, variant="all")
scanBam()
Import, count, index, filter, sort, and merge BAM' (binary alignment) files. ## Description Import binary
BAM' files into a list structure, with facilities for
selecting what fields and which records are imported, and other
operations to manipulate BAM files.
Usage
scanBam(file, index=file, ..., param=ScanBamParam(what=scanBamWhat()))
countBam(file, index=file, ..., param=ScanBamParam())
idxstatsBam(file, index=file, ...)
scanBamHeader(files, ...)
list(list("scanBamHeader"), list("character"))(files, ...)
asBam(file, destination=sub("\.sam(\.gz)?", "", file), ...)
list(list("asBam"), list("character"))(file, destination=sub("\.sam(\.gz)?", "", file),
..., overwrite=FALSE, indexDestination=TRUE)
asSam(file, destination=sub("\.bam", "", file), ...)
list(list("asSam"), list("character"))(file, destination=sub("\.bam", "", file),
..., overwrite=FALSE)
filterBam(file, destination, index=file, ...)
list(list("filterBam"), list("character"))(file, destination, index=file, ...,
filter=FilterRules(), indexDestination=TRUE,
param=ScanBamParam(what=scanBamWhat()))
sortBam(file, destination, ...)
list(list("sortBam"), list("character"))(file, destination, ..., byQname=FALSE, maxMemory=512)
indexBam(files, ...)
list(list("indexBam"), list("character"))(files, ...)
mergeBam(files, destination, ...)
list(list("mergeBam"), list("character"))(files, destination, ..., region = GRanges(),
overwrite = FALSE, header = character(), byQname = FALSE,
addRG = FALSE, compressLevel1 = FALSE, indexDestination = FALSE)
Arguments
Argument | Description |
---|---|
file | The character(1) file name of the BAM' ('SAM' for asBam` ) file to be processed. |
files | The character() file names of the BAM' file to be processed. For mergeBam, must satisfy length(files) >=` . |
index | The character(1) name of the index file of the 'BAM' file being processed; this is given without the '.bai' extension. |
destination | The character(1) file name of the location where the sorted, filtered, or merged output file will be created. For asBam asSam , and sortBam this is without the .bam file suffix. |
region | A GRanges() instance with <= 1 elements, specifying the region of the BAM files to merged. |
list() | Additional arguments, passed to methods. |
overwrite | A logical(1) indicating whether the destination can be over-written if it already exists. |
filter | A FilterRules instance allowing users to filter BAM files based on arbitrary criteria, as described below. |
indexDestination | A logical(1) indicating whether the created destination file should also be indexed. |
byQname | A logical(1) indicating whether the sorted destination file should be sorted by Query-name (TRUE) or by mapping position (FALSE). |
header | A character(1) file path for the header information to be used in the merged BAM file. |
addRG | A logical(1) indicating whether the file name should be used as RG (read group) tag in the merged BAM file. |
compressLevel1 | A logical(1) indicating whether the merged BAM file should be compressed to zip level 1. |
maxMemory | A numerical(1) indicating the maximal amount of memory (in MB) that the function is allowed to use. |
param | An instance of ScanBamParam . This influences what fields and which records are imported. |
Details
The scanBam
function parses binary BAM files; text SAM files
can be parsed using R's scan
function, especially with
arguments what
to control the fields that are parsed.
countBam
returns a count of records consistent with
param
.
idxstatsBam
visit the index in index(file)
, and quickly
returns the number of mapped and unmapped reads on each seqname.
scanBamHeader
visits the header information in a BAM file,
returning for each file a list containing elements targets
and
text
, as described below. The SAM / BAM specification does not
require that the content of the header be consistent with the content
of the file, e.g., more targets may be present that are represented by
reads in the file. An optional character vector argument containing
one or two elements of what=c("targets", "text")
can be used to
specify which elements of the header are returned.
asBam
converts 'SAM' files to 'BAM' files, equivalent to
samtools view -Sb file > destination
. The 'BAM' file is sorted
and an index created on the destination (with extension '.bai') when
indexDestination=TRUE
.
asSam
converts 'BAM' files to 'SAM' files, equivalent to
samtools view file > destination
.
filterBam
parses records in file
. Records satisfying the
bamWhich
bamFlag
and bamSimpleCigar
criteria of
param
are accumulated to a default of yieldSize =
records (change this by specifying yieldSize
when
creating a BamFile
instance; see
BamFile ). These records are then parsed to
a DataFrame
and made available for further filtering by
user-supplied FilterRules
. Functions in the FilterRules
instance should expect a single DataFrame
argument representing
all information specified by param
. Each function must return a
logical
vector equal to the number of rows of the
DataFrame
. Return values are used to include (when TRUE
)
corresponding records in the filtered BAM file. The BAM file is
created at destination
. An index file is created on the
destination when indexDestination=TRUE
. It is more space- and
time-efficient to filter using bamWhich
, bamFlag
, and
bamSimpleCigar
, if appropriate, than to supply
FilterRules
. filter
may be a list of FilterRules
instances, in which case destination
must be a character vector
of equal length. The original file
is then separately filtered
into destination[[i]]
, using filter[[i]]
as the filter
criterion.
sortBam
sorts the BAM file given as its first argument,
analogous to the samtools sort function.
indexBam
creates an index for each BAM file specified,
analogous to the samtools index function.
mergeBam
merges 2 or more sorted BAM files. As with samtools,
the RG (read group) dictionary in the header of the BAM files is not
reconstructed.
Details of the ScanBamParam
class are provide on its help page;
several salient points are reiterated here. ScanBamParam
can
contain a field what
, specifying the components of the BAM
records to be returned. Valid values of what
are available with
scanBamWhat
. ScanBamParam
can contain an argument
which
that specifies a subset of reads to return. This requires
that the BAM file be indexed, and that the file be named following
samtools convention as <bam_filename>.bai
. ScanBamParam
can contain an argument tag
to specify which tags will be
extracted.
Value
The scanBam,character-method
returns a list of lists. The outer
list groups results from each IntegerRanges
list of
bamWhich(param)
; the outer list is of length one when
bamWhich(param)
has length 0. Each inner list contains elements
named after scanBamWhat()
; elements omitted from
bamWhat(param)
are removed. The content of non-null elements
are as follows, taken from the description in the samtools API
documentation:
qname: This is the QNAME field in SAM Spec v1.4. The query name, i.e., identifier, associated with the read.
flag: This is the FLAG field in SAM Spec v1.4. A numeric value summarizing details of the read. See ScanBamParam and the
flag
argument, andscanBamFlag()
.rname: This is the RNAME field in SAM Spec v1.4. The name of the reference to which the read is aligned.
strand: The strand to which the read is aligned.
pos: This is the POS field in SAM Spec v1.4. The genomic coordinate at the start of the alignment. Coordinates are
left-most', i.e., at the 3' end of a read on the '-' strand, and 1-based. The position excludes clipped nucleotides, even though soft-clipped nucleotides are included in
seq. * qwidth: The width of the query, as calculated from the
cigarencoding; normally equal to the width of the query returned in
seq. * mapq: This is the MAPQ field in SAM Spec v1.4. The MAPping Quality. * cigar: This is the CIGAR field in SAM Spec v1.4. The CIGAR string. * mrnm: This is the RNEXT field in SAM Spec v1.4. The reference to which the mate (of a paired end or mate pair read) aligns. * mpos: This is the PNEXT field in SAM Spec v1.4. The position to which the mate aligns. * isize: This is the TLEN field in SAM Spec v1.4. Inferred insert size for paired end alignments. * seq: This is the SEQ field in SAM Spec v1.4. The query sequence, in the 5' to 3' orientation. If aligned to the minus strand, it is the reverse complement of the original sequence. * qual: This is the QUAL field in SAM Spec v1.4. Phred-encoded, phred-scaled base quality score, oriented as
seq. * groupid: This is an integer vector of unique group ids returned when
asMates=TRUEin a BamFile object.
groupidvalues are used to create the partitioning for a
GAlignmentsListobject. * mate_status: Returned (always) when
asMates=TRUEin a BamFile object. This is a factor indicating status (
mated,
ambiguous,
unmated) of each record.
idxstatsBamreturns a
data.framewith columns
seqnames,
seqlength,
mapped(number of mapped reads on
seqnames) and
unmapped(number of unmapped reads).
scanBamHeaderreturns a list, with one element for each file named in
files. The list contains two element. The
targetselement contains target (reference) sequence lengths. The
textelement is itself a list with each element a list corresponding to tags (e.g., @SQ ) found in the header, and the associated tag values.
asBam,
asSamreturn the file name of the destination file.
sortBamreturns the file name of the sorted file.
indexBamreturns the file name of the index file created.
filterBamreturns the file name of the destination file created. ## Seealso ScanBamParam , [
scanBamWhat](#scanbamwhat) , [
scanBamFlag](#scanbamflag) ## Author Martin Morgan <mtmorgan@fhcrc.org>. Thomas Unterhiner <thomas.unterthiner@students.jku.at> (
sortBam` ). ## References http://samtools.sourceforge.net/ ## Examplesr fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE) ## ## scanBam ## res0 <- scanBam(fl)[[1]] # always list-of-lists names(res0) length(res0[["qname"]]) lapply(res0, head, 3) table(width(res0[["seq"]])) # query widths table(res0[["qwidth"]], useNA="always") # query widths derived from cigar table(res0[["cigar"]], useNA="always") table(res0[["strand"]], useNA="always") table(res0[["flag"]], useNA="always") which <- IRangesList(seq1=IRanges(1000, 2000), seq2=IRanges(c(100, 1000), c(1000, 2000))) p1 <- ScanBamParam(which=which, what=scanBamWhat()) res1 <- scanBam(fl, param=p1) names(res1) names(res1[[2]]) p2 <- ScanBamParam(what=c("rname", "strand", "pos", "qwidth")) res2 <- scanBam(fl, param=p2) p3 <- ScanBamParam( what="flag", # information to query from BAM file flag=scanBamFlag(isMinusStrand=FALSE)) length(scanBam(fl, param=p3)[[1]]$flag) ## ## idxstatsBam ## idxstatsBam(fl) ## ## filterBam ## param <- ScanBamParam( flag=scanBamFlag(isUnmappedQuery=FALSE), what="seq") dest <- filterBam(fl, tempfile(), param=param) countBam(dest) ## 3271 records ## filter to a single file filter <- FilterRules(list(MinWidth = function(x) width(x$seq) > 35)) dest <- filterBam(fl, tempfile(), param=param, filter=filter) countBam(dest) ## 398 records res3 <- scanBam(dest, param=ScanBamParam(what="seq"))[[1]] table(width(res3$seq)) ## filter 1 file to 2 destinations filters <- list( FilterRules(list(long=function(x) width(x$seq) > 35)), FilterRules(list(short=function(x) width(x$seq) <= 35)) ) destinations <- replicate(2, tempfile()) dest <- filterBam(fl, destinations, param=param, filter=filters) lapply(dest, countBam) ## ## sortBam ## sorted <- sortBam(fl, tempfile()) ## ## scanBamParam re-orders 'which'; recover original order ## gwhich <- as(which, "GRanges")[c(2, 1, 3)] # example data cnt <- countBam(fl, param=ScanBamParam(which=gwhich)) reorderIdx <- unlist(split(seq_along(gwhich), seqnames(gwhich))) cnt cnt[reorderIdx,]
scanBcf()
Operations on BCF' files. ## Description Import, coerce, or index variant call files in text or binary format. ## Usage ```r scanBcfHeader(file, ...) list(list("scanBcfHeader"), list("character"))(file, ...) scanBcf(file, ...) list(list("scanBcf"), list("character"))(file, index = file, ..., param=ScanBcfParam()) asBcf(file, dictionary, destination, ..., overwrite=FALSE, indexDestination=TRUE) list(list("asBcf"), list("character"))(file, dictionary, destination, ..., overwrite=FALSE, indexDestination=TRUE) indexBcf(file, ...) list(list("indexBcf"), list("character"))(file, ...) ``` ## Arguments |Argument |Description| |------------- |----------------| |
file| For
scanBcfand
scanBcfHeader, the character() file name of the BCF file to be processed, or an instance of class [
BcfFile](#bcffile) .| |
index| The character() file name(s) of the
BCF' index to be processed.|
|dictionary
| a character vector of the unique CHROM names in the VCF file.|
|destination
| The character(1) file name of the location where the BCF output file will be created. For asBcf
this is without the .bcf file suffix.|
|param
| A instance of ScanBcfParam influencing which records are parsed and the INFO and GENO information returned.|
|...
| Additional arguments, e.g., for scanBcfHeader,character-method
, mode
of BcfFile
.|
|overwrite
| A logical(1) indicating whether the destination can be over-written if it already exists.|
|indexDestination
| A logical(1) indicating whether the created destination file should also be indexed.|
Details
bcf*
functions are restricted to the GENO fields supported by
bcftools (see documentation at the url below). The argument
param
allows portions of the file to be input, but requires
that the file be BCF or bgzip'd and indexed as a
TabixFile . For similar functions operating on VCF
files see ? scanVcf
in the VariantAnnotation
package.
Value
scanBcfHeader
returns a list, with one element for each file
named in file
. Each element of the list is itself a list containing
three elements. The Reference
element is a character() vector with
names of reference sequences. The Sample
element is a character()
vector of names of samples. The Header
element is a DataFrameList
with one DataFrame per unique key value in the header
(preceded by ## ).
NOTE: In Rsamtools >=1.33.6, the Header
DataFrameList no longer
contains a DataFrame named "META". The META DataFrame contained all "simple"
key-value headers lines from a bcf / vcf file. These "simple" header
lines are now parsed into individual DataFrames named for the unique
key.
scanBcf
returns a list, with one element per file. Each list has 9
elements, corresponding to the columns of the VCF specification: CHROM
,
POS
, ID
, REF
, ALT
QUAL
, FILTER
,
INFO
, FORMAT
, GENO
.
The GENO
element is itself a list, with elements corresponding
to fields supported by bcftools (see documentation at the url below).
asBcf
creates a binary BCF file from a text VCF file.
indexBcf
creates an index into the BCF file.
Seealso
Author
Martin Morgan mtmorgan@fhcrc.org.
References
http://vcftools.sourceforge.net/specs.html outlines the VCF specification.
http://samtools.sourceforge.net/mpileup.shtml contains
information on the portion of the specification implemented by
bcftools
.
http://samtools.sourceforge.net/ provides information on
samtools
.
Examples
fl <- system.file("extdata", "ex1.bcf.gz", package="Rsamtools",
mustWork=TRUE)
scanBcfHeader(fl)
bcf <- scanBcf(fl)
## value: list-of-lists
str(bcf[1:8])
names(bcf[["GENO"]])
str(head(bcf[["GENO"]][["PL"]]))
example(BcfFile)
scanFa()
Operations on indexed 'fasta' files.
Description
Scan indexed fasta (or compressed fasta) files and their indicies.
Usage
indexFa(file, ...)
list(list("indexFa"), list("character"))(file, ...)
scanFaIndex(file, ...)
list(list("scanFaIndex"), list("character"))(file, ...)
countFa(file, ...)
list(list("countFa"), list("character"))(file, ...)
scanFa(file, param, ...,
as=c("DNAStringSet", "RNAStringSet", "AAStringSet"))
list(list("scanFa"), list("character,GRanges"))(file, param, ...,
as=c("DNAStringSet", "RNAStringSet", "AAStringSet"))
list(list("scanFa"), list("character,IntegerRangesList"))(file, param, ...,
as=c("DNAStringSet", "RNAStringSet", "AAStringSet"))
list(list("scanFa"), list("character,missing"))(file, param, ...,
as=c("DNAStringSet", "RNAStringSet", "AAStringSet"))
Arguments
Argument | Description |
---|---|
file | A character(1) vector containing the fasta file path. |
param | An optional GRanges or IntegerRangesList instance to select reads (and sub-sequences) for input. |
as | A character(1) vector indicating the type of object to return; default DNAStringSet . |
... | Additional arguments, passed to readDNAStringSet / readRNAStringSet / readAAStringSet when param is missing . |
Value
indexFa
visits the path in file
and create an index file
at the same location but with extension .fai ).
scanFaIndex
reads the sequence names and and widths of recorded
in an indexed fasta file, returning the information as a
GRanges object.
countFa
returns the number of records in the fasta file.
scanFa
return the sequences indicated by param
as a
DNAStringSet , RNAStringSet ,
AAStringSet instance. seqnames(param)
selects the sequences to return; start(param)
and
end{param}
define the (1-based) region of the sequence to
return. Values of end(param)
greater than the width of the
sequence are set to the width of the sequence. When param
is
missing, all records are selected. When param
is
GRanges()
, no records are selected.
Author
Martin Morgan mtmorgan@fhcrc.org.
References
http://samtools.sourceforge.net/ provides information on
samtools
.
Examples
fa <- system.file("extdata", "ce2dict1.fa", package="Rsamtools",
mustWork=TRUE)
countFa(fa)
(idx <- scanFaIndex(fa))
(dna <- scanFa(fa, idx[1:2]))
ranges(idx) <- narrow(ranges(idx), -10) # last 10 nucleotides
(dna <- scanFa(fa, idx[1:2]))
scanTabix()
Operations on tabix' (indexed, tab-delimited) files. ## Description Scan compressed, sorted, tabix-indexed, tab-delimited files. ## Usage ```r scanTabix(file, ..., param) list(list("scanTabix"), list("character,IntegerRangesList"))(file, ..., param) list(list("scanTabix"), list("character,GRanges"))(file, ..., param) ``` ## Arguments |Argument |Description| |------------- |----------------| |
file| The character() file name(s) of the tabix file be processed, or more flexibly an instance of class [
TabixFile](#tabixfile) .| |
param| A instance of
GRangesor
IntegerRangesListproviding the sequence names and regions to be parsed.| |
...| Additional arguments, currently ignored.| ## Value
scanTabix` returns a list, with one element per region. Each element
of the list is a character vector representing records in the region.
## Author
Martin Morgan mtmorgan@fhcrc.org.
## References
http://samtools.sourceforge.net/tabix.shtml
## Examples
r example(TabixFile)
seqnamesTabix()
Retrieve sequence names defined in a tabix file.
Description
This function queries a tabix file, returning the names of the sequences used as a key when creating the file.
Usage
seqnamesTabix(file, ...)
list(list("seqnamesTabix"), list("character"))(file, ...)
Arguments
Argument | Description |
---|---|
file | A character(1) file path or TabixFile instance pointing to a tabix file. |
... | Additional arguments, currently ignored. |
Value
A character()
vector of sequence names present in the file.
Author
Martin Morgan mtmorgan@fhcrc.org.
Examples
fl <- system.file("extdata", "example.gtf.gz", package="Rsamtools",
mustWork=TRUE)
seqnamesTabix(fl)
testPairedEndBam()
Quickly test if a BAM file has paired end reads
Description
Iterate through a BAM file until a paired-end read is encountered or the end of file is reached; report the occurrence of paired-end reads to the user.
Usage
testPairedEndBam(file, index=file, ...)
Arguments
Argument | Description |
---|---|
file | character(1) BAM file name, or a BamFile instance. Open BamFile s are closed; their yield size is respected when iterating through the file. |
index | (optional) character(1) name of the index file of the 'BAM' file being processed; this is given without the '.bai' extension. |
list() | Additional arguments, currently unused. |
Value
A logical vector of length 1 containing TRUE is returned if BAM file contained paired end reads, FALSE otherwise.
Author
Martin Morgan mailto:mtmorgan@fhcrc.org , Sonali Arora mailto:sarora@fhcrc.org
Examples
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
testPairedEndBam(fl)
zip()
File compression for tabix (bgzip) and fasta (razip) files.
IMPORTANT NOTE: Starting with Rsamtools 1.99.0 (Bioconductor 3.9),
razip()
is defunct. Please use bgzip()
instead.
Description
bgzip
compresses tabix (e.g. SAM or VCF) or FASTA files
to a format that allows indexing for later fast random-access.
Usage
bgzip(file, dest=sprintf("%s.bgz", sub("\.gz$", "", file)),
overwrite = FALSE)
## Defunct!
razip(file, dest=sprintf("%s.rz", sub("\.gz$", "", file)),
overwrite = FALSE)
Arguments
Argument | Description |
---|---|
file | A character(1) path to an existing uncompressed or gz-compressed file. This file will be compressed. |
dest | A character(1) path to a file. This will be the compressed file. If dest exists, then it is only over-written when overwrite=TRUE . |
overwrite | A logical(1) indicating whether dest should be over-written, if it already exists. |
Value
The full path to dest
.
Seealso
Author
Martin Morgan mtmorgan@fhcrc.org
References
http://samtools.sourceforge.net/
Examples
from <- system.file("extdata", "ex1.sam", package="Rsamtools",
mustWork=TRUE)
to <- tempfile()
zipped <- bgzip(from, to)