bioconductor v3.9.0 ShortRead
This package implements sampling, iteration, and input of
Link to this section Summary
Functions
(Legacy) AlignedDataFrame constructor
(Legacy) "AlignedDataFrame" representing alignment annotations as a data frame
(Legacy) Construct objects of class "AlignedRead"
(Legacy) "AlignedRead" class for aligned short reads
(Legacy) Quality assessment summaries from Bowtie files
(Legacy) "ExperimentPath" class representing a file hierarchy of data files
Quality assessment of fastq files and ShortReadQ objects
(Legacy) "Intensity", "IntensityInfo", and "IntensityMeasure" base classes for short read image intensities
(Legacy) Quality assessment summaries from MAQ map files
(Updated) classes for representing quality assessment results
Construct objects indicating read or alignment quality
Quality scores for short reads and their alignments
(Legacy) "RochePath" class representing a Roche (454) experiment location
(Legacy) Roche (454) experiment-wide data container
(Legacy) Construct objects of class "RtaIntensity"
(Legacy) Class "RtaIntensity"
"SRFilterResult" for SRFilter output and statistics
"SRFilter" for representing functions operating on ShortRead objects
(Legacy) A base class for Roche experiment-wide data
".SRUtil" and related classes
Sampling and streaming records from fastq files
"ShortReadQ" class for short reads and their quality scores
"ShortRead" class for short reads
Deprecated functions from the ShortRead package
FASTQ input and manipulation.
Class "SnapshotFunction"
Class "Snapshot"
(Legacy) Quality assessment summaries from Solexa export and realign files
(Legacy) Construct objects of class "SolexaIntensity" and "SolexaIntensityInfo"
Classes "SolexaIntensity" and "SolexaIntensityInfo"
(Legacy) "SolexaPath" class representing a standard output file hierarchy
(Legacy) "SolexaSet" coordinating Solexa output locations with sample annotations
Class "SpTrellis"
(Legacy) Accessors for ShortRead classes
Summarize nucleotide, amino acid, or quality scores by cycle
Efficiently calculate the sum of quality scores across bases
Remove sequences with ambiguous nucleotides from short read classes
Count lines in all (text) files in a directory whose file name matches a pattern
Deprecated and defunct functions
Virtual class for representing quality assessment results
Summarize low-complexity sequences
Filter fastq from one file to another
Utilities for common, simple operations
Perform quality assessment on short reads
(Updated) quality assessment reports on short reads
(Legacy) Read aligned reads and their quality scores into R representations
(Legacy) Read short reads and their quality scores into R representations
(Legacy) Get a list of the sequences in a Maq .bfa file
Read and write FASTA files to or from ShortRead objects
Read and write FASTQ-formatted files
(Legacy) Read Illumina image intensity files
(Legacy) Read Solexa prb files as fastq-style quality scores
(Legacy) Read Solexa qseq files as fastq-style quality scores
Read one or more columns into XStringSet (e.g., DNAStringSet) objects
Renew (update) a ShortRead object with new values
Summarize quality assessment results into a report
Tools to visualize genomic data
Functions for user-created and built-in ShortRead filters
Edit distances between reads and a small number of short references
Order, sort, and find duplicates in XStringSet objects
Summarize XStringSet read frequencies
Trim ends of reads based on nucleotides or qualities
Link to this section Functions
AlignedDataFrame()
(Legacy) AlignedDataFrame constructor
Description
Construct an AlignedDataFrame
from a data frame and its
metadata
Usage
AlignedDataFrame(data, metadata, nrow = nrow(data))
Arguments
Argument | Description |
---|---|
data | A data frame containing alignment information. |
metadata | A data frame describing the columns of data , and with number of rows of metadata corresponding to number of columns of data . . The data frame must contain a column labelDescription providing a verbose description of each column of data . |
nrow | An optional argument, to be used when data is not provided, to construct an AlignedDataFrame with the specified number of rows. |
Value
An object of AlignedDataFrame .
Author
Martin Morgan mtmorgan@fhcrc.org
AlignedDataFrame_class()
(Legacy) "AlignedDataFrame" representing alignment annotations as a data frame
Description
This class extends
AnnotatedDataFrame
. It
is a data frame and associated metadata (describing the columns of the
data frame). The main purpose of this class is to contain alignment
data in addition to the central information of
AlignedRead .
Seealso
Author
Martin Morgan mtmorgan@fhcrc.org
AlignedRead()
(Legacy) Construct objects of class "AlignedRead"
Description
This function constructs objects of
AlignedRead . It will often be more convenient to
create AlignedRead
objects using parsers such as
readAligned
.
Usage
AlignedRead(sread, id, quality, chromosome, position, strand,
alignQuality,
alignData = AlignedDataFrame(nrow = length(sread)))
Arguments
Argument | Description |
---|---|
sread | An object of class DNAStringSet , containing the DNA sequences of the short reads. |
id | An object of class BStringSet , containing the identifiers of the short reads. This object is the same length as sread . |
quality | An object of class BStringSet , containing the ASCII-encoded quality scores of the short reads. This object is the same length as sread . |
chromosome | A factor describing the particular sequence within a set of target sequences (e.g. chromosomes in a genome assembly) to which each short read aligns. |
position | A integer vector describing the (base pair) position at which each short read begins its alignment. |
strand | A factor describing the strand to which the short read aligns. |
alignQuality | A numeric vector describing the alignment quality. |
alignData | An AlignedDataFrame with number of rows equal to the length of sread , containing additional information about alignments. |
Value
An object of class AlignedRead .
Seealso
AlignedRead .
Author
Martin Morgan mtmorgan@fhcrc.org
AlignedRead_class()
(Legacy) "AlignedRead" class for aligned short reads
Description
This class represents and manipulates reads and their genomic alignments. Alignment information includes genomic position, strand, quality, and other data.
Seealso
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
showMethods(class="AlignedRead", where=getNamespace("ShortRead"))
dirPath <- system.file('extdata', 'maq', package='ShortRead')
(aln <- readAligned(dirPath, 'out.aln.1.txt', type="MAQMapview"))
coverage(aln)[[1]]
cvg <- coverage(aln, shift=c(ChrA=10L))
## remove 0 coverage on left ends
ltrim0 <- function(x) {
i <- !cumprod(runValue(x) == 0)
Rle(runValue(x)[i], runLength(x)[i])
}
endoapply(cvg, ltrim0)
## demonstration of show() and detail() methods
show(aln)
detail(aln)
BowtieQA_class()
(Legacy) Quality assessment summaries from Bowtie files
Description
This class contains a list-like structure with summary descriptions derived from visiting one or more Bowtie files.
Seealso
qa
.
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
showClass("BowtieQA")
ExperimentPath_class()
(Legacy) "ExperimentPath" class representing a file hierarchy of data files
Description
Short read technologies often produce a hierarchy of output files. The content of the hierarchy varies. This class represents the root of the file hierarchy. Specific classes (e.g., SolexaPath ) represent different technologies.
Author
Michael Lawrence
Examples
showClass("ExperimentPath")
FastqQA_class()
Quality assessment of fastq files and ShortReadQ objects
Description
These classes contains a list-like structure with summary descriptions derived from visiting one or more fastq files, or from a ShortReadQ object.
Seealso
qa
.
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
showClass("FastqQA")
Intensity_class()
(Legacy) "Intensity", "IntensityInfo", and "IntensityMeasure" base classes for short read image intensities
Description
The Intensity
, IntensityMeasure
, and IntensityInfo
classes represent and manipulate image intensity measures. Instances
from the class may also contain information about measurement errors,
and additional information about the reads from which the intensities
are derived.
Intensity
, and IntensityMeasure
, are virtual classes,
and cannot be created directly. Classes derived from
IntensityMeasure
(e.g., ArrayIntensity
) and
Intensity
(e.g., SolexaIntensity ) are used
to represent specific technologies.
Seealso
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
showMethods(class="Intensity", where=getNamespace("ShortRead"))
example(readIntensities)
MAQMapQA_class()
(Legacy) Quality assessment summaries from MAQ map files
Description
This class contains a list-like structure with summary descriptions derived from visiting one or more MAQMap files.
Seealso
qa
.
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
showClass("MAQMapQA")
QA_class()
(Updated) classes for representing quality assessment results
Description
Classes derived from .QA-class
represent results of quality
assurance analyses.
Seealso
Specific classes derived from .QA2
Author
Martin Morgan mtmmorgan@fhcrc.org
Examples
getClass(".QA2", where=getNamespace("ShortRead"))
QualityScore()
Construct objects indicating read or alignment quality
Description
Use these functions to construct quality indicators for reads or alignments. See QualityScore for details of object content and methods available for manipulating them.
Usage
NumericQuality(quality = numeric(0))
IntegerQuality(quality = integer(0))
MatrixQuality(quality = new("matrix"))
FastqQuality(quality, ...)
SFastqQuality(quality, ...)
Arguments
Argument | Description |
---|---|
quality | An object used to initialize the data structure. Appropriate objects are indicated in the constructors above for Numeric, Integer, and Matrix qualities. For FastqQuality and SFastqQuality , methods are defined for BStringSet , character , and missing . |
... | Additional arguments, currently unused. |
Value
Constructors return objects of the corresponding class derived from QualityScore .
Seealso
QualityScore , readFastq
,
readAligned
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
nq <- NumericQuality(rnorm(20))
nq
quality(nq)
quality(nq[10:1])
QualityScore_class()
Quality scores for short reads and their alignments
Description
This class hierarchy represents quality scores for short
reads. QualityScore
is a virtual base class, with derived
classes offering different ways of representing qualities. Methods
defined on QualityScore
are implemented in all derived
classes.
Seealso
NumericQuality
and other constructors.
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
names(slot(getClass("QualityScore"), "subclasses"))
encoding(FastqQuality())
encoding(SFastqQuality())
RochePath_class()
(Legacy) "RochePath" class representing a Roche (454) experiment location
Description
This class represents the directory location where Roche (454) result files (fasta sequences and qualities) can be found.
Seealso
ExperimentPath .
Author
Michael Lawrence mflawrence@fhcrc.org
Examples
showClass("RochePath")
RocheSet_class()
(Legacy) Roche (454) experiment-wide data container
Description
This class is meant to coordinate all data in a Roche (454) experiment. See SRSet for additional details.
Seealso
SRSet
Author
Michael Lawrence mflawrence@fhcrc.org
Examples
showClass("RocheSet")
RtaIntensity()
(Legacy) Construct objects of class "RtaIntensity"
Description
RtaIntensity objects contain Illumina image
intensity measures created by the RTA pipeline. It will often be more
convenient to create this object using readIntensities
.
Usage
RtaIntensity(intensity=array(0, c(0, 0, 0)),
measurementError=array(0, c(0, 0, 0)),
readInfo=SolexaIntensityInfo(
lane=integer()[seq_len(nrow(intensity))]),
...)
Arguments
Argument | Description |
---|---|
intensity | A matrix of image intensity values. Successive columns correspond to nucleotides A, C, G, T; four successive columns correspond to each cycle. Typically, derived from "_int.txt" files. |
measurementError | As intensity , but measuring standard error. Usually derived from "_nse.txt" files. |
readInfo | An object of class AnnotatedDataFrame , containing information described by RtaIntensityInfo . |
... | Additional arguments, not currently used. |
Value
An object of class RtaIntensity .
Seealso
RtaIntensity ,
readIntensities
.
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
rta <- RtaIntensity(array(runif(60), c(5,4,3)))
intensity(rta)
## subsetting, access, and coercion
as(intensity(rta)[1:2,,], "array")
RtaIntensity_class()
(Legacy) Class "RtaIntensity"
Description
Subclass of Intensity for representing image intensity data from the Illumina RTA pipeline.
Seealso
SolexaIntensity , readIntensities
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
showClass("RtaIntensity")
showMethods(class="RtaIntensity", where=getNamespace("ShortRead"))
SRFilterResult_class()
"SRFilterResult" for SRFilter output and statistics
Description
Objects of this class are logical vectors indicating records passing the applied filter, with an associated data frame summarizing the name, input number of records, records passing filter, and logical operation used for all filters in which the result participated.
Usage
SRFilterResult(x = logical(), name = NA_character_,
input = length(x), passing = sum(x), op = NA_character_)
list(list("Logic"), list("SRFilterResult,SRFilterResult"))(e1, e2)
list(list("name"), list("SRFilterResult"))(x, ...)
stats(x, ...)
list(list("show"), list("SRFilterResult"))(object)
Arguments
Argument | Description |
---|---|
x, object, e1, e2 | For SRFilterResult , logical() indicating records that passed filter or, for others, an instance of SRFilterResult class. |
name | character() indicating the name by which the filter is to be referred. Internally, name , input , passing , and op may all be vectors representing columns of a data.frame summarizing the application of successive filters. |
input | integer() indicating the length of the original input. |
passing | integer() indicating the number of records passing the filter. |
op | character() indicating the logical operation, if any, associated with this filter. |
... | Additional arguments, unused in methods documented on this page. |
Seealso
Author
Martin Morgan mailto:mtmorgan@fhcrc.org
Examples
fa <- srFilter(function(x) x %% 2 == 0, "Even")
fb <- srFilter(function(x) x %% 2 == 1, "Odd")
x <- 1:10
|fa(x) | fb(x)|
fa(x) & fb(x)
!(fa(x) & fb(x))
SRFilter_class()
"SRFilter" for representing functions operating on ShortRead objects
Description
Objects of this class are functions that, when provided an appropriate object from the ShortRead package, return logical vectors indicating which parts of the object satisfy the filter criterion.
A number of filters are built-in (described below); users are free to
create their own filters, using the srFilter
function.
Seealso
srFilter
for predefined and user-defined filters.
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
## see ?srFilter
SRSet_class()
(Legacy) A base class for Roche experiment-wide data
Description
This class coordinates phenotype (sample) and sequence data, primarily as used on the Roche platform.
Conceptually, this class has reads from a single experiment
represented as a long vector, ordered by sample. The readCount
slot indicates the number of reads in each sample, so that the sum of
readCount
is the total number of reads in the experiment. The
readIndex
field is a light-weight indicator of which reads from
all those available that are currently referenced by the SRSet
.
Author
Michael Lawrence mflawrence@fhcrc.org
Examples
showClass("SRSet")
SRUtil_class()
".SRUtil" and related classes
Description
These classes provide important utility functions in the ShortRead package, but may occasionally be seen by the user and are documented here for that reason.
Author
Martin Morgan
Examples
getClass(".SRUtil", where=getNamespace("ShortRead"))
ShortRead:::.SRError_types
ShortRead:::.SRWarn_types
detail(SRList(1:5, letters[1:5]))
tryCatch(SRVector(1:5, letters[1:5]),
SRVectorClassDisagreement=function(err) {
cat("caught:", conditionMessage(err), "
")
})
Sampler_class()
Sampling and streaming records from fastq files
Description
FastqFile
represents a path and connection to a fastq
file. FastqFileList
is a list of such connections.
FastqSampler
draws a subsample from a fastq file. yield
is the method used to extract the sample from the FastqSampler
instance; a short illustration is in the example
below. FastqSamplerList
is a list of FastqSampler
elements.
FastqStreamer
draws successive subsets from a fastq file, a
short illustration is in the example below. FastqStreamerList
is a list of FastqStreamer
elements.
Usage
## FastqFile and FastqFileList
FastqFile(con, ...)
FastqFileList(..., class="FastqFile")
list(list("open"), list("ShortReadFile"))(con, ...)
list(list("close"), list("ShortReadFile"))(con, ...)
list(list("readFastq"), list("FastqFile"))(dirPath, pattern=character(), ...)
## FastqSampler and FastqStreamer
FastqSampler(con, n=1e6, readerBlockSize=1e8, verbose=FALSE,
ordered = FALSE)
FastqSamplerList(..., n=1e6, readerBlockSize=1e8, verbose=FALSE,
ordered = FALSE)
FastqStreamer(con, n, readerBlockSize=1e8, verbose=FALSE)
FastqStreamerList(..., n, readerBlockSize=1e8, verbose=FALSE)
yield(x, ...)
Arguments
Argument | Description |
---|---|
con, dirPath | A character string naming a connection, or (for con ) an R connection (e.g., file , gzfile ). |
n | For FastqSampler , the size of the sample (number of records) to be drawn. For FastqStreamer a numeric(1) (set to 1e6 when n is missing) providing the number of successive records to be returned on each yield, or an IRanges -class delimiting the (1-based) indicies of records returned by each yield; entries in n must have non-zero width and must not overlap. |
readerBlockSize | The number of bytes or characters to be read at one time; smaller readerBlockSize reduces memory requirements but is less efficient. |
verbose | Display progress. |
ordered | logical(1) indicating whether sampled reads should be returned in the same order as they were encountered in the file. |
x | An instance from the FastqSampler or FastqStreamer class. |
... | Additional arguments. For FastqFileList , FastqSamplerList , or FastqStreamerList , this can either be a single character vector of paths to fastq files, or several instances of the corresponding FastqFile , FastqSampler , or FastqStreamer objects. |
pattern | Ignored. |
class | For developer use, to specify the underlying class contained in the FastqFileList . |
Seealso
readFastq
, writeFastq
, yield
.
Note
FastqSampler
and FastqStreamer
use OpenMP threads (when
available) during creation of the return value. This may sometimes
create problems when a process is already running on multiple threads,
e.g., with an error message like list("
", " libgomp: Thread creation failed: Resource temporarily unavailable
", " ") A solution is to precede problematic code with the following code
snippet, to disable threading list("
", " nthreads <- .Call(ShortRead:::.set_omp_threads, 1L)
", " on.exit(.Call(ShortRead:::.set_omp_threads, nthreads))
", " ")
Examples
sp <- SolexaPath(system.file('extdata', package='ShortRead'))
fl <- file.path(analysisPath(sp), "s_1_sequence.txt")
f <- FastqFile(fl)
rfq <- readFastq(f)
close(f)
f <- FastqSampler(fl, 50)
yield(f) # sample of size n=50
yield(f) # independent sample of size 50
close(f)
## Return sample as ordered in original file
f <- FastqSampler(fl, 50, ordered=TRUE)
yield(f)
close(f)
f <- FastqStreamer(fl, 50)
yield(f) # records 1 to 50
yield(f) # records 51 to 100
close(f)
## iterating over an entire file
f <- FastqStreamer(fl, 50)
while (length(fq <- yield(f))) {
## do work here
print(length(fq))
}
close(f)
## iterating over IRanges
rng <- IRanges(c(50, 100, 200), width=10:8)
f <- FastqStreamer(fl, rng)
while (length(fq <- yield(f))) {
print(length(fq))
}
close(f)
## Internal fields, methods, and help; for developers
ShortRead:::.FastqSampler_g$methods()
ShortRead:::.FastqSampler_g$fields()
ShortRead:::.FastqSampler_g$help("yield")
ShortReadQ_class()
"ShortReadQ" class for short reads and their quality scores
Description
This class provides a way to store and manipulate, in a coordinated fashion, the reads, identifiers, and quality scores of uniform-length short reads.
Seealso
readFastq
for creation of objects of this class from
fastq-format files.
Author
Martin Morgan
Examples
showClass("ShortReadQ")
showMethods(class="ShortReadQ", where=getNamespace("ShortRead"),
inherit=FALSE)
showMethods(class="ShortRead", where=getNamespace("ShortRead"),
inherit=FALSE)
sp <- SolexaPath(system.file('extdata', package='ShortRead'))
rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt")
quality(rfq)
sread(reverseComplement(rfq))
quality(reverseComplement(rfq))
quality(trimTails(rfq, 2, "H", successive=TRUE))
ShortRead_class()
"ShortRead" class for short reads
Description
This class provides a way to store and manipulate, in a coordinated fashion, uniform-length short reads and their identifiers.
Seealso
ShortReadQ
Author
Martin Morgan
Examples
showClass("ShortRead")
showMethods(class="ShortRead", where=getNamespace("ShortRead"))
ShortRead_deprecated()
Deprecated functions from the ShortRead package
Description
These functions are deprecated, and will become defunct.
Usage
uniqueFilter(withSread=TRUE, .name="UniqueFilter")
Arguments
Argument | Description |
---|---|
withSread | A logical(1) indicating whether uniqueness includes the read sequence ( withSread=TRUE ) or is based only on chromosome, position, and strand ( withSread=FALSE ) |
.name | An optional character(1) object used to over-ride the name applied to default filters. |
Details
See srFilter
for details of ShortRead filters.
uniqueFilter
selects elements satisfying
!srduplicated(x)
when withSread=TRUE
, and
!(duplicated(chromosome(x)) & duplicated(position(x)) &
when withSread=FALSE
.
The behavior when withSread=TRUE
can be obtained with
occurrenceFilter(withSread=TRUE)
. The behavior when
withSread=FALSE
can be obtained using a custom filter
ShortRead_package()
FASTQ input and manipulation.
Description
This package implements sampling, iteration, and input of FASTQ files. The package includes functions for filtering and trimming reads, and for generating a quality assessment report. Data are represented as DNAStringSet-derived objects, and easily manipulated for a diversity of purposes. The package also contains legacy support for early single-end, ungapped alignment formats.
Details
See packageDescription('ShortRead')
Author
Maintainer: Martin Morgan mtmorgan@fhcrc.org
SnapshotFunction_class()
Class "SnapshotFunction"
Description
A class to store custom reader and viewer functions for the
Snapshot
class.
Usage
SnapshotFunction(reader, viewer, limits, ...)
reader(x, ...)
viewer(x, ...)
limits(x, ...)
Arguments
Argument | Description |
---|---|
reader | A function for reading data. The function must take a single argument (a Snapshot instance) and return a data.frame summarizing the file. |
viewer | A function for visualizing the data. The function must accept the data.frame created by reader , and return an SpTrellis object representing the view. |
limits | An integer(2) indicating the minimum and maximum number of nucleotides the SnapshotFunction is intended to visualize. For instance, a fine-scale viewer displaying a pileup might be appropriate at between 1000 and 50000 nucleotides. |
x | An instance of SnapshotFunction |
... | Additional arguments, currently unused. |
Seealso
Author
Martin Morgan and Chao-Jen Wong
Examples
## internally defined function
reader(ShortRead:::.fine_coverage)
viewer(ShortRead:::.fine_coverage)
limits(ShortRead:::.fine_coverage)
Snapshot_class()
Class "Snapshot"
Description
A Snapshot -class to visualize genomic data from BAM files with zoom and pan functionality.
Usage
Snapshot(files, range, ...)
Arguments
Argument | Description |
---|---|
files | A character() or BamFileList specifying the file(s) to be visualized. |
range | A GRanges object specifying the range to be visualized. |
|...
| Additional, optional, arguments to be passed to the Snapshot initialize
function. Arguments include: list("
", "
", " ", list(list("functions:"), list("A ", list(list("SnapshotFunctionList")), " of
", " functions, in addition to built-in ", list("fine_coverage"), ",
", " ", list("coarse_coverage"), ", ", list("multifine_coverage"), ", to be
", " used for visualization.")), "
", "
", " ", list(list("currentFunction:"), list("character(1) naming the function, from
", " ", list("functions"), " to be used for data input and
", " visualization. The default chooses a function based on the
", |
" scale at which the data is being visualized.")), "
", " ", " ", list(list("annTrack:"), list("Annotation track. If built-in visualization ", " functions are to be used, ", list("annTrack"), " should be a ", " ", list("GRanges"), " instance and the first column of its ", " elementMeatdata would be used to annotate the range.")), " ", " ", " ", list(list("fac:"), list("Character(1) indicating which factor used for ", " grouping the sample files. The factor should be included in ",
" the elementMetadata of ", list("files"), ", otherwise ignored. Used
", " only to visualize multiple files. ", " ")), " ", " ", " ", list(list(".auto_display:"), list("logical(1) indicating whether the ", " visualization is to be updated when ", list("show"), " is invoked.")), " ", " ", " ", list(list(".debug"), list("logical(1) indicating whether debug messages are to ", " be printed.")), " ", " ")
Seealso
Author
Martin Morgan and Chao-Jen Wong cwon2@fhcrc.org
Examples
## example 1: Importing specific ranges of records
file <- system.file("extdata", "SRR002051.chrI-V.bam",
package="yeastNagalakshmi")
which <- GRanges("chrI", IRanges(1, 2e5))
s <- Snapshot(file, range=which)
## methods
zoom(s) # zoom in
## zoom in to a specific region
zoom(s, range=GRanges("chrI", IRanges(7e4, 7e4+8000)))
pan(s) # pan right
togglez(s) # change effect of zooming
zoom(s) # zoom out
togglep(s) # change effect of panning
pan(s)
## accessors
functions(s)
vrange(s)
show(s)
ignore.strand(s)
view(s) ## extract the spTrellis object
getTrellis(s) ## extract the trellis object
## example 2: ignore strand
s <- Snapshot(file, range=which, ignore.strand=TRUE)
##
## example 3: visualizing annotation track
##
library(GenomicFeatures)
getAnnGR <- function(txdb, which) {
ex <- exonsBy(txdb, by="gene")
seqlevels(ex, pruning.mode="coarse") <- seqlevels(which)
r <- range(ex)
gr <- unlist(r)
values(gr)[["gene_id"]] <- rep.int(names(r), times=lengths(r))
gr
}
txdbFile <- system.file("extdata", "sacCer2_sgdGene.sqlite",
package="yeastNagalakshmi")
# txdb <- makeTxDbFromUCSC(genome="sacCer2", tablename="sgdGene")
txdb <- loadDb(txdbFile)
which <- GRanges("chrI", IRanges(1, 2e5))
gr <- getAnnGR(txdb, which)
## note that the first column of the elementMetadata annotates of the
## range of the elements.
gr
s <- Snapshot(file, range=which, annTrack=gr)
annTrack(s)
## zoom in to an interesting region
zoom(s, range=GRanges("chrI", IRanges(7e4, 7e4+8000)))
togglez(s) ## zoom out
zoom(s)
pan(s)
## example 4, 5, 6: multiple BAM files with 'multicoarse_covarage'
## and 'multifine_coverage' view.
## Resolution does not automatically switch for views of multiple
## files. It is important to note if width(which) < 10,000, use
## multifine_coverage. Otherwise use multicoarse_coverage
file <- system.file("extdata", "SRR002051.chrI-V.bam",
package="yeastNagalakshmi")
which <- GRanges("chrI", IRanges(1, 2e5))
s <- Snapshot(c(file, file), range=which,
currentFunction="multicoarse_coverage")
## grouping files and view by 'multicoarse_coverage'
bfiles <- BamFileList(c(a=file, b=file))
values(bfiles) <- DataFrame(sampleGroup=factor(c("normal", "tumor")))
values(bfiles)
s <- Snapshot(bfiles, range=which,
currentFunction="multicoarse_coverage", fac="sampleGroup")
## grouping files and view by 'multifine_coverage'
which <- GRanges("chrI", IRanges(7e4, 7e4+8000))
s <- Snapshot(bfiles, range=which,
currentFunction="multifine_coverage", fac="sampleGroup")
SolexaExportQA_class()
(Legacy) Quality assessment summaries from Solexa export and realign files
Description
This class contains a list-like structure with summary descriptions derived from visiting one or more Solexa export or realign files.
Seealso
qa
.
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
showClass("SolexaExportQA")
SolexaIntensity()
(Legacy) Construct objects of class "SolexaIntensity" and "SolexaIntensityInfo"
Description
These function constructs objects of
SolexaIntensity and
SolexaIntensityInfo . It will often be more
convenient to create these objects using parsers such as
readIntensities
.
Usage
SolexaIntensity(intensity=array(0, c(0, 0, 0)),
measurementError=array(0, c(0, 0, 0)),
readInfo=SolexaIntensityInfo(
lane=integer(nrow(intensity))),
...)
SolexaIntensityInfo(lane=integer(0),
tile=integer(0)[seq_along(lane)],
x=integer(0)[seq_along(lane)],
y=integer(0)[seq_along(lane)])
Arguments
Argument | Description |
---|---|
intensity | A matrix of image intensity values. Successive columns correspond to nucleotides A, C, G, T; four successive columns correspond to each cycle. Typically, derived from "_int.txt" files. |
measurementError | As intensity , but measuring standard error. Usually derived from "_nse.txt" files. |
readInfo | An object of class AnnotatedDataFrame , containing information described by SolexaIntensityInfo . |
lane | An integer vector giving the lane from which each read is derived. |
tile | An integer vector giving the tile from which each read is derived. |
x | An integer vector giving the tile-local x coordinate of the read from which each read is derived. |
y | An integer vector giving the tile-local y coordinate of the read from which each read is derived. |
... | Additional arguments, not currently used. |
Value
An object of class SolexaIntensity , or
SolexaIntensityInfo
.
Seealso
SolexaIntensity .
Author
Martin Morgan mtmorgan@fhcrc.org
SolexaIntensity_class()
Classes "SolexaIntensity" and "SolexaIntensityInfo"
Description
Instances of Intensity and IntensityInfo for representing image intensity data from Solexa experiments.
Seealso
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
showClass("SolexaIntensity")
sp <- SolexaPath(system.file('extdata', package='ShortRead'))
int <- readIntensities(sp)
int # SolexaIntensity
readIntensityInfo(int)# SolexaIntensityInfo
int[1:5,,]# read 1:5
SolexaPath_class()
(Legacy) "SolexaPath" class representing a standard output file hierarchy
Description
Solexa produces a hierarchy of output files. The content of the hierarchy varies depending on analysis options. This class represents a standard class hierarchy, constructed by searching a file hierarchy for appropriately named directories.
Author
Martin Morgan
Examples
showClass("SolexaPath")
showMethods(class="SolexaPath", where=getNamespace("ShortRead"))
sf <- system.file("extdata", package="ShortRead")
sp <- SolexaPath(sf)
sp
readFastq(sp, pattern="s_1_sequence.txt")
nfiles <- length(list.files(analysisPath(sp), "s_[1-8]_export.txt"))
library(Rmpi)
mpi.spawn.Rslaves(nslaves=nfiles)
report(qa(sp))
nfiles <- length(list.files(analysisPath(sp), "s_[1-8]_export.txt"))
report(qa(sp))
SolexaSet_class()
(Legacy) "SolexaSet" coordinating Solexa output locations with sample annotations
Description
This class coordinates the file hierarchy produced by the Solexa
pipeline' with annotation data contained in an [
AnnotatedDataFrame`](#annotateddataframe) (defined in the Biobase
package).
## Author
Martin Morgan
## Examples
r showClass("SolexaSet") showMethods(class="SolexaSet", where=getNamespace("ShortRead")) ## construct a SolexaSet sf <- system.file("extdata", package="ShortRead") df <- data.frame(Sample=c("Sample 1", "Sample 2", "Sample 3", "Sample 4", "Center-wide control", "Sample 6", "Sample 7", "Sample 8"), Genome=c(rep("hg18", 4), "phi_plus_SNPs.txt", rep("hg18", 3))) dfMeta <- data.frame(labelDescription=c("Type of sample", "Alignment genome")) adf <- new("AnnotatedDataFrame", data=df, varMetadata=dfMeta) SolexaSet(sf, adf)
SpTrellis_class()
Class "SpTrellis"
Description
A reference class to manage the trellis graphics related component of the
Snapshot
functionality for visualization of genomic data.
Usage
SpTrellis(trellis, debug_enabled=FALSE)
Arguments
Argument | Description |
---|---|
trellis | A trellis object for storing the plot of the genome area being visualized. |
debug_enabled | logical(1) indicating whether class methods should report debugging information to the user. |
Seealso
Author
Chao-Jen cwon2@fhcrc.org
Examples
col <- c("#66C2A5", "#FC8D62")
x = numeric(1000)
x[sample(1000, 100)] <- abs(rnorm(100))
df <- data.frame(x = c(x, -x), pos = seq(1, 1e5, length.out=1000),
group = rep(c("positive", "negative"), each=1000))
cv <- lattice::xyplot(x ~ pos, df, group=group, type="s",
col=col, main="yeast chrI:1 - 2e5",
ylab="Coverage", xlab="Coordinate",
scales=list(y=list(tck=c(1,0)),
x=list(rot=45, tck=c(1,0), tick.number=20)),
panel=function(...) {
lattice::panel.xyplot(...)
lattice::panel.grid(h=-1, v=20)
lattice::panel.abline(a=0, b=0, col="grey")
})
s <- SpTrellis(cv)
s
zi(s)
zi(s)
left(s)
right(s)
zo(s)
restore(s)
accessors()
(Legacy) Accessors for ShortRead classes
Description
These functions and generics define accessors' (to get and set values) for objects in the ShortRead package; methods defined in other packages may have additional meaning. ## Usage ```r ## SRVector vclass(object, ...) ## AlignedRead chromosome(object, ...) position(object, ...) alignQuality(object, ...) alignData(object, ...) ## Solexa experimentPath(object, ...) dataPath(object, ...) scanPath(object, ...) imageAnalysisPath(object, ...) baseCallPath(object, ...) analysisPath(object, ...) ## SolexaSet solexaPath(object, ...) laneDescription(object, ...) laneNames(object, ...) ``` ## Arguments |Argument |Description| |------------- |----------------| |
object| An object derived from class [
ShortRead](#content) . See help pages for individual objects, e.g., ShortReadQ . The default is to extract the contents of a slot of the corresponding name (e.g., slot
sread) from
object.| |
...| Additional arguments passed to the accessor. The default definitions do not make use of additional arguments.| ## Value Usually, the value of the corresponding slot, or other simple content described on the help page of
object` .
## Author
Martin Morgan
## Examples
r sp <- SolexaPath(system.file('extdata', package='ShortRead')) experimentPath(sp) basename(analysisPath(sp))
alphabetByCycle()
Summarize nucleotide, amino acid, or quality scores by cycle
Description
alphabetByCycle
summarizes nucleotides, amino acid, or qualities
by cycle, e.g., returning the number of occurrences of each nucleotide
A, T, G, C
across all reads from 36 cycles of a Solexa lane.
Usage
alphabetByCycle(stringSet, alphabet, ...)
Arguments
Argument | Description |
---|---|
stringSet | A R object representing the collection of reads, amino acid sequences, or quality scores, to be summarized. |
alphabet | The alphabet (character vector of length 1 strings) from which the sequences in stringSet are composed. Methods often define an appropriate alphabet, so that the user does not have to provide one. |
... | Additional arguments, perhaps used by methods defined on this generic. |
Details
The default method requires that stringSet
extends the
XStringSet
class of
list("Biostrings") .
The following method is defined, in addition to methods described in class-specific documentation: list(" ", " ", " ", list(list("alphabetByCycle"), list(list("signature(stringSet = "BStringSet")"), ": ", " this method uses an alphabet spanning all ASCII characters, codes ", " ", list("1:255"), ". ")), " ", " ", " ")
Value
A matrix with number of rows equal to the length of alphabet
and columns equal to the maximum width of reads or quality scores in
the string set. Entries in the matrix are the number of times, over
all reads of the set, that the corresponding letter of the alphabet
(row) appeared at the specified cycle (column).
Seealso
The IUPAC alphabet in Biostrings.
http://www.bioperl.org/wiki/FASTQ_sequence_format for the BioPerl definition of fastq.
Solexa documentation `Data analysis - documentation : Pipeline output
and visualisation'.
## Author
Martin Morgan
## Examples
r showMethods("alphabetByCycle") sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") alphabetByCycle(sread(rfq)) abcq <- alphabetByCycle(quality(rfq)) dim(abcq) ## 'high' scores, first and last cycles abcq[64:94,c(1:5, 32:36)]
alphabetScore()
Efficiently calculate the sum of quality scores across bases
Description
This generic takes a QualityScore or
PhredQuality
object and calculates, for each read, the sum of
the encoded nucleotide probabilities.
Usage
alphabetScore(object, ...)
Arguments
Argument | Description |
---|---|
object | An object of class QualityScore . |
list() | Additional arguments, currently unused. |
Value
A vector of numeric values of length equal to the length of
object
.
Author
Martin Morgan mtmorgan@fhcrc.org
clean()
Remove sequences with ambiguous nucleotides from short read classes
Description
Short reads may contain ambiguous base calls (i.e., IUPAC symbols different from A, T, G, C). This generic removes all sequences containing 1 or more ambiguous bases.
Usage
clean(object, ...)
Arguments
Argument | Description |
---|---|
object | An object for which clean methods exist; see below to discover these methods. |
list() | Additional arguments, perhaps used by methods. |
Details
The following method is defined, in addition to methods described in class-specific documentation: list(" ", " ", " ", list(list("clean"), list(list("signature(x = "DNAStringSet")"), ": ", " Remove all sequences containing non-base (A, C, G, T) IUPAC ", " symbols.")), " ", " ", " ")
Value
An instance of class(object)
, containing only sequences with
non-redundant nucleotides.
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
showMethods('clean')
countLines()
Count lines in all (text) files in a directory whose file name matches a pattern
Description
countLines
visits all files in a directory path dirPath
whose base (i.e., file) name matches pattern
. Lines in the file
are counted as the number of new line characters.
Usage
countLines(dirPath, pattern=character(0), ..., useFullName=FALSE)
Arguments
Argument | Description |
---|---|
dirPath | A character vector (or other object; see methods defined on this generic) giving the directory path (relative or absolute) of files whose lines are to be counted. |
pattern | The ( grep -style) pattern describing files whose lines are to be counted. The default ( character(0) ) results in line counts for all files in the directory. |
... | Additional arguments, passed internally to list.files. See list.files . |
useFullName | A logical(1) indicating whether elements of the returned vector should be named with the base (file) name (default; useFullName=FALSE ) or the full path name ( useFullName=TRUE ). |
Value
A named integer vector of line counts. Names are paths to the files
whose lines have been counted, excluding dirPath
.
Author
Martin Morgan
Examples
sp <- SolexaPath(system.file('extdata', package='ShortRead'))
countLines(analysisPath(sp))
countLines(experimentPath(sp), recursive=TRUE)
countLines(experimentPath(sp), recursive=TRUE, useFullName=TRUE)
deprecated()
Deprecated and defunct functions
Description
These functions were introduced but are now deprecated or defunct.
Details
Defunct functions:
list(list("srapply")) list(". Use the BiocParallel package instead.")
list(list("readAligned,BamFile-method")) list(". Use the GenomicAlignments ", " package instead.")
list(list("basePath()")) list()
dotQA_class()
Virtual class for representing quality assessment results
Description
Classes derived from .QA-class
represent results of quality
assurance analyses. Details of derived class structure are found on
the help pages of the derived classes.
Seealso
Specific classes derived from .QA
Author
Martin Morgan mtmmorgan@fhcrc.org
Examples
getClass(".QA", where=getNamespace("ShortRead"))
dustyScore()
Summarize low-complexity sequences
Description
dustyScore
identifies low-complexity sequences, in a manner
inspired by the dust
implementation in BLAST
.
Usage
dustyScore(x, batchSize=NA, ...)
Arguments
Argument | Description |
---|---|
x | A DNAStringSet object, or object derived from ShortRead , containing a collection of reads to be summarized. |
batchSize | NA or an integer(1) vector indicating the maximum number of reads to be processed at any one time. |
... | Additional arguments, not currently used. |
Details
The following methods are defined: list(" ", " ", " ", list(list("dustyScore"), list(list("signature(x = "DNAStringSet")"), ": operating on ", " an object derived from class ", list("DNAStringSet"), ".")), " ", " ", " ", list(list("dustyScore"), list(list("signature(x = "ShortRead")"), ": operating on ", " the ", list("sread"), " of an object derived from class ", " ", list("ShortRead"), ".")), " ", " ", " ")
The dust-like calculations used here are as implemented at https://stat.ethz.ch/pipermail/bioc-sig-sequencing/2009-February/000170.html . Scores range from 0 (all triplets unique) to the square of the width of the longest sequence (poly-A, -C, -G, or -T).
The batchSize
argument can be used to reduce the memory
requirements of the algorithm by processing the x
argument in
batches of the specified size. Smaller batch sizes use less memory,
but are computationally less efficient.
Value
A vector of numeric scores, with length equal to the length of
x
.
Seealso
The WindowMasker supplement defining dust
ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/windowmasker/windowmasker_suppl.pdf
Author
Herve Pages (code); Martin Morgan
References
Morgulis, Getz, Schaffer and Agarwala, 2006. WindowMasker: window-based masker for sequenced genomes, Bioinformatics 22: 134-141.
Examples
sp <- SolexaPath(system.file('extdata', package='ShortRead'))
rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt")
range(dustyScore(rfq))
filterFastq()
Filter fastq from one file to another
Description
filterFastq
filters reads from source to destination file(s)
applying a filter to reads in each file. The filter can be a function
or FilterRules instance; operations are done in a memory-efficient
manner.
Usage
filterFastq(files, destinations, ..., filter = FilterRules(),
compress=TRUE, yieldSize = 1000000L)
Arguments
Argument | Description |
---|---|
files | a character vector of valid file paths. |
destinations | a character vector of destinations, recycled to be the same length as files . destinations must not already exist. |
... | Additional arguments, perhaps used by a filter function. |
filter | A simple function taking as it's first argument a ShortReadQ instance and returning a modified ShortReadQ instance (e.g., with records or nucleotides removed), or a FilterRules instance specifying which records are to be removed. |
compress | A logical(1) indicating whether the file should be gz-compressed. The default is TRUE . |
yieldSize | Number of fastq records processed in each call to filter ; increase this for (marginally) more efficient I/O at the expense of increased memory use. |
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
## path to a convenient fastq file
sp <- SolexaPath(system.file('extdata', package='ShortRead'))
fl <- file.path(analysisPath(sp), "s_1_sequence.txt")
## filter reads to keep those with GC < 0.7
fun <- function(x) {
gc <- alphabetFrequency(sread(x), baseOnly=TRUE)[,c("G", "C")]
x[rowSums(gc) / width(x) < .7]
}
filterFastq(fl, tempfile(), filter=fun)
## trimEnds,character-method uses filterFastq internally
trimEnds(fl, "V", destinations=tempfile())
polyn()
Utilities for common, simple operations
Description
These functions perform a variety of simple operations.
Usage
polyn(nucleotides, n)
Arguments
Argument | Description |
---|---|
nucleotides | A character vector with all elements having exactly 1 character, typically from the IUPAC alphabet. |
n | An integer(1) vector. |
Details
polyn
returns a character vector with each element having
n
characters. Each element contains a single nucleotide. Thus
polyn("A", 5)
returns AAAAA
.
Value
polyn
returns a character vector of length length(nucleotide)
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
polyn(c("A", "N"), 35)
Perform quality assessment on short reads
Description
This function is a common interface to quality assessment functions
available in ShortRead
. Results from this function may be
displayed in brief, or integrated into reports using, e.g.,
report
.
Usage
qa(dirPath, ...)
list(list("qa"), list("character"))(dirPath, pattern=character(0),
type=c("fastq", "SolexaExport", "SolexaRealign", "Bowtie",
"MAQMap", "MAQMapShort"),
...)
list(list("qa"), list("list"))(dirPath, ...)
Arguments
Argument | Description |
---|---|
dirPath | A character vector or other object (e.g., SolexaPath ; see showMethods , below) locating the data for which quality assessment is to be performed. See help pages for defined methods (by evaluating the example code, below) for details of available methods. |
pattern | A character vector limiting the files in dirPath to be processed, as with list.files . Care should be taken to specify pattern to avoid reading unintended files. |
type | The type of file being parsed; must be a character vector of length 1, selected from one of the types enumerated in the parameter. |
|list()
| Additional arguments used by methods. list("
", "
", " ", list(list(list("sample=TRUE"), ":"), list("Logical(1) indicating whether QA should
", " be performed on a sample (default size 1000000) drawn from each
", " FASTQ file, or from the entire file.")), "
", "
", " ", list(list(list("n"), ":"), list("The number of reads to sample when processing
", " FASTQ files.")), "
", "
", " ", list(list(list("Lpattern"), ", ", list("Rpattern"), ":"), list("A character vector or
", " XString object to be matched to the left end of a sequence. If
", |
" either ", list("Lpattern"), " or ", list("Rpattern"), " are provided,
", " ", list("trimLRPatterns"), " is invoked to produce a measure of adapter ", " contamination. Mismatch rates are 0.1 on the left and 0.2 on ", " the right, with a minimum overlap of 10 nt.")), " ", " ", " ", list(list(list("BPPARAM"), ":"), list("How parallel evalutation will be ", " performed. see ", list(list("BiocParallelParam")), "; the default is ", " ", list(
"BiocParallel::registered()[1]"), ".")), "
", " ", " ")
Details
The most common use of this function provides a directory path and pattern identifying FASTQ files for quality assessment. The default is then to create a quality assessment report based on a random sample of n=1000000 reads from each file.
The following methods are defined, in addition to those on S4 formal classes documented elsewhere:
list(" ", " ", " ", list(list(list("qa,character-method")), list(" ", " ", " Quality assessment is performed on all files in directory ", " ", list("dirPath"), " whose file name matches ", list("pattern"), ". The type of ", " analysis performed is based on the ", list("type"), " argument. Use ", " ", list("SolexaExport"), " when all files matching ", list("pattern"), " are ", " Solexa ", list("_export.txt"), " files. Use ", list("SolexaRealign"), " for ", " Solexa ",
list("_realign.txt"), " files. Use ", list("Bowtie"), " for Bowtie
", " files. Use ", list("MAQMapShort"), " for MAQ ", list("map"), " files produced by
", " MAQ versions below 0.70 and ", list("MAQMap"), " for more recent output.
", " Use ", list("fastq"), " for collections of fastq-format files. Quality
", " assessment details vary depending on data source.
", "
", " ")), "
", "
", " ", list(list(list("qa,list-method")), list("
", "
", " ", list(
"dirPath"), " is a list of objects, all of the same class and
", " typically derived from ", list("ShortReadQ"), ", on which quality ", " assessment is performed. All elements of the list must have names, ", " and these should be unique. ", " ", " ")), " ", " ")
Value
An object derived from class .QA . Values
contained in this object are meant for use by report
Seealso
.QA , SolexaExportQA MAQMapQA FastqQA
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
dirPath <- system.file(package="ShortRead", "extdata", "E-MTAB-1147")
## sample 1M reads / file
qa <- qa(dirPath, "fastq.gz", BPPARAM=SerialParam())
if (interactive())
browseURL(report(qa))
showMethods("qa", where=getNamespace("ShortRead"))
qa2()
(Updated) quality assessment reports on short reads
Description
This page summarizes an updated approach to quality assessment reports
in ShortRead
.
Usage
## Input source for short reads
QAFastqSource(con = character(), n = 1e+06, readerBlockSize = 1e+08,
flagNSequencesRange = NA_integer_, ...,
html = system.file("template", "QASources.html", package="ShortRead"))
QAData(seq = ShortReadQ(), filter = logical(length(seq)), ...)
## Possible QA elements
QAFrequentSequence(useFilter = TRUE, addFilter = TRUE,
n = NA_integer_, a = NA_integer_, flagK=.8, reportSequences = FALSE,
...)
QANucleotideByCycle(useFilter = TRUE, addFilter = TRUE, ...)
QANucleotideUse(useFilter = TRUE, addFilter = TRUE, ...)
QAQualityByCycle(useFilter = TRUE, addFilter = TRUE, ...)
QAQualityUse(useFilter = TRUE, addFilter = TRUE, ...)
QAReadQuality(useFilter = TRUE, addFilter = TRUE,
flagK = 0.2, flagA = 30L, ...)
QASequenceUse(useFilter = TRUE, addFilter = TRUE, ...)
QAAdapterContamination(useFilter = TRUE, addFilter = TRUE,
Lpattern = NA_character_, Rpattern = NA_character_,
max.Lmismatch = 0.1, max.Rmismatch = 0.2, min.trim = 9L, ...)
## Order QA report elements
QACollate(src, ...)
## perform analysis
qa2(object, state, ..., verbose=FALSE)
## Outputs from qa2
QA(src, filtered, flagged, ...)
QAFiltered(useFilter = TRUE, addFilter = TRUE, ...)
QAFlagged(useFilter = TRUE, addFilter = TRUE, ...)
## Summarize results as html report
list(list("report"), list("QA"))(x, ..., dest = tempfile(), type = "html")
## additional methods; 'flag' is not fully implemented
flag(object, ..., verbose=FALSE)
list(list("rbind"), list("QASummary"))(..., deparse.level = 1)
Arguments
Argument | Description |
---|---|
con | character(1) file location of fastq input, as used by FastqSampler . |
n | integer(1) number of records to input, as used by FastqStreamer ( QAFastqSource ). integer(1) number of sequences to tag as frequent ( QAFrequentSequence ). |
readerBlockSize | integer(1) number of bytes to input, as used by FastqStreamer . |
flagNSequencesRange | integer(2) minimum and maximum reads above which source files will be flagged as outliers. |
html | character(1) location of the HTML template for summarizing this report element. |
seq | ShortReadQ representation of fastq data. |
filter | logical() vector with length equal to seq , indicating whether elements of seq are filtered ( TRUE ) or not. |
useFilter, addFilter | logical(1) indicating whether the QA element should be calculating using the filtered ( useFilter=TRUE ) or all reads, and whether reads failing the QA element should be added to the filter used by subsequent steps ( addFilter = TRUE ) or not. |
a | integer(1) count of number of sequences above which a read will be considered frequent ( QAFrequentSequence ). |
flagK, flagA | flagK numeric(1) between 0 and 1 indicating the fraction of frequent sequences greater than or equal to n or a above which a fastq file will be flagged ( QAFrequentSequence ). flagK numeric{1} between 0 and 1 and flagA integer(1) indicating that a run should be flagged when the fraction of reads with quality greater than or equal to flagA falls below threshold flagK . |
reportSequences | logical(1) indicating whether frequent sequences are to be reported. |
Lpattern, Rpattern, max.Lmismatch, max.Rmismatch, |
| Parameters influencing adapter identification, see matchPattern
.|
|src
| The source, e.g., QAFastqSource
, on which the quality assessment report will be based.|
|object
| An instance of class derived from QA
on which quality metrics will be derived; for end users, this is usually the result of QACollate
.|
|state
| The data on which quality assessment will be performed; this is not usually necessary for end-users.|
|verbose
| logical(1)
indicating whether progress reports should be reported.|
|filtered, flagged
| Primarily for internal use, instances of QAFiltered
and QAFlagged
.|
|x
| An instance of QA
on which a report is to be generated.|
|dest
| character(1)
providing the directory in which the report is to be generated.|
|type
| character(1)
indicating the type of report to be generated; only html is supported.|
|deparse.level
| see rbind
.|
|...
| Additional arguments, e.g., html
to specify the location of the html source to use as a template for the report.|
Details
Use QACollate
to specify an order in which components of a QA
report are to be assembled. The first argument is the data source
(e.g., QAFastqSource
).
Functions related to data input include: list(" ", " ", " ", list(list(list("QAFastqSource")), list("defines the location of fastq files to ", " be included in the report. ", list("con"), " is used to construct a ", " ", list(list("FastqSampler")), " instance, and records are processed ", " using ", list("qa2,QAFastqSource-method"), ".")), " ", " ", " ", list(list(list("QAData")), list("is a class for representing the data during the ", " QA report generation pass; it is primarily for internal use.")),
"
", " ", " ")
Possible elements in a QA report are: list(" ", " ", " ", list(list(list("QAFrequentSequence")), list("identifies the most-commonly ", " occuring sequences. One of ", list("n"), " or ", list("a"), " can be non-NA, and ", " determine the number of frequent sequences reported. ", list("n"), " ", " specifies the number of most-frequent sequences to filter, e.g., ", " ", list("n=10"), " would filter the top 10 most commonly occurring ", " sequences; ", list("a"), " provides a threshold frequency (count) above ",
" which reads are filtered. The sample is flagged when a fraction
", " ", list("flagK"), " of the reads are filtered. ", " ", " ", list("reportSequences"), " determines whether the most commonly ", " occuring sequences, as determined by ", list("n"), " or ", list("a"), ", are ", " printed in the html report. ", " ")), " ", " ", " ", list(list(list("QANucleotideByCycle")), list("reports nucleotide frequency as a ", " function of cycle.")), " ", " ",
" ", list(list(list("QAQualityByCycle")), list("reports average quality score as a
", " function of cycle.")), " ", " ", " ", list(list(list("QAQualityUse")), list("summarizes overall nucleotide qualities.")), " ", " ", " ", list(list(list("QAReadQuality")), list("summarizes the distribution of read ", " qualities.")), " ", " ", " ", list(list(list("QASequenceUse")), list("summarizes the cumulative distribution ", " of reads occurring 1, 2, ", list(), " times.")),
"
", " ", " ", list(list(list("QAAdapterContamination")), list("reports the occurrence of ", " ", list("adapter"), " sequences on the left and / or right end of each ", " read.")), " ", " ", " ")
Value
An object derived from class .QA . Values
contained in this object are meant for use by report
Seealso
QA .
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
dirPath <- system.file(package="ShortRead", "extdata", "E-MTAB-1147")
fls <- dir(dirPath, "fastq.gz", full=TRUE)
coll <- QACollate(QAFastqSource(fls), QAReadQuality(),
QAAdapterContamination(), QANucleotideUse(),
QAQualityUse(), QASequenceUse(),
QAFrequentSequence(n=10), QANucleotideByCycle(),
QAQualityByCycle())
x <- qa2(coll, BPPARAM=SerialParam(), verbose=TRUE)
res <- report(x)
if (interactive())
browseURL(res)
readAligned()
(Legacy) Read aligned reads and their quality scores into R representations
Description
Import files containing aligned reads into an internal representation of the alignments, sequences, and quality scores. Most methods (see details for exceptions) read all files into a single R object.
Usage
readAligned(dirPath, pattern=character(0), ...)
Arguments
Argument | Description |
---|---|
dirPath | A character vector (or other object; see methods defined on this generic) giving the directory path (relative or absolute; some methods also accept a character vector of file names) of aligned read files to be input. |
pattern | The ( grep -style) pattern describing file names to be read. The default ( character(0) ) results in (attempted) input of all files in the directory. |
... | Additional arguments, used by methods. When dirPath is a character vector, the argument type must be provided. Possible values for type and their meaning are described below. Most methods implement filter=srFilter() , allowing objects of SRFilter to selectively returns aligned reads. |
Details
There is no standard aligned read file format; methods parse particular file types.
The readAligned,character-method
interprets file types based
on an additional type
argument. Supported types are:
list("
", "
", " ", list(list(list("type="SolexaExport"")), list("
", "
", " This type parses ", list(".*_export.txt"), " files following the
", " documentation in the Solexa Genome Alignment software manual,
", " version 0.3.0. These files consist of the following columns;
", " consult Solexa documentation for precise descriptions. If parsed,
", " values can be retrieved from ", list(list("AlignedRead")), " as
", " follows:
", "
", " ", list("
",
" ", list(list("Machine"), list("see below")), "
", " ", list(list("Run number"), list("stored in ", list("alignData"))), " ", " ", list(list("Lane"), list("stored in ", list("alignData"))), " ", " ", list(list("Tile"), list("stored in ", list("alignData"))), " ", " ", list(list("X"), list("stored in ", list("alignData"))), " ", " ", list(list("Y"), list("stored in ", list("alignData"))), " ", " ", list(list("Multiplex index"),
list("see below")), "
", " ", list(list("Paired read number"), list("see below")), " ", " ", list(list("Read"), list(list("sread"))), " ", " ", list(list("Quality"), list(list("quality"))), " ", " ", list(list("Match chromosome"), list(list("chromosome"))), " ", " ", list(list("Match contig"), list(list("alignData"))), " ", " ", list(list("Match position"), list(list("position"))), " ", " ", list(list("Match strand"),
list(list("strand"))), "
", " ", list(list("Match description"), list("Ignored")), " ", " ", list(list("Single-read alignment score"), list(list("alignQuality"))), " ", " ", list(list("Paired-read alignment score"), list("Ignored")), " ", " ", list(list("Partner chromosome"), list("Ignored")), " ", " ", list(list("Partner contig"), list("Ignored")), " ", " ", list(list("Partner offset"), list("Ignored")), " ", " ",
list(list("Partner strand"), list("Ignored")), "
", " ", list(list("Filtering"), list(list("alignData"))), "
", " "), "
", "
", " The following optional arguments, set to ", list("FALSE"), " by default,
", " influence data input
", "
", " ", list("
", "
", " ", list(list("withMultiplexIndex"), list("When ", list("TRUE"), ", include the
", " multiplex index as a column ", list("multiplexIndex"), " in
", " ", list("alignData"),
".")), "
", " ", " ", list(list("withPairedReadNumber"), list("When ", list("TRUE"), ", include the paired ", " read number as a column ", list("pairedReadNumber"), " in ", " ", list("alignData"), ".")), " ", " ", " ", list(list("withId"), list("When ", list("TRUE"), ", construct an identifier string ", " as ", " ", list("Machine_Run:Lane:Tile:X:Y#multiplexIndex/pairedReadNumber"), ". The ", " substrings ", list("#multiplexIndex"),
" and
", " ", list("/pairedReadNumber"), " are not present if ", " ", list("withMultiplexIndex=FALSE"), " or ", " ", list("withPairedReadNumber=FALSE"), ".")), " ", " ", " ", list(list("withAll"), list("A convencience which, when ", list("TRUE"), ", sets all ", " ", list("with*"), " values to ", list("TRUE"), ".")), " ", " ", " "), " ", " ", " Note that not all paired read columns are interpreted. Different ", " interfaces to reading alignment files are described in ",
" ", list(list("SolexaPath")), " and
", " ", list(list("SolexaSet")), ". ", " ", " ")), " ", " ", " ", list(list(list("type="SolexaPrealign"")), list("See SolexaRealign")), " ", " ", list(list(list("type="SolexaAlign"")), list("See SolexaRealign")), " ", " ", list(list(list("type="SolexaRealign"")), list(" ", " ", " These types parse ", list("s_L_TTTT_prealign.txt"), ", ", " ", list("s_L_TTTT_align.txt"), " or ", list("s_L_TTTT_realign.txt"), " files ",
" produced by default and eland analyses. From the Solexa
", " documentation, ", list("align"), " corresponds to unfiltered first-pass ", " alignments, ", list("prealign"), " adjusts alignments for error rates ", " (when available), ", list("realign"), " filters alignments to exclude ", " clusters failing to pass quality criteria. ", " ", " Because base quality scores are not stored with alignments, the ", " object returned by ", list("readAligned"),
" scores all base qualities as
", " ", list("-32"), ". ", " ", " If parsed, values can be retrieved from ", " ", list(list("AlignedRead")), " as follows: ", " ", " ", list(" ", " ", list(list("Sequence"), list("stored in ", list("sread"))), " ", " ", list(list("Best score"), list("stored in ", list("alignQuality"))), " ", " ", list(list("Number of hits"), list("stored in ", list("alignData"))), " ", " ", list(list("Target position"), list(
"stored in ", list("position"))), "
", " ", list(list("Strand"), list("stored in ", list("strand"))), " ", " ", list(list("Target sequence"), list("Ignored; parse using ", " ", list(list("readXStringColumns")))), " ", " ", list(list("Next best score"), list("stored in ", list("alignData"))), " ", " "), " ", " ")), " ", " ", " ", list(list(list("type="SolexaResult"")), list(" ", " ", " This parses ", list("s_L_eland_results.txt"), " files, an intermediate ",
" format that does not contain read or alignment quality
", " scores. ", " ", " Because base quality scores are not stored with alignments, the ", " object returned by ", list("readAligned"), " scores all base qualities as ", " ", list("-32"), ". ", " ", " Columns of this file type can be retrieved from ", " ", list(list("AlignedRead")), " as follows (description of ", " columns is from Table 19, Genome Analyzer Pipeline Software User ", " Guide, Revision A, January 2008): ",
"
", " ", list(" ", " ", list(list("Id"), list("Not parsed")), " ", " ", list(list("Sequence"), list("stored in ", list("sread"))), " ", " ", list(list("Type of match code"), list("Stored in ", list("alignData"), " as ", " ", list("matchCode"), ". Codes are (from the Eland manual): NM (no ", " match); QC (no match due to quality control failure); RM (no ", " match due to repeat masking); U0 (best match was unique and ", " exact); U1 (best match was unique, with 1 mismatch); U2 (best ",
" match was unique, with 2 mismatches); R0 (multiple exact
", " matches found); R1 (multiple 1 mismatch matches found, no ", " exact matches); R2 (multiple 2 mismatch matches found, no ", " exact or 1-mismatch matches).")), " ", " ", list(list("Number of exact matches"), list("stored in ", list("alignData"), " as ", " ", list("nExactMatch"))), " ", " ", list(list("Number of 1-error mismatches"), list("stored in ", list("alignData"),
"
", " as ", list("nOneMismatch"))), " ", " ", list(list("Number of 2-error mismatches"), list("stored in ", list("alignData"), " ", " as ", list("nTwoMismatch"))), " ", " ", list(list("Genome file of match"), list("stored in ", list("chromosome"))), " ", " ", list(list("Position"), list("stored in ", list("position"))), " ", " ", list(list("Strand"), list("(direction of match) stored in ", list("strand"))), " ", " ", list(list(list( | "N"), " treatment"), list("stored in ", list("alignData"), ", as ", " ", list("NCharacterTreatment"), ". ", list("."), " indicates treatment of ", " ", list("N"), " was not applicable; ", list("D"), " indicates treatment ", " as deletion; ", list("|"), " indicates treatment as insertion")), " ", " ", list(list("Substitution error"), list("stored in ", list("alignData"), " as ", " ", list("mismatchDetailOne"), " and ", list("mismatchDetailTwo"), |
". Present
", " only for unique inexact matches at one or two ", " positions. Position and type of first substitution error, ", " e.g., 11A represents 11 matches with 12th base an A in ", " reference but not read. The reference manual cited below lists ", " only one field (", list("mismatchDetailOne"), "), but two are present ", " in files seen in the wild.")), " ", " ", " "), " ", " ")), " ", " ", " ", list(list(
list("type="MAQMap", records=-1L")), list("Parse binary ", list("map"), "
", " files produced by MAQ. See details in the next section. The ", " ", list("records"), " option determines how many lines are read; ", " ", list("-1L"), " (the default) means that all records are input. For ", " ", list("type="MAQMap""), ", ", list("dir"), " and ", list("pattern"), " must match a ", " single file.")), " ", " ", " ", list(list(list("type="MAQMapShort", records=-1L")),
list("The same as
", " ", list("type="MAQMap""), " but for map files made with Maq prior to ", " version 0.7.0. (These files use a different maximum read length ", " [64 instead of 128], and are hence incompatible with newer Maq map ", " files.). For ", list("type="MAQMapShort""), ", ", list("dir"), " and ", " ", list("pattern"), " must match a single file.")), " ", " ", " ", list(list(list("type="MAQMapview"")), list(" ", " ", " Parse alignment files created by MAQ's ",
list("mapiew"), " command.
", " Interpretation of columns is based on the description in the MAQ ", " manual, specifically ", " ", " ", list(" ", " ...each line consists of read name, chromosome, position, ", " strand, insert size from the outer coordinates of a pair, ", " paired flag, mapping quality, single-end mapping quality, ", " alternative mapping quality, number of mismatches of the ", " best hit, sum of qualities of mismatched bases of the best ",
" hit, number of 0-mismatch hits of the first 24bp, number
", " of 1-mismatch hits of the first 24bp on the reference, ", " length of the read, read sequence and its quality. ", " "), " ", " ", " The read name, read sequence, and quality are read as ", " ", list("XStringSet"), " objects. Chromosome and strand are read as ", " ", list("factor"), "s. Position is ", list("numeric"), ", while mapping quality is ", " ", list("numeric"), ". These fields are mapped to their corresponding ",
" representation in ", list("AlignedRead"), " objects.
", " ", " Number of mismatches of the best hit, sum of qualities of mismatched ", " bases of the best hit, number of 0-mismatch hits of the first 24bp, ", " number of 1-mismatch hits of the first 24bp are represented in the ", " ", list("AlignedRead"), " object as components of ", list("alignData"), ". ", " ", " Remaining fields are currently ignored. ", " ", " ")), " ", " ", " ", list(list(list(
"type="Bowtie"")), list("
", " ", " Parse alignment files created with the Bowtie alignment ", " algorithm. Parsed columns can be retrieved from ", " ", list(list("AlignedRead")), " as follows: ", " ", " ", list(" ", " ", list(list("Identifier"), list(list("id"))), " ", " ", list(list("Strand"), list(list("strand"))), " ", " ", list(list("Chromosome"), list(list("chromosome"))), " ", " ", list(list("Position"), list(list("position"),
"; see comment below")), "
", " ", list(list("Read"), list(list("sread"), "; see comment below")), " ", " ", list(list("Read quality"), list(list("quality"), "; see comments below")), " ", " ", list(list("Similar alignments"), list(list("alignData"), ", ", list("similar"), " ", " column; Bowtie v. 0.9.9.3 (12 May, 2009) documents this as ", " the number of other instances where the same read aligns against the ", " same reference characters as were aligned against in this ",
" alignment. Previous versions marked this as ", list("Reserved"))), "
", " ", list(list("Alignment mismatch locations"), list(list("alignData"), "
", " ", list("mismatch"), ", column")), "
", "
", " "), "
", "
", " NOTE: the default quality encoding changes to ", list("FastqQuality"), "
", " with ", list("ShortRead"), " version 1.3.24.
", "
", " This method includes the argument ", list("qualityType"), " to specify
", " how quality scores are encoded. Bowtie quality scores are
",
" ", list("Phred"), "-like by default, with
", " ", list("qualityType='FastqQuality'"), ", but can be specified as ", " ", list("Solexa"), "-like, with ", list("qualityType='SFastqQuality'"), ". ", " ", " Bowtie outputs positions that are 0-offset from the left-most end ", " of the ", list("+"), " strand. ", list("ShortRead"), " parses position ", " information to be 1-offset from the left-most end of the ", list("+"), " ", " strand. ", " ", " Bowtie outputs reads aligned to the ",
list("-"), " strand as their
", " reverse complement, and reverses the quality score string of these ", " reads. ", list("ShortRead"), " parses these to their original sequence ", " and orientation. ", " ", " ")), " ", " ", " ", list(list(list("type="SOAP"")), list(" ", " ", " Parse alignment files created with the SOAP alignment ", " algorithm. Parsed columns can be retrieved from ", " ", list(list("AlignedRead")), " as follows: ", " ", " ",
list("
", " ", list(list("id"), list(list("id"))), " ", " ", list(list("seq"), list(list("sread"), "; see comment below")), " ", " ", list(list("qual"), list(list("quality"), "; see comment below")), " ", " ", list(list("number of hits"), list(list("alignData"))), " ", " ", list(list("a/b"), list(list("alignData"), " (", list("pairedEnd"), ")")), " ", " ", list(list("length"), list(list("alignData"), " (", list("alignedLength"), ")")), " ", " ",
list(list("+/-"), list(list("strand"))), "
", " ", list(list("chr"), list(list("chromosome"))), " ", " ", list(list("location"), list(list("position"), "; see comment below")), " ", " ", list(list("types"), list(list("alignData"), " (", list("typeOfHit"), ": integer ", " portion; ", list("hitDetail"), ": text portion)")), " ", " "), " ", " ", " This method includes the argument ", list("qualityType"), " to specify ", " how quality scores are encoded. It is unclear from SOAP ",
" documentation what the quality score is; the default is
", " ", list("Solexa"), "-like, with ", list("qualityType='SFastqQuality'"), ", but ", " can be specified as ", list("Phred"), "-like, with ", " ", list("qualityType='FastqQuality'"), ". ", " ", " SOAP outputs positions that are 1-offset from the left-most end of ", " the ", list("+"), " strand. ", list("ShortRead"), " preserves this ", " representation. ", " ", " SOAP reads aligned to the ",
list("-"), " strand are reported by SOAP as
", " their reverse complement, with the quality string of these reads ", " reversed. ", list("ShortRead"), " parses these to their original sequence ", " and orientation. ", " ", " ")), " ", " ", " ")
Value
A single R object (e.g., AlignedRead ) containing
alignments, sequences and qualities of all files in dirPath
matching pattern
. There is no guarantee of order in which files
are read.
Seealso
The AlignedRead class.
Genome Analyzer Pipeline Software User Guide, Revision A, January 2008.
The MAQ reference manual, http://maq.sourceforge.net/maq-manpage.shtml#5 , 3 May, 2008.
The Bowtie reference manual, http://bowtie-bio.sourceforge.net , 28 October, 2008.
The SOAP reference manual, http://soap.genomics.org.cn/soap1 , 16 December, 2008.
Author
Martin Morgan mtmorgan@fhcrc.org, Simon Anders anders@ebi.ac.uk (MAQ map)
Examples
sp <- SolexaPath(system.file("extdata", package="ShortRead"))
ap <- analysisPath(sp)
## ELAND_EXTENDED
(aln0 <- readAligned(ap, "s_2_export.txt", "SolexaExport"))
## PhageAlign
(aln1 <- readAligned(ap, "s_5_.*_realign.txt", "SolexaRealign"))
## MAQ
dirPath <- system.file('extdata', 'maq', package='ShortRead')
list.files(dirPath)
## First line
readLines(list.files(dirPath, full.names=TRUE)[[1]], 1)
countLines(dirPath)
## two files collapse into one
(aln2 <- readAligned(dirPath, type="MAQMapview"))
## select only chr1-5.fa, '+' strand
filt <- compose(chromosomeFilter("chr[1-5].fa"),
strandFilter("+"))
(aln3 <- readAligned(sp, "s_2_export.txt", filter=filt))
readBaseQuality()
(Legacy) Read short reads and their quality scores into R representations
Description
readBaseQuality
reads all base call files in a directory
dirPath
whose file name matches seqPattern
and all quality
score files whose name matches prbPattern
, returning a compact
internal representation of the sequences, and quality scores in the files.
Methods read all files into a single R object.
Usage
readBaseQuality(dirPath, ...)
list(list("readBaseQuality"), list("character"))(dirPath, seqPattern=character(0),
prbPattern=character(0), type=c("Solexa"), ...)
Arguments
Argument | Description |
---|---|
dirPath | A character vector (or other object; see methods defined on this generic) giving the directory path (relative or absolute) of files to be input. |
seqPattern | The ( grep -style) pattern describing base call file names to be read. The default ( character(0) ) results in (attempted) input of all files in the directory. |
prbPattern | The ( grep -style) pattern describing quality score file names to be read. The default ( character(0) ) results in (attempted) input of all files in the directory. |
type | The type of file to be parsed. Supported types include: Solexa : parse reads and their qualities from _seq.txt and _prb.txt -formatted files, respectively. |
... | Additional arguments, perhaps used by methods. |
Value
A single R object (e.g., ShortReadQ ) containing
sequences and qualities of all files in dirPath
matching
seqPattern
and prbPattern
respectively. There is no
guarantee of order in which files are read.
Seealso
A ShortReadQ object.
Author
Patrick Aboyoun paboyoun@fhcrc.org
Examples
sp <- SolexaPath(system.file("extdata", package="ShortRead"))
readBaseQuality(sp, seqPattern="s_1.*_seq.txt", prbPattern="s_1.*_prb.txt")
readBfaToc()
(Legacy) Get a list of the sequences in a Maq .bfa file
Description
As coverage
needs to know the lengths of the reference sequences,
this function is provided which extracts this information from
a .bfa file (Maq's "binary FASTA" format).
Usage
readBfaToc( bfafile )
Arguments
Argument | Description |
---|---|
bfafile | The file name of the .bfa file. |
Value
An integer vector with one element per reference sequence found in the .bfa file, each vector element named with the sequence name and having the sequence length as value.
Author
Simon Anders, EMBL-EBI, sanders@fs.tum.de
(Note: The C code for this function incorporates code from Li Heng's MAQ software, (c) Li Heng and released by him under GPL 2.
readFasta()
Read and write FASTA files to or from ShortRead objects
Description
readFasta
reads all FASTA-formated files in a directory
dirPath
whose file name matches pattern pattern
,
returning a compact internal representation of the sequences and
quality scores in the files. Methods read all files into a single R
object; a typical use is to restrict input to a single FASTA file.
writeFasta
writes an object to a single file
, using
mode="w"
(the default) to create a new file or mode="a"
append to an existing file. Attempting to write to an existing file
with mode="w"
results in an error.
Usage
readFasta(dirPath, pattern = character(0), ...,
nrec=-1L, skip=0L)
list(list("readFasta"), list("character"))(dirPath, pattern = character(0), ...,
nrec=-1L, skip=0L)
writeFasta(object, file, mode="w", ...)
list(list("writeFasta"), list("DNAStringSet"))(object, file, mode="w", ...)
Arguments
Argument | Description |
---|---|
dirPath | A character vector giving the directory path (relative or absolute) or single file name of FASTA files to be read. |
pattern | The ( grep -style) pattern describing file names to be read. The default ( character(0) ) results in (attempted) input of all files in the directory. |
object | An object to be output in fasta format. |
file | A length 1 character vector providing a path to a file to the object is to be written to. |
mode | A length 1 character vector equal to either w or a to write to a new file or append to an existing file, respectively. |
... | Additional arguments used by methods or, for writeFasta , writeXStringSet . |
nrec | See ?readDNAStringSet . |
skip | See ?readDNAStringSet . |
Value
readFasta
returns a DNAStringSet .
containing sequences and qualities contained in all files in
dirPath
matching pattern
. There is no guarantee of order
in which files are read.
writeFasta
is invoked primarily for its side effect, creating
or appending to file file
. The function returns, invisibly, the
length of object
, and hence the number of records
written. There is a writeFasta
method for any class derived
from ShortRead .
Author
Martin Morgan
Examples
showMethods("readFasta")
showMethods("writeFasta")
f1 <- system.file("extdata", "someORF.fa", package="Biostrings")
rfa <- readFasta(f1)
sread(rfa)
id(rfa)
sp <- SolexaPath(system.file('extdata', package='ShortRead'))
rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt")
file <- tempfile()
writeFasta(rfq, file)
readLines(file, 8)
writeFasta(sread(rfq), file) # no 'id's
readFastq()
Read and write FASTQ-formatted files
Description
readFastq
reads all FASTQ-formated files in a directory
dirPath
whose file name matches pattern pattern
,
returning a compact internal representation of the sequences and
quality scores in the files. Methods read all files into a single R
object; a typical use is to restrict input to a single FASTQ file.
writeFastq
writes an object to a single file
, using
mode="w"
(the default) to create a new file or mode="a"
append to an existing file. Attempting to write to an existing file
with mode="w"
results in an error.
Usage
readFastq(dirPath, pattern=character(0), ...)
list(list("readFastq"), list("character"))(dirPath, pattern=character(0), ..., withIds=TRUE)
writeFastq(object, file, mode="w", full=FALSE, compress=TRUE, ...)
Arguments
Argument | Description |
---|---|
dirPath | A character vector (or other object; see methods defined on this generic) giving the directory path (relative or absolute) or single file name of FASTQ files to be read. |
pattern | The ( grep -style) pattern describing file names to be read. The default ( character(0) ) results in (attempted) input of all files in the directory. |
object | An object to be output in fastq format. For methods, use showMethods(object, where=getNamespace("ShortRead")) . |
file | A length 1 character vector providing a path to a file to the object is to be written to. |
mode | A length 1 character vector equal to either w or a to write to a new file or append to an existing file, respectively. |
full | A logical(1) indicating whether the identifier line should be repeated full=TRUE or omitted full=FALSE on the third line of the fastq record. |
compress | A logical(1) indicating whether the file should be gz-compressed. The default is TRUE . |
|...
| Additional arguments. In particular, qualityType
and filter
: list("
", "
", " ", list(list("qualityType:"), list("Representation to be used for quality scores,
", " must be one of ", list("Auto"), " (choose Illumina base 64 encoding
", " ", list("SFastqQuality"), " if all characters are ASCII-encoded as
", " greater than 58 ", list(":"), " and some characters are greater than 74
", " ", list("J"), "), ", list("FastqQuality"), " (Phred-like base 33 encoding),
", " ", list("SFastqQuality"), " (Illumina base 64 encoding).")), "
", "
", " ", |
list(list("filter:"), list("An object of class ", list(list("srFilter")), ", used to
", " filter objects of class ", list(list("ShortReadQ")), " at
", " input.")), "
", "
", " ")
|withIds
| logical(1)
indicating whether identifiers should be read from the fastq file.|
Details
The fastq format is not quite precisely defined. The basic definition used here parses the following four lines as a single record:
list(" ", " @HWI-EAS88_1_1_1_1001_499 ", " GGACTTTGTAGGATACCCTCGCTTTCCTTCTCCTGT ", " +HWI-EAS88_1_1_1_1001_499 ", " ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS ", " ")
The first and third lines are identifiers preceded by a specific
character (the identifiers are identical, in the case of Solexa). The
second line is an upper-case sequence of nucleotides. The parser
recognizes IUPAC-standard alphabet (hence ambiguous nucleotides),
coercing .
to -
to represent missing values. The final
line is an ASCII-encoded representation of quality scores, with one
ASCII character per nucleotide.
The encoding implicit in Solexa-derived fastq files is that each
character code corresponds to a score equal to the ASCII character
value minus 64 (e.g., ASCII @
is decimal 64, and corresponds to
a Solexa quality score of 0). This is different from BioPerl, for
instance, which recovers quality scores by subtracting 33 from the
ASCII character value (so that, for instance, !
, with decimal
value 33, encodes value 0).
The BioPerl description of fastq asserts that the first character of
line 4 is a !
, but the current parser does not support this
convention.
writeFastq
creates files following the specification outlined
above, using the IUPAC-standard alphabet (hence, sequences containing
list(".") when read will be represented by list("-") when written).
Value
readFastq
returns a single R object (e.g.,
ShortReadQ ) containing sequences and qualities
contained in all files in dirPath
matching
pattern
. There is no guarantee of order in which files are
read.
writeFastq
is invoked primarily for its side effect, creating
or appending to file file
. The function returns, invisibly, the
length of object
, and hence the number of records written.
Seealso
The IUPAC alphabet in Biostrings.
http://www.bioperl.org/wiki/FASTQ_sequence_format for the BioPerl definition of fastq.
Solexa documentation `Data analysis - documentation : Pipeline output
and visualisation'.
## Author
Martin Morgan
## Examples
r showMethods(readFastq) showMethods(writeFastq) sp <- SolexaPath(system.file('extdata', package='ShortRead')) rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt") sread(rfq) id(rfq) quality(rfq) ## SolexaPath method 'knows' where FASTQ files are placed rfq1 <- readFastq(sp, pattern="s_1_sequence.txt") rfq1 file <- tempfile() writeFastq(rfq, file) readLines(file, 8)
readIntensities()
(Legacy) Read Illumina image intensity files
Description
readIntensities
reads image intensity' files (such as Illumina's
int.txtand (optionally)
_nse.txt) into a single object. ## Usage ```r readIntensities(dirPath, pattern=character(0), ...) ``` ## Arguments |Argument |Description| |------------- |----------------| |
dirPath| Directory path or other object (e.g., SolexaPath ) for which methods are defined.| |
pattern| A length 1 character vector representing a regular expression to be combined with
dirPath, as described below, to match files to be summarized.| |
list()| Additional arguments used by methods.| ## Details Additional methods are defined on specific classes, see, e.g., list("SolexaPath") . The
readIntensities,character-methodcontains an argument
typethat determines how intensities are parsed. Use the
typeargument to
readIntensities,character-method, as described below. All
readIntensities,character` methods accepts
the folling arguments:
list("
", "
", " ", list(list("withVariability:"), list("Include estimates of variability (i.e., from
", " parsing ", list("_nse"), " files).")), "
", "
", " ", list(list("verbose:"), list("Report on progress when starting to read each file.")), "
", "
", " ")
The supported types and their signatures are:
list("
", "
", " ", list(list(list("type="RtaIntensity"")), list("
", "
", " Intensities are read from Illumina ", list("_cif.txt"), " and
", " ", list("_cnf.txt"), "-style files. The signature for this method is
", "
", " ", list("dirPath, pattern=character(0), ..., type="RtaIntensity",
", " lane=integer(0), cycles=integer(0), cycleIteration=1L,
", " tiles=integer(0),
", " laneName=sprintf("L", "%.3d", lane),", "
", " cycleNames=sprintf("C",
"%d.%d", cycles, cycleIteration),", "
", " tileNames=sprintf("s", "%d%d", lane, tiles),", "
", " posNames=sprintf("s", "%d_%.4d_pos.txt", lane, tiles),", "
", " withVariability=TRUE, verbose=FALSE"), "
", "
", " ", list("
", "
", " ", list(list("lane:"), list(list("integer(1)"), " identifying the lane in which
", " cycles and tiles are to be processed.")), "
", "
", " ", list(list("cycles:"), list(list("integer()"), " enumerating cycles to be
",
" processed.")), "
", "
", " ", list(list("cycleIteration:"), list(list("integer(1)"), " identifying the
", " iteration of the base caller to be summarized")), "
", "
", " ", list(list("tiles:"), list(list("integer()"), " enumerating tile numbers to be
", " summarized.")), "
", "
", " ", list(list("laneName, cycleNames, tileNames,
", " posNames:"), list(list("character()"), " vectors identifying the lane and
", " cycle directories, and the ", list("pos"),
" and tile file names
", " (excluding the ", list(".cif"), " or ", list(".cnf"), " extension) to be
", " processed.")), "
", "
", " "), "
", "
", " The ", list("dirPath"), " and ", list("pattern"), " arguments are combined as
", " ", list("list.files(dirPath, pattern)"), ", and must identify a single
", " directory. Most uses of this function will focus on a single tile
", " (specified with, e.g., ", list("tiles=1L"), "); the ", list("laneName"), ",
", " ",
list("cycleNames"), ", ", list("tileNames"), ", and ", list("posNames"), "
", " parameters are designed to work with the default Illumina pipeline
", " and do not normally need to be specified.
", "
", " ")), "
", "
", " ", list(list(list("type="IparIntensity"")), list("
", "
", " Intensities are read from Solexa ", list("_pos.txt"), ",
", " ", list("_int.txt.p"), ", ", list("_nse.txt.p"), "-style file triplets. The
", " signature for this method is
",
"
", " ", list("dirPath, pattern=character(0), ...,
", " type="IparIntensity",
", " intExtension="_int.txt.p.gz",
", " nseExtension="_nse.txt.p.gz",
", " posExtension="_pos.txt",
", " withVariability=TRUE, verbose=FALSE"), "
", "
", " Files to be parsed are determined as, e.g., ", list("paste(pattern,
", " intExtension, sep="")"), ".
", "
", " ")), "
", "
", " ", list(list(list("type="SolexaIntensity"")), list("
", "
",
" Intensities are read from Solexa ", list("_int.txt"), " and
", " ", list("_nse.txt"), "-style files. The signature for this method is
", "
", " ", list("dirPath, pattern=character(0), ...,
", " type="SolexaIntensity",
", " intExtension="_int.txt",
", " nseExtension="_nse.txt",
", " withVariability=TRUE, verbose=FALSE"), "
", "
", " Files to be parsed are determined as, e.g., ", list("paste(pattern,
", " intExtension, sep="")"), ".
",
" ")), "
", "
", " ")
## Value
An object derived from class Intensity .
## Author
Martin Morgan mtmorgan@fhcrc.org,
Michael Muratet mmuratet@hudsonalpha.org (RTA).
## Examples
r fl <- system.file("extdata", package="ShortRead") sp <- SolexaPath(fl) int <- readIntensities(sp) int intensity(int)[1,,] # one read intensity(int)[[1:2,,]] # two reads, as 'array' head(rowMeans(intensity(int)))# treated as 'array' head(pData(readIntensityInfo(int))) ## RTA Lane 2, cycles 1:80, cycle iteration 1, tile 3 int <- readIntensities("Data/Intensities", type="RtaIntensity", lane=2, cycles=1:80, tiles=3)
readPrb()
(Legacy) Read Solexa prb files as fastq-style quality scores
Description
readPrb
reads all _prb.txt
files in a directory into a
single object. Most methods (see details) do this by identifying the
maximum base call quality for each cycle and read, and representing
this as an ASCII-encoded character string.
Usage
readPrb(dirPath, pattern = character(0), ...)
Arguments
Argument | Description |
---|---|
dirPath | Directory path or other object (e.g., SolexaPath for which methods are defined. |
pattern | Regular expression matching names of _prb files to be summarized. |
list() | Additional arguments, unused. |
Details
The readPrb,character-method
contains an argument as
that determines the value of the returned object, as follows.
list(" ", " ", " ", list(list(list("as="SolexaEncoding"")), list(" ", " ", " The ASCII encoding of the maximum per cycle and read quality score ", " is encoded using Solexa conventions. ", " ", " ")), " ", " ", " ", list(list(list("as="FastqEncoding"")), list(" ", " ", " The ASCII encoding of the maximum per cycle and read quality score ", " is encoded using Fastq conventions, i.e., ", list("!"), " has value 0. ", " ", " ")), " ", " ", " ", list(
list(list("as="IntegerEncoding"")), list("
", " ", " The maximum per cycle and read quality score is returned as a in ", " integer value. Values are collated into a matrix with number of ", " rows equal to number of reads, and number of columns equal to ", " number of cycles. ", " ", " ")), " ", " ", " ", list(list(list("as="array"")), list(" ", " ", " The quality scores are ", list("not"), " summarized; the return value is ", " an integer array with dimensions corresponding to reads, ",
" nucleotides, and cycles.
", " ", " ")), " ", " ", " ")
Value
An object of class QualityScore , or an integer matrix.
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
fl <- system.file("extdata", package="ShortRead")
sp <- SolexaPath(fl)
readPrb(sp, "s_1.*_prb.txt") # all tiles to a single file
readQseq()
(Legacy) Read Solexa qseq files as fastq-style quality scores
Description
readQseq
reads all files matching pattern
in a directory
into a single ShortReadQ -class object.
Information on machine, lane, tile, x, and y coordinates, filtering
status, and read number are not returned (although filtering status
can be used to selectively include reads as described below).
Usage
readQseq(dirPath, pattern = character(0), ...,
as=c("ShortReadQ", "DataFrame", "XDataFrame"),
filtered=FALSE,
verbose=FALSE)
Arguments
Argument | Description |
---|---|
dirPath | Directory path or other object (e.g., SolexaPath ) for which methods are defined. |
pattern | Regular expression matching names of _qseq files to be summarized. |
list() | Additional argument, passed to I/O functions. |
as | character(1) indicating the class of the return type. XDataFrame is included for backward compatibility, but is no longer supported. |
filtered | logical(1) indicating whether to include only those reads passing Solexa filtering? |
verbose | logical(1) indicating whether to report on progress during evaluation. |
Value
An object of class ShortReadQ .
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
fl <- system.file("extdata", package="ShortRead")
sp <- SolexaPath(fl)
readQseq(sp)
readXStringColumns()
Read one or more columns into XStringSet (e.g., DNAStringSet) objects
Description
This function allows short read data components such as DNA sequence,
quality scores, and read names to be read in to XStringSet
(e.g., DNAStringSet
, BStringSet
) objects. One or several
files of identical layout can be specified.
Usage
readXStringColumns(dirPath, pattern=character(0),
colClasses=list(NULL),
nrows=-1L, skip=0L,
sep = " ", header = FALSE, comment.char="#")
Arguments
Argument | Description |
---|---|
dirPath | A character vector giving the directory path (relative or absolute) of files to be read. |
pattern | The ( grep -style) pattern describing file names to be read. The default ( character(0) ) reads all files in dirPath . All files are expected to have identical numbers of columns. |
colClasses | A list of length equal to the number of columns in a file. Columns with corresponding colClasses equal to NULL are ignored. Other entries in colClasses are expected to be character strings describing the base class for the XStringSet . For instance a column of DNA sequences would be specified as "DNAString" . The column would be parsed into a DNAStringSet object. |
nrows | A length 1 integer vector describing the maximum number of XString objects to read into the set. Reads may come from more than one file when dirPath and pattern parse several files and nrow is greater than the number of reads in the first file. |
skip | A length 1 integer vector describing how many lines to skip at the start of each file. |
sep | A length 1 character vector describing the column separator. |
header | A length 1 logical vector indicating whether files include a header line identifying columns. If present, the header of the first file is used to name the returned values. |
comment.char | A length 1 character vector, with a single character that, when appearing at the start of a line, indicates that the entire line should be ignored. Currently there is no way to use comment characters in other than the first position of a line. |
Value
A list, with each element containing an XStringSet
object of
the type corresponding to the non-NULL elements of colClasses
.
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
## valid character strings for colClasses
names(slot(getClass("XString"), "subclasses"))
dirPath <- system.file('extdata', 'maq', package='ShortRead')
colClasses <- rep(list(NULL), 16)
colClasses[c(1, 15, 16)] <- c("BString", "DNAString", "BString")
## read one file
readXStringColumns(dirPath, "out.aln.1.txt", colClasses=colClasses)
## read all files into a single object for each column
res <- readXStringColumns(dirPath, colClasses=colClasses)
renew()
Renew (update) a ShortRead object with new values
Description
Use renew
to update an object defined in ShortRead with
new values. Discover update-able classes and values with
renewable
.
Usage
renewable(x, list())
renew(x, list())
Arguments
Argument | Description |
---|---|
x | For renewable : missing , character(1) , or a class defined in the ShortRead package. For renew : an instance of a class defined in the ShortRead package. |
list() | For renewable , ignored. For renew , named arguments identifying which parts of x are to be renewed. |
Details
When invoked with no arguments renewable
returns a character
vector naming classes that can be renewed.
When invoked with a character(1)
or an instance of a
ShortRead class, a list of the names and values of the elements
that can be renewed. When x
is a character vector naming a
virtual class, then each element of the returned list is a non-virtual
descendant of that class that can be used in renewal. This is not
fully recursive.
renew
is always invoked with the x
argument being an
instance of a class identified by renewable()
. Remaining
arguments are name-value pairs identifying the components of x
that are to be renewed (updated). The name-value pairs must be
consistent with renewable(x)
. The resulting object is checked
for validity. Multiple components of the object can be updated in a
single call to renew
, allowing comparatively efficient complex
transformations.
Value
renewable()
returns a character vector of renewable classes.
renewable(x)
returns a named list. The names correspond to
renewable classes, and the elements of the list correspond to
renewable components of the class.
renew(x,
returns an object of the same class as
x
, but with components of x
replaced by the named values
of list() .
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
## discovery
renewable()
renewable("AlignedRead")
renewable("QualityScore") ## instantiable classes
## example data
sp <- SolexaPath(system.file("extdata", package="ShortRead"))
ap <- analysisPath(sp)
filt <- chromosomeFilter("chr[[:digit:]+].fa")
aln <- readAligned(ap, "s_2_export.txt", "SolexaExport",
filter=filt)
## renew chromosomes from 'chr1.fa' to 'chr1', etc
labels <- sub("\.fa", "", levels(chromosome(aln)))
renew(aln, chromosome=factor(chromosome(aln), labels=labels))
## multiple changes -- update chromosome, offset position
renew(aln, chromosome=factor(chromosome(aln), labels=labels),
position=1L+position(aln))
## oops! invalid instances cannot be constructed
try(renew(aln, position=1:10))
report()
Summarize quality assessment results into a report
Description
This generic function summarizes results from evaluation of
qa
into a report. Available report formats vary
depending on the data analysed.
Usage
report(x, ..., dest=tempfile(), type="html")
report_html(x, dest, type, ...)
Arguments
Argument | Description |
---|---|
x | An object returned by qa , usually derived from class .QA |
list() | Additional arguments used by specific methods. All methods with type="html" support the argument cssFile , which is a named, length 1 character vector. The value is a path to a CSS file to be incorporated into the report (e.g., system.file("template", "QA.css", ). The name of cssFile is the name of the CSS file as seen by the html report (e.g., list("QA.css") ). See specific methods for details on additional list() arguments. |
dest | The output destination for the final report. For type="html" this is a directory; for (deprecated) type="pdf" this is a file. |
type | A text string defining the type of report; available report types depend on the type of object x ; usually this is html . |
Details
report_html
is meant for use by package authors wishing to add
methods for creating HTML reports; users should always invoke
report
.
The following methods are defined: list(" ", " ", " ", list(list(list("x="BowtieQA", ..., dest=tempfile(), type="html"")), list(" ", " Produce an HTML-based report from an object of class ", " ", list(list("BowtieQA")), ".")), " ", " ", " ", list(list(list("x="FastqQA", ..., dest=tempfile(), type="html"")), list(" ", " Produce an HTML-based report from an object of class ", " ", list(list("FastqQA")), ".")), " ", " ", " ", list(list(list("x="MAQMapQA", ..., dest=tempfile(), type="html"")),
list("
", " Produce an HTML-based report from an object of class ", " ", list(list("MAQMapQA")), ".")), " ", " ", " ", list(list(list("x="SolexaExportQA", ..., dest=tempfile(), type="html"")), list(" ", " Produce an HTML-based report from an object of class ", " ", list(list("SolexaExportQA")), ".")), " ", " ", " ", list(list(list("x="SolexaExportQA", ..., dest=tempfile(), type="pdf"")), list(" ", " (Deprecated) Produce an PDF report from an object of class ",
" ", list(list("SolexaExportQA")), ".")), "
", " ", " ", list(list(list("x="SolexaPath", ..., dest=tempfile(), type="html"")), list(" ", " Produce an HTML report by first visiting all ", list("_export.txt"), " ", " files in the ", list("analysisPath"), " directory of ", list("x"), " to create a ", " ", list("SolexaExportQA"), " instance.")), " ", " ", " ", list(list(list("x="SolexaPath", ..., dest=tempfile(), type="pdf"")), list(" ", " (Deprecated) Produce an PDF report by first visiting all ",
" ", list("_export.txt"), " files in the ", list("analysisPath"), " directory of
", " ", list("x"), " to create a ", list("SolexaExportQA"), " instance.")), " ", " ", " ", list(list(" ", " ", list("x="ANY", ..., dest=tempfile(), type="ANY""), " ", " "), list("This method is used internally")), " ", " ")
Value
This function is invoked for its side effect; the return value is the name of the directory or file where the report was created.
Seealso
SolexaExportQA
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
showMethods("report")
## default CSS file
cssFile <- c(QA.css=system.file("template", "QA.css",
package="ShortRead"))
noquote(readLines(cssFile))
spViewPerFeature()
Tools to visualize genomic data
Description
Use Snapshot
-class to visualize a specific region of genomic data
Usage
spViewPerFeature(GRL, name, files, ignore.strand=FALSE,
multi.levels = FALSE, fac=character(0L), ...)
Arguments
Argument | Description |
---|---|
GRL | Object GRangeList containing annotation of genomic data. It can be generated by applying exonsBy() or transcriptsBy() to a TxDb instance. See examples below. |
name | Character(1) specifying which element in GRL to be visualized. |
files | Charactor() or BamFileList specifying the file(s) to be visualized. If multiple files, local metadata of the files can be hold by setting a DataFrame (values(files) <- DataFrame(...)). See examples below. |
ignore.strand | Logical(1) indicating whether to ignore the strand of the genomic data. |
multi.levels | Logical(1) indicating whether to plot the coverage of multiple files on different panels. If FALSE , the mean coverage of multiple files would be plotted. |
fac | Character(1) indicating which column of local metadata ( elementMetatdata() ) should be used to group the samples. Ignore |
list() | Arguments used for creating a Snapshot object. |
Value
A Snapshot
instance
Seealso
Author
Chao-Jen Wong cwon2@fhcrc.org
Examples
## Example 1
library(GenomicFeatures)
txdbFile <- system.file("extdata", "sacCer2_sgdGene.sqlite",
package="yeastNagalakshmi")
## either use a txdb file quaried from UCSC or use existing TxDb packages.
txdb <- loadDb(txdbFile)
grl <- exonsBy(txdb, by="gene")
file <- system.file("extdata", "SRR002051.chrI-V.bam",
package="yeastNagalakshmi")
s <- spViewPerFeature(GRL=grl, name="YAL001C", files=file)
## Example 2
## multi-files: using 'BamFileList' and setting up the 'DataFrame'
## holding the phenotype data
bfiles <- BamFileList(c(a=file, b=file))
values(bfiles) <- DataFrame(sampleGroup=factor(c("normal", "tumor")))
values(bfiles)
s <- spViewPerFeature(GRL=grl, name="YAL001C",
files=bfiles, multi.levels=TRUE, fac="sampleGroup")
srFilter()
Functions for user-created and built-in ShortRead filters
Description
These functions create user-defined ( srFitler
) or built-in
instances of SRFilter objects. Filters can be
applied to objects from ShortRead
, returning a logical vector
to be used to subset the objects to include only those components
satisfying the filter.
Usage
srFilter(fun, name = NA_character_, ...)
list(list("srFilter"), list("missing"))(fun, name=NA_character_, ...)
list(list("srFilter"), list("function"))(fun, name=NA_character_, ...)
compose(filt, ..., .name)
idFilter(regex=character(0), fixed=FALSE, exclude=FALSE,
.name="idFilter")
occurrenceFilter(min=1L, max=1L,
withSread=c(NA, TRUE, FALSE),
duplicates=c("head", "tail", "sample", "none"),
.name=.occurrenceName(min, max, withSread,
duplicates))
nFilter(threshold=0L, .name="CleanNFilter")
polynFilter(threshold=0L, nuc=c("A", "C", "T", "G", "other"),
.name="PolyNFilter")
dustyFilter(threshold=Inf, batchSize=NA, .name="DustyFilter")
srdistanceFilter(subject=character(0), threshold=0L,
.name="SRDistanceFilter")
##
## legacy filters for ungapped alignments
##
chromosomeFilter(regex=character(0), fixed=FALSE, exclude=FALSE,
.name="ChromosomeFilter")
positionFilter(min=-Inf, max=Inf, .name="PositionFilter")
strandFilter(strandLevels=character(0), .name="StrandFilter")
alignQualityFilter(threshold=0L, .name="AlignQualityFilter")
alignDataFilter(expr=expression(), .name="AlignDataFilter")
Arguments
Argument | Description |
---|---|
fun | An object of class function to be used as a filter. fun must accept a single named argument x , and is expected to return a logical vector such that x[fun(x)] selects only those elements of x satisfying the conditions of fun |
name | A character(1) object to be used as the name of the filter. The name is useful for debugging and reference. |
filt | A SRFilter object, to be used with additional arguments to create a composite filter. |
.name | An optional character(1) object used to over-ride the name applied to default filters. |
regex | Either character(0) or a character(1) regular expression used as grep(regex, chromosome(x)) to filter based on chromosome. The default ( character(0) ) performs no filtering |
fixed | logical(1) passed to grep , influencing how pattern matching occurs. |
exclude | logical(1) which, when TRUE , uses regex to exclude, rather than include, reads. |
min | numeric(1) |
max | numeric(1) . For positionFilter , min and max define the closed interval in which position must be found min <= position <= max . For occurrenceFilter , min and max define the minimum and maximum number of times a read occurs after the filter. |
strandLevels | Either character(0) or character(1) containing strand levels to be selected. ShortRead objects have standard strand levels NA, "+", "-", "*" , with NA meaning strand information not available and "*" meaning strand information not relevant. |
withSread | A logical(1) indicating whether uniqueness includes the read sequence ( withSread=TRUE ), is based only on chromosome, position, and strand ( withSread=FALSE ), or only the read sequence ( withSread=NA ), as described for occurrenceFilter below.. |
duplicates | Either character{1} , a function name , or a function taking a single argument. Influence how duplicates are handled, as described for occurrenceFilter below. |
threshold | A numeric(1) value representing a minimum ( srdistanceFilter , alignQualityFilter ) or maximum ( nFilter , polynFilter , dustyFilter ) criterion for the filter. The minima and maxima are closed-interval (i.e., x >= threshold , x <= threshold for some property x of the object being filtered). |
nuc | A character vector containing IUPAC symbols for nucleotides or the value "other" corresponding to all non-nucleotide symbols, e.g., N . |
batchSize | NA or an integer(1) vector indicating the number of DNA sequences to be processed simultaneously by dustyFilter . By default, all reads are processed simultaneously. Smaller values use less memory but are computationally less efficient. |
subject | A character() of any length, to be used as the corresponding argument to srdistance . |
expr | A expression to be evaluated with pData(alignData(x)) . |
list() | Additional arguments for subsequent methods; these arguments are not currently used. |
Details
srFilter
allows users to construct their own filters. The
fun
argument to srFilter
must be a function accepting a
single argument x
and returning a logical vector that can be
used to select elements of x
satisfying the filter with
x[fun(x)]
The signature(fun="missing")
method creates a default filter
that returns a vector of TRUE
values with length equal to
length(x)
.
compose
constructs a new filter from one or more existing
filter. The result is a filter that returns a logical vector with
indices corresponding to components of x
that pass all
filters. If not provided, the name of the filter consists of the names
of all component filters, each separated by " o "
.
The remaining functions documented on this page are built-in filters
that accept an argument x
and return a logical vector of
length(x)
indicating which components of x
satisfy the
filter.
idFilter
selects elements satisfying
grep(regex, id(x), fixed=fixed)
.
chromosomeFilter
selects elements satisfying
grep(regex, chromosome(x), fixed=fixed)
.
positionFilter
selects elements satisfying
min <= position(x) <= max
.
strandFilter
selects elements satisfying
match(strand(x), strand, nomatch=0) > 0
.
occurrenceFilter
selects elements that occur >=min
and
<=max
times. withSread
determines how reads will be
treated: TRUE
to include the sread, chromosome, strand, and
position when determining occurrence, FALSE
to include
chromosome, strand, and position, and NA
to include only
sread. The default is withSread=NA
. duplicates
determines how reads with more than max
reads are
treated. head
selects the first max
reads of each set of
duplicates, tail
the last max
reads, and sample
a
random sample of max
reads. none
removes all reads
represented more than max
times. The user can also provide a
function (as used by tapply
) of a single argument to
select amongst reads.
nFilter
selects elements with fewer than threshold
'N'
symbols in each element of sread(x)
.
polynFilter
selects elements with fewer than threshold
copies of any nucleotide indicated by nuc
.
dustyFilter
selects elements with high sequence complexity, as
characterized by their dustyScore
. This emulates the
dust
command from WindowMaker
software. Calculations can be memory intensive; use
batchSize
to process the argument to dustyFilter
in
batches of the specified size.
srdistanceFilter
selects elements at an edit distance greater
than threshold
from all sequences in subject
.
alignQualityFilter
selects elements with alignQuality(x)
greater than threshold
.
alignDataFilter
selects elements with
pData(alignData(x))
satisfying expr
. expr
should
be formulated as though it were to be evaluated as
eval(expr, pData(alignData(x)))
.
Value
srFilter
returns an object of SRFilter .
Built-in filters return a logical vector of length(x)
, with
TRUE
indicating components that pass the filter.
Seealso
SRFilter .
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
sp <- SolexaPath(system.file("extdata", package="ShortRead"))
aln <- readAligned(sp, "s_2_export.txt") # Solexa export file, as example
# a 'chromosome 5' filter
filt <- chromosomeFilter("chr5.fa")
aln[filt(aln)]
# filter during input
readAligned(sp, "s_2_export.txt", filter=filt)
# x- and y- coordinates stored in alignData, when source is SolexaExport
xy <- alignDataFilter(expression(abs(x-500) > 200 & abs(y-500) > 200))
aln[xy(aln)]
# both filters as a single filter
chr5xy <- compose(filt, xy)
aln[chr5xy(aln)]
# both filters as a collection
filters <- c(filt, xy)
subsetByFilter(aln, filters)
summary(filters, aln)
# read, chromosome, strand, position tuples occurring exactly once
aln[occurrenceFilter(withSread=TRUE, duplicates="none")(aln)]
# reads occurring exactly once
aln[occurrenceFilter(withSread=NA, duplicates="none")(aln)]
# chromosome, strand, position tuples occurring exactly once
aln[occurrenceFilter(withSread=FALSE, duplicates="none")(aln)]
# custom filter: minimum calibrated base call quality >20
goodq <- srFilter(function(x) {
apply(as(quality(x), "matrix"), 1, min, na.rm=TRUE) > 20
}, name="GoodQualityBases")
goodq
aln[goodq(aln)]
srdistance()
Edit distances between reads and a small number of short references
Description
srdistance
calculates the edit distance from each read in
pattern
to each read in subject
. The underlying
algorithm pairwiseAlignment
is only efficient when both
reads are short, and when the number of subject
reads is small.
Usage
srdistance(pattern, subject, ...)
Arguments
Argument | Description |
---|---|
pattern | An object of class DNAStringSet containing reads whose edit distance is desired. |
subject | A short character vector, DNAString or (small) DNAStringSet to serve as reference. |
list() | additional arguments, unused. |
Details
The underlying algorithm performs pairwise alignment from each read in
pattern
to each sequence in subject
. The return value is
a list of numeric vectors of distances, one list element for each
sequence in subject
. The vector in each list element contains
for each read in pattern
the edit distance from the read to the
corresponding subject. The weight matrix and gap penalties used to
calculate the distance are structured to weight base substitutions and
single base insert/deletions equally. Edit distance between known and
ambiguous (e.g., N) nucleotides, or between ambiguous nucleotides, are
weighted as though each possible nucleotide in the ambiguity were
equally likely.
Value
A list of length equal to that of subject
. Each element is a
numeric vector equal to the length of pattern
, with values
corresponding to the minimum distance between between the
corresponding pattern and subject sequences.
Seealso
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
sp <- SolexaPath(system.file("extdata", package="ShortRead"))
aln <- readAligned(sp, "s_2_export.txt")
polyA <- polyn("A", 35)
polyT <- polyn("T", 35)
d1 <- srdistance(clean(sread(aln)), polyA)
d2 <- srdistance(sread(aln), polyA)
d3 <- srdistance(sread(aln), c(polyA, polyT))
srduplicated()
Order, sort, and find duplicates in XStringSet objects
Description
These generics order, rank, sort, and find duplicates in short read
objects, including fastq-encoded qualities. srorder
,
srrank
and srsort
differ from the default functions
rank
, order
and sort
in that sorting is based on
an internally-defined order rather than, e.g., the order implied by
LC_COLLATE
.
Usage
srorder(x, ...)
srrank(x, ...)
srsort(x, ...)
srduplicated(x, ...)
Arguments
Argument | Description |
---|---|
x | The object to be sorted, ranked, ordered, or to have duplicates identified; see the examples below for objects for which methods are defined. |
list() | Additional arguments available for use by methods; usually ignored. |
Details
Unlike sort
and friends, the implementation does not preserve
order of duplicated elements. Like duplicated
, one element in
each set of duplicates is marked as FALSE
.
srrank
settles ties using the list("min") criterion described
in rank
, i.e., identical elements are ranked equal to the
rank of the first occurrence of the sorted element.
The following methods are defined, in addition to methods described in class-specific documentation: list(" ", " ", " ", list(list("srsort"), list(list("signature(x = "XStringSet")"), ":")), " ", " ", list(list("srorder"), list(list("signature(x = "XStringSet")"), ":")), " ", " ", list(list("srduplicated"), list(list("signature(x = "XStringSet")"), ": ", " ", " Apply ", list("srorder"), ", ", list("srrank"), ", ", list("srsort"), ", ", " ", list("srduplicated"), " to ", " ", list(list("XStringSet")), " objects such ", " as those returned by ", list(list(
"sread")), ".")), "
", " ", " ", " ", list(list("srsort"), list(list("signature(x = "ShortRead")"), ":")), " ", " ", list(list("srorder"), list(list("signature(x = "ShortRead")"), ":")), " ", " ", list(list("srduplicated"), list(list("signature(x = "ShortRead")"), ": ", " ", " Apply ", list("srorder"), ", ", list("srrank"), ", ", list("srsort"), ", ", " ", list("srduplicated"), " to ", " ", list(list("XStringSet")), " objects to ", " the ", list("sread"),
" component of ", list(list("ShortRead")), " and
", " derived objects.")), " ", " ", " ")
Value
The functions return the following values:
*
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
showMethods("srsort")
showMethods("srorder")
showMethods("srduplicated")
sp <- SolexaPath(system.file('extdata', package='ShortRead'))
rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt")
sum(srduplicated(sread(rfq)))
srsort(sread(rfq))
srsort(quality(rfq))
tables()
Summarize XStringSet read frequencies
Description
This generic summarizes the number of times each sequence occurs in an
XStringSet
instance.
Usage
tables(x, n=50, ...)
Arguments
Argument | Description |
---|---|
x | An object for which a tables method is defined. |
n | An integer(1) value determining how many named sequences will be present in the top portion of the return value. |
list() | Additional arguments available to methods |
Details
Methods of this generic summarize the frequency with which each read occurs, There are two components to the summary. The reads are reported from most common to least common; typically a method parameter controls how many reads to report. Methods also return a pair of vectors describing how many reads were represented 1, 2, ... times.
The following methods are defined, in addition to methods described in class-specific documentation: list(" ", " ", " ", list(list("tables"), list(list("signature(x= "XStringSet", n = 50)"), ": ", " Apply ", list("tables"), " to the ", list("XStringSet"), " ", list("x"), ".")), " ", " ", " ")
Value
A list of length two.
*
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
showMethods("tables")
sp <- SolexaPath(system.file("extdata", package="ShortRead"))
aln <- readAligned(sp)
tables(sread(aln), n=6)
lattice::xyplot(log10(nReads)~log10(nOccurrences),
tables(sread(aln))$distribution)
trimTails()
Trim ends of reads based on nucleotides or qualities
Description
These generic functions remove leading or trailing nucleotides or
qualities. trimTails
and trimTailw
remove low-quality
reads from the right end using a sliding window ( trimTailw
) or
a tally of (successive) nucleotides falling at or below a quality
threshold ( trimTails
). trimEnds
takes an alphabet of
characters to remove from either left or right end.
Usage
## S4 methods for 'ShortReadQ', 'FastqQuality', or 'SFastqQuality'
trimTailw(object, k, a, halfwidth, ..., ranges=FALSE)
trimTails(object, k, a, successive=FALSE, ..., ranges=FALSE)
trimEnds(object, a, left=TRUE, right=TRUE, relation=c("<=", "=="),
..., ranges=FALSE)
list(list("trimTailw"), list("BStringSet"))(object, k, a, halfwidth, ..., alphabet, ranges=FALSE)
list(list("trimTails"), list("BStringSet"))(object, k, a, successive=FALSE, ...,
alphabet, ranges=FALSE)
list(list("trimTailw"), list("character"))(object, k, a, halfwidth, ..., destinations, ranges=FALSE)
list(list("trimTails"), list("character"))(object, k, a, successive=FALSE, ..., destinations, ranges=FALSE)
list(list("trimEnds"), list("character"))(object, a, left=TRUE, right=TRUE, relation=c("<=", "=="),
..., destinations, ranges=FALSE)
Arguments
Argument | Description |
---|---|
object | An object (e.g., ShortReadQ and derived classes; see below to discover these methods) or character vector of fastq file(s) to be trimmed. |
k | integer(1) describing the number of failing letters required to trigger trimming. |
a | For trimTails and trimTailw , a character(1) with nchar(a) == 1L giving the letter at or below which a nucleotide is marked as failing. For trimEnds a character() with all nchar() == giving the letter at or below which a nucleotide or quality scores marked for removal. |
halfwidth | The half width (cycles before or after the current; e.g., a half-width of 5 would span 5 + 1 + 5 cycles) in which qualities are assessed. |
successive | logical(1) indicating whether failures can occur anywhere in the sequence, or must be successive. If successive=FALSE , then the k'th failed letter and subsequent are removed. If successive=TRUE , the first succession of k failed and subsequent letters are removed. |
left, right | logical(1) indicating whether trimming is from the left or right ends. |
relation | character(1) selected from the argument values, i.e., <= or == indicating whether all letters at or below the alphabet(object) are to be removed, or only exact matches. |
list() | Additional arguments, perhaps used by methods. |
destinations | For object of type character() , an equal-length vector of destination files. Files must not already exist. |
alphabet | character() (ordered low to high) letters on which quality scale is measured. Usually supplied internally (user does not need to specify). If missing, then set to ASCII characters 0-127. |
ranges | logical(1) indicating whether the trimmed object, or only the ranges satisfying the trimming condition, be returned. |
Details
trimTailw
starts at the left-most nucleotide, tabulating the
number of cycles in a window of 2 * halfwidth + 1
surrounding
the current nucleotide with quality scores that fall at or below
a
. The read is trimmed at the first nucleotide for which this
number >= k
. The quality of the first or last nucleotide is
used to represent portions of the window that extend beyond the
sequence.
trimTails
starts at the left-most nucleotide and accumulates
cycles for which the quality score is at or below a
. The read
is trimmed at the first location where this number >= k
. With
successive=TRUE
, failing qualities must occur in strict
succession.
trimEnds
examines the left
, right
, or both ends
of object
, marking for removal letters that correspond to
a
and relation
. The trimEnds,ShortReadQ-method
trims based on quality.
ShortReadQ
methods operate on quality scores; use
sread()
and the ranges
argument to trim based on
nucleotide (see examples).
character
methods transform one or several fastq files to new
fastq files, applying trim operations based on quality scores; use
filterFastq
with your own filter
argument to filter on
nucleotides.
Value
An instance of class(object)
trimmed to contain only those
nucleotides satisfying the trim criterion or, if ranges=TRUE
an
IRanges
instance defining the ranges that would trim
object
.
Note
The trim*
functions use OpenMP threads (when available) during
creation of the return value. This may sometimes create problems when
a process is already running on multiple threads, e.g., with an error
message like list("
", " libgomp: Thread creation failed: Resource temporarily unavailable
", " ") A solution is to precede problematic code
with the following code snippet, to disable threading list("
", " nthreads <- .Call(ShortRead:::.set_omp_threads, 1L)
", " on.exit(.Call(ShortRead:::.set_omp_threads, nthreads))
", " ")
Author
Martin Morgan mtmorgan@fhcrc.org
Examples
showMethods(trimTails)
sp <- SolexaPath(system.file('extdata', package='ShortRead'))
rfq <- readFastq(analysisPath(sp), pattern="s_1_sequence.txt")
## remove leading / trailing quality scores <= 'I'
trimEnds(rfq, "I")
## remove leading / trailing 'N's
rng <- trimEnds(sread(rfq), "N", relation="==", ranges=TRUE)
narrow(rfq, start(rng), end(rng))
## remove leading / trailing 'G's or 'C's
trimEnds(rfq, c("G", "C"), relation="==")