bioconductor v3.9.0 VariantAnnotation
Annotate variants, compute amino acid coding changes,
Link to this section Summary
Functions
Convert genotype likelihoods to genotype probabilities
PROVEANDb objects
PolyPhenDb Columns
PolyPhenDb objects
SIFTDb Columns
SIFTDb objects
Parameters for scanning VCF files
VCFHeader instances
VCF class objects
VRangesList objects
VRanges objects
VariantType subclasses
Manipulate Variant Call Format (Vcf) files.
Defunct Functions in Package VariantAnnotation
Filter VCF files
Convert genotype calls from a VCF file to a SnpMatrix object
Get transcript sequences
Create Index files for VCF file
Identification of genomic variant types.
Locate variants
Predict amino acid coding changes for variants
Convert posterior genotype probability to a SnpMatrix object
Read VCF files
Import VCF files
Get seqinfo for VCF file
Counts and distribution statistics for SNPs in a VCF object
Summarize variants by sample
Write VCF files
Link to this section Functions
GLtoGP()
Convert genotype likelihoods to genotype probabilities
Description
Convert an array of genotype likelihoods to posterior genotype probabilities.
Usage
GLtoGP(gl)
PLtoGP(pl)
Arguments
Argument | Description |
---|---|
gl | Array of genotype likelihoods (log10-scaled). The format can be a matrix of lists, or a three-dimensional array in which the third dimension corresponds to the probabilities for each genotype. |
pl | Array of genotype likelihoods (phred-scaled, i.e. -10*log10). The format can be a matrix of lists, or a three-dimensional array in which the third dimension corresponds to the probabilities for each genotype. |
Details
GLtoGP
computes the probability of each genotype as 10^x / sum(10^x)
.
PLtoGP
first divides by -10 and then proceeds as in GLtoGP
.
Value
An array of posterior genotype probabilities, in the same format as the input (matrix of lists or 3D array).
Seealso
Author
Stephanie Gogarten sdmorris@u.washington.edu
Examples
## Read a vcf file with a "GL" field.
vcfFile <- system.file("extdata", "gl_chr1.vcf", package="VariantAnnotation")
vcf <- readVcf(vcfFile, "hg19")
## extract genotype likelihoods as a matrix of lists
gl <- geno(vcf)$GL
class(gl)
mode(gl)
# convert to posterior probabilities
gp <- GLtoGP(gl)
## Read a vcf file with a "PL" field.
vcfFile <- system.file("extdata", "hapmap_exome_chr22.vcf.gz",
package="VariantAnnotation")
vcf <- readVcf(vcfFile, "hg19")
## extract genotype likelihoods as a matrix of lists
pl <- geno(vcf)$PL
class(pl)
mode(pl)
# convert to posterior probabilities
gp <- PLtoGP(pl)
PROVEANDb_class()
PROVEANDb objects
Description
The PROVEANDb class is a container for storing a connection to a PROVEAN sqlite database.
Details
The SIFT tool is no longer actively maintained. A few of the orginal authors have started the PROVEAN (Protein Variation Effect Analyzer) project. PROVEAN is a software tool which predicts whether an amino acid substitution or indel has an impact on the biological function of a protein. PROVEAN is useful for filtering sequence variants to identify nonsynonymous or indel variants that are predicted to be functionally important.
See the web pages for a complete description of the methods.
PROVEAN Home: http://provean.jcvi.org/index.php/
SIFT Home: http://sift.jcvi.org/
Though SIFT is not under active development, the PROVEAN team still
provids the SIFT scores in the pre-computed downloads. This package,
SIFT.Hsapiens.dbSNP137
, contains both SIFT and PROVEAN scores.
One notable difference between this and the previous SIFT database
package is that keys
in SIFT.Hsapiens.dbSNP132
are
rs IDs whereas in SIFT.Hsapiens.dbSNP137
they are NCBI dbSNP IDs.
Author
Valerie Obenchain
References
The PROVEAN tool has replaced SIFT: http://provean.jcvi.org/about.php
Choi Y, Sims GE, Murphy S, Miller JR, Chan AP (2012) Predicting the Functional Effect of Amino Acid Substitutions and Indels. PLoS ONE 7(10): e46688.
Choi Y (2012) A Fast Computation of Pairwise Sequence Alignment Scores Between a Protein and a Set of Single-Locus Variants of Another Protein. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine (BCB '12). ACM, New York, NY, USA, 414-417.
Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc. 2009;4(7):1073-81
Ng PC, Henikoff S. Predicting the Effects of Amino Acid Substitutions on Protein Function Annu Rev Genomics Hum Genet. 2006;7:61-80.
Ng PC, Henikoff S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003 Jul 1;31(13):3812-4.
Examples
if (require(SIFT.Hsapiens.dbSNP137)) {
## metadata
metadata(SIFT.Hsapiens.dbSNP137)
## keys are the DBSNPID (NCBI dbSNP ID)
dbsnp <- keys(SIFT.Hsapiens.dbSNP137)
head(dbsnp)
columns(SIFT.Hsapiens.dbSNP137)
## Return all columns. Note that the key, DBSNPID,
## is always returned.
select(SIFT.Hsapiens.dbSNP137, dbsnp[10])
## subset on keys and cols
cols <- c("VARIANT", "PROVEANPRED", "SIFTPRED")
select(SIFT.Hsapiens.dbSNP137, dbsnp[20:23], cols)
}
PolyPhenDbColumns()
PolyPhenDb Columns
Description
Description of the PolyPhen Sqlite Database Columns
Seealso
?PolyPhenDb
Author
Valerie Obenchain
PolyPhenDb_class()
PolyPhenDb objects
Description
The PolyPhenDb class is a container for storing a connection to a PolyPhen sqlite database.
Details
PolyPhen (Polymorphism Phenotyping) is a tool which predicts the possible impact of an amino acid substitution on the structure and function of a human protein by applying empirical rules to the sequence, phylogenetic and structural information characterizing the substitution.
PolyPhen makes its predictions using UniProt features, PSIC profiles scores derived from multiple alignment and matches to PDP or PQS structural databases. The procedure can be roughly outlined in the following steps, see the references for complete details,
sequence-based characterization of substitution site
calculation of PSIC profile scores for two amino acid variants
calculation of structural parameters and contacts
prediction
PolyPhen uses empirically derived rules to predict that a non-synonymous SNP isprobably damaging : it is with high confidence supposed to affect protein function or structure
possibly damaging : it is supposed to affect protein function or structure
benign : most likely lacking any phenotypic effect
unknown : when in some rare cases, the lack of data do not allow PolyPhen to make a prediction
Seealso
?PolyPhenDbColumns
Author
Valerie Obenchain
References
PolyPhen Home: http://genetics.bwh.harvard.edu/pph2/dokuwiki/
Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR. Nat Methods 7(4):248-249 (2010).
Ramensky V, Bork P, Sunyaev S. Human non-synonymous SNPs: server and survey. Nucleic Acids Res 30(17):3894-3900 (2002).
Sunyaev SR, Eisenhaber F, Rodchenkov IV, Eisenhaber B, Tumanyan VG, Kuznetsov EN. PSIC: profile extraction from sequence alignments with position-specific counts of independent observations. Protein Eng 12(5):387-394 (1999).
Examples
library(PolyPhen.Hsapiens.dbSNP131)
## metadata
metadata(PolyPhen.Hsapiens.dbSNP131)
## available rsid's
head(keys(PolyPhen.Hsapiens.dbSNP131))
## column descriptions found at ?PolyPhenDbColumns
columns(PolyPhen.Hsapiens.dbSNP131)
## subset on keys and columns
subst <- c("AA1", "AA2", "PREDICTION")
rsids <- c("rs2142947", "rs4995127", "rs3026284")
select(PolyPhen.Hsapiens.dbSNP131, keys=rsids, columns=subst)
## retrieve substitution scores
subst <- c("IDPMAX", "IDPSNP", "IDQMIN")
select(PolyPhen.Hsapiens.dbSNP131, keys=rsids, columns=subst)
## retrieve the PolyPhen-2 classifiers
subst <- c("PPH2CLASS", "PPH2PROB", "PPH2FPR", "PPH2TPR", "PPH2FDR")
select(PolyPhen.Hsapiens.dbSNP131, keys=rsids, columns=subst)
## duplicate groups of rsid's
duplicateRSID(PolyPhen.Hsapiens.dbSNP131, c("rs71225486", "rs1063796"))
SIFTDbColumns()
SIFTDb Columns
Description
Description of the SIFT Sqlite Database Columns
Seealso
?SIFTDb
Author
Valerie Obenchain
SIFTDb_class()
SIFTDb objects
Description
The SIFTDb class is a container for storing a connection to a SIFT sqlite database.
Details
SIFT is a sequence homology-based tool that sorts intolerant from tolerant amino acid substitutions and predicts whether an amino acid substitution in a protein will have a phenotypic effect. SIFT is based on the premise that protein evolution is correlated with protein function. Positions important for function should be conserved in an alignment of the protein family, whereas unimportant positions should appear diverse in an alignment.
SIFT uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence. The procedure can be outlined in the following steps,
search for similar sequences
choose closely related sequences that may share similar function to the query sequence
obtain the alignment of the chosen sequences
calculate normalized probabilities for all possible substitutions from the alignment. Positions with normalized probabilities less than 0.05 are predicted to be deleterious, those greater than or equal to 0.05 are predicted to be tolerated.
Author
Valerie Obenchain
References
SIFT Home: http://sift.jcvi.org/
Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc. 2009;4(7):1073-81
Ng PC, Henikoff S. Predicting the Effects of Amino Acid Substitutions on Protein Function Annu Rev Genomics Hum Genet. 2006;7:61-80.
Ng PC, Henikoff S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003 Jul 1;31(13):3812-4.
Examples
if (interactive()) {
library(SIFT.Hsapiens.dbSNP132)
## metadata
metadata(SIFT.Hsapiens.dbSNP132)
## available rsid's
head(keys(SIFT.Hsapiens.dbSNP132))
## for column descriptions see ?SIFTDbColumns
columns(SIFT.Hsapiens.dbSNP132)
## subset on keys and columns
rsids <- c("rs2142947", "rs17970171", "rs8692231", "rs3026284")
subst <- c("RSID", "PREDICTION", "SCORE")
select(SIFT.Hsapiens.dbSNP132, keys=rsids, columns=subst)
select(SIFT.Hsapiens.dbSNP132, keys=rsids[1:2])
}
ScanVcfParam_class()
Parameters for scanning VCF files
Description
Use ScanVcfParam()
to create a parameter object influencing
which records and fields are imported from a VCF file. Record
parsing is based on genomic coordinates and requires a Tabix index
file. Individual VCF elements can be specified in the fixed ,
info , geno and samples arguments.
Usage
ScanVcfParam(fixed=character(), info=character(), geno=character(),
samples=character(), trimEmpty=TRUE, which, ...)
## Getters and Setters
vcfFixed(object)
vcfFixed(object) <- value
vcfInfo(object)
vcfInfo(object) <- value
vcfGeno(object)
vcfGeno(object) <- value
vcfSamples(object)
vcfSamples(object) <- value
vcfTrimEmpty(object)
vcfTrimEmpty(object) <- value
vcfWhich(object)
vcfWhich(object) <- value
Arguments
Argument | Description |
---|---|
fixed | A character() vector of fixed fields to be returned. Possible values are ALT, QUAL and FILTER. The CHROM, POS, ID and REF fields are needed to create the GRanges of variant locations. Because these are essential fields there is no option to request or omit them. If not specified, all fields are returned; if fixed=NA only REF is returned. |
info | A character() vector naming the INFO fields to return. scanVcfHeader() returns a vector of available fields. If not specified, all fields are returned; if info=NA no fields are returned. |
geno | A character() vector naming the GENO fields to return. scanVcfHeader() returns a vector of available fields. If not specified, all fields are returned; if geno=NA no fields are returned and requests for specific samples are ignored. |
samples | A character() vector of sample names to return. samples(scanVcfHeader()) returns all possible names. If not specified, data for all samples are returned; if either samples=NA or geno=NA no fields are returned. Requests for specific samples when geno=NA are ignored. |
trimEmpty | A logical(1) indicating whether GENO fields with no values should be returned. |
which | A GRanges describing the sequences and ranges to be queried. Variants whose POS lies in the interval(s) [start, end] are returned. If which is not specified all ranges are returned. |
object | An instance of class ScanVcfParam . |
value | An instance of the corresponding slot, to be assigned to object . |
list() | Arguments passed to methods. |
Seealso
Author
Martin Morgan and Valerie Obenchain
Examples
ScanVcfParam()
fl <- system.file("extdata", "structural.vcf", package="VariantAnnotation")
compressVcf <- bgzip(fl, tempfile())
idx <- indexTabix(compressVcf, "vcf")
tab <- TabixFile(compressVcf, idx)
## ---------------------------------------------------------------------
## 'which' argument
## ---------------------------------------------------------------------
## To subset on genomic coordinates, supply an object
## containing the ranges of interest. These ranges can
## be given directly to the 'param' argument or wrapped
## inside ScanVcfParam() as the 'which' argument.
## When using a list, the outer list names must correspond to valid
## chromosome names in the vcf file. In this example they are "1"
## and "2".
gr1 <- GRanges("1", IRanges(13219, 2827695, name="regionA"))
gr2 <- GRanges(rep("2", 2), IRanges(c(321680, 14477080),
c(321689, 14477090), name=c("regionB", "regionC")))
grl <- GRangesList("1"=gr1, "2"=gr2)
vcf <- readVcf(tab, "hg19", grl)
## Names of the ranges are in the 'paramRangeID' metadata column of the
## GRanges object returned by the rowRanges() accessor.
rowRanges(vcf)
## which can be used for subsetting the VCF object
vcf[rowRanges(vcf)$paramRangeID == "regionA"]
## When using ranges, the seqnames must correspond to valid
## chromosome names in the vcf file.
gr <- unlist(grl, use.names=FALSE)
vcf <- readVcf(tab, "hg19", gr)
## ---------------------------------------------------------------------
## 'fixed', 'info', 'geno' and 'samples' arguments
## ---------------------------------------------------------------------
## This param specifies the "GT" 'geno' field for a single sample
## and the subset of ranges in 'which'. All 'fixed' and 'info' fields
## will be returned.
ScanVcfParam(geno="GT", samples="NA00002", which=gr)
## Here two 'fixed' and one 'geno' field are specified
ScanVcfParam(fixed=c("ALT", "QUAL"), geno="GT", info=NA)
## Return only the 'fixed' fields
ScanVcfParam(geno=NA, info=NA)
VCFHeader_class()
VCFHeader instances
Description
The VCFHeader
class holds Variant Call Format (VCF) file
header information and is produced from a call to scanVcfHeader
.
Details
The VCFHeader
class holds header information from a
VCF file.
Slots : list(" ", " ", list(list(list("reference")), list("character() vector ", " ")), " ", " ", list(list(list("sample")), list("character() vector ", " ")), " ", " ", list(list(list("header")), list(list("DataFrameList"), "-class ", " ")), " ", " ")
Seealso
Author
Valerie Obenchain
Examples
fl <- system.file("extdata", "structural.vcf", package="VariantAnnotation")
hdr <- scanVcfHeader(fl)
fixed(hdr)
info(hdr)
geno(hdr)
vcfFields(hdr)
VCF_class()
VCF class objects
Description
The VCF class is a virtual class extended from
RangedSummarizedExperiment
. The subclasses, CompressedVCF
and ExtendedVCF
, are containers for holding data from
Variant Call Format files.
Details
The VCF
class is a virtual class with two concrete subclasses,
CollapsedVCF
and ExtendedVCF
.
Slots unique to VCF
and subclasses,
fixed
: A DataFrame containing the REF, ALT, QUAL and FILTER fields from a VCF file.info
: A DataFrame containing the INFO fields from a VCF file.
Slots inherited from RangedSummarizedExperiment
,
metadata
: Alist
containing the file header or other information about the overall experiment.rowRanges
: A GRanges -class instance defining the variant ranges and associated metadata columns of REF, ALT, QUAL and FILTER. While the REF, ALT, QUAL and FILTER fields can be displayed as metadata columns they cannot be modified withrowRanges<-
. To modify these fields usefixed<-
.colData
: A DataFrame -class instance describing the samples and associated metadata.geno
: Theassays
slot fromRangedSummarizedExperiment
has been renamed asgeno
for the VCF class. This slot contains the genotype information immediately following the FORMAT field in a VCF file. Each element of thelist
orSimpleList
is a matrix or array.
It is expected that users will not create instances of the VCF class
but instead one of the concrete subclasses, CollapsedVCF or ExpandVCF.
CollapsedVCF contains the ALT data as a DNAStringSetList
allowing
for multiple alleles per variant. The ExpandedVCF ALT data is a
DNAStringSet
where the ALT column has been expanded to create a
flat form of the data with one row per variant-allele combination. In
the case of strucutral variants, ALT will be a CompressedCharacterList
or character
in the collapsed or expanded forms.
Seealso
GRanges ,
DataFrame ,
SimpleList ,
RangedSummarizedExperiment ,
readVcf
,
writeVcf
isSNV
Author
Valerie Obenchain
Examples
## readVcf() parses data into a VCF object:
fl <- system.file("extdata", "structural.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, genome="hg19")
## ----------------------------------------------------------------
## Accessors
## ----------------------------------------------------------------
## Variant locations are stored in the GRanges object returned by
## the rowRanges() accessor.
rowRanges(vcf)
## Suppress fixed fields:
rowRanges(vcf, fixed=FALSE)
## Individual fields can be extracted with ref(), alt(), qual(), filt() etc.
qual(vcf)
ref(vcf)
head(info(vcf))
## All available VCF field names can be contracted with vcfFields().
vcfFields(vcf)
## Extract genotype fields with geno(). Access specific fields with
## '$' or '[['.
geno(vcf)
identical(geno(vcf)$GQ, geno(vcf)[[2]])
## ----------------------------------------------------------------
## Renaming seqlevels and subsetting
## ----------------------------------------------------------------
## Overlap and matching operations require that the objects
## being compared have the same seqlevels (chromosome names).
## It is often the case that the seqlevesls in on of the objects
## needs to be modified to match the other. In this VCF, the
## seqlevels are numbers instead of preceded by "chr" or "ch".
seqlevels(vcf)
## Rename the seqlevels to start with 'chr'.
vcf2 <- renameSeqlevels(vcf, paste0("chr", seqlevels(vcf)))
seqlevels(vcf2)
## The VCF can also be subset by seqlevel using 'keepSeqlevels'
## or 'dropSeqlevels'. See ?keepSeqlevels for details.
vcf3 <- keepSeqlevels(vcf2, "chr2", pruning.mode="coarse")
seqlevels(vcf3)
## ----------------------------------------------------------------
## Header information
## ----------------------------------------------------------------
## Header data can be modified in the 'meta', 'info' and 'geno'
## slots of the VCFHeader object. See ?VCFHeader for details.
## Current 'info' fields.
rownames(info(header(vcf)))
## Add a new field to the header.
newInfo <- DataFrame(Number=1, Type="Integer",
Description="Number of Samples With Data",
row.names="NS")
info(header(vcf)) <- rbind(info(header(vcf)), newInfo)
rownames(info(header(vcf)))
## ----------------------------------------------------------------
## Collapsed and Expanded VCF
## ----------------------------------------------------------------
## readVCF() produces a CollapsedVCF object.
fl <- system.file("extdata", "ex2.vcf",
package="VariantAnnotation")
vcf <- readVcf(fl, genome="hg19")
vcf
## The ALT column is a DNAStringSetList to allow for more
## than one alternate allele per variant.
alt(vcf)
## For structural variants ALT is a CharacterList.
fl <- system.file("extdata", "structural.vcf",
package="VariantAnnotation")
vcf <- readVcf(fl, genome="hg19")
alt(vcf)
## ExpandedVCF is the 'flattened' counterpart of CollapsedVCF.
## The ALT and all variables with Number='A' in the header are
## expanded to one row per alternate allele.
vcfLong <- expand(vcf)
alt(vcfLong)
## Also see the ?VRanges class for an alternative form of
## 'flattened' VCF data.
## ----------------------------------------------------------------
## isSNV()
## ----------------------------------------------------------------
## isSNV() returns a subset VCF containing SNVs only.
vcf <- VCF(rowRanges = GRanges("chr1", IRanges(1:4*3, width=c(1, 2, 1, 1))))
alt(vcf) <- DNAStringSetList("A", c("TT"), c("G", "A"), c("TT", "C"))
ref(vcf) <- DNAStringSet(c("G", c("AA"), "T", "G"))
fixed(vcf)[c("REF", "ALT")]
## SNVs are present in rows 1 (single ALT value), 3 (both ALT values)
## and 4 (1 of the 2 ALT values).
vcf[isSNV(vcf, singleAltOnly=TRUE)]
vcf[isSNV(vcf, singleAltOnly=FALSE)] ## all 3 SNVs
VRangesList_class()
VRangesList objects
Description
VRangesList is a virtual class representing a list of
VRanges objects and should behave much like any
other derivative of List
. It has both a simple and
compressed implementation. VRangesList provides conveniences for
manipulating sets of VRanges
objects.
Author
Michael Lawrence
Examples
## construction
example(VRanges)
vrl <- VRangesList(sampleA = vr, sampleB = vr)
stackSamples(vrl)
VRanges_class()
VRanges objects
Description
The VRanges class is a container for variant calls, including SNVs and
indels. It extends GRanges
to provide
special semantics on top of a simple vector of genomic locations. While
it is not as expressive as the VCF object, it is
a simpler alternative that may be convenient for variant
calling/filtering and similar exercises.
Details
VRanges extends GRanges to store the following components. Except where noted, the components are considered columns in the dataset, i.e., their lengths match the number of variants. Many columns can be stored as either an atomic vector or an Rle. list(" ", " ", list(list(list("ref")), list("(", list("character"), "), the reference ", " allele. The range (start/end/width) should always correspond to ", " this sequence.")), " ", " ", list(list(list("alt")), list("(", list("character/Rle"), "), ", " the alternative allele (NA allowed). By convention there is only ", " a single alt allele per element (row) of the VRanges. Many methods, ", " like ", list("match"), ", make this assumption. ", " ")), " ",
" ", list(list(list("refCount")), list("(", list("integer/Rle"), "), read count for the
", " reference allele (NA allowed)")), " ", " ", list(list(list("altCount")), list("(", list("integer/Rle"), "), read count for the ", " alternative allele (NA allowed)")), " ", " ", list(list(list("totalCount")), list("(", list("integer/Rle"), "), total read count at the ", " position, must be at least ", list("refCount+altCount"), " (NA allowed)")), " ", " ", list(list(list(
"sampleNames")), list("(", list("factor/Rle"), "), name of the sample -
", " results from multiple samplse can be combined into the same object ", " (NA allowed)")), " ", " ", list(list(list("softFilterMatrix")), list("(", list("matrix/FilterMatrix"), "), ", " variant by filter matrix, ", list("TRUE"), " where variant passed the ", " filter; use a ", list(list("FilterMatrix")), " to store the ", " actual ", list("FilterRules"), " object that was applied")),
"
", " ", list(list(list("hardFilters")), list("(", list("FilterRules"), ") record of hard
", " filters applied, i.e., only the variants that passed the filters
", " are present in this object; this is the only component that is not
", " a column, i.e., its length does not match the number of variants")), "
", " ")
Except in the special circumstances described here, a VRanges
may be treated like a GRanges
. The range should span the
sequence in ref
. Indels are typically represented by the VCF
convention, i.e., the start position is one upstream of the event. The
strand is always constrained to be positive (+).
Indels, by convention, should be encoded VCF-style, with the upstream
reference base prepended to the indel sequence. The ref/alt for a
deletion of GCGT before A might be AGCGT/A and for an insertion might
be A/AGCGT. Since the range always matches the ref
sequence,
this means a deletion will be the width of the deletion + 1, and an
insertion is always of width 1.
VRanges and the VCF class: The VRanges and VCF classes encode different types of information and are semantically incompatible. While methods exist for converting a VCF object to a VRanges and vice versa, information is lost in the transformation. There is no way to collapse multiple rows of a VRanges at the same genomic position and accurately represent missing data. For this reason, it is not reasonable to assume that an object resulting from multiple conversions (VRanges -> VCF -> VRanges) will be equivalent to the original.
Seealso
VRangesList , a list of VRanges
; bam_tally
in the
gmapR package, which generates a VRanges
.
Author
Michael Lawrence. makeVRangesFromGRanges
was contributed
by Thomas Sandmann.
Examples
## construction
vr <- VRanges(seqnames = c("chr1", "chr2"),
ranges = IRanges(c(1, 10), c(5, 20)),
ref = c("T", "A"), alt = c("C", "T"),
refDepth = c(5, 10), altDepth = c(7, 6),
totalDepth = c(12, 17), sampleNames = letters[1:2],
hardFilters =
FilterRules(list(coverage = function(x) totalDepth > 10)),
softFilterMatrix =
FilterMatrix(matrix = cbind(depth = c(TRUE, FALSE)),
FilterRules(depth = function(x) altDepth(x) > 6)),
tumorSpecific = c(FALSE, TRUE))
## simple accessors
ref(vr)
alt(vr)
altDepth(vr)
vr$tumorSpecific
called(vr)
## coerce to VCF and write
vcf <- as(vr, "VCF")
## writeVcf(vcf, "example.vcf")
## or just
## writeVcf(vr, "example.vcf")
## other utilities
match(vr, vr[2:1])
VariantType_class()
VariantType subclasses
Description
VariantType
subclasses specify the type of variant to be located with
locateVariants
.
Usage
CodingVariants()
IntronVariants()
FiveUTRVariants()
ThreeUTRVariants()
SpliceSiteVariants()
IntergenicVariants(upstream = 1e+06L, downstream = 1e+06L,
idType=c("gene", "tx"))
PromoterVariants(upstream = 2000L, downstream = 200L)
AllVariants(promoter = PromoterVariants(),
intergenic = IntergenicVariants())
Arguments
Argument | Description |
---|---|
upstream, downstream | Single integer values representing the number of base pairs upstream of the 5'-end and downstream of the 3'-end. Used in contructing PromoterVariants() and IntergenicVariants() objects only. |
idType | character indicating if the ids in the PRECEDEID and FOLLOWID metadata columns should be gene ids ("gene") or transcript ids ("tx"). Applicable to IntergenicVariants() objects only. |
promoter | PromoterVariants object with appropriate upstream and downstream values. Used when constructing AllVariants objects only. |
intergenic | IntergenicVariants object with appropriate upstream and downstream values. Used when constructing AllVariants objects only. |
Details
VariantType
is a virtual class inherited by the CodingVariants
,
IntronVariants
, FiveUTRVariants
, ThreeUTRVariants
,
SpliceSiteVariants
, IntergenicVariants
and AllVariants
subclasses.
The subclasses are used as the region
argument to
locateVariants
. They designate the type of variant (i.e., region of
the annotation to match) when calling locateVariants
.
The majority of subclasses have no slots and require no arguments for an
instance to be created. PromoterVariants
and IntergenicVariants
and accept upstream
and downstream
arguments that define
the number of base pairs upstream from the 5'-end and downstream from
the 3'-end of the transcript region. See the ? locateVariants
man
page for details. IntergenicVariants
also accepts a
idType
that controls what IDs are returned in the
PRECEDEID and FOLLOWID metadata columns.
AllVariants
accepts promoter
and
intergenic
arguments which are PromoterVariants()
and
IntergenicVariants()
objects with the appropriate
upstream
and downstream
values.
Seealso
- The promoters function on the intra-range-methods man page in the GenomicRanges package.
Author
Valerie Obenchain
Examples
CodingVariants()
SpliceSiteVariants()
PromoterVariants(upstream=1000, downstream=10000)
## Default values for PromoterVariants and IntergenicVariants
AllVariants()
## Modify 'upstream' and 'downstream' for IntergenicVariants
AllVariants(intergenic=IntergenicVariants(500, 100))
## Reset PromoterVariants on existing AllVariants object
av <- AllVariants()
av
promoter(av) <- PromoterVariants(100, 50)
av
VcfFile_class()
Manipulate Variant Call Format (Vcf) files.
Description
Use VcfFile()
to create a reference to a Vcf file (and its
index). Once opened, the reference remains open across calls to
methods, avoiding costly index re-loading.
VcfFileList()
provides a convenient way of managing a list of
VcfFile
instances.
Author
Valerie Obenchain
Examples
fl <- system.file("extdata", "chr7-sub.vcf.gz", package="VariantAnnotation",
mustWork=TRUE)
vcffile <- VcfFile(fl)
vcffile
vcfFields(fl)
vcfFields(vcffile)
param <- GRanges("7", IRanges(c(55000000, 55900000), width=10000))
vcf <- readVcf(vcffile, "hg19", param=param)
dim(vcf)
## `vcfFields` also works for remote vcf filepath.
chr22url <- "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
vcfFields(chr22url)
defunct()
Defunct Functions in Package VariantAnnotation
Description
The functions or variables listed here are no longer part
of VariantAnnotation
.
Details
Removed
list(" ", " ", list("refLocsToLocalLocs"), " has been replaced by ", list("mapToTranscripts"), " ", " and ", list("pmapToTranscripts"), ". ", " ")
list(" ", " ", list("readVcfLongForm"), " has been replaced by ", list("expand"), ". ", " ")
list(" ", " ", list("dbSNPFilter"), " and ", list("regionFilter"), " have been replaced by ", " ", list("filterVcf"), ". ", " ")
list(" ", " ", list("regionFilter"), " has been replaced by ", list("filterVcf"), ". ", " ")
list(" ", " ", list("MatrixToSnpMatrix"), " has been replaced by ", list("genotypeToSnpMatrix"), ". ", " ")
list(" ", " ", list("getTranscriptSeqs"), " has been replaced by ", " ", list("extractTranscriptSeqs"), " in the GenomicFeatures package. ", " ")
list(" ", " ", list("VRangesScanVcfParam"), " has been replaced by ", list("ScanVcfParam"), ". ", " ")
list(" ", " ", list("restrictToSVN"), " has been replaced by ", list("isSNV"), ". ", " ")
Seealso
Author
Valerie Obenchain
filterVcf_methods()
Filter VCF files
Description
Filter Variant Call Format (VCF) files from one file to another
Usage
list(list("filterVcf"), list("character"))(file, genome, destination, ..., verbose = TRUE,
index = FALSE, prefilters = FilterRules(), filters = FilterRules(),
param = ScanVcfParam())
list(list("filterVcf"), list("TabixFile"))(file, genome, destination, ..., verbose = TRUE,
index = FALSE, prefilters = FilterRules(), filters = FilterRules(),
param = ScanVcfParam())
Arguments
Argument | Description |
---|---|
file | A character(1) file path or TabixFile specifying the VCF file to be filtered. |
genome | A character(1) identifier |
destination | A character(1) path to the location where the filtered VCF file will be written. |
... | Additional arguments, possibly used by future methods. |
verbose | A logical(1) indicating whether progress messages should be printed. |
index | A logical(1) indicating whether the filtered file should be compressed and indexed (using bgzip and indexTabix ). |
prefilters | A FilterRules instance contains rules for filtering un-parsed lines of the VCF file. |
filters | A FilterRules instance contains rules for filtering fully parsed VCF objects. |
param | A ScanVcfParam instance restricting input of particular info or geno fields, or genomic locations. Applicable when applying a filter only. Prefiltering involves a grep of unparsed lines in the file; indexing is not used. |
Details
This function transfers content of one VCF file to another, removing
records that fail to satisfy prefilters
and
filters
. Filtering is done in a memory efficient manner,
iterating over the input VCF file in chunks of default size 100,000
(when invoked with character(1)
for file
) or as
specified by the yieldSize
argument of TabixFile
(when
invoked with TabixFile
).
There are up to two passes. In the first pass, unparsed lines are
passed to prefilters
for filtering, e.g., searching for a fixed
character string. In the second pass lines successfully passing
prefilters
are parsed into VCF
instances and made
available for further filtering. One or both of prefilter
and
filter
can be present.
Filtering works by removing the rows (variants) that do not meet a criteria. Because this is a row-based approach and samples are column-based most genotype filters are only meaningful for single-sample files. If a single samples fails the criteria the entire row (all samples) are removed. The case where genotype filtering is effective for multiple samples is when the criteria is applied across samples and not to the individual (e.g., keep rows where all samples have DP > 10).
Value
The destination file path as a character(1)
.
Seealso
Author
Martin Morgan and Paul Shannon
Examples
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
## -----------------------------------------------------------------------
## Filter for SNVs in a defined set of ranges:
## -----------------------------------------------------------------------
if (require(TxDb.Hsapiens.UCSC.hg19.knownGene)) {
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
exons <- exons(txdb)
exons22 <- exons[seqnames(exons) == "chr22"]
seqlevelsStyle(exons22) <- "NCBI" ## match chrom names in VCF file
## Range-based filter:
withinRange <- function(rng)
function(x) x%within% rng
## The first filter identifies SNVs and the second applies the
## range restriction.
filters <- FilterRules(list(
isSNV = isSNV,
withinRange = withinRange(exons22)))
## Apply
filt1 <- filterVcf(fl, "hg19", tempfile(), filters=filters, verbose=TRUE)
}
## -----------------------------------------------------------------------
## Using a pre-filter and filter:
## -----------------------------------------------------------------------
## Low coverage exome snp filter:
lowCoverageExomeSNP = function(x) grepl("LOWCOV,EXOME", x, fixed=TRUE)
## The pre-filter identifies low coverage exome snps and the filter
## identifies variants with INFO variable VT = SNP.
pre <- FilterRules(list(lowCoverageExomeSNP = lowCoverageExomeSNP))
filt <- FilterRules(list(VTisSNP = function(x) info(x)$VT == "SNP"))
## Apply
filt2 <- filterVcf(fl, "hg19", tempfile(), prefilters=pre, filters=filt)
## Filtered results
vcf <- readVcf(filt2, "hg19")
genotypeToSnpMatrix_methods()
Convert genotype calls from a VCF file to a SnpMatrix object
Description
Convert an array of genotype calls from the "GT", "GP", "GL" or "PL" FORMAT field of a VCF file to a SnpMatrix .
Usage
list(list("genotypeToSnpMatrix"), list("CollapsedVCF"))(x, uncertain=FALSE, ...)
list(list("genotypeToSnpMatrix"), list("array"))(x, ref, alt, ...)
Arguments
Argument | Description |
---|---|
x | A CollapsedVCF object or a array of genotype data from the "GT", "GP", "GL" or "PL" FORMAT field of a VCF file. This array is created with a call to readVcf and can be accessed with geno(<VCF>) . |
uncertain | A logical indicating whether the genotypes to convert should come from the "GT" field ( uncertain=FALSE ) or the "GP", "GL" or "PL" field ( uncertain=TRUE ). |
ref | A DNAStringSet of reference alleles. |
alt | A DNAStringSetList of alternate alleles. |
list() | Additional arguments, passed to methods. |
Details
genotypeToSnpMatrix
converts an array of genotype calls from the
"GT", "GP", "GL" or "PL" FORMAT field of a VCF file into a
SnpMatrix . The following caveats apply,
no distinction is made between phased and unphased genotypes
variants with >1 ALT allele are set to NA
only single nucleotide variants are included; others are set to NA
only diploid calls are included; others are set to NA
In VCF files, 0 represents the reference allele and integers greater than 0 represent the alternate alleles (i.e., 2, 3, 4 would indicate the 2nd, 3rd or 4th allele in the ALT field for a particular variant). This function only supports variants with a single alternate allele and therefore the alternate values will always be 1. Genotypes are stored in the SnpMatrix as 0, 1, 2 or 3 where 0 = missing, 1 = "0/0", 2 = "0/1" or "1/0" and 3 = "1/1". In SnpMatrix terminology, "A" is the reference allele and "B" is the risk allele. Equivalent statements to those made with 0 and 1 allele values would be 0 = missing, 1 = "A/A", 2 = "A/B" or "B/A" and 3 = "B/B".
The genotype fields are defined as follows:
|* list("GT : genotype, encoded as allele values separated by either of ", " "/" or "|". The allele values are 0 for the reference allele and 1 ", " for the alternate allele.") |
list("GL : genotype likelihoods comprised of comma separated ", " floating point log10-scaled likelihoods for all possible ", " genotypes. In the case of a reference allele A and a single ", " alternate allele B, the likelihoods will be ordered "A/A", "A/B", ", " "B/B".")
list("PL : the phred-scaled genotype likelihoods rounded to the ", " closest integer. The ordering of values is the ", " same as for the GL field.")
list("GP : the phred-scaled genotype posterior probabilities for all ", " possible genotypes; intended to store ", " imputed genotype probabilities. The ordering of values is the ", " same as for the GL field.")
If uncertain=TRUE
, the posterior probabilities of the three
genotypes ("A/A", "A/B", "B/B") are encoded (approximately) as byte
values. This encoding allows uncertain genotypes to be used in
snpStats functions, which in some cases may be more
appropriate than using only the called genotypes. The byte encoding
conserves memory by allowing the uncertain genotypes to be stored in a
two-dimensional raw matrix.
See the snpStats documentation for more details.
Value
A list with the following elements,
*
Seealso
Author
Stephanie Gogarten and Valerie Obenchain
References
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41
Examples
## ----------------------------------------------------------------
## Non-probability based snp encoding using "GT"
## ----------------------------------------------------------------
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
## This file has no "GL" or "GP" field so we use "GT".
geno(vcf)
## Convert the "GT" FORMAT field to a SnpMatrix.
mat <- genotypeToSnpMatrix(vcf)
## The result is a list of length 2.
names(mat)
## Compare coding in the VCF file to the SnpMatrix.
geno(vcf)$GT
t(as(mat$genotype, "character"))
## The 'ignore' column in 'map' indicates which variants
## were set to NA. Variant rs6040355 was ignored because
## it has multiple alternate alleles, microsat1 is not a
## snp, and chr20:1230237 has no alternate allele.
mat$map
## ----------------------------------------------------------------
## Probability-based encoding using "GL", "PL" or "GP"
## ----------------------------------------------------------------
## Read a vcf file with a "GL" field.
fl <- system.file("extdata", "gl_chr1.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
geno(vcf)
## Convert the "GL" FORMAT field to a SnpMatrix
mat <- genotypeToSnpMatrix(vcf, uncertain=TRUE)
## Only 3 of the 9 variants passed the filters. The
## other 6 variants had no alternate alleles.
mat$map
## Compare genotype representations for a subset of
## samples in variant rs180734498.
## Original called genotype
geno(vcf)$GT["rs180734498", 14:16]
## Original genotype likelihoods
geno(vcf)$GL["rs180734498", 14:16]
## Posterior probability (computed inside genotypeToSnpMatrix)
GLtoGP(geno(vcf)$GL["rs180734498", 14:16, drop=FALSE])[1,]
## SnpMatrix coding.
t(as(mat$genotype, "character"))["rs180734498", 14:16]
t(as(mat$genotype, "numeric"))["rs180734498", 14:16]
## For samples NA11829 and NA11830, one probability is significantly
## higher than the others, so SnpMatrix calls the genotype. These
|## calls match the original coding: "0|1" -> "A/B", "0|0" -> "A/A".|
|## Sample NA11831 was originally called as "0|1" but the probability|
|## of "0|0" is only a factor of 3 lower, so SnpMatrix calls it as|
## "Uncertain" with an appropriate byte-level encoding.
getTranscriptSeqs_methods()
Get transcript sequences
Description
This function is defunct. Use GenomicFeatures::extractTranscriptSeqs() instead.
Extract transcript sequences from a BSgenome object or an FaFile .
Usage
list(list("getTranscriptSeqs"), list("GRangesList,BSgenome"))(query, subject, ...)
list(list("getTranscriptSeqs"), list("GRangesList,FaFile"))(query, subject, ...)
list(list("getTranscriptSeqs"), list("GRanges,FaFile"))(query, subject, ...)
Arguments
Argument | Description |
---|---|
query | A GRangesList object containing exons or cds grouped by transcript. |
subject | A BSgenome object or a FaFile from which the sequences will be taken. |
list() | Additional arguments |
Details
getTranscriptSeqs
is a wrapper for the
extractTranscriptSeqs
and getSeq
functions. The
purpose is to allow sequence extraction from either a
BSgenome or FaFile . Transcript
sequences are extracted based on the boundaries of the feature
provided in the query
(i.e., either exons or cds regions).
Value
A DNAStringSet instance containing the sequences for all transcripts
specified in query
.
Seealso
predictCoding extractTranscriptSeqs getSeq
Author
Valerie Obenchain
Examples
## See ?extractTranscriptSeqs in the GenomicFeatures package.
indexVcf_method()
Create Index files for VCF file
Description
Create Index file for VCF if it does not exist
Usage
list(list("indexVcf"), list("character"))(x, ...)
list(list("indexVcf"), list("VcfFile"))(x, ...)
list(list("indexVcf"), list("VcfFileList"))(x, ...)
Arguments
Argument | Description |
---|---|
x | Either character(), VcfFile , or VcfFileList |
... | Additional arguments to indexTabix |
Details
If a character vector is given, assumes they are the path(s) to the
VCF file. If the index file for that VCF file does not exist, one is
created. A VcfFile
or VcfFileList
is returned.
If a VcfFile
or VcfFileList
is given, the index file is
checked, if it does not exist it will crete one. If the index file was
NA or missing, the path of the associated VCF file is used as the
index file path. An updated VcfFile
or VcfFileList
is
returned.
Value
VcfFile or VcfFileList
Seealso
Author
Lori Shepherd
Examples
fl <- system.file("extdata", "chr7-sub.vcf.gz", package="VariantAnnotation",
mustWork=TRUE)
vcf1 <- indexVcf(fl)
vcf1
isSNV_methods()
Identification of genomic variant types.
Description
Functions for identifying variant types such as SNVs, insertions, deletions, transitions, and structural rearrangements.
Usage
list(list("isSNV"), list("VRanges"))(x, ...)
list(list("isSNV"), list("ExpandedVCF"))(x, ...)
list(list("isSNV"), list("CollapsedVCF"))(x, ..., singleAltOnly = TRUE)
list(list("isInsertion"), list("VRanges"))(x, ...)
list(list("isInsertion"), list("ExpandedVCF"))(x, ...)
list(list("isInsertion"), list("CollapsedVCF"))(x, ..., singleAltOnly = TRUE)
list(list("isDeletion"), list("VRanges"))(x, ...)
list(list("isDeletion"), list("ExpandedVCF"))(x, ...)
list(list("isDeletion"), list("CollapsedVCF"))(x, ..., singleAltOnly = TRUE)
list(list("isIndel"), list("VRanges"))(x, ...)
list(list("isIndel"), list("ExpandedVCF"))(x, ...)
list(list("isIndel"), list("CollapsedVCF"))(x, ..., singleAltOnly = TRUE)
list(list("isDelins"), list("VRanges"))(x, ...)
list(list("isDelins"), list("ExpandedVCF"))(x, ...)
list(list("isDelins"), list("CollapsedVCF"))(x, ..., singleAltOnly = TRUE)
list(list("isTransition"), list("VRanges"))(x, ...)
list(list("isTransition"), list("ExpandedVCF"))(x, ...)
list(list("isTransition"), list("CollapsedVCF"))(x, ..., singleAltOnly = TRUE)
list(list("isSubstitution"), list("VRanges"))(x, ...)
list(list("isSubstitution"), list("ExpandedVCF"))(x, ...)
list(list("isSubstitution"), list("CollapsedVCF"))(x, ..., singleAltOnly = TRUE)
Arguments
Argument | Description |
---|---|
x | A VCF or VRanges object. |
singleAltOnly | A logical only applicable when x is a CollapsedVCF class. When TRUE (default) only variants with a single alternate allele are evaluated; all multi-alt variants evaluate to FALSE . When singleAltOnly=FALSE all ref / alt pairs for each variant are evaluated. If any ref / alt pairs meet the test criteria a value of TRUE is returned for the variant; this may result in a value of TRUE for a variant with a mixture of alternate alleles, some that pass the criteria and some that do not. To retain single ref / alt pairs that pass the critera use expand on the CollapsedVCF and then apply the test. |
list() | Arguments passed to other methods. |
Details
All functions return a logical vector the length of x
.
Variants in gvcf files with NON_REF alt alleles return TRUE;
structural variants return FALSE.
list("isSNV: ") list(" ", " Reference and alternate alleles are both a single nucleotide long. ", " ")
list("isInsertion: ") list(" ", " Reference allele is a single nucleotide and the alternate allele ", " is greater (longer) than a single nucleotide and the first ", " nucleotide of the alternate allele matches the reference. ", " ")
list("isDeletion: ") list(" ", " Alternate allele is a single nucleotide and the reference allele ", " is greater (longer) than a single nucleotide and the first ", " nucleotide of the reference allele matches the alternate. ", " ")
list("isIndel: ") list(" ", " The variant is either a deletion or insertion as determined ", " by ", list("isDeletion"), " and ", list("isInsertion"), ". ", " ")
list("isDelins: ") list(" ", " The variant is a deletion followed by an insertion, either of them ", " involving two or more nucleotides. ", " ")
list("isSubstition: ") list(" ", " Reference and alternate alleles are the same length (1 or ", " more nucleotides long). ", " ")
list("isTransition: ") list(" ", " Reference and alternate alleles are both a single nucleotide long. ", " The reference-alternate pair interchange is of either two-ring ", " purines (A <-> G) or one-ring pyrimidines (C <-> T). ", " ")
Value
A logical
vector the same length as x
.
Author
Michael Lawrence, Valerie Obenchain and Robert Castelo
Examples
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
## ---------------------------------------------------------------------
## VCF objects
## ---------------------------------------------------------------------
vcf <- readVcf(fl, "hg19")
DataFrame(ref(vcf), alt(vcf))
## This vcf has transitions in row 2 and 3. When 'singleAltOnly=TRUE'
## only the row 2 variant is identified:
isTransition(vcf)
## Both row 2 and 3 are identified when 'singleAltOnly=FALSE':
isTransition(vcf, singleAltOnly=FALSE)
## Expand the CollapsedVCF to ExpandedVCF
evcf <- expand(vcf)
## All ref / alt pairs are now expanded and there is no need to
## use 'singleAltOnly'. The return length is now 7 instead of 5:
transition <- isTransition(evcf)
transition
DataFrame(ref(evcf)[transition], alt(evcf)[transition])
## ---------------------------------------------------------------------
## VRanges objects
## ---------------------------------------------------------------------
## A VRanges object holds data from a VCF class in a completely
## 'flat' fashion. INFO and FORMAT variables for all subjects are
## 'repped' out such that each row is a unique combination of data.
vr <- as(vcf, "VRanges")
isSNV(vr, singleAltOnly=FALSE)
locateVariants_methods()
Locate variants
Description
Variant location with respect to gene function
Usage
locateVariants(query, subject, region, ...)
list(list("locateVariants"), list("VCF,TxDb,VariantType"))(query, subject, region, ...,
cache=new.env(parent=emptyenv()), ignore.strand=FALSE, asHits=FALSE)
list(list("locateVariants"), list("GRanges,TxDb,VariantType"))(query, subject, region, ...,
cache=new.env(parent=emptyenv()), ignore.strand=FALSE, asHits=FALSE)
Arguments
Argument | Description |
---|---|
query | A IntegerRanges , GRanges or VCF object containing the variants. Metadata columns are allowed but ignored. NOTE: Zero-width ranges are treated as width-1 ranges; start values are decremented to equal the end value. |
subject | A TxDb or GRangesList object that serves as the annotation. GFF files can be converted to TxDb objects with makeTxDbFromGFF() in the GenomicFeatures package. |
region | An instance of one of the 8 VariantType classes: CodingVariants , IntronVariants , FiveUTRVariants , ThreeUTRVariants , IntergenicVariants , SpliceSiteVariants , PromoterVariants , AllVariants . All objects can be instantiated with no arguments, e.g., CodingVariants() will create an object of CodingVariants . AllVariants , PromoterVariants and IntergenicVariants have upstream and downstream arguments. For PromoterVariants and IntergenicVariants these are single integer values >= 0. For AllVariants these are integer vectors of length 2 named promoter and intergenic . See ? upstream for more details. When using AllVariants , a range in query may fall in multiple regions (e.g., 'intergenic' and 'promoter'). In this case the result will have a row for each match. All data in the row will be equivalent except the LOCATION column. |
list() | Additional arguments passed to methods |
cache | An environment into which required components of subject are loaded. Provide, and re-use, a cache to speed repeated queries to the same subject across different query instances. |
ignore.strand | A logical indicating if strand should be ignored when performing overlaps. |
asHits | A logical indicating if the results should be returned as a Hits object. Not applicable when region is AllVariants or IntergenicVariants. |
Details
list(" ", " ", list(list("Range representation:"), list(" ", " The ranges in ", list("query"), " should reflect the position(s) of the ", " reference allele. For snps the range will be of width 1. For range ", " insertions or deletions the reference allele could be a sequence ", " such as GGTG in which case the width of the range should be 4. ", " ")), " ", " ", list(list("Location:"), list(" ", " Possible locations are ", list("coding"), ", ", list("intron"),
",
", " ", list("threeUTR"), ", ", list("fiveUTR"), ", ", list("intergenic"), ", ", " ", list("spliceSite"), ", or ", list("promoter"), ". ", " ", " Overlap operations for ", list("coding"), ", ", list("intron"), ", ", " ", list("threeUTR"), ", and ", list("fiveUTR"), " require variants to fall ", " completely within the defined region to be classified as such. ", " ", " To be classified as a ", list("spliceSite"), " the variant must overlap ", " with any portion of the first 2 or last 2 nucleotides in an intron. ",
"
", " ", list("intergenic"), " variants are ranges that do not fall within a ", " defined gene region. ", list("transcripts by gene"), " are extracted from ", " the annotation and overlapped with the variant positions. Variants with ", " no overlaps are classified as ", list("intergenic"), ". When available, gene ", " IDs for the flanking genes are provided as ", list("PRECEDEID"), " and ", " ", list("FOLLOWID"), ". ", list("upstream"), " and ", list("downstream"),
" arguments define
", " the acceptable distance from the query for the flanking genes. ", " ", list("PRECEDEID"), " and ", list("FOLLOWID"), " results are lists and contain all ", " genes that fall within the defined distance. See the examples for how ", " to compute the distance from ranges to PRECEDEID and FOLLOWID. ", " ", " ", list("promoter"), " variants fall within a specified range upstream and ", " downstream of the transcription start site. Ranges values can be set ",
" with the ", list("upstream"), " and ", list("downstream"), " arguments when creating
", " the ", list("PromoterVariants()"), " or ", list("AllVariants()"), " classes. ", " ")), " ", " ", list(list("Subject as GRangesList:"), list(" ", " The ", list("subject"), " can be a ", list("TxDb"), " or ", list("GRangesList"), " ", " object. When using a ", list("GRangesList"), " the type of data required ", " is driven by the ", list("VariantType"), " class. Below is a description of ",
" the appropriate ", list("GRangesList"), " for each ", list("VariantType"), ".
", " ", list(" ", " ", list(list("CodingVariants:"), list("coding (CDS) by transcript")), " ", " ", list(list("IntronVariants:"), list("introns by transcript")), " ", " ", list(list("FiveUTRVariants:"), list("five prime UTR by transcript")), " ", " ", list(list("ThreeUTRVariants:"), list("three prime UTR by transcript")), " ", " ", list(list("IntergenicVariants:"),
list("transcripts by gene")), "
", " ", list(list("SpliceSiteVariants:"), list("introns by transcript")), " ", " ", list(list("PromoterVariants:"), list("list of transcripts")), " ", " ", list(list("AllVariants:"), list("no GRangeList method available")), " ", " "), " ", " ")), " ", " ", list(list("Using the cache:"), list(" ", " When processing multiple VCF files performance is enhanced by specifying ", " an environment as the ", list("cache"),
" argument. This cache is used to store
", " and reuse extracted components of the subject (TxDb) required by the ", " function. The first call to the function (i.e., processing the first ", " VCF file in a list of many) populates the cache; repeated calls ", " to ", list("locateVariants"), " will access these objects from the cache vs ", " re-extracting the same information. ", " ")), " ", " ")
Value
A GRanges
object with a row for each variant-transcript match.
Strand of the output is from the subject
hit
except in the case of IntergenicVariants. For intergenic, multiple precede
and follow gene ids are returned for each variant. When
ignore.strand=TRUE
the return strand is *
because
genes on both strands are considered and it is possible to have a mixture.
When ignore.strand=FALSE
the strand will match the query
because only genes on the same strand are considered.
Metadata columns are LOCATION
, QUERYID
,
TXID
, GENEID
, PRECEDEID
, FOLLOWID
and
CDSID
. Results are ordered by QUERYID
, TXID
and
GENEID
. Columns are described in detail below.
list(" ", " ", list(list(list("LOCATION")), list(" ", " Possible locations are ", list("coding"), ", ", list("intron"), ", ", " ", list("threeUTR"), ", ", list("fiveUTR"), ", ", list("intergenic"), ", ", " ", list("spliceSite"), " and ", list("promoter"), ". ", " ", " To be classified as ", list("coding"), ", ", list("intron"), ", ", list("threeUTR"), " ", " or ", list("fiveUTR"), " the variant must fall completely within the region. ", " ", " ", list("intergenic"),
" variants do not fall within a transcript. The
", " ", list("GENEID"), " for these positions are ", list("NA"), ". Lists of flanking ", " genes that fall within the distance defined by ", list("upstream"), " and ", " ", list("downstream"), " are given as ", list("PRECEDEID"), " and ", list("FOLLOWID"), ". ", " By default, the gene ID is returned in the ", list("PRECEDEID"), " and ", " ", list("FOLLOWID"), " columns. To return the transcript ids instead set ", " ",
list("idType = "tx""), " in the ", list("IntergenicVariants()"), "
", " constructor.
", "
", " A ", list("spliceSite"), " variant overlaps any portion of the first 2 or last
", " 2 nucleotides of an intron.
", " ")), "
", " ", list(list(list("LOCSTART, LOCEND")), list("
", " Genomic position in LOCATION-centric coordinates.
", " If LOCATION is intron
, these are intron-centric coordinates,
", " if LOCATION is coding
then cds-centric. All coordinates are
",
" relative to the start of the transcript. SpliceSiteVariants,
", " IntergenicVariants and PromoterVariants have no formal
", " extraction by transcript
so for these variants LOCSTART and
", " LOCEND are NA. Coordinates are computed with ", list("mapToTranscripts"), ";
", " see ?", list("mapToTranscripts"), " in the GenomicFeatures package for details.
", " ")), "
", " ", list(list(list("QUERYID")), list("
", " The ", list("QUERYID"), " column provides a map back to the row in the
",
" original ", list("query"), ". If the ", list("query"), " was a ", list("VCF"), " object this
", " index corresponds to the row in the ", list("GRanges"), " object returned by ", " the ", list("rowRanges"), " accessor. ", " ")), " ", " ", list(list(list("TXID")), list(" ", " The transcript id taken from the ", list("TxDb"), " object. ", " ")), " ", " ", list(list(list("CDSID")), list(" ", " The coding sequence id(s) taken from the ", list("TxDb"),
" object.
", " ")), " ", " ", list(list(list("GENEID")), list(" ", " The gene id taken from the ", list("TxDb"), " object. ", " ")), " ", " ", list(list(list("PRECEDEID")), list(" ", " IDs for all genes the query precedes within the defined ", " ", list("upstream"), " and ", list("downstream"), " distance. Only applicable ", " for ", list("intergenic"), " variants. By default this column contains gene ids; ", " to return transcript ids set ", list(
"idType = "tx""), " in
", " the ", list("IntergenicVariants"), " constructor. ", " ")), " ", " ", list(list(list("FOLLOWID")), list(" ", " IDs for all genes the query follows within the defined ", " ", list("upstream"), " and ", list("downstream"), " distance. Only applicable ", " for ", list("intergenic"), " variants. By default this column contains gene ids; ", " to return transcript ids set ", list("idType = "tx""), " in ", " the ", list("IntergenicVariants"),
" constructor.
", " ")), " ", " All ID values will be ", list("NA"), " for variants with a location of ", " ", list("transcript_region"), " or ", list("NA"), ". ", " ")
Seealso
The readVcf function.
The predictCoding function.
The promoters function on the intra-range-methods man page in the GenomicRanges package.
Author
Valerie Obenchain
Examples
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
## ---------------------------------------------------------------------
## Variants in all gene regions
## ---------------------------------------------------------------------
## Read variants from a VCF file.
fl <- system.file("extdata", "gl_chr1.vcf",
package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
## Often the seqlevels in the VCF file do not match those in the TxDb.
head(seqlevels(vcf))
head(seqlevels(txdb))
intersect(seqlevels(vcf), seqlevels(txdb))
## Rename seqlevels with renameSeqlevesl().
vcf <- renameSeqlevels(vcf, paste0("chr", seqlevels(vcf)))
## Confirm.
intersect(seqlevels(vcf), seqlevels(txdb))
## Overlaps for all possible variant locations.
loc_all <- locateVariants(vcf, txdb, AllVariants())
table(loc_all$LOCATION)
## ---------------------------------------------------------------------
## Variants in intergenic regions
## ---------------------------------------------------------------------
## Intergenic variants do not overlap a gene range in the
## annotation and therefore 'GENEID' is always NA. Flanking genes
## that fall within the 'upstream' and 'downstream' distances are
## reported as PRECEDEID and FOLLOWID.
region <- IntergenicVariants(upstream=70000, downstream=70000)
loc_int <- locateVariants(vcf, txdb, region)
mcols(loc_int)[c("LOCATION", "PRECEDEID", "FOLLOWID")]
## Distance to the flanking genes can be computed for variants that
## have PRECEDEID(s) or FOLLOWID(s). Each variant can have multiple
## flanking id's so we first expand PRECEDEID and the corresponding
## variant ranges.
p_ids <- unlist(loc_int$PRECEDEID, use.names=FALSE)
exp_ranges <- rep(loc_int, elementNROWS(loc_int$PRECEDEID))
## Compute distances with the distance method defined in GenomicFeatures.
## Help page can be found at ?`distance,GenomicRanges,TxDb-method`.
## The method returns NA for ids that cannot be collapsed into a single
## range (e.g., genes with ranges on multiple chromosomes).
distance(exp_ranges, txdb, id=p_ids, type="gene")
## To search for distance by transcript id set idType='tx' in the
## IntergenicVariants() constructor, e.g.,
## locateVariants(vcf, txdb, region=IntergenicVariants(idType="tx"))
## Unlist ids and expand ranges as before to get p_ids and exp_ranges.
## Then call distance() with type = "tx":
## distance(exp_ranges, txdb, id=p_ids, type="tx")
## ---------------------------------------------------------------------
## GRangesList as subject
## ---------------------------------------------------------------------
## When 'subject' is a GRangesList the GENEID is unavailable and
## will always be reported as NA. This is because the GRangesList
## objects are extractions of region-by-transcript, not region-by-gene.
cdsbytx <- cdsBy(txdb)
locateVariants(vcf, cdsbytx, CodingVariants())
intbytx <- intronsByTranscript(txdb)
locateVariants(vcf, intbytx, IntronVariants())
## ---------------------------------------------------------------------
## Using the cache
## ---------------------------------------------------------------------
## When processing multiple VCF files, the 'cache' can be used
## to store the extracted components of the TxDb
## (i.e., cds by tx, introns by tx etc.). This avoids having to
## re-extract these GRangesLists during each loop.
myenv <- new.env()
files <- list(vcf1, vcf2, vcf3)
lapply(files,
function(fl) {
vcf <- readVcf(fl, "hg19")
## modify seqlevels to match TxDb
seqlevels(vcf_mod) <- paste0("chr", seqlevels(vcf))
locateVariants(vcf_mod, txdb, AllVariants(), cache=myenv)
})
## ---------------------------------------------------------------------
## Parallel implmentation
## ---------------------------------------------------------------------
library(BiocParallel)
## A connection to a TxDb object is established when
## the package is loaded. Because each process reading from an
## sqlite db must have a unique connection the TxDb
## object cannot be passed as an argument when running in
## parallel. Instead the package must be loaded on each worker.
## The overhead of the multiple loading may defeat the
## purpose of running the job in parallel. An alternative is
## to instead pass the appropriate GRangesList as an argument.
## The details section on this man page under the heading
## 'Subject as GRangesList' explains what GRangesList is
## appropriate for each variant type.
## A. Passing a GRangesList:
fun <- function(x, subject, ...)
locateVariants(x, subject, IntronVariants())
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
grl <- intronsByTranscript(TxDb.Hsapiens.UCSC.hg19.knownGene)
mclapply(c(vcf, vcf), fun, subject=grl)
## B. Passing a TxDb:
## Forking:
## In the case of forking, the TxDb cannot be loaded
## in the current workspace.
## To detach the NAMESPACE:
## unloadNamespace("TxDb.Hsapiens.UCSC.hg19.knownGene")
fun <- function(x) {
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
locateVariants(x, TxDb.Hsapiens.UCSC.hg19.knownGene,
IntronVariants())
}
mclapply(c(vcf, vcf), fun)
## Clusters:
cl <- makeCluster(2, type = "SOCK")
fun <- function(query, subject, region) {
library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
locateVariants(query, TxDb.Hsapiens.UCSC.hg19.knownGene, region)
}
parLapply(cl, c(vcf, vcf), fun, region=IntronVariants())
stopCluster(cl)
predictCoding_methods()
Predict amino acid coding changes for variants
Description
Predict amino acid coding changes for variants a coding regions
Usage
list(list("predictCoding"), list("CollapsedVCF,TxDb,ANY,missing"))(query, subject, seqSource, varAllele, ..., ignore.strand=FALSE)
list(list("predictCoding"), list("ExpandedVCF,TxDb,ANY,missing"))(query, subject, seqSource, varAllele, ..., ignore.strand=FALSE)
list(list("predictCoding"), list("IntegerRanges,TxDb,ANY,DNAStringSet"))(query, subject, seqSource, varAllele, ..., ignore.strand=FALSE)
list(list("predictCoding"), list("GRanges,TxDb,ANY,DNAStringSet"))(query, subject, seqSource, varAllele, ..., ignore.strand=FALSE)
list(list("predictCoding"), list("VRanges,TxDb,ANY,missing"))(query, subject, seqSource, varAllele, ..., ignore.strand=FALSE)
Arguments
Argument | Description |
---|---|
query | A VCF , IntegerRanges , GRanges or VRanges instance containing the variants to be annotated. If query is a IntegerRanges or VRanges it is coerced to a GRanges . If a VCF is provided the GRanges returned by the rowRanges() accessor will be used. All metadata columns are ignored. When query is not a VCF object a varAllele must be provided. The varAllele must be a DNAStringSet the same length as the query . If there are multiple alternate alleles per variant the query must be expanded to one row per each alternate allele. See examples. NOTE: Variants are expected to conform to the VCF specs as described on the 1000 Genomes page (see references). Indels must include the reference allele; zero-width ranges are not valid and return no matches. |
subject | A TxDb object that serves as the annotation. GFF files can be converted to TxDb objects with makeTxDbFromGFF() in the GenomicFeatures package. |
seqSource | A BSgenome instance or an FaFile to be used for sequence extraction. |
varAllele | A DNAStringSet containing the variant (alternate) alleles. The length of varAllele must equal the length of query . Missing values are represented by a zero width empty element. Ranges with missing varAllele values are ignored; those with an N character are not translated. When the query is a VCF object the varAllele argument will be missing. This value is taken internally from the VCF with alt(<VCF>) . |
list() | Additional arguments passed to methods. Arguments genetic.code and if.fuzzy.codon are supported for the translate function. |
ignore.strand | A logical indicating if strand should be ignored when performing overlaps. When ignore.strand == TRUE the query strand is set to '*' and can overlap with any strand of the subject . The return GRanges reflects the strand of the subject hit, however, the positions and alleles reported are computed as if both were from the '+' strand. ignore.strand == FALSE requires the query and subject to have compatible strand. Again the return GRanges reports the strand of the subject hit but in this case positions and alleles are computed according to the strand of the subject . |
Details
This function returns the amino acid coding for variants that fall
completely within' a coding region. The reference sequences are taken from a fasta file or [BSgenome](#bsgenome) . The width of the reference is determined from the start position and width of the range in the
query. For guidance on how to represent an insertion, deletion or substitution see the 1000 Genomes VCF format (references). Variant alleles are taken from the
varAllelewhen supplied. When the
queryis a
VCFobject the
varAllelewill be missing. This value is taken internally from the
VCFwith
alt(. The variant allele is substituted into the reference sequences and transcribed. Transcription only occurs if the substitution, insertion or deletion results in a new sequence length divisible by 3. When the
queryis an unstranded (*)
GRangespredictCoding
will attempt to find overlaps on both the positive and negative strands of the
subject. When the subject hit is on the negative strand the return
varAlleleis reverse complemented. The strand of the returned
GRangesrepresents the strand of the subject hit. ## Value A [GRanges](#granges) with a row for each variant-transcript match. The result includes only variants that fell within coding regions. The strand of the output
GRangesrepresents the strand of the
subjecthit. At a minimum, the metadata columns (accessible with
mcols) include, list(" ", " ", list(list(list("varAllele")), list(" ", " Variant allele. This value is reverse complemented for an unstranded ", " ", list("query"), " that overlaps a ", list("subject"), " on the negative strand. ", " ")), " ", " ", list(list(list("QUERYID")), list(" ", " Map back to the row in the original query ", " ")), " ", " ", list(list(list("TXID")), list(" ", " Internal transcript id from the annotation ", " ")), " ", " ", list(list(list( "CDSID")), list(" ", " Internal coding region id from the annotation ", " ")), " ", " ", list(list(list("GENEID")), list(" ", " Internal gene id from the annotation ", " ")), " ", " ", list(list(list("CDSLOC")), list(" ", " Variant location in coding region-based coordinates. This position is ", " relative to the start of the coding (cds) region defined in the ", " ", list("subject"), " annotation. ", " ")), " ", " ", list(list(list("PROTEINLOC")), list(" ", " Variant codon triplet location in coding region-based coordinates. ", " This position is relative to the start of the coding (cds) region ", " defined in the ", list("subject"), " annotation. ", " ")), " ", " ", list(list(list("CONSEQUENCE")), list(" ", " Possible values are
synonymous', nonsynonymous',
frameshift',
", " nonsense' and
not translated'. Variant sequences are translated only
", " when the codon sequence is a multiple of 3. The value will be frameshift' ", " when a sequence is of incompatible length.
not translated' is used
", " when ", list("varAllele"), " is missing or there is an ", list("N"), " in the
", " sequence. `nonsense' is used for premature stop codons.
", " ")), "
", " ", list(list(list("REFCODON")), list("
", " The reference codon sequence. This range is typically greater
", " than the width of the range in the ", list("GRanges"), " because it includes
", " all codons involved in the sequence modification. If the reference
",
" sequence is of width 2 but the alternate allele is of width 4 then at
", " least two codons must be included in the ", list("REFSEQ"), ".
", " ")), "
", " ", list(list(list("VARCODON")), list("
", " This sequence is the result of inserting, deleting or replacing the
", " position(s) in the reference sequence alternate allele. If the result
", " of this substitution is not a multiple of 3 it will not be translated.
", " ")), "
", " ", list(list(
list("REFAA")), list("
", " The reference amino acid column contains the translated ", list("REFSEQ"), ".
", " When translation is not possible this value is missing.
", " ")), "
", " ", list(list(list("VARAA")), list("
", " The variant amino acid column contains the translated ", list("VARSEQ"), ". When
", " translation is not possible this value is missing.
", " ")), "
", " ")
## Seealso
readVcf ,
locateVariants ,
refLocsToLocalLocs
getTranscriptSeqs
## Author
Michael Lawrence and Valerie Obenchain
## References
http://www.1000genomes.org/wiki/analysis/variant-call-format/
http://vcftools.sourceforge.net/specs.html
## Examples
r library(BSgenome.Hsapiens.UCSC.hg19) library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene ## ---------------------------- ## VCF object as query ## ---------------------------- ## Read variants from a VCF file fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation") vcf <- readVcf(fl, "hg19") ## Rename seqlevels in the VCF object to match those in the TxDb. vcf <- renameSeqlevels(vcf, "chr22") ## Confirm common seqlevels intersect(seqlevels(vcf), seqlevels(txdb)) ## When 'query' is a VCF object the varAllele argument is missing. coding1 <- predictCoding(vcf, txdb, Hsapiens) head(coding1, 3) ## Exon-centric or cDNA locations: exonsbytx <- exonsBy(txdb, "tx") cDNA <- mapToTranscripts(coding1, exonsbytx) mcols(cDNA)$TXID <- names(exonsbytx)[mcols(cDNA)$transcriptsHits] cDNA <- cDNA[mcols(cDNA)$TXID == mcols(coding1)$TXID[mcols(cDNA)$xHits]] ## Make sure cDNA is parallel to coding1 stopifnot(identical(mcols(cDNA)$xHits, seq_along(coding1))) coding1$cDNA <- ranges(cDNA) ## ---------------------------- ## GRanges object as query ## ---------------------------- ## A GRanges can also be used as the 'query'. The seqlevels in the VCF ## were adjusted in previous example so the GRanges extracted with ## has the correct seqlevels. rd <- rowRanges(vcf) ## The GRanges must be expanded to have one row per alternate allele. ## Variants 1, 2 and 10 have two alternate alleles. altallele <- alt(vcf) eltROWS <- elementNROWS(altallele) rd_exp <- rep(rd, eltROWS) ## Call predictCoding() with the expanded GRanges and the unlisted ## alternate allele as the 'varAllele'. coding2 <- predictCoding(rd_exp, txdb, Hsapiens, unlist(altallele))
probabilityToSnpMatrix()
Convert posterior genotype probability to a SnpMatrix object
Description
Convert a matrix of posterior genotype probabilites P(AA), P(AB), P(BB) to a SnpMatrix .
Usage
probabilityToSnpMatrix(probs)
Arguments
Argument | Description |
---|---|
probs | Matrix with three columns for the posterior probabilities of the three genotypes: "P(A/A)", "P(A/B)", "P(B/B)". Each row must sum to 1. |
Details
probabilityToSnpMatrix
converts a matrix of posterior
probabilites of genotype calls into a
SnpMatrix .
Value
An object of class "SnpMatrix"
with one row (one sample).
Posterior probabilities are encoded (approximately) as byte
values, one per SNP. See
the help page for SnpMatrix for complete
details of the class structure.
Seealso
genotypeToSnpMatrix , SnpMatrix
Author
Stephanie Gogarten sdmorris@u.washington.edu
Examples
probs <- matrix(c(1,0,0,
0,1,0,
0,0,1,
NA,NA,NA),
ncol=3, byrow=TRUE,
dimnames=list(1:4,c("A/A","A/B","B/B")))
sm <- probabilityToSnpMatrix(probs)
as(sm, "character")
readVcf_methods()
Read VCF files
Description
Read Variant Call Format (VCF) files
Usage
list(list("readVcf"), list("TabixFile,ScanVcfParam"))(file, genome, param,
..., row.names=TRUE)
list(list("readVcf"), list("TabixFile,IntegerRangesList"))(file, genome, param,
..., row.names=TRUE)
list(list("readVcf"), list("TabixFile,GRanges"))(file, genome, param,
..., row.names=TRUE)
list(list("readVcf"), list("TabixFile,GRangesList"))(file, genome, param,
..., row.names=TRUE)
list(list("readVcf"), list("TabixFile,missing"))(file, genome, param,
..., row.names=TRUE)
list(list("readVcf"), list("character,ANY"))(file, genome, param,
..., row.names=TRUE)
list(list("readVcf"), list("character,missing"))(file, genome, param,
..., row.names=TRUE)
list(list("readVcf"), list("character,missing"))(file, genome, param,
..., row.names=TRUE)
## Lightweight functions to read a single variable
readInfo(file, x, param=ScanVcfParam(), ..., row.names=TRUE)
readGeno(file, x, param=ScanVcfParam(), ..., row.names=TRUE)
readGT(file, nucleotides=FALSE, param=ScanVcfParam(), ..., row.names=TRUE)
## Import wrapper
list(list("import"), list("VcfFile,ANY,ANY"))(con, format, text, ...)
Arguments
Argument | Description |
---|---|
file | A VcfFile (synonymous with TabixFile ) instance or character() name of the VCF file to be processed. When ranges are specified in param , file must be a VcfFile . Use of the VcfFile methods are encouraged as they are more efficient than the character() methods. See ? VcfFile , and ? indexVcf for help creating a VcfFile . |
genome | A character or Seqinfo object. |
list(list("character"), ":") list(" Genome identifier as a single string or named ", " character vector. Names of the character vector correspond to ", " chromosome names in the file. This identifier replaces the genome ", " information in the VCF ", list("Seqinfo"), " (i.e., ", list("seqinfo(vcf)"), "). ", " When not provided, ", list("genome"), " is taken from the VCF file header. ", " ")
list(list("Seqinfo"), ":") list(" When ", list("genome"), " is provided as a ", list("Seqinfo"), " ", " it is propagated to the VCF output. If seqinfo information can be ", " obtained from the file, ", " (i.e., seqinfo(scanVcfHeader(fl)) is not empty), the output ", " ", list("Seqinfo"), " is a product of merging the two. ", " ", " If a param (i.e., ScanVcfParam) is used in the call to ", list("readVcf"), ", ", " the seqlevels of the param ranges must be present in ", list( "genome"), ". ", " ")
|param
| An instance of ScanVcfParam ,GRanges
,GRangesList
orIntegerRangesList
. VCF files can be subset on genomic coordinates (ranges) or elements in the VCF fields. Both genomic coordinates and VCF elements can be specified in a ScanVcfParam . See ?ScanVcfParam
for details. | |x
|character
name of singleinfo
orgeno
field to import. Applicable toreadInfo
andreadGeno
only. | |row.names
| Alogical
specifying if rownames should be returned. In the case ofreadVcf
, rownames appear on theGRanges
returned by therowRanges
accessor. | |nucleotides
| Alogical
indicating if genotypes should be returned as nucleotides instead of the numeric representation. Applicable toreadGT
only. | |con
| TheVcfFile
object to import.| |format, text
| Ignored.| |list()
| Additional arguments, passed to methods. Forimport
, the arguments are passed toreadVcf
. |
Details
list(" ", " ", list(list("Data Import: "), list(" ", " ", list(" ", " ", list(list("VCF object: "), list(" ", " ", list("readVcf"), " imports records from bzip compressed or uncompressed ", " VCF files. Data are parsed into a ", list(list("VCF")), " object ", " using the file header information if available. To import a subset ", " of ranges the VCF must have an index file. An index file can be ", " created with ", list("bzip"),
" and ", list("indexVcf"), " functions.
", " ", " The ", list("readInfo"), ", ", list("readGeno"), " and ", list("readGT"), " functions ", " are lightweight versions of ", list("readVcf"), " and import a single ", " variable. The return object is a vector, matrix or CompressedList ", " instead of a VCF class. ", " ")), " ", " "), " ", " ", list("readVcf"), " calls ", list(list("scanVcf")), ", the details of which can be ", " found with ",
list("?scanVcf"), ".
", " ")), " ", " ", list(list("Header lines (aka Meta-information): "), list(" ", " readVcf() reads and parses fields according to the multiplicity and ", " data type specified in the header lines. Fields without header lines are ", " skipped (not read or parsed). To see what fields are present in the ", " header use ", list("scanVcfHeader()"), ". See ?", list("VCFHeader"), " for more details. ", " ", " Passing ", list("verbose = TRUE"),
" to ", list("readVcf()"), " prints the fields
", " with header lines that will be parsed by ", list("readVcf"), ". ", " ")), " ", " ", list(list("Data type: "), list(" ", " CHROM, POS, ID and REF fields are used to create the ", list("GRanges"), " ", " stored in the ", list("VCF"), " object and accessible with the ", list("rowRanges"), " ", " accessor. ", " ", " REF, ALT, QUAL and FILTER are parsed into the ", list("DataFrame"), " in the ", " ",
list("fixed"), " slot. Because ALT can have more than one value per variant
", " it is represented as a ", list("DNAStringSetList"), ". REF is a ", list("DNAStringSet"), ", ", " QUAL is ", list("numeric"), " and FILTER is a ", list("character"), ". Accessors include ", " ", list("fixed"), ", ", list("ref"), ", ", list("alt"), ", ", list("qual"), ", and ", list("filt"), ". ", " ", " Data from the INFO field can be accessed with the ", list("info"), " accessor. ", " Genotype data (i.e., data immediately following the FORMAT field in the ",
" VCF) can be accessed with the ", list("geno"), " accessor. INFO and genotype data
", " types are determined according to the ", list("Number"), " and ", list("Type"), " ", " information in the file header as follows: ", " ", " ", list("Number"), " should only be 0 when ", list("Type"), " is 'flag'. These ", " fields are parsed as logical vectors. ", " ", " If ", list("Number"), " is 1, ", list("info"), " data are parsed into a ", " ", list("vector"),
" and ", list("geno"), " into a ", list("matrix"), ".
", " ", " If ", list("Number"), " is >1, ", list("info"), " data are parsed into a ", " ", list("DataFrame"), " with the same number of columns. ", list("geno"), " are ", " parsed into an ", list("array"), " with the same dimensions as ", list("Number"), ". ", " Columns of the ", list("geno"), " matrices are the samples. ", " ", " If ", list("Number"), " is ", list("."), ", ", list("A"), " or ", list("G"),
",
", " both ", list("info"), " and ", list("geno"), " data are parsed into a ", list("matrix"), ". ", " ", " When the header does not contain any ", list("INFO"), " lines, the data are ", " returned as a single, unparsed column. ", " ")), " ", " ", list(list("Missing data: "), list(" ", " Missing data in VCF files on disk are represented by a dot ("."). ", " ", list("readVcf"), " retains the dot as a character string for data type ", " character and converts it to ",
list("NA"), " for data types numeric or double.
", " ", " Because the data are stored in rectangular data structures there is a ", " value for each ", list("info"), " and ", list("geno"), " field element in the ", list("VCF"), " ", " class. If the element was missing or was not collected for a particular ", " variant the value will be ", list("NA"), ". ", " ", " In the case of the ALT field we have the following treatment of ", " special characters / missing values: ",
" ", list("
", " ", list(), " '.' true missings become empty characters ", " ", list(), " '*' are treated as missing and become empty characters ", " ", list(), " 'I' are treated as undefined and become '.' ", " "), " ", " ")), " ", " ", list(list("Efficient Usage: "), list(" ", " Subsets of data (i.e., specific variables, positions or samples) can ", " be read from a VCF file by providing a ", list("ScanVcfParam"), " object in ", " the call to ",
list("readVcf"), ". Other lightweight options are the
", " ", list("readGT"), ", ", list("readInfo"), " and ", list("readGeno"), " functions which ", " return data as a matrix instead of the ", list("VCF"), " class. ", " ", " Another option for handling large files is to iterate through the ", " data in chunks by setting the ", list("yieldSize"), " parameter in a ", " ", list("VcfFile"), " object. Iteration can be through all data fields or ", " a subset defined by a ",
list("ScanVcfParam"), ". See example below,
", " Iterating through VCF with yieldSize
.
", " ")), "
", " ")
Value
readVcf
returns a list("VCF") object. See ? VCF
for
complete details of the class structure. readGT
, readInfo
and
readGeno
return a matrix
.
list(" ", " ", list(list("rowRanges: "), list(" ", " The CHROM, POS, ID and REF fields are used to create a ", list("GRanges"), " ", " object. Ranges are created using POS as the start value and width of ", " the reference allele (REF). By default, the IDs become the rownames ", " ('row.names = FALSE' to turn this off). If IDs are ", " missing (i.e., ", list("."), ") a string of CHROM:POS_REF/ALT is used instead. ", " The ", list("genome"), " argument is stored in the seqinfo of the ",
list("GRanges"), "
", " and can be accessed with ", list("genome(
" ", list("DataFrame"), ".
", " ", " REF is returned as a DNAStringSet. ", " ", " ALT is a CharacterList when it contains structural variants and a ", " DNAStringSetList otherwise. See also the 'Details' section ", " for 'Missing data'. ", " ")), " ", " ", list(list("info: "), list(" ", " Data from the INFO field of the VCF is parsed into a ", list("DataFrame"), ". ", " ")), " ", " ", list(list("geno: "), list(" ", " If present, the genotype data are parsed into a list of ",
list("matrices"), "
", " or ", list("arrays"), ". Each list element represents a field in the FORMAT ", " column of the VCF file. Rows are the variants, columns are the samples. ", " ")), " ", " ", list(list("colData: "), list(" ", " This slot contains a ", list("DataFrame"), " describing the samples. If present, ", " the sample names following FORMAT in the VCF file become the row names. ", " ")), " ", " ", list(list("metadata: "), list(" ", " Header information present in the file is put into a ",
list("list"), "
", " in ", list("metadata"), ". ", " ")), " ", " ") See references for complete details of the VCF file format.
Seealso
indexVcf
,
VcfFile
,
indexTabix
,
TabixFile
,
scanTabix
,
scanBcf
,
expand,CollapsedVCF-method
Author
Valerie Obenchain>
References
http://vcftools.sourceforge.net/specs.html outlines the VCF specification.
http://samtools.sourceforge.net/mpileup.shtml contains
information on the portion of the specification implemented by
bcftools
.
http://samtools.sourceforge.net/ provides information on
samtools
.
Examples
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
## vcf <- readVcf(fl, c("20"="hg19")) ## 'genome' as named vector
## ---------------------------------------------------------------------
## Header and genome information
## ---------------------------------------------------------------------
vcf
## all header information
hdr <- header(vcf)
## header information for 'info' and 'fixed' fields
info(hdr)
fixed(hdr)
## ---------------------------------------------------------------------
## Accessors
## ---------------------------------------------------------------------
## fixed fields together
head(fixed(vcf), 5)
## fixed fields separately
filt(vcf)
ref(vcf)
## info data
info(hdr)
info(vcf)
info(vcf)$DP
## geno data
geno(hdr)
geno(vcf)
head(geno(vcf)$GT)
## genome
unique(genome(rowRanges(vcf)))
## ---------------------------------------------------------------------
## Data subsets with lightweight read* functions
## ---------------------------------------------------------------------
## Import a single 'info' or 'geno' variable
DP <- readInfo(fl, "DP")
HQ <- readGeno(fl, "HQ")
## Import GT as numeric representation
GT <- readGT(fl)
## Import GT as nucleotides
GT <- readGT(fl, nucleotides=TRUE)
## ---------------------------------------------------------------------
## Data subsets with ScanVcfParam
## ---------------------------------------------------------------------
## Subset on genome coordinates:
## 'file' must have an index
rngs <- GRanges("20", IRanges(c(14370, 1110000), c(17330, 1234600)))
names(rngs) <- c("geneA", "geneB")
param <- ScanVcfParam(which=rngs)
compressVcf <- bgzip(fl, tempfile())
tab <- indexVcf(compressVcf)
vcf <- readVcf(tab, "hg19", param)
## When data are subset by range ('which' argument in ScanVcfParam),
## the 'paramRangeID' column provides a map back to the original
## range in 'param'.
rowRanges(vcf)[,"paramRangeID"]
vcfWhich(param)
## Subset on samples:
## Consult the header for the sample names.
samples(hdr)
## Specify one or more names in 'samples' in a ScanVcfParam.
param <- ScanVcfParam(samples="NA00002")
vcf <- readVcf(tab, "hg19", param)
geno(vcf)$GT
## Subset on 'fixed', 'info' or 'geno' fields:
param <- ScanVcfParam(fixed="ALT", geno=c("GT", "HQ"), info=c("NS", "AF"))
vcf_tab <- readVcf(tab, "hg19", param)
info(vcf_tab)
geno(vcf_tab)
## No ranges are specified in the 'param' so tabix file is not
## required. Instead, the uncompressed VCF can be used as 'file'.
vcf_fname <- readVcf(fl, "hg19", param)
## The header will always contain information for all variables
## in the original file reguardless of how the data were subset.
## For example, all 'geno' fields are listed in the header
geno(header(vcf_fname))
## but only 'GT' and 'HQ' are present in the VCF object.
geno(vcf_fname)
## Subset on both genome coordinates and 'info', 'geno' fields:
param <- ScanVcfParam(geno="HQ", info="AF", which=rngs)
vcf <- readVcf(tab, "hg19", param)
## When any of 'fixed', 'info' or 'geno' are omitted (i.e., no
## elements specified) all records are retrieved. Use NA to indicate
## that no records should be retrieved. This param specifies
## all 'fixed fields, the "GT" 'geno' field and none of 'info'.
ScanVcfParam(geno="GT", info=NA)
## ---------------------------------------------------------------------
## Iterate through VCF with 'yieldSize'
## ---------------------------------------------------------------------
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
param <- ScanVcfParam(fixed="ALT", geno=c("GT", "GL"), info=c("LDAF"))
tab <- VcfFile(fl, yieldSize=4000)
open(tab)
while (nrow(vcf_yield <- readVcf(tab, "hg19", param=param)))
cat("vcf dim:", dim(vcf_yield), "
")
close(tab)
## ---------------------------------------------------------------------
## Debugging with 'verbose'
## ---------------------------------------------------------------------
## readVcf() uses information in the header lines to parse the data to
## the correct number and type. Fields without header lines are skipped.
## If a call to readVcf() results in no info(VCF) or geno(VCF) data the
## file may be missing header lines. Set 'verbose = TRUE' to get
## a listing of fields found in the header.
## readVcf(myfile, "mygenome", verbose=TRUE)
## Header fields can also be discovered with scanVcfHeader().
hdr <- scanVcfHeader(fl)
geno(hdr)
scanVcf_methods()
Import VCF files
Description
Import Variant Call Format (VCF) files in text or binary format
Usage
scanVcfHeader(file, ...)
list(list("scanVcfHeader"), list("character"))(file, ...)
scanVcf(file, ..., param)
list(list("scanVcf"), list("character,ScanVcfParam"))(file, ..., param)
list(list("scanVcf"), list("character,missing"))(file, ..., param)
list(list("scanVcf"), list("connection,missing"))(file, ..., param)
list(list("scanVcfHeader"), list("TabixFile"))(file, ...)
list(list("scanVcf"), list("TabixFile,missing"))(file, ..., param)
list(list("scanVcf"), list("TabixFile,ScanVcfParam"))(file, ..., param)
list(list("scanVcf"), list("TabixFile,GRanges"))(file, ..., param)
list(list("scanVcf"), list("TabixFile,IntegerRangesList"))(file, ..., param)
Arguments
Argument | Description |
---|---|
file | For scanVcf and scanVcfHeader , the character() file name, TabixFile , or class connection ( file() or bgzip() ) of the VCF file to be processed. |
param | A instance of ScanVcfParam influencing which records are parsed and the INFO and GENO information returned. |
... | Additional arguments for methods |
Details
The argument param
allows portions of the file to be input, but
requires that the file be bgzip'd and indexed as a
TabixFile .
scanVcf
with param="missing"
and file="character"
or file="connection"
scan the entire file. With
file="connection"
, an argument n
indicates the number of
lines of the VCF file to input; a connection open at the beginning of
the call is open and incremented by n
lines at the end of the
call, providing a convenient way to stream through large VCF files.
The INFO field of the scanned VCF file is returned as a single packed vector, as in the VCF file. The GENO field is a list of matrices, each matrix corresponds to a field as defined in the FORMAT field of the VCF header. Each matrix has as many rows as scanned in the VCF file, and as many columns as there are samples. As with the INFO field, the elements of the matrix are packed . The reason that INFO and GENO are returned packed is to facilitate manipulation, e.g., selecting particular rows or samples in a consistent manner across elements.
Value
scanVcfHeader
returns a VCFHeader
object with
header information parsed into five categories, samples
,
meta
, fixed
, info
and geno
. Each
can be accessed with a getter' of the same name (e.g., info(<VCFHeader>)). If the file header has multiple rows with the same name (e.g., 'source') the row names of the DataFrame are made unique in the usual way, 'source', 'source.1' etc.
scanVcfreturns a list, with one element per range. Each list has 7 elements, obtained from the columns of the VCF specification: list(" ", " ", list(list("rowRanges"), list(" ", " ", list("GRanges"), " instance derived from ", list("CHROM"), ", ", list("POS"), ", ", " ", list("ID"), ", and the width of ", list("REF"), " ", " ")), " ", " ", list(list("REF"), list(" ", " reference allele ", " ")), " ", " ", list(list("ALT"), list(" ", " alternate allele ", " ")), " ", " ", list(list("QUAL"), list(" ", " phred-scaled quality score for the assertion made in ALT ", " ")), " ", " ", list(list("FILTER"), list(" ", " indicator of whether or not the position passed all filters applied ", " ")), " ", " ", list(list("INFO"), list(" ", " additional information ", " ")), " ", " ", list(list("GENO"), list(" ", " genotype information immediately following the FORMAT field in the VCF ", " ")), " ", " ") The
GENOelement is itself a list, with elements corresponding to those defined in the VCF file header. For
scanVcf, elements of GENO are returned as a matrix of records x samples; if the description of the element in the file header indicated multiplicity other than 1 (e.g., variable number for list("A") , list("G") , or list(".") ), then each entry in the matrix is a character string with sub-entries comma-delimited. ## Seealso [
readVcf](#readvcf) [
BcfFile](#bcffile) [
TabixFile](#tabixfile) ## Author Martin Morgan and Valerie Obenchain> ## References [http://vcftools.sourceforge.net/specs.html](http://vcftools.sourceforge.net/specs.html) outlines the VCF specification. [http://samtools.sourceforge.net/mpileup.shtml](http://samtools.sourceforge.net/mpileup.shtml) contains information on the portion of the specification implemented by
bcftools. [http://samtools.sourceforge.net/](http://samtools.sourceforge.net/) provides information on
samtools` .
## Examples
r fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") scanVcfHeader(fl) vcf <- scanVcf(fl) ## value: list-of-lists str(vcf) names(vcf[[1]][["GENO"]]) vcf[[1]][["GENO"]][["GT"]]
seqinfo_method()
Get seqinfo for VCF file
Description
Get seqinfo for VCF file
Usage
list(list("seqinfo"), list("VcfFile"))(x)
list(list("seqinfo"), list("VcfFileList"))(x)
Arguments
Argument | Description |
---|---|
x | Either character(), VcfFile , or VcfFileList |
Details
If a VcfFile
The file header is scanned an appropriate seqinfo
object in given. If a VcfFileList
is given, all file headers are
scanned, and appropriate combined seqinfo object is given.
Value
Seqinfo object
Seealso
Author
Lori Shepherd
Examples
fl <- system.file("extdata", "chr7-sub.vcf.gz", package="VariantAnnotation",
mustWork=TRUE)
vcf <- VcfFile(fl)
seqinfo(vcf)
snpSummary()
Counts and distribution statistics for SNPs in a VCF object
Description
Counts and distribution statistics for SNPs in a VCF object
Usage
list(list("snpSummary"), list("CollapsedVCF"))(x, ...)
Arguments
Argument | Description |
---|---|
x | A CollapsedVCF object. |
list() | Additional arguments to methods. |
Details
Genotype counts, allele counts and Hardy Weinberg equilibrium (HWE) statistics are calculated for single nucleotide variants in a CollapsedVCF object. HWE has been established as a useful quality filter on genotype data. This equilibrium should be attained in a single generation of random mating. Departures from HWE are indicated by small p values and are almost invariably indicative of a problem with genotype calls.
The following caveats apply:
No distinction is made between phased and unphased genotypes.
Only diploid calls are included.
Only
valid' SNPs are included. A
valid' SNP is defined as having a reference allele of length 1 and a single alternate allele of length 1.
Variants that do not meet these criteria are set to NA.
Value
The object returned is a data.frame
with seven columns.
list("
", " ", list(list("g00"), list("
", " Counts for genotype 00 (homozygous reference).
", " ")), "
", " ", list(list("g01"), list("
", " Counts for genotype 01 or 10 (heterozygous).
", " ")), "
", " ", list(list("g11"), list("
", " Counts for genotype 11 (homozygous alternate).
", " ")), "
", " ", list(list("a0Freq"), list("
", " Frequency of the reference allele.
", " ")), "
", " ", list(list("a1Freq"), list("
", " Frequency of the alternate allele.
",
" ")), "
", " ", list(list("HWEzscore"), list(" ", " Z-score for departure from a null hypothesis of Hardy Weinberg equilibrium. ", " ")), " ", " ", list(list("HWEpvalue"), list(" ", " p-value for departure from a null hypothesis of Hardy Weinberg equilibrium. ", " ")), " ", " ")
Seealso
genotypeToSnpMatrix , probabilityToSnpMatrix
Author
Chris Wallace cew54@cam.ac.uk
Examples
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
## The return value is a data.frame with genotype counts
## and allele frequencies.
df <- snpSummary(vcf)
df
## Compare to ranges in the VCF object:
rowRanges(vcf)
## No statistics were computed for the variants in rows 3, 4
## and 5. They were omitted because row 3 has two alternate
## alleles, row 4 has none and row 5 is not a SNP.
summarizeVariants_methods()
Summarize variants by sample
Description
Variants in a VCF file are overlapped with an annotation region and summarized by sample. Genotype information in the VCF is used to determine which samples express each variant.
Usage
list(list("summarizeVariants"), list("TxDb,VCF,CodingVariants"))(query, subject, mode, ...)
list(list("summarizeVariants"), list("TxDb,VCF,FiveUTRVariants"))(query, subject, mode, ...)
list(list("summarizeVariants"), list("TxDb,VCF,ThreeUTRVariants"))(query, subject, mode, ...)
list(list("summarizeVariants"), list("TxDb,VCF,SpliceSiteVariants"))(query, subject, mode, ...)
list(list("summarizeVariants"), list("TxDb,VCF,IntronVariants"))(query, subject, mode, ...)
list(list("summarizeVariants"), list("TxDb,VCF,PromoterVariants"))(query, subject, mode, ...)
list(list("summarizeVariants"), list("GRangesList,VCF,VariantType"))(query, subject, mode, ...)
list(list("summarizeVariants"), list("GRangesList,VCF,function"))(query, subject, mode, ...)
Arguments
Argument | Description |
---|---|
query | A TxDb or GRangesList object that serves as the annotation. GFF files can be converted to TxDb objects with makeTxDbFromGFF() in the GenomicFeatures package. |
subject | A VCF object containing the variants. |
mode | mode can be a VariantType class or the name of a function. When mode is a VariantType class, counting is done with locateVariants and counts are summarized transcript-by-sample. Supported VariantType classes include CodingVariants , IntronVariants , FiveUTRVariants , ThreeUTRVariants , SpliceSiteVariants or PromoterVariants . AllVariants() and IntergenicVariants are not supported. See ? locateVariants for more detail on the variant classes. mode can also be the name of any counting function that outputs a Hits object. Variants will be summarized by the length of the GRangesList annotation (i.e., 'length-of-GRangesList'-by-sample). |
|list()
| Additional arguments passed to methods such as list("
", " ", list(list("ignore.strand"), list("A ", list("logical"), " indicating if strand should be
", " igored when performing overlaps.
", " ")), "
", " ") |
Details
summarizeVariants
uses the genotype information in a VCF
file to determine which samples are positive for each variant.
Variants are overlapped with the annotation and the counts
are summarized annotation-by-sample. If the annotation is a
GRangesList
of transcripts, the count matrix will
be transcripts-by-sample. If the GRangesList
is genes,
the count matrix will be gene-by-sample.
list("Counting with locateVariants() :") list(" ", " ", " Variant counts are always summarized transcript-by-sample. ", " When ", list("query"), " is a ", list("GRangesList"), ", it must be compatible ", " with the ", list("VariantType"), "-class given as the ", list("mode"), " argument. ", " The list below specifies the appropriate ", list("GRangesList"), " for each ", " ", list("mode"), ". ", " ", list(" ", " ", list(list("CodingVariants :"), list("coding (CDS) by transcript")), " ", " ", list(list("IntronVariants :"), list("introns by transcript")), " ", " ", list(list("FiveUTRVariants :"), list("five prime UTR by transcript")), " ", " ", list(list("ThreeUTRVariants :"), list("three prime UTR by transcript")), " ", " ", list(list("SpliceSiteVariants :"), list("introns by transcript")), " ", " ", list(list("PromoterVariants :"), list("list of transcripts")), " ", " "), " ", " ", " When ", list("query"), " is a ", list("TxDb"), ", the appropriate ", " region-by-transcript ", list("GRangesList"), " listed above is extracted ", " internally and used as the annotation. ", " ", " ")
list("Counting with a user-supplied function :") list(" ", " ", " ", list("subject"), " must be a ", list("GRangesList"), " and ", list("mode"), " must ", " be the name of a function. The count function must take 'query' ", " and 'subject' arguments and return a ", list("Hits"), " object. Counts are ", " summarized by the outer list elements of the ", list("GRangesList"), ". ", " ")
Value
A RangedSummarizedExperiment
object with count summaries in the
assays
slot. The rowRanges
contains the annotation
used for counting. Information in colData
and metadata
are taken from the VCF file.
Seealso
readVcf
,
predictCoding
,
locateVariants
Author
Valerie Obenchain
Examples
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
## Read variants from VCF.
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
## Rename seqlevels to match TxDb; confirm the match.
seqlevels(vcf) <- paste0("chr", seqlevels(vcf))
intersect(seqlevels(vcf), seqlevels(txdb))
## ----------------------------------------
## Counting with locateVariants()
## ----------------------------------------
## TxDb as the 'query'
coding1 <- summarizeVariants(txdb, vcf, CodingVariants())
colSums(assays(coding1)$counts)
## GRangesList as the 'query'
cdsbytx <- cdsBy(txdb, "tx")
coding2 <- summarizeVariants(cdsbytx, vcf, CodingVariants())
stopifnot(identical(assays(coding1)$counts, assays(coding2)$counts))
## Promoter region variants summarized by transcript
tx <- transcripts(txdb)
txlst <- splitAsList(tx, seq_len(length(tx)))
promoter <- summarizeVariants(txlst, vcf,
PromoterVariants(upstream=100, downstream=10))
colSums(assays(promoter)$counts)
## ----------------------------------------
## Counting with findOverlaps()
## ----------------------------------------
## Summarize all variants by transcript
allvariants <- summarizeVariants(txlst, vcf, findOverlaps)
colSums(assays(allvariants)$counts)
writeVcf_methods()
Write VCF files
Description
Write Variant Call Format (VCF) files to disk
Usage
list(list("writeVcf"), list("VCF,character"))(obj, filename, index = FALSE, ...)
list(list("writeVcf"), list("VCF,connection"))(obj, filename, index = FALSE, ...)
Arguments
Argument | Description |
---|---|
obj | Object containing data to be written out. At present only accepts VCF . |
filename | The character() name of the VCF file, or a connection (e.g., file ), to be written out. A connection opened with open = "a" will have header information written only if the file does not already exist. |
index | Whether to bgzip the output file and generate a tabix index. |
list() | Additional arguments, passed to methods. |
- nchunk: Integer or NA. When provided this argument overrides the default chunking behavior of
writeVcf
, see Details section. An integer value specifies the number of records in each chunk; NA disables chunking.
Details
A VCF file can be written out from data in a VCF
object. More
general methods to write out from other objects may be added in the future.
writeVcf
writes out the header fields in a VCF
object
'as-is' with the exception of these key-value pairs:
fileformat: When missing, a line is added at the top of the file with the current supported version.
VariantAnnotation
>=1.27.6 supports VCFv4.3.fileDate: When missing, a line is added with today's date. If the key-value pair exists, the date is overwritten with today's date.
contig: When missing,
VariantAnnotation
attempts to use theSeqinfo
of theVCF
object to determine the contig information.
Large VCF files (i.e., > 1e5 records) are written out in
chunks; VCF files with < 1e5 records are not chunked. The optimal number of
records per chunk depends on both the number of records and complexity of the
data. Currently writeVcf
determines records per chunk based on the
total number of records only. To override this behavior or experiment with
other values use nchunk
as an integer or NA. An integer value
represents the number of records per chunk regardless of the size of the
VCF; NA disables all chunking.
writeVcf(vcf, tempfile()) ## default chunking
writeVcf(vcf, tempfile(), nchunk = 1e6) ## chunk by 1e6
writeVcf(vcf, tempfile(), nchunk = NA) ## no chunking
Value
VCF file
Seealso
Note
NOTE: VariantAnnotation
>= 1.27.6 supports VCFv4.3. See the NOTE on
the ?VCFHeader
man page under the meta()
extractor for
a description of how header parsing has changed to accommodate the new
header lines with key name of 'META'.
Author
Valerie Obenchain and Michael Lawrence
References
http://vcftools.sourceforge.net/specs.html outlines the VCF specification.
http://samtools.sourceforge.net/mpileup.shtml contains
information on the portion of the specification implemented by
bcftools
.
http://samtools.sourceforge.net/ provides information on
samtools
.
Examples
fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
out1.vcf <- tempfile()
out2.vcf <- tempfile()
in1 <- readVcf(fl, "hg19")
writeVcf(in1, out1.vcf)
in2 <- readVcf(out1.vcf, "hg19")
writeVcf(in2, out2.vcf)
in3 <- readVcf(out2.vcf, "hg19")
stopifnot(all(in2 == in3))
## write incrementally
out3.vcf <- tempfile()
con <- file(out3.vcf, open="a")
writeVcf(in1[1:2,], con)
writeVcf(in1[-(1:2),], con)
close(con)
readVcf(out3.vcf, "hg19")