bioconductor v3.9.0 GenomicFeatures

The DEFAULT_CIRC_SEQS character vector contains strings that are normally used by major repositories as the names of chromosomes that are typically circular, it is available as a convenience so that users can us it as a default value for circ_seqs arguments, and append to it as needed.

Usage

DEFAULT_CIRC_SEQS

Examples

DEFAULT_CIRC_SEQS

FeatureDb_class()

FeatureDb objects

Description

The FeatureDb class is a generic container for storing genomic locations of an arbitrary type of genomic features.

See ? for a container for storing transcript annotations.

See ? for a convenient way to make FeatureDb objects from BioMart online resources.

The TxDb class for storing transcript annotations.
makeFeatureDbFromUCSC for a convenient way to make a FeatureDb object from UCSC online resources.
saveDb and loadDb for saving and loading the database content of a FeatureDb object.
features for how to extract genomic features from a FeatureDb object.

Author

Marc Carlson

Examples

fdb_file <- system.file("extdata", "FeatureDb.sqlite",
package="GenomicFeatures")
fdb <- loadDb(fdb_file)
fdb

TxDb_class()

TxDb objects

Description

The TxDb class is a container for storing transcript annotations.

See ? for a more generic container for storing genomic locations of an arbitrary type of genomic features.

makeTxDbFromUCSC , makeTxDbFromBiomart , and makeTxDbFromEnsembl , for making a TxDb object from online resources.
makeTxDbFromGRanges and makeTxDbFromGFF for making a TxDb object from a GRanges object, or from a GFF or GTF file.
saveDb and loadDb in the AnnotationDbi package for saving and loading a TxDb object as an SQLite file.
transcripts , transcriptsBy , and transcriptsByOverlaps , for extracting genomic feature locations from a TxDb -like object.
transcriptLengths for extracting the transcript lengths (and other metrics) from a TxDb object.
select-methods for how to use the simple "select" interface to extract information from a TxDb object.
The FeatureDb class for storing genomic locations of an arbitrary type of genomic features.
The Seqinfo class in the GenomeInfoDb package.

Author

Hervé Pagès, Marc Carlson

Examples

txdb_file <- system.file("extdata", "Biomart_Ensembl_sample.sqlite",
package="GenomicFeatures")
txdb <- loadDb(txdb_file)
txdb

## Use of seqinfo():
seqlevelsStyle(txdb)
seqinfo(txdb)
seqlevels(txdb)
seqlengths(txdb)  # shortcut for 'seqlengths(seqinfo(txdb))'
isCircular(txdb)  # shortcut for 'isCircular(seqinfo(txdb))'
names(which(isCircular(txdb)))

## You can set user-supplied seqlevels on 'txdb' to restrict any further
## operations to a subset of chromosomes:
seqlevels(txdb) <- c("Y", "6")
## Then you can restore the seqlevels stored in the db:
seqlevels(txdb) <- seqlevels0(txdb)

## Use of as.list():
txdb_dump <- as.list(txdb)
txdb_dump
txdb1 <- do.call(makeTxDb, txdb_dump)
stopifnot(identical(as.list(txdb1), txdb_dump))

as_format_methods()

Coerce to file format structures

Description

These functions coerce a TxDb object to a GRanges object with metadata columns encoding transcript structures according to the model of a standard file format. Currently, BED and GFF models are supported. If a TxDb is passed to export , when targeting a BED or GFF file, this coercion occurs automatically.

Usage

list(list("asBED"), list("TxDb"))(x)
list(list("asGFF"), list("TxDb"))(x)

Arguments

Argument	Description
`x`	A `TxDb` object to coerce to a `GRanges` , structured as BED or GFF.

Value

For asBED , a GRanges , with the columns name , thickStart , thickEnd , blockStarts , blockSizes added. The thick regions correspond to the CDS regions, and the blocks represent the exons. The transcript IDs are stored in the name column. The ranges are the transcript bounds.

For asGFF , a GRanges , with columns type , Name , ID ,, and Parent . The gene structures are expressed according to the conventions defined by the GFF3 spec. There are elements of each type of feature: gene , mRNA exon and cds . The Name column contains the gene_id for genes, tx_name for transcripts, and exons and cds regions are NA . The ID column uses gene_id and tx_id , with the prefixes GeneID and TxID to ensure uniqueness across types. The exons and cds regions have NA for ID . The Parent column contains the ID s of the parent features. A feature may have multiple parents (the column is a CharacterList ). Each exon belongs to one or more mRNAs, and mRNAs belong to a gene.

Author

Michael Lawrence

Examples

txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite",
package="GenomicFeatures")
txdb <- loadDb(txdb_file)

asBED(txdb)
asGFF(txdb)

coordinate_mapping_methods()

Map range coordinates between transcripts and genome space

Description

Map range coordinates between features in the transcriptome and genome (reference) space.

See ? in the GenomicAlignments package for mapping coordinates between reads (local) and genome (reference) space using a CIGAR alignment.

Usage

## mapping to transcripts
list(list("mapToTranscripts"), list("GenomicRanges,GenomicRanges"))(x, transcripts,
          ignore.strand = FALSE)
list(list("mapToTranscripts"), list("GenomicRanges,GRangesList"))(x, transcripts,
          ignore.strand = FALSE, intronJunctions=FALSE)
list(list("mapToTranscripts"), list("ANY,TxDb"))(x, transcripts, ignore.strand = FALSE,
          extractor.fun = GenomicFeatures::transcripts, ...)
list(list("pmapToTranscripts"), list("GenomicRanges,GRangesList"))(x, transcripts,
          ignore.strand = FALSE)
## mapping from transcripts
list(list("mapFromTranscripts"), list("GenomicRanges,GRangesList"))(x, transcripts,
          ignore.strand = FALSE)
list(list("pmapFromTranscripts"), list("GenomicRanges,GRangesList"))(x, transcripts,
          ignore.strand = FALSE)
list(list("pmapFromTranscripts"), list("IntegerRanges,GRangesList"))(x, transcripts)

Arguments

Argument	Description
`x`	GenomicRanges object of positions to be mapped. The seqnames of `x` are used in `mapFromTranscripts` , i.e., when mapping from transcripts to the genome. In the case of `pmapFromTranscripts` , `x` can be an IntegerRanges object.
`transcripts`	A named GenomicRanges or GRangesList object used to map between `x` and the result. The ranges can be any feature in the transcriptome extracted from a `TxDb` (e.g., introns, exons, cds regions). See ? `transcripts` and ? `transcriptsBy` for a list of extractor functions. The `transcripts` object must have names. When mapping from transcripts to the genome, they are used to determine mapping pairs; in the reverse direction they become the seqlevels of the output object.
`ignore.strand`	When `ignore.strand` is TRUE, strand is ignored in overlaps operations (i.e., all strands are considered "+") and the strand in the output is '*'. When `ignore.strand` is FALSE strand in the output is taken from the `transcripts` argument. When `transcripts` is a `GRangesList` , all inner list elements of a common list element must have the same strand or an error is thrown. Mapped position is computed by counting from the transcription start site (TSS) and is not affected by the value of `ignore.strand` .
`intronJunctions`	Logical to indicate if intronic ranges in `x` should be reported. This argument is only supported in `mapToTranscripts` when `transcripts` is a GRangesList. When `transcripts` is a GRangesList, individual ranges can be thought of as exons and the spaces between the ranges as introns. When `intronJunctions=TRUE` , ranges that fall completely "within" an intron are reported as a zero-width range (start and end are taken from the ranges they fall between). A metadata column called "intronic" is returned with the GRanges and marked as `TRUE` for these ranges. By default, `intronJunctions=FALSE` and these ranges are not mapped. Ranges that have either the start or end in an intron are considered "non hits" and are never mapped. Ranges that span introns are always mapped. Neither of these range types are controlled by the `intronJunctions` argument.
`extractor.fun`	Function to extract genomic features from a `TxDb` object. This argument is only applicable to `mapToTranscripts` when `transcripts` is a `TxDb` object. The `extractor` should be the name of a function (not a character()) described on the `?transcripts` , `?transcriptsBy` , or `?microRNAs` man page. Valid `extractor` functions:

transcripts ## default
exons
cds
genes
promoters
exonicParts
intronicParts
transcriptsBy
exonsBy
cdsBy
intronsByTranscript
fiveUTRsByTranscript
threeUTRsByTranscript
microRNAs
tRNAs
|list() | Additional arguments passed to extractor.fun functions. |

Details

In GenomicFeatures >= 1.21.10, the default for ignore.strand was changed to FALSE for consistency with other methods in the GenomicRanges and GenomicAlignments packages. Additionally, the mapped position is computed from the TSS and does not depend on the ignore.strand argument. See the section on ignore.strand for details.

list(list("mapToTranscripts"), ", ", list("pmapToTranscripts")) list(" ", " The genomic range in ", list("x"), " is mapped to the local position in the ", " ", list("transcripts"), " ranges. A successful mapping occurs when ", list("x"), " ", " is completely within the ", list("transcripts"), " range, equivalent to: ", " ", list(" ", " findOverlaps(..., type="within") ", " "), " ", " Transcriptome-based coordinates start counting at 1 at the beginning ", " of the ", list("transcripts"), " range and return positions where ", list("x"), " ", " was aligned. The seqlevels of the return object are taken from the ", " ", list("transcripts"), " object and should be transcript names. In this ", " direction, mapping is attempted between all elements of ", list("x"), " and ", " all elements of ", list("transcripts"), ". ", " ", " ", list("mapToTranscripts"), " uses ", list("findOverlaps"), " to map ranges in ", " ", list("x"), " to ranges in ", list("transcripts"), ". This method does not return ", " unmapped ranges. ", " ", " ", list("pmapToTranscripts"), " maps the i-th range in ", list("x"), " to the ", " i-th range in ", list("transcripts"), ". Recycling is supported for both ", " ", list("x"), " and ", list("transcripts"), " when either is length == 1L; otherwise ", " the lengths must match. Ranges in ", list("x"), " that do not map (out of bounds ", " or strand mismatch) are returned as zero-width ranges starting at 0. ", " These ranges are given the seqname of "UNMAPPED". ", " ")
list(list("mapFromTranscripts"), ", ", list("pmapFromTranscripts")) list(" ", " The transcript-based position in ", list("x"), " is mapped to genomic coordinates ", " using the ranges in ", list("transcripts"), ". A successful mapping occurs when ", " the following is TRUE: ", " ", list(" ", " width(transcripts) >= start(x) + width(x) ", " "), " ", " ", list("x"), " is aligned to ", list("transcripts"), " by moving in ", list("start(x)"), " ", " positions in from the beginning of the ", list("transcripts"), " range. The ", " seqlevels of the return object are chromosome names. ", " ", " ", list("mapFromTranscripts"), " uses the seqname of ", list("x"), " and the names ", " of ", list("transcripts"), " to determine mapping pairs (vs attempting to match ", " all possible pairs). Name matching is motivated by use cases such as ", " differentially expressed regions where the expressed regions in ", list("x"), " ", " would only be related to a subset of regions in ", list("transcripts"), ". ", " This method does not return unmapped ranges. ", " ", " ", list("pmapFromTranscripts"), " maps the i-th range in ", list("x"), " to the i-th ", " range in ", list("transcripts"), " and therefore does not use name matching. ", " Recycling is supported in ", list("pmapFromTranscripts"), " when either ", " ", list("x"), " or ", list("transcripts"), " is length == 1L; otherwise the lengths ", " must match. Ranges in ", list("x"), " that do not map (out of bounds or strand ", " mismatch) are returned as zero-width ranges starting at 0. These ranges ", " are given the seqname of "UNMAPPED". ", " ")

Value

pmapToTranscripts returns a GRanges the same length as x .

pmapFromTranscripts returns a GRanges when transcripts is a GRanges and a GRangesList when transcripts is a GRangesList . In both cases the return object is the same length as x . The rational for returning the GRangesList is to preserve exon structure; ranges in a list element that are not overlapped by x are returned as a zero-width range. The GRangesList return object will have no seqlevels called "UNMAPPED"; those will only occur when a GRanges is returned.

mapToTranscripts and mapFromTranscripts return GRanges objects that vary in length similar to a Hits object. The result contains mapped records only; strand mismatch and out of bound ranges are not returned. xHits and transcriptsHits metadata columns (similar to the queryHits and subjectHits of a Hits object) indicate elements of x and transcripts used in the mapping.

When intronJunctions is TRUE, mapToTranscripts returns an extra metdata column named intronic to identify the intron ranges.

When mapping to transcript coordinates, seqlevels of the output are the names on the transcripts object and most often these will be transcript names. When mapping to the genome, seqlevels of the output are the seqlevels of transcripts which are usually chromosome names.

Author

V. Obenchain, M. Lawrence and H. Pagès

Examples

## -------------------------------------------------------------------------
## A. Basic Use: Conversion between CDS and Exon coordinates and the genome
## -------------------------------------------------------------------------

## Gene "Dgkb" has ENTREZID "217480":
library(org.Mm.eg.db)
Dgkb_geneid <- get("Dgkb", org.Mm.egSYMBOL2EG)

## The gene is on the positive strand, chromosome 12:
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene
tx_by_gene <- transcriptsBy(txdb, by="gene")
Dgkb_transcripts <- tx_by_gene[[Dgkb_geneid]]
Dgkb_transcripts  # all 7 Dgkb transcripts are on chr12, positive strand

## To map coordinates from local CDS or exon space to genome
## space use mapFromTranscripts().

## When mapping CDS coordinates to genome space the 'transcripts'
## argument is the collection of CDS regions by transcript.
coord <- GRanges("chr12", IRanges(4, width=1))
## Get the names of the transcripts in the gene:
Dgkb_tx_names <- mcols(Dgkb_transcripts)$tx_name
Dgkb_tx_names
## Use these names to isolate the region of interest:
cds_by_tx <- cdsBy(txdb, "tx", use.names=TRUE)
Dgkb_cds_by_tx <- cds_by_tx[intersect(Dgkb_tx_names, names(cds_by_tx))]
Dgkb_cds_by_tx  # Dgkb CDS grouped by transcript (no-CDS transcripts omitted)
lengths(Dgkb_cds_by_tx)  # nb of CDS per transcript
## A requirement for mapping from transcript space to genome space
## is that seqnames in 'x' match the names in 'transcripts'.
names(Dgkb_cds_by_tx) <- rep(seqnames(coord), length(Dgkb_cds_by_tx))
## There are 6 results, one for each transcript.
mapFromTranscripts(coord, Dgkb_cds_by_tx)

## To map exon coordinates to genome space the 'transcripts'
## argument is the collection of exon regions by transcript.
coord <- GRanges("chr12", IRanges(100, width=1))
ex_by_tx <- exonsBy(txdb, "tx", use.names=TRUE)
Dgkb_ex_by_tx <- ex_by_tx[Dgkb_tx_names]
names(Dgkb_ex_by_tx) <- rep(seqnames(coord), length(Dgkb_ex_by_tx))
## Again the output has 6 results, one for each transcript.
mapFromTranscripts(coord, Dgkb_ex_by_tx)

## To go the reverse direction and map from genome space to
## local CDS or exon space, use mapToTranscripts().

## Genomic position 37981944 maps to CDS position 4:
coord <- GRanges("chr12", IRanges(37981944, width=1))
mapToTranscripts(coord, Dgkb_cds_by_tx)

## Genomic position 37880273 maps to exon position 100:
coord <- GRanges("chr12", IRanges(37880273, width=1))
mapToTranscripts(coord, Dgkb_ex_by_tx)


## The following examples use more than 2GB of memory, which is more
## than what 32-bit Windows can handle:
is_32bit_windows <- .Platform$OS.type == "windows" &&
.Platform$r_arch == "i386"
if (!is_32bit_windows) {
## ---------------------------------------------------------------------
## B. Map sequence locations in exons to the genome
## ---------------------------------------------------------------------

## NAGNAG alternative splicing plays an essential role in biological processes
## and represents a highly adaptable system for posttranslational regulation
## of gene function. The majority of NAGNAG studies largely focus on messenger
## RNA. A study by Sun, Lin, and Yan
## (http://www.hindawi.com/journals/bmri/2014/736798/) demonstrated that
## NAGNAG splicing is also operative in large intergenic noncoding RNA
## (lincRNA).

## One finding of interest was that linc-POLR3G-10 exhibited two NAGNAG
## acceptors located in two distinct transcripts: TCONS_00010012 and
## TCONS_00010010.

## Extract the exon coordinates of TCONS_00010012 and TCONS_00010010:
lincrna <- c("TCONS_00010012", "TCONS_00010010")
library(TxDb.Hsapiens.UCSC.hg19.lincRNAsTranscripts)
txdb <- TxDb.Hsapiens.UCSC.hg19.lincRNAsTranscripts
exons <- exonsBy(txdb, by="tx", use.names=TRUE)[lincrna]
exons

## The two NAGNAG acceptors were identified in the upstream region of
## the fourth and fifth exons located in TCONS_00010012.
## Extract the sequences for transcript TCONS_00010012:
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
exons_seq <- getSeq(genome, exons[[1]])

## TCONS_00010012 has 4 exons:
exons_seq

## The most common triplet among the lincRNA sequences was CAG. Identify
## the location of this pattern in all exons.
cag_loc <- vmatchPattern("CAG", exons_seq)

## Convert the first occurance of CAG in each exon back to genome coordinates.
first_loc <- do.call(c, sapply(cag_loc, "[", 1, simplify=TRUE))
pmapFromTranscripts(first_loc, exons[[1]])

## -----------------------------------------------------------------------
## C. Map dbSNP variants to CDS or cDNA coordinates
## -----------------------------------------------------------------------

## The GIPR gene encodes a G-protein coupled receptor for gastric inhibitory
## polypeptide (GIP). Originally GIP was identified to inhibited gastric acid
## secretion and gastrin release but was later demonstrated to stimulate
## insulin release in the presence of elevated glucose.

## In this example 5 SNPs located in the GIPR gene are mapped to cDNA
## coordinates. A list of SNPs in GIPR can be downloaded from dbSNP or NCBI.
rsids <- c("rs4803846", "rs139322374", "rs7250736", "rs7250754", "rs9749185")

## Extract genomic coordinates with a SNPlocs package.
library(SNPlocs.Hsapiens.dbSNP144.GRCh38)
snps <- snpsById(SNPlocs.Hsapiens.dbSNP144.GRCh38, rsids)

## Gene regions of GIPR can be extracted from a TxDb package of compatible
## build. The TxDb package uses Entrez gene identifiers and GIPR is a gene
## symbol. Let's first lookup its Entrez gene ID.
library(org.Hs.eg.db)
GIPR_geneid <- get("GIPR", org.Hs.egSYMBOL2EG)

## The transcriptsBy() extractor returns a range for each transcript that
## includes the UTR and exon regions (i.e., cDNA).
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
tx_by_gene <- transcriptsBy(txdb, "gene")
GIPR_transcripts <- tx_by_gene[GIPR_geneid]
GIPR_transcripts  # all 8 GIPR transcripts are on chr19, positive strand

## Before mapping, the chromosome names (seqlevels) in the two
## objects must be harmonized. The style is NCBI for 'snps' and
## UCSC for 'GIPR_transcripts'.
seqlevelsStyle(snps)
seqlevelsStyle(GIPR_transcripts)

## Modify the style and genome in 'snps' to match 'GIPR_transcripts'.
seqlevelsStyle(snps) <- seqlevelsStyle(GIPR_transcripts)
genome(snps) <- genome(GIPR_transcripts)

## The 'GIPR_transcripts' object is a GRangesList of length 1. This single
## list element contains the cDNA range for 8 different transcripts. To map
## to each transcript individually 'GIPR_transcripts' must be unlisted
## before mapping.

## Map all 5 SNPS to all 8 transcripts:
mapToTranscripts(snps, unlist(GIPR_transcripts))

## Map the first SNP to transcript "ENST00000590918.5" and the second to
## "ENST00000263281.7".
pmapToTranscripts(snps[1:2], unlist(GIPR_transcripts)[1:2])

## The cdsBy() extractor returns coding regions by gene or by transcript.
## Extract the coding regions for transcript "ENST00000263281.7".
cds <- cdsBy(txdb, "tx", use.names=TRUE)["ENST00000263281.7"]
cds

## The 'cds' object is a GRangesList of length 1 containing all CDS ranges
## for the single transcript "ENST00000263281.7".

## To map to the concatenated group of ranges leave 'cds' as a GRangesList.
mapToTranscripts(snps, cds)

## Only the second SNP could be mapped. Unlisting the 'cds' object maps the
## SNPs to the individual cds ranges (vs the concatenated range).
mapToTranscripts(snps[2], unlist(cds))

## The location is the same because the SNP hit the first CDS range. If the
## transcript were on the "-" strand the difference in concatenated vs
## non-concatenated position would be more obvious.

## Change strand:
strand(cds) <- strand(snps) <- "-"
mapToTranscripts(snps[2], unlist(cds))
}

coverageByTranscript()

Compute coverage by transcript (or CDS) of a set of ranges

Description

coverageByTranscript computes the transcript (or CDS) coverage of a set of ranges.

pcoverageByTranscript is a version of coverageByTranscript that operates element-wise.

Usage

coverageByTranscript(x, transcripts, ignore.strand=FALSE)
pcoverageByTranscript(x, transcripts, ignore.strand=FALSE, ...)

Arguments

Argument	Description
`x`	An object representing a set of ranges (typically aligned reads). GRanges , GRangesList , GAlignments , GAlignmentPairs , and GAlignmentsList objects are supported. More generally, for `coverageByTranscript` `x` can be any object for which `seqinfo` and `coverage` are supported (e.g. a BamFile object). Note that, for such objects, `coverage()` is expected to return an RleList object whose names are `seqlevels(x)` ). More generally, for `pcoverageByTranscript` `x` can be any object for which `grglist` is supported. It should have the length of `transcripts` or length 1. If the latter, it is recycled to the length of `transcripts` .
`transcripts`	A GRangesList object representing the exons of each transcript for which to compute coverage. For each transcript, the exons must be ordered by ascending rank , that is, by their position in the transcript. This means that, for a transcript located on the minus strand, the exons should typically be ordered by descending position on the reference genome. If `transcripts` was obtained with `exonsBy` , then the exons are guaranteed to be ordered by ascending rank. See `?` for more information. Alternatively, `transcripts` can be a TxDb object, or any TxDb -like object that supports the `exonsBy` extractor (e.g. an EnsDb object). In this case it is replaced with the GRangesList object returned by `exonsBy` . For `pcoverageByTranscript` , `transcripts` should have the length of `x` or length 1. If the latter, it is recycled to the length of `x` .
`ignore.strand`	TRUE or FALSE. If FALSE (the default) then the strand of a range in `x` and exon in `transcripts` must be the same in order for the range to contribute coverage to the exon. If TRUE then the strand is ignored.
`...`	Additional arguments passed to the internal call to `grglist` . More precisely, when `x` is not a GRanges or GRangesList object, `pcoverageByTranscript` replace it with the GRangesList object returned by `grglist` .

Value

An RleList object parallel to transcripts , that is, the i-th element in it is an integer- Rle representing the coverage of the i-th transcript in transcripts . Its lengths() is guaranteed to be identical to sum(width(transcripts)) . The names and metadata columns on transcripts are propagated to it.

transcripts , transcriptsBy , and transcriptsByOverlaps , for extracting genomic feature locations from a TxDb -like object.
transcriptLengths for extracting the transcript lengths (and other metrics) from a TxDb object.
extractTranscriptSeqs for extracting transcript (or CDS) sequences from chromosome sequences.
The RleList class defined and documented in the IRanges package.
The GRangesList class defined and documented in the GenomicRanges package.
The coverage methods defined in the GenomicRanges package.
The exonsBy function for extracting exon ranges grouped by transcript.
findCompatibleOverlaps in the GenomicAlignments package for finding which reads are compatible with the splicing of which transcript.

Author

Hervé Pagès

Examples

## ---------------------------------------------------------------------
## 1. A SIMPLE ARTIFICIAL EXAMPLE WITH ONLY ONE TRANSCRIPT
## ---------------------------------------------------------------------

## Get some transcripts:
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
dm3_transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)
dm3_transcripts

## Let's pick up the 1st transcript: FBtr0300689. It as 2 exons and 1
## intron:
my_transcript <- dm3_transcripts["FBtr0300689"]

## Let's create 3 artificial aligned reads. We represent them as a
## GRanges object of length 3 that contains the genomic positions of
## the 3 reads. Note that these reads are simple alignments i.e. each
## of them can be represented with a single range. This would not be
## the case if they were junction reads.
my_reads <- GRanges(c("chr2L:7531-7630",
"chr2L:8101-8200",
"chr2L:8141-8240"))

## The coverage of the 3 reads on the reference genome is:
coverage(my_reads)

## As you can see, all the genomic positions in the 3 ranges participate
## to the coverage. This can be confirmed by comparing:
sum(coverage(my_reads))
## with:
sum(width(my_reads))
## They should always be the same.

## When computing the coverage on a transcript, only the part of the
## read that overlaps with the transcript participates to the coverage.
## Let's look at the individual coverage of each read on transcript
## FBtr0300689:

## The 1st read is fully contained within the 1st exon:
coverageByTranscript(my_reads[1], my_transcript)

## Note that the length of the Rle (1880) is the length of the transcript.

## The 2nd and 3rd reads overlap the 2 exons and the intron. Only the
## parts that overlap the exons participate to coverage:
coverageByTranscript(my_reads[2], my_transcript)
coverageByTranscript(my_reads[3], my_transcript)

## The coverage of the 3 reads together is:
coverageByTranscript(my_reads, my_transcript)

## Note that this is the sum of the individual coverages. This can be
## checked with:
stopifnot(all(
coverageByTranscript(my_reads, my_transcript)
==
Reduce("+", lapply(seq_along(my_reads),
function(i) coverageByTranscript(my_reads[i], my_transcript)), 0L)
))

## ---------------------------------------------------------------------
## 2. COMPUTE THE FULL TRANSCRIPTOME COVERAGE OF A SET OF ALIGNED READS
## ---------------------------------------------------------------------

## Load the aligned reads:
library(pasillaBamSubset)
library(GenomicAlignments)
reads <- readGAlignments(untreated1_chr4())

## Compute the full transcriptome coverage by calling
## coverageByTranscript() on 'dm3_transcripts':
tx_cvg <- coverageByTranscript(reads, dm3_transcripts, ignore.strand=TRUE)
tx_cvg

## A sanity check:
stopifnot(identical(lengths(tx_cvg), sum(width(dm3_transcripts))))

## We can also use pcoverageByTranscript() to compute 'tx_cvg'.
## For this we first create a GAlignmentsList object "parallel" to
## 'dm3_transcripts' where the i-th list element contains the aligned
## reads that overlap with the i-th transcript:
hits <- findOverlaps(reads, dm3_transcripts, ignore.strand=TRUE)
tx2reads <- setNames(as(t(hits), "List"), names(dm3_transcripts))
reads_by_tx <- extractList(reads, tx2reads)  # GAlignmentsList object
reads_by_tx

## Call pcoverageByTranscript():
tx_cvg2 <- pcoverageByTranscript(reads_by_tx, dm3_transcripts,
ignore.strand=TRUE)
stopifnot(identical(tx_cvg, tx_cvg2))

## A more meaningful coverage is obtained by counting for each
## transcript only the reads that are *compatible* with its splicing:
compat_hits <- findCompatibleOverlaps(reads, dm3_transcripts)
tx2reads <- setNames(as(t(compat_hits), "List"), names(dm3_transcripts))
compat_reads_by_tx <- extractList(reads, tx2reads)

tx_compat_cvg <- pcoverageByTranscript(compat_reads_by_tx,
dm3_transcripts,
ignore.strand=TRUE)
## A sanity check:
stopifnot(all(all(tx_compat_cvg <= tx_cvg)))

## ---------------------------------------------------------------------
## 3. COMPUTE CDS COVERAGE OF A SET OF ALIGNED READS
## ---------------------------------------------------------------------

## coverageByTranscript() can also be used to compute CDS coverage:
cds <- cdsBy(txdb, by="tx", use.names=TRUE)
cds_cvg <- coverageByTranscript(reads, cds, ignore.strand=TRUE)
cds_cvg

## A sanity check:
stopifnot(identical(lengths(cds_cvg), sum(width(cds))))

## ---------------------------------------------------------------------
## 4. ALTERNATIVELY, THE CDS COVERAGE CAN BE OBTAINED FROM THE
##    TRANSCRIPT COVERAGE BY TRIMMING THE 5' AND 3' UTRS
## ---------------------------------------------------------------------

tx_lens <- transcriptLengths(txdb, with.utr5_len=TRUE, with.utr3_len=TRUE)
stopifnot(identical(tx_lens$tx_name, names(tx_cvg)))  # sanity

## Keep the rows in 'tx_lens' that correspond to a list element in
## 'cds_cvg' and put them in the same order as in 'cds_cvg':
m <- match(names(cds_cvg), names(tx_cvg))
tx_lens <- tx_lens[m, ]
utr5_width <- tx_lens$utr5_len
utr3_width <- tx_lens$utr3_len
cds_cvg2 <- windows(tx_cvg[m], start=1L+utr5_width, end=-1L-utr3_width)

## A sanity check:
stopifnot(identical(cds_cvg2, cds_cvg))

disjointExons()

Extract non-overlapping exon parts from an object

Description

disjointExons extracts the non-overlapping exon parts from a TxDb object or any other supported object.

WARNING: disjointExons is superseded by exonicParts and will be deprecated soon. Please use improved exonicParts instead.

Usage

disjointExons(x, ...)
list(list("disjointExons"), list("TxDb"))(x, aggregateGenes=FALSE, 
                 includeTranscripts=TRUE, ...)

Arguments

Argument	Description
`x`	A TxDb object or any other supported object.
`...`	Arguments to be passed to methods.
`aggregateGenes`	For `disjointExons` : A `logical` . When `FALSE` (default) exon fragments that overlap multiple genes are dropped. When `TRUE` , all fragments are kept and the `gene_id` metadata column includes all gene ids that overlap the exon fragment.
`includeTranscripts`	For `disjointExons` : A `logical` . When `TRUE` (default) a `tx_name` metadata column is included that lists all transcript names that overlap the exon fragment.

Details

disjointExons creates a GRanges of non-overlapping exon parts with metadata columns of gene_id and exonic_part. Exon parts that overlap more than 1 gene can be dropped with aggregateGenes=FALSE . When includeTranscripts=TRUE a tx_name metadata column is included that lists all transcript names that overlap the exon fragment. This function replaces prepareAnnotationForDEXSeq in the DEXSeq package.

Value

A GRanges object.

Author

disjointExons was originally implemented by Mike Love and Alejandro Reyes and then moved (and adapted) to GenomicFeatures by Valerie Obenchain.

Examples

## TODO

exonicParts()

Extract non-overlapping exonic or intronic parts from a TxDb-like object

Description

exonicParts and intronicParts extract the non-overlapping (a.k.a. disjoint) exonic or intronic parts from a TxDb -like object.

Usage

exonicParts(txdb, linked.to.single.gene.only=FALSE)
intronicParts(txdb, linked.to.single.gene.only=FALSE)
## 3 helper functions used internally by exonicParts() and intronicParts():
tidyTranscripts(txdb, drop.geneless=FALSE)
tidyExons(txdb, drop.geneless=FALSE)
tidyIntrons(txdb, drop.geneless=FALSE)

Arguments

Argument	Description
`txdb`	A TxDb object, or any TxDb -like object that supports the `transcripts` and `exonsBy` extractors (e.g. an EnsDb object).
`linked.to.single.gene.only`	`TRUE` or `FALSE` . If `FALSE` (the default), then the disjoint parts are obtained by calling `disjoin` on all the exons (or introns) in `txdb` , including on exons (or introns) not linked to a gene or linked to more than one gene. If `TRUE` , then the disjoint parts are obtained in 2 steps:

call disjoin on the exons (or introns) linked to at least one gene ,
then drop the parts linked to more than one gene from the set of exonic (or intronic) parts obtained previously.
|drop.geneless | If FALSE (the default), then all the transcripts (or exons, or introns) get extracted from the TxDb object. If TRUE , then only the transcripts (or exons, or introns) that are linked to a gene get extracted from the TxDb object. Note that drop.geneless also impacts the order in which the features are returned: |
Transcripts: If drop.geneless is FALSE then transcripts are returned in the same order as with transcripts , which is expected to be by internal transcript id ( tx_id ). Otherwise they are ordered first by gene id ( gene_id ), then by internal transcript id.
Exons: If drop.geneless is FALSE then exons are ordered first by internal transcript id ( tx_id ), then by exon rank ( exon_rank ). Otherwise they are ordered first by gene id ( gene_id ), then by internal transcript id, and then by exon rank.
Introns: If drop.geneless is FALSE then introns are ordered by internal transcript id ( tx_id ). Otherwise they are ordered first by gene id ( gene_id ), then by internal transcript id.

Value

exonicParts returns a disjoint and strictly sorted GRanges object with 1 range per exonic part and with metadata columns tx_id , tx_name , gene_id , exon_id , exon_name , and exon_rank .

intronicParts returns a disjoint and strictly sorted GRanges object with 1 range per intronic part and with metadata columns tx_id , tx_name , and gene_id .

tidyTranscripts returns a GRanges object with 1 range per transcript and with metadata columns tx_id , tx_name , and gene_id .

tidyExons returns a GRanges object with 1 range per exon and with metadata columns tx_id , tx_name , gene_id , exon_id , exon_name , and exon_rank .

tidyIntrons returns a GRanges object with 1 range per intron and with metadata columns tx_id , tx_name , and gene_id .

disjoin in the IRanges package.
transcripts , transcriptsBy , and transcriptsByOverlaps , for extracting genomic feature locations from a TxDb -like object.
transcriptLengths for extracting the transcript lengths (and other metrics) from a TxDb object.
extractTranscriptSeqs for extracting transcript (or CDS) sequences from chromosome sequences.
coverageByTranscript for computing coverage by transcript (or CDS) of a set of ranges.
The TxDb class.

Note

exonicParts is a replacement for disjointExons with the following differences/improvements:

Argument linked.to.single.gene.only in exonicParts replaces argument aggregateGenes in disjointExons , but has opposite meaning i.e. exonicParts(txdb, linked.to.single.gene.only=TRUE) returns the same exonic parts as disjointExons(txdb, aggregateGenes=FALSE) .
Unlike disjointExons(txdb, aggregateGenes=TRUE) , exonicParts(txdb, linked.to.single.gene.only=FALSE) does NOT discard exon parts that are not linked to a gene.
exonicParts is almost 2x more efficient than disjointExons .
exonicParts works out-of-the-box on any TxDb -like object that supports the transcripts and exonsBy extractors (e.g. on an EnsDb object).

Author

Hervé Pagès

Examples

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

## ---------------------------------------------------------------------
## exonicParts()
## ---------------------------------------------------------------------

exonic_parts1 <- exonicParts(txdb)
exonic_parts1

## Mapping from exonic parts to genes is many-to-many:
mcols(exonic_parts1)$gene_id
table(lengths(mcols(exonic_parts1)$gene_id))
## A Human exonic part can be linked to 0 to 22 known genes!

exonic_parts2 <- exonicParts(txdb, linked.to.single.gene.only=TRUE)
exonic_parts2

## Mapping from exonic parts to genes now is many-to-one:
class(mcols(exonic_parts2)$gene_id)

## Sanity checks:
stopifnot(isDisjoint(exonic_parts1), isStrictlySorted(exonic_parts1))
stopifnot(isDisjoint(exonic_parts2), isStrictlySorted(exonic_parts2))
stopifnot(all(exonic_parts2 %within% reduce(exonic_parts1)))
stopifnot(identical(
lengths(mcols(exonic_parts1)$gene_id) == 1L,
exonic_parts1 %within% exonic_parts2
))

## ---------------------------------------------------------------------
## intronicParts()
## ---------------------------------------------------------------------

intronic_parts1 <- intronicParts(txdb)
intronic_parts1

## Mapping from intronic parts to genes is many-to-many:
mcols(intronic_parts1)$gene_id
table(lengths(mcols(intronic_parts1)$gene_id))
## A Human intronic part can be linked to 0 to 22 known genes!

intronic_parts2 <- intronicParts(txdb, linked.to.single.gene.only=TRUE)
intronic_parts2

## Mapping from intronic parts to genes now is many-to-one:
class(mcols(intronic_parts2)$gene_id)

## Sanity checks:
stopifnot(isDisjoint(intronic_parts1), isStrictlySorted(intronic_parts1))
stopifnot(isDisjoint(intronic_parts2), isStrictlySorted(intronic_parts2))
stopifnot(all(intronic_parts2 %within% reduce(intronic_parts1)))
stopifnot(identical(
lengths(mcols(intronic_parts1)$gene_id) == 1L,
intronic_parts1 %within% intronic_parts2
))

## ---------------------------------------------------------------------
## Helper functions
## ---------------------------------------------------------------------

tidyTranscripts(txdb)                      # Ordered by 'tx_id'.
tidyTranscripts(txdb, drop.geneless=TRUE)  # Ordered first by 'gene_id',
# then by 'tx_id'.

tidyExons(txdb)                            # Ordered first by 'tx_id',
# then by 'exon_rank'.
tidyExons(txdb, drop.geneless=TRUE)        # Ordered first by 'gene_id',
# then by 'tx_id',
# then by 'exon_rank'.

tidyIntrons(txdb)                          # Ordered by 'tx_id'.
tidyIntrons(txdb, drop.geneless=TRUE)      # Ordered first by 'gene_id',
# then by 'tx_id'.

extractTranscriptSeqs()

Extract transcript (or CDS) sequences from chromosome sequences

Description

extractTranscriptSeqs extracts transcript (or CDS) sequences from an object representing a single chromosome or a collection of chromosomes.

Usage

extractTranscriptSeqs(x, transcripts, ...)
list(list("extractTranscriptSeqs"), list("DNAString"))(x, transcripts, strand="+")
list(list("extractTranscriptSeqs"), list("ANY"))(x, transcripts, ...)

Arguments

Argument	Description
`x`	An object representing a single chromosome or a collection of chromosomes. More precisely, `x` can be a DNAString object (single chromosome), or a BSgenome object (collection of chromosomes). Other objects representing a collection of chromosomes are supported (e.g. FaFile objects in the Rsamtools package) as long as `seqinfo` and `getSeq` work on them.
`transcripts`	An object representing the exon ranges of each transcript to extract. More precisely:

If x is a DNAString object, then transcripts must be an IntegerRangesList object.
If x is a BSgenome object or any object representing a collection of chromosomes, then transcripts must be a GRangesList object or any object for which exonsBy is implemented (e.g. a TxDb or EnsDb object). If the latter, then it's first turned into a GRangesList object with exonsBy . Note that, for each transcript, the exons must be ordered by ascending rank , that is, by ascending position in the transcript (when going in the 5' to 3' direction). This generally means (but not always) that they are also ordered from 5' to 3' on the reference genome. More precisely:
For a transcript located on the plus strand, the exons will typically (but not necessarily) be ordered by ascending position on the reference genome.
For a transcript located on the minus strand, the exons will typically (but not necessarily) be ordered by descending position on the reference genome. If transcripts was obtained with exonsBy (see above), then the exons are guaranteed to be ordered by ascending rank. See ? for more information. |... | Additional arguments, for use in specific methods. For the default method, additional arguments are allowed only when transcripts is not a GRangesList object, in which case they are passed to the internal call to exonsBy (see above). | |strand | Only supported when x is a DNAString object. Can be an atomic vector, a factor, or an Rle object, in which case it indicates the strand of each transcript (i.e. all the exons in a transcript are considered to be on the same strand). More precisely: it's turned into a factor (or factor- Rle ) that has the "standard strand levels" (this is done by calling the strand function on it). Then it's recycled to the length of IntegerRangesList object transcripts if needed. In the resulting object, the i-th element is interpreted as the strand of all the exons in the i-th transcript. strand can also be a list-like object, in which case it indicates the strand of each exon, individually. Thus it must have the same shape as IntegerRangesList object transcripts (i.e. same length plus strand[[i]] must have the same length as transcripts[[i]] for all i ). strand can only contain "+" and/or "-" values. "*" is not allowed. |

Value

A DNAStringSet object parallel to transcripts , that is, the i-th element in it is the sequence of the i-th transcript in transcripts .

coverageByTranscript for computing coverage by transcript (or CDS) of a set of ranges.
transcriptLengths for extracting the transcript lengths (and other metrics) from a TxDb object.
The transcriptLocs2refLocs function for converting transcript-based locations into reference-based locations.
The available.genomes function in the BSgenome package for checking avaibility of BSgenome data packages (and installing the desired one).
The DNAString and DNAStringSet classes defined and documented in the Biostrings package.
The translate function in the Biostrings package for translating DNA or RNA sequences into amino acid sequences.
The GRangesList class defined and documented in the GenomicRanges package.
The IntegerRangesList class defined and documented in the IRanges package.
The exonsBy function for extracting exon ranges grouped by transcript.
The TxDb class.

Author

Hervé Pagès

Examples

## ---------------------------------------------------------------------
## 1. A TOY EXAMPLE
## ---------------------------------------------------------------------

library(Biostrings)

## A chromosome of length 30:
x <- DNAString("ATTTAGGACACTCCCTGAGGACAAGACCCC")

## 2 transcripts on 'x':
tx1 <- IRanges(1, 8)            # 1 exon
tx2 <- c(tx1, IRanges(12, 30))  # 2 exons
transcripts <- IRangesList(tx1=tx1, tx2=tx2)
extractTranscriptSeqs(x, transcripts)

## By default, all the exons are considered to be on the plus strand.
## We can use the 'strand' argument to tell extractTranscriptSeqs()
## to extract them from the minus strand.

## Extract all the exons from the minus strand:
extractTranscriptSeqs(x, transcripts, strand="-")

## Note that, for a transcript located on the minus strand, the exons
## should typically be ordered by descending position on the reference
## genome in order to reflect their rank in the transcript:
extractTranscriptSeqs(x, IRangesList(tx1=tx1, tx2=rev(tx2)), strand="-")

## Extract the exon of the 1st transcript from the minus strand:
extractTranscriptSeqs(x, transcripts, strand=c("-", "+"))

## Extract the 2nd exon of the 2nd transcript from the minus strand:
extractTranscriptSeqs(x, transcripts, strand=list("-", c("+", "-")))

## ---------------------------------------------------------------------
## 2. A REAL EXAMPLE
## ---------------------------------------------------------------------

## Load a genome:
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19

## Load a TxDb object:
txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite",
package="GenomicFeatures")
txdb <- loadDb(txdb_file)

## Check that 'txdb' is based on the hg19 assembly:
txdb

## Extract the exon ranges grouped by transcript from 'txdb':
transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)

## Extract the transcript sequences from the genome:
tx_seqs <- extractTranscriptSeqs(genome, transcripts)
tx_seqs

## A sanity check:
stopifnot(identical(width(tx_seqs), unname(sum(width(transcripts)))))

## Note that 'tx_seqs' can also be obtained with:
extractTranscriptSeqs(genome, txdb, use.names=TRUE)

## ---------------------------------------------------------------------
## 3. USING extractTranscriptSeqs() TO EXTRACT CDS SEQUENCES
## ---------------------------------------------------------------------

cds <- cdsBy(txdb, by="tx", use.names=TRUE)
cds_seqs <- extractTranscriptSeqs(genome, cds)
cds_seqs

## A sanity check:
stopifnot(identical(width(cds_seqs), unname(sum(width(cds)))))

## Note that, alternatively, the CDS sequences can be obtained from the
## transcript sequences by removing the 5' and 3' UTRs:
tx_lens <- transcriptLengths(txdb, with.utr5_len=TRUE, with.utr3_len=TRUE)
stopifnot(identical(tx_lens$tx_name, names(tx_seqs)))  # sanity
## Keep the rows in 'tx_lens' that correspond to a sequence in 'cds_seqs'
## and put them in the same order as in 'cds_seqs':
m <- match(names(cds_seqs), names(tx_seqs))
tx_lens <- tx_lens[m, ]
utr5_width <- tx_lens$utr5_len
utr3_width <- tx_lens$utr3_len
cds_seqs2 <- narrow(tx_seqs[m],
start=utr5_width+1L, end=-(utr3_width+1L))
stopifnot(identical(as.character(cds_seqs2), as.character(cds_seqs)))

## ---------------------------------------------------------------------
## 4. TRANSLATE THE CDS SEQUENCES
## ---------------------------------------------------------------------

prot_seqs <- translate(cds_seqs, if.fuzzy.codon="solve")

## Note that, by default, translate() uses The Standard Genetic Code to
## translate codons into amino acids. However, depending on the organism,
## a different genetic code might be needed to translate CDS sequences
## located on the mitochodrial chromosome. For example, for vertebrates,
## the following code could be used to correct 'prot_seqs':
SGC1 <- getGeneticCode("SGC1")
chrM_idx <- which(all(seqnames(cds) == "chrM"))
prot_seqs[chrM_idx] <- translate(cds_seqs[chrM_idx], genetic.code=SGC1,
if.fuzzy.codon="solve")

extractUpstreamSeqs()

Extract sequences upstream of a set of genes or transcripts

Description

extractUpstreamSeqs is a generic function for extracting sequences upstream of a supplied set of genes or transcripts.

Usage

extractUpstreamSeqs(x, genes, width=1000, ...)
## Dispatch is on the 2nd argument!
list(list("extractUpstreamSeqs"), list("GenomicRanges"))(x, genes, width=1000)
list(list("extractUpstreamSeqs"), list("TxDb"))(x, genes, width=1000, exclude.seqlevels=NULL)

Arguments

Argument	Description
`x`	An object containing the chromosome sequences from which to extract the upstream sequences. It can be a BSgenome , TwoBitFile , or FaFile object, or any genome sequence container . More formally, `x` must be an object for which `seqinfo` and `getSeq` are defined.
`genes`	An object containing the locations (i.e. chromosome name, start, end, and strand) of the genes or transcripts with respect to the reference genome. Only GenomicRanges and TxDb objects are supported at the moment. If the latter, the gene locations are obtained by calling the `genes` function on the TxDb object internally.
`width`	How many bases to extract upstream of each TSS (transcription start site).
`...`	Additional arguments, for use in specific methods.
`exclude.seqlevels`	A character vector containing the chromosome names (a.k.a. sequence levels) to exclude when the genes are obtained from a TxDb object.

Value

A DNAStringSet object containing one upstream sequence per gene (or per transcript if genes is a GenomicRanges object containing transcript ranges).

More precisely, if genes is a GenomicRanges object, the returned object is parallel to it, that is, the i-th element in the returned object is the upstream sequence corresponding to the i-th gene (or transcript) in genes . Also the names on the GenomicRanges object are propagated to the returned object.

If genes is a TxDb object, the names on the returned object are the gene IDs found in the TxDb object. To see the type of gene IDs (i.e. Entrez gene ID or Ensembl gene ID or ...), you can display genes with show(genes) .

In addition, the returned object has the following metadata columns (accessible with mcols ) that provide some information about the gene (or transcript) corresponding to each upstream sequence:

gene_seqnames : the chromosome name of the gene (or transcript);
gene_strand : the strand of the gene (or transcript);
gene_TSS : the transcription start site of the gene (or transcript).

makeTxDbFromUCSC , makeTxDbFromBiomart , and makeTxDbFromEnsembl , for making a TxDb object from online resources.
makeTxDbFromGRanges and makeTxDbFromGFF for making a TxDb object from a GRanges object, or from a GFF or GTF file.
The available.genomes function in the BSgenome package for checking avaibility of BSgenome data packages (and installing the desired one).
The BSgenome , TwoBitFile , and FaFile classes, defined and documented in the BSgenome , rtracklayer , and Rsamtools packages, respectively.
The TxDb class.
The genes function for extracting gene ranges from a TxDb object.
The GenomicRanges class defined and documented in the GenomicRanges package.
The DNAStringSet class defined and documented in the Biostrings package.
The seqinfo getter defined and documented in the GenomeInfoDb package.
The getSeq function for extracting subsequences from a sequence container.

Note

IMPORTANT: Always make sure to use a TxDb package (or TxDb object) that contains a gene model compatible with the list("genome sequence ", " container") x , that is, a gene model based on the exact same reference genome as x .

See http://bioconductor.org/packages/release/BiocViews.html#___TxDb for the list of TxDb packages available in the current release of Bioconductor. Note that you can make your own custom TxDb object from various annotation resources by using one of the makeTxDbFrom*() functions listed in the "See also" section below.

Author

Hervé Pagès

Examples

## Load a genome:
library(BSgenome.Dmelanogaster.UCSC.dm3)
genome <- BSgenome.Dmelanogaster.UCSC.dm3
genome

## Use a TxDb object:
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
txdb  # contains Ensembl gene IDs

## Because the chrU and chrUextra sequences are made of concatenated
## scaffolds (see http://genome.ucsc.edu/cgi-bin/hgGateway?db=dm3),
## extracting the upstream sequences for genes located on these
## scaffolds is not reliable. So we exclude them:
exclude <- c("chrU", "chrUextra")
up1000seqs <- extractUpstreamSeqs(genome, txdb, width=1000,
exclude.seqlevels=exclude)
up1000seqs  # the names are Ensembl gene IDs
mcols(up1000seqs)

## Upstream sequences for genes close to the chromosome bounds can be
## shorter than 1000 (note that this does not happen for circular
## chromosomes like chrM):
table(width(up1000seqs))
mcols(up1000seqs)[width(up1000seqs) != 1000, ]

features()

Extract simple features from a FeatureDb object

Description

Generic function to extract genomic features from a FeatureDb object.

Usage

features(x)
list(list("features"), list("FeatureDb"))(x)

Arguments

Argument	Description
`x`	A FeatureDb object.

Value

a GRanges object

Author

M. Carlson

Examples

fdb <- loadDb(system.file("extdata", "FeatureDb.sqlite",
package="GenomicFeatures"))
features(fdb)

getPromoterSeq_methods()

Get gene promoter sequences

Description

Extract sequences for the genes or transcripts specified in the query (a GRanges or GRangesList object) from a BSgenome object or an FaFile .

Usage

list(list("getPromoterSeq"), list("GRangesList"))(query, subject, upstream=2000, downstream=200, ...)
  list(list("getPromoterSeq"), list("GRangesList"))(query, subject, upstream=2000, downstream=200, ...)
  list(list("getPromoterSeq"), list("GRanges"))(query, subject, upstream=2000, downstream=200, ...)

Arguments

Argument	Description
`query`	A GRanges or GRangesList object containing genes grouped by transcript.
`subject`	A BSgenome object or a FaFile from which the sequences will be taken.
`upstream`	The number of DNA bases to include upstream of the TSS (transcription start site)
`downstream`	The number of DNA bases to include downstream of the TSS (transcription start site)
`list()`	Additional arguments

Details

getPromoterSeq is an overloaded method dispatching on query, which is either a GRanges or a GRangesList. It is a wrapper for the promoters and getSeq functions. The purpose is to allow sequence extraction from either a BSgenome or FaFile .

Default values for upstream and downstream were chosen based on our current understanding of gene regulation. On average, promoter regions in the mammalian genome are 5000 bp upstream and downstream of the transcription start site.

Value

A DNAStringSet or DNAStringSetList instance corresponding to the GRanges or GRangesList supplied in the query.

Author

Paul Shannon

Examples

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(BSgenome.Hsapiens.UCSC.hg19)

e2f3 <- "1871"  # entrez geneID for a cell cycle control transcription
# factor, chr6 on the plus strand

transcriptCoordsByGene.GRangesList <-
transcriptsBy (TxDb.Hsapiens.UCSC.hg19.knownGene, by = "gene") [e2f3]
# a GrangesList of length one, describing three transcripts

promoter.seqs <- getPromoterSeq (transcriptCoordsByGene.GRangesList,
Hsapiens, upstream=10, downstream=0)
# DNAStringSetList of length 1
# [["1871"]] GCTTCCTGGA GCTTCCTGGA CGGAGCCAGG

id2name()

Map internal ids to external names for a given feature type

Description

Utility function for retrieving the mapping from the internal ids to the external names of a given feature type.

Usage

id2name(txdb, feature.type=c("tx", "exon", "cds"))

Arguments

Argument	Description
`txdb`	A TxDb object.
`feature.type`	The feature type for which the mapping must be retrieved.

Details

Transcripts, exons and CDS in a TxDb object are stored in seperate tables where the primary key is an integer called list("feature internal id") . This id is stored in the "tx_id" column for transcripts, in the "exon_id" column for exons, and in the "cds_id" column for CDS. Unlike other commonly used ids like Entrez Gene IDs or Ensembl IDs, this internal id was generated at the time the TxDb object was created and has no meaning outside the scope of this object.

The id2name function can be used to translate this internal id into a more informative id or name called list("feature external ", " name") . This name is stored in the "tx_name" column for transcripts, in the "exon_name" column for exons, and in the "cds_name" column for CDS.

Note that, unlike the feature internal id, the feature external name is not guaranteed to be unique or even defined (the column can contain NA s).

Value

A named character vector where the names are the internal ids and the values the external names.

Author

Hervé Pagès

Examples

txdb1_file <- system.file("extdata", "hg19_knownGene_sample.sqlite",
package="GenomicFeatures")
txdb1 <- loadDb(txdb1_file)
id2name(txdb1, feature.type="tx")[1:4]
id2name(txdb1, feature.type="exon")[1:4]
id2name(txdb1, feature.type="cds")[1:4]

txdb2_file <- system.file("extdata", "Biomart_Ensembl_sample.sqlite",
package="GenomicFeatures")
txdb2 <- loadDb(txdb2_file)
id2name(txdb2, feature.type="tx")[1:4]
id2name(txdb2, feature.type="exon")[1:4]
id2name(txdb2, feature.type="cds")[1:4]

makeFeatureDbFromUCSC()

Making a FeatureDb object from annotations available at the UCSC Genome Browser

Description

The makeFeatureDbFromUCSC function allows the user to make a FeatureDb object from simple annotation tracks at UCSC. The tracks in question must (at a minimum) have a start, end and a chromosome affiliation in order to be made into a FeatureDb . This function requires a precise declaration of its first three arguments to indicate which genome, track and table wish to be imported. There are discovery functions provided to make this process go smoothly.

Usage

supportedUCSCFeatureDbTracks(genome)
supportedUCSCFeatureDbTables(genome, track)
 
UCSCFeatureDbTableSchema(genome,
                         track,
                         tablename)
makeFeatureDbFromUCSC(
        genome,
        track,
        tablename,
        columns = UCSCFeatureDbTableSchema(genome,track,tablename),
        url="http://genome.ucsc.edu/cgi-bin/",
        goldenPath_url="http://hgdownload.cse.ucsc.edu/goldenPath",
        chromCol, 
        chromStartCol, 
        chromEndCol,
        taxonomyId=NA)

Arguments

Argument	Description
`genome`	genome abbreviation used by UCSC and obtained by `ucscGenomes` . For example: `"hg18"` .
`track`	name of the UCSC track. Use `supportedUCSCFeatureDbTracks` to get the list of available tracks for a particular genome
`tablename`	name of the UCSC table containing the annotations to retrieve. Use the `supportedUCSCFeatureDbTables` utility function to get the list of supported tables for a track.
`columns`	a named character vector to list out the names and types of the other columns that the downloaded track should have. Use `UCSCFeatureDbTableSchema` to retrieve this information for a particular table.
`url,goldenPath_url`	use to specify the location of an alternate UCSC Genome Browser.
`chromCol`	If the schema comes back and the 'chrom' column has been labeled something other than 'chrom', use this argument to indicate what that column has been labeled as so we can properly designate it. This could happen (for example) with the knownGene track tables, which has no 'chromStart' or 'chromEnd' columns, but which DOES have columns that could reasonably substitute for these columns under particular circumstances. Therefore we allow these three columns to have arguments so that their definition can be re-specified
`chromStartCol`	Same thing as chromCol, but for renames of 'chromStart'
`chromEndCol`	Same thing as chromCol, but for renames of 'chromEnd'
`taxonomyId`	By default this value is NA and the organism inferred will be used to look up the correct value for this. But you can use this argument to override that and supply your own valid taxId here.

Details

makeFeatureDbFromUCSC list("is a convenience function that builds a ", " tiny database from one of the UCSC track tables.") supportedUCSCFeatureDbTracks list("a convenience function that ", " returns potential track names that could be used to make ", " FeatureDb objects") supportedUCSCFeatureDbTables list("a convenience function that ", " returns potential table names for FeatureDb objects (table names ", " go with a track name)") UCSCFeatureDbTableSchema list("A convenience function that creates a ", " named vector of types for all the fields that can potentially be ", " supported for a given track. By default, this will be called on ", " your specified tablename to include all of the fields in a track.")

% code{link{makeFeatureDb}} % function. % See code{?link{makeFeatureDbFromBiomart}} for a similar function % that feeds data from a BioMart database.

Value

A FeatureDb object for makeFeatureDbFromUCSC . Or in the case of supportedUCSCFeatureDbTracks and UCSCFeatureDbTableSchema a named character vector

Author

M. Carlson

Examples

## Display the list of genomes available at UCSC:
library(GenomicFeatures)
library(rtracklayer)
ucscGenomes()[ , "db"]

## Display the list of Tracks supported by makeFeatureDbFromUCSC():
# supportedUCSCFeatureDbTracks("mm10")

## Display the list of tables supported by your track:
supportedUCSCFeatureDbTables(genome="mm10",
track="qPCR Primers")

## Display fields that could be passed in to colnames:
UCSCFeatureDbTableSchema(genome="mm10",
track="qPCR Primers",
tablename="qPcrPrimers")

## Retrieving a full transcript dataset for Mouse from UCSC:
fdb <- makeFeatureDbFromUCSC(genome="mm10",
track="qPCR Primers",
tablename="qPcrPrimers")
fdb

makeTxDb()

Making a TxDb object from user supplied annotations

Description

makeTxDb is a low-level constructor for making a TxDb object from user supplied transcript annotations.

Note that the end-user will rarely need to use makeTxDb directly but will typically use one of the high-level constructors makeTxDbFromUCSC , makeTxDbFromEnsembl , or makeTxDbFromGFF .

Usage

makeTxDb(transcripts, splicings, genes=NULL,
         chrominfo=NULL, metadata=NULL,
         reassign.ids=FALSE, on.foreign.transcripts=c("error", "drop"))

Arguments

Argument	Description
`transcripts`	Data frame containing the genomic locations of a set of transcripts.
`splicings`	Data frame containing the exon and CDS locations of a set of transcripts.
`genes`	Data frame containing the genes associated to a set of transcripts.
`chrominfo`	Data frame containing information about the chromosomes hosting the set of transcripts.
`metadata`	2-column data frame containing meta information about this set of transcripts like organism, genome, UCSC table, etc... The names of the columns must be `"name"` and `"value"` and their type must be character.
`reassign.ids`	`TRUE` or `FALSE` . Controls how internal ids should be assigned for each type of feature i.e. for transcripts, exons, and CDS. For each type, if `reassign.ids` is `FALSE` (the default) and if the ids are supplied, then they are used as the internal ids, otherwise the internal ids are assigned in a way that is compatible with the order defined by ordering the features first by chromosome, then by strand, then by start, and finally by end.
`on.foreign.transcripts`	Controls what to do when the input contains foreign transcripts i.e. transcripts that are on sequences not in `chrominfo` . If set to `"error"` (the default)

Details

The transcripts (required), splicings (required) and genes (optional) arguments must be data frames that describe a set of transcripts and the genomic features related to them (exons, CDS and genes at the moment). The chrominfo (optional) argument must be a data frame containing chromosome information like the length of each chromosome.

transcripts must have 1 row per transcript and the following columns:

tx_id : Transcript ID. Integer vector. No NAs. No duplicates.
tx_chrom : Transcript chromosome. Character vector (or factor) with no NAs.
tx_strand : Transcript strand. Character vector (or factor) with no NAs where each element is either "+" or "-" .
tx_start , tx_end : Transcript start and end. Integer vectors with no NAs.
tx_name : [optional] Transcript name. Character vector (or factor). NAs and/or duplicates are ok.
tx_type : [optional] Transcript type (e.g. mRNA, ncRNA, snoRNA, etc...). Character vector (or factor). NAs and/or duplicates are ok.
gene_id : [optional] Associated gene. Character vector (or factor). NAs and/or duplicates are ok.
Other columns, if any, are ignored (with a warning).

splicings must have N rows per transcript, where N is the nb of exons in the transcript. Each row describes an exon plus, optionally, the CDS contained in this exon. Its columns must be:
tx_id : Foreign key that links each row in the splicings data frame to a unique row in the transcripts data frame. Note that more than 1 row in splicings can be linked to the same row in transcripts (many-to-one relationship). Same type as transcripts$tx_id (integer vector). No NAs. All the values in this column must be present in transcripts$tx_id .
exon_rank : The rank of the exon in the transcript. Integer vector with no NAs. ( tx_id , exon_rank ) pairs must be unique.
exon_id : [optional] Exon ID. Integer vector with no NAs.
exon_name : [optional] Exon name. Character vector (or factor). NAs and/or duplicates are ok.
exon_chrom : [optional] Exon chromosome. Character vector (or factor) with no NAs. If missing then transcripts$tx_chrom is used. If present then exon_strand must also be present.
exon_strand : [optional] Exon strand. Character vector (or factor) with no NAs. If missing then transcripts$tx_strand is used and exon_chrom must also be missing.
exon_start , exon_end : Exon start and end. Integer vectors with no NAs.
cds_id : [optional] CDS ID. Integer vector. If present then cds_start and cds_end must also be present. NAs are allowed and must match those in cds_start and cds_end .
cds_name : [optional] CDS name. Character vector (or factor). If present then cds_start and cds_end must also be present. NAs and/or duplicates are ok. Must contain NAs at least where cds_start and cds_end contain them.
cds_start , cds_end : [optional] CDS start and end. Integer vectors. If one of the 2 columns is missing then all cds_* columns must be missing. NAs are allowed and must occur at the same positions in cds_start and cds_end .
cds_phase : [optional] CDS phase. Integer vector. If present then cds_start and cds_end must also be present. NAs are allowed and must match those in cds_start and cds_end .
Other columns, if any, are ignored (with a warning).

genes should not be supplied if transcripts has a gene_id column. If supplied, it must have N rows per transcript, where N is the nb of genes linked to the transcript (N will be 1 most of the time). Its columns must be:
tx_id : [optional] genes must have either a tx_id or a tx_name column but not both. Like splicings$tx_id , this is a foreign key that links each row in the genes data frame to a unique row in the transcripts data frame.
tx_name : [optional] Can be used as an alternative to the genes$tx_id foreign key.
gene_id : Gene ID. Character vector (or factor). No NAs.
Other columns, if any, are ignored (with a warning).

chrominfo must have 1 row per chromosome and the following columns:
chrom : Chromosome name. Character vector (or factor) with no NAs and no duplicates.
length : Chromosome length. Integer vector with either all NAs or no NAs.
is_circular : [optional] Chromosome circularity flag. Logical vector. NAs are ok.
Other columns, if any, are ignored (with a warning).

Value

A TxDb object.

makeTxDbFromUCSC , makeTxDbFromBiomart , and makeTxDbFromEnsembl , for making a TxDb object from online resources.
makeTxDbFromGRanges and makeTxDbFromGFF for making a TxDb object from a GRanges object, or from a GFF or GTF file.
The TxDb class.
saveDb and loadDb in the AnnotationDbi package for saving and loading a TxDb object as an SQLite file.

Author

Hervé Pagès

Examples

transcripts <- data.frame(
tx_id=1:3,
tx_chrom="chr1",
tx_strand=c("-", "+", "+"),
tx_start=c(1, 2001, 2001),
tx_end=c(999, 2199, 2199))
splicings <-  data.frame(
tx_id=c(1L, 2L, 2L, 2L, 3L, 3L),
exon_rank=c(1, 1, 2, 3, 1, 2),
exon_start=c(1, 2001, 2101, 2131, 2001, 2131),
exon_end=c(999, 2085, 2144, 2199, 2085, 2199),
cds_start=c(1, 2022, 2101, 2131, NA, NA),
cds_end=c(999, 2085, 2144, 2193, NA, NA),
cds_phase=c(0, 0, 2, 0, NA, NA))

txdb <- makeTxDb(transcripts, splicings)

makeTxDbFromBiomart()

Make a TxDb object from annotations available on a BioMart database

Description

The makeTxDbFromBiomart function allows the user to make a TxDb object from transcript annotations available on a BioMart database.

Usage

makeTxDbFromBiomart(biomart="ENSEMBL_MART_ENSEMBL",
                    dataset="hsapiens_gene_ensembl",
                    transcript_ids=NULL,
                    circ_seqs=DEFAULT_CIRC_SEQS,
                    filter=NULL,
                    id_prefix="ensembl_",
                    host="www.ensembl.org",
                    port=80,
                    taxonomyId=NA,
                    miRBaseBuild=NA)
getChromInfoFromBiomart(biomart="ENSEMBL_MART_ENSEMBL",
                        dataset="hsapiens_gene_ensembl",
                        id_prefix="ensembl_",
                        host="www.ensembl.org",
                        port=80)

Arguments

Argument	Description
`biomart`	which BioMart database to use. Get the list of all available BioMart databases with the `listMarts` function from the biomaRt package. See the details section below for a list of BioMart databases with compatible transcript annotations.
`dataset`	which dataset from BioMart. For example: `"hsapiens_gene_ensembl"` , `"mmusculus_gene_ensembl"` , `"dmelanogaster_gene_ensembl"` , `"celegans_gene_ensembl"` , `"scerevisiae_gene_ensembl"` , etc in the ensembl database. See the examples section below for how to discover which datasets are available in a given BioMart database.
`transcript_ids`	optionally, only retrieve transcript annotation data for the specified set of transcript ids. If this is used, then the meta information displayed for the resulting TxDb object will say 'Full dataset: no'. Otherwise it will say 'Full dataset: yes'.
`circ_seqs`	a character vector to list out which chromosomes should be marked as circular.
`filter`	Additional filters to use in the BioMart query. Must be a named list. An example is `filter=list(source="entrez")`
`id_prefix`	Specifies the prefix used in BioMart attributes. For example, some BioMarts may have an attribute specified as `"ensembl_transcript_id"` whereas others have the same attribute specified as `"transcript_id"` . Defaults to `"ensembl_"` .
`host`	The host URL of the BioMart. Defaults to www.ensembl.org.
`port`	The port to use in the HTTP communication with the host.
`taxonomyId`	By default this value is NA and the dataset selected will be used to look up the correct value for this. But you can use this argument to override that and supply your own taxId here (which will be independently checked to make sure its a real taxonomy id). Normally you should never need to use this.
`miRBaseBuild`	specify the string for the appropriate build Information from mirbase.db to use for microRNAs. This can be learned by calling `supportedMiRBaseBuildValues` . By default, this value will be set to `NA` , which will inactivate the `microRNAs` accessor.

Details

makeTxDbFromBiomart is a convenience function that feeds data from a BioMart database to the lower level makeTxDb function. See ? for a similar function that feeds data from the UCSC source.

Here is a list of datasets known to be compatible with makeTxDbFromBiomart (list updated on September 18, 2017):

All the datasets in the main Ensembl database. Get the list with: list(" mart <- biomaRt::useMart(biomart="ENSEMBL_MART_ENSEMBL", ", " host="www.ensembl.org") ", " biomaRt::listDatasets(mart)")
All the datasets in the Ensembl Fungi database. Get the list with: list(" mart <- biomaRt::useMart(biomart="fungi_mart", ", " host="fungi.ensembl.org") ", " biomaRt::listDatasets(mart)")
All the datasets in the Ensembl Metazoa database. Get the list with: list(" mart <- biomaRt::useMart(biomart="metazoa_mart", ", " host="metazoa.ensembl.org") ", " biomaRt::listDatasets(mart)")
All the datasets in the Ensembl Plants database. Get the list with: list(" mart <- biomaRt::useMart(biomart="plants_mart", ", " host="plants.ensembl.org") ", " biomaRt::listDatasets(mart)")
All the datasets in the Ensembl Protists database. Get the list with: list(" mart <- biomaRt::useMart(biomart="protists_mart", ", " host="protists.ensembl.org") ", " biomaRt::listDatasets(mart)")
All the datasets in the Gramene Mart. Get the list with: list(" mart <- biomaRt::useMart(biomart="ENSEMBL_MART_PLANT", ", " host="ensembl.gramene.org") ", " biomaRt::listDatasets(mart)")

Note that BioMart is not currently available for Ensembl Bacteria.

Also please note that not all these datasets have CDS information.

Value

A TxDb object for makeTxDbFromBiomart .

A data frame with 1 row per chromosome (or scaffold) and with columns chrom and length for getChromInfoFromBiomart .

makeTxDbFromUCSC and makeTxDbFromEnsembl for making a TxDb object from other online resources.
makeTxDbFromGRanges and makeTxDbFromGFF for making a TxDb object from a GRanges object, or from a GFF or GTF file.
The listMarts , useMart , listDatasets , and listFilters functions in the biomaRt package.
DEFAULT_CIRC_SEQS .
The supportedMiRBaseBuildValues function for listing all the possible values for the miRBaseBuild argument.
The TxDb class.
makeTxDb for the low-level function used by the makeTxDbFrom* functions to make the TxDb object returned to the user.

Author

M. Carlson and H. Pagès

Examples

## ---------------------------------------------------------------------
## A. BASIC USAGE
## ---------------------------------------------------------------------

## We can use listDatasets() from the biomaRt package to list the
## datasets available in the "ENSEMBL_MART_ENSEMBL" BioMart database:
library(biomaRt)
listMarts(host="www.ensembl.org")
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="www.ensembl.org")
datasets <- listDatasets(mart)
head(datasets)
subset(datasets, grepl("elegans", dataset, ignore.case=TRUE))

## Retrieve the full transcript dataset for Worm:
txdb1 <- makeTxDbFromBiomart(dataset="celegans_gene_ensembl")
txdb1

## Retrieve an incomplete transcript dataset for Human:
transcript_ids <- c(
"ENST00000013894",
"ENST00000268655",
"ENST00000313243",
"ENST00000435657",
"ENST00000384428",
"ENST00000478783"
)

if (interactive()) {
txdb2 <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl",
transcript_ids=transcript_ids)
txdb2  # note that these annotations match the GRCh38 genome assembly
}

## ---------------------------------------------------------------------
## B. ACCESSING THE EnsemblGenomes MARTS
## ---------------------------------------------------------------------

library(biomaRt)

## Note that BioMart is not currently available for Ensembl Bacteria.

## ---------------------
## --- Ensembl Fungi ---

mart <- useMart(biomart="fungi_mart", host="fungi.ensembl.org")
datasets <- listDatasets(mart)
datasets$dataset
yeast_txdb <- makeTxDbFromBiomart(biomart="fungi_mart",
dataset="scerevisiae_eg_gene",
host="fungi.ensembl.org")
yeast_txdb

## Note that the dataset for Yeast on Ensembl Fungi is not necessarily
## the same as on the main Ensembl database:
yeast_txdb0 <- makeTxDbFromBiomart(dataset="scerevisiae_gene_ensembl")
all(transcripts(yeast_txdb0) %in% transcripts(yeast_txdb))
all(transcripts(yeast_txdb) %in% transcripts(yeast_txdb0))

## -----------------------
## --- Ensembl Metazoa ---

## The metazoa mart is slow and at the same time it doesn't seem to
## support requests that take more than 1 min at the moment. So a call to
## biomaRt::getBM() will fail with a "Timeout was reached" error if the
## requested data takes more than 1 min to download. This unfortunately
## happens with the example below so we don't try to run it for now.

mart <- useMart(biomart="metazoa_mart", host="metazoa.ensembl.org")
datasets <- listDatasets(mart)
datasets$dataset
worm_txdb <- makeTxDbFromBiomart(biomart="metazoa_mart",
dataset="celegans_eg_gene",
host="metazoa.ensembl.org")
worm_txdb

## Note that even if the dataset for Worm on Ensembl Metazoa contains
## the same transcript as on the main Ensembl database, the transcript
## type might be annotated with slightly different terms (e.g. antisense
## vs antisense_RNA):
filter <- list(tx_name="Y71G12B.44")
transcripts(worm_txdb, filter=filter, columns=c("tx_name", "tx_type"))
transcripts(txdb1, filter=filter, columns=c("tx_name", "tx_type"))
## ----------------------
## --- Ensembl Plants ---

## Like the metazoa mart (see above), the plants mart is also slow and
## doesn't seem to support requests that take more than 1 min either.
## So we don't try to run the example below for now.

mart <- useMart(biomart="plants_mart", host="plants.ensembl.org")
datasets <- listDatasets(mart)
datasets[ , 1:2]
athaliana_txdb <- makeTxDbFromBiomart(biomart="plants_mart",
dataset="athaliana_eg_gene",
host="plants.ensembl.org")
athaliana_txdb
## ------------------------
## --- Ensembl Protists ---

mart <- useMart(biomart="protists_mart", host="protists.ensembl.org")
datasets <- listDatasets(mart)
datasets$dataset
tgondii_txdb <- makeTxDbFromBiomart(biomart="protists_mart",
dataset="tgondii_eg_gene",
host="protists.ensembl.org")
tgondii_txdb

## ---------------------------------------------------------------------
## C. USING AN Ensembl MIRROR
## ---------------------------------------------------------------------

## You can use the 'host' argument to access the "ENSEMBL_MART_ENSEMBL"
## BioMart database at a mirror (e.g. at uswest.ensembl.org). A gotcha
## when doing this is that the name of the database on the mirror might
## be different! We can check this with listMarts() from the biomaRt
## package:
listMarts(host="useast.ensembl.org")

## Therefore in addition to setting 'host' to "uswest.ensembl.org" we
## might also need to specify the 'biomart' argument:
if (interactive()) {
txdb3 <- makeTxDbFromBiomart(biomart="ENSEMBL_MART_ENSEMBL",
dataset="hsapiens_gene_ensembl",
transcript_ids=transcript_ids,
host="useast.ensembl.org")
txdb3
}

## ---------------------------------------------------------------------
## D. USING FILTERS
## ---------------------------------------------------------------------

## We can use listFilters() from the biomaRt package to get valid filter
## names:
mart <- useMart(biomart="ENSEMBL_MART_ENSEMBL",
dataset="hsapiens_gene_ensembl",
host="www.ensembl.org")
head(listFilters(mart))

## Retrieve transcript dataset for Ensembl gene ENSG00000011198:
my_filter <- list(ensembl_gene_id="ENSG00000011198")

if (interactive()) {
txdb4 <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl",
filter=my_filter)
txdb4
transcripts(txdb4, columns=c("tx_id", "tx_name", "gene_id"))
transcriptLengths(txdb4)
}

## ---------------------------------------------------------------------
## E. RETRIEVING CHROMOSOME INFORMATION ONLY
## ---------------------------------------------------------------------

chrominfo <- getChromInfoFromBiomart(dataset="celegans_gene_ensembl")
chrominfo

makeTxDbFromEnsembl()

Make a TxDb object from an Ensembl database

Description

The makeTxDbFromEnsembl function creates a TxDb object for a given organism by importing the genomic locations of its transcripts, exons, CDS, and genes from an Ensembl database.

Note that it uses the RMariaDB package internally so make sure that this package is installed.

Usage

makeTxDbFromEnsembl(organism="Homo sapiens",
                    release=NA,
                    circ_seqs=DEFAULT_CIRC_SEQS,
                    server="ensembldb.ensembl.org",
                    username="anonymous", password=NULL, port=0L,
                    tx_attrib=NULL)

Arguments

Argument	Description
`organism`	The scientific name (i.e. genus and species, or genus and species and subspecies) of the organism for which to import the data. Case is not sensitive. Underscores can be used instead of white spaces e.g. `"homo_sapiens"` is accepted.
`release`	The Ensembl release to query e.g. 89. If set to `NA` (the default), the current release is used.
`circ_seqs`	A character vector to list out which chromosomes should be marked as circular.
`server`	The name of the MySQL server to query. See https://www.ensembl.org/info/data/mysql.html for the list of Ensembl public MySQL servers. Make sure to use the server nearest to you. It can make a big difference!
`username`	Login username for the MySQL server.
`password`	Login password for the MySQL server.
`port`	Port of the MySQL server.
`tx_attrib`	If not `NULL` , only select transcripts with an attribute of the given code, a string, like `"gencode_basic"` .

Value

A TxDb object.

makeTxDbFromUCSC and makeTxDbFromBiomart for making a TxDb object from other online resources.
makeTxDbFromGRanges and makeTxDbFromGFF for making a TxDb object from a GRanges object, or from a GFF or GTF file.
DEFAULT_CIRC_SEQS .
The TxDb class.
makeTxDb for the low-level function used by the makeTxDbFrom* functions to make the TxDb object returned to the user.

Note

makeTxDbFromEnsembl tends to be faster and more reliable than makeTxDbFromBiomart .

Author

H. Pagès

Examples

txdb <- makeTxDbFromEnsembl("Saccharomyces cerevisiae",
server="useastdb.ensembl.org")
txdb

makeTxDbFromGFF()

Make a TxDb object from annotations available as a GFF3 or GTF file

Description

The makeTxDbFromGFF function allows the user to make a TxDb object from transcript annotations available as a GFF3 or GTF file.

Usage

makeTxDbFromGFF(file,
                format=c("auto", "gff3", "gtf"),
                dataSource=NA,
                organism=NA,
                taxonomyId=NA,
                circ_seqs=DEFAULT_CIRC_SEQS,
                chrominfo=NULL,
                miRBaseBuild=NA,
                metadata=NULL,
                dbxrefTag)

Arguments

Argument	Description
`file`	Input GFF3 or GTF file. Can be a path to a file, or an URL, or a connection object, or a GFF3File or GTFFile object.
`format`	Format of the input file. Accepted values are: `"auto"` (the default) for auto-detection of the format, `"gff3"` , or `"gtf"` . Use `"gff3"` or `"gtf"` only if auto-detection failed.
`dataSource`	A single string describing the origin of the data file. Please be as specific as possible.
`organism`	What is the Genus and species of this organism. Please use proper scientific nomenclature for example: "Homo sapiens" or "Canis familiaris" and not "human" or "my fuzzy buddy". If properly written, this information may be used by the software to help you out later.
`taxonomyId`	By default this value is NA and the organism provided will be used to look up the correct value for this. But you can use this argument to override that and supply your own taxonomy id here (which will be separately validated). Since providing a valid taxonomy id will not require us to look up one based on your organism: this is one way that you can loosen the restrictions about what is and isn't a valid value for the organism.
`circ_seqs`	A character vector to list out which chromosomes should be marked as circular.
`chrominfo`	Data frame containing information about the chromosomes. Will be passed to the internal call to `makeTxDb` . See `?` for more information. Alternatively, can be a Seqinfo object.
`miRBaseBuild`	Specify the string for the appropriate build Information from mirbase.db to use for microRNAs. This can be learned by calling `supportedMiRBaseBuildValues` . By default, this value will be set to `NA` , which will inactivate the `microRNAs` accessor.
`metadata`	A 2-column data frame containing meta information to be included in the TxDb object. See `?` for more information about the format of `metadata` .
`dbxrefTag`	If not missing, the values in the `Dbxref` attribute with the specified tag (like GeneID ) are used for the feature names.

Details

makeTxDbFromGFF is a convenience function that feeds data from the parsed file to the makeTxDbFromGRanges function.

Value

A TxDb object.

makeTxDbFromGRanges , which makeTxDbFromGFF is based on, for making a TxDb object from a GRanges object.
The import function in the rtracklayer package (also used by makeTxDbFromGFF internally).
makeTxDbFromUCSC , makeTxDbFromBiomart , and makeTxDbFromEnsembl , for making a TxDb object from online resources.
DEFAULT_CIRC_SEQS .
The supportedMiRBaseBuildValues function for listing all the possible values for the miRBaseBuild argument.
The TxDb class.
makeTxDb for the low-level function used by the makeTxDbFrom* functions to make the TxDb object returned to the user.

Author

M. Carlson and H. Pagès

Examples

## TESTING GFF3
gffFile <- system.file("extdata","GFF3_files","a.gff3",package="GenomicFeatures")
txdb <- makeTxDbFromGFF(file=gffFile,
dataSource="partial gtf file for Tomatoes for testing",
organism="Solanum lycopersicum")

## TESTING GTF, this time specifying the chrominfo
gtfFile <- system.file("extdata","GTF_files","Aedes_aegypti.partial.gtf",
package="GenomicFeatures")
chrominfo <- data.frame(chrom = c('supercont1.1','supercont1.2'),
length=c(5220442, 5300000),
is_circular=c(FALSE, FALSE))
metadata <- data.frame(name="Resource URL",
value=paste0("ftp://ftp.ensemblgenomes.org/pub/metazoa/",
"release-13/gtf/aedes_aegypti/"))
txdb2 <- makeTxDbFromGFF(file=gtfFile,
chrominfo=chrominfo,
dataSource="ensemblgenomes",
organism="Aedes aegypti",
metadata=metadata)

makeTxDbFromGRanges()

Make a TxDb object from a GRanges object

Description

The makeTxDbFromGRanges function allows the user to extract gene, transcript, exon, and CDS information from a GRanges object structured as GFF3 or GTF, and to return that information in a TxDb object.

Usage

makeTxDbFromGRanges(gr, drop.stop.codons=FALSE, metadata=NULL, taxonomyId=NA)

Arguments

Argument	Description
`gr`	A GRanges object structured as GFF3 or GTF, typically obtained with `rtracklayer::` .
`drop.stop.codons`	`TRUE` or `FALSE` . If `TRUE` , then features of type `stop_codon` are ignored. Otherwise (the default) the stop codons are considered to be part of the CDS and merged to them.
`metadata`	A 2-column data frame containing meta information to be included in the TxDb object. This data frame is just passed to `makeTxDb` , which `makeTxDbFromGRanges` calls at the end to make the TxDb object from the information extracted from `gr` . See `?` for more information about the format of `metadata` .
`taxonomyId`	By default this value is NA which will result in an NA field since there is no reliable way to infer this from a GRanges object. But you can use this argument to supply your own valid taxId here and if you do, then the Organism can be filled in as well

Value

A TxDb object.

makeTxDbFromUCSC , makeTxDbFromBiomart , and makeTxDbFromEnsembl , for making a TxDb object from online resources.
makeTxDbFromGFF for making a TxDb object from a GFF or GTF file.
The import function in the rtracklayer package.
The asGFF method for TxDb objects ( asGFF,TxDb-method ) for the reverse of makeTxDbFromGRanges , that is, for turning a TxDb object into a GRanges object structured as GFF.
The TxDb class.
makeTxDb for the low-level function used by the makeTxDbFrom* functions to make the TxDb object returned to the user.

Author

Hervé Pagès

Examples

library(rtracklayer)  # for the import() function

## ---------------------------------------------------------------------
## WITH A GRanges OBJECT STRUCTURED AS GFF3
## ---------------------------------------------------------------------
GFF3_files <- system.file("extdata", "GFF3_files",
package="GenomicFeatures")

path <- file.path(GFF3_files, "a.gff3")
gr <- import(path)
txdb <- makeTxDbFromGRanges(gr)
txdb

## Reverse operation:
gr2 <- asGFF(txdb)

## Sanity check (asGFF() does not propagate the CDS phase at the moment):
target <- as.list(txdb)
target$splicings$cds_phase <- NULL
stopifnot(identical(target, as.list(makeTxDbFromGRanges(gr2))))

## ---------------------------------------------------------------------
## WITH A GRanges OBJECT STRUCTURED AS GTF
## ---------------------------------------------------------------------
GTF_files <- system.file("extdata", "GTF_files", package="GenomicFeatures")

## test1.gtf was grabbed from http://mblab.wustl.edu/GTF22.html (5 exon
## gene with 3 translated exons):
path <- file.path(GTF_files, "test1.gtf")
gr <- import(path)
txdb <- makeTxDbFromGRanges(gr)
txdb

path <- file.path(GTF_files, "Aedes_aegypti.partial.gtf")
gr <- import(path)
txdb <- makeTxDbFromGRanges(gr)
txdb

makeTxDbFromUCSC()

Make a TxDb object from annotations available at the UCSC Genome Browser

Description

The makeTxDbFromUCSC function allows the user to make a TxDb object from transcript annotations available at the UCSC Genome Browser.

Note that it uses the RMariaDB package internally so make sure that this package is installed.

Usage

makeTxDbFromUCSC(
        genome="hg19",
        tablename="knownGene",
        transcript_ids=NULL,
        circ_seqs=DEFAULT_CIRC_SEQS,
        url="http://genome.ucsc.edu/cgi-bin/",
        goldenPath_url="http://hgdownload.cse.ucsc.edu/goldenPath",
        taxonomyId=NA,
        miRBaseBuild=NA)
supportedUCSCtables(genome="hg19", url="http://genome.ucsc.edu/cgi-bin/")
browseUCSCtrack(genome="hg19", tablename="knownGene",
                url="http://genome.ucsc.edu/cgi-bin/")
getChromInfoFromUCSC(
        genome,
        goldenPath_url="http://hgdownload.cse.ucsc.edu/goldenPath")

Arguments

Argument	Description
`genome`	genome abbreviation used by UCSC and obtained by `ucscGenomes` . For example: `"hg19"` .
`tablename`	name of the UCSC table containing the transcript annotations to retrieve. Use the `supportedUCSCtables` utility function to get the list of tables known to work with `makeTxDbFromUCSC` .
`transcript_ids`	optionally, only retrieve transcript annotation data for the specified set of transcript ids. If this is used, then the meta information displayed for the resulting TxDb object will say 'Full dataset: no'. Otherwise it will say 'Full dataset: yes'.
`circ_seqs`	a character vector to list out which chromosomes should be marked as circular.
`url,goldenPath_url`	use to specify the location of an alternate UCSC Genome Browser.
`taxonomyId`	By default this value is NA and the organism inferred will be used to look up the correct value for this. But you can use this argument to supply your own valid taxId here.
`miRBaseBuild`	specify the string for the appropriate build Information from mirbase.db to use for microRNAs. This can be learned by calling `supportedMiRBaseBuildValues` . By default, this value will be set to `NA` , which will inactivate the `microRNAs` accessor.

Details

makeTxDbFromUCSC is a convenience function that feeds data from the UCSC source to the lower level makeTxDb function. See ? for a similar function that feeds data from a BioMart database.

Value

For makeTxDbFromUCSC : A TxDb object.

For supportedUCSCtables : A data frame with 3 columns ( tablename , track , and subtrack ) and 1 row per table known to work with makeTxDbFromUCSC . IMPORTANT NOTE: In the returned data frame, the set of tables associated with a track with subtracks might contain tables that don't exist for the specified genome.

For getChromInfoFromUCSC : A data frame with 1 row per chromosome (or scaffold) and with columns chrom and length .

makeTxDbFromBiomart and makeTxDbFromEnsembl for making a TxDb object from other online resources.
makeTxDbFromGRanges and makeTxDbFromGFF for making a TxDb object from a GRanges object, or from a GFF or GTF file.
ucscGenomes in the rtracklayer package.
DEFAULT_CIRC_SEQS .
The supportedMiRBaseBuildValues function for listing all the possible values for the miRBaseBuild argument.
The TxDb class.
makeTxDb for the low-level function used by the makeTxDbFrom* functions to make the TxDb object returned to the user.

Author

M. Carlson and H. Pagès

Examples

## ---------------------------------------------------------------------
## A. BASIC USAGE
## ---------------------------------------------------------------------

## Use ucscGenomes() from the rtracklayer package to display the list of
## genomes available at UCSC:
library(rtracklayer)
ucscGenomes()[ , "db"]

## Display the list of tables known to work with makeTxDbFromUCSC():
supportedUCSCtables()

## Browse the UCSC track page for a given organism/table:
browseUCSCtrack(genome="sacCer3", tablename="ensGene")

## Retrieve a full transcript dataset for Yeast from UCSC:
txdb1 <- makeTxDbFromUCSC(genome="sacCer3", tablename="ensGene",
circ_seqs="chrM")
txdb1

## Retrieve an incomplete transcript dataset for Mouse from UCSC (only
## transcripts linked to Entrez Gene ID 22290):
transcript_ids <- c(
"uc009uzf.1",
"uc009uzg.1",
"uc009uzh.1",
"uc009uzi.1",
"uc009uzj.1"
)

txdb2 <- makeTxDbFromUCSC(genome="mm10", tablename="knownGene",
transcript_ids=transcript_ids,
circ_seqs="chrM")
txdb2

## ---------------------------------------------------------------------
## B. IMPORTANT NOTE ABOUT supportedUCSCtables()
## ---------------------------------------------------------------------

## In the data frame returned by supportedUCSCtables(), the set of
## tables associated with a track with subtracks might contain tables
## that don't exist for the specified genome:
supported_tables <- supportedUCSCtables("hg38")
browseUCSCtrack(genome="hg38", tablename="augustusHints")  # no such table

## ---------------------------------------------------------------------
## C. RETRIEVING CHROMOSOME INFORMATION ONLY
## ---------------------------------------------------------------------

chrominfo <- getChromInfoFromUCSC(genome="hg38")
chrominfo

makeTxDbPackage()

Making a TxDb package from annotations available at the UCSC Genome Browser, biomaRt or from another source.

Description

A TxDb package is an annotation package containing a TxDb object.

The makeTxDbPackageFromUCSC function allows the user to make a TxDb package from transcript annotations available at the UCSC Genome Browser.

The makeTxDbPackageFromBiomart function allows the user to do the same thing as makeTxDbPackageFromUCSC except that the annotations originate from biomaRt.

Finally, the makeTxDbPackage function allows the user to make a TxDb package directly from a TxDb object.

Usage

makeTxDbPackageFromUCSC(
    version=,
    maintainer,
    author,
    destDir=".",
    license="Artistic-2.0",
    genome="hg19",
    tablename="knownGene",
    transcript_ids=NULL,
    circ_seqs=DEFAULT_CIRC_SEQS,
    url="http://genome.ucsc.edu/cgi-bin/",
    goldenPath_url="http://hgdownload.cse.ucsc.edu/goldenPath",
    taxonomyId=NA,
    miRBaseBuild=NA)
makeFDbPackageFromUCSC(
    version,
    maintainer,
    author,
    destDir=".",
    license="Artistic-2.0",
    genome="hg19",
    track="tRNAs",
    tablename="tRNAs",
    columns = UCSCFeatureDbTableSchema(genome, track, tablename),
    url="http://genome.ucsc.edu/cgi-bin/",
    goldenPath_url="http://hgdownload.cse.ucsc.edu/goldenPath",
    chromCol=NULL,
    chromStartCol=NULL,
    chromEndCol=NULL,
    taxonomyId=NA)
  makeTxDbPackageFromBiomart(
    version,
    maintainer,
    author,
    destDir=".",
    license="Artistic-2.0",
    biomart="ENSEMBL_MART_ENSEMBL",
    dataset="hsapiens_gene_ensembl",
    transcript_ids=NULL,
    circ_seqs=DEFAULT_CIRC_SEQS,
    filter=NULL,
    id_prefix="ensembl_",
    host="www.ensembl.org",
    port=80,
    taxonomyId=NA,
    miRBaseBuild=NA)
  makeTxDbPackage(txdb,
                  version,
          maintainer,
                  author,
              destDir=".",
                  license="Artistic-2.0",
                  pkgname=NULL,
                  provider=NULL,
                  providerVersion=NULL)
  supportedMiRBaseBuildValues()
  makePackageName(txdb)

Arguments

Argument	Description
`version`	What is the version number for this package?
`maintainer`	Who is the package maintainer? (must include email to be valid). Should be a `person` object, or something coercible to one, like a string. May be omitted if the `author` argument is a `person` containing someone with the maintainer role.
`author`	Who is the creator of this package? Should be a `person` object, or something coercible to one, like a character vector of names. The `maintainer` argument will be merged into this list.
`destDir`	A path where the package source should be assembled.
`license`	What is the license (and it's version)
`biomart`	which BioMart database to use. Get the list of all available BioMart databases with the `listMarts` function from the biomaRt package. See the details section below for a list of BioMart databases with compatible transcript annotations.
`dataset`	which dataset from BioMart. For example: `"hsapiens_gene_ensembl"` , `"mmusculus_gene_ensembl"` , `"dmelanogaster_gene_ensembl"` , `"celegans_gene_ensembl"` , `"scerevisiae_gene_ensembl"` , etc in the ensembl database. See the examples section below for how to discover which datasets are available in a given BioMart database.
`genome`	genome abbreviation used by UCSC and obtained by `ucscGenomes` . For example: `"hg18"` .
`track`	name of the UCSC track. Use `supportedUCSCFeatureDbTracks` to get the list of available tracks for a particular genome
`tablename`	name of the UCSC table containing the transcript annotations to retrieve. Use the `supportedUCSCtables` utility function to get the list of tables known to work with `makeTxDbFromUCSC` .
`transcript_ids`	optionally, only retrieve transcript annotation data for the specified set of transcript ids. If this is used, then the meta information displayed for the resulting TxDb object will say 'Full dataset: no'. Otherwise it will say 'Full dataset: yes'.
`circ_seqs`	a character vector to list out which chromosomes should be marked as circular.
`filter`	Additional filters to use in the BioMart query. Must be a named list. An example is `filter=as.list(c(source="entrez"))`
`host`	The host URL of the BioMart. Defaults to www.ensembl.org.
`port`	The port to use in the HTTP communication with the host.
`id_prefix`	Specifies the prefix used in BioMart attributes. For example, some BioMarts may have an attribute specified as `"ensembl_transcript_id"` whereas others have the same attribute specified as `"transcript_id"` . Defaults to `"ensembl_"` .
`columns`	a named character vector to list out the names and types of the other columns that the downloaded track should have. Use `UCSCFeatureDbTableSchema` to retrieve this information for a particular table.
`url,goldenPath_url`	use to specify the location of an alternate UCSC Genome Browser.
`chromCol`	If the schema comes back and the 'chrom' column has been labeled something other than 'chrom', use this argument to indicate what that column has been labeled as so we can properly designate it. This could happen (for example) with the knownGene track tables, which has no 'chromStart' or 'chromEnd' columns, but which DOES have columns that could reasonably substitute for these columns under particular circumstances. Therefore we allow these three columns to have arguments so that their definition can be re-specified
`chromStartCol`	Same thing as chromCol, but for renames of 'chromStart'
`chromEndCol`	Same thing as chromCol, but for renames of 'chromEnd'
`txdb`	A TxDb object that represents a handle to a transcript database. This object type is what is returned by `makeTxDbFromUCSC` , `makeTxDbFromUCSC` or `makeTxDb`
`taxonomyId`	By default this value is NA and the organism provided (or inferred) will be used to look up the correct value for this. But you can use this argument to override that and supply your own valid taxId here
`miRBaseBuild`	specify the string for the appropriate build Information from mirbase.db to use for microRNAs. This can be learned by calling `supportedMiRBaseBuildValues` . By default, this value will be set to `NA` , which will inactivate the `microRNAs` accessor.
`pkgname`	By default this value is NULL and does not need to be filled in (a package name will be generated for you). But if you override this value, then the package and it's object will be instead named after this value. Be aware that the standard rules for package names will apply, (so don't include spaces, underscores or dashes)
`provider`	If not given, a default is taken from the 'Data source' field of the metadata table.
`providerVersion`	If not given, a default is taken from one of 'UCSC table', 'BioMart version' or 'Data source' fields of the metadata table.

Details

makeTxDbPackageFromUCSC is a convenience function that calls both the makeTxDbFromUCSC and the makeTxDbPackage functions. The makeTxDbPackageFromBiomart follows a similar pattern and calls the makeTxDbFromBiomart and makeTxDbPackage functions. supportedMiRBaseBuildValues is a convenience function that will list all the possible values for the miRBaseBuild argument. makePackageName creates a package name from a TxDb object. This function is also used by OrganismDbi.

Value

A TxDb object.

Author

M. Carlson

Examples

## First consider relevant helper/discovery functions:
## Get the list of tables known to work with makeTxDbPackageFromUCSC():
supportedUCSCtables(genome="hg19")

## Can also list all the possible values for the miRBaseBuild argument:
supportedMiRBaseBuildValues()

## Next are examples of actually building a package:
## Makes a transcript package for Yeast from the ensGene table at UCSC:
makeTxDbPackageFromUCSC(version="0.01",
maintainer="Some One <so@someplace.org>",
author="Some One <so@someplace.com>",
genome="sacCer2",
tablename="ensGene")

## Makes a transcript package from Human by using biomaRt and limited to a
## small subset of the transcripts.
transcript_ids <- c(
"ENST00000400839",
"ENST00000400840",
"ENST00000478783",
"ENST00000435657",
"ENST00000268655",
"ENST00000313243",
"ENST00000341724")

makeTxDbPackageFromBiomart(version="0.01",
maintainer="Some One <so@someplace.org>",
author="Some One <so@someplace.com>",
transcript_ids=transcript_ids)

mapIdsToRanges()

Map IDs to Genomic Ranges

Description

Map IDs to Genomic Ranges

Usage

mapIdsToRanges(x, ...)
list(list("mapIdsToRanges"), list("TxDb"))(x, keys, type = c("cds", "exon", "tx",
  "gene"), columns = NULL)

Arguments

Argument	Description
`x`	Database to use for mapping
`keys`	Values to lookup, passed to `transcripts` et. al.
`type`	Types of feature to return
`columns`	Additional metadata columns to include in the output
`...`	Additional arguments passed to methods

Value

GRangesList corresponding to the keys

Examples

fl <- system.file(package = "GenomicFeatures", "extdata", "sample_ranges.rds")
txdb <- makeTxDbFromGRanges(readRDS(fl))

keys <- list(tx_name = c("ENST00000371582", "ENST00000371588",
"ENST00000494752", "ENST00000614008", "ENST00000496771"))
mapIdsToRanges(txdb, keys = keys, type = "tx")

mapRangesToIds()

Map Genomic Ranges to IDs

Description

Map Genomic Ranges to IDs

Usage

mapRangesToIds(x, ...)
list(list("mapRangesToIds"), list("TxDb"))(x, ranges, type = c("cds", "exon", "tx",
  "gene"), columns = NULL, ...)

Arguments

Argument	Description
`x`	Database to use for mapping
`ranges`	range object used to subset
`type`	of feature to return
`columns`	additional metadata columns to include in the output.
`...`	Additional arguments passed to `findOverlaps`

Value

DataFrame of mcols from the database.

Examples

fl <- system.file(package = "GenomicFeatures", "extdata", "sample_ranges.rds")
txdb <- makeTxDbFromGRanges(readRDS(fl))

keys <- list(tx_name = c("ENST00000371582", "ENST00000371588",
"ENST00000494752", "ENST00000614008", "ENST00000496771"))
res <- mapIdsToRanges(txdb, keys = keys, type = "tx")
mapRangesToIds(txdb, res, "tx")

microRNAs()

Extract microRNA or tRNA genomic ranges from an object

Description

Generic functions to extract microRNA or tRNA genomic ranges from an object. This page documents the methods for TxDb objects only.

Usage

microRNAs(x)
list(list("microRNAs"), list("TxDb"))(x)
tRNAs(x)
list(list("tRNAs"), list("TxDb"))(x)

Arguments

Argument	Description
`x`	A TxDb object.

Value

A GRanges object.

Author

M. Carlson

Examples

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(mirbase.db)
microRNAs(TxDb.Hsapiens.UCSC.hg19.knownGene)

nearest_methods()

Finding the nearest genomic range neighbor in a TxDb

Description

The distance methods for TxDb objects and subclasses.

Usage

list(list("distance"), list("GenomicRanges,TxDb"))(x, y, ignore.strand=FALSE,
    ..., id, type=c("gene", "tx", "exon", "cds"))

Arguments

Argument	Description
`x`	The query GenomicRanges instance.
`y`	For `distance` , a TxDb instance. The `id` is used to extract ranges from the TxDb which are then used to compute the distance from `x` .
`id`	A `character` vector the same length as `x` . The `id` must be identifiers in the TxDb object. `type` indicates what type of identifier `id` is.
`type`	A `character(1)` describing the `id` . Must be one of gene , tx , exon or cds .
`ignore.strand`	A `logical` indicating if the strand of the ranges should be ignored. When `TRUE` , strand is set to `'+'` .
`...`	Additional arguments for methods.

Details

list("distance: ") list(" ", " Returns the distance for each range in ", list("x"), " to the range ", " extracted from the ", list("TxDb"), " object ", list("y"), ". Values in ", " ", list("id"), " are matched to one of ", list("gene_id"), ", ", list("tx_id"), ", ", " ", list("exon_id"), " or ", list("cds_id"), " identifiers in the ", list("TxDb"), " ", " and the corresponding ranges are extracted. The ", list("type"), " argument ", " specifies which identifier is represented in ", list("id"), ". The extracted ", " ranges are used in the distance calculation with the ranges in ", " ", list("x"), ". ", " ", " The method returns ", list("NA"), " values when the genomic region defined ", " by ", list("id"), " cannot be collapsed into a single range (e.g., ", " when a gene spans multiple chromosomes) or if the ", list("id"), " ", " is not found in ", list("y"), ". ", " ", " The behavior of ", list("distance"), " with respect to zero-width ranges ", " has changed in Bioconductor 2.12. See the man page ", list("?distance"), " ", " in IRanges for details. ", " ")

Value

For distance , an integer vector of distances between the ranges in x and y .

Author

Valerie Obenchain vobencha@fhcrc.org

Examples

## -----------------------------------------------------------
## distance()
## -----------------------------------------------------------

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
gr <- GRanges(c("chr2L", "chr2R"),
IRanges(c(100000, 200000),  width=100))
distance(gr, txdb, id=c("FBgn0259717", "FBgn0261501"), type="gene")
distance(gr, txdb, id=c("10000", "23000"), type="cds")

## The id's must be in the appropriate order with respect to 'x'.
distance(gr, txdb, id=c("4", "4097"), type="tx")

## 'id' "4" is on chr2L and "4097" is on chr2R.
transcripts(txdb, filter=list(tx_id=c("4", "4097")))

## If we reverse the 'id' the chromosomes are incompatable with gr.
distance(gr, txdb, id=c("4097", "4"), type="tx")

## distance() compares each 'x' to the corresponding 'y'.
## If an 'id' is not found in the TxDb 'y' will not
## be the same lenth as 'x' and an error is thrown.
distance(gr, txdb, id=c("FBgn0000008", "INVALID"), type="gene") ## will fail

select_methods()

Using the "select" interface on TxDb objects

Description

select , columns and keys can be used together to extract data from a TxDb object.

Details

In the code snippets below, x is a TxDb object.

list(" ", " ", list(list(), list(" ", " ", list("keytypes(x)"), ": ", " allows the user to discover which keytypes can be passed in to ", " ", list("select"), " or ", list("keys"), " and the ", list("keytype"), " argument. ", " ")), " ", " ", list(list(), list(" ", " ", list("keys(x, keytype, pattern, column, fuzzy)"), ": Return keys for ", " the database contained in the ", list("TxDb"), " object . ", " ", " The ", list("keytype"), " argument specifies the kind of keys that will ",

"      be returned. By default ", list("keys"), " will return the "GENEID" keys

", " for the database. ", " ", " If ", list("keys"), " is used with ", list("pattern"), ", it will pattern match ", " on the ", list("keytype"), ". ", " ", " But if the ", list("column"), " argument is also provided along with the ", " ", list("pattern"), " argument, then ", list("pattern"), " will be matched ", " against the values in ", list("column"), " instead. ", " ",

"      And if ", list("keys"), " is called with ", list("column"), " and no

", " ", list("pattern"), " argument, then it will return all keys that have ", " corresponding values in the ", list("column"), " argument. ", " ", " Thus, the behavior of ", list("keys"), " all depends on how many arguments are ", " specified. ", " ", " Use of the ", list("fuzzy"), " argument will toggle fuzzy matching to ", " TRUE or FALSE. If ", list("pattern"), " is not used, fuzzy is ignored. ",

"    ")), "

", " ", list(list(), list(" ", " ", list("columns(x)"), ": ", " Show which kinds of data can be returned for the ", " ", list("TxDb"), " object. ", " ")), " ", " ", list(list(), list(" ", " ", list("select(x, keys, columns, keytype)"), ": ", " When all the appropriate arguments are specified ", list("select"), " ", " will retrieve the matching data as a data.frame based on ", " parameters for selected ", list("keys"), " and ",

list("columns"), " and

", " ", list("keytype"), " arguments. ", " ")), " ", " ")

Author

Marc Carlson

Examples

txdb_file <- system.file("extdata", "Biomart_Ensembl_sample.sqlite",
package="GenomicFeatures")
txdb <- loadDb(txdb_file)
txdb

## find key types
keytypes(txdb)

## list IDs that can be used to filter
head(keys(txdb, "GENEID"))
head(keys(txdb, "TXID"))
head(keys(txdb, "TXNAME"))

## list columns that can be returned by select
columns(txdb)

## call select
res <- select(txdb, head(keys(txdb, "GENEID")),
columns=c("GENEID","TXNAME"),
keytype="GENEID")
head(res)

transcriptLengths()

Extract the transcript lengths (and other metrics) from a TxDb object

Description

The transcriptLengths function extracts the transcript lengths from a TxDb object. It also returns the CDS and UTR lengths for each transcript if the user requests them.

Usage

transcriptLengths(txdb, with.cds_len=FALSE,
                  with.utr5_len=FALSE, with.utr3_len=FALSE, ...)

Arguments

Argument	Description
`txdb`	A TxDb object.
`with.cds_len, with.utr5_len, with.utr3_len`	`TRUE` or `FALSE` . Whether or not to also extract and return the CDS, 5' UTR, and 3' UTR lengths for each transcript.
`list()`	Additional arguments used by `transcripts` and other accessor functions.

Details

All the lengths are counted in number of nucleotides.

The length of a processed transcript is just the sum of the lengths of its exons. This should not be confounded with the length of the stretch of DNA transcribed into RNA (a.k.a. transcription unit), which can be obtained with width(transcripts(txdb)) .

Value

A data frame with 1 row per transcript. The rows are guaranteed to be in the same order as the elements of the GRanges object returned by transcripts . The data frame has between 5 and 8 columns, depending on what the user requested via the with.cds_len , with.utr5_len , and with.utr3_len arguments.

The first 3 columns are the same as the metadata columns of the object returned by transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))

that is:

tx_id : The internal transcript ID. This ID is unique within the scope of the TxDb object. It is not an official or public ID (like an Ensembl or FlyBase ID) or an Accession number, so it cannot be used to lookup the transcript in public data bases or in other TxDb objects. Furthermore, this ID could change when re-running the code that was used to make the TxDb object.
tx_name : An official/public transcript name or ID that can be used to lookup the transcript in public data bases or in other TxDb objects. This column is not guaranteed to contain unique values and it can contain NAs.
gene_id : The official/public ID of the gene that the transcript belongs to. Can be NA if the gene is unknown or if the transcript is not considered to belong to a gene.

The other columns are quantitative:

nexon : The number of exons in the transcript.
tx_len : The length of the processed transcript.
cds_len : [optional] The length of the CDS region of the processed transcript.
utr5_len : [optional] The length of the 5' UTR region of the processed transcript.
utr3_len : [optional] The length of the 3' UTR region of the processed transcript.

transcripts , transcriptsBy , and transcriptsByOverlaps , for extracting genomic feature locations from a TxDb -like object.
exonicParts and intronicParts for extracting non-overlapping exonic or intronic parts from a TxDb-like object.
extractTranscriptSeqs for extracting transcript (or CDS) sequences from chromosome sequences.
coverageByTranscript for computing coverage by transcript (or CDS) of a set of ranges.
makeTxDbFromUCSC , makeTxDbFromBiomart , and makeTxDbFromEnsembl , for making a TxDb object from online resources.
makeTxDbFromGRanges and makeTxDbFromGFF for making a TxDb object from a GRanges object, or from a GFF or GTF file.
The TxDb class.

Author

Hervé Pagès

Examples

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
dm3_txlens <- transcriptLengths(txdb)
head(dm3_txlens)

dm3_txlens <- transcriptLengths(txdb, with.cds_len=TRUE,
with.utr5_len=TRUE,
with.utr3_len=TRUE)
head(dm3_txlens)

## When cds_len is 0 (non-coding transcript), utr5_len and utr3_len
## must also be 0:
non_coding <- dm3_txlens[dm3_txlens$cds_len == 0, ]
stopifnot(all(non_coding[6:8] == 0))

## When cds_len is not 0 (coding transcript), cds_len + utr5_len +
## utr3_len must be equal to tx_len:
coding <- dm3_txlens[dm3_txlens$cds_len != 0, ]
stopifnot(all(rowSums(coding[6:8]) == coding[[5]]))

## A sanity check:
stopifnot(identical(dm3_txlens$tx_id, mcols(transcripts(txdb))$tx_id))

transcriptLocs2refLocs()

Converting transcript-based locations into reference-based locations

Description

transcriptLocs2refLocs converts transcript-based locations into reference-based (aka chromosome-based or genomic) locations.

transcriptWidths computes the lengths of the transcripts (called the "widths" in this context) based on the boundaries of their exons.

Usage

transcriptLocs2refLocs(tlocs,
        exonStarts=list(), exonEnds=list(), strand=character(0),
        decreasing.rank.on.minus.strand=FALSE, error.if.out.of.bounds=TRUE)
transcriptWidths(exonStarts=list(), exonEnds=list())

Arguments

Argument	Description
`tlocs`	A list of integer vectors of the same length as `exonStarts` and `exonEnds` . Each element in `tlocs` must contain transcript-based locations.
`exonStarts, exonEnds`	The starts and ends of the exons, respectively. Each argument can be a list of integer vectors, an IntegerList object, or a character vector where each element is a comma-separated list of integers. In addition, the lists represented by `exonStarts` and `exonEnds` must have the same shape i.e. have the same lengths and have elements of the same lengths. The length of `exonStarts` and `exonEnds` is the number of transcripts.
`strand`	A character vector of the same length as `exonStarts` and `exonEnds` specifying the strand ( `"+"` or `"-"` ) from which the transcript is coming.
`decreasing.rank.on.minus.strand`	`TRUE` or `FALSE` . Describes the order of exons in transcripts located on the minus strand: are they ordered by increasing (default) or decreasing rank?
`error.if.out.of.bounds`	`TRUE` or `FALSE` . Controls how out of bound `tlocs` are handled: an error is thrown (default) or `NA` is returned.

Value

For transcriptLocs2refLocs : A list of integer vectors of the same shape as tlocs .

For transcriptWidths : An integer vector with one element per transcript.

Author

Hervé Pagès

Examples

## ---------------------------------------------------------------------
## GOING FROM TRANSCRIPT-BASED TO REFERENCE-BASED LOCATIONS
## ---------------------------------------------------------------------
library(BSgenome.Hsapiens.UCSC.hg19)  # load the genome
genome <- BSgenome.Hsapiens.UCSC.hg19
txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite",
package="GenomicFeatures")
txdb <- loadDb(txdb_file)
transcripts <- exonsBy(txdb, by="tx", use.names=TRUE)
tx_seqs <- extractTranscriptSeqs(genome, transcripts)

## Get the reference-based locations of the first 4 (5' end)
## and last 4 (3' end) nucleotides in each transcript:
tlocs <- lapply(width(tx_seqs), function(w) c(1:4, (w-3):w))
tx_strand <- sapply(strand(transcripts), runValue)
## Note that, because of how we made them, 'tlocs', 'start(exbytx)',
## 'end(exbytx)' and 'tx_strand' have the same length, and, for any
## valid positional index, elements at this position are corresponding
## to each other. This is how transcriptLocs2refLocs() expects them
## to be!
rlocs <- transcriptLocs2refLocs(tlocs,
start(transcripts), end(transcripts),
tx_strand, decreasing.rank.on.minus.strand=TRUE)

## ---------------------------------------------------------------------
## EXTRACTING WORM TRANSCRIPTS ZC101.3 AND F37B1.1
## ---------------------------------------------------------------------
## Transcript ZC101.3 (is on + strand):
##   Exons starts/ends relative to transcript:
rstarts1 <- c(1, 488, 654, 996, 1365, 1712, 2163, 2453)
rends1 <- c(137, 578, 889, 1277, 1662, 1870, 2410, 2561)
##   Exons starts/ends relative to chromosome:
starts1 <- 14678410 + rstarts1
ends1 <- 14678410 + rends1

## Transcript F37B1.1 (is on - strand):
##   Exons starts/ends relative to transcript:
rstarts2 <- c(1, 325)
rends2 <- c(139, 815)
##   Exons starts/ends relative to chromosome:
starts2 <- 13611188 - rends2
ends2 <- 13611188 - rstarts2

exon_starts <- list(as.integer(starts1), as.integer(starts2))
exon_ends <- list(as.integer(ends1), as.integer(ends2))
transcripts <- IRangesList(start=exon_starts, end=exon_ends)

library(BSgenome.Celegans.UCSC.ce2)
## Both transcripts are on chrII:
chrII <- Celegans$chrII
tx_seqs <- extractTranscriptSeqs(chrII, transcripts, strand=c("+","-"))

## Same as 'width(tx_seqs)':
transcriptWidths(exonStarts=exon_starts, exonEnds=exon_ends)

transcriptLocs2refLocs(list(c(1:6, 135:140, 1555:1560),
c(1:6, 137:142, 625:630)),
exonStarts=exon_starts,
exonEnds=exon_ends,
strand=c("+","-"))

## A sanity check:
ref_locs <- transcriptLocs2refLocs(list(1:1560, 1:630),
exonStarts=exon_starts,
exonEnds=exon_ends,
strand=c("+","-"))
stopifnot(chrII[ref_locs[[1]]] == tx_seqs[[1]])
stopifnot(complement(chrII)[ref_locs[[2]]] == tx_seqs[[2]])

transcripts()

Extract genomic features from a TxDb-like object

Description

Generic functions to extract genomic features from a TxDb-like object. This page documents the methods for TxDb objects only.

Usage

transcripts(x, ...)
list(list("transcripts"), list("TxDb"))(x, columns=c("tx_id", "tx_name"), filter=NULL, use.names=FALSE)
exons(x, ...)
list(list("exons"), list("TxDb"))(x, columns="exon_id", filter=NULL, use.names=FALSE)
cds(x, ...)
list(list("cds"), list("TxDb"))(x, columns="cds_id", filter=NULL, use.names=FALSE)
genes(x, ...)
list(list("genes"), list("TxDb"))(x, columns="gene_id", filter=NULL, single.strand.genes.only=TRUE)
list(list("promoters"), list("TxDb"))(x, upstream=2000, downstream=200, use.names=TRUE, ...)

Arguments

Argument	Description
`x`	A TxDb object.
`...`	For the `transcripts` , `exons` , `cds` , and `genes` generic functions: arguments to be passed to methods. For the `promoters` method for TxDb objects: arguments to be passed to the internal call to `transcripts` .
`columns`	Columns to include in the output. Must be `NULL` or a character vector as given by the `columns` method. With the following restrictions:

"TXCHROM" and "TXSTRAND" are not allowed for transcripts .
"EXONCHROM" and "EXONSTRAND" are not allowed for exons .
"CDSCHROM" and "CDSSTRAND" are not allowed for cds . If the vector is named, those names are used for the corresponding column in the element metadata of the returned object. |filter | Either NULL or a named list of vectors to be used to restrict the output. Valid names for this list are: "gene_id" , "tx_id" , "tx_name" , "tx_chrom" , "tx_strand" , "exon_id" , "exon_name" , "exon_chrom" , "exon_strand" , "cds_id" , "cds_name" , "cds_chrom" , "cds_strand" and "exon_rank" . | |use.names | TRUE or FALSE . If TRUE , the feature names are set as the names of the returned object, with NAs being replaced with empty strings. | |single.strand.genes.only | TRUE or FALSE . If TRUE (the default), then genes that have exons located on both strands of the same chromosome or on two different chromosomes are dropped. In that case, the genes are returned in a GRanges object. Otherwise, all genes are returned in a GRangesList object with the columns specified thru the columns argument set as top level metadata columns. (Please keep in mind that the top level metadata columns of a GRangesList object are not displayed by the show method.) | |upstream | For promoters : An integer(1) value indicating the number of bases upstream from the transcription start site. For additional details see ?promoters,GRanges-method. | |`downstream` | For `promoters` : An `integer(1)` value indicating the number of bases downstream from the transcription start site. For additional details see `?`promoters,GRanges-method . |

Details

These are the main functions for extracting transcript information from a TxDb -like object. These methods can restrict the output based on categorical information. To restrict the output based on interval information, use the transcriptsByOverlaps , exonsByOverlaps , and cdsByOverlaps functions.

The promoters function computes user-defined promoter regions for the transcripts in a TxDb -like object. The return object is a GRanges of promoter regions around the transcription start site the span of which is defined by upstream and downstream . For additional details on how the promoter range is computed and the handling of + and - strands see ?promoters,GRanges-method` . ## Value A [GRanges](#granges) object. The only exception being whengenesis used withsingle.strand.genes.only=FALSE, in which case a [GRangesList](#grangeslist) object is returned. ## Seealso * [transcriptsBy](#transcriptsby) and [transcriptsByOverlaps](#transcriptsbyoverlaps) for more ways to extract genomic features from a [TxDb](#txdb) -like object. * [transcriptLengths](#transcriptlengths) for extracting the transcript lengths (and other metrics) from a [TxDb](#txdb) object. * [exonicParts](#exonicparts) and [intronicParts](#intronicparts) for extracting non-overlapping exonic or intronic parts from a TxDb-like object. * [extractTranscriptSeqs](#extracttranscriptseqs) for extracting transcript (or CDS) sequences from chromosome sequences. * [coverageByTranscript](#coveragebytranscript) for computing coverage by transcript (or CDS) of a set of ranges. * [select-methods](#select-methods) for how to use the simple "select" interface to extract information from a [TxDb](#txdb) object. * [microRNAs](#micrornas) and [tRNAs](#trnas) for extracting microRNA or tRNA genomic ranges from a [TxDb](#txdb) object. * [id2name`](#id2name) for mapping TxDb internal ids to external names for a given feature type.
* The TxDb class. ## Author M. Carlson, P. Aboyoun and H. Pagès ## Examples r txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite", package="GenomicFeatures") txdb <- loadDb(txdb_file) ## --------------------------------------------------------------------- ## transcripts() ## --------------------------------------------------------------------- tx1 <- transcripts(txdb) tx1 transcripts(txdb, use.names=TRUE) transcripts(txdb, columns=NULL, use.names=TRUE) filter <- list(tx_chrom = c("chr3", "chr5"), tx_strand = "+") tx2 <- transcripts(txdb, filter=filter) tx2 ## Sanity checks: stopifnot( identical(mcols(tx1)$tx_id, seq_along(tx1)), identical(tx2, tx1[seqnames(tx1) == "chr3" & strand(tx1) == "+"]) ) ## --------------------------------------------------------------------- ## exons() ## --------------------------------------------------------------------- exons(txdb, columns=c("EXONID", "TXNAME"), filter=list(exon_id=1)) exons(txdb, columns=c("EXONID", "TXNAME"), filter=list(tx_name="uc009vip.1")) ## --------------------------------------------------------------------- ## genes() ## --------------------------------------------------------------------- genes(txdb) # a GRanges object cols <- c("tx_id", "tx_chrom", "tx_strand", "exon_id", "exon_chrom", "exon_strand") single_strand_genes <- genes(txdb, columns=cols) ## Because we've returned single strand genes only, the "tx_chrom" ## and "exon_chrom" metadata columns are guaranteed to match ## 'seqnames(single_strand_genes)': stopifnot(identical(as.character(seqnames(single_strand_genes)), as.character(mcols(single_strand_genes)$tx_chrom))) stopifnot(identical(as.character(seqnames(single_strand_genes)), as.character(mcols(single_strand_genes)$exon_chrom))) ## and also the "tx_strand" and "exon_strand" metadata columns are ## guaranteed to match 'strand(single_strand_genes)': stopifnot(identical(as.character(strand(single_strand_genes)), as.character(mcols(single_strand_genes)$tx_strand))) stopifnot(identical(as.character(strand(single_strand_genes)), as.character(mcols(single_strand_genes)$exon_strand))) all_genes <- genes(txdb, columns=cols, single.strand.genes.only=FALSE) all_genes # a GRangesList object multiple_strand_genes <- all_genes[elementNROWS(all_genes) >= 2] multiple_strand_genes mcols(multiple_strand_genes) ## --------------------------------------------------------------------- ## promoters() ## --------------------------------------------------------------------- ## This: promoters(txdb, upstream=100, downstream=50) ## is equivalent to: promoters(transcripts(txdb, use.names=TRUE), upstream=100, downstream=50) ## Extra arguments are passed to transcripts(). So this: columns <- c("tx_name", "gene_id") promoters(txdb, upstream=100, downstream=50, columns=columns) ## is equivalent to: promoters(transcripts(txdb, columns=columns, use.names=TRUE), upstream=100, downstream=50)

transcriptsBy()

Extract and group genomic features of a given type from a TxDb-like object

Description

Generic functions to extract genomic features of a given type grouped based on another type of genomic feature. This page documents the methods for TxDb objects only.

Usage

transcriptsBy(x, by=c("gene", "exon", "cds"), ...)
list(list("transcriptsBy"), list("TxDb"))(x, by=c("gene", "exon", "cds"), use.names=FALSE)
exonsBy(x, by=c("tx", "gene"), ...)
list(list("exonsBy"), list("TxDb"))(x, by=c("tx", "gene"), use.names=FALSE)
cdsBy(x, by=c("tx", "gene"), ...)
list(list("cdsBy"), list("TxDb"))(x, by=c("tx", "gene"), use.names=FALSE)
intronsByTranscript(x, ...)
list(list("intronsByTranscript"), list("TxDb"))(x, use.names=FALSE)
fiveUTRsByTranscript(x, ...)
list(list("fiveUTRsByTranscript"), list("TxDb"))(x, use.names=FALSE)
threeUTRsByTranscript(x, ...)
list(list("threeUTRsByTranscript"), list("TxDb"))(x, use.names=FALSE)

Arguments

Argument	Description
`x`	A TxDb object.
`...`	Arguments to be passed to or from methods.
`by`	One of `"gene"` , `"exon"` , `"cds"` or `"tx"` . Determines the grouping.
`use.names`	Controls how to set the names of the returned GRangesList object. These functions return all the features of a given type (e.g. all the exons) grouped by another feature type (e.g. grouped by transcript) in a GRangesList object. By default (i.e. if `use.names` is `FALSE` ), the names of this GRangesList object (aka the group names) are the internal ids of the features used for grouping (aka the grouping features), which are guaranteed to be unique. If `use.names` is `TRUE` , then the names of the grouping features are used instead of their internal ids. For example, when grouping by transcript ( `by="tx"` ), the default group names are the transcript internal ids ( `"tx_id"` ). But, if `use.names=TRUE` , the group names are the transcript names ( `"tx_name"` ). Note that, unlike the feature ids, the feature names are not guaranteed to be unique or even defined (they could be all `NA` s). A warning is issued when this happens. See `?` for more information about feature internal ids and feature external names and how to map the formers to the latters. Finally, `use.names=TRUE` cannot be used when grouping by gene `by="gene"` . This is because, unlike for the other features, the gene ids are external ids (e.g. Entrez Gene or Ensembl ids) so the db doesn't have a `"gene_name"` column for storing alternate gene names.

Details

These functions return a GRangesList object where the ranges within each of the elements are ordered according to the following rule:

When using exonsBy or cdsBy with by = "tx" , the returned exons or CDS are ordered by ascending rank for each transcript, that is, by their position in the transcript. In all other cases, the ranges will be ordered by chromosome, strand, start, and end values.

Value

A GRangesList object.

transcripts and transcriptsByOverlaps for more ways to extract genomic features from a TxDb -like object.
transcriptLengths for extracting the transcript lengths (and other metrics) from a TxDb object.
exonicParts and intronicParts for extracting non-overlapping exonic or intronic parts from a TxDb-like object.
extractTranscriptSeqs for extracting transcript (or CDS) sequences from chromosome sequences.
coverageByTranscript for computing coverage by transcript (or CDS) of a set of ranges.
select-methods for how to use the simple "select" interface to extract information from a TxDb object.
id2name for mapping TxDb internal ids to external names for a given feature type.
The TxDb class.

Author

M. Carlson, P. Aboyoun and H. Pagès

Examples

txdb_file <- system.file("extdata", "hg19_knownGene_sample.sqlite",
package="GenomicFeatures")
txdb <- loadDb(txdb_file)

## Get the transcripts grouped by gene:
transcriptsBy(txdb, "gene")

## Get the exons grouped by gene:
exonsBy(txdb, "gene")

## Get the CDS grouped by transcript:
cds_by_tx0 <- cdsBy(txdb, "tx")
## With more informative group names:
cds_by_tx1 <- cdsBy(txdb, "tx", use.names=TRUE)
## Note that 'cds_by_tx1' can also be obtained with:
names(cds_by_tx0) <- id2name(txdb, feature.type="tx")[names(cds_by_tx0)]
stopifnot(identical(cds_by_tx0, cds_by_tx1))

## Get the introns grouped by transcript:
intronsByTranscript(txdb)

## Get the 5' UTRs grouped by transcript:
fiveUTRsByTranscript(txdb)
fiveUTRsByTranscript(txdb, use.names=TRUE)  # more informative group names

transcriptsByOverlaps()

Extract genomic features from a TxDb-like object based on their genomic location

Description

Generic functions to extract genomic features for specified genomic locations. This page documents the methods for TxDb objects only.

Usage

transcriptsByOverlaps(x, ranges,
                      maxgap = -1L, minoverlap = 0L,
                      type = c("any", "start", "end"), ...)
list(list("transcriptsByOverlaps"), list("TxDb"))(x, ranges,
                      maxgap = -1L, minoverlap = 0L,
                      type = c("any", "start", "end"),
                      columns = c("tx_id", "tx_name"))
exonsByOverlaps(x, ranges,
                maxgap = -1L, minoverlap = 0L,
                type = c("any", "start", "end"), ...)
list(list("exonsByOverlaps"), list("TxDb"))(x, ranges,
                maxgap = -1L, minoverlap = 0L,
                type = c("any", "start", "end"),
                columns = "exon_id")
cdsByOverlaps(x, ranges,
              maxgap = -1L, minoverlap = 0L,
              type = c("any", "start", "end"), ...)
list(list("cdsByOverlaps"), list("TxDb"))(x, ranges,
              maxgap = -1L, minoverlap = 0L,
              type = c("any", "start", "end"),
              columns = "cds_id")

Arguments

Argument	Description
`x`	A TxDb object.
`ranges`	A GRanges object to restrict the output.
`maxgap,minoverlap,type`	Used in the internal call to `findOverlaps()` to detect overlaps. See `?` in the IRanges package for a description of these arguments.
`...`	Arguments to be passed to or from methods.
`columns`	Columns to include in the output. See `?` for the possible values.

Details

These functions subset the results of transcripts , exons , and cds function calls with using the results of findOverlaps calls based on the specified ranges .

Value

a GRanges object

transcripts and transcriptsBy for more ways to extract genomic features from a TxDb -like object.
transcriptLengths for extracting the transcript lengths (and other metrics) from a TxDb object.
exonicParts and intronicParts for extracting non-overlapping exonic or intronic parts from a TxDb-like object.
extractTranscriptSeqs for extracting transcript (or CDS) sequences from chromosome sequences.
coverageByTranscript for computing coverage by transcript (or CDS) of a set of ranges.
select-methods for how to use the simple "select" interface to extract information from a TxDb object.
id2name for mapping TxDb internal ids to external names for a given feature type.
The TxDb class.

Author

P. Aboyoun

Examples

txdb <- loadDb(system.file("extdata", "hg19_knownGene_sample.sqlite",
package="GenomicFeatures"))
gr <- GRanges(Rle("chr1", 2),
IRanges(c(500,10500), c(10000,30000)),
strand = Rle("-", 2))
transcriptsByOverlaps(txdb, gr)