bioconductor v3.9.0 Ensembldb

The package provides functions to create and use

Link to this section Summary

Functions

Deprecated functionality

Connect to an EnsDb object

Integration into the AnnotationDbi framework

Basic usage of an Ensembl based annotation database

Retrieve annotation data from an Ensembl based package

Calculating lengths of features

Support for other than Ensembl seqlevel style

Functionality related to DNA/RNA sequences

Utility functions

Filters supported by ensembldb

Protein related functionality

Map positions within the CDS to coordinates relative to the start of the transcript

Convert an AnnotationFilter to a SQL WHERE condition for EnsDb

Map genomic coordinates to protein coordinates

Map genomic coordinates to transcript coordinates

Globally add filters to an EnsDb database

Determine whether protein data is available in the database

List EnsDb databases in a MariaDB/MySQL server

Generating a Ensembl annotation package from Ensembl

Map within-protein coordinates to genomic coordinates

Map protein-relative coordinates to positions within the transcript

Search annotations interactively

Map transcript-relative coordinates to positions within the CDS

Map transcript-relative coordinates to genomic coordinates

Map transcript-relative coordinates to amino acid residues of the encoded protein

Use a MariaDB/MySQL backend

Link to this section Functions

Deprecated functionality

Description

All functions, methods and classes listed on this page are deprecated and might be removed in future releases.

GeneidFilter creates a GeneIdFilter . Use GeneIdFilter from the AnnotationFilter package instead.

GenebiotypeFilter creates a GeneBiotypeFilter . Use GeneBiotypeFilter from the AnnotationFilter package instead.

EntrezidFilter creates a EntrezFilter . Use EntrezFilter from the AnnotationFilter package instead.

TxidFilter creates a TxIdFilter . Use TxIdFilter from the AnnotationFilter package instead.

TxbiotypeFilter creates a TxBiotypeFilter . Use TxBiotypeFilter from the AnnotationFilter package instead.

ExonidFilter creates a ExonIdFilter . Use ExonIdFilter from the AnnotationFilter package instead.

ExonrankFilter creates a ExonRankFilter . Use ExonRankFilter from the AnnotationFilter package instead.

SeqNameFilter creates a SeqNameFilter . Use SeqNameFilter from the AnnotationFilter package instead.

SeqstrandFilter creates a SeqStrandFilter . Use SeqStrandFilter from the AnnotationFilter instead.

SeqstartFilter creates a GeneStartFilter , TxStartFilter or ExonStartFilter depending on the value of the parameter feature . Use GeneStartFilter , TxStartFilter and ExonStartFilter instead.

SeqendFilter creates a GeneEndFilter , TxEndFilter or ExonEndFilter depending on the value of the parameter feature . Use GeneEndFilter , TxEndFilter and ExonEndFilter instead.

Usage

GeneidFilter(value, condition = "==")
GenebiotypeFilter(value, condition = "==")
EntrezidFilter(value, condition = "==")
TxidFilter(value, condition = "==")
TxbiotypeFilter(value, condition = "==")
ExonidFilter(value, condition = "==")
ExonrankFilter(value, condition = "==")
SeqnameFilter(value, condition = "==")
SeqstrandFilter(value, condition = "==")
SeqstartFilter(value, condition = ">", feature = "gene")
SeqendFilter(value, condition = "<", feature = "gene")

Arguments

ArgumentDescription
valueThe value for the filter.
conditionThe condition for the filter.
featureFor SeqstartFilter and SeqendFilter : on what type of feature should the filter be applied? Supported are "gene" , "tx" and "exon" .

Connect to an EnsDb object

Description

The EnsDb constructor function connects to the database specified with argument x and returns a corresponding EnsDb object.

Usage

EnsDb(x)

Arguments

ArgumentDescription
xEither a character specifying the SQLite database file, or a DBIConnection to e.g. a MariaDB/MySQL database.

Details

By providing the connection to a MariaDB/MySQL database, it is possible to use MariaDB/MySQL as the database backend and queries will be performed on that database. Note however that this requires the package RMariaDB to be installed. In addition, the user needs to have access to a MySQL server providing already an EnsDb database, or must have write privileges on a MySQL server, in which case the useMySQL method can be used to insert the annotations from an EnsDB package into a MySQL database.

Value

A EnsDb object.

Author

Johannes Rainer

Examples

## "Standard" way to create an EnsDb object:
library(EnsDb.Hsapiens.v86)
EnsDb.Hsapiens.v86

## Alternatively, provide the full file name of a SQLite database file
dbfile <- system.file("extdata/EnsDb.Hsapiens.v86.sqlite", package = "EnsDb.Hsapiens.v86")
edb <- EnsDb(dbfile)
edb

## Third way: connect to a MySQL database
library(RMariaDB)
dbcon <- dbConnect(MySQL(), user = my_user, pass = my_pass,
host = my_host, dbname = "ensdb_hsapiens_v86")
edb <- EnsDb(dbcon)
Link to this function

EnsDb_AnnotationDbi()

Integration into the AnnotationDbi framework

Description

Several of the methods available for AnnotationDbi objects are also implemented for EnsDb objects. This enables to extract data from EnsDb objects in a similar fashion than from objects inheriting from the base annotation package class AnnotationDbi . In addition to the standard usage, the select and mapIds for EnsDb objects support also the filter framework of the ensembdb package and thus allow to perform more fine-grained queries to retrieve data.

Usage

list(list("columns"), list("EnsDb"))(x)
list(list("keys"), list("EnsDb"))(x, keytype, filter,...)
list(list("keytypes"), list("EnsDb"))(x)
list(list("mapIds"), list("EnsDb"))(x, keys, column, keytype, ..., multiVals)
list(list("select"), list("EnsDb"))(x, keys, columns, keytype, ...)

Arguments

ArgumentDescription
columnFor mapIds : the column to search on, i.e. from which values should be retrieved.
columnsFor select : the columns from which values should be retrieved. Use the columns method to list all possible columns.
keysThe keys/ids for which data should be retrieved from the database. This can be either a character vector of keys/IDs, a single filter object extending AnnotationFilter , an combination of filters AnnotationFilterList or a formula representing a filter expression (see AnnotationFilter for more details).
keytypeFor mapIds and select : the type (column) that matches the provided keys. This argument does not have to be specified if argument keys is a filter object extending AnnotationFilter or a list of such objects. For keys : which keys should be returned from the database.
filterFor keys : either a single object extending AnnotationFilter or a list of such object to retrieve only specific keys from the database.
multiValsWhat should mapIds do when there are multiple values that could be returned? Options are: "first" (default), "list" , "filter" , "asNA" . See mapIds in the AnnotationDbi package for a detailed description.
xThe EnsDb object.
...Not used.

Value

See method description above.

Seealso

listColumns transcripts

Author

Johannes Rainer

Examples

library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86

## List all supported keytypes.
keytypes(edb)

## List all supported columns for the select and mapIds methods.
columns(edb)

## List /real/ database column names.
listColumns(edb)

## Retrieve all keys corresponding to transcript ids.
txids <- keys(edb, keytype = "TXID")
length(txids)
head(txids)

## Retrieve all keys corresponding to gene names of genes encoded on chromosome X
gids <- keys(edb, keytype = "GENENAME", filter = SeqNameFilter("X"))
length(gids)
head(gids)

## Get a mapping of the genes BCL2 and BCL2L11 to all of their
## transcript ids and return the result as list
maps <- mapIds(edb, keys = c("BCL2", "BCL2L11"), column = "TXID",
keytype = "GENENAME", multiVals = "list")
maps

## Perform the same query using a combination of a GeneNameFilter and a
## TxBiotypeFilter to just retrieve protein coding transcripts for these
## two genes.
mapIds(edb, keys = list(GeneNameFilter(c("BCL2", "BCL2L11")),
TxBiotypeFilter("protein_coding")), column = "TXID",
multiVals = "list")

## select:
## Retrieve all transcript and gene related information for the above example.
select(edb, keys = list(GeneNameFilter(c("BCL2", "BCL2L11")),
TxBiotypeFilter("protein_coding")),
columns = c("GENEID", "GENENAME", "TXID", "TXBIOTYPE", "TXSEQSTART",
"TXSEQEND", "SEQNAME", "SEQSTRAND"))

## Get all data for genes encoded on chromosome Y
Y <- select(edb, keys = "Y", keytype = "SEQNAME")
head(Y)
nrow(Y)

## Get selected columns for all lincRNAs encoded on chromosome Y. Here we use
## a filter expression to define what data to retrieve.
Y <- select(edb, keys = ~ seq_name == "Y" & gene_biotype == "lincRNA",
columns = c("GENEID", "GENEBIOTYPE", "TXID", "GENENAME"))
head(Y)
nrow(Y)

Basic usage of an Ensembl based annotation database

Description

The EnsDb class provides access to an Ensembl-based annotation package. This help page describes functions to get some basic informations from such an object.

Usage

list(list("dbconn"), list("EnsDb"))(x)
list(list("ensemblVersion"), list("EnsDb"))(x)
list(list("listColumns"), list("EnsDb"))(x, table, skip.keys=TRUE, ...)
list(list("listGenebiotypes"), list("EnsDb"))(x, ...)
list(list("listTxbiotypes"), list("EnsDb"))(x, ...)
list(list("listTables"), list("EnsDb"))(x, ...)
list(list("metadata"), list("EnsDb"))(x, ...)
list(list("organism"), list("EnsDb"))(object)
list(list("returnFilterColumns"), list("EnsDb"))(x)
list(list("returnFilterColumns"), list("EnsDb"))(x)
list(list("returnFilterColumns"), list("EnsDb"))(x) <- value
list(list("seqinfo"), list("EnsDb"))(x)
list(list("seqlevels"), list("EnsDb"))(x)
list(list("updateEnsDb"), list("EnsDb"))(x, ...)

Arguments

ArgumentDescription
...Additional arguments. Not used.
objectFor organism : an EnsDb instance.
skip.keysfor listColumns : whether primary and foreign keys (not being e.g. "gene_id" or alike) should be returned or not. By default these will not be returned.
tableFor listColumns : optionally specify the table name(s) for which the columns should be returned.
valueFor returnFilterColumns : a logical of length one specifying whether columns that are used for eventual filters should also be returned.
xAn EnsDb instance.

Value

list(" ", " ", list(list("For ", list("connection")), list(" ", " The SQL connection to the RSQLite database. ", " ")), " ", " ", " ", list(list("For ", list("EnsDb")), list(" ", " An ", list("EnsDb"), " instance. ", " ")), " ", " ", " ", list(list("For ", list("lengthOf")), list(" ", " A named integer vector with the length of the genes or transcripts. ", " ")), " ", " ", " ", list(list("For ", list("listColumns")), list(" ", " A character vector with the column names. ",

"    ")), "

", " ", " ", list(list("For ", list("listGenebiotypes")), list(" ", " A character vector with the biotypes of the genes in the database. ", " ")), " ", " ", " ", list(list("For ", list("listTxbiotypes")), list(" ", " A character vector with the biotypes of the transcripts in the database. ", " ")), " ", " ", " ", list(list("For ", list("listTables")), list(" ", " A list with the names corresponding to the database table names ", " and the elements being the attribute (column) names of the table. ",

"    ")), "

", " ", " ", list(list("For ", list("metadata")), list(" ", " A ", list("data.frame"), ". ", " ")), " ", " ", " ", list(list("For ", list("organism")), list(" ", " A character string. ", " ")), " ", " ", " ", list(list("For ", list("returnFilterColumns")), list(" ", " A logical of length 1. ", " ")), " ", " ", " ", list(list("For ", list("seqinfo")), list(" ", " A ", list("Seqinfo"), " class. ", " ")), " ", " ", " ",

list(list("For ", list("updateEnsDb")), list("

", " A ", list("EnsDb"), " object. ", " ")), " ", " ")

Seealso

EnsDb , makeEnsembldbPackage , exonsBy , genes , transcripts , makeEnsemblSQLiteFromTables

addFilter for globally adding filters to an EnsDb object.

Note

While a column named "tx_name" is listed by the listTables and listColumns method, no such column is present in the database. Transcript names returned by the methods are actually the transcript IDs. This virtual column was only introduced to be compliant with TxDb objects (which provide transcript names).

Author

Johannes Rainer

Examples

library(EnsDb.Hsapiens.v86)

## Display some information:
EnsDb.Hsapiens.v86

## Show the tables along with its columns
listTables(EnsDb.Hsapiens.v86)

## For what species is this database?
organism(EnsDb.Hsapiens.v86)

## What Ensembl version if the database based on?
ensemblVersion(EnsDb.Hsapiens.v86)

## Get some more information from the database
metadata(EnsDb.Hsapiens.v86)

## Get all the sequence names.
seqlevels(EnsDb.Hsapiens.v86)

## List all available gene biotypes from the database:
listGenebiotypes(EnsDb.Hsapiens.v86)

## List all available transcript biotypes:
listTxbiotypes(EnsDb.Hsapiens.v86)

## Update the EnsDb; this is in most instances not necessary at all.
updateEnsDb(EnsDb.Hsapiens.v86)

######    returnFilterColumns
returnFilterColumns(EnsDb.Hsapiens.v86)

## Get protein coding genes on chromosome X, specifying to return
## only columns gene_name as additional column.
genes(EnsDb.Hsapiens.v86, filter=list(SeqNameFilter("X"),
GeneBiotypeFilter("protein_coding")),
columns=c("gene_name"))
## By default we get also the gene_biotype column as the data was filtered
## on this column.

## This can be changed using the returnFilterColumns option
returnFilterColumns(EnsDb.Hsapiens.v86) <- FALSE
genes(EnsDb.Hsapiens.v86, filter=list(SeqNameFilter("X"),
GeneBiotypeFilter("protein_coding")),
columns=c("gene_name"))
Link to this function

EnsDb_exonsBy()

Retrieve annotation data from an Ensembl based package

Description

Retrieve gene/transcript/exons annotations stored in an Ensembl based database package generated with the makeEnsembldbPackage function. Parameter filter enables to define filters to retrieve only specific data. Alternatively, a global filter might be added to the EnsDb object using the addFilter method.

Usage

list(list("exons"), list("EnsDb"))(x, columns = listColumns(x,"exon"),
        filter = AnnotationFilterList(), order.by,
        order.type = "asc", return.type = "GRanges")
list(list("exonsBy"), list("EnsDb"))(x, by = c("tx", "gene"),
        columns = listColumns(x, "exon"), filter =
        AnnotationFilterList(), use.names = FALSE)
list(list("intronsByTranscript"), list("EnsDb"))(x, ..., use.names = FALSE)
list(list("exonsByOverlaps"), list("EnsDb"))(x, ranges, maxgap = -1L, minoverlap = 0L,
        type = c("any", "start", "end"), columns = listColumns(x, "exon"),
        filter = AnnotationFilterList())
list(list("transcripts"), list("EnsDb"))(x, columns = listColumns(x, "tx"),
        filter = AnnotationFilterList(), order.by, order.type = "asc",
        return.type = "GRanges")
list(list("transcriptsBy"), list("EnsDb"))(x, by = c("gene", "exon"),
        columns = listColumns(x, "tx"), filter = AnnotationFilterList())
list(list("transcriptsByOverlaps"), list("EnsDb"))(x, ranges, maxgap = -1L,
        minoverlap = 0L, type = c("any", "start", "end"),
        columns = listColumns(x, "tx"), filter = AnnotationFilterList())
list(list("promoters"), list("EnsDb"))(x, upstream = 2000, downstream = 200,
        use.names = TRUE, ...)
list(list("genes"), list("EnsDb"))(x, columns = c(listColumns(x, "gene"), "entrezid"),
        filter = AnnotationFilterList(), order.by, order.type = "asc",
        return.type = "GRanges")
list(list("disjointExons"), list("EnsDb"))(x, aggregateGenes = FALSE,
        includeTranscripts = TRUE, filter = AnnotationFilterList(), ...)
list(list("cdsBy"), list("EnsDb"))(x, by = c("tx", "gene"), columns = NULL,
        filter = AnnotationFilterList(), use.names = FALSE)
list(list("fiveUTRsByTranscript"), list("EnsDb"))(x, columns = NULL,
        filter = AnnotationFilterList())
list(list("threeUTRsByTranscript"), list("EnsDb"))(x, columns = NULL,
        filter = AnnotationFilterList())
list(list("toSAF"), list("GRangesList"))(x, ...)

Arguments

ArgumentDescription
...For promoters : additional arguments to be passed to the transcripts method. For intronsByTranscript : additional arguments such as filter .
aggregateGenesFor disjointExons : When FALSE (default) exon fragments that overlap multiple genes are dropped. When TRUE , all fragments are kept and the gene_id metadata column includes all gene IDs that overlap the exon fragment.
byFor exonsBy : wheter exons sould be fetched by genes or by transcripts; as in the corresponding function of the GenomicFeatures package. For transcriptsBy : whether transcripts should be fetched by genes or by exons; fetching transcripts by cds as supported by the transcriptsBy method in the GenomicFeatures package is currently not implemented. For cdsBy : whether cds should be fetched by transcript of by gene.
columnsColumns to be retrieved from the database tables. Default values for genes are all columns from the gene database table, for exons and exonsBy the column names of the exon database table table and for transcript and transcriptBy the columns of the tx data base table (see details below for more information). Note that any of the column names of the database tables can be submitted to any of the methods (use listTables or listColumns methods for a complete list of allowed column names). For cdsBy : this argument is only supported for for by="tx" .
downstreamFor method promoters : the number of nucleotides downstream of the transcription start site that should be included in the promoter region.
filterA filter describing which results to retrieve from the database. Can be a single object extending AnnotationFilter , an AnnotationFilterList object combining several such objects or a formula representing a filter expression (see examples below or AnnotationFilter for more details). Use the supportedFilters method to get an overview of supported filter classes and related fields.
includeTranscriptsFor disjointExons : When TRUE (default) a tx_name metadata column is included that lists all transcript IDs that overlap the exon fragment. Note: this is different to the disjointExons function in the GenomicFeatures package, that lists the transcript names, not IDs.
maxgapFor exonsByOverlaps and transcriptsByOverlaps : see exonsByOverlaps in GenomicFeatures for more information.
minoverlapFor exonsByOverlaps and transcriptsByOverlaps : see exonsByOverlaps in GenomicFeatures for more information.
order.byCharacter vector specifying the column(s) by which the result should be ordered. This can be either in the form of "gene_id, seq_name" or c("gene_id", "seq_name") .
order.typeIf the results should be ordered ascending ( asc , default) or descending ( desc ).
rangesFor exonsByOverlaps and transcriptsByOverlaps : a GRanges object specifying the genomic regions.
return.typeType of the returned object. Can be either "data.frame" , "DataFrame" or "GRanges" . In the latter case the return object will be a GRanges object with the GRanges specifying the chromosomal start and end coordinates of the feature (gene, transcript or exon, depending whether genes , transcripts or exons was called). All additional columns are added as metadata columns to the GRanges object.
typeFor exonsByOverlaps and transcriptsByOverlaps : see exonsByOverlaps in GenomicFeatures for more information.
upstreamFor method promoters : the number of nucleotides upstream of the transcription start site that should be included in the promoter region.
use.namesFor cdsBy and exonsBy : only for by="gene" : use the names of the genes instead of their IDs as names of the resulting GRangesList .
xFor toSAF a GRangesList object. For all other methods an EnsDb instance.

Details

A detailed description of all database tables and the associated attributes/column names is also given in the vignette of this package. An overview of the columns is given below: list(" ", " ", list(list("gene_id"), list("the Ensembl gene ID of the gene.")), " ", " ", list(list("gene_name"), list("the name of the gene (in most cases its official symbol).")), " ", " ", list(list("entrezid"), list("the NCBI Entrezgene ID of the gene. Note that this ", " column contains a ", list("list"), " of Entrezgene identifiers to ", " accommodate the potential 1:n mapping between Ensembl genes and ", " Entrezgene IDs.")), " ", " ", list(list("gene_biotype"),

list("the biotype of the gene.")), "

", " ", list(list("gene_seq_start"), list("the start coordinate of the gene on the ", " sequence (usually a chromosome).")), " ", " ", list(list("gene_seq_end"), list("the end coordinate of the gene.")), " ", " ", list(list("seq_name"), list("the name of the sequence the gene is encoded ", " (usually a chromosome).")), " ", " ", list(list("seq_strand"), list("the strand on which the gene is encoded")), " ", " ", list(list("seq_coord_system"),

list("the coordinate system of the sequence.")), "

", " ", list(list("tx_id"), list("the Ensembl transcript ID.")), " ", " ", list(list("tx_biotype"), list("the biotype of the transcript.")), " ", " ", list(list("tx_seq_start"), list("the chromosomal start coordinate of the transcript.")), " ", " ", list(list("tx_seq_end"), list("the chromosomal end coordinate of the transcript.")), " ", " ", list(list("tx_cds_seq_start"), list("the start coordinate of the coding region of ",

"      the transcript (NULL for non-coding transcripts).")), "

", " ", list(list("tx_cds_seq_end"), list("the end coordinate of the coding region.")), " ", " ", list(list("exon_id"), list("the ID of the exon. In Ensembl, each exon specified ", " by a unique chromosomal start and end position has its own ", " ID. Thus, the same exon might be part of several transcripts.")), " ", " ", list(list("exon_seq_start"), list("the chromosomal start coordinate of the exon.")), " ",

"    ", list(list("exon_seq_end"), list("the chromosomal end coordinate of the exon.")), "

", " ", list(list("exon_idx"), list("the index of the exon in the transcript model. As ", " noted above, an exon can be part of several transcripts and thus ", " its position inside these transcript might differ.")), " ", " ")

Many EnsDb databases provide also protein related annotations. See listProteinColumns for more information.

Value

For exons , transcripts and genes , a data.frame , DataFrame or a GRanges , depending on the value of the return.type parameter. The result is ordered as specified by the parameter order.by or, if not provided, by seq_name and chromosomal start coordinate, but NOT by any ordering of values in eventually submitted filter objects.

For exonsBy , transcriptsBy : a GRangesList , depending on the value of the return.type parameter. The results are ordered by the value of the by parameter.

For exonsByOverlaps and transcriptsByOverlaps : a GRanges with the exons or transcripts overlapping the specified regions.

For toSAF : a data.frame with column names "GeneID" (the group name from the GRangesList , i.e. the ID by which the GRanges are split), "Chr" (the seqnames from the GRanges ), "Start" (the start coordinate), "End" (the end coordinate) and "Strand" (the strand).

For disjointExons : a GRanges of non-overlapping exon parts.

For cdsBy : a GRangesList with GRanges per either transcript or exon specifying the start and end coordinates of the coding region of the transcript or gene.

For fiveUTRsByTranscript : a GRangesList with GRanges for each protein coding transcript representing the start and end coordinates of full or partial exons that constitute the 5' untranslated region of the transcript.

For threeUTRsByTranscript : a GRangesList with GRanges for each protein coding transcript representing the start and end coordinates of full or partial exons that constitute the 3' untranslated region of the transcript.

Seealso

supportedFilters to get an overview of supported filters. makeEnsembldbPackage , listColumns , lengthOf

addFilter for globally adding filters to an EnsDb object.

Note

Ensembl defines genes not only on standard chromosomes, but also on patched chromosomes and chromosome variants. Thus it might be advisable to restrict the queries to just those chromosomes of interest (e.g. by specifying a SeqNameFilter(c(1:22, "X", "Y")) ). In addition, also so called LRG genes (Locus Reference Genomic) are defined in Ensembl. Their gene id starts with LRG instead of ENS for Ensembl genes, thus, a filter can be applied to specifically select those genes or exclude those genes (see examples below).

Depending on the value of the global option "ucscChromosomeNames" (use getOption(ucscChromosomeNames, FALSE) to get its value or option(ucscChromosomeNames=TRUE) to change its value) the sequence/chromosome names of the returned GRanges objects or provided in the returned data.frame or DataFrame correspond to Ensembl chromosome names (if value is FALSE ) or UCSC chromosome names (if TRUE ). This ensures a better integration with the Gviz package, in which this option is set by default to TRUE .

Author

Johannes Rainer, Tim Triche

Examples

library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86

######   genes
##
## Get all genes encoded on chromosome Y
AllY <- genes(edb, filter = SeqNameFilter("Y"))
AllY

## Return the result as a DataFrame; also, we use a filter expression here
## to define which features to extract from the database.
AllY.granges <- genes(edb,
filter = ~ seq_name == "Y",
return.type="DataFrame")
AllY.granges

## Include all transcripts of the gene and their chromosomal
## coordinates, sort by chrom start of transcripts and return as
## GRanges.
AllY.granges.tx <- genes(edb,
filter = SeqNameFilter("Y"),
columns = c("gene_id", "seq_name",
"seq_strand", "tx_id", "tx_biotype",
"tx_seq_start", "tx_seq_end"),
order.by = "tx_seq_start")
AllY.granges.tx



######   transcripts
##
## Get all transcripts of a gene
Tx <- transcripts(edb,
filter = GeneIdFilter("ENSG00000184895"),
order.by = "tx_seq_start")
Tx

## Get all transcripts of two genes along with some information on the
## gene and transcript
Tx <- transcripts(edb,
filter = GeneIdFilter(c("ENSG00000184895",
"ENSG00000092377")),
columns = c("gene_id", "gene_seq_start", "gene_seq_end",
"gene_biotype", "tx_biotype"))
Tx

######   promoters
##
## Get the bona-fide promoters (2k up- to 200nt downstream of TSS)
promoters(edb, filter = GeneIdFilter(c("ENSG00000184895",
"ENSG00000092377")))

######   exons
##
## Get all exons of protein coding transcript for the gene ENSG00000184895
Exon <- exons(edb,
filter = ~ gene_id == "ENSG00000184895" &
tx_biotype == "protein_coding",
columns = c("gene_id", "gene_seq_start", "gene_seq_end",
"tx_biotype", "gene_biotype"))
Exon



#####    exonsBy
##
## Get all exons for transcripts encoded on chromosomes X and Y.
ETx <- exonsBy(edb, by = "tx",
filter = SeqNameFilter(c("X", "Y")))
ETx
## Get all exons for genes encoded on chromosome 1 to 22, X and Y and
## include additional annotation columns in the result
EGenes <- exonsBy(edb, by = "gene",
filter = SeqNameFilter(c("X", "Y")),
columns = c("gene_biotype", "gene_name"))
EGenes

## Note that this might also contain "LRG" genes.
length(grep(names(EGenes), pattern="LRG"))

## to fetch just Ensemblgenes, use an GeneIdFilter with value
## "ENS%" and condition "like"
eg <- exonsBy(edb, by = "gene",
filter = AnnotationFilterList(SeqNameFilter(c("X", "Y")),
GeneIdFilter("ENS", "startsWith")),
columns = c("gene_biotype", "gene_name"))
eg
length(grep(names(eg), pattern="LRG"))

#####    transcriptsBy
##
TGenes <- transcriptsBy(edb, by = "gene",
filter = SeqNameFilter(c("X", "Y")))
TGenes

## convert this to a SAF formatted data.frame that can be used by the
## featureCounts function from the Rsubreader package.
head(toSAF(TGenes))


#####   transcriptsByOverlaps
##
ir <- IRanges(start = c(2654890, 2709520, 28111770),
end = c(2654900, 2709550, 28111790))
gr <- GRanges(rep("Y", length(ir)), ir)

## Retrieve all transcripts overlapping any of the regions.
txs <- transcriptsByOverlaps(edb, gr)
txs

## Alternatively, use a GRangesFilter
grf <- GRangesFilter(gr, type = "any")
txs <- transcripts(edb, filter = grf)
txs


####    cdsBy
## Get the coding region for all transcripts on chromosome Y.
## Specifying also additional annotation columns (in addition to the default
## exon_id and exon_rank).
cds <- cdsBy(edb, by = "tx", filter = SeqNameFilter("Y"),
columns = c("tx_biotype", "gene_name"))

####    the 5' untranslated regions:
fUTRs <- fiveUTRsByTranscript(edb, filter = SeqNameFilter("Y"))

####    the 3' untranslated regions with additional column gene_name.
tUTRs <- threeUTRsByTranscript(edb, filter = SeqNameFilter("Y"),
columns = "gene_name")
Link to this function

EnsDb_lengths()

Calculating lengths of features

Description

These methods allow to calculate the lengths of features (transcripts, genes, CDS, 3' or 5' UTRs) defined in an EnsDb object or database.

Usage

list(list("lengthOf"), list("EnsDb"))(x, of="gene", filter = AnnotationFilterList())

Arguments

ArgumentDescription
filterA filter describing which results to retrieve from the database. Can be a single object extending AnnotationFilter , an AnnotationFilterList object combining several such objects or a formula representing a filter expression (see examples below or AnnotationFilter for more details).
offor lengthOf : whether the length of genes or transcripts should be retrieved from the database.
xFor lengthOf : either an EnsDb or a GRangesList object. For all other methods an EnsDb instance.

Value

For lengthOf : see method description above.

Seealso

exonsBy transcripts transcriptLengths

Author

Johannes Rainer

Examples

library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86

#####    lengthOf
##
## length of a specific gene.
lengthOf(edb, filter = GeneIdFilter("ENSG00000000003"))

## length of a transcript
lengthOf(edb, of = "tx", filter = TxIdFilter("ENST00000494424"))

## Average length of all protein coding genes encoded on chromosomes X
mean(lengthOf(edb, of = "gene",
filter = ~ gene_biotype == "protein_coding" &
seq_name == "X"))

## Average length of all snoRNAs
mean(lengthOf(edb, of = "gene",
filter = ~ gene_biotype == "snoRNA" &
seq_name == "X"))

##### transcriptLengths
##
## Calculate the length of transcripts encoded on chromosome Y, including
## length of the CDS, 5' and 3' UTR.
len <- transcriptLengths(edb, with.cds_len = TRUE, with.utr5_len = TRUE,
with.utr3_len = TRUE, filter = SeqNameFilter("Y"))
head(len)
Link to this function

EnsDb_seqlevels()

Support for other than Ensembl seqlevel style

Description

The methods and functions on this help page allow to integrate EnsDb objects and the annotations they provide with other Bioconductor annotation packages that base on chromosome names (seqlevels) that are different from those defined by Ensembl.

Usage

list(list("seqlevelsStyle"), list("EnsDb"))(x)
list(list("seqlevelsStyle"), list("EnsDb"))(x) <- value
list(list("supportedSeqlevelsStyles"), list("EnsDb"))(x)

Arguments

ArgumentDescription
valueFor seqlevelsStyle<- : a character string specifying the seqlevels style that should be set. Use the supportedSeqlevelsStyle to list all available and supported seqlevel styles.
xAn EnsDb instance.

Value

For seqlevelsStyle : see method description above.

For supportedSeqlevelsStyles : see method description above.

Seealso

EnsDb transcripts

Note

The mapping between different seqname styles is performed based on data provided by the GenomeInfoDb package. Note that in most instances no mapping is provided for seqnames other than for primary chromosomes. By default functions from the ensembldb package return the original seqname is in such cases. This behaviour can be changed with the ensembldb.seqnameNotFound global option. For the special keyword "ORIGINAL" (the default), the original seqnames are returned, for "MISSING" an error is thrown if a seqname can not be mapped. In all other cases, the value of the option is returned as seqname if no mapping is available (e.g. setting options(ensembldb.seqnameNotFound=NA) returns an NA if the seqname is not mappable).

Author

Johannes Rainer

Examples

library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86

## Get the internal, default seqlevel style.
seqlevelsStyle(edb)

## Get the seqlevels from the database.
seqlevels(edb)

## Get all supported mappings for the organism of the EnsDb.
supportedSeqlevelsStyles(edb)

## Change the seqlevels to UCSC style.
seqlevelsStyle(edb) <- "UCSC"
seqlevels(edb)

## Change the option ensembldb.seqnameNotFound to return NA in case
## the seqname can not be mapped form Ensembl to UCSC.
options(ensembldb.seqnameNotFound = NA)

seqlevels(edb)

## Restoring the original setting.
options(ensembldb.seqnameNotFound = "ORIGINAL")
Link to this function

EnsDb_sequences()

Functionality related to DNA/RNA sequences

Description

Utility functions related to RNA/DNA sequences, such as extracting RNA/DNA sequences for features defined in Ensb .

Usage

list(list("getGenomeFaFile"), list("EnsDb"))(x, pattern="dna.toplevel.fa")
list(list("getGenomeTwoBitFile"), list("EnsDb"))(x)

Arguments

ArgumentDescription
patternFor method getGenomeFaFile : the pattern to be used to identify the fasta file representing genomic DNA sequence.
xAn EnsDb instance.

Value

For getGenomeFaFile : a FaFile-class object with the genomic DNA sequence.

For getGenomeTwoBitFile : a TwoBitFile-class object with the genome sequence.

Seealso

transcripts exonsBy

Author

Johannes Rainer

Examples

## Loading an EnsDb for Ensembl version 86 (genome GRCh38):
library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86

## Retrieve a TwoBitFile with the gneomic DNA sequence matching the organism,
## genome release version and, if possible, the Ensembl version of the
## EnsDb object.
Dna <- getGenomeTwoBitFile(edb)
## Extract the transcript sequence for all transcripts encoded on chromosome
## Y.
##extractTranscriptSeqs(Dna, edb, filter=SeqNameFilter("Y"))

Utility functions

Description

Utility functions integrating EnsDb objects with other Bioconductor packages.

Usage

list(list("getGeneRegionTrackForGviz"), list("EnsDb"))(x,
        filter = AnnotationFilterList(), chromosome = NULL,
        start = NULL, end = NULL, featureIs = "gene_biotype")

Arguments

ArgumentDescription
chromosomeFor getGeneRegionTrackForGviz : optional chromosome name to restrict the returned entry to a specific chromosome.
endFor getGeneRegionTrackForGviz : optional chromosomal end coordinate specifying, together with start , the chromosomal region from which features should be retrieved.
featureIsFor getGeneRegionTrackForGviz : whether the gene ( "gene_biotype" ) or the transcript biotype ( "tx_biotype" ) should be returned in column "feature" .
filterA filter describing which results to retrieve from the database. Can be a single object extending AnnotationFilter , an AnnotationFilterList object combining several such objects or a formula representing a filter expression (see examples below or AnnotationFilter for more details).
startFor getGeneRegionTrackForGviz : optional chromosomal start coordinate specifying, together with end , the chromosomal region from which features should be retrieved.
xFor toSAF a GRangesList object. For all other methods an EnsDb instance.

Value

For getGeneRegionTrackForGviz : see method description above.

Seealso

transcripts

Author

Johannes Rainer

Examples

library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86
######   getGeneRegionTrackForGviz
##
## Get all genes encoded on chromosome Y in the specifyed region.
AllY <- getGeneRegionTrackForGviz(edb, chromosome = "Y", start = 5131959,
end = 7131959)
## We could plot this now using plotTracks(GeneRegionTrack(AllY))
Link to this function

Filter_classes()

Filters supported by ensembldb

Description

ensembldb supports most of the filters from the AnnotationFilter package to retrieve specific content from EnsDb databases. These filters can be passed to the methods such as genes() with the filter parameter or can be added as a global filter to an EnsDb object (see addFilter() for more details). Use supportedFilters() to get an overview of all filters supported by EnsDb object.

seqnames : accessor for the sequence names of the GRanges object within a GRangesFilter .

seqnames : accessor for the seqlevels of the GRanges object within a GRangesFilter .

supportedFilters returns a data.frame with the names of all filters and the corresponding field supported by the EnsDb object.

Usage

OnlyCodingTxFilter()
ProtDomIdFilter(value, condition = "==")
ProteinDomainIdFilter(value, condition = "==")
ProteinDomainSourceFilter(value, condition = "==")
UniprotDbFilter(value, condition = "==")
UniprotMappingTypeFilter(value, condition = "==")
TxSupportLevelFilter(value, condition = "==")
list(list("seqnames"), list("GRangesFilter"))(x)
list(list("seqlevels"), list("GRangesFilter"))(x)
list(list("supportedFilters"), list("EnsDb"))(object, ...)

Arguments

ArgumentDescription
valueThe value(s) for the filter. For GRangesFilter it has to be a GRanges object.
conditioncharacter(1) specifying the condition of the filter. For character -based filters (such as GeneIdFilter ) "==" , "!=" , "startsWith" and "endsWith" are supported. Allowed values for integer -based filters (such as GeneStartFilter ) are "==" , "!=" , "<" . "<=" , ">" and ">=" .
xFor seqnames , seqlevels : a GRangesFilter object.
objectFor supportedFilters : an EnsDb object.
...For supportedFilters : currently not used.

Details

ensembldb supports the following filters from the AnnotationFilter package:

  • GeneIdFilter : filter based on the Ensembl gene ID.

  • GeneNameFilter : filter based on the name of the gene as provided Ensembl. In most cases this will correspond to the official gene symbol.

  • SymbolFilter filter based on the gene names. EnsDb objects don't have a dedicated symbol column, the filtering is hence based on the gene names.

  • GeneBiotype : filter based on the biotype of genes (e.g. "protein_coding" ).

  • GeneStartFilter : filter based on the genomic start coordinate of genes.

  • GeneEndFilter : filter based on the genomic end coordinate of genes.

  • EntrezidFilter : filter based on the genes' NCBI Entrezgene ID.

  • TxIdFilter : filter based on the Ensembld transcript ID.

  • TxNameFilter : filter based on the Ensembld transcript ID; no transcript names are provided in EnsDb databases.

  • TxBiotypeFilter : filter based on the transcripts' biotype.

  • TxStartFilter : filter based on the genomic start coordinate of the transcripts.

  • TxEndFilter : filter based on the genonic end coordinates of the transcripts.

  • ExonIdFilter : filter based on Ensembl exon IDs.

  • ExonRankFilter : filter based on the index/rank of the exon within the transcrips.

  • ExonStartFilter : filter based on the genomic start coordinates of the exons.

  • ExonEndFilter : filter based on the genomic end coordinates of the exons.

  • GRangesFilter : Allows to fetch features within or overlapping specified genomic region(s)/range(s). This filter takes a GRanges object as input and, if type = "any" (the default) will restrict results to features (genes, transcripts or exons) that are partially overlapping the region. Alternatively, by specifying condition = "within" it will return features located within the range. In addition, the GRangesFilter condition = "start" , condition = "end" and condition = "equal" filtering for features with the same start or end coordinate or that are equal to the GRanges . Note that the type of feature on which the filter is applied depends on the method that is called, i.e. genes() will filter on the genomic coordinates of genes, transcripts() on those of transcripts and exons() on exon coordinates. Calls to the methods exonsBy() , cdsBy() and transcriptsBy() use the start and end coordinates of the feature type specified with argument by (i.e. "gene" , "transcript" or "exon" ) for the filtering. If the specified GRanges object defines multiple regions, all features within (or overlapping) any of these regions are returned. Chromosome names/seqnames can be provided in UCSC format (e.g. "chrX" ) or Ensembl format (e.g. "X" ); see seqlevelsStyle() for more information.

  • SeqNameFilter : filter based on chromosome names.

  • SeqStrandFilter : filter based on the chromosome strand. The strand can be specified with value = "+" , value = "-" , value = -1 or value = 1 .

  • ProteinIdFilter : filter based on Ensembl protein IDs. This filter is only supported if the EnsDb provides protein annotations; use the hasProteinData() method to check.

  • UniprotFilter : filter based on Uniprot IDs. This filter is only supported if the EnsDb provides protein annotations; use the hasProteinData() method to check.

In addition, the following filters are defined by ensembldb :

  • TxSupportLevel : allows to filter results using the provided transcript support level. Support levels for transcripts are defined by Ensembl based on the available evidences for a transcript with 1 being the highest evidence grade and 5 the lowest level. This filter is only supported on EnsDb databases with a db schema version higher 2.1.

  • UniprotDbFilter : allows to filter results based on the specified Uniprot database name(s).

  • UniprotMappingTypeFilter : allows to filter results based on the mapping method/type that was used to assign Uniprot IDs to Ensembl protein IDs.

  • ProtDomIdFilter , ProteinDomainIdFilter : allows to retrieve entries from the database matching the provided filter criteria based on their protein domain ID ( protein_domain_id ).

  • ProteinDomainSourceFilter : filter results based on the source (database/method) defining the protein domain (e.g. "pfam" ).

  • OnlyCodingTxFilter : allows to retrieve entries only for protein coding transcripts, i.e. transcripts with a CDS. This filter does not take any input arguments.

Value

For ProtDomIdFilter : A ProtDomIdFilter object.

For ProteinDomainIdFilter : A ProteinDomainIdFilter object.

For ProteinDomainSourceFilter : A ProteinDomainSourceFilter object.

For UniprotDbFilter : A UniprotDbFilter object.

For UniprotMappingTypeFilter : A UniprotMappingTypeFilter object.

For TxSupportLevel : A TxSupportLevel object.

For supportedFilters : a data.frame with the names and the corresponding field of the supported filter classes.

Seealso

supportedFilters() to list all filters supported for EnsDb objects.

listUniprotDbs() and listUniprotMappingTypes() to list all Uniprot database names respectively mapping method types from the database.

GeneIdFilter() in the AnnotationFilter package for more details on the filter objects.

genes() , transcripts() , exons() , listGenebiotypes() , listTxbiotypes() .

addFilter() and filter() for globally adding filters to an EnsDb .

Note

For users of ensembldb version < 2.0: in the GRangesFilter from the AnnotationFilter package the condition parameter was renamed to type (to be consistent with the IRanges package). In addition, condition = "overlapping" is no longer recognized. To retrieve all features overlapping the range type = "any" has to be used.

Protein annotation based filters can only be used if the EnsDb database contains protein annotations, i.e. if hasProteinData is TRUE . Also, only protein coding transcripts will have protein annotations available, thus, non-coding transcripts/genes will not be returned by the queries using protein annotation filters.

Author

Johannes Rainer

Examples

## Create a filter that could be used to retrieve all informations for
## the respective gene.
gif <- GeneIdFilter("ENSG00000012817")
gif

## Create a filter for a chromosomal end position of a gene
sef <- GeneEndFilter(10000, condition = ">")
sef

## For additional examples see the help page of "genes".


## Example for GRangesFilter:
## retrieve all genes overlapping the specified region
grf <- GRangesFilter(GRanges("11", ranges = IRanges(114129278, 114129328),
strand = "+"), type = "any")
library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86
genes(edb, filter = grf)

## Get also all transcripts overlapping that region.
transcripts(edb, filter = grf)

## Retrieve all transcripts for the above gene
gn <- genes(edb, filter = grf)
txs <- transcripts(edb, filter = GeneNameFilter(gn$gene_name))
## Next we simply plot their start and end coordinates.
plot(3, 3, pch=NA, xlim=c(start(gn), end(gn)), ylim=c(0, length(txs)),
yaxt="n", ylab="")
## Highlight the GRangesFilter region
rect(xleft=start(grf), xright=end(grf), ybottom=0, ytop=length(txs),
col="red", border="red")
for(i in 1:length(txs)){
current <- txs[i]
rect(xleft=start(current), xright=end(current), ybottom=i-0.975, ytop=i-0.125, border="grey")
text(start(current), y=i-0.5,pos=4, cex=0.75, labels=current$tx_id)
}
## Thus, we can see that only 4 transcripts of that gene are indeed
## overlapping the region.


## No exon is overlapping that region, thus we're not getting anything
exons(edb, filter = grf)


## Example for ExonRankFilter
## Extract all exons 1 and (if present) 2 for all genes encoded on the
## Y chromosome
exons(edb, columns = c("tx_id", "exon_idx"),
filter=list(SeqNameFilter("Y"),
ExonRankFilter(3, condition = "<")))


## Get all transcripts for the gene SKA2
transcripts(edb, filter = GeneNameFilter("SKA2"))

## Which is the same as using a SymbolFilter
transcripts(edb, filter = SymbolFilter("SKA2"))


## Create a ProteinIdFilter:
pf <- ProteinIdFilter("ENSP00000362111")
pf
## Using this filter would retrieve all database entries that are associated
## with a protein with the ID "ENSP00000362111"
if (hasProteinData(edb)) {
res <- genes(edb, filter = pf)
res
}

## UniprotFilter:
uf <- UniprotFilter("O60762")
## Get the transcripts encoding that protein:
if (hasProteinData(edb)) {
transcripts(edb, filter = uf)
## The mapping Ensembl protein ID to Uniprot ID can however be 1:n:
transcripts(edb, filter = TxIdFilter("ENST00000371588"),
columns = c("protein_id", "uniprot_id"))
}

## ProtDomIdFilter:
pdf <- ProtDomIdFilter("PF00335")
## Also here we could get all transcripts related to that protein domain
if (hasProteinData(edb)) {
transcripts(edb, filter = pdf, columns = "protein_id")
}
Link to this function

ProteinFunctionality()

Protein related functionality

Description

This help page provides information about most of the functionality related to protein annotations in ensembldb .

The proteins method retrieves protein related annotations from an EnsDb database.

The listUniprotDbs method lists all Uniprot database names in the EnsDb .

The listUniprotMappingTypes method lists all methods that were used for the mapping of Uniprot IDs to Ensembl protein IDs.

The listProteinColumns function allows to conveniently extract all database columns containing protein annotations from an EnsDb database.

Usage

list(list("proteins"), list("EnsDb"))(object, columns = listColumns(object,
  "protein"), filter = AnnotationFilterList(), order.by = "",
  order.type = "asc", return.type = "DataFrame")
list(list("listUniprotDbs"), list("EnsDb"))(object)
list(list("listUniprotMappingTypes"), list("EnsDb"))(object)
listProteinColumns(object)

Arguments

ArgumentDescription
objectThe EnsDb object.
columnsFor proteins : character vector defining the columns to be extracted from the database. Can be any column(s) listed by the listColumns method.
filterFor proteins : A filter object extending AnnotationFilter or a list of such objects to select specific entries from the database. See Filter-classes for a documentation of available filters and use supportedFilters to get the full list of supported filters.
order.byFor proteins : a character vector specifying the column(s) by which the result should be ordered.
order.typeFor proteins : if the results should be ordered ascending ( order.type = "asc" ) or descending ( order.type = "desc" )
return.typeFor proteins : character of lenght one specifying the type of the returned object. Can be either "DataFrame" , "data.frame" or "AAStringSet" .

Details

The proteins method performs the query starting from the protein tables and can hence return all annotations from the database that are related to proteins and transcripts encoding these proteins from the database. Since proteins does thus only query annotations for protein coding transcripts, the genes or transcripts methods have to be used to retrieve annotations for non-coding transcripts.

Value

The proteins method returns protein related annotations from an EnsDb object with its return.type argument allowing to define the type of the returned object. Note that if return.type = "AAStringSet" additional annotation columns are stored in a DataFrame that can be accessed with the mcols method on the returned object.

The listProteinColumns function returns a character vector with the column names containing protein annotations or throws an error if no such annotations are available.

Author

Johannes Rainer

Examples

library(ensembldb)
library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86
## Get all proteins from tha database for the gene ZBTB16, if protein
## annotations are available
if (hasProteinData(edb))
proteins(edb, filter = GeneNameFilter("ZBTB16"))

## List the names of all Uniprot databases from which Uniprot IDs are
## available in the EnsDb
if (hasProteinData(edb))
listUniprotDbs(edb)


## List the type of all methods that were used to map Uniprot IDs to Ensembl
## protein IDs
if (hasProteinData(edb))
listUniprotMappingTypes(edb)


## List all columns containing protein annotations
library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86
if (hasProteinData(edb))
listProteinColumns(edb)
Link to this function

cdsToTranscript()

Map positions within the CDS to coordinates relative to the start of the transcript

Description

Converts CDS-relative coordinates to positions within the transcript, i.e. relative to the start of the transcript and hence including its 5' UTR.

Usage

cdsToTranscript(x, db, id = "name")

Arguments

ArgumentDescription
xIRanges with the coordinates within the CDS. Coordinates are expected to be relative to the transcription start (the first nucleotide of the transcript). The Ensembl IDs of the corresponding transcripts have to be provided either as names of the IRanges , or in one of its metadata columns.
dbEnsDb object.
idcharacter(1) specifying where the transcript identifier can be found. Has to be either "name" or one of colnames(mcols(prng)) .

Value

IRanges with the same length (and order) than the input IRanges x . Each element in IRanges provides the coordinates within the transcripts CDS. The transcript-relative coordinates are provided as metadata columns. IRanges with a start coordinate of -1 is returned for transcripts that are not known in the database, non-coding transcripts or if the provided start and/or end coordinates are not within the coding region.

Seealso

Other coordinate mapping functions: genomeToProtein , genomeToTranscript , proteinToGenome , proteinToTranscript , transcriptToCds , transcriptToGenome , transcriptToProtein

Author

Johannes Rainer

Examples

library(EnsDb.Hsapiens.v86)
## Defining transcript-relative coordinates for 4 transcripts of the gene
## BCL2
txcoords <- IRanges(start = c(4, 3, 143, 147), width = 1,
names = c("ENST00000398117", "ENST00000333681",
"ENST00000590515", "ENST00000589955"))

cdsToTranscript(txcoords, EnsDb.Hsapiens.v86)

## Next we map the coordinate for variants within the gene PKP2 to the
## genome. The variants is PKP2 c.1643DelG and the provided
## position is thus relative to the CDS. We have to convert the
## position first to transcript-relative coordinates.
pkp2 <- IRanges(start = 1643, width = 1, name = "ENST00000070846")

## Map the coordinates by first converting the CDS- to transcript-relative
## coordinates
transcriptToGenome(cdsToTranscript(pkp2, EnsDb.Hsapiens.v86),
EnsDb.Hsapiens.v86)
Link to this function

convertFilter()

Convert an AnnotationFilter to a SQL WHERE condition for EnsDb

Description

convertFilter converts an AnnotationFilter::AnnotationFilter or AnnotationFilter::AnnotationFilterList to an SQL where condition for an EnsDb database.

Usage

list(list("convertFilter"), list("AnnotationFilter,EnsDb"))(object, db,
  with.tables = character())
list(list("convertFilter"), list("AnnotationFilterList,EnsDb"))(object, db,
  with.tables = character())

Arguments

ArgumentDescription
objectAnnotationFilter or AnnotationFilterList objects (or objects extending these classes).
dbEnsDb object.
with.tablesoptional character vector specifying the names of the database tables that are being queried.

Value

A character(1) with the SQL where condition.

Note

This function might be used in direct SQL queries on the SQLite database underlying an EnsDb but is more thought to illustrate the use of AnnotationFilter objects in combination with SQL databases. This method is used internally to create the SQL calls to the database.

Author

Johannes Rainer

Examples

library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86

## Define a filter
flt <- AnnotationFilter(~ gene_name == "BCL2")

## Use the method from the AnnotationFilter package:
convertFilter(flt)

## Create a combination of filters
flt_list <- AnnotationFilter(~ gene_name %in% c("BCL2", "BCL2L11") &
tx_biotype == "protein_coding")
flt_list

convertFilter(flt_list)

## Use the filters in the context of an EnsDb database:
convertFilter(flt, edb)

convertFilter(flt_list, edb)
Link to this function

genomeToProtein()

Map genomic coordinates to protein coordinates

Description

Map positions along the genome to positions within the protein sequence if a protein is encoded at the location. The provided coordinates have to be completely within the genomic position of an exon of a protein coding transcript (see genomeToTranscript() for details). Also, the provided positions have to be within the genomic region encoding the CDS of a transcript (excluding its stop codon; soo transcriptToProtein() for details).

For genomic positions for which the mapping failed an IRanges with negative coordinates (i.e. a start position of -1) is returned.

Usage

genomeToProtein(x, db)

Arguments

ArgumentDescription
xGRanges with the genomic coordinates that should be mapped to within-protein coordinates.
dbEnsDb object.

Details

genomeToProtein combines calls to genomeToTranscript() and transcriptToProtein() .

Value

An IRangesList with each element representing the mapping of one of the GRanges in x (i.e. the length of the IRangesList is length(x) ). Each element in IRanges provides the coordinates within the protein sequence, names being the (Ensembl) IDs of the protein. The ID of the transcript encoding the protein, the ID of the exon within which the genomic coordinates are located and its rank in the transcript are provided in metadata columns "tx_id" , "exon_id" and "exon_rank" . Metadata columns "cds_ok" indicates whether the length of the CDS matches the length of the encoded protein. Coordinates for which cds_ok = FALSE should be taken with caution, as they might not be correct. Metadata columns "seq_start" , "seq_end" , "seq_name" and "seq_strand" provide the provided genomic coordinates.

For genomic coordinates that can not be mapped to within-protein sequences an IRanges with a start coordinate of -1 is returned.

Seealso

Other coordinate mapping functions: cdsToTranscript , genomeToTranscript , proteinToGenome , proteinToTranscript , transcriptToCds , transcriptToGenome , transcriptToProtein

Author

Johannes Rainer

Examples

library(EnsDb.Hsapiens.v86)
## Restrict all further queries to chromosome x to speed up the examples
edbx <- filter(EnsDb.Hsapiens.v86, filter = ~ seq_name == "X")

## In the example below we define 4 genomic regions:
## 630898: corresponds to the first nt of the CDS of ENST00000381578
## 644636: last nt of the CDS of ENST00000381578
## 644633: last nt before the stop codon in ENST00000381578
## 634829: position within an intron.
gnm <- GRanges("X", IRanges(start = c(630898, 644636, 644633, 634829),
width = c(5, 1, 1, 3)))
res <- genomeToProtein(gnm, edbx)

## The result is an IRangesList with the same length as gnm
length(res)
length(gnm)

## The first element represents the mapping for the first GRanges:
## the coordinate is mapped to the first amino acid of the protein(s).
## The genomic coordinates can be mapped to several transcripts (and hence
## proteins).
res[[1]]

## The stop codon is not translated, thus the mapping for the second
## GRanges fails
res[[2]]

## The 3rd GRanges is mapped to the last amino acid.
res[[3]]

## Mapping of intronic positions fail
res[[4]]
Link to this function

genomeToTranscript()

Map genomic coordinates to transcript coordinates

Description

genomeToTranscript maps genomic coordinates to positions within the transcript (if at the provided genomic position a transcript is encoded). The function does only support mapping of genomic coordinates that are completely within the genomic region at which an exon is encoded. If the genomic region crosses the exon boundary an empty IRanges is returned. See examples for details.

Usage

genomeToTranscript(x, db)

Arguments

ArgumentDescription
xGRanges object with the genomic coordinates that should be mapped.
dbEnsDb object.

Details

The function first retrieves all exons overlapping the provided genomic coordinates and identifies then exons that are fully containing the coordinates in x . The transcript-relative coordinates are calculated based on the relative position of the provided genomic coordinates in this exon.

Value

An IRangesList with length equal to length(x) . Each element providing the mapping(s) to position within any encoded transcripts at the respective genomic location as an IRanges object. An IRanges with negative start coordinates is returned, if the provided genomic coordinates are not completely within the genomic coordinates of an exon.

The ID of the exon and its rank (index of the exon in the transcript) are provided in the result's IRanges metadata columns as well as the genomic position of x .

Seealso

Other coordinate mapping functions: cdsToTranscript , genomeToProtein , proteinToGenome , proteinToTranscript , transcriptToCds , transcriptToGenome , transcriptToProtein

Note

The function throws a warning and returns an empty IRanges object if the genomic coordinates can not be mapped to a transcript.

Author

Johannes Rainer

Examples

library(EnsDb.Hsapiens.v86)

## Subsetting the EnsDb object to chromosome X only to speed up execution
## time of examples
edbx <- filter(EnsDb.Hsapiens.v86, filter = ~ seq_name == "X")

## Define a genomic region and calculate within-transcript coordinates
gnm <- GRanges("X:107716399-107716401")

res <- genomeToTranscript(gnm, edbx)
## Result is an IRanges object with the start and end coordinates within
## each transcript that has an exon at the genomic range.
res

## An IRanges with negative coordinates is returned if at the provided
## position no exon is present. Below we use the same coordinates but
## specify that the coordinates are on the forward (+) strand
gnm <- GRanges("X:107716399-107716401:+")
genomeToTranscript(gnm, edbx)

## Next we provide multiple genomic positions.
gnm <- GRanges("X", IRanges(start = c(644635, 107716399, 107716399),
end = c(644639, 107716401, 107716401)), strand = c("*", "*", "+"))

## The result of the mapping is an IRangesList each element providing the
## within-transcript coordinates for each input region
genomeToTranscript(gnm, edbx)
Link to this function

global_filters()

Globally add filters to an EnsDb database

Description

These methods allow to set, delete or show globally defined filters on an EnsDb object.

addFilter : adds an annotation filter to the EnsDb object.

dropFilter deletes all globally set filters from the EnsDb object.

activeFilter returns the globally set filter from an EnsDb object.

filter filters an EnsDb object. filter is an alias for the addFilter function.

Usage

list(list("addFilter"), list("EnsDb"))(x, filter = AnnotationFilterList())
list(list("dropFilter"), list("EnsDb"))(x)
list(list("activeFilter"), list("EnsDb"))(x)
filter(x, filter = AnnotationFilterList())

Arguments

ArgumentDescription
xThe EnsDb object to which the filter should be added.
filterThe filter as an AnnotationFilter , AnnotationFilterList or filter expression. See

Details

Adding a filter to an EnsDb object causes this filter to be permanently active. The filter will be used for all queries to the database and is added to all additional filters passed to the methods such as genes .

Value

addFilter and filter return an EnsDb object with the specified filter added.

activeFilter returns an AnnotationFilterList object being the active global filter or NA if no filter was added.

dropFilter returns an EnsDb object with all eventually present global filters removed.

Seealso

Filter-classes for a list of all supported filters.

Author

Johannes Rainer

Examples

library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86

## Add a global SeqNameFilter to the database such that all subsequent
## queries will be applied on the filtered database.
edb_y <- addFilter(edb, SeqNameFilter("Y"))

## Note: using the filter function is equivalent to a call to addFilter.

## Each call returns now only features encoded on chromosome Y
gns <- genes(edb_y)

seqlevels(gns)

## Get all lincRNA gene transcripts on chromosome Y
transcripts(edb_y, filter = ~ gene_biotype == "lincRNA")

## Get the currently active global filter:
activeFilter(edb_y)

## Delete this filter again.
edb_y <- dropFilter(edb_y)

activeFilter(edb_y)
Link to this function

hasProteinData_EnsDb_method()

Determine whether protein data is available in the database

Description

Determines whether the EnsDb provides protein annotation data.

Usage

list(list("hasProteinData"), list("EnsDb"))(x)

Arguments

ArgumentDescription
xThe EnsDb object.

Value

A logical of length one, TRUE if protein annotations are available and FALSE otherwise.

Seealso

listTables

Author

Johannes Rainer

Examples

library(EnsDb.Hsapiens.v86)
## Does this database/package have protein annotations?
hasProteinData(EnsDb.Hsapiens.v86)

List EnsDb databases in a MariaDB/MySQL server

Description

The listEnsDbs function lists EnsDb databases in a MariaDB/MySQL server.

Usage

listEnsDbs(dbcon, host, port, user, pass)

Arguments

ArgumentDescription
dbconA DBIConnection object providing access to a MariaDB/MySQL database. Either dbcon or all of the other arguments have to be specified.
hostCharacter specifying the host on which the MySQL server is running.
portThe port of the MariaDB/MySQL server (usually 3306 ).
userThe username for the MariaDB/MySQL server.
passThe password for the MariaDB/MySQL server.

Details

The use of this function requires the RMariaDB package to be installed. In addition user credentials to access a MySQL server (with already installed EnsDb databases), or with write access are required. For the latter EnsDb databases can be added with the useMySQL method. EnsDb databases in a MariaDB/MySQL server follow the same naming conventions than EnsDb packages, with the exception that the name is all lower case and that each "." is replaced by "_" .

Value

A data.frame listing the database names, organism name and Ensembl version of the EnsDb databases found on the server.

Seealso

useMySQL

Author

Johannes Rainer

Examples

library(RMariaDB)
dbcon <- dbConnect(MariaDB(), host = "localhost", user = my_user, pass = my_pass)
listEnsDbs(dbcon)
Link to this function

makeEnsemblDbPackage()

Generating a Ensembl annotation package from Ensembl

Description

The functions described on this page allow to build EnsDb annotation objects/databases from Ensembl annotations. The most complete set of annotations, which include also the NCBI Entrezgene identifiers for each gene, can be retrieved by the functions using the Ensembl Perl API (i.e. functions fetchTablesFromEnsembl , makeEnsemblSQLiteFromTables ). Alternatively the functions ensDbFromAH , ensDbFromGRanges , ensDbFromGff and ensDbFromGtf can be used to build EnsDb objects using GFF or GTF files from Ensembl, which can be either manually downloaded from the Ensembl ftp server, or directly form within R using AnnotationHub . The generated SQLite database can be packaged into an R package using the makeEnsembldbPackage .

Usage

ensDbFromAH(ah, outfile, path, organism, genomeVersion, version)
ensDbFromGRanges(x, outfile, path, organism, genomeVersion,
                 version, ...)
ensDbFromGff(gff, outfile, path, organism, genomeVersion,
             version, ...)
ensDbFromGtf(gtf, outfile, path, organism, genomeVersion,
             version, ...)
fetchTablesFromEnsembl(version, ensemblapi, user="anonymous",
                       host="ensembldb.ensembl.org", pass="",
                       port=5306, species="human")
makeEnsemblSQLiteFromTables(path=".", dbname)
makeEnsembldbPackage(ensdb, version, maintainer, author,
                     destDir=".", license="Artistic-2.0")

Arguments

ArgumentDescription
ahFor ensDbFromAH : an AnnotationHub object representing a single resource (i.e. GTF file from Ensembl) from AnnotationHub .
authorThe author of the package.
dbnameThe name for the database (optional). By default a name based on the species and Ensembl version will be automatically generated (and returned by the function).
destDirWhere the package should be saved to.
ensdbThe file name of the SQLite database generated by makeEnsemblSQLiteFromTables .
ensemblapiThe path to the Ensembl perl API installed locally on the system. The Ensembl perl API version has to fit the version.
genomeVersionFor ensDbFromAH , ensDbFromGtf and ensDbFromGff : the version of the genome (e.g. "GRCh37" ). If not provided the function will try to guess it from the file name (assuming file name convention of Ensembl GTF files).
gffThe GFF file to import.
gtfThe GTF file name.
hostThe hostname to access the Ensembl database.
licenseThe license of the package.
maintainerThe maintainer of the package.
organismFor ensDbFromAH , ensDbFromGff and ensDbFromGtf : the organism name (e.g. "Homo_sapiens" ). If not provided the function will try to guess it from the file name (assuming file name convention of Ensembl GTF files).
outfileThe desired file name of the SQLite file. If not provided the name of the GTF file will be used.
passThe password for the Ensembl database.
pathThe directory in which the tables retrieved by fetchTablesFromEnsembl or the SQLite database file generated by ensDbFromGtf are stored.
portThe port to be used to connect to the Ensembl database.
speciesThe species for which the annotations should be retrieved.
userThe username for the Ensembl database.
versionFor fetchTablesFromEnsembl , ensDbFromGRanges and ensDbFromGtf : the Ensembl version for which the annotation should be retrieved (e.g. 75). The ensDbFromGtf function will try to guess the Ensembl version from the GTF file name if not provided. For makeEnsemblDbPackage : the version for the package.
xFor ensDbFromGRanges : the GRanges object.
...Currently not used.

Details

The fetchTablesFromEnsembl function internally calls the perl script get_gene_transcript_exon_tables.pl to retrieve all required information from the Ensembl database using the Ensembl perl API.

As an alternative way, a EnsDb database file can be generated by the ensDbFromGtf or ensDbFromGff from a GTF or GFF file downloaded from the Ensembl ftp server or using the ensDbFromAH to build a database directly from corresponding resources from the AnnotationHub. The returned database file name can then be used as an input to the makeEnsembldbPackage or it can be directly loaded and used by the EnsDb constructor.

Value

makeEnsemblSQLiteFromTables , ensDbFromAH , ensDbFromGRanges and ensDbFromGtf : the name of the SQLite file.

Seealso

EnsDb , genes

Note

A local installation of the Ensembl perl API is required for the fetchTablesFromEnsembl . See http://www.ensembl.org/info/docs/api/api_installation.html for installation inscructions.

A database generated from a GTF/GFF files lacks some features as they are not available in the GTF files from Ensembl. These are: NCBI Entrezgene IDs.

Author

Johannes Rainer

Examples

## get all human gene/transcript/exon annotations from Ensembl (75)
## the resulting tables will be stored by default to the current working
## directory; if the correct Ensembl api (version 75) is defined in the
## PERL5LIB environment variable, the ensemblapi parameter can also be omitted.
fetchTablesFromEnsembl(75,
ensemblapi="/home/bioinfo/ensembl/75/API/ensembl/modules",
species="human")

## These tables can then be processed to generate a SQLite database
## containing the annotations
DBFile <- makeEnsemblSQLiteFromTables()

## and finally we can generate the package
makeEnsembldbPackage(ensdb=DBFile, version="0.0.1",
maintainer="Johannes Rainer <johannes.rainer@eurac.edu>",
author="J Rainer")

## Build an annotation database form a GFF file from Ensembl.
## ftp://ftp.ensembl.org/pub/release-83/gff3/rattus_norvegicus
gff <- "Rattus_norvegicus.Rnor_6.0.83.gff3.gz"
DB <- ensDbFromGff(gff=gff)
edb <- EnsDb(DB)
edb

## Build an annotation file from a GTF file.
## the GTF file can be downloaded from
## ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/
gtffile <- "Homo_sapiens.GRCh37.75.gtf.gz"
## generate the SQLite database file
DB <- ensDbFromGtf(gtf=paste0(ensemblhost, gtffile))

## load the DB file directly
EDB <- EnsDb(DB)

## Alternatively, we could fetch a GTF file directly from AnnotationHub
## and build the database from that:
library(AnnotationHub)
ah <- AnnotationHub()
## Query for all GTF files from Ensembl for Ensembl version 81
query(ah, c("Ensembl", "release-81", "GTF"))
## We could get the one from e.g. Bos taurus:
DB <- ensDbFromAH(ah["AH47941"])
edb <- EnsDb(DB)
edb

## Generate a sqlite database for genes encoded on chromosome Y
chrY <- system.file("chrY", package="ensembldb")
DBFile <- makeEnsemblSQLiteFromTables(path=chrY ,dbname=tempfile())
## load this database:
edb <- EnsDb(DBFile)

edb

## Generate a sqlite database from a GRanges object specifying
## genes encoded on chromosome Y
load(system.file("YGRanges.RData", package="ensembldb"))

Y

DB <- ensDbFromGRanges(Y, path=tempdir(), version=75,
organism="Homo_sapiens")
edb <- EnsDb(DB)
Link to this function

proteinToGenome()

Map within-protein coordinates to genomic coordinates

Description

proteinToGenome maps protein-relative coordinates to genomic coordinates based on the genomic coordinates of the CDS of the encoding transcript. The encoding transcript is identified using protein-to-transcript annotations (and eventually Uniprot to Ensembl protein identifier mappings) from the submitted EnsDb object (and thus based on annotations from Ensembl).

Not all coding regions for protein coding transcripts are complete, and the function thus checks also if the length of the coding region matches the length of the protein sequence and throws a warning if that is not the case.

The genomic coordinates for the within-protein coordinates, the Ensembl protein ID, the ID of the encoding transcript and the within protein start and end coordinates are reported for each input range.

Usage

proteinToGenome(x, db, id = "name", idType = "protein_id")

Arguments

ArgumentDescription
xIRanges with the coordinates within the protein(s). The object has also to provide some means to identify the protein (see details).
dbEnsDb object to be used to retrieve genomic coordinates of encoding transcripts.
idcharacter(1) specifying where the protein identifier can be found. Has to be either "name" or one of colnames(mcols(prng)) .
idTypecharacter(1) defining what type of IDs are provided. Has to be one of "protein_id" (default), "uniprot_id" or "tx_id" .

Details

Protein identifiers (supported are Ensembl protein IDs or Uniprot IDs) can be passed to the function as names of the x IRanges object, or alternatively in any one of the metadata columns ( mcols ) of x .

Value

list , each element being the mapping results for one of the input ranges in x and names being the IDs used for the mapping. Each element can be either a:

  • GRanges object with the genomic coordinates calculated on the protein-relative coordinates for the respective Ensembl protein (stored in the "protein_id" metadata column.

  • GRangesList object, if the provided protein identifier in x was mapped to several Ensembl protein IDs (e.g. if Uniprot identifiers were used). Each element in this GRangesList is a GRanges with the genomic coordinates calculated for the protein-relative coordinates from the respective Ensembl protein ID.

The following metadata columns are available in each GRanges in the result:

  • "protein_id" : the ID of the Ensembl protein for which the within-protein coordinates were mapped to the genome.

  • "tx_id" : the Ensembl transcript ID of the encoding transcript.

  • "exon_id" : ID of the exons that have overlapping genomic coordinates.

  • "exon_rank" : the rank/index of the exon within the encoding transcript.

  • "cds_ok" : contains TRUE if the length of the CDS matches the length of the amino acid sequence and FALSE otherwise.

  • "protein_start" : the within-protein sequence start coordinate of the mapping.

  • "protein_end" : the within-protein sequence end coordinate of the mapping.

Genomic coordinates are returned ordered by the exon index within the transcript.

Seealso

Other coordinate mapping functions: cdsToTranscript , genomeToProtein , genomeToTranscript , proteinToTranscript , transcriptToCds , transcriptToGenome , transcriptToProtein

Note

While the mapping for Ensembl protein IDs to encoding transcripts (and thus CDS) is 1:1, the mapping between Uniprot identifiers and encoding transcripts (which is based on Ensembl annotations) can be one to many. In such cases proteinToGenome calculates genomic coordinates for within-protein coordinates for all of the annotated Ensembl proteins and returns all of them. See below for examples.

Mapping using Uniprot identifiers needs also additional internal checks that have a significant impact on the performance of the function. It is thus strongly suggested to first identify the Ensembl protein identifiers for the list of input Uniprot identifiers (e.g. using the proteins() function and use these as input for the mapping function.

A warning is thrown for proteins which sequence does not match the coding sequence length of any encoding transcripts. For such proteins/transcripts a FALSE is reported in the respective "cds_ok" metadata column. The most common reason for such discrepancies are incomplete 3' or 5' ends of the CDS. The positions within the protein might not be correclty mapped to the genome in such cases and it might be required to check the mapping manually in the Ensembl genome browser.

Author

Johannes Rainer based on initial code from Laurent Gatto and Sebastian Gibb

Examples

library(EnsDb.Hsapiens.v86)
## Restrict all further queries to chromosome x to speed up the examples
edbx <- filter(EnsDb.Hsapiens.v86, filter = ~ seq_name == "X")

## Define an IRange with protein-relative coordinates within a protein for
## the gene SYP
syp <- IRanges(start = 4, end = 17)
names(syp) <- "ENSP00000418169"
res <- proteinToGenome(syp, edbx)
res
## Positions 4 to 17 within the protein span two exons of the encoding
## transcript.

## Perform the mapping for multiple proteins identified by their Uniprot
## IDs.
ids <- c("O15266", "Q9HBJ8", "unexistant")
prngs <- IRanges(start = c(13, 43, 100), end = c(21, 80, 100))
names(prngs) <- ids

res <- proteinToGenome(prngs, edbx, idType = "uniprot_id")

## The result is a list, same length as the input object
length(res)
names(res)

## No protein/encoding transcript could be found for the last one
res[[3]]

## The first protein could be mapped to multiple Ensembl proteins. The
## mapping result using all of their encoding transcripts are returned
res[[1]]

## The coordinates within the second protein span two exons
res[[2]]
Link to this function

proteinToTranscript()

Map protein-relative coordinates to positions within the transcript

Description

proteinToTranscript maps protein-relative coordinates to positions within the encoding transcript. Note that the returned positions are relative to the complete transcript length, which includes the 5' UTR.

Similar to the proteinToGenome() function, proteinToTranscript compares for each protein whether the length of its sequence matches the length of the encoding CDS and throws a warning if that is not the case. Incomplete 3' or 5' CDS of the encoding transcript are the most common reasons for a mismatch between protein and transcript sequences.

Usage

proteinToTranscript(x, db, id = "name", idType = "protein_id")

Arguments

ArgumentDescription
xIRanges with the coordinates within the protein(s). The object has also to provide some means to identify the protein (see details).
dbEnsDb object to be used to retrieve genomic coordinates of encoding transcripts.
idcharacter(1) specifying where the protein identifier can be found. Has to be either "name" or one of colnames(mcols(prng)) .
idTypecharacter(1) defining what type of IDs are provided. Has to be one of "protein_id" (default), "uniprot_id" or "tx_id" .

Details

Protein identifiers (supported are Ensembl protein IDs or Uniprot IDs) can be passed to the function as names of the x IRanges object, or alternatively in any one of the metadata columns ( mcols ) of x .

Value

IRangesList , each element being the mapping results for one of the input ranges in x . Each element is a IRanges object with the positions within the encoding transcript (relative to the start of the transcript, which includes the 5' UTR). The transcript ID is reported as the name of each IRanges . The IRanges can be of length > 1 if the provided protein identifier is annotated to more than one Ensembl protein ID (which can be the case if Uniprot IDs are provided). If the coordinates can not be mapped (because the protein identifier is unknown to the database) an IRanges with negative coordinates is returned.

The following metadata columns are available in each IRanges in the result:

  • "protein_id" : the ID of the Ensembl protein for which the within-protein coordinates were mapped to the genome.

  • "tx_id" : the Ensembl transcript ID of the encoding transcript.

  • "cds_ok" : contains TRUE if the length of the CDS matches the length of the amino acid sequence and FALSE otherwise.

  • "protein_start" : the within-protein sequence start coordinate of the mapping.

  • "protein_end" : the within-protein sequence end coordinate of the mapping.

Seealso

Other coordinate mapping functions: cdsToTranscript , genomeToProtein , genomeToTranscript , proteinToGenome , transcriptToCds , transcriptToGenome , transcriptToProtein

Note

While mapping of Ensembl protein IDs to Ensembl transcript IDs is 1:1, a single Uniprot identifier can be annotated to several Ensembl protein IDs. proteinToTranscript calculates in such cases transcript-relative coordinates for each annotated Ensembl protein.

Mapping using Uniprot identifiers needs also additional internal checks that can have a significant impact on the performance of the function. It is thus strongly suggested to first identify the Ensembl protein identifiers for the list of input Uniprot identifiers (e.g. using the proteins() function and use these as input for the mapping function.

Author

Johannes Rainer

Examples

library(EnsDb.Hsapiens.v86)
## Restrict all further queries to chromosome x to speed up the examples
edbx <- filter(EnsDb.Hsapiens.v86, filter = ~ seq_name == "X")

## Define an IRange with protein-relative coordinates within a protein for
## the gene SYP
syp <- IRanges(start = 4, end = 17)
names(syp) <- "ENSP00000418169"
res <- proteinToTranscript(syp, edbx)
res
## Positions 4 to 17 within the protein span are encoded by the region
## from nt 23 to 64.

## Perform the mapping for multiple proteins identified by their Uniprot
## IDs.
ids <- c("O15266", "Q9HBJ8", "unexistant")
prngs <- IRanges(start = c(13, 43, 100), end = c(21, 80, 100))
names(prngs) <- ids

res <- proteinToTranscript(prngs, edbx, idType = "uniprot_id")

## The result is a list, same length as the input object
length(res)
names(res)

## No protein/encoding transcript could be found for the last one
res[[3]]

## The first protein could be mapped to multiple Ensembl proteins. The
## region within all transcripts encoding the region in the protein are
## returned
res[[1]]

## The result for the region within the second protein
res[[2]]

Search annotations interactively

Description

This function starts the interactive EnsDb shiny web application that allows to look up gene/transcript/exon annotations from an EnsDb annotation package installed locally.

Usage

runEnsDbApp(...)

Arguments

ArgumentDescription
...Additional arguments passed to the runApp function from the shiny package.

Details

The shiny based web application allows to look up any annotation available in any of the locally installed EnsDb annotation packages.

Value

If the button Return & close is clicked, the function returns the results of the present query either as data.frame or as GRanges object.

Seealso

EnsDb , genes

Author

Johannes Rainer

Link to this function

transcriptToCds()

Map transcript-relative coordinates to positions within the CDS

Description

Converts transcript-relative coordinates to positions within the CDS (if the transcript encodes a protein).

Usage

transcriptToCds(x, db, id = "name")

Arguments

ArgumentDescription
xIRanges with the coordinates within the transcript. Coordinates are expected to be relative to the transcription start (the first nucleotide of the transcript). The Ensembl IDs of the corresponding transcripts have to be provided either as names of the IRanges , or in one of its metadata columns.
dbEnsDb object.
idcharacter(1) specifying where the transcript identifier can be found. Has to be either "name" or one of colnames(mcols(prng)) .

Value

IRanges with the same length (and order) than the input IRanges x . Each element in IRanges provides the coordinates within the transcripts CDS. The transcript-relative coordinates are provided as metadata columns. IRanges with a start coordinate of -1 is returned for transcripts that are not known in the database, non-coding transcripts or if the provided start and/or end coordinates are not within the coding region.

Seealso

Other coordinate mapping functions: cdsToTranscript , genomeToProtein , genomeToTranscript , proteinToGenome , proteinToTranscript , transcriptToGenome , transcriptToProtein

Author

Johannes Rainer

Examples

library(EnsDb.Hsapiens.v86)
## Defining transcript-relative coordinates for 4 transcripts of the gene
## BCL2
txcoords <- IRanges(start = c(1463, 3, 143, 147), width = 1,
names = c("ENST00000398117", "ENST00000333681",
"ENST00000590515", "ENST00000589955"))

## Map the coordinates.
transcriptToCds(txcoords, EnsDb.Hsapiens.v86)

## ENST00000590515 does not encode a protein and thus -1 is returned
## The coordinates within ENST00000333681 are outside the CDS and thus also
## -1 is reported.
Link to this function

transcriptToGenome()

Map transcript-relative coordinates to genomic coordinates

Description

transcriptToGenome maps transcript-relative coordinates to genomic coordinates. Provided coordinates are expected to be relative to the first nucleotide of the transcript , not the CDS . CDS-relative coordinates have to be converted to transcript-relative positions first with the cdsToTranscript() function.

Usage

transcriptToGenome(x, db, id = "name")

Arguments

ArgumentDescription
xIRanges with the coordinates within the transcript. Coordinates are counted from the start of the transcript (including the 5' UTR). The Ensembl IDs of the corresponding transcripts have to be provided either as names of the IRanges , or in one of its metadata columns.
dbEnsDb object.
idcharacter(1) specifying where the transcript identifier can be found. Has to be either "name" or one of colnames(mcols(prng)) .

Value

GRangesList with the same length (and order) than the input IRanges x . Each GRanges in the GRangesList provides the genomic coordinates corresponding to the provided within-transcript coordinates. The original transcript ID and the transcript-relative coordinates are provided as metadata columns as well as the ID of the individual exon(s). An empty GRanges is returned for transcripts that can not be found in the database.

Seealso

cdsToTranscript() and transcriptToCds() for the mapping between CDS- and transcript-relative coordinates.

Other coordinate mapping functions: cdsToTranscript , genomeToProtein , genomeToTranscript , proteinToGenome , proteinToTranscript , transcriptToCds , transcriptToProtein

Author

Johannes Rainer

Examples

library(EnsDb.Hsapiens.v86)
## Restrict all further queries to chromosome x to speed up the examples
edbx <- filter(EnsDb.Hsapiens.v86, filter = ~ seq_name == "X")

## Below we map positions 1 to 5 within the transcript ENST00000381578 to
## the genome. The ID of the transcript has to be provided either as names
## or in one of the IRanges' metadata columns
txpos <- IRanges(start = 1, end = 5, names = "ENST00000381578")

transcriptToGenome(txpos, edbx)
## The object returns a GRangesList with the genomic coordinates, in this
## example the coordinates are within the same exon and map to a single
## genomic region.

## Next we map nucleotides 501 to 505 of ENST00000486554 to the genome
txpos <- IRanges(start = 501, end = 505, names = "ENST00000486554")

transcriptToGenome(txpos, edbx)
## The positions within the transcript are located within two of the
## transcripts exons and thus a `GRanges` of length 2 is returned.

## Next we map multiple regions, two within the same transcript and one
## in a transcript that does not exist.
txpos <- IRanges(start = c(501, 1, 5), end = c(505, 10, 6),
names = c("ENST00000486554", "ENST00000486554", "some"))

res <- transcriptToGenome(txpos, edbx)

## The length of the result GRangesList has the same length than the
## input IRanges
length(res)

## The result for the last region is an empty GRanges, because the
## transcript could not be found in the database
res[[3]]

res
Link to this function

transcriptToProtein()

Map transcript-relative coordinates to amino acid residues of the encoded protein

Description

transcriptToProtein maps within-transcript coordinates to the corresponding coordinates within the encoded protein sequence. The provided coordinates have to be within the coding region of the transcript (excluding the stop codon) but are supposed to be relative to the first nucleotide of the transcript (which includes the 5' UTR). Positions relative to the CDS of a transcript (e.g. /PKP2 c.1643delg/) have to be first converted to transcript-relative coordinates. This can be done with the cdsToTranscript() function.

Usage

transcriptToProtein(x, db, id = "name")

Arguments

ArgumentDescription
xIRanges with the coordinates within the transcript. Coordinates are counted from the start of the transcript (including the 5' UTR). The Ensembl IDs of the corresponding transcripts have to be provided either as names of the IRanges , or in one of its metadata columns.
dbEnsDb object.
idcharacter(1) specifying where the transcript identifier can be found. Has to be either "name" or one of colnames(mcols(prng)) .

Details

Transcript-relative coordinates are mapped to the amino acid residues they encode. As an example, positions within the transcript that correspond to nucleotides 1 to 3 in the CDS are mapped to the first position in the protein sequence (see examples for more details).

Value

IRanges with the same length (and order) than the input IRanges x . Each element in IRanges provides the coordinates within the protein sequence, names being the (Ensembl) IDs of the protein. The original transcript ID and the transcript-relative coordinates are provided as metadata columns. Metadata columns "cds_ok" indicates whether the length of the transcript's CDS matches the length of the encoded protein. IRanges with a start coordinate of -1 is returned for transcript coordinates that can not be mapped to protein-relative coordinates (either no transcript was found for the provided ID, the transcript does not encode a protein or the provided coordinates are not within the coding region of the transcript).

Seealso

cdsToTranscript() and transcriptToCds() for conversion between CDS- and transcript-relative coordinates.

Other coordinate mapping functions: cdsToTranscript , genomeToProtein , genomeToTranscript , proteinToGenome , proteinToTranscript , transcriptToCds , transcriptToGenome

Author

Johannes Rainer

Examples

library(EnsDb.Hsapiens.v86)
## Restrict all further queries to chromosome x to speed up the examples
edbx <- filter(EnsDb.Hsapiens.v86, filter = ~ seq_name == "X")

## Define an IRanges with the positions of the first 2 nucleotides of the
## coding region for the transcript ENST00000381578
txpos <- IRanges(start = 692, width = 2, names = "ENST00000381578")

## Map these to the corresponding residues in the protein sequence
## The protein-relative coordinates are returned as an [`IRanges`](IRanges.html) object,
## with the original, transcript-relative coordinates provided in metadata
## columns tx_start and tx_end
transcriptToProtein(txpos, edbx)

## We can also map multiple ranges. Note that for any of the 3 nucleotides
## encoding the same amino acid the position of this residue in the
## protein sequence is returned. To illustrate this we map below each of the
## first 4 nucleotides of the CDS to the corresponding position within the
## protein.
txpos <- IRanges(start = c(692, 693, 694, 695),
width = rep(1, 4), names = rep("ENST00000381578", 4))
transcriptToProtein(txpos, edbx)

## If the mapping fails, an IRanges with negative start position is returned.
## Mapping can fail (as below) because the ID is not known.
transcriptToProtein(IRanges(1, 1, names = "unknown"), edbx)

## Or because the provided coordinates are not within the CDS
transcriptToProtein(IRanges(1, 1, names = "ENST00000381578"), edbx)
Link to this function

useMySQL_EnsDb_method()

Use a MariaDB/MySQL backend

Description

Change the SQL backend from SQLite to MySQL . When first called on an EnsDb object, the function tries to create and save all of the data into a MySQL database. All subsequent calls will connect to the already existing MySQL database.

Usage

list(list("useMySQL"), list("EnsDb"))(x, host = "localhost", port = 3306, user,
  pass)

Arguments

ArgumentDescription
xThe EnsDb object.
hostCharacter vector specifying the host on which the MariaDB/MySQL server runs.
portThe port on which the MariaDB/MySQL server can be accessed.
userThe user name for the MariaDB/MySQL server.
passThe password for the MariaDB/MySQL server.

Details

This functionality requires that the RMariaDB package is installed and that the user has (write) access to a running MySQL server. If the corresponding database does already exist users without write access can use this functionality.

Value

A EnsDb object providing access to the data stored in the MySQL backend.

Note

At present the function does not evaluate whether the versions between the SQLite and MariaDB/MySQL database differ.

Author

Johannes Rainer

Examples

## Load the EnsDb database (SQLite backend).
library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86
## Now change the backend to MySQL; my_user and my_pass should
## be the user name and password to access the MySQL server.
edb_mysql <- useMySQL(edb, host = "localhost", user = my_user, pass = my_pass)