biopython v1.71.0 Bio.SeqIO.InsdcIO

Bio.SeqIO support for the “genbank” and “embl” file formats.

You are expected to use this module via the Bio.SeqIO functions. Note that internally this module calls Bio.GenBank to do the actual parsing of GenBank, EMBL and IMGT files.

See Also: International Nucleotide Sequence Database Collaboration http://www.insdc.org/

GenBank http://www.ncbi.nlm.nih.gov/Genbank/

EMBL Nucleotide Sequence Database http://www.ebi.ac.uk/embl/

DDBJ (DNA Data Bank of Japan) http://www.ddbj.nig.ac.jp/

IMGT (use a variant of EMBL format with longer feature indents) http://imgt.cines.fr/download/LIGM-DB/userman_doc.html http://imgt.cines.fr/download/LIGM-DB/ftable_doc.html http://www.ebi.ac.uk/imgt/hla/docs/manual.html

Link to this section Summary

Functions

Breaks up a EMBL file into SeqRecord objects for each CDS feature

Breaks up an EMBL file into SeqRecord objects

Breaks up a Genbank file into SeqRecord objects for each CDS feature

Breaks up a Genbank file into SeqRecord objects

Breaks up an IMGT file into SeqRecord objects

Build a GenBank/EMBL position string (PRIVATE)

Build a GenBank/EMBL location from a (Compound) FeatureLocation (PRIVATE)

Link to this section Functions

Link to this function EmblCdsFeatureIterator()

Breaks up a EMBL file into SeqRecord objects for each CDS feature.

Every section from the LOCUS line to the terminating // can contain many CDS features. These are returned as with the stated amino acid translation sequence (if given).

Breaks up an EMBL file into SeqRecord objects.

Every section from the LOCUS line to the terminating // becomes a single SeqRecord with associated annotation and features.

Note that for genomes or chromosomes, there is typically only one record.

This gets called internally by Bio.SeqIO for the EMBL file format:

 >>> from Bio import SeqIO
 >>> for record in SeqIO.parse("EMBL/epo_prt_selection.embl", "embl"):
 ...     print(record.id)
 ...
 A00022.1
 A00028.1
 A00031.1
 A00034.1
 A00060.1
 A00071.1
 A00072.1
 A00078.1
 CQ797900.1

Equivalently,

 >>> with open("EMBL/epo_prt_selection.embl") as handle:
 ...     for record in EmblIterator(handle):
 ...         print(record.id)
 ...
 A00022.1
 A00028.1
 A00031.1
 A00034.1
 A00060.1
 A00071.1
 A00072.1
 A00078.1
 CQ797900.1
Link to this function GenBankCdsFeatureIterator()

Breaks up a Genbank file into SeqRecord objects for each CDS feature.

Every section from the LOCUS line to the terminating // can contain many CDS features. These are returned as with the stated amino acid translation sequence (if given).

Link to this function GenBankIterator()

Breaks up a Genbank file into SeqRecord objects.

Every section from the LOCUS line to the terminating // becomes a single SeqRecord with associated annotation and features.

Note that for genomes or chromosomes, there is typically only one record.

This gets called internally by Bio.SeqIO for the GenBank file format:

 >>> from Bio import SeqIO
 >>> for record in SeqIO.parse("GenBank/cor6_6.gb", "gb"):
 ...     print(record.id)
 ...
 X55053.1
 X62281.1
 M81224.1
 AJ237582.1
 L31939.1
 AF297471.1

Equivalently,

 >>> with open("GenBank/cor6_6.gb") as handle:
 ...     for record in GenBankIterator(handle):
 ...         print(record.id)
 ...
 X55053.1
 X62281.1
 M81224.1
 AJ237582.1
 L31939.1
 AF297471.1

Breaks up an IMGT file into SeqRecord objects.

Every section from the LOCUS line to the terminating // becomes a single SeqRecord with associated annotation and features.

Note that for genomes or chromosomes, there is typically only one record.

Link to this function _insdc_feature_position_string()

Build a GenBank/EMBL position string (PRIVATE).

Use offset=1 to add one to convert a start position from python counting.

Link to this function _insdc_location_string()

Build a GenBank/EMBL location from a (Compound) FeatureLocation (PRIVATE).

There is a choice of how to show joins on the reverse complement strand, GenBank used “complement(join(1,10),(20,100))” while EMBL used to use “join(complement(20,100),complement(1,10))” instead (but appears to have now adopted the GenBank convention). Notice that the order of the entries is reversed! This function therefore uses the first form. In this situation we expect the CompoundFeatureLocation and its parts to all be marked as strand == -1, and to be in the order 19:100 then 0:10.