biopython v1.71.0 Bio.Seq

Provide objects to represent biological sequences with alphabets.

See also the Seq_ wiki and the chapter in our tutorial:

  • HTML Tutorial_
  • PDF Tutorial_

.. Seq: http://biopython.org/wiki/Seq .. HTML Tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html .. _PDF Tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.pdf

Link to this section Summary

Functions

Make a python string translation table (PRIVATE)

Run the Bio.Seq module’s doctests (PRIVATE)

Translate nucleotide string into a protein string (PRIVATE)

Return the RNA sequence back-transcribed into DNA

Return the complement sequence of a nucleotide string

Return the reverse complement sequence of a nucleotide string

Transcribe a DNA sequence into RNA

Translate a nucleotide sequence into amino acids

Link to this section Functions

Make a python string translation table (PRIVATE).

Arguments:

  • complement_mapping - a dictionary such as ambiguous_dna_complement and ambiguous_rna_complement from Data.IUPACData.

Returns a translation table (a string of length 256) for use with the python string’s translate method to use in a (reverse) complement.

Compatible with lower case and upper case sequences.

For internal use only.

Run the Bio.Seq module’s doctests (PRIVATE).

Link to this function _translate_str()

Translate nucleotide string into a protein string (PRIVATE).

Arguments:

  • sequence - a string
  • table - a CodonTable object (NOT a table name or id number)
  • stop_symbol - a single character string, what to use for terminators.
  • to_stop - boolean, should translation terminate at the first in frame stop codon? If there is no in-frame stop codon then translation continues to the end.
  • pos_stop - a single character string for a possible stop codon (e.g. TAN or NNN)
  • cds - Boolean, indicates this is a complete CDS. If True, this checks the sequence starts with a valid alternative start codon (which will be translated as methionine, M), that the sequence length is a multiple of three, and that there is a single in frame stop codon at the end (this will be excluded from the protein sequence, regardless of the to_stop option). If these tests fail, an exception is raised.
  • gap - Single character string to denote symbol used for gaps. Defaults to None.

Returns a string.

e.g.

 >>> from Bio.Data import CodonTable
 >>> table = CodonTable.ambiguous_dna_by_id[1]
 >>> _translate_str("AAA", table)
 'K'
 >>> _translate_str("TAR", table)
 '*'
 >>> _translate_str("TAN", table)
 'X'
 >>> _translate_str("TAN", table, pos_stop="@")
 '@'
 >>> _translate_str("TA?", table)
 Traceback (most recent call last):
    ...
 TranslationError: Codon 'TA?' is invalid

In a change to older versions of Biopython, partial codons are now always regarded as an error (previously only checked if cds=True) and will trigger a warning (likely to become an exception in a future release).

If cds=True, the start and stop codons are checked, and the start codon will be translated at methionine. The sequence must be an while number of codons.

 >>> _translate_str("ATGCCCTAG", table, cds=True)
 'MP'
 >>> _translate_str("AAACCCTAG", table, cds=True)
 Traceback (most recent call last):
    ...
 TranslationError: First codon 'AAA' is not a start codon
 >>> _translate_str("ATGCCCTAGCCCTAG", table, cds=True)
 Traceback (most recent call last):
    ...
 TranslationError: Extra in frame stop codon found.
Link to this function back_transcribe()

Return the RNA sequence back-transcribed into DNA.

If given a string, returns a new string object.

Given a Seq or MutableSeq, returns a new Seq object with an RNA alphabet.

Trying to transcribe a protein or DNA sequence raises an exception.

e.g.

 >>> back_transcribe("ACUGN")
 'ACTGN'

Return the complement sequence of a nucleotide string.

If given a string, returns a new string object. Given a Seq or a MutableSeq, returns a new Seq object with the same alphabet.

Supports unambiguous and ambiguous nucleotide sequences.

e.g.

 >>> complement("ACTG-NH")
 'TGAC-ND'
Link to this function reverse_complement()

Return the reverse complement sequence of a nucleotide string.

If given a string, returns a new string object. Given a Seq or a MutableSeq, returns a new Seq object with the same alphabet.

Supports unambiguous and ambiguous nucleotide sequences.

e.g.

 >>> reverse_complement("ACTG-NH")
 'DN-CAGT'

Transcribe a DNA sequence into RNA.

If given a string, returns a new string object.

Given a Seq or MutableSeq, returns a new Seq object with an RNA alphabet.

Trying to transcribe a protein or RNA sequence raises an exception.

e.g.

 >>> transcribe("ACTGN")
 'ACUGN'

Translate a nucleotide sequence into amino acids.

If given a string, returns a new string object. Given a Seq or MutableSeq, returns a Seq object with a protein alphabet.

Arguments:

  • table - Which codon table to use? This can be either a name (string), an NCBI identifier (integer), or a CodonTable object (useful for non-standard genetic codes). Defaults to the “Standard” table.
  • stop_symbol - Single character string, what to use for any terminators, defaults to the asterisk, “*”.
  • to_stop - Boolean, defaults to False meaning do a full translation continuing on past any stop codons (translated as the specified stop_symbol). If True, translation is terminated at the first in frame stop codon (and the stop_symbol is not appended to the returned protein sequence).
  • cds - Boolean, indicates this is a complete CDS. If True, this checks the sequence starts with a valid alternative start codon (which will be translated as methionine, M), that the sequence length is a multiple of three, and that there is a single in frame stop codon at the end (this will be excluded from the protein sequence, regardless of the to_stop option). If these tests fail, an exception is raised.
  • gap - Single character string to denote symbol used for gaps. Defaults to None.

A simple string example using the default (standard) genetic code:

 >>> coding_dna = "GTGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG"
 >>> translate(coding_dna)
 'VAIVMGR*KGAR*'
 >>> translate(coding_dna, stop_symbol="@")
 'VAIVMGR@KGAR@'
 >>> translate(coding_dna, to_stop=True)
 'VAIVMGR'

Now using NCBI table 2, where TGA is not a stop codon:

 >>> translate(coding_dna, table=2)
 'VAIVMGRWKGAR*'
 >>> translate(coding_dna, table=2, to_stop=True)
 'VAIVMGRWKGAR'

In fact this example uses an alternative start codon valid under NCBI table 2, GTG, which means this example is a complete valid CDS which when translated should really start with methionine (not valine):

 >>> translate(coding_dna, table=2, cds=True)
 'MAIVMGRWKGAR'

Note that if the sequence has no in-frame stop codon, then the to_stop argument has no effect:

 >>> coding_dna2 = "GTGGCCATTGTAATGGGCCGC"
 >>> translate(coding_dna2)
 'VAIVMGR'
 >>> translate(coding_dna2, to_stop=True)
 'VAIVMGR'

NOTE - Ambiguous codons like “TAN” or “NNN” could be an amino acid or a stop codon. These are translated as “X”. Any invalid codon (e.g. “TA?” or “T-A”) will throw a TranslationError.

It will however translate either DNA or RNA.