biopython v1.71.0 Bio.Seq.UnknownSeq

Read-only sequence object of known length but unknown contents.

If you have an unknown sequence, you can represent this with a normal Seq object, for example:

 >>> my_seq = Seq("N"*5)
 >>> my_seq
 Seq('NNNNN', Alphabet())
 >>> len(my_seq)
 5
 >>> print(my_seq)
 NNNNN

However, this is rather wasteful of memory (especially for large sequences), which is where this class is most useful:

 >>> unk_five = UnknownSeq(5)
 >>> unk_five
 UnknownSeq(5, alphabet = Alphabet(), character = '?')
 >>> len(unk_five)
 5
 >>> print(unk_five)
 ?????

You can add unknown sequence together, provided their alphabets and characters are compatible, and get another memory saving UnknownSeq:

 >>> unk_four = UnknownSeq(4)
 >>> unk_four
 UnknownSeq(4, alphabet = Alphabet(), character = '?')
 >>> unk_four + unk_five
 UnknownSeq(9, alphabet = Alphabet(), character = '?')

If the alphabet or characters don’t match up, the addition gives an ordinary Seq object:

 >>> unk_nnnn = UnknownSeq(4, character = "N")
 >>> unk_nnnn
 UnknownSeq(4, alphabet = Alphabet(), character = 'N')
 >>> unk_nnnn + unk_four
 Seq('NNNN????', Alphabet())

Combining with a real Seq gives a new Seq object:

 >>> known_seq = Seq("ACGT")
 >>> unk_four + known_seq
 Seq('????ACGT', Alphabet())
 >>> known_seq + unk_four
 Seq('ACGT????', Alphabet())

Link to this section Summary

Functions

Add another sequence or string to this sequence

Get a subsequence from the UnknownSeq object

Create a new UnknownSeq object

Return the stated length of the unknown sequence

Add a sequence on the left

Return (truncated) representation of the sequence for debugging

Return the unknown sequence as full string of the given length

Return an unknown DNA sequence from an unknown RNA sequence

Return the complement of an unknown nucleotide equals itself

Return a non-overlapping count, like that of a python string

Return an overlapping count

Return a lower case copy of the sequence

Return the reverse complement of an unknown sequence

Return an unknown RNA sequence from an unknown DNA sequence

Translate an unknown nucleotide sequence into an unknown protein

Return a copy of the sequence without the gap character(s)

Return an upper case copy of the sequence

Link to this section Functions

Add another sequence or string to this sequence.

Adding two UnknownSeq objects returns another UnknownSeq object provided the character is the same and the alphabets are compatible.

 >>> from Bio.Seq import UnknownSeq
 >>> from Bio.Alphabet import generic_protein
 >>> UnknownSeq(10, generic_protein) + UnknownSeq(5, generic_protein)
 UnknownSeq(15, alphabet = ProteinAlphabet(), character = 'X')

If the characters differ, an UnknownSeq object cannot be used, so a Seq object is returned:

 >>> from Bio.Seq import UnknownSeq
 >>> from Bio.Alphabet import generic_protein
 >>> UnknownSeq(10, generic_protein) + UnknownSeq(5, generic_protein,
 ...                                              character="x")
 Seq('XXXXXXXXXXxxxxx', ProteinAlphabet())

If adding a string to an UnknownSeq, a new Seq is returned with the same alphabet:

 >>> from Bio.Seq import UnknownSeq
 >>> from Bio.Alphabet import generic_protein
 >>> UnknownSeq(5, generic_protein) + "LV"
 Seq('XXXXXLV', ProteinAlphabet())

Get a subsequence from the UnknownSeq object.

 >>> unk = UnknownSeq(8, character="N")
 >>> print(unk[:])
 NNNNNNNN
 >>> print(unk[5:3])
 <BLANKLINE>
 >>> print(unk[1:-1])
 NNNNNN
 >>> print(unk[1:-1:2])
 NNN

Create a new UnknownSeq object.

If character is omitted, it is determined from the alphabet, “N” for nucleotides, “X” for proteins, and “?” otherwise.

Return the stated length of the unknown sequence.

Add a sequence on the left.

Return (truncated) representation of the sequence for debugging.

Return the unknown sequence as full string of the given length.

Link to this function back_transcribe()

Return an unknown DNA sequence from an unknown RNA sequence.

 >>> my_rna = UnknownSeq(20, character="N")
 >>> my_rna
 UnknownSeq(20, alphabet = Alphabet(), character = 'N')
 >>> print(my_rna)
 NNNNNNNNNNNNNNNNNNNN
 >>> my_dna = my_rna.back_transcribe()
 >>> my_dna
 UnknownSeq(20, alphabet = DNAAlphabet(), character = 'N')
 >>> print(my_dna)
 NNNNNNNNNNNNNNNNNNNN

Return the complement of an unknown nucleotide equals itself.

 >>> my_nuc = UnknownSeq(8)
 >>> my_nuc
 UnknownSeq(8, alphabet = Alphabet(), character = '?')
 >>> print(my_nuc)
 ????????
 >>> my_nuc.complement()
 UnknownSeq(8, alphabet = Alphabet(), character = '?')
 >>> print(my_nuc.complement())
 ????????

Return a non-overlapping count, like that of a python string.

This behaves like the python string (and Seq object) method of the same name, which does a non-overlapping count!

For an overlapping search use the newer count_overlap() method.

Returns an integer, the number of occurrences of substring argument sub in the (sub)sequence given by [start:end]. Optional arguments start and end are interpreted as in slice notation.

Arguments:

  • sub - a string or another Seq object to look for
  • start - optional integer, slice start
  • end - optional integer, slice end

    >>> “NNNN”.count(“N”) 4 >>> Seq(“NNNN”).count(“N”) 4 >>> UnknownSeq(4, character=”N”).count(“N”) 4 >>> UnknownSeq(4, character=”N”).count(“A”) 0 >>> UnknownSeq(4, character=”N”).count(“AA”) 0

HOWEVER, please note because that python strings and Seq objects (and MutableSeq objects) do a non-overlapping search, this may not give the answer you expect:

 >>> UnknownSeq(4, character="N").count("NN")
 2
 >>> UnknownSeq(4, character="N").count("NNN")
 1
Link to this function count_overlap()

Return an overlapping count.

For a non-overlapping search use the count() method.

Returns an integer, the number of occurrences of substring argument sub in the (sub)sequence given by [start:end]. Optional arguments start and end are interpreted as in slice notation.

Arguments:

  • sub - a string or another Seq object to look for
  • start - optional integer, slice start
  • end - optional integer, slice end

e.g.

 >>> from Bio.Seq import UnknownSeq
 >>> UnknownSeq(4, character="N").count_overlap("NN")
 3
 >>> UnknownSeq(4, character="N").count_overlap("NNN")
 2

Where substrings do not overlap, should behave the same as the count() method:

 >>> UnknownSeq(4, character="N").count_overlap("N")
 4
 >>> UnknownSeq(4, character="N").count_overlap("N") == UnknownSeq(4, character="N").count("N")
 True
 >>> UnknownSeq(4, character="N").count_overlap("A")
 0
 >>> UnknownSeq(4, character="N").count_overlap("A") == UnknownSeq(4, character="N").count("A")
 True
 >>> UnknownSeq(4, character="N").count_overlap("AA")
 0
 >>> UnknownSeq(4, character="N").count_overlap("AA") == UnknownSeq(4, character="N").count("AA")
 True

Return a lower case copy of the sequence.

This will adjust the alphabet if required:

 >>> from Bio.Alphabet import IUPAC
 >>> from Bio.Seq import UnknownSeq
 >>> my_seq = UnknownSeq(20, IUPAC.extended_protein)
 >>> my_seq
 UnknownSeq(20, alphabet = ExtendedIUPACProtein(), character = 'X')
 >>> print(my_seq)
 XXXXXXXXXXXXXXXXXXXX
 >>> my_seq.lower()
 UnknownSeq(20, alphabet = ProteinAlphabet(), character = 'x')
 >>> print(my_seq.lower())
 xxxxxxxxxxxxxxxxxxxx

See also the upper method.

Link to this function reverse_complement()

Return the reverse complement of an unknown sequence.

The reverse complement of an unknown nucleotide equals itself:

 >>> from Bio.Seq import UnknownSeq
 >>> from Bio.Alphabet import generic_dna
 >>> example = UnknownSeq(6, generic_dna)
 >>> print(example)
 NNNNNN
 >>> print(example.reverse_complement())
 NNNNNN

Return an unknown RNA sequence from an unknown DNA sequence.

 >>> my_dna = UnknownSeq(10, character="N")
 >>> my_dna
 UnknownSeq(10, alphabet = Alphabet(), character = 'N')
 >>> print(my_dna)
 NNNNNNNNNN
 >>> my_rna = my_dna.transcribe()
 >>> my_rna
 UnknownSeq(10, alphabet = RNAAlphabet(), character = 'N')
 >>> print(my_rna)
 NNNNNNNNNN

Translate an unknown nucleotide sequence into an unknown protein.

e.g.

 >>> my_seq = UnknownSeq(9, character="N")
 >>> print(my_seq)
 NNNNNNNNN
 >>> my_protein = my_seq.translate()
 >>> my_protein
 UnknownSeq(3, alphabet = ProteinAlphabet(), character = 'X')
 >>> print(my_protein)
 XXX

In comparison, using a normal Seq object:

 >>> my_seq = Seq("NNNNNNNNN")
 >>> print(my_seq)
 NNNNNNNNN
 >>> my_protein = my_seq.translate()
 >>> my_protein
 Seq('XXX', ExtendedIUPACProtein())
 >>> print(my_protein)
 XXX

Return a copy of the sequence without the gap character(s).

The gap character can be specified in two ways - either as an explicit argument, or via the sequence’s alphabet. For example:

 >>> from Bio.Seq import UnknownSeq
 >>> from Bio.Alphabet import Gapped, generic_dna
 >>> my_dna = UnknownSeq(20, Gapped(generic_dna, "-"))
 >>> my_dna
 UnknownSeq(20, alphabet = Gapped(DNAAlphabet(), '-'), character = 'N')
 >>> my_dna.ungap()
 UnknownSeq(20, alphabet = DNAAlphabet(), character = 'N')
 >>> my_dna.ungap("-")
 UnknownSeq(20, alphabet = DNAAlphabet(), character = 'N')

If the UnknownSeq is using the gap character, then an empty Seq is returned:

 >>> my_gap = UnknownSeq(20, Gapped(generic_dna, "-"), character="-")
 >>> my_gap
 UnknownSeq(20, alphabet = Gapped(DNAAlphabet(), '-'), character = '-')
 >>> my_gap.ungap()
 Seq('', DNAAlphabet())
 >>> my_gap.ungap("-")
 Seq('', DNAAlphabet())

Notice that the returned sequence’s alphabet is adjusted to remove any explicit gap character declaration.

Return an upper case copy of the sequence.

 >>> from Bio.Alphabet import generic_dna
 >>> from Bio.Seq import UnknownSeq
 >>> my_seq = UnknownSeq(20, generic_dna, character="n")
 >>> my_seq
 UnknownSeq(20, alphabet = DNAAlphabet(), character = 'n')
 >>> print(my_seq)
 nnnnnnnnnnnnnnnnnnnn
 >>> my_seq.upper()
 UnknownSeq(20, alphabet = DNAAlphabet(), character = 'N')
 >>> print(my_seq.upper())
 NNNNNNNNNNNNNNNNNNNN

This will adjust the alphabet if required. See also the lower method.