biopython v1.71.0 Bio.Seq.UnknownSeq
Read-only sequence object of known length but unknown contents.
If you have an unknown sequence, you can represent this with a normal Seq object, for example:
>>> my_seq = Seq("N"*5)
>>> my_seq
Seq('NNNNN', Alphabet())
>>> len(my_seq)
5
>>> print(my_seq)
NNNNN
However, this is rather wasteful of memory (especially for large sequences), which is where this class is most useful:
>>> unk_five = UnknownSeq(5)
>>> unk_five
UnknownSeq(5, alphabet = Alphabet(), character = '?')
>>> len(unk_five)
5
>>> print(unk_five)
?????
You can add unknown sequence together, provided their alphabets and characters are compatible, and get another memory saving UnknownSeq:
>>> unk_four = UnknownSeq(4)
>>> unk_four
UnknownSeq(4, alphabet = Alphabet(), character = '?')
>>> unk_four + unk_five
UnknownSeq(9, alphabet = Alphabet(), character = '?')
If the alphabet or characters don’t match up, the addition gives an ordinary Seq object:
>>> unk_nnnn = UnknownSeq(4, character = "N")
>>> unk_nnnn
UnknownSeq(4, alphabet = Alphabet(), character = 'N')
>>> unk_nnnn + unk_four
Seq('NNNN????', Alphabet())
Combining with a real Seq gives a new Seq object:
>>> known_seq = Seq("ACGT")
>>> unk_four + known_seq
Seq('????ACGT', Alphabet())
>>> known_seq + unk_four
Seq('ACGT????', Alphabet())
Link to this section Summary
Functions
Add another sequence or string to this sequence
Get a subsequence from the UnknownSeq object
Create a new UnknownSeq object
Return the stated length of the unknown sequence
Add a sequence on the left
Return (truncated) representation of the sequence for debugging
Return the unknown sequence as full string of the given length
Return an unknown DNA sequence from an unknown RNA sequence
Return the complement of an unknown nucleotide equals itself
Return a non-overlapping count, like that of a python string
Return an overlapping count
Return a lower case copy of the sequence
Return the reverse complement of an unknown sequence
Return an unknown RNA sequence from an unknown DNA sequence
Translate an unknown nucleotide sequence into an unknown protein
Return a copy of the sequence without the gap character(s)
Return an upper case copy of the sequence
Link to this section Functions
Add another sequence or string to this sequence.
Adding two UnknownSeq objects returns another UnknownSeq object provided the character is the same and the alphabets are compatible.
>>> from Bio.Seq import UnknownSeq
>>> from Bio.Alphabet import generic_protein
>>> UnknownSeq(10, generic_protein) + UnknownSeq(5, generic_protein)
UnknownSeq(15, alphabet = ProteinAlphabet(), character = 'X')
If the characters differ, an UnknownSeq object cannot be used, so a Seq object is returned:
>>> from Bio.Seq import UnknownSeq
>>> from Bio.Alphabet import generic_protein
>>> UnknownSeq(10, generic_protein) + UnknownSeq(5, generic_protein,
... character="x")
Seq('XXXXXXXXXXxxxxx', ProteinAlphabet())
If adding a string to an UnknownSeq, a new Seq is returned with the same alphabet:
>>> from Bio.Seq import UnknownSeq
>>> from Bio.Alphabet import generic_protein
>>> UnknownSeq(5, generic_protein) + "LV"
Seq('XXXXXLV', ProteinAlphabet())
Get a subsequence from the UnknownSeq object.
>>> unk = UnknownSeq(8, character="N")
>>> print(unk[:])
NNNNNNNN
>>> print(unk[5:3])
<BLANKLINE>
>>> print(unk[1:-1])
NNNNNN
>>> print(unk[1:-1:2])
NNN
Create a new UnknownSeq object.
If character is omitted, it is determined from the alphabet, “N” for nucleotides, “X” for proteins, and “?” otherwise.
Return the stated length of the unknown sequence.
Add a sequence on the left.
Return (truncated) representation of the sequence for debugging.
Return the unknown sequence as full string of the given length.
Return an unknown DNA sequence from an unknown RNA sequence.
>>> my_rna = UnknownSeq(20, character="N")
>>> my_rna
UnknownSeq(20, alphabet = Alphabet(), character = 'N')
>>> print(my_rna)
NNNNNNNNNNNNNNNNNNNN
>>> my_dna = my_rna.back_transcribe()
>>> my_dna
UnknownSeq(20, alphabet = DNAAlphabet(), character = 'N')
>>> print(my_dna)
NNNNNNNNNNNNNNNNNNNN
Return the complement of an unknown nucleotide equals itself.
>>> my_nuc = UnknownSeq(8)
>>> my_nuc
UnknownSeq(8, alphabet = Alphabet(), character = '?')
>>> print(my_nuc)
????????
>>> my_nuc.complement()
UnknownSeq(8, alphabet = Alphabet(), character = '?')
>>> print(my_nuc.complement())
????????
Return a non-overlapping count, like that of a python string.
This behaves like the python string (and Seq object) method of the same name, which does a non-overlapping count!
For an overlapping search use the newer count_overlap() method.
Returns an integer, the number of occurrences of substring argument sub in the (sub)sequence given by [start:end]. Optional arguments start and end are interpreted as in slice notation.
Arguments:
- sub - a string or another Seq object to look for
- start - optional integer, slice start
end - optional integer, slice end
>>> “NNNN”.count(“N”) 4 >>> Seq(“NNNN”).count(“N”) 4 >>> UnknownSeq(4, character=”N”).count(“N”) 4 >>> UnknownSeq(4, character=”N”).count(“A”) 0 >>> UnknownSeq(4, character=”N”).count(“AA”) 0
HOWEVER, please note because that python strings and Seq objects (and MutableSeq objects) do a non-overlapping search, this may not give the answer you expect:
>>> UnknownSeq(4, character="N").count("NN")
2
>>> UnknownSeq(4, character="N").count("NNN")
1
Return an overlapping count.
For a non-overlapping search use the count() method.
Returns an integer, the number of occurrences of substring argument sub in the (sub)sequence given by [start:end]. Optional arguments start and end are interpreted as in slice notation.
Arguments:
- sub - a string or another Seq object to look for
- start - optional integer, slice start
- end - optional integer, slice end
e.g.
>>> from Bio.Seq import UnknownSeq
>>> UnknownSeq(4, character="N").count_overlap("NN")
3
>>> UnknownSeq(4, character="N").count_overlap("NNN")
2
Where substrings do not overlap, should behave the same as the count() method:
>>> UnknownSeq(4, character="N").count_overlap("N")
4
>>> UnknownSeq(4, character="N").count_overlap("N") == UnknownSeq(4, character="N").count("N")
True
>>> UnknownSeq(4, character="N").count_overlap("A")
0
>>> UnknownSeq(4, character="N").count_overlap("A") == UnknownSeq(4, character="N").count("A")
True
>>> UnknownSeq(4, character="N").count_overlap("AA")
0
>>> UnknownSeq(4, character="N").count_overlap("AA") == UnknownSeq(4, character="N").count("AA")
True
Return a lower case copy of the sequence.
This will adjust the alphabet if required:
>>> from Bio.Alphabet import IUPAC
>>> from Bio.Seq import UnknownSeq
>>> my_seq = UnknownSeq(20, IUPAC.extended_protein)
>>> my_seq
UnknownSeq(20, alphabet = ExtendedIUPACProtein(), character = 'X')
>>> print(my_seq)
XXXXXXXXXXXXXXXXXXXX
>>> my_seq.lower()
UnknownSeq(20, alphabet = ProteinAlphabet(), character = 'x')
>>> print(my_seq.lower())
xxxxxxxxxxxxxxxxxxxx
See also the upper method.
Return the reverse complement of an unknown sequence.
The reverse complement of an unknown nucleotide equals itself:
>>> from Bio.Seq import UnknownSeq
>>> from Bio.Alphabet import generic_dna
>>> example = UnknownSeq(6, generic_dna)
>>> print(example)
NNNNNN
>>> print(example.reverse_complement())
NNNNNN
Return an unknown RNA sequence from an unknown DNA sequence.
>>> my_dna = UnknownSeq(10, character="N")
>>> my_dna
UnknownSeq(10, alphabet = Alphabet(), character = 'N')
>>> print(my_dna)
NNNNNNNNNN
>>> my_rna = my_dna.transcribe()
>>> my_rna
UnknownSeq(10, alphabet = RNAAlphabet(), character = 'N')
>>> print(my_rna)
NNNNNNNNNN
Translate an unknown nucleotide sequence into an unknown protein.
e.g.
>>> my_seq = UnknownSeq(9, character="N")
>>> print(my_seq)
NNNNNNNNN
>>> my_protein = my_seq.translate()
>>> my_protein
UnknownSeq(3, alphabet = ProteinAlphabet(), character = 'X')
>>> print(my_protein)
XXX
In comparison, using a normal Seq object:
>>> my_seq = Seq("NNNNNNNNN")
>>> print(my_seq)
NNNNNNNNN
>>> my_protein = my_seq.translate()
>>> my_protein
Seq('XXX', ExtendedIUPACProtein())
>>> print(my_protein)
XXX
Return a copy of the sequence without the gap character(s).
The gap character can be specified in two ways - either as an explicit argument, or via the sequence’s alphabet. For example:
>>> from Bio.Seq import UnknownSeq
>>> from Bio.Alphabet import Gapped, generic_dna
>>> my_dna = UnknownSeq(20, Gapped(generic_dna, "-"))
>>> my_dna
UnknownSeq(20, alphabet = Gapped(DNAAlphabet(), '-'), character = 'N')
>>> my_dna.ungap()
UnknownSeq(20, alphabet = DNAAlphabet(), character = 'N')
>>> my_dna.ungap("-")
UnknownSeq(20, alphabet = DNAAlphabet(), character = 'N')
If the UnknownSeq is using the gap character, then an empty Seq is returned:
>>> my_gap = UnknownSeq(20, Gapped(generic_dna, "-"), character="-")
>>> my_gap
UnknownSeq(20, alphabet = Gapped(DNAAlphabet(), '-'), character = '-')
>>> my_gap.ungap()
Seq('', DNAAlphabet())
>>> my_gap.ungap("-")
Seq('', DNAAlphabet())
Notice that the returned sequence’s alphabet is adjusted to remove any explicit gap character declaration.
Return an upper case copy of the sequence.
>>> from Bio.Alphabet import generic_dna
>>> from Bio.Seq import UnknownSeq
>>> my_seq = UnknownSeq(20, generic_dna, character="n")
>>> my_seq
UnknownSeq(20, alphabet = DNAAlphabet(), character = 'n')
>>> print(my_seq)
nnnnnnnnnnnnnnnnnnnn
>>> my_seq.upper()
UnknownSeq(20, alphabet = DNAAlphabet(), character = 'N')
>>> print(my_seq.upper())
NNNNNNNNNNNNNNNNNNNN
This will adjust the alphabet if required. See also the lower method.