Genome assembly algorithms through jigsaw puzzles - II

This is the second installment of “genome assembly algorithms through jigsaw puzzles”. We post them here every Tuesday. You can find all those pieces in one place (and some more) at this link.

We are developing this tutorial to explain genome assembly algorithms in a simple manner. In fact, rather than explaining, we expect you to discover the answer by manually solving a jigsaw puzzle. Later we show you how your solutions are related to the commonly used algorithms and their variations.

A simpler puzzle

If last week’s puzzle appears too complex, let us try an even simpler one. The following strings are taken from a sentence of length 100-200 letters. What is the sentence?

possible-copyin, tely-suggests-a, our-notice-that, ice-that-the-sp, ible-copying-me,
uggests-a-possi, -suggests-a-pos, anism-for-the-g, lated-immediate, ped-our-notice-,
ly-suggests-a-p, e-copying-mecha, ying-mechanism-, ts-a-possible-c, ossible-copying,
pying-mechanism, the-genetic-mat, -immediately-su, pecific-pairing, uggests-a-possi,
e-copying-mecha, copying-mechani, tely-suggests-a, aped-our-notice, immediately-sug,
ated-immediatel, ng-we-have-post, ing-mechanism-f, mediately-sugge, iately-suggests,
suggests-a-poss, ur-notice-that-, -have-postulate, ur-notice-that-, fic-pairing-we-,
-we-have-postul, e-genetic-mater, ostulated-immed, lated-immediate, e-genetic-mater,
ostulated-immed, ed-immediately-, ce-that-the-spe, notice-that-the, -pairing-we-hav,
ing-we-have-pos, -for-the-geneti, as-not-escaped-, t-has-not-escap, -postulated-imm,
e-postulated-im, t-the-specific-, ests-a-possible, -not-escaped-ou, immediately-sug,
e-that-the-spec, ossible-copying, iately-suggests, otice-that-the-, ng-we-have-post,
-pairing-we-hav, has-not-escaped, e-have-postulat, ely-suggests-a-, ely-suggests-a-,
e-specific-pair, ble-copying-mec, -genetic-materi, -our-notice-tha, mmediately-sugg,
pying-mechanism, -our-notice-tha, notice-that-the, caped-our-notic, tice-that-the-s,
-that-the-speci, e-copying-mecha, -copying-mechan, possible-copyin, ve-postulated-i,
ted-immediately, -postulated-imm, ng-mechanism-fo, specific-pairin, e-specific-pair,
ice-that-the-sp, -for-the-geneti, ulated-immediat, -the-specific-p, escaped-our-not,
ediately-sugges, at-the-specific, ped-our-notice-, -that-the-speci, ests-a-possible,
ggests-a-possib, ecific-pairing-, not-escaped-our, sm-for-the-gene, ve-postulated-i,
for-the-genetic, the-genetic-mat, ely-suggests-a-, ely-suggests-a-, -copying-mechan,
our-notice-that, ing-we-have-pos, ave-postulated-, stulated-immedi, -not-escaped-ou,
caped-our-notic, e-that-the-spec, a-possible-copy, notice-that-the, he-specific-pai,
e-have-postulat, ic-pairing-we-h, -possible-copyi, ng-we-have-post, It-has-not-esca,
r-the-genetic-m, the-genetic-mat, stulated-immedi, g-mechanism-for, sible-copying-m,
that-the-specif, e-specific-pair, has-not-escaped, ve-postulated-i, ic-pairing-we-h,
ur-notice-that-, ing-we-have-pos, -notice-that-th, ible-copying-me, otice-that-the-,
ve-postulated-i, -postulated-imm, iately-suggests, d-our-notice-th, cific-pairing-w,
diately-suggest, pairing-we-have, -postulated-imm, t-the-specific-, g-mechanism-for,
stulated-immedi, ur-notice-that-, a-possible-copy, e-postulated-im, anism-for-the-g,
or-the-genetic-, anism-for-the-g, we-have-postula, ely-suggests-a-, r-notice-that-t,
t-escaped-our-n, -copying-mechan, chanism-for-the, t-escaped-our-n, we-have-postula,
specific-pairin, r-notice-that-t, hanism-for-the-, ce-that-the-spe, copying-mechani,
pying-mechanism, for-the-genetic, ecific-pairing-, ostulated-immed, ly-suggests-a-p,
t-has-not-escap, we-have-postula, otice-that-the-, ng-we-have-post, has-not-escaped,
ice-that-the-sp, ble-copying-mec, tice-that-the-s, r-the-genetic-m, as-not-escaped-,
the-genetic-mat, e-specific-pair, immediately-sug, aped-our-notice, mediately-sugge,
ice-that-the-sp, has-not-escaped, c-pairing-we-ha, -the-genetic-ma, diately-suggest,
that-the-specif, mmediately-sugg, -copying-mechan, ying-mechanism-, pying-mechanism,
-for-the-geneti, e-postulated-im, g-we-have-postu, y-suggests-a-po, fic-pairing-we-,
hat-the-specifi, scaped-our-noti, pying-mechanism, specific-pairin, pying-mechanism,
as-not-escaped-, tulated-immedia, not-escaped-our, -copying-mechan, lated-immediate,
ce-that-the-spe, ism-for-the-gen, ulated-immediat, ng-we-have-post, chanism-for-the,
anism-for-the-g, ic-pairing-we-h, -genetic-materi, ble-copying-mec, diately-suggest,
possible-copyin, y-suggests-a-po, -possible-copyi, possible-copyin, pairing-we-have,
m-for-the-genet, postulated-imme, copying-mechani, ice-that-the-sp, sible-copying-m,
possible-copyin, ately-suggests-, ssible-copying-, opying-mechanis, escaped-our-not,

You will quickly find the large words - ‘possible’, ‘copying’, ‘immediately’, ‘specific’, ‘pairing’, ‘suggests’, ‘genetic’, ‘material’, ‘escaped’, ‘notice’, ‘postulated’. Sometimes the words are whole, but fragmented in most other cases.

Let us try to order those large words based on the overlaps between the fragmented words in different strings. If you try yourself, you will soon notice the following sequence of words -

  1. ‘escaped’, ‘our’, ‘notice’,
  2. ‘specific’, ‘pairing’
  3. ‘postulated’, ‘immediately’, ‘suggests’, ‘a’, ‘possible’, ‘copying’, ‘mechanism’
  4. ‘genetic’, ‘material’

A little more effort takes you to the following sentence from the famous paper of Watson and Crick -

It-has-not-escaped-our-notice-that-the-specific-pairing-we-have-postulated-immediately-suggests-a-possible-copying-mechanism-for-the-genetic-material"

We will use similar ideas in solving the original puzzle, and then connect them to the algorithmic concepts (de Bruijn graph, string graph, etc.) used for genome assembly.

Python model of a sequencing instrument

A code that starts from a long sentence and generates many small fragments is like a sequencing instrument. Let us build one using Python.

import random

seq="It-has-not-escaped-our-notice-that-the-specific-pairing-we-have-postulated-immediately-suggests-a-possible-copying-mechanism-for-the-genetic-material"

L=len(seq)

for i in range(350):
        pos=random.random()
        pos=int(L*pos)
        puzzle_piece=seq[pos:pos+15]
        if len(puzzle_piece)==15:
                print puzzle_piece

The above code uses Python library ‘random’ to generate a random number. It loops for 350 times, and each time creates a new random number as a position within the original sequence. A fifteen-letter word fragment is generated from that position, and printed if the length is indeed fifteen. This check is to remove all short words from the end of the sequence.


We will be continuing this series on every Tuesday. The fully assembled tutorial is available here.

Written by M. //