This is the third installment of “genome assembly algorithms through jigsaw puzzles”. We usually post them here every Tuesday, although we are late this week. You can find all those pieces in one place (and some more) at this link.
We are developing this tutorial to explain genome assembly algorithms in a simple manner. In fact, rather than explaining, we expect you to discover the answer by manually solving a jigsaw puzzle. Later we show you how your solutions are related to the commonly used algorithms and their variations.
Simplifying the Original Puzzle
You may have noticed that instead of solving the original puzzle, we took a big detour last week and chose to solve a different puzzle. Why did we do that?
I have often seen that those new to programming and those with experience attempt to solve a new differently. Beginners tend to go to the computer and start writing code, while the experienced ones turn off all electronic devices to scribble on paper. In their efforts, the later ones try to come up with a set of rules as the solution, and those collection of rules is called an algorithm.
How does one find a new algorithm? There are no set rules to do that, or should I say there is no algorithm to discover algorithms. Nonetheless, the practitioners usually find one method helpful, and that is to gain insights by solving simpler versions of the large or complex problem. For example, those writing genome assembly programs for human genomes first try to assemble E. coli genome or even smaller pieces of sequence. That is the rationale behind our last week’s discussion. In fact, our original puzzle itself is a simpler version of nucleotide assembly problem, where we made considerable simplification by using amino acids and removing all complications from two strands.
The first observation we can make from last week’s simpler puzzle is that the identical strings do not provide any additional information and can be safely removed. For example, the first string ‘possible-copyin’ is present four other times in the collection of the given strings. We can remove all occurrences except one and still lose nothing.
Going back to our original puzzle, the set shrinks from 185 words to only 95 words after removing duplicates. They are given below in the alphabetical order.
AYLMNKVVNVVEGKV, CDILCHWCKRNVGWK, CHWCKRNVGWKYLQS, CKRNVGWKYLQSSND, CRHCKTHLSSSFQII, DDQQYKEGKFILELK, DILCHWCKRNVGWKY, DQQYKEGKFILELKN, DYLVCDILCHWCKRN, DYRGRTGTAYLMNKV, EGKFILELKNICKCT, ENPLSSPSSSYKSIN, EQRRMLTGDYLVCDI, FHSQHRSQKNVSFIT, FITYGCRHCKTHLSS, FQIISRDYRGRTGTA, GDYLVCDILCHWCKR, GKVEQRRMLTGDYLV, GLRYSIYIENPLSSP, GRTGTAYLMNKVVNV, GTAYLMNKVVNVVEG, HCKTHLSSSFQIISR, HLSSSFQIISRDYRG, IISRDYRGRTGTAYL, ISRDYRGRTGTAYLM, ITYGCRHCKTHLSSS, IYIENPLSSPSSSYK, KEGKFILELKNICKC, KRNVGWKYLQSSNDD, KSINDPLFHSQHRSQ, KVEQRRMLTGDYLVC, KVVNVVEGKVEQRRM, KYLQSSNDDQQYKEG, LFHSQHRSQKNVSFI, LMNKVVNVVEGKVEQ, LQSSNDDQQYKEGKF, LSSPSSSYKSINDPL, LSSSFQIISRDYRGR, LTGDYLVCDILCHWC, LVCDILCHWCKRNVG, MGLRYSIYIENPLSS, MLTGDYLVCDILCHW, MNKVVNVVEGKVEQR, NDDQQYKEGKFILEL, NDPLFHSQHRSQKNV, NKVVNVVEGKVEQRR, NPLSSPSSSYKSIND, NVVEGKVEQRRMLTG, PLSSPSSSYKSINDP, PSSSYKSINDPLFHS, QHRSQKNVSFITYGC, QIISRDYRGRTGTAY, QKNVSFITYGCRHCK, QYKEGKFILELKNIC, RDYRGRTGTAYLMNK, RGRTGTAYLMNKVVN, RHCKTHLSSSFQIIS, RMLTGDYLVCDILCH, RRMLTGDYLVCDILC, RSQKNVSFITYGCRH, RTGTAYLMNKVVNVV, RYSIYIENPLSSPSS, SFITYGCRHCKTHLS, SINDPLFHSQHRSQK, SIYIENPLSSPSSSY, SPSSSYKSINDPLFH, SQHRSQKNVSFITYG, SQKNVSFITYGCRHC, SRDYRGRTGTAYLMN, SSFQIISRDYRGRTG, SSNDDQQYKEGKFIL, SSPSSSYKSINDPLF, SSSFQIISRDYRGRT, SSSYKSINDPLFHSQ, SSYKSINDPLFHSQH, SYKSINDPLFHSQHR, TAYLMNKVVNVVEGK, TGTAYLMNKVVNVVE, THLSSSFQIISRDYR, TYGCRHCKTHLSSSF, VCDILCHWCKRNVGW, VGWKYLQSSNDDQQY, VNVVEGKVEQRRMLT, VSFITYGCRHCKTHL, VVEGKVEQRRMLTGD, VVNVVEGKVEQRRML, WKYLQSSNDDQQYKE, YGCRHCKTHLSSSFQ, YKEGKFILELKNICK, YKSINDPLFHSQHRS, YLMNKVVNVVEGKVE, YLQSSNDDQQYKEGK, YLVCDILCHWCKRNV, YRGRTGTAYLMNKVV, YSIYIENPLSSPSSS,
To solve last week’s puzzle, we identified English words like ‘possible’, ‘copying’, ‘immediately’, ‘specific’ etc. and used them to reconstruct the original sentence. The same approach works here as well, except that we do not recognize the patterns as easily as from English words. However, you can definitely find them, if you look closely enough. Check the subword ‘AYLMNKVVNV’ for example. It is part of the first string and is also present in five other places.
How can we build a set of rules for solving the puzzle from the above observation, and how can we write a code to automatically do the assembly? We will cover this next week, and discover de Bruijn graph algorithm for genome assembly.
Another observation that can be made from last week’s exercise is that the large words like ‘immediately’ often get split between multiple fragments. Therefore, it is possible to use ‘immediately’ as an anchor to join two pieces into a super-word. By repeating the same exercise on other words or super-words, the entire sentence can be reconstructed.
Our original puzzle can be solved in a similar manner. You will find that the word ‘AYLMNKVVNV’ is split between ‘DYRGRTGTAYLMNKV’ and ‘LMNKVVNVVEGKVEQ’ in the collection shown above. Therefore, those two pieces can be merged into the super-word ‘DYRGRTGTAYLMNKVVNVVEGKVEQ’. In the week after the following one, we will generalize this observation to come up with string graph algorithm for genome assembly.
Are assembly algorithms that simple?
Yes and no. If you are struggling to understand various genome assembly papers, please note that many of them are written using a complex language that the computer scientists get trained to speak. Once you go beyond the complex symbols and try out the underlying algorithm on simple test cases, you will recognize them as fairly straightforward.
The real challenges of computer science comes from two aspects, as understood by asking the following two questions.
Is the simplification step removing information?
Will the solution derived from simplified case run fast enough, when applied to larger problem?
In later sections, we will go through plenty of examples to answer both questions in the context of assembly algorithms.