Genome assembly algorithms through jigsaw puzzles

We are developing this tutorial to explain genome assembly algorithms in a simple manner. In fact, rather than explaining, we expect you to discover the answer by manually solving a jigsaw puzzle. Later we show you how your solutions are related to the commonly used algorithms and their variations.

There are two reasons behind our developing this tutorial.

First, we like to show you the inner workings of the assembly algorithms. That way you can intelligently use the kmer-based tools to get more insight into the data. At present many users of the genome and transcriptome assembly programs run them as black boxes. Simple explanations of de Bruijn graph-based assemblers are available in some books and tutorials, but they do not go far enough to help you construct the assemblers from scratch. When it comes to long reads, hardly any simple tutorial exists. Our puzzle will help you understand (discover) both de Bruijn graph and string graph algorithms.

That brings us to the second point, and it is especially important for those learning to code (for example in Python). ‘Coding’ is often presented as a panacea, but coming up with algorithms is more important than coding itself. When it comes to algorithms, textbooks usually explain what already exists and not how to come up with new ones. Our example will show how to come up with the algorithms by manually solving (or trying to solve) the problem first and then replacing the most laborious steps one by one with code. In our opinion, this approach of ‘thinking using computer’ is more effective for genomic data, but is quite different from how bioinformatics is practiced today.

Solve a jigsaw puzzle

The puzzle in this chapter is a simplified version of the typical assembly problems in bioinformatics.

Please try to solve it yourself without looking at the solution, because that will help you in understanding various aspects of assembly algorithms. We will refer to it later in our discussions.

The problem is fairly staightforward and comparable to a jigsaw puzzle with less than 100 pieces. We expect you to solve it without help, but in case you need it, you may check the hint section. We also provide the solution and a bonus problem in a later section.

The following short words are taken randomly different parts of a 150-200 character long word. Every part of the original word got sampled multiple times in the set below, and that is why you see many duplicates and overlaps between the short words. Can you reconstruct the original long word from these pieces?

GRTGTAYLMNKVVNV, SQHRSQKNVSFITYG, SPSSSYKSINDPLFH, VCDILCHWCKRNVGW, DDQQYKEGKFILELK, 
TYGCRHCKTHLSSSF, FITYGCRHCKTHLSS, QYKEGKFILELKNIC, SSNDDQQYKEGKFIL, IISRDYRGRTGTAYL, 
LSSSFQIISRDYRGR, SSPSSSYKSINDPLF, KYLQSSNDDQQYKEG, YKEGKFILELKNICK, TYGCRHCKTHLSSSF, 
YKEGKFILELKNICK, SSSFQIISRDYRGRT, LVCDILCHWCKRNVG, VNVVEGKVEQRRMLT, SSSYKSINDPLFHSQ, 
SYKSINDPLFHSQHR, HLSSSFQIISRDYRG, LMNKVVNVVEGKVEQ, RDYRGRTGTAYLMNK, KRNVGWKYLQSSNDD, 
YSIYIENPLSSPSSS, RHCKTHLSSSFQIIS, GDYLVCDILCHWCKR, SSFQIISRDYRGRTG, QYKEGKFILELKNIC, 
SSSYKSINDPLFHSQ, VVNVVEGKVEQRRML, THLSSSFQIISRDYR, RMLTGDYLVCDILCH, PLSSPSSSYKSINDP, 
SSFQIISRDYRGRTG, THLSSSFQIISRDYR, YLMNKVVNVVEGKVE, ITYGCRHCKTHLSSS, SFITYGCRHCKTHLS, 
VVNVVEGKVEQRRML, SSNDDQQYKEGKFIL, TGTAYLMNKVVNVVE, ITYGCRHCKTHLSSS, DYLVCDILCHWCKRN, 
QIISRDYRGRTGTAY, DYLVCDILCHWCKRN, TGTAYLMNKVVNVVE, DQQYKEGKFILELKN, GKVEQRRMLTGDYLV, 
WKYLQSSNDDQQYKE, SSNDDQQYKEGKFIL, DILCHWCKRNVGWKY, TYGCRHCKTHLSSSF, MLTGDYLVCDILCHW, 
DDQQYKEGKFILELK, YLVCDILCHWCKRNV, YLQSSNDDQQYKEGK, EGKFILELKNICKCT, PLSSPSSSYKSINDP, 
YLVCDILCHWCKRNV, RMLTGDYLVCDILCH, QYKEGKFILELKNIC, CRHCKTHLSSSFQII, NKVVNVVEGKVEQRR, 
EQRRMLTGDYLVCDI, SSSFQIISRDYRGRT, YRGRTGTAYLMNKVV, SRDYRGRTGTAYLMN, SPSSSYKSINDPLFH, 
GKVEQRRMLTGDYLV, VSFITYGCRHCKTHL, SIYIENPLSSPSSSY, KSINDPLFHSQHRSQ, NPLSSPSSSYKSIND, 
SQKNVSFITYGCRHC, GTAYLMNKVVNVVEG, FHSQHRSQKNVSFIT, AYLMNKVVNVVEGKV, YKSINDPLFHSQHRS, 
FHSQHRSQKNVSFIT, RRMLTGDYLVCDILC, RMLTGDYLVCDILCH, TAYLMNKVVNVVEGK, CHWCKRNVGWKYLQS, 
TAYLMNKVVNVVEGK, KRNVGWKYLQSSNDD, SYKSINDPLFHSQHR, KVEQRRMLTGDYLVC, SINDPLFHSQHRSQK, 
SQHRSQKNVSFITYG, EGKFILELKNICKCT, MLTGDYLVCDILCHW, VGWKYLQSSNDDQQY, SSPSSSYKSINDPLF, 
CKRNVGWKYLQSSND, RGRTGTAYLMNKVVN, LMNKVVNVVEGKVEQ, FQIISRDYRGRTGTA, LFHSQHRSQKNVSFI, 
SSSFQIISRDYRGRT, AYLMNKVVNVVEGKV, DYRGRTGTAYLMNKV, MNKVVNVVEGKVEQR, MGLRYSIYIENPLSS, 
KSINDPLFHSQHRSQ, LSSPSSSYKSINDPL, NDPLFHSQHRSQKNV, PSSSYKSINDPLFHS, NDDQQYKEGKFILEL, 
CKRNVGWKYLQSSND, SSFQIISRDYRGRTG, VCDILCHWCKRNVGW, RDYRGRTGTAYLMNK, QKNVSFITYGCRHCK, 
LSSPSSSYKSINDPL, LSSSFQIISRDYRGR, KVVNVVEGKVEQRRM, QKNVSFITYGCRHCK, YGCRHCKTHLSSSFQ, 
FITYGCRHCKTHLSS, HCKTHLSSSFQIISR, WKYLQSSNDDQQYKE, CDILCHWCKRNVGWK, RTGTAYLMNKVVNVV, 
SPSSSYKSINDPLFH, SSSFQIISRDYRGRT, SYKSINDPLFHSQHR, TAYLMNKVVNVVEGK, ENPLSSPSSSYKSIN, 
GLRYSIYIENPLSSP, SIYIENPLSSPSSSY, DILCHWCKRNVGWKY, KSINDPLFHSQHRSQ, KEGKFILELKNICKC, 
LFHSQHRSQKNVSFI, YKEGKFILELKNICK, KEGKFILELKNICKC, LSSPSSSYKSINDPL, YLQSSNDDQQYKEGK, 
SSSYKSINDPLFHSQ, NDDQQYKEGKFILEL, LMNKVVNVVEGKVEQ, VNVVEGKVEQRRMLT, ISRDYRGRTGTAYLM, 
RDYRGRTGTAYLMNK, LSSSFQIISRDYRGR, SSPSSSYKSINDPLF, NVVEGKVEQRRMLTG, VGWKYLQSSNDDQQY, 
YGCRHCKTHLSSSFQ, YLVCDILCHWCKRNV, YKEGKFILELKNICK, LQSSNDDQQYKEGKF, GDYLVCDILCHWCKR, 
RDYRGRTGTAYLMNK, VVEGKVEQRRMLTGD, YKSINDPLFHSQHRS, KVVNVVEGKVEQRRM, YSIYIENPLSSPSSS, 
WKYLQSSNDDQQYKE, QHRSQKNVSFITYGC, QHRSQKNVSFITYGC, YLQSSNDDQQYKEGK, SSYKSINDPLFHSQH, 
SFITYGCRHCKTHLS, VCDILCHWCKRNVGW, LTGDYLVCDILCHWC, TGTAYLMNKVVNVVE, RDYRGRTGTAYLMNK, 
SSNDDQQYKEGKFIL, EQRRMLTGDYLVCDI, RYSIYIENPLSSPSS, DILCHWCKRNVGWKY, YGCRHCKTHLSSSFQ, 
IYIENPLSSPSSSYK, RSQKNVSFITYGCRH, DQQYKEGKFILELKN, TGTAYLMNKVVNVVE, QHRSQKNVSFITYGC, 
SSYKSINDPLFHSQH, NPLSSPSSSYKSIND, YKSINDPLFHSQHRS, EQRRMLTGDYLVCDI, DQQYKEGKFILELKN

Hint

  1. Try to solve the problem manually before writing any code.
  2. Look for common subwords in the above set and see whether you can join the words based on those common subwords.

We will be continuing this series on every Tuesday. The complete tutorial is available here.

Written by M. //