Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

References

We created this list for our own convenience. However, it took us some time to get all the pieces together, and we thought posting the full list here would help someone else go to sleep early. Wherever we could, we added short narration about the papers or the section (again for our own convenience). There is no guarantee that the texts describe the papers accurately. Neither is there guarantee that the texts describe the papers inaccurately. The papers mentioned here are related to bioinformatics problems, when no reference genome exists. We split the other set of papers with alignment, SNP calling, etc. into another set. 1. Pre-NGS Genome Assemblers This section includes old assembly-related papers that we may need to cite from time to time. The first subgroup has links to base-calling and error correction programs such as phred, phrap and consed. The second subgroup has more sophisticated assemblers, but they mostly belong to overlap-layout-consensus type. Those dinosaurs used to rule over the world in not too distant past. CAP3 and Celere are the only programs that we like. They are our pet dinosaurs. That does not mean the others are bad. We heard very good opinions about them from, well, the genome centers that nurture them. Most programs are associated with one or other genome centers. TIGR is from TIGR. ARACHNE belongs to Broad. Phusion is from EMBL. Atlas came from Baylor. Celera was written by J. C. Venter’s company, but is maintained by the bioinformatics group from Maryland. Base-calling and error detection Krawetz SA (1989) Sequence errors described in GenBank: a means to determine the accuracy of DNA sequence interpretation.Nucleic Acids Res. 17(10):3951-7. Link Bonfield JK, Staden R (1995) The application of numerical estimates of base calling accuracy to DNA sequencing projects.Nucleic Acids Res. 23(8):1406-10. Link Ewing B, Hillier L, Wendl M, Green P: Basecalling of automated sequencer traces using phred. I. Accuracy assessment.Genome Research 8:175-185 (1998). Link Ewing B, Green P: Base calling of automated sequencer traces using phred. II. Error probabilities.Genome Research 8:186-194 (1998). Link Gordon D, Abajian C, Green P: (1998) Consed: a graphical tool for sequence finishing.Genome Research 8:195-202 Link Assembler Sutton, G. G., White, O., Adams, M. D., Kerlavage, A. R. (1995) TIGR Assembler: A new tool for assembling large shotgun sequencing projects.Genome Science and Technology. 1(1): 9-19. Link Huang X, Madan A, (1999) CAP3: A DNA sequence assembly program.Genome Res. 9(9):868-77. Link Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KHJ, Remington KA, et al.: (2000) A whole-genome assembly of Drosophila. Science287(5461):2196-2204. Link Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES: (2002) ARACHNE: A whole-genome shotgun assembler. Genome Res12(1):177-189. Link Mullikin JC, Ning ZM: (2003) The phusion assembler. Genome Res13(1):81-90. Link Istrail, S. et al. (2004) Whole-Genome Shotgun Assembly and Comparison of Human Genome Assemblies. Proc. Nat. Acad. Sci. USA 101:1916-1921. Link David B. Jaffe, Jonathan Butler, Sante Gnerre, Evan Mauceli, Kerstin Lindblad-Toh,Jill P. Mesirov, Michael C. Zody,and Eric S. Lander (2003) Whole-Genome Sequence Assembly for Mammalian Genomes: Arachne 2 Genome Res. 13(1): 91–96. Link Havlak P, Chen R, Durbin KJ, Egan A, Ren YR, Song XZ, Weinstock GM, Gibbs RA (2004) The atlas genome assembly system. Genome Res14(4):721-732. Link Chapman JA, Ho I, Sunkara S, Luo S, Schroth GP, et al. (2011) Meraculous: De Novo Genome Assembly with Short Paired-End Reads.PLoS ONE 6(8): e23501. Link Mihai Pop, Adam Phillippy, Arthur L. Delcher, Steven L. Salzberg (2004) Comparative Genome Assembly, Briefings in Bioinformatics 5 (3):237-248. Link R. Xia and A. Kim (2012) MERmaid: A Parallel Genome Assembler for the Cloud.Link Weber JL, Myers EW (1997) Human whole-genome shotgun sequencing. Genome Res 7: 401–409. Link 2. NGS Genome Assemblers non de Bruijn, k-mer based You need a video to understand this category (pay attention to the guys, who jumped early). Sundquist A, Ronaghi M, Tang HX, Pevzner P, Batzoglou S: (2007) Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies. PLoS ONE, 2(5). Link Warren RL, Sutton GG, Jones SJM, Holt RA: Assembling millions of short DNA sequences using SSAKE. Bioinformatics 2007, 23(4):500-501. Link SHORTY (Chen and Skiena, 2007) [specialised in localising the use of paired-end reads.] Daniel D Sommer, Arthur L Delcher, Steven L Salzbergand Mihai Po (2007) Minimus: a fast, lightweight genome assembler BMC Bioinformatics 2007, 8:64. Link de Bruijn graph-based assemblers Many articles in our blog explained this class of assemblers. Idury RM, Waterman MS (1995) A new algorithm for DNA sequence assembly. J Comput Biol 2: 291–306. Link Pevzner, Pavel A.; Tang, Haixu (2001). Fragment Assembly with Double-Barreled Data. Bioinformatics/ISMB 1: 1–9. Link Chaisson M, Pevzner P, Tang H (2004) Fragment assembly with short reads. Bioinformatics 20: 2067-74. Link Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragment assembly. Proc. Nat. Acad. Sci. USA 2001, 98(17):9748-9753. Link Zerbino DR, Birney E: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 2008, 18(5):821-829. Link Zerbino, D., Genome assembly and comparison using de Bruijn graphs Ph.D. Thesis, EBI, UK. Link Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB: ALLPATHS: De novo assembly of whole-genome shotgun microreads. Genome Research 2008, 18(5):810-820. Link Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I: ABySS: A parallel assembler for short read sequence data. Genome Research 2009, 19(6):1117-1123. Link Boisvert S, Laviolette F, Corbeil J. J Comput Biol. (2010) Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. Nov;17(11):1519-33. Epub 2010 Oct 20. Link Applications Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J (2010) De novo assembly of human genomes with massively parallel short read sequencing. Genome Research20(2):265-272. Link Li R, Fan W, Tian G, et al. (2010) The sequence and de novo assembly of the giant panda genome. Nature463 (7279):311-317. Link Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S, et al. (2011) High-quality draft assemblies of mammalian genomes from massively parallel sequence data. P Natl Acad Sci USA108(4):1513-1518. Link Chitsaz H, Yee-Greenbaum J, Tesler G, Lombardo M, Dupont C, et al. (2011) Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nat Biotechnol 29: 915-21. Link Rodrigue S, Malmstrom R, Berlin A, Birren B, Henn M, et al. (2009) Whole genome amplication and de novo assembly of single bacterial cells. PLoS One 4: e6864. Link Schuster SC, Miller W, Ratan A, Tomsho LP, Giardine B, et al. (2010) Complete Khoisan and Bantu genomes from southern Africa.Nature 463: 943–947. Link Comparison Deng HW, Lin Y, Li J, Shen H, Zhang L, Papasian CJ (2011) Comparative studies of de novo assembly tools for next-generation sequencing technologies. Bioinformatics27(15):2031-2037. Link Salzberg SL, Phillippy AM, Zimin AV, Puiu D, Magoc T, Koren S, Treangen T, Schatz MC, Delcher AL, Roberts M, et al. (2011) GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. Link Zhang WY, Chen JJ, Yang Y, Tang YF, Shang J, Shen BR (2011) A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies. PLoS ONE, 6(3). Link Magoč T, Salzberg SL (2011) FLASH: Fast Length Adjustment of Short Reads to Improve Genome Assemblies. Bioinformatics. Link Salzberg SL, Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C (2004) Versatile and open software for comparing large genomes. Genome Biol, 5(2). Link D. Earl et al. (2011) Assemblathon 1: A competitive assessment of de novo short read assembly methods, Genome Research, 21:2224-2241. Link 3. Exomes, Transcriptomes, Metagenomes and Highly Polymorphic Genomes Transcriptome Assemblers Robertson D, Schein J, Chiu R, Corbett R. Field M et al. (2010) De novo assembly and analysis of RNA-seq data Nature Methods 7, 909–912. Link Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA et al. (2011) Full-length transcriptome assembly from RNA-seq data without a reference genome. Nat Biotechnol. 29(7):644-52. Link Schulz M, Zerbino D, Vingron M, Birney E (2012) Oases: robust de novo rna-seq assembly across the dynamic range of expression levels. Bioinformatics 28: 1086-92. Link Metagenomes Namiki T, Hachiya T, Tanaka H, Sakakibara Y (2011) MetaVelvet: An extension of Velvet assem-bler to de novo metagenome assembly from short sequence reads. ACM Conference on Bioinformatics, Computational Biology and Biomedicine. Link Peng Y, Leung H, Yiu S, Chin F (2011) Meta-idba: a de novo assembler for metagenomic data. Bioinformatics 27: i94-101. Link Vaughn Iverson, Robert M. Morris, Christian D. Frazar, Chris T. Berthiaume, Rhonda L. Morales, E. Virginia Armbrust (2012) Untangling Genomes from Metagenomes: Revealing an Uncultured Class of Marine Euryarchaeota, Science 335(6068):587-590. Link C. T. B. on scaling metagenome Polymorphic Genomes Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G (2012) De novo assembly and genotyping of variants using colored de bruijn graphs. Nat Genet 44: 226-32. Link S. Huang et al. (2012) HaploMerger: Reconstructing allelic relationships for polymorphic diploid genome assemblies, Genome Research. Link Targeted Assembly P. Peterlongo and R. Chikhi (2011) Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer BMC Bioinformatics 2012, 13:48. Link René L. Warren Robert A. Holt (2011) Targeted Assembly of Short Sequence Reads PLOS ONE. Link 4. Faster, better, cheaper k-mer counting Bloom BH (1974) Space/time trade-offs in hash coding with allowable errors. Commun ACM, 13:422-426. Link Marçais G, Kingsford C (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 2011, 27(6):764-770. Link Melsted P, Pritchard JK (2011) Efficient counting of k-mers in DNA sequences using a bloom filter. Bmc Bioinformatics12. Link C. T. Brown khmer https://github.com/ctb/ Storage Christley S, Lu Y, Li C, Xie X (2009) Human genomes as email attachments. Bioinformatics ;25:274-275. Conway TC, Bromage AJ (2011) Succinct data structures for assembling large genomes. Bioinformatics27(4):479-486. Link Fritz MH-Y, Leinonen R, Cochrane G, Birney E. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res.;21:734-740. Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics20(18):3363-3369. Link Pinho A, Pratas D, Garcia S (2012) GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res 40: e27. Link Daniel C. Jones, W. L. Ruzzo, X. Peng, M. G. Katze (2012) Compression of next-generation sequencing reads aided by highly efficient de novo assembly, Nucleic Acids Research, Link https://github.com/dcjones/quip#readme Rayan Chikhiand Guillaume Rizk (2012) Space-efficient and exact de Bruijn graph representation based on a Bloom Filter. Link Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G. (2011) Compressing genomic sequence fragments using SlimGene. J Comput Biol. (3):401-13.Link Pell J. et al.(2012) Scaling metagenome sequence assembly with probabilistic de Bruijn graphs, Proc. Nat. Acad. Sci. USA. Link C. Titus Brown, Adina Howe, Qingpeng Zhang, Alexis B. Pyrkosz, Timothy H. Brom A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data http://arxiv.org/abs/1203.4802 Error correction Kelley D, Schatz M, Salzberg S: (2010) Quake: quality-aware detection and correction of sequencing errors. Genome Biology11(11):R116. Medvedev P, Scott E, Kakaradov B, Pevzner P (2011) Error correction of high-throughput sequencing datasets with non-uniform coverage. Bioinformatics 27: i137-41. C. Titus Brown, Adina Howe, Qingpeng Zhang, Alexis B. Pyrkosz, Timothy H. Brom A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data http://arxiv.org/abs/1203.4802 Hadoop Schatz M (2009) Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics 25: 1363-9. Contrail http://sourceforge.net/apps/mediawiki/contrail-bio/index.php?title=Contrail Hardware-accelerators Shi H, Schmidt B, Liu W, Müller-Wittig W (2010) A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-Enabled Graphics Hardware. Journal of Computational Biology17(4):603-615. Liu Y, Schmidt B, Maskell DL (2011) Parallelized short read assembly of large genomes using de Bruijn graphs. BMC Bioinformatics, 12:354. Link String graph assembler Myers EW (2005) The fragment assembly string graph. Bioinformatics21:79-85. Link Simpson JT, Durbin R: Efficient de novo assembly of large genomes using compressed data structures. Genome Res 2011. Link Simpson J, Durbin R (2010) Efficient construction of an assembly string graph using the fm-index. Bioinformatics 26: i367-73. Link Simpson J, Durbin R (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome Res 22: 549-56. Link Scaffolding Zerbino DR, McEwen GK, Margulies EH, Birney E (2009) Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler. PLoS One 4: e8407. Link Koren S, Treangen T, Pop M (2011) Bambus 2: scaffolding metagenomes.Bioinformatics 27: 2964-71. Link Alexey A. Gritsenko et al. (2012) GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies Bioinformatics 28(11): 1429-1437 Abstract Repeats Zerbino DR, McEwen GK, Margulies EH, Birney E (2009) Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler. PLoS One 4: e8407. Link Do H, Choi K, Preparata F, Sung W, Zhang L (2008) Spectrum-based de novo repeat detection in genomic sequences. J Comput Biol 15: 469-87. Link Novak P, Neumann P, Macas J (2010) Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinformatics 11: 378. Link Gu W, Castoe T, Hedges D, Batzer M, Pollock D (2008) Identification of repeat structure in large genomes using repeat probability clouds. Anal Biochem 380: 77-83. Link 6. Reviews, visions and IMs Metzker M (2010) Sequencing technologies – the next generation.Nat Rev Genet 11: 31-46. Link Shendure J. and Ji H. (2008) Next-generation DNA sequencing, Nature Biotechnol. 26, 1135 – 1145. Link Phillippy AM, Schatz MC, Pop M (2008) Genome assembly forensics: finding the elusive mis-assembly.Genome Biol9(3). Link Miller J, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data.Genomics 95: 315-27. Link Trapnell C, Salzberg S (2009) How to map billions of short reads onto genomes.Nat Biotechnol 27: 455-7. Link Stein L. (2010) The case for cloud computing in genome informatics.Genome Biol 11: 207. Link Li Y, Hu Y, Bolund L, Wang J (2010) State of the art de novo assembly of human genomes from massively parallel sequencing data.Hum Genomics 4: 271-7. Link Compeau P, Pevzner P, Tesler G (2011) How to apply de bruijn graphs to genome assembly.Nat Biotechnol. 29: 987-91. Link Nagarajan N, Pop M (2009) Parametric complexity of sequence assembly: theory and applications to next generation sequencing.J Comput Biol 16: 897-908. Link Carl Kingsford, Michael C Schatz and Mihai Pop (2010) Assembly Complexity of prokaryotic genomes using short reads BMC Bioinformatics 2010, 11:21 Link Pop M, Salzberg SL (2008) Bioinformatics challenges of new sequencing technology. Trends Genet 24: 142–149. Link Flicek P, Birney E (2009) Sense from sequence reads: methods for alignment and assembly. Nat Methods 6: S6–S12. Link (pdf), Link  


Web Statistics