Building an NGS Reference List (de novo assembly category)
Did we miss any important category/paper?
Not all t’s are crossed and i’s are dotted yet. We will also add hyperlinks soon.
We created this list for our own convenience. However, it took us some time to get all the pieces together, and we thought posting the full list here would help someone else go to sleep early. Wherever we could, we added short narration about the papers or the section (again for our own convenience). There is no guarantee that the texts describe the papers accurately. Neither is there guarantee that the texts describe the papers inaccurately.
The papers mentioned here are related to bioinformatics problems, when no reference genome exists. We split the other set of papers with alignment, SNP calling, etc. into another set.
1. Pre-NGS Genome Assemblers
This section includes old assembly-related papers that we may need to cite from time to time. The first subgroup has links to base-calling and error correction programs such as phred, phrap and consed. The second subgroup has more sophisticated assemblers, but they mostly belong to overlap-layout- consensus type. Those dinosaurs used to rule over the world in not too distant past.
CAP3 and Celere are the only programs that we like. They are our pet dinosaurs. That does not mean the others are bad. We heard very good opinions about them from, well, the genome centers that nurture them. Most programs are associated with one or other genome centers. TIGR is from TIGR. ARACHNE belongs to Broad. Phusion is from EMBL. Atlas came from Baylor. Celera was written by J. C. Venter’s company, but is maintained by the bioinformatics group from Maryland.
Base-calling and error detection
Krawetz SA (1989) Sequence errors described in GenBank: a means to determine the accuracy of DNA sequence interpretation.Nucleic Acids Res. 17(10):3951-7. Link
Bonfield JK, Staden R (1995) The application of numerical estimates of base calling accuracy to DNA sequencing projects.Nucleic Acids Res. 23(8):1406-10. Link
Ewing B, Hillier L, Wendl M, Green P: Basecalling of automated sequencer traces using phred. I. Accuracy assessment.Genome Research 8:175-185 (1998). Link
Ewing B, Green P: Base calling of automated sequencer traces using phred. II. Error probabilities.Genome Research 8:186-194 (1998). Link
Gordon D, Abajian C, Green P: (1998) Consed: a graphical tool for sequence finishing.Genome Research 8:195-202 Link
Assembler
Sutton, G. G., White, O., Adams, M. D., Kerlavage, A. R. (1995) TIGR Assembler: A new tool for assembling large shotgun sequencing projects.Genome Science and Technology. 1(1): 9-19. Link
Huang X, Madan A, (1999) CAP3: A DNA sequence assembly program.Genome Res. 9(9):868-77. Link
Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KHJ, Remington KA, et al.: (2000) A whole-genome assembly of Drosophila. **Science287(5461):2196-2204. **Link
Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES: (2002) ARACHNE: A whole-genome shotgun assembler. **Genome Res12(1):**177-189. Link
Mullikin JC, Ning ZM: (2003) The phusion assembler. **Genome Res13(1):**81-90. Link
Istrail, S. et al. (2004) Whole-Genome Shotgun Assembly and Comparison of Human Genome Assemblies. Proc. Nat. Acad. Sci. USA 101:1916-1921. Link
David B. Jaffe, Jonathan Butler, Sante Gnerre, Evan Mauceli, Kerstin Lindblad- Toh,Jill P. Mesirov, Michael C. Zody,and Eric S. Lander (2003) **Whole-Genome Sequence Assembly for Mammalian Genomes: Arachne 2 **Genome Res. 13(1):
Havlak P, Chen R, Durbin KJ, Egan A, Ren YR, Song XZ, Weinstock GM, Gibbs RA (2004) The atlas genome assembly system. **Genome Res14(4):**721-732. Link
Chapman JA, Ho I, Sunkara S, Luo S, Schroth GP, et al. (2011) Meraculous: ****De Novo** Genome Assembly with Short Paired-End Reads.**PLoS ONE 6(8): e23501. Link
Mihai Pop, Adam Phillippy, Arthur L. Delcher, Steven L. Salzberg (2004) Comparative Genome Assembly, Briefings in Bioinformatics 5 (3):237-248. Link
R. Xia and A. Kim (2012) MERmaid: A Parallel Genome Assembler for the Cloud.Link
Weber JL, Myers EW (1997) Human whole-genome shotgun sequencing. Genome Res 7: 401409. Link
2. NGS Genome Assemblers
non de Bruijn, k-mer based
You need a video to understand this category (pay attention to the guys, who jumped early).
Sundquist A, Ronaghi M, Tang HX, Pevzner P, Batzoglou S: (2007) Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies. **PLoS ONE, **2(5). **Link**
Warren RL, Sutton GG, Jones SJM, Holt RA: Assembling millions of short DNA sequences using SSAKE. **Bioinformatics 2007, **23(4):500-501. Link
SHORTY (Chen and Skiena, 2007) [specialised in localising the use of paired- end reads.]
Daniel D Sommer, Arthur L Delcher, Steven L Salzberg**and **Mihai Po (2007) **Minimus: a fast, lightweight genome assembler BMC Bioinformatics 2007, **8:64. Link
de Bruijn graph-based assemblers
Many articles in our blog explained this class of assemblers.
Idury RM, Waterman MS (1995) A new algorithm for DNA sequence assembly. J Comput Biol 2: 291306. Link
Pevzner, Pavel A.; Tang, Haixu (2001). Fragment Assembly with Double- Barreled Data. Bioinformatics/ISMB 1: 19. Link
Chaisson M, Pevzner P, Tang H (2004) Fragment assembly with short reads. Bioinformatics 20: 2067-74. Link
Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragment assembly. **Proc. Nat. Acad. Sci. USA 2001, **98(17):9748-9753. Link
Zerbino DR, Birney E: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. **Genome Research 2008, **18(5):821-829. Link
Zerbino, D., Genome assembly and comparison using de Bruijn graphs Ph.D. Thesis, EBI, UK. Link
Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB: ALLPATHS: De novo assembly of whole-genome shotgun microreads. **Genome Research 2008, **18(5):810-820. Link
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I: ABySS: A parallel assembler for short read sequence data. **Genome Research 2009, **19(6):1117-1123. Link
Boisvert S, Laviolette F, Corbeil J. J Comput Biol. (2010) Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. Nov;17(11):1519-33. Epub 2010 Oct 20. Link
Applications
Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J (2010) De novo assembly of human genomes with massively parallel short read sequencing. **Genome Research20(2):**265-272. Link
Li R, Fan W, Tian G, et al. (2010) The sequence and de novo assembly of the giant panda genome. **Nature463 (7279):**311-317. Link
Gnerre S, MacCallum I, Przybylski D, Ribeiro FJ, Burton JN, Walker BJ, Sharpe T, Hall G, Shea TP, Sykes S, et al. (2011) High-quality draft assemblies of mammalian genomes from massively parallel sequence data. **P Natl Acad Sci USA108(4):**1513-1518. Link
Chitsaz H, Yee-Greenbaum J, Tesler G, Lombardo M, Dupont C, et al. (2011) Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nat Biotechnol 29: 915-21. Link
Rodrigue S, Malmstrom R, Berlin A, Birren B, Henn M, et al. (2009) Whole genome amplication and de novo assembly of single bacterial cells. PLoS One 4: e6864. Link
Schuster SC, Miller W, Ratan A, Tomsho LP, Giardine B, et al. (2010) Complete Khoisan and Bantu genomes from southern Africa.Nature 463:
Comparison
Deng HW, Lin Y, Li J, Shen H, Zhang L, Papasian CJ (2011) Comparative studies of de novo assembly tools for next-generation sequencing technologies. **Bioinformatics27(15):**2031-2037. Link
Salzberg SL, Phillippy AM, Zimin AV, Puiu D, Magoc T, Koren S, Treangen T, Schatz MC, Delcher AL, Roberts M, et al. (2011) **GAGE: A critical evaluation of genome assemblies and assembly algorithms. **Genome Res. Link
Zhang WY, Chen JJ, Yang Y, Tang YF, Shang J, Shen BR (2011) A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies. **_PLoS ONE, _6(3). **Link
Mago? T, Salzberg SL (2011) **FLASH: Fast Length Adjustment of Short Reads to Improve Genome Assemblies. **Bioinformatics. Link
Salzberg SL, Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C (2004) Versatile and open software for comparing large genomes. **Genome Biol, **5(2). **Link**
D. Earl et al. (2011)** Assemblathon 1: A competitive assessment of de novo short read assembly methods, **Genome Research, 21:2224-2241. Link
3. Exomes, Transcriptomes, Metagenomes and Highly Polymorphic Genomes
Transcriptome Assemblers
Robertson D, Schein J, Chiu R, Corbett R. Field M et al. (2010) De novo** assembly and analysis of RNA-seq data** Nature Methods 7, 909912. Link
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA et al. (2011) **Full- length transcriptome assembly from RNA-seq data without a reference genome. **Nat Biotechnol. 29(7):644-52. Link
Schulz M, Zerbino D, Vingron M, Birney E (2012) Oases: robust de novo rna- seq assembly across the dynamic range of expression levels. Bioinformatics 28: 1086-92. Link
Metagenomes
Namiki T, Hachiya T, Tanaka H, Sakakibara Y (2011) MetaVelvet: An extension of Velvet assem-bler to de novo metagenome assembly from short sequence reads. ACM Conference on Bioinformatics, Computational Biology and Biomedicine. Link
Peng Y, Leung H, Yiu S, Chin F (2011) **Meta-idba: a de novo assembler for metagenomic data. **Bioinformatics 27: i94-101. Link
Vaughn Iverson, Robert M. Morris, Christian D. Frazar, Chris T. Berthiaume, Rhonda L. Morales, E. Virginia Armbrust (2012) Untangling Genomes from Metagenomes: Revealing an Uncultured Class of Marine Euryarchaeota, Science 335(6068):587-590. Link
C. T. B. on scaling metagenome
Polymorphic Genomes
Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G (2012) De novo assembly and genotyping of variants using colored de bruijn graphs. Nat Genet 44: 226-32. Link
S. Huang et al. (2012) **HaploMerger: Reconstructing allelic relationships for polymorphic diploid genome assemblies, **_Genome Research. _Link
Targeted Assembly
P. Peterlongo and R. Chikhi** (2011) Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer BMC Bioinformatics 2012, **13:48. Link
Ren L. Warren Robert A. Holt (2011)** Targeted Assembly of Short Sequence Reads PLOS ONE. **Link
4. Faster, better, cheaper
k-mer counting
Bloom BH (1974) Space/time trade-offs in hash coding with allowable errors. **Commun ACM, **13:422-426. Link
Marais G, Kingsford C (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. **Bioinformatics 2011, **27(6):764-770. Link
Melsted P, Pritchard JK (2011) Efficient counting of k-mers in DNA sequences using a bloom filter. **Bmc Bioinformatics12. **Link
C. T. Brown** khmer **https://github.com/ctb/
Storage
Christley S, Lu Y, Li C, Xie X (2009) Human genomes as email attachments. Bioinformatics ;25:274-275.
Conway TC, Bromage AJ (2011) **Succinct data structures for assembling large genomes. **Bioinformatics27(4):479-486. Link
Fritz MH-Y, Leinonen R, Cochrane G, Birney E. (2011) **Efficient storage of high throughput DNA sequencing data using reference-based compression. **Genome Res.;21:734-740.
Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA (2004) Reducing storage requirements for biological sequence comparison. **Bioinformatics20(18):**3363-3369. Link
Pinho A, Pratas D, Garcia S (2012) GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res 40: e27. Link
Daniel C. Jones, W. L. Ruzzo, X. Peng, M. G. Katze (2012) **Compression of next-generation sequencing reads aided by highly efficient de novo assembly, Nucleic Acids Research, **Link https://github.com/dcjones/quip#readme
Rayan Chikhiand Guillaume Rizk (2012) **Space-efficient and exact de Bruijn graph representation based on a Bloom Filter. **Link
Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G. (2011) Compressing genomic sequence fragments using SlimGene. J Comput Biol. (3):401-13.Link
Pell J. **et al.(2012) Scaling metagenome sequence assembly with probabilistic de Bruijn graphs, **_Proc. Nat. Acad. Sci. USA. _Link
C. Titus Brown, Adina Howe, Qingpeng Zhang, Alexis B. Pyrkosz, Timothy H. Brom A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data http://arxiv.org/abs/1203.4802
Error correction
Kelley D, Schatz M, Salzberg S: (2010) Quake: quality-aware detection and correction of sequencing errors. **Genome Biology11(11):**R116.
Medvedev P, Scott E, Kakaradov B, Pevzner P (2011) Error correction of high- throughput sequencing datasets with non-uniform coverage. Bioinformatics 27: i137-41.
C. Titus Brown, Adina Howe, Qingpeng Zhang, Alexis B. Pyrkosz, Timothy H. Brom A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data http://arxiv.org/abs/1203.4802
Hadoop
Schatz M (2009) Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics 25: 1363-9.
Contrail <http://sourceforge.net/apps/mediawiki/contrail- bio/index.php?title=Contrail>
Hardware-accelerators
Shi H, Schmidt B, Liu W, Mller-Wittig W (2010) A Parallel Algorithm for Error Correction in High-Throughput Short-Read Data on CUDA-Enabled Graphics Hardware. **Journal of Computational Biology17(4):**603-615.
Liu Y, Schmidt B, Maskell DL (2011) Parallelized short read assembly of large genomes using de Bruijn graphs. BMC Bioinformatics, 12:354. Link
String graph assembler
Myers EW (2005) The fragment assembly string graph. **Bioinformatics21:**79-85. Link
Simpson JT, Durbin R: **Efficient de novo assembly of large genomes using compressed data structures. **Genome Res 2011. Link
Simpson J, Durbin R (2010) Efficient construction of an assembly string graph using the fm-index.
Bioinformatics 26: i367-73. Link
Simpson J, Durbin R (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome Res 22: 549-56. Link
Scaffolding
Zerbino DR, McEwen GK, Margulies EH, Birney E (2009) Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler. PLoS One 4: e8407. Link
Koren S, Treangen T, Pop M (2011) Bambus 2: scaffolding metagenomes.Bioinformatics 27: 2964-71. Link
Alexey A. Gritsenko et al. (2012) GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies Bioinformatics 28(11): 1429-1437 Abstract
Repeats
Zerbino DR, McEwen GK, Margulies EH, Birney E (2009) Pebble and rock band: heuristic resolution of repeats and scaffolding in the velvet short-read de novo assembler. PLoS One 4: e8407. Link
Do H, Choi K, Preparata F, Sung W, Zhang L (2008) Spectrum-based de novo repeat detection in genomic sequences. J Comput Biol 15: 469-87. Link
Novak P, Neumann P, Macas J (2010) Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data. BMC Bioinformatics 11: 378. Link
Gu W, Castoe T, Hedges D, Batzer M, Pollock D (2008) Identification of repeat structure in large genomes using repeat probability clouds. **Anal Biochem 380: 77-83. **Link
6. Reviews, visions and IMs
Metzker M (2010) Sequencing technologies - the next generation.Nat Rev Genet 11: 31-46. Link
Shendure J. and Ji H. (2008) Next-generation DNA sequencing, Nature Biotechnol. 26, 1135 1145. Link
Phillippy AM, Schatz MC, Pop M (2008) **Genome assembly forensics: finding the elusive mis-assembly.**Genome Biol9(3). Link
Miller J, Koren S, Sutton G (2010) Assembly algorithms for next-generation sequencing data.Genomics 95: 315-27. Link
Trapnell C, Salzberg S (2009) How to map billions of short reads onto genomes.Nat Biotechnol 27: 455-7. Link
Stein L. (2010) The case for cloud computing in genome informatics.Genome Biol 11: 207. Link
Li Y, Hu Y, Bolund L, Wang J (2010) State of the art de novo assembly of human genomes from massively parallel sequencing data.Hum Genomics 4: 271-7. Link
Compeau P, Pevzner P, Tesler G (2011) How to apply de bruijn graphs to genome assembly.Nat Biotechnol. 29: 987-91. Link
Nagarajan N, Pop M (2009) Parametric complexity of sequence assembly: theory and applications to next generation sequencing._J Comput Biol _16: 897-908. Link
Carl Kingsford, Michael C Schatz and Mihai Pop (2010) Assembly Complexity of prokaryotic genomes using short reads **BMC Bioinformatics 2010, **11:21 Link
Pop M, Salzberg SL (2008) Bioinformatics challenges of new sequencing technology. Trends Genet 24: 142149. Link
Flicek P, Birney E (2009) Sense from sequence reads: methods for alignment and assembly. Nat Methods 6: S6S12. Link (pdf), Link