Three Helpful Guides for Those Working on Genome Assembly
A. Rayan Chikhi’s slides - comprehensive yet introductory
Conclusions -
What is a good assembly ?
-
No total order
-
Main metrics : N50, coverage, accuracy
-
Use QUAST
How are assemblies made ?
-
Typically, using a de Bruijn graph or a string graph.
-
Errors and small variants are removed from the graph.
-
Contigs are just simple paths from the graph.
Assembly software
-
Recommended software for Illumina data : SOAPdenovo2, Allpaths-LG
-
Plethora of other software for custom needs : Minia for low-memory, SGA for
very accurate assembly, etc..
- Recommended software for 454 data : Newbler, Celera
A few tips
-
How to choose k : always try many values
-
Put the assembler inside a pipeline : error correction, scaffolding, gap-filling
Case study
- How to assemble a human genome with Minia
B.
** New High Throughput Sequencing technologies at the Norwegian Sequencing Centre - and beyond ** from Lex Nederbragt
C. A Good Thread in SeqAnswers Forum
Original question -
I have some new Illumina data (HiSeq 100b reads- one paired-end (94xe6 reads) and one mate-pair (54xe6 reads) lib.) for a fungal genome (ca. 30MB) for which a pretty good reference is already assembled/available.
My coverage is about 400X, and I have de novo assembled the new data with both Velvet (VelvetOptimiser) and Soapdenovo, but based on simple metrics, e.g. # scaffolds, largest scaffold, N50, this new assembly doesn’t appear to be quite as good as the reference.
I don’t have access to the read data used to assemble the original reference, and I would like see if I can improve it with this additional data. It looks like you can give Velvet a -long switch for a reference seq, but the documentation isn’t very clear on this. And, I’m not sure how to go about generating a “new” reference sequence/scaffolds after, for example, using an aligner, e.g. Bowtie or BWA, to align the new read data to the reference seq.
Can someone suggest/describe the best approach or a pipeline to get where I want to go with this dataset?