Three Helpful Guides for Those Working on Genome Assembly

A. Rayan Chikhi’s slides - comprehensive yet introductory

Conclusions -

What is a good assembly ?

No total order
Main metrics : N50, coverage, accuracy
Use QUAST

How are assemblies made ?

Typically, using a de Bruijn graph or a string graph.
Errors and small variants are removed from the graph.
Contigs are just simple paths from the graph.

Assembly software

Recommended software for Illumina data : SOAPdenovo2, Allpaths-LG
Plethora of other software for custom needs : Minia for low-memory, SGA for

very accurate assembly, etc..

Recommended software for 454 data : Newbler, Celera

A few tips

How to choose k : always try many values
Put the assembler inside a pipeline : error correction, scaffolding, gap-filling

Case study

How to assemble a human genome with Minia

** New High Throughput Sequencing technologies at the Norwegian Sequencing Centre - and beyond ** from Lex Nederbragt

C. A Good Thread in SeqAnswers Forum

Original question -

I have some new Illumina data (HiSeq 100b reads- one paired-end (94xe6 reads) and one mate-pair (54xe6 reads) lib.) for a fungal genome (ca. 30MB) for which a pretty good reference is already assembled/available.

My coverage is about 400X, and I have de novo assembled the new data with both Velvet (VelvetOptimiser) and Soapdenovo, but based on simple metrics, e.g. # scaffolds, largest scaffold, N50, this new assembly doesn’t appear to be quite as good as the reference.

I don’t have access to the read data used to assemble the original reference, and I would like see if I can improve it with this additional data. It looks like you can give Velvet a -long switch for a reference seq, but the documentation isn’t very clear on this. And, I’m not sure how to go about generating a “new” reference sequence/scaffolds after, for example, using an aligner, e.g. Bowtie or BWA, to align the new read data to the reference seq.

Can someone suggest/describe the best approach or a pipeline to get where I want to go with this dataset?

‹»Top Bioinformatics Contributions of 2012« »High-throughput Microbial Population Genomics using the Cortex Variation Assembler«›