I Often Repeat Repeat Myself, I Often Repeat Repeat

A new Hindawi paper on assembling the repeat regions was forwarded to us (h/t: @srbehera11), and we decided to check what else is available in the same genre. First, the Hindawi paper -

<a href=”http://www.hindawi.com/journals/bmri/aip/736473/

“>A de novo genome assembly algorithm for repeats and non- repeat

They claim to assemble the repeat regions from short reads and supposedly do a better job than all other assemblers. Sadly, they did not compare with SPAdes, SOAPdenovo repeat resolution module and Ray, three assemblers we expected to do well with repeats based on their algorithms. Especially a comparison with SPAdes would have been nice, given that Pevzner had been writing on repeat- resolution for almost a decade now. The lack of comparison is not entirely their fault, because they used GAGE benchmarks and not GAGE-B.

Going through the algorithm, we do not understand what is innovative and would like our readers to comment. Here is a short snippet and the paper has lot more details.

To this end, we proposed a new genome assembly algorithm aiming for assembling repeats and non-repeats, named SWA (Sliding Window Assembly), which can assemble repeats and non-repeats completely and accurately. In SWA, sliding window function is used to filter out the sequencing bias caused by sequencing process and improve the confidence of separating repeats and non-repeats.

[snip]

The main contributions of our approach are as follows: 1) Assembling repeats and non-repeats completely and accurately rather than only detecting where repeats or non-repeats are. Complex repeats structures have very important biomedical functions. Consequently, the completeness and accuracy of assembling repeats are what SWA mainly concerned rather than the continuity of whole genome assembly. 2) Sliding window functions to filter out the sequencing bias are used in genome assembling process. Filtering noise by window function is very common in information processing but is rare in genome assembly process. SWA adopts sliding window to filter out NGS data bias and improve the statistical significance of read counts. Whats more, a compensational mechanism based on sliding window was embedded in SWA. This mechanism can improve the significance of read counts under the condition of low coverage.

While on the topic of repeat assembly, we would like to point out a number of related earlier papers.

De Novo Repeat Classification and Fragment Assembly

Repetitive sequences make up a significant fraction of almost any genome, and an important and still open question in bioinformatics is how to represent all repeats in DNA sequences. We propose a new approach to repeat classification that represents all repeats in a genome as a mosaic of sub- repeats. Our key algorithmic idea also leads to new approaches to multiple alignment and fragment assembly. In particular, we show that our FragmentGluer assembler improves on Phrap and ARACHNE in assembly of BACs and bacterial genomes.

That is the first paper by Pevzner, Tang and Tessler, but a number of their other papers including Pathset graph by Son Pham, rectangular graph by Vyahhi were geared toward resolving repeats and assembling scaffolds properly.

Apart from Pevzner’s group, following paper could be relevant.

RepARKde novo creation of repeat libraries from whole-genome NGS reads

Generation of repeat libraries is a critical step for analysis of complex genomes. In the era of next-generation sequencing (NGS), such libraries are usually produced using a whole-genome shotgun (WGS) derived reference sequence whose completeness greatly influences the quality of derived repeat libraries. We describe here a de novo repeat assembly methodRepARK (Repetitive motif detection by Assembly of Repetitive K-mers)which avoids potential biases by using abundant k-mers of NGS WGS reads without requiring a reference genome. For validation, repeat consensuses derived from simulated and real Drosophila melanogaster NGS WGS reads were compared to repeat libraries generated by four established methods. RepARK is orders of magnitude faster than the other methods and generates libraries that are: (i) composed almost entirely of repetitive motifs, (ii) more comprehensive and (iii) almost completely annotated by TEclass. Additionally, we show that the RepARK method is applicable to complex genomes like human and can even serve as a diagnostic tool to identify repetitive sequences contaminating NGS datasets.

Please feel free to add any other in the comment section.

-————————————————–

The title of the post is from a poem by Jack Prelutsky.

I often repeat repeat myself,

I often repeat repeat.

I don’t don’t know why know why,

I simply know that I I I

am am inclined to say to say

a lot a lot this way this way-

I often repeat repeat myself,

I often repeat repeat.

I often repeat repeat myself,

I often repeat repeat.

My mom my mom gets mad gets mad,

it irritates my dad my dad,

it drives them up a tree a tree,

that’s what they tell they tell me me-

I often repeat repeat myself,

I often repeat repeat.

I often repeat repeat myself,

I often repeat repeat.

It gets me in a jam a jam,

but that’s the way I am I am,

in fact I think it’s neat it’s neat

to to to to repeat repeat-

I often repeat repeat myself,

I often repeat repeat.

‹»Who Deserves the ENCODE Nobel Prize? Ans. Ron Davis« »Transcriptome in Vivo Analysis (TIVA) of Spatially Defined Single Cells in«›