# WABIstan Report

The 14th International Workshop on Algorithms in Bioinformatics (WABI) takes place in Wroclaw, Poland between September 810, 2014. A few days back, a reader forwarded us the official link to its proceedings. Thankfully, we managed to download it in time, because Springer decided to change access policy for the Proceedings. Here is a brief summary of some of the papers.

The proceeding included 25+ papers. We already wrote about Gene Myers’ submission, but here is the entire table of contents.

QCluster: Extending Alignment-Free Measures with Quality Values for Reads Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Matteo Comin, Andrea Leoni, and Michele Schimd

Abstract. The data volume generated by Next-Generation Sequencing

(NGS) technologies is growing at a pace that is now challenging the storage and data processing capacities of modern computer systems. In this context an important aspect is the reduction of data complexity by collapsing redundant reads in a single cluster to improve the run time, memory requirements, and quality of post-processing steps like assembly and error correction. Several alignment-free measures, based on k-mers counts, have been used to cluster reads.

Quality scores produced by NGS platforms are fundamental for various analysis of NGS data like reads mapping and error detection. Moreover future-generation sequencing platforms will produce long reads but with a large number of erroneous bases (up to 15%). Thus it will be fundamental to exploit quality value information within the alignment-free framework. In this paper we present a family of alignment-free measures, called Dq-type, that incorporate quality value information and k-mers counts for the comparison of reads data. A set of experiments on simulated and real reads data confirms that the new measures are superior to other classical alignment-free statistics, especially when erroneous reads are considered. These measures are implemented in a software called QCluster (http://www.dei.unipd.it/~ciompin/main/qcluster.html).

-—————————

Improved Approximation for the Maximum Duo-Preservation String

Mapping Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Nicolas Boria, Adam Kurpisz, Samuli Leppanen, and

Monaldo Mastrolilli

-—————————

A Faster 1.375Approximation Algorithm for Sorting by

Transpositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

Lu?s Felipe I. Cunha, Luis Antonio B. Kowada, Rodrigo de A. Hausen, and Celina M.H. de Figueiredo

-—————————

A Generalized Cost Model for DCJ-Indel Sorting . . . . . . . . . . . . . . . . . . . . . 38

Phillip E.C. Compeau

-—————————

Efficient Local Alignment Discovery amongst Noisy Long Reads. . . . . . . . 52

Gene Myers

-—————————

Efficient Indexed Alignment of Contigs to Optical Maps . . . . . . . . . . . . . . 68

Martin D. Muggli, Simon J. Puglisi, and Christina Boucher

-—————————

Navigating in a Sea of Repeats in RNA-seq without Drowning . . . . . . . . . 82

Gustavo Sacomoto, Blerina Sinaimeri, Camille Marchet,

Vincent Miele, Marie-France Sagot, and Vincent Lacroix

-—————————

Linearization of Median Genomes under DCJ . . . . . . . . . . . . . . . . . . . . . . . . 97

Shuai Jiang and Max A. Alekseyev

-—————————

An LP-Rounding Algorithm for Degenerate Primer Design . . . . . . . . . . . . 107

Yu-Ting Huang and Marek Chrobak

-—————————

GAML: Genome Assembly by Maximum Likelihood . . . . . . . . . . . . . . . . . . 122

Vladim?r Bo?za, Bro?na Brejova, and Toma?s Vina?r

-—————————

A Common Framework for Linear and Cyclic Multiple Sequence

Alignment Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Sebastian Will and Peter F. Stadler

-—————————

Entropic Profiles, Maximal Motifs and the Discovery of Significant Repetitions in Genomic Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

Laxmi Parida, Cinzia Pizzi, and Simona E. Rombo

-—————————

Estimating Evolutionary Distances from Spaced-Word Matches . . . . . . . . 161

Burkhard Morgenstern, Binyao Zhu, Sebastian Horwege, and

Chris-Andre Leimeister

-—————————

On the Family-Free DCJ Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

Fabio Viduani Martinez, Pedro Feijao, Mar?lia D.V. Braga, and

Jens Stoye

-—————————

New Algorithms for Computing Phylogenetic Biodiversity . . . . . . . . . . . . . 187

Constantinos Tsirogiannis, Brody Sandel, and Adrija Kalvisa

-—————————

The Divisible Load Balance Problem and Its Application to

Phylogenetic Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

Kassian Kobert, Toma?s Flouri, Andre Aberer, and

Alexandros Stamatakis

-—————————

Multiple Mass Spectrometry Fragmentation Trees Revisited: Boosting Performance and Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

Kerstin Scheubert, Franziska Hufsky, and Sebastian Bocker

-—————————

An Online Peak Extraction Algorithm for Ion Mobility Spectrometry

Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

Dominik Kopczynski and Sven Rahmann

-—————————

Best-Fit in Linear Time for Non-generative Population Simulation

(Extended Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

Niina Haiminen, Claude Lebreton, and Laxmi Parida

-—————————

GDNorm: An Improved Poisson Regression Model for Reducing Biases

in Hi-C Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

Ei-Wen Yang and Tao Jiang

-—————————

Pacemaker Partition Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

Sagi Snir

Abstract. The universally observed conservation of the distribution of

evolution rates across the complete sets of orthologous genes in pairs of

related genomes can be explained by the model of the Universal Pacemaker

(UPM) of genome evolution. Under UPM, the relative evolutionary

rates of all genes remain nearly constant whereas the absolute rates

can change arbitrarily. It was shown on several taxa groups spanning the

entire tree of life that the UPM model describes the evolutionary process

better than the traditional molecular clock model [26,25]. Here we extend

this analysis and ask: how many pacemakers are there and which genes

are affected by which pacemakers? The answer to this question induces

a partition of the gene set such that all the genes in one part are affected

by the same pacemaker. The input to the problem comes with arbitrary

amount of statistical noise, hindering the solution even more. In this

work we devise a novel heuristic procedure, relying on statistical and geometrical

tools, to solve the pacemaker partition identification problem

and demonstrate by simulation that this approach can cope satisfactorily

with considerable noise and realistic problem sizes. We applied this procedure

to a set of over 2000 genes in 100 prokaryotes and demonstrated

the significant existence of two pacemakers.

-—————————

Manifold de Bruijn Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296

Yu Lin and Pavel A. Pevzner

Abstract. Genome assembly is usually abstracted as the problem of

reconstructing a string from a set of its k-mers. This abstraction naturally leads to the classical de Bruijn graph approachthe key algorithmic technique in genome assembly. While each vertex in this approach is labeled by a string of the fixed length k, the recent genome assembly studies suggest that it would be useful to generalize the notion of the de Bruijn graph to the case when vertices are labeled by strings of variable lengths. Ideally, we would like to choose larger values of k in

high-coverage regions to reduce repeat collapsing and smaller values of k in the low-coverage regions to avoid fragmentation of the de Bruijn graph. To address this challenge, the iterative de Bruijn graph assembly (IDBA) approach allows one to increase k at each iterations of the graph construction. We introduce the Manifold de Bruijn (M-Bruijn) graph (that generalizes the concept of the de Bruijn graph) and show that it can provide benefits similar to the IDBA approach in a single iteration that considers the entire range of possible k-mer sizes rather than varies k from one iteration to another.

-—————————

Constructing String Graphs in External Memory . . . . . . . . . . . . . . . . . . . . . 311

Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola,

Marco Previtali, and Raffaella Rizzi

Abstract. In this paper we present an efficient external memory algorithm to compute the string graph from a collection of reads, which is a fundamental data representation used for sequence assembly. Our algorithm builds upon some recent results on lightweight Burrows-Wheeler Transform (BWT) and Longest Common Prefix (LCP) construction providing, as a by-product, an efficient procedure to extend intervals of the BWT that could be of independent interest. We have implemented our algorithm and compared its efficiency against SGAthe most advanced assembly string graph construction program.

-—————————

Topology-Driven Trajectory Synthesis with an Example on Retinal Cell Motions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326

Chen Gu, Leonidas Guibas, and Michael Kerber

Abstract. We design a probabilistic trajectory synthesis algorithm for generating time-varying sequences of geometric configuration data. The algorithm takes a set of observed samples (each may come from a different trajectory) and simulates the dynamic evolution of the patterns in O(n2 log n) time. To synthesize geometric configurations with indistinct identities, we use the pair correlation function to summarize point distribution, and ?-shapes to maintain topological shape features based on a fast persistence matching approach. We apply our method to build a computational model for the geometric transformation of the cone mosaic in retinitis pigmentosa an inherited and currently untreatable retinal degeneration.

-—————————

A Graph Modification Approach for Finding CorePeriphery Structures in Protein Interaction Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340

Sharon Bruckner, Falk Huffner, and Christian Komusiewicz

Abstract. The coreperiphery model for protein interaction (PPI) networks assumes that protein complexes in these networks consist of a dense core and a possibly sparse periphery that is adjacent to vertices in the core of the complex. In this work, we aim at uncovering a global coreperiphery structure for a given PPI network. We propose two exact graph-theoretic formulations for this task, which aim to fit the input network to a hypothetical ground truth network by a minimum number of edge modifications. In one model each cluster has its own periphery, and in the other the periphery is shared. We first analyze both models from a theoretical point of view, showing their NP- hardness. Then, we devise efficient exact and heuristic algorithms for both models and finally perform an evaluation on subnetworks of the S. cerevisiae PPI network.

-—————————

Interpretable Per Case Weighted Ensemble Method for Cancer

Associations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352

Adrin Jalali and Nico Pfeifer

Abstract. Over the past decades, biology has transformed into a high throughput research field both in terms of the number of different measurement techniques as well as the amount of variables measured by each technique (e.g., from Sanger sequencing to deep sequencing) and is more and more targeted to individual cells [3]. This has led to an unprecedented growth of biological information. Consequently, techniques that can help researchers find the important insights of the data are becoming

more and more important. Molecular measurements from cancer

patients such as gene expression and DNA methylation are usually very noisy. Furthermore, cancer types can be very heterogeneous. Therefore, one of the main assumptions for machine learning, that the underlying unknown distribution is the same for all samples in training and test data, might not be completely fulfilled.

-—————————

Reconstructing Mutational History in Multiply Sampled Tumors Using Perfect Phylogeny Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

Iman Hajirasouliha and Benjamin J. Raphael

Abstract. High-throughput sequencing of cancer genomes have motivated the problem of inferring the ancestral history of somatic mutations that accumulate in cells during cancer progression. While the somatic mutation process in cancer cells meets the requirements of the classic Perfect Phylogeny problem, nearly all cancer sequencing studies do not sequence single cancerous cells, but rather thousands-millions of cells in

a tumor sample. In this paper, we formulate the Perfect Phylogeny Mixture problem of inferring a perfect phylogeny given somatic mutation data from multiple tumor samples, each of which is a superposition of cells, or species. We prove that the Perfect Phylogeny Mixture problem is NP-hard, using a reduction from the graph coloring problem. Finally, we derive an algorithm to solve the problem.

We will cover a few papers in future commentaries, but here is a comment about WABIstan in general. We do not understand, why its organizers decided to turn it into a ‘land-locked country’ by removing all aspects of asking biological questions from the definition of bioinformatics. If bioinformatics is redefined as asking questions about how living systems function using computers, the conference (and the field of bioinformatics) would benefit.