Monday review - KMC3 and other seXY topics

Monday review - KMC3 and other seXY topics

1. KMC3 is out

KMC2 is the best kmer counting tool and is included in our Pandora’s Toolbox. Newly published KMC3 packs many improvements to make the program even better. Here are the updates -


Summary: Counting all k-mers in a given dataset is a standard procedure in many bioinformatics applications. We introduce KMC3, a significant improvement of the former KMC2 algorithm together with KMC tools for manipulating k-mer databases. Usefulness of the tools is shown on a few real problems. Availability: Program is freely available at this http URL Contact:

Speaking of new papers containing k-mer analysis tools, readers may also check - KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. This topic had been popular since 2013-2014, and we discuss several existing ones (e.g. Kmergenie, Kmerstream, KmerMagic, etc.) in our tutorials.

2. Fast and scalable minimal perfect hashing for massive key sets

We will definitely include BBHash in our collection of elegant bioinformatics algorithms.


Minimal perfect hash functions provide space-efficient and collision-free hashing on static sets. Existing algorithms and implementations that build such functions have practical limitations on the number of input elements they can process, due to high construction time, RAM or external memory usage. We revisit a simple algorithm and show that it is highly competitive with the state of the art, especially in terms of construction time and memory usage. We provide a parallel C++ implementation called BBhash. It is capable of creating a minimal perfect hash function of 1010 elements in less than 7 minutes using 8 threads and 5 GB of memory, and the resulting function uses 3.7 bits/element. To the best of our knowledge, this is also the first implementation that has been successfully tested on an input of cardinality 1012. Source code: this https URL

3. Is programming the new blue-collar job?


Politicians routinely bemoan the loss of good blue-collar jobs. Work like that is correctly seen as a pillar of civil middle-class society. And it may yet be again. What if the next big blue-collar job category is already here—and it’s programming? What if we regarded code not as a high-stakes, sexy affair, but the equivalent of skilled work at a Chrysler plant?

Among other things, it would change training for programming jobs—and who gets encouraged to pursue them. As my friend Anil Dash, a technology thinker and entrepreneur, notes, teachers and businesses would spend less time urging kids to do expensive four-year computer-­science degrees and instead introduce more code at the vocational level in high school. You could learn how to do it at a community college; midcareer folks would attend intense months-long programs like Dev Bootcamp. There’d be less focus on the wunderkinds and more on the proletariat.

These sorts of coders won’t have the deep knowledge to craft wild new algorithms for flash trading or neural networks. Why would they need to? That level of expertise is rarely necessary at a job. But any blue-collar coder will be plenty qualified to sling Java­Script for their local bank. That’s a solidly middle-class job, and middle-class jobs are growing: The national average salary for IT jobs is about $81,000 (more than double the national average for all jobs), and the field is set to expand by 12 percent from 2014 to 2024, faster than most other occupations.

Readers may find the following links amusing.

Programming Is the New Literacy (2008)

Learning to code is NOT the new literacy (2015)

Coding is not the new literacy (2015)

Programming is the New Literacy (2016)

We look forward to ‘algorithm is the new mathematics’ series.

4. ‘Drain the swamp’ Fail - ENCODE gets more money

Biochemists are unhappy.

What did ENCODE researchers say on Reddit?

ENCODE researchers answered a bunch of question on Reddit a few days ago. I asked them to give their opinion on how much junk DNA is in our genome but they declined to answer that question. However, I think we can get some idea about the current thinking in the leading labs by looking at the questions they did choose to answer. I don’t think the picture is very encouraging. It’s been almost five years since the ENCODE publicity disaster of September 2012. You’d think the researchers might have learned a thing or two about junk DNA since that fiasco.

Intelligent designers are happy.

With Fresh Funding, ENCODE Team Continues Demolition of “Junk DNA” Myth

Darwinians don’t give up easily, though, as we have often noted. Transcription is not proof of function, they argue. But why use costly resources to transcribe junk for no purpose? In the intervening years, more and more functions have come to light.

In the meanwhile, US healthcare system continues to suffer from ENCODE-like wastes by NIH. Check - U.S. Healthcare Is A Global Outlier (And Not In A Good Way)

US healthcare

5. Human genome assembly from nanopore reads

Koren and Phillippy assembled human nanopore reads using their CANU program. The results are not encouraging.

Assembly of a human genome from nanopore sequencing data

Overall this first Nanopore human assembly was a success. The continuity is that (or better) of a similar coverage PacBio run using current chemistries. The encountered issues boil down to systematic base-call error and inefficiencies caused by the super-long read lengths. We are testing improvements for the latter, but it is up to Nanopore to fix the former. Notably, the consensus accuracy we see here is significantly lower than for bacterial genomes like E. coli, which can reach 99%. This suggests that the Nanopore base caller is underperforming on human, either due to DNA modifications or sequence contexts not seen in the training data. We are also testing improvements to the Canu correction module to make better use of the data we have. Until these issues can be resolved, polishing Nanopore assemblies with Illumina data can improve accuracy in the unique regions of the genome. However, this approach does leave some sequence uncorrected because the Illumina reads cannot be uniquely mapped to the entire genome. This is the main limitation of Nanopore assembly at this time, compared to PacBio, which can produce high base accuracy across the entire genome.

Another biorxiv paper compares assembly of C. elegans genome between using Illumina or Nanopore reads. It is unclear, why the authors did not include Pacbio in their analysis.

Whole genome sequencing and assembly of a Caenorhabditis elegans genome with complex genomic rearrangements using the MinION sequencing device

Advances in 3rd generation sequencing have opened new possibilities for ‘benchtop’ whole genome sequencing. The MinION is a portable device that uses nanopore technology and can sequence long DNA molecules. MinION long reads are well suited for sequencing and de novo assembly of complex genomes with large repetitive elements. Long reads also facilitate the identification of complex genomic rearrangements such as those observed in tumor genomes. To assess the feasibility of the de novo assembly of large complex genomes using both MinION and Illumina platforms, we sequenced the genome of a Caenorhabditis elegans strain that contains a complex acetaldehyde-induced rearrangement and a biolistic bombardment-mediated insertion of a GFP containing plasmid. Using ~5.8 gigabases of MinION sequence data, we were able to assemble a C. elegans genome containing 145 contigs (N50 contig length = 1.22 Mb) that covered >99% of the 100,286,401 bp reference genome. In contrast, using ~8.04 gigabases of Illumina sequence data, we were able to assemble a C. elegans genome in 38,645 contigs (N50 contig length = ~26 kb) containing 117 Mb. From the MinION genome assembly we identified the complex structures of both the acetaldehyde-induced mutation and the biolistic-mediated insertion. To date, this is the largest genome to be assembled exclusively from MinION data and is the first demonstration that the long reads of MinION sequencing can be used for whole genome assembly of large (100 Mb) genomes and the elucidation of complex genomic rearrangements.

6. seXY: a tool for sex inference from genotype arrays

In the last years, we saw MEGAHIT, COCACOLA, BIGMAC and AVOCADO as creative names for new bioinformatics tools. Here is another eye-catching one.


Motivation: Checking concordance between reported sex and genotype-inferred sex is a crucial quality control measure in genome-wide association studies (GWAS). However, limited insights exist regarding the true accuracy of software that infer sex from genotype array data. Results: We present seXY, a logistic regression model trained on both X chromosome heterozygosity and Y chromosome missingness, that consistently demonstrated >99.5% sex inference accuracy in cross-validation for 889 males and 5,361 females enrolled in prostate cancer and ovarian cancer GWAS. Compared to PLINK, one of the most popular tools for sex inference in GWAS that assesses only X chromosome heterozygosity, seXY achieved marginally better male classification and 3% more accurate female classification.

7. Today’s revealing genome paper

The house spider genome reveals an ancient whole-genome duplication during arachnid evolution

The duplication of genes can occur through various mechanisms and is thought to make a major contribution to the evolutionary diversification of organisms. There is increasing evidence for a large-scale duplication of genes in some chelicerate lineages including two rounds of whole genome duplication (WGD) in horseshoe crabs. To investigate this further we sequenced and analyzed the genome of the common house spider Parasteatoda tepidariorum. We found pervasive duplication of both coding and non-coding genes in this spider, including two clusters of Hox genes. Analysis of synteny conservation across the P. tepidariorum genome suggests that there has been an ancient WGD in spiders. Comparison with the genomes of other chelicerates, including that of the newly sequenced bark scorpion Centruroides sculpturatus, suggests that this event occurred in the common ancestor of spiders and scorpions and is probably independent of the WGDs in horseshoe crabs. Furthermore, characterization of the sequence and expression of the Hox paralogs in P. tepidariorum suggests that many have been subject to neofunctionalization and subfunctionalization since their duplication, and therefore may have contributed to the diversification of spiders and other pulmonate arachnids.

If so many bioinformatics tools and algorithms confuse you, we are simplifying them for our members.

Written by M. //