kSNP v2 - Alignment-Free SNP Discovery and Phylogenetics of Hundreds of Microbial Genomes
This paper is everything that the TIGRA paper is not. In twitter, @pathogenomenick forwarded the link and we were immediately able to glance through it without being asked to pay $20 ! Thank you Shea Gardner and Barry Hall for making the paper open access.
The authors use k-mer counting (Jellyfish) and follow-up analysis to align large number of closely-related microbial genomes. The workflow is clear from the above chart. Speaking of k-mer-based methods, readers may also look at Sailfish, which used k-mers to quickly estimate gene expressions in RNAseq.
Since many of our readers are also interested in algorithm development, they will find the following extracts from two sections helpful.
Advantages and Disadvantages
kSNP cannot find SNPs that are too close together (closer than one half k).
K is usually in the range of 1331. For viruses, we have found that k = 13 or 15 works well, and for bacteria, k = 19 or 21, and have included the Kchooser script to assist the user in selecting an optimal value of k for a given data set.
Repetitive elements like gene duplications can contain SNPs so long as the duplicate kmer locus does not create an allele conflict within a given genome. Even if such regions create allele conflicts within a subset of genomes, the SNP locus can still be detected as a SNP in other genomes without an allele conflict. This facilitates identification of SNPs on regions that may be duplicated or horizontally transferred, such as phage, plasmids, or other mobile elements, in those genomes for which the duplication does not create a SNP allele conflict. But the SNP will not be reported in the genomes with allele conflicts, which would require a longer value of k, i.e. more sequence context, in order to tell the duplicates apart. So running k with a longer value of k should be better at distinguishing loci in homologous regions by detecting some of the SNPs that would be considered allele conflicts with a shorter value of k. But the tradeoff is that a longer k will miss all those high density SNPs in which there is sequence variation within half k of the SNP.
kSNP cannot distinguish true SNPs from sequencing errors. It is advised that for raw read data, some quality filters are imposed on the reads prior to running kSNP (e.g. replace bases with quality below Q20Q30 with N, and remove adaptors, barcodes, or other non-biological portions of reads).
kSNP v2 does not find indels. Indel sequencing errors that occur in the kmer sequence flanking a SNP will cause a SNP detection failure for that locus in that genome.
Some unique features of kSNP v2 are that it scales better for large data sets (hundreds of bacterial or viral genomes) than other SNP finding approaches (Table 1). It can handle many genomes as unassembled raw reads. For example, we have run it in 6.9 hours on 5.8 GB of input for 212 Salmonella genomes, including many in raw reads from multiple sequencing technologies, on a node with 48 GB of RAM and 12 CPU. It does not require a multiple sequence alignment or a reference sequence, so avoids biases stemming from the choice of a reference.
kSNP finds SNPs that are not in the core genome, as well as those that are. It phylogenetically analyzes both core SNPs only, and all SNPs, and allows users to investigate cases intermediate between these ends of the spectrum, as SNP loci shared by at least a user-specified fraction of the genomes.
One application of kSNP could be a quick initial look at a large data set to determine clades, prior to full genome multiple sequence alignments of genomes within clades to look at strain differences including indels in more detail.
Improvements from version 1.
For better speed, v2 uses MUMmer instead of BLAST, jellyfish instead of sa (suffix array) for k-mers<32, and FastTreeMP and Parsimonator instead of RAxML [19] and PHYLIP [11].
There are algorithmic changes as well: In version 1, k-mers were initially computed for all genomes at once, and these k-mer lists were used to find candidate SNPs. BLAST was run to compare all candidate k-mers against all genomes to identify SNPs (allele variation among genomes), conflicting alleles (allele variation within a genome), and identify the allele variant within each genome. This use of BLAST was more memory intensive because all candidate SNP loci and all possible allele variants had to be compared to each genome, and positions even in raw read or merged contig genomes were found, even though that positional information was irrelevant. When run against GB of genomes in raw reads in v1, this step was more likely to run out of memory. In version 2, k-mer comparisons are used much more extensively and BLAST is replaced by MUMmer, which is called very minimally. First, jellyfish is run against each genome individually, and PERL and Unix scripts are used to parse the k-mer lists to determine SNPs, alleles within each genome, and conflicting alleles. Forward and reverse complement k-mers and counts are summed and only the orientation occurring first in an alphabetic sort is stored, saving time and space compared to v1. However, this means that more of the loci are reported in the reverse direction than in kSNP v1. MUMmer is only used to determine the position of the allele in finished genomes specified in the -p option input file. Also, k-mer calculations are performed in subsets by prefix, enabling better memory management for extremely large data, and better parallelization.
Relevant twitter discussion -