KMC tools tutorial - II

Yesterday we looked into the newly released ‘kmc tools’. Today we will work out another simple problem so that you feel familiar with it. We really love this powerful program, because, as the authors have shown, they could reproduce the results of many previously published bioinformatics papers with only a few commands.

Yesterday we mentioned about four kmc tools options - ‘transform’, ‘simple’, ‘filter’ and ‘complex’. Let us take a look at the suboptions of ‘simple’ and ‘transform’.

For kmc_tools transform, which operates on single database, the commands are -

Command	Use
1. sort	Sorts the order of kmers
2. reduce	Removes too rare or too frequent kmers
3. compact	Removes kmer counters
4. histogram	Produces histogram based on counts
5. dump	Text dump of kmer db, similar to kmc_dump

For kmc_tools simple, which combines two databases, the commands are -

Command	Use
1. intersect	Prints kmers common between databases
2. union	Prints combined database
3. kmers_subtract	Subtracts kmers of second db from first
4. reverse_kmers_subtract	Subtracts kmers of first db from second
5. counters_subtract	Subtracts kmer counts
6. reverse_counters_subtract	Subtracts kmer counts

The options of kmc_tools filter, which filters reads from FASTA or FASTQ files based on subset of kmers, are straightforward. For kmc_tools complex, which performs complex processing of multiple kmc databases, the options will be discussed in a future post.

Here is another simple example demonstrating utility of ‘kmc tools’. If you are looking for more complicated uses, you will pleased with three such cases in the latest kmc3 paper. In today’s example, we will merge kmer counts from two FASTQ files. This can be done in two ways using kmc -

compute kmer distribution for each file separately, and then merge using kmc tools.
use ‘@’ option to compute kmer distribution from multiple files together,

Let us try both methods and see whether the results are identical. You can also try them yourself by logging into your account and using the same data sets as ours. For details, please check yesterday’s post.

ssh -p 2230 [user_id]@coding4medicine.com

Kmc requires a temporary directory to run. Therefore, if you do not have it, create one with the name ‘tmp’ (or something else you like).

mkdir tmp

First method - separate computation of kmer spectrums

Next we run kmc on two E. coli files in the /share/data/D-ecoli-10K-ilmn directory. Those trying the program yourself can pick any two fastq files.

Let us explain the syntax of the following command.

The part ‘-m1’ means using no more than 1GB RAM.
The part ‘/share/data/D-ecoli-10K-ilmn/z1.fastq’ is the input file in FASTQ format. Please note that kmc can read gzip files as well.
The next word (‘out1’ or ‘out2’) is the name of the output database.
The final word (‘tmp’) is where the temporary files are stored. It is cleaned after kmc completes execution.

kmc -m1 /share/data/D-ecoli-10K-ilmn/z1.fastq out1 tmp

kmc -m1 /share/data/D-ecoli-10K-ilmn/z2.fastq out2 tmp

After you run the above two commands, you will see the files ‘out1.kmc_pre’, ‘out1.kmc_suf’, ‘out2.kmc_pre’, ‘out2.kmc_suf’ in your directory. We will use the ‘union’ option in ‘kmc_tools simple’ to merge them.

kmc_tools simple out1 out2 union merge

You will now see the kmc database merge_kmc with files ‘merge.kmc_pre’ and ‘merge.kmc_suf’ in your directory. The binary database can be converted to text file using the following command.

kmc_dump merge merge-kmers.txt

Second method - combined computation of kmer spectrum

Create a file ‘fq’ with the following lines -

/share/data/D-ecoli-10K-ilmn/z1.fastq
/share/data/D-ecoli-10K-ilmn/z2.fastq

Run the ‘kmc’ commands to get binary database ‘out’ and then text file ‘out-kmers.txt’.

kmc -m1 @fq out tmp
kmc_dump out out-kmers.txt

We confirmed that the files ‘merge-kmers.txt’ and ‘out-kmers.txt’ obtained by two methods are identical for all kmers except those with rare counts. The count for rare kmers differ, because kmc, by default, removes all singleton kmers. That means the separate method removes a lot more kmers with frequency 2 than the combined method.

The posts from yesterday and today will be included in the kmc section of our Pandora’s toolbox tutorial.

‹»A tutorial on KMC tools« »Monday review - Myers' dBG Paper, Pacbio's Multiplexing and Bioinformaticians' Foray into Escapism«›