Typically, distributed systems in computing clusters and supercomputing centers implement MPI-based architecture for parallel computing. Another type of distributed architecture named Hadoop/MapReduce has become popular among the internet companies processing terabytes of data. Hadoop is accessible to bioinformaticians through Amazon cloud (Elastic MapReduce), but many researchers do not understand what advantage Hadoop would provide over conventional parallel architecture. Here we explain the difference with simple examples.
Instead of the BLAST example mentioned above, let us think about another problem. We have a large library (100 million) of 99nt Solexa reads from a transcriptome study, and we want to find out the distribution of all 25-mers within the reads. As we explained before, de Bruijn graph approach for transcriptome assembly requires keeping tabs on all K-mers (K=25). If we have one computer, the problem can be solved in the following manner. We go over the entire library of reads one by one, split both strands of each read into 25-mers and add the 25-mers into a giant hash table loaded into memory. After the entire library is processed, we print or save the counts for all 25-mers. The above computation requires large amount of RAM, because we are storing all 25-mers into the memory. If our computer does not have large RAM or if we want to split the computation among many nodes, following steps can be used. (i) Split 100 million reads into 100 files with 1 million reads in each, (ii) do the counting for each file in a single node and save, (iii) sort each saved file according to 25-mers and (iii) combine 100 files into one to get distribution for all reads.
Conceptually both tasks are done by splitting the input file among 100 machines, running the computation on partitioned input file in each node and combining the results. However, there is one major difference. For BLAST, calculation on each gene is complex and CPU intensive, but it does not requires large RAM or disk-space. Moreover combining the results from 100 BLAST runs is straightforward. In case of counting 25-mers, the task on each read is very simple, but the real challenge comes from doing it on a large data set. Moreover, output from each node is a very large file. Therefore, sorting 100 such files and combining their results into one file is a major challenge. The second problem is exactly what Hadoop is designed for. This article will not go into details of how Hadoop is implemented except on a conceptual level. Solving a hadoop problem requires two steps – (i) writing the code in Hadoop form, and (ii) finding a cluster of machine, where Hadoop architecture is implemented, and running the code. Writing the code in Hadoop form means we have to partition the tasks into a Mapper and a Reducer function. You already know how to do it, although you did not hear the words map and reduce explicitly. In terms of the above example - i) Map step – Partitioning the input reads into 100 subsets and divide the job of counting among 100 machines (nodes). ii) Reduce step – Combining the sorted results from 100 nodes into one file. If you want to use Hadoop to solve a problem, you have to come up with a way to split your large input data set among many machines, perform task in each machine and combining the results. That is all there to Hadoop. I hope the above description helps you get started with Hadoop. In the next article of this series, we will try a hands on example and present the code for solving the problem described above.
In our previous commentary, we explained the advantages provided by Hadoop in distributed analysis of large data sets. Today we will work on a real life problem, namely, finding all K-mers in a set of short reads. Being true to our style (standing on the shoulder of giants), we shall only explain things that are not explained elsewhere in the web by other experts. When others have done better job than we can ever do, we are content with providing links. Hadoop is a framework developed primarily for Java programmers. Those, who have familiarity with Java or C++ but never used Hadoop, will find the following example the most useful. Others will learn the general concepts, but may have to do a bit of work, if they want to port our code to their favorite languages. It is possible to write Hadoop codes in python by using APIs developed for python. Let us get started. Finding all K-mers from a large set of sequences using Hadoop requires two steps – (i) installing Hadoop or finding computers where Hadoop is already installed, and (ii) writing, compiling and running Java code for the problem at hand. If you follow the example presented here mechanically, you should be able to install and run the code in Hadoop in less than 30 minutes. Then the hard work is to understand all the steps. We tested our code in Windows (cygwin) and Linux (Fedora release 9, 64 bit), and encountered every possible pitfall a human being could possibly come across. Hopefully, you will not make the same mistakes as us, but if you do, the ‘Pitfalls’ section in the next commentary will be helpful. The Hadoop documentation written by Apache foundation is another excellent source of information, and we urge you to read it. One point of caution – some examples provided in the official documentation are not updated for the latest version of Hadoop. We will mention those differences in our ‘Pitfalls’ section.
ii) Code for finding all K-mers from a short read library If you never used Hadoop, a Hadoop code may appear daunting. We will make the task real easy for you. We suggest that you learn Hadoop coding in two steps.
Below we included a Hadoop 0.20 template (borrowed from this link) that can be used for most simple codes. You can cut and paste the template, and fill up few sections with code for your favorite application. You will have to fill up the class names (1,2,3) and add codes for map and reduce functions. The code for K-mer application in the next section shows how to do that. Here is the code for K-mer application. We used the template and filled the map and reduce functions. The map function goes through the sequences one by one and extracts all K-mers (K=10). The reduce function takes outputs of map function from various files as key-value pairs and adds up the counts.
In our work, we used hadoop-0.20.2 that can be downloaded from here or here. We encountered some difficulties with a newer version – hadoop-0.20.203. Please check ‘Pitfalls’ section for details. Hadoop is meant to be run in a computer cluster in distributed manner, but one can also set it up on a single node in standalone or pseudo-distributed manner. Our example will be based on single node standalone installation. Although single node operation does not take any advantage of Hadoop, it is great for testing codes and learning the general concepts. The Java code itself need not be changed for running it in a distributed cluster. We expect that you will not have to install Hadoop in a cluster, and you will find help from either your local computer administrator or Amazon cloud. If you have to do multi-node installation, Apache documentation is an excellent place to get started. Please follow these steps for standalone installation of Hadoop - a) Make sure Java is installed properly in your machine by compiling (‘javac’) and running (‘java’) HelloWorldApp.java from here. You will see ‘Hello World’ printed on your screen. b) Download stable version of Hadoop from an Apache Download Mirror. We used hadoop-0.20.2.tar.gz. c) Uncompress and untar the hadoop-0.20.2.tar.gz file. Your installation is complete. No need to do ‘configure’, ‘make’, etc. It is that easy (thanks to Java). d) At this point, you will have ‘hadoop-0.20.2′ folder in your directory. Go inside ‘hadoop-0.20.2′ folder and run ‘ls’. You will see the following files and folders.
Hadoop executable is located in the ‘bin’ directory. The ‘txt’ files are for general information. The ‘jar’ files are compiled versions of various codes and libraries that we will use for testing and building our own code. The ‘lib’ directory has additional libraries. The ‘conf’ directory has configuration files. The ‘docs’ directory has a copy of Apache documentation that you will find useful, if you are stuck in Yellowstone NP without any internet. The ‘src’ directory has various source-codes. Among them, you can check the ‘src/examples’ for many excellent Hadoop examples. e) Edit the file conf/hadoop-env.sh to make ‘JAVA_HOME’ to be the root of your Java installation. Alternatively, you can also define JAVA_HOME as an environment variable from you .cshrc or .bashrc file. f) From inside ‘hadoop-0.20.2′ directory, run the following commands.
In the above steps, you executed an example provided by Hadoop creators. You copied the content of ‘conf’ directory to ‘input’, ran the example in the jar file ‘hadoop-0.20.2-examples.jar’ to count number of word ‘This’ within all files of ‘input’ directory. The output of your run is stored in ‘output’ directory. Here is term-by-term explanation of the command on third line (bin/hadoop jar hadoop-0.20.2-examples.jar grep input output ‘This’). bin/hadoop – hadoop executable jar – option to hadoop executable that tells it that we are running a jar file hadoop-0.20.2-examples.jar – the jar file being run grep – class within jar file that is run input – location of input files output – location of output files If we can create a jar file for our K-mer application, we can run it in the same manner to find all K-mers in a set of sequences. That is explained in the next section.
ii) Code for finding all K-mers from a short read library If you never used Hadoop, a Hadoop code may appear daunting. We will make the task real easy for you. We suggest that you learn Hadoop coding in two steps. Below we included a Hadoop 0.20 template (borrowed from this link) that can be used for most simple codes. You can cut and paste the template, and fill up few sections with code for your favorite application. You will have to fill up the class names (1,2,3) and add codes for map and reduce functions. The code for K-mer application in the next section shows how to do that.