Must-have Tools for a Bioinformatician
A friend of mine got so overwhelmed after receiving seven large hiSEQ libraries (to be assembled in a week) that she deleted everything from her hard-drive. Trying to show the bright side of things, I told her that this would be a great opportunity to clean all clutter from her server (done :) ) and install only the top quality programs that she will use again and again. Can you suggest some programs that you use very frequently? Here is the list I came up with. Please feel free to add.
A. Sequence Search
Sequence search is one of the tasks bioinformaticians do day in and out. The purpose of those searches vary. Let me list a few here.
(i) You have a protein sequence and you want to find out whether it matches any other protein that people have already studied.
(ii) You have a large library (100M) of Illumina reads from human genome and you like to match them against the reference genome to find SNPs.
(iii) You have a long EST from an Arabidopsis sample and you want to find its coordinates on the Arabidopsis genome.
(iv) You have a peptide sequence from mouse, but do not know its gene sequence. You like to find out the gene sequence by searching the peptide against the mouse genome.
You can see that those four examples need different capabilities from the search program and your computer. For example, the second one can assume that the matches will be near exact, and use that information to speed up the search process. On the other hand, the first example cannot make that compromise, but speed may not be an issue, when you are searching for only one or few proteins instead of 100M.
A bioinformatician needs to have many types of search programs to fit various purposes. Here is the core set I recommend -
NCBI BLAST
Purpose
This is the all-purpose sequence homlogy search program that every biologist is familiar with. One can use BLAST to search nucleotide sequence against protein database, nucleotide against nucleotide database, peptide against nucleotide database or peptide against protein database.
Download from
Installation
Installation is easy. Download the right executable for your type of computer and nothing else needs to be done other than setting path.
Common commands
preparing database: formatdb -i [FASTA reference dbname] -p F -o T
nucl against nucl: blastall -i [FASTA seq] -d [ref dbname] -p blastn -e 1e-10 -o [output]
nucl against nucl: blastall -i [FASTA seq] -d [ref dbname] -p blastp -e 1e-10 -o [output]
nucl against nucl: blastall -i [FASTA seq] -d [ref dbname] -p blastn -e 1e-10 -o [output]
nucl against nucl: blastall -i [FASTA seq] -d [ref dbname] -p blastn -e 1e-10 -o [output]
Online references
Weakness
BLAST is too slow for searching sequences that map exactly on a genome, and especially for mapping large libraries of short reads on a reference genome.
Alternative - HMMER (claims to be far more accurate than BLAST or FASTA for protein homology search), FASTA. Also note that many variations of BLAST exist, such as megaBLAST, WU-BLAST, etc.
BLAT
Purpose
BLAT is used for mapping a long sequence that matches a genome near exactly.
Download from
Installation
Installation is easy. Download the right executable for your type of computer and nothing else needs to be done other than setting path.
Common command
blat [database] [query] [output]
Online references
Weakness
BLAT is best for mapping ESTs on a reference genome, and it reports all splice junctions properly. Don’t use it for any other purpose.
Alternative - exonerate from EBI.
Bowtie/Tophat/Cufflink
Purpose
Quick mapping of large number of short read sequences on a reference genome, and constructing gene sequences.
Download from
Installation
Installation is easy. Download the right executable for your type of computer and nothing else needs to be done other than setting path.
Common command
building index: bowtie-build [reference] [index]
unpaired reads: bowtie [index] [read file] [output]
paired reads: bowtie [index] -1 [left read] -2 [right read] [output]
Online references
Weakness
Primarily for short reads. Requires near perfect match.
Alternatives - BWA, MAQ, Shrimp.
B. Multiple Sequence Alignment
Clustal (clustalw2)
Purpose
Clustal is perfect for aligning large number of similar sequences (nucleotide or protein) and find their common segments.
Download from
Installation
Installation is easy. Download the right executable for your type of computer and nothing else needs to be done other than setting path.
Common command
clustalw2 [FASTA sequence]
Online references
Weakness
Performs poorly, if any sequence has large N block.
Alternative - BLAST.
C. SNP Analysis
Samtools
Purpose
Samtools is the most efficient program to store and access large number of short read alignments. SAM format has almost become a standard now.
Download from
Installation
Installation is easy. Download the right executable for your type of computer and nothing else needs to be done other than setting path.
Common command
Check the synopsis section at the top of this page.
Online references
Weakness
Alternative - SOAPsnp.
D. Assembly
CAP3
Purpose
A very versatile assembly program, when the number of reads is not very high.
Download from
Installation
Installation is easy. Download the right executable for your type of computer. That is all.
Common command
cap3 [fasta file]
Online references
Weakness
This is an overlap and extend type of program. De Bruijn assemblers are better for large library of short reads.
Newbler
Among all programs listed here, Newbler is the only one I never used. However, everyone using 454 data recommends it. The software comes from Roche and here is how you get it in various countries of the world. Source code is proprietary and it is not a de Bruijn assembler.
SOAPdenovo
Purpose
One of the best de Bruijn assemblers around, both in terms of performance and memory requirement.
Download from
Installation
Installation is easy. You can download the executable from their website.
Common command
Check here.
Online references
Weakness
Source code unavailable.
Velvet
Purpose
De Bruijn assembler for genomic data. Works great for color space.
Download from
Installation
‘make’ or ‘make color’
Common command
velveth out 21 -shortPaired reads.fa
velvetg out
Online references
Manual. Also Velvet mailing list is very active.
Weakness
Requires large RAM. Also, it is not good for transcriptome data, unless you also use Oases.
Oases
Purpose
De Bruijn assembler for transcriptome that works on Velvet output.
Download from
Installation
make ‘VELVET_DIR=/path/to/velvet’ [check manual for color space installation]
Common command
velveth out 21 -shortPaired reads.fa
velvetg out -read_trkg yes
oases out -ins_length 200
Online references -
UCSC. Also, Oases mailing list is very active.
Weakness
Needs even more RAM than Velvet.
Trinity
Purpose
De Bruijn assembler for transcriptome.
Download from
sourceforge website for trinity
Installation
You need Java in your machine. C/C++ part needs to be compiled with ‘make’. Java part is already compiled.
Common command
Trinity.sh –seqType fq –left l.fq –right r.fq –output outdir
Online references
Weakness
Inchworm step is very slow.
Alternatives - Abyss (parallel assembler for MPI machines), Contrail (parallel assembler for hadoop), Euler, ALLpaths- LG.
E. Protein Function Analysis
BLAST against NCBI NR
Interpro scan
F. Statistical Analysis
R
G. Microarray data
Bioconductor module in R
limma package
Common commands: lmFit, eBayes, topTable
H. Sharing and Visualization
IGV
I. Colorspace
ABI Color space library
We covered all of four above topics in various posts here.
J. Bonus tools
ssaha2
hmmer
Mummer
Stampy
repeatmasker
Hadoop
-———————-
Here are few places I go to to find information on latest software packages -
ii) Bioinformatics tools at EBI
iii) List of all resources at NCBI, and sub list of bioinformatic tools.
iv) Computational tools at BROAD institute