Simple Examples to Learn Bioinformatics Programming

Simple Examples to Learn Bioinformatics Programming

C. Titus Brown asked in his blog - What’s the equivalent of prime number calculations for biologists learning to program?. New students of bioinformatics should check his commentary and responses to find interesting ideas.

If we had to design a course on programming for biologists, it will have the following progression.

‘Hello world’

It will start with GC-counting of a genome as ‘Hello World’ example.

Warm up questions

b) Early sessions will have the following examples -

i) Writing code to extract nucleotides 12000-14500 on chromosome 3 from human genome,

ii) Finding reverse complement of a sequence,

iii) Translating a nucleotide sequence in six frames and finding longest peptide,

‘Prime number’

Counting k-mers in sequences is equivalent to identifying all prime numbers up to 10 million in C-programming. Why do we say so? Anyone can write a lousy program for k-mer counting or prime number counting that does not scale well. So, both examples are good for teaching the differences between O(N), O(N logN), O(N^2) types of algorithms.

For the above reason, when we first wrote about Hadoop, we used the k-mer counting example. Here we described other efficient methods for counting k-mers. The good thing about k-mer counting is that the problem itself is so easy to state that it can be used to explain aspects of hardware, memory and hashing.

‘search/sort algorithms’

In our first course of programming in computer science classes, we started with simple sort algorithms like bubble-sort, quick-sort, etc.

For bioinformaticians, the equivalent topics are Smith-Waterman alignment and various other alignment programs.

‘graph algorithms’

After mastering search and sort, we learned few graph-algorithms. If you read our blog for the last few months, you know what the bioinformatics equivalent would be.

Use of bio libraries**

We will bring the bioperl, biopython, etc. libraries at the end. Only after someone appreciates how time-consuming it is to write code for simple-sounding tasks (finding every match from a BLAST output), it is good to learn that others already solved the problems.


Question for readers:

If you want to estimate GC in a 300 million nucleotide genome and computing is so expensive that checking each nucleotide costs $3, what is the most logical solution?

Written by M. //