Tutorials

Enjoy This Site? Join Our Remote R/Bioinformatics Classes

Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

The Mapping Problem

Whether you are searching for words within a text document or nucleotides within a reference sequence, the mapping problem comes with three levels of difficulties.

Perfect match

Suppose you have a large document and you like to determine the number of times the word ‘analysis’ appears in the text. You have seen such search capabilies implemented in ‘search and replace’ tools of popular word-processors. Here is a good example of perfect search in the biological context. Suppose a researcher independently sequenced a human gene and is interested to find out its location within the human genome. In the first order, a perfect search algorithm matching pieces of the gene with the genome will work. As another example, suppose a bioinformatician likes to find out the frequency of the sequence motif ATGTCA within 100 Kb of all transcription factors. A perfect search algorithm can be applied in his case. Perfect match problem has becomes even more important in the next-generation sequencing (NGS) era, because an overwhelming majority of short reads generated by NGS match the reference genome perfectly. Tabulating their relative mapping frequencies is helpful for downstream statistical analysis in the first order.

Imperfect match

‘Analyse’ and ‘analyze’ are two spelling forms of the same word, and it is often desirable to look for both variations in a document. By counting leading capital and small letters, the number of possibilities go up to four. In the biologial context, imperfect matches are useful to account for sequencing errors as well as naturally-occuring mutations. The degree of variation is an important factor for the search algorithms. Sequencing errors typically lead to at most one or two nucleotide changes within high-quality sequences, and it is enough to restrict for such small changes. When a researcher is looking for the ortholog of a human gene within the mouse genome, he will have to account for larger degree of variation. All such cases require implementation of imperfect match algorithms.

Gapped match

Americans spell the English word ‘colour’ as ‘color’. A document search algorithm to identify both of them not only needs to include mismatches, but also the presence of gaps. Gapped search algorithms find extensive use in biology as well, because nucleotide insertions and deletions are quite common in the evolutionary context. Gapped searches are useful for technological reasons as well. The sequencing instruments using PacBio technology provide long reads, but insertion-deletion is the primary mode of error in them.