Note: These tutorials are incomplete. More complete versions are being made available for our members. Sign up for free.

Advanced Hashing

Cuckoo hashing

The basic idea is to use two hash functions instead of only one. This provides two possible locations in the hash table for each key. In one of the commonly used variants of the algorithm, the hash table is split into two smaller tables of equal size, and each hash function provides an index into one of these two tables. When a new key is inserted, a greedy algorithm is used: The new key is inserted in one of its two possible locations, “kicking out”, that is, displacing, any key that might already reside in this location. This displaced key is then inserted in its alternative location, again kicking out any key that might reside there, until a vacant position is found, or the procedure enters an infinite loop. In the latter case, the hash table is rebuilt in-place using new hash functions: There is no need to allocate new tables for the rehashing: We may simply run through the tables to delete and perform the usual insertion procedure on all keys found not to be at their intended position in the table. —Pagh & Rodler, “Cuckoo Hashing”[1] Lookup requires inspection of just two locations in the hash table, which takes constant time in the worst case (see Big O notation). This is in contrast to many other hash table algorithms, which may not have a constant worst-case bound on the time to do a lookup.” Bloom filters (discussed here and here) also use multiple hashes, but that is the only similarity between two methods. Here is a good discussion on their differences - “Which do you prefer, wine or cheese? A Bloom filter is for when you have limited space, high query cost, and mostly negative queries. In that case, a BF with 8 bits per key and 4 hash functions gives you 2.5% false positive rate; you process queries nearly 40 times faster than before, at the cost of 1 byte per key. On the other hand, if any of the previous conditions do not hold, a hash table acting as a cache makes sense, though it obviously will take a lot more than one byte per entry You can even skip over the hard edge cases of cuckoo hashing if it’s a cache. That also makes the size-increase problems of cuckoo hash tables (or anything other than linear hash) moot. my2c”

Perfect Hash

In a regular hashing scheme, the hash functions are fixed and data is dynamically allocated. For a NGS library, we know beforehand, which k-mers will go into the hash, because we can run a k-mer counting step before assembly. So, the set of k-mers going into the hash table is fixed. Can we choose the hash functions dynamically to avoid all collisions? The scheme works in the following manner. First, a hash function is chosen, all k-mers are run through it, and those k-mers not hitting collision are used up (allocated). Then, a second hash function is chosen, all remaining k-mers are run through it, and those k-mers not hitting collision are used up (allocated). The above steps continue with third, fourth, fifth hash functions, until all k-mers are allocated. If we have 10 hash functions at that point, a ‘perfect’ hash designed with those 10 hash functions will avoid all collisions for the fixed set of k-mers being allocated. Real simple, huh? [To be continued]

Google sparsehash In chapter 2, we showed the mathematically complete de Bruijn graphs for any k-mer size. Typical de Bruijn graphs of an NGS library covers only a tiny subspace of the entire k-mer space. Therefore, it is not necessary to store all permutations of k-mers, and hashes turn out to be the best data structures for storing de Bruijn graphs for assembly. Velvet assembler, which was the first de Bruijn graph-based programs for NGS data, used its own hashing approach for storing the graph. AbySS assembler instead used Google sparsehash, an extremely memory-efficient implementation of hash structure with only 2 bits/entry overhead, and managed to reduce overall memory utilization by a significant amount.

Linear Probing

The topics covered today are not directly related to any specific bioinformatics algorithm, but given their widespread use in non-bioinformatics applications, it is only a matter of time before someone finds a use in bioinformatics world. Together with Bloom filter and Cuckoo hashing, they form a complete toolbox for uses of hashing.

Linear probing is an old concept for filling up buckets, when collision occurs in a hash table. The idea behind it is very simple. It says that when collision occurs, check the next spot in the hash table, next to next one and so on. One can also generalize the concept of ‘next’ by using a stepsize. Linear probing is a scheme in computer programming for resolving hash collisions of values of hash functions by sequentially searching the hash table for a free location.[1] This is accomplished using two values – one as a starting value and one as an interval between successive values in modular arithmetic. The second value, which is the same for all keys and known as the stepsize, is repeatedly added to the starting value until a free space is found, or the entire table is traversed. (In order to traverse the entire table the stepsize should be relatively prime to the arraysize, which is why the array size is often chosen to be a prime number.) newLocation = (startingValue + stepSize) % arraySize

Power of Two Choices

‘Power of two choices’ came from PhD thesis (1996) of Michael David Mitzenmacher, and is applicable to load balancing. Suppose that n balls are placed into n bins, each ball being placed into a bin chosen independently and uniformly at random. Then, with high probability, the maximum load in any bin is approximately log n/(log log n). Suppose instead that each ball is placed sequentially into the least full of d bins chosen independently and uniformly at random. It has recently been shown that the maximum load is then only (log logn)/(log d +O(1)) with high probability. Thus giving each ball two choices instead of just one leads to an exponential improvement in the maximum load. This result demonstrates the power of two choices, and it has several applications to load balancing in distributed systems.

Here are two useful links, if you want to learn more - The Power of Two Choices – Benny van Houdt The Power of Two Random Choices: A Survey of Techniques and Results – Michael Mitzenmacher Andrea W. Richa

Skip List

http://www.homolog.us/blogs/blog/2013/03/18/skip-lists-and-other-efficient-data-structures/

Sparse Arrays, Splay trees, hash tables with double hashing


Web Statistics