Consistent Hashing

Consistent Hashing

Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web

Something about hash #

Credited to F14.

Some hash functions distribute information evenly across all their bits — any change to the input causes an avalanche of changes to the hash code. Ideally, any one-bit change in the input is expected to change about half the bits of the output. These hash functions can be mapped onto a power-of-two range by just zeroing the top bits. For example, we might map hash codes across 32 chunks by zeroing all but the bottom five bits.

At the other end of the spectrum are hash functions like std::hash for integral types, which is often the identity function. It meets the C++ standard’s requirements for hash quality (hash codes are very unlikely to be equal unless the inputs are also equal), but it doesn’t spread information at all. If we just took the bottom five bits, for example, then we would experience a large number of collisions if all the integer keys were multiples of 32. If the hash function doesn’t distribute information, then F14 must postprocess the hash code with a bit mixer.

Introduction #

The paper contains a lot of mathematics, but we try to focus on definition and implementation of the consistent hashing. We will skip the mathematical details and theorem proofs, and only concentrate on the core contributions of the paper.

Definitions #

  • Balance: literally balance, meaning that hash function could generate a hash code with an even distribtion.
  • Monotonicity: this property says that if items are initially assigned to a set of buckets \(\mathcal{V}_1\) and then some new buckets are added to form \(\mathcal{V}_2\) , then an item may move from an old bucket to a new bucket, but not from one old bucket to another old bucket.
  • Spread: For example, there is a set of clients, and each clients may have different sight of servers. The spread means the same item will be assigned to a same server with high probability.
  • Load: The load property is similar to spread. It just explain the same thing from a different side.

Implementation #

Implementation is simple, proofs are hard. If you want to learn about the proofs, you should read the paper. Let’s get into the implementation here.

Let’s say we have two functions \(\mathcal{r}_\mathcal{I}\) and \(\mathcal{r}_\mathcal{B}\) (we often use MurmurHash in the industry). The function \(\mathcal{r}_\mathcal{B}\) maps buckets randomly to the unit interval, and \(\mathcal{r}_\mathcal{I}\) does the same for items. The consistent hashing is defined to be the function which maps the item \(i\) to the bucket “closest” to \(i\) . In the industry, we usually map the same bucket to multi-nodes by concatenating a suffix to the bucket.

Thoughts #

  • What is the most innovative part of consistent hashing?
    • I would say the main idea of the consistent hashing is to use mapping functions to map both bucket and item to the unit interval. This approach ensures that when the buckets changes, most items will not move to the changed bucket.
  • Can we make any improvements on consistent hashing?
    • Yes, Google has proposed a paper(jump consistent hash) which optimizes the performance for lookups.