Posted on 2015-12-31

Per-Key Actor Epochs In Riak

Last time I wrote about how Dotted Version Vectors fix Sibling Explosion. Sibling Explosion was relatively common bug, often seen in the wild. This post is about a subtle bug that manifests in a number of different ways, but all have the same result: silent data loss. Don't panic, we believe this to be a very uncommon edge case, and it is now fixed in Riak >= 2.1.

In order for the bug to occur we need a complex interplay of features, so this post will have to cover Deletes, Read Repair and Handoff in Riak.

The bug itself is very similar to the requirement to Read Your Own Writes for Client Version Vectors, so if you haven't read part one now is a good time to do so.

Why RYOW, again?

If you recall, an actor in a version vector must increment it's counter each time it updates a value. If an actor can not read it's own latest update in the version vector, it cannot issue a new event that is certain to be larger than the last. The first post in this series showed how this can lead to data loss. But how can it be that some vnode (as actor) could fail to read it's own last write, isn't a vnode where the database stores data? The simplest answer is: deletes.

Deletes are hard

Before moving on we are going to have to have a least a passing understanding of how deletes in Riak work. Deletes are hard in an eventually consistent distributed system. Riak allows concurrent writes to the same value, and it also allows a value to be concurrently deleted and updated. Which operation "wins"? As with concurrent updates to a key, version vectors arbitrate. You can read more about deletes in the Riak documentation, but this post will cover what is needed to understand the bug.

Tombstones

Rather than physically deleting a key, we can logically delete it. We read the key and get it's version vector, and write a special tombstone value back to Riak with the version vector as a context. A delete then behaves like any other write. If the key was concurrently updated, the version vector detects this, and the tombstone and updated value become siblings. If there was no concurrent update only the tombstone is recorded on disk. A request to read a key that only has a tombstone value will return not_found as Riak understands tombstones to mean logical deletion. A client can also ask Riak for the "deleted vclock" on a get, to ensure that a new write supersedes a tombstone.

Reaping: I want my disk space back!

Disks are cheap and disks are big, but they're neither free nor infinitely large, and customers would like to see the deletion of values reflected in increased disk space (I know! Crazy!) Even if a tombstone is just a version vector and a small special value actual removal of the key data is a requirement. The process by which Riak reclaims space is called tombstone reaping. It's reasonably complex:

Client issues delete command
Riak writes special tombstone with Version Vector
Riak internally performs a GET on the key, with the quorum value set to all (which means "ask all the replicas for the value, please.")
If the result of the get is a tombstone AND all vnodes that reply are primaries the vnodes are told they may physically remove the key
The vnode will read the key, check it is a tombstone, and check the delete_mode and finally delete the tombstone.

Step 4 above amounts to Riak declaring unanimously that all primary replicas agree on a tombstone value before issuing the final delete that reaps the tombstone.

What is delete_mode as mentioned in step 5? It is a setting which can be one of

keep - never delete the tombstone
immediate - remove the tombstone at once
integer - remove the tombstone after some time, eg 3 seconds

Read the documentation for more details. I'll cover these settings as they pertain to the bug later.

Clearly deletes are hard. The summary though is that Riak needs to see all primary replicas agreeing on a logical delete before a physical delete is considered.

Read Repair

As mentioned in the introduction the process of Read Repair has a part to play.

Read repair is an opportunistic anti-entropy mechanism. When a client reads data from Riak, each replica responds with it's local copy. Using the Version Vector Riak can detect when some replica is behind, or in conflict, and send the most up-to-date value to that replica. The replica then stores the correct value. It is a curious fact that all the replying vnodes may need read repair as the "most up-to-date value" could be the result of merging all the responses. Imagine, for example, 3 clients each writing a different value, concurrently, to 3 vnodes.

Hand Off

Hand Off is a process that restores data after some failure or network partition.

If the node A that some key X should be stored on is unavailable when a client wishes to write X some other node A' will step in and handle that write as a fallback. When the node A is available again then the fallback node will hand off any data it stored. All this means is node A' sends any data that it stored for node A back to node A. Of course the magic of Version Vectors is how node A knows if it should add node A' data as a sibling, discard it, or overwrite it's local data.

Doomstones

Finally we have all the knowledge we need to understand the issue.

The example I'm going to use has deletes and read repair and fallbacks in. It's a reasonably complex example, but one that has been seen in the wild.

A client decides to delete key X. It reads key X and gets the Version Vector

    [{A, 2}]

Which means that actor A has issued two updates to X. The Client sends the delete command and version vector to Riak. Riak creates a tombstone value and writes it. However, only primary nodes A and B are available, node C appears offline, maybe some congestion at a network switch, or some other problem. Maybe an operator took C offline to replace a faulty NIC. Whatever, node C' handles the write of the tombstone as a fallback. The client is notified the delete succeed, and its part is done.

Our cluster is in this state: the tombstone is on A B and C'. As stated above, Riak now performs a GET operation to see if all primaries unanimously agree on the tombstone value. By now C is back online and returns not_found. It's OK, Read Repair kicks in, and sends the tombstone to C.

Now our cluster has the tombstone on A B C and C'. A read to key X occurs. The client receives a not_found and since all nodes have the tombstone, and all are primaries, the reap logic is run.

Nodes A B and C remove the tombstone. A client writes a new value for X. Node A coordinates. The new value ends up with the Version Vector.

    [{A, 1}]

This would be fine, except lingering on that fallback node C' is a tombstone with the version vector

    [{A, 2}]

Handoff kicks in, C' sends its value to C. Node C looks at its local Version Vector for key X sees that the incoming hand-off value dominates it, and writes the tombstone.

Later, a read will cause read repair to spread the tombstone to nodes A and B. The tombstone looks like it is from a later causal time, but it's actually from an earlier time. The tombstone has managed to silently delete the new value of key X which is bad.

This example is convoluted but the essence of the problem is much like the RYOW problem from part 1. If a vnode forgets the version vector for a key (say by deleting the key) then it re-issues some event, in this case {A, 1}.

In this case the answer could be as simple as use delete_mode = keep. But that's not the whole answer.

Fault Tolerant?

There are other ways for this issue to manifest. In the real world disk errors occur. If your database is a replicated, fault tolerant database, and it gets an error from a disk, should it fail a write operation? When a Riak vnode can't read a local value for a key, it treats the value as not_found after all, there are n replicas of the data, a write shouldn't fail if one is lost.

Turns out this can be bad, too. Failing to read the local version vector for whatever reason leads to a new version vector being created for the key. Imagine some key X is on replicas A B and C with version vector

    [{A, 3}, {B, 2}, {C, 5}]

For whatever reason, a coordinating write on node A fails to read X. Solar flare, disk error, operator removes the data directory 'cos "3 replicas, it's OK!", anything. At this point we assign the Version Vector

    [{A, 1}]

To the new write. When it reaches replicas B and C they will discard the new value as "already seen". Next Read Repair will cause A to remove the value too. Same outcome (silent data loss), and from a similar case: a vnode failing to read its own writes.

Epochs is the answer

As with so many things, the answer is conceptually very simple, but the engineering a little harder.

In each of the cases above, and other more complex manifestations of the issue, the problem arises when a local not_found leads to the creation of a new Version Vector when an older one from a later logical time still exists in the system.

One possible answer is to have an Epoch for each key. When key X is created the first time call that "Epoch 1". Then, if later it is deleted, and re-created the new X will be in "Epoch 2." If some replica has key X in Epoch 1 and we try and compare it with Epoch 2, we can ensure that Epoch 1's Version Vector at [{A, 3}] does not dominate Epoch 2's Version Vector at [{A, 1}]

In practical terms this means that whenever some vnode gets a local not_found when coordinating a write, it creates a new epoch for that key. Why might a vnode get a local not_found? As above, maybe a delete, maybe disk error, maybe operator error. Maybe this key has never been written before. It doesn't matter. A local not_found means a new Epoch for the key.

Does Epoch Two Dominate Epoch One?

NO! We've already seen an example above where Epoch 1 and 2 contain siblings. So what do we do then? How do we merge a clock like

    [{A, 2}, {B, 4}]

From Epoch 1 with one like

    [{A, 1}]

From Epoch 2?

Per Key Actor Epochs

Instead of having an epoch per key, we have an actor epoch per key. Each vnode keeps a counter. Every time the vnode gets a local not_found when coordinating a write it increments it's counter and for that key only creates an actor ID from the pair of values vnode id and counter. This means that every time a key is (re)created it gets a unique actor ID. This turns the clocks above into the pair:

    [{A:1, 2}, {B, 4}]
    [{A:2, 1}]

Meaning we can merge the clocks into a single version vector, and we can treat as concurrent the values assigned to each clock. No more silent data loss. We consider the pair A:N as a single actor ID, thus each vnode gets a per-epoch-id for each key without having a general explosion in the number of actors in the system. This scheme isolates the growth of actors to individual keys.

In the tombstone example above, the new Version Vector for the re-created 'X' key would be [{A:2, 1}] and would therefore conflict with the "doomstone" version vector. This ensures the new value survives. Is this strictly correct? No! Ideally in the "doomstone" case the new epoch would dominate, but in the local fail to read case, the new epoch would be concurrent. But in either case no data is lost with this scheme.

This scheme has the benefit of being entirely backwards compatible and requiring no logical changes elsewhere in Riak: all Version Vector logic remains the same.

Updating A Version Vector

How does the vnode A update it's entry in the Version Vector now? It finds the entry A:N for the highest N in the Version Vector and updates that.

Mechanics

It turns out keeping a simple, durable, strictly increasing counter is not so simple. In order to work no pair vnode id:counter can ever be used twice for a new key. Which means the counter must be durable. Durable data is expensive, as this talk illustrates. In Riak each vnode uses a leased counter approach. At a configurable interval each vnode will asynchronously flush a lease (say 10k) to disk, and will continue to increment it's counter up to that ceiling. As the ceiling approaches, the vnode will store a new value (+10k) to disk, and continue to issue new writes. If the vnode should be restarted or crash between flushes it reads the ceiling value as the starting point and creates a new lease from there. This ensures that a vnode never res-uses a counter value. The size of the lease, and therefore frequency of flushing/leasing is a configurable parameter.

Summary

In this series of blog posts we've gone from 1970s seminal works to modern academic/industry collaboration. We've seen how one of the simplest and most venerable data structures in computer science can be complex and error prone in the real world, and how even the classics can be improved with some necessary innovation for time to time.

If you've read the whole series, thank you! You're fully caught up with logical clocks in Riak from back in client-side vclock days to the present world of per-key-actor-epoch-vnode-dotted-version-vectors!