Massive Technical Interviews Tips: CRDT - Conflict-Free Replicated Data Types

Wednesday, February 14, 2018

CRDT - Conflict-Free Replicated Data Types

http://www.baeldung.com/java-conflict-free-replicated-data-types
https://github.com/netopyr/wurmloch-crdt

When we have a cluster of N replica nodes in a distributed system, we may encounter a network partition — some nodes are temporarily unable to communicate with each other. This situation is called a split-brain.

When we have a split-brain in our system, some write requests — even for the same user — can go to different replicas that are not connected with each other. When such a situation occurs, our system is still available but is not consistent.

We need to decide what to do with writes and data that are not consistent when the network between two split clusters starts working again.

Using a similar rule for the increment-only counter, we can create a counter that can be both incremented and decremented. The PNCounter stores all increments and decrements separately.

When replicas synchronize, the resulting value will be equal to the sum of all increments minus the sum of all decrements:

https://www.datastax.com/dev/blog/why-cassandra-doesnt-need-vector-clocks

Resolving conflicts with vector clocks

The original Dynamo, like the open-source Voldemort and Riak, was a key/value database. Thus, objects would need to be serialized in a format such as json. For example, I might have a user object with key jbellis and value of {'email': 'jbellis@example.com', 'phone': '555-5555'}. We'll call this initial value V0.

Next, suppose we update the email address, changing the value to V1 of {'email': 'jbellis@illustration.com', 'phone': '555-5555'}. Some failure causes this to only be written to one replica. Later, we update the phone number, but we read from a different replica so we start from the original value V0 (with the original email address), so we write V2 {'email': 'jbellis@example.com', 'phone': '444-4444'}.

(Note that failure -- whether actual machine failure, network failure, or even load shedding -- can cause "conflicting" updates even with a single client and no concurrency.)

Since our object values are opaque blobs to this system, a naive last-write-wins conflict resolution policy will result in discarding the V1 email address change in favor of the V2 phone number update. This is why it's so easy to lose data using last-write-wins conflict resolution in a key/value system like Riak.

Vector clocks solve this problem by allowing the database to push conflict resolution back out to the client. Skipping a lot of details, the database would retain both V1 and V2, and when a client next reads key jbellis, it would return both versions and tell the client, "you figure out what you want the value to be now." The client can then deserialize the objects and merge the separately updated fields without data loss to the intended value of {'email': 'jbellis@illustration.com', 'phone': '444-4444'}.

Performance: as I alluded to earlier this year, updating a single field in an object stored in a key/value database requires three steps: read and deserialize the exiting object, update the desired field, and serialize and write the resulting object as a new value. Updating an object in Cassandra requires only communicating the changed fields, no more.
Siblings -- multiple versions generated by conflicting updates -- are difficult to deal with in practice, to the point that Riak makes last-write-wins the default despite the high potential for data loss.
Vector clocks are good at helping clients with simple merges like the above user object, but it's important to understand that vector clocks only tell you that a conflict occurred, and not how to resolve it; as Basho put it, even with perfect implementation you can’t have perfect information about causality in an open system without unbounded information growth. This is why Cassandra and later Riak both had to go beyond vector clocks when implementing counters.

People who have been burned by last-write-wins in other systems are justifiably nervous when approaching Cassandra. But Cassandra breaks a row up into columns that can be updated independently. Here's what that looks like for our example:


CREATE TABLE users (
    username text PRIMARY KEY,
    email text,
    phone text
);

INSERT INTO users (username, email, phone)
VALUES ('jbellis', 'jbellis@example.com', '555-5555');

UPDATE users SET email = 'jbellis@illustration.com' WHERE username = 'jbellis';

UPDATE users SET phone = '444-4444' WHERE username = 'jbellis';

This way, the storage engine can resolve changes to email and phone columns automatically. Conversely, if there are concurrent changes to a single field, only one will be retained, which is also what we want. (Cassandra extends this fine-grained conflict resolution to Collection elements as well.)

Thus, clock synchronization is nice to have in a Cassandra cluster but not critical; timestamps are only used to pick a "winning" update within a single column or collection element. (A timestamp tie will also result in a deterministic, commutative result.) Lightweight transactions are available when linearizability is important.

Cassandra addresses the problem that vector clocks were designed to solve by breaking up documents/objects/rows into units of data that can be updated and merged independently

Wednesday, February 14, 2018

CRDT - Conflict-Free Replicated Data Types

Resolving conflicts with vector clocks

People who have been burned by last-write-wins in other systems are justifiably nervous when approaching Cassandra. But Cassandra breaks a row up into columns that can be updated independently. Here's what that looks like for our example:

Labels

Popular Posts