http://www.allthingsdistributed.com/2007/12/eventually_consistent.html

There are two ways of looking at consistency. One is from the developer / client point of view; how they observe data updates. The second way is from the server side; how updates flow through the system and what guarantees systems can give with respect to updates.

Client side consistency

At the client side there are four components:

A storage system. For the moment we'll treat it as a black box, but if you want you should assume that under the covers it is something big and distributed and built to guarantee durability and availability.
Process A. A process that writes to and reads from the storage system.
Process B & C. Two processes independent of process A that also write to and read from the storage system. It is irrelevant whether these are really processes or threads within the same process, important is that they are independent and need to communicate the share information.

At the client side consistency has to do with how and when an observer (in this case processes A, B or C) sees updates made to a data object in the storage systems. In the following examples Process A has made an update to a data object.

Strong consistency. After the update completes any subsequent access (by A, B or C) will return the updated value.
Weak consistency. The system does not guarantee that subsequent accesses will return the updated value. A number of conditions need to be met before the value will be returned. Often this condition is the passing of time. The period between the update and the moment when it is guaranteed that any observer will always see the updated value is dubbed the inconsistency window.
Eventual consistency. The storage system guarantees that if no new updates are made to the object eventually (after the inconsistency window closes) all accesses will return the last updated value. The most popular system that implements eventual consistency is DNS, the domain name system. Updates to a name are distributed according to a configured pattern and in combination with time controlled caches, eventually of client will see the update.

There are a number of variations on the eventual consistency model that are important to consider:

Causal consistency. If process A has communicated to process B that it has updated a data item, a subsequent access by process B will return the updated value and a write is guaranteed to supersede the earlier write. Access by process C that has no causal relationship to process A is subject to the normal eventual consistency rules.
Read-your-writes consistency. This is an important model where process A after it has updated a data item always accesses the updated value and never will see an older value. This is a special case of the causal consistency model.
Session consistency. This is a practical version of the previous model, where a process accesses the storage system in the context of a session. As long as the session exists, the system guarantees read-your-writes consistency. If the session terminates because of certain failure scenarios a new session needs to be created, and the guarantees do not overlap the sessions.
Monotonic read consistency. If a process has seen a particular value for the object any subsequent accesses will never return any previous values.
Monotonic write consistency. In this case the system guarantees to serialize the writes by the same process. Systems that do not guarantee this level of consistency are notoriously hard to program.

A number of these properties can be combined. For example one can get monotonic reads combined with session level consistency. From a practical point of view these two properties (monotonic reads and read-your-writes) are most desirable in an eventually consistent system, but not always required.

Many modern RDBMS systems that provide primary-backup reliability implement their replication techniques in both synchronous and asynchronous modes. In synchronous mode the replica update is part of the transaction, in asynchronous mode the updates arrive at the backup in a delayed manner, often through log shipping. In the last mode if the primary fails before the logs are shipped, reading from the promoted backup will produce old, inconsistent values. Also to support better scalable read performance RDBMS systems have start to provide reading from the backup, which is a classical case of providing eventual consistency guarantees, where the inconsistency windows depends on the periodicity of the log shipping.

Server-side consistency.

We need to establish a few definitions before we can get started:

N - the number of nodes that store a replicas of the data
W - the number of replicas that need to acknowledge the receipt of the update before the update completes
R - the number of replicas that are contacted when a data object is accessed through a read operation

If W+R > N than the write set and the read set always overlap and one can guarantee strong consistency. In the primary-backup RDBMS scenario which implements synchronous replication N=2, W=2 and R=1. No matter from which replica the client reads, it will always get a consistent answer. In the asynchronous replication case with reading from the backup enabled N=2, W=1 and R=1. In this case R+W=N and consistency cannot be guaranteed.

The problems with these configurations, which are basic quorum protocols, is that when because of failures the system cannot write to W nodes, the write operation has to fail, marking the unavailability of the system. With N=3 and W=3 and only 2 nodes available the system will have to fail the write.

In distributed storage systems that need to address high-performance and high-availability the number of replicas is in general higher than 2. Systems that focus solely on fault-tolerance often use N=3 (with W=2 and R=2 configurations). Systems that need to serve very high read loads often replicate their data beyond what is required for fault-tolerance, where N can be tens or even hundreds of nodes and with R configured to 1 such that a single read will return a result. For systems that are concerned about consistency they set W=N for updates, but which may decrease the probability of the write succeeding. A common configuration for systems in this configuration that are concerned about fault-tolerance, but not consistency, is to run with W=3 to get basic durability of the update and then rely on a lazy (epidemic) technique to update the other replicas.

How to configure N, W and R depends on what the common case is and which performance path needs to be optimized. In R=1 and N=W we optimize for the read case and in the W=1 and R=N we would optimize for a very fast write. Of course in the latter case, durability is not guaranteed in the presence of failures, and if W < (N+1)/2 there is the possibility of conflicting writes because write sets do not overlap.

Weak/eventually consistency arises when W+R <= N, meaning that there is no overlap in the read and write set. If this configuration is deliberate and not based on a failure case, than it hardly makes sense to set R to anything else but 1.

There are two very common cases where this happens: the first is the massive replication for read scaling mentioned earlier and the second is where data access is more complicated. In a simple key-value model it is easy to compare versions to determine which is the latest value but in systems that return sets of objects is it more difficult to determine what the correct latest set should be. In these systems where the write set is smaller than the replica set, there is a mechanism in place that in a lazy manner applies the updates to the remaining nodes in the replicas set. The period until all replicas have been updated is the inconsistency window discussed before. If W+R <= N than the system is vulnerable to reading from nodes that have not yet received the updates.

Whether or not read-your-write, session and monotonic consistency can be achieved depends in general on the "stickiness" of clients to the server that executes the distributed protocol for them. If this is the same server every time than it is relatively easy to guarantee read-your-writes and monotonic reads. This makes it slightly harder to manage load balancing and fault-tolerance, but it is a simple solution. Using sessions, which are sticky, makes this explicit and provides an exposure level that clients can reason about.

Sometimes read-your-writes and monotic reads are implemented by the client. By adding versions on writes, the client discards reads of values with versions that precede the last seen version.

Partitions happen when some nodes in the system cannot reach other nodes, but all can be reached by clients.

If you use a classical majority quorum approach, than the partition that has W nodes of the replica set can continue to take updates while the other partition becomes unavailable. The same for the read set. Given that these two sets overlap, by definition the minority set becomes unavailable. Partitions don't happen that frequently, but they do occur, between datacenters as well inside datacenters.

Inconsistency can be tolerated for two reasons: for improving read and write performance under highly concurrent conditions and for handling partition cases where a majority model would render part of the system unavailable even though the nodes are up and running.

Whether or not inconsistencies are acceptable depends on the client application. A specific popular case is a website scenario in which we can have the notion of user-perceived consistency;

the inconsistency window needs to be smaller than the time expected for the customer to return for the next page load. This allows for updates to propagate through the system, before the next read is expected.

TODO: https://en.wikipedia.org/wiki/Consistency_model

http://stackoverflow.com/questions/6865545/read-your-own-writes-consistency-in-cassandra

Read-your-own-writes consistency is great improvement from the so called eventual consistency: if I change my profile picture I don't care if others see the change a minute later, but it looks weird if after a page reload I still see the old one.

Can this be achieved in Cassandra without having to do a full read-check on more than one node?

Using ConsistencyLevel.QUORUM is fine while reading an unspecified data and n>1 nodes are actually being read. However when client reads from the same node as he writes in (and actually using the same connection) it can be wasteful - some databases will in this case always ensure that the previously written (my) data are returned, and not some older one. Using ConsistencyLevel.ONE does not ensure this and assuming it leads to race conditions. Some test showed this: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/per-connection-quot-read-after-my-write-quot-consistency-td6018377.html

We've had http://issues.apache.org/jira/browse/CASSANDRA-876 open for a while to add this, but nobody's bothered finishing it because

CL.ONE is just fine for a LOT of workloads without any extra gymnastics
Reads are so fast anyway that doing the extra one is not a big deal (and in fact Read Repair, which is on by default, means all the nodes get checked anyway, so the difference between CL.ONE and higher is really more about availability than performance)

https://docs.oracle.com/cd/E17276_01/html/gsg_db_rep/C/rywc.html

RYW (Read-Your-Writes) consistency

http://developer.couchbase.com/documentation/server/current/developer-guide/query-consistency.html

Forcing query to wait until the index has been updated

It is possible to ensure that prior mutations are always considered and indexed when performing the query. This option works by instructing the query service to wait for all pending mutations to be indexed. The option is off by default to increase performance: in write heavy environments, many pending writes may be active and therefore waiting for them all to be indexed can slow down the execution time of the server. This option is called consistency

from couchbase.n1ql import CONSISTENCY_REQUEST
query = N1QLQuery(
    'SELECT names, emails FROM default WHERE $1 IN names', 'Brass')
query.consistency = CONSISTENCY_REQUEST

https://msdn.microsoft.com/en-us/library/dn589800.aspx

When a customer places an order, the application instance performs the following operations across a collection of heterogeneous data stores held in various locations:

Update the stock level of the item ordered.
Record the details of the order.
Verify payment details for the order.

Instead, implementing the order process as an eventually consistent series of steps, where each step in the process is essentially an autonomous operation, is a much more scalable solution. While these steps are progressing, the state of the overall system is inconsistent. For example, after the stock level has been updated but before the details of the order have been recorded the system has temporarily lost some stock. However, when all the steps have been completed, the system returns to a consistent state and all stock items can be accounted for

the application is responsible for guaranteeing either that all three steps in the order process complete, or determining the actions to take if any of the steps fail

Retrying Failing Steps

In a distributed environment, the inability to complete an operation is often due to some type of temporary error (communication failure is always a possibility.) If such a failure occurs, an application might assume that the situation is transient and simply attempt to repeat the step that failed. Less transient exceptions, such as database or virtual machine failure, may also occur and the remedy might be similar—wait for the system to be recovered and then try the failing operation again. This approach could result in the same step actually being run twice, possibly resulting in multiple updates. It is very difficult to design a solution to prevent this repetition from occurring, but the application should attempt to render such repetition harmless.

One strategy is to design each step in an operation to be idempotent. This means that a step that had previously succeeded can be repeated without actually changing the state of the system.

A common technique is to associate the message sent to the service with a unique identifier. The service can store the identifier for each message it receives locally, and only process a message if the identifier does not match that of a message it received earlier. This technique is known as de-duping (the removal of duplicate messages). This strategy, exemplified by the Idempotent Receiver pattern, depends on the service being able to store message identifiers successfully.

Partitioning Data and Using Idempotent Commands

Multiple instances of an application competing to modify the same data at the same time are another common cause of failure to achieve eventual consistency. If possible, you should design your system to minimize these situations. You should try and partition your system to ensure that concurrent instances of an application attempting to perform the same operations simultaneously do not conflict with each other.

Implementing Compensating Logic

There may ultimately be situations where the logic in the application determines that an operation cannot or should not be allowed to complete (this could be for a variety of business-specific reasons). In these cases you can implement compensating logic that undoes the work performed by the operation, as described by the Compensating Transaction pattern.

In the ecommerce example shown in Figure 1, as the application performs each step of the order process it can record the tasks necessary to undo this step. If the order process fails, the application can apply the “undo” steps for each step that had previously completed to restore the system to a consistent state. This technique may be complicated by the fact that undoing a step may not be as simple as performing the exact opposite of the original step, and there may be additional business rules that the application must apply. For example, undoing the step that records the details of an order in the document database may not be as straightforward as removing the document. For auditing purposes, it may be necessary to leave the original order document in place but change the status of the order in this document to “cancelled.”

http://www.enterpriseintegrationpatterns.com/patterns/messaging/IdempotentReceiver.html

Even when a sender application only sends a message once, the receiver application may receive the message more than once.

How can a message receiver deal with duplicate messages?

Design a receiver to be an Idempotent Receiver--one that can safely receive the same message multiple times.

The term idempotent is used in mathematics to describe a function that produces the same result if it is applied to itself, i.e. f(x) = f(f(x)). In Messaging this concepts translates into the a message that has the same effect whether it is received once or multiple times. This means that a message can safely be resent without causing any problems even if the receiver receives duplicates of the same message.

Compensating Transaction Pattern
https://msdn.microsoft.com/en-us/library/dn589804.aspx
Undo the work performed by a series of steps, which together define an eventually consistent operation, if one or more of the steps fail.

A significant challenge in the eventual consistency model is how to handle a step that has failed irrecoverably. In this case it may be necessary to undo all of the work completed by the previous steps in the operation. However, the data cannot simply be rolled back because other concurrent instances of the application may have since changed it. Even in cases where the data has not been changed by a concurrent instance, undoing a step might not simply be a matter of restoring the original state. It may be necessary to apply various business-specific rule.

Implement a compensating transaction. The steps in a compensating transaction must undo the effects of the steps in the original operation. A compensating transaction might not be able to simply replace the current state with the state the system was in at the start of the operation because this approach could overwrite changes made by other concurrent instances of an application. Rather, it must be an intelligent process that takes into account any work done by concurrent instances

A common approach to implementing an eventually consistent operation that requires compensation is to use a workflow. As the original operation proceeds, the system records information about each step and how the work performed by that step can be undone. If the operation fails at any point, the workflow rewinds back through the steps it has completed and performs the work that reverses each step. Note that a compensating transaction might not have to undo the work in the exact mirror-opposite order of the original operation, and it may be possible to perform some of the undo steps in parallel.

A compensating transaction is itself an eventually consistent operation and it could also fail. The system should be able to resume the compensating transaction at the point of failure and continue. It may be necessary to repeat a step that has failed, so the steps in a compensating transaction should be defined as idempotent commands. For more information about idempotency, see Idempotency Patterns on Jonathan Oliver’s blog.

It might not be easy to determine when a step in an operation that implements eventual consistency has failed. A step might not fail immediately, but instead it could block. It may be necessary to implement some form of time-out mechanism.

Placing a short-term timeout-based lock on each resource that is required to complete an operation, and obtaining these resources in advance, can help increase the likelihood that the overall activity will succeed. The work should be performed only after all the resources have been acquired. All actions must be finalized before the locks expire.
Consider using retry logic that is more forgiving than usual to minimize failures that trigger a compensating transaction. If a step in an operation that implements eventual consistency fails, try handling the failure as a transient exception and repeat the step. Only abort the operation and initiate a compensating transaction if a step fails repeatedly or irrecoverably.

Use this pattern only for operations that must be undone if they fail. If possible, design solutions to avoid the complexity of requiring compensating transactions

Use this pattern only for operations that must be undone if they fail. If possible, design solutions to avoid the complexity of requiring compensating transactions (for more information, see the Data Consistency Primer).

Example

A travel website enables customers to book itineraries. A single itinerary may comprise a series of flights and hotels. A customer traveling from Seattle to London and then on to Paris could perform the following steps when creating an itinerary:

Book a seat on flight F1 from Seattle to London.
Book a seat on flight F2 from London to Paris.
Book a seat on flight F3 from Paris to Seattle.
Reserve a room at hotel H1 in London.
Reserve a room at hotel H2 in Paris.

These steps constitute an eventually consistent operation, although each step is essentially a separate atomic action in its own right. Therefore, as well as performing these steps, the system must also record the counter operations necessary to undo each step in case the customer decides to cancel the itinerary. The steps necessary to perform the counter operations can then run as a compensating transaction if necessary.

In many business solutions, failure of a single step does not always necessitate rolling the system back by using a compensating transaction. For example, if—after having booked flights F1, F2, and F3 in the travel website scenario—the customer is unable to reserve a room at hotel H1, it is preferable to offer the customer a room at a different hotel in the same city rather than cancelling the flights. The customer may still elect to cancel (in which case the compensating transaction runs and undoes the bookings made on flights F1, F2, and F3), but this decision should be made by the customer rather than by the system.

http://blog.jonathanoliver.com/idempotency-patterns/

one requirement for creating idempotency in messages that aren’t idempotent is to establish a message store. This message store is keyed off of the application-level identifier of the message being processed. It contains a list of messages that have been dispatched as a result of processing the message being received. Hereafter we shall simply call this a message store.

When the computational overhead of reprocessing a message is high as compared to querying to determine if a message has been processed and the probability of a message being received more than once, we will want to proactively filter messages that have already been processed.

https://msdn.microsoft.com/en-us/library/dn589788.aspx

Circuit Breaker Pattern. The Retry pattern is ideally suited to handling transient faults. If a failure is expected to be more long lasting, it may be more appropriate to implement the Circuit Breaker Pattern. The Retry pattern can also be used in conjunction with a circuit breaker to provide a comprehensive approach to handling faults.

http://danielwhittaker.me/2014/10/27/4-ways-handle-eventual-consistency-ui/

https://cloudant.com/blog/coding-for-eventual-consistency/#.V_xnRGQrJE4
Store State Locally
Other systems, like Cloudant and CouchDB, never use locks. This makes every part of the system always available to handle requests. In exchange, try to afford the cluster a moment to bring itself up to date after any change. Alternatively, store relevant state locally, such as by using PouchDB, which will keep itself up to date with your cluster automatically.

Insert, Don't Update

Instead, store votes independently, like this:

{
 "_id": "...",
 "_rev": "...",
 "type": "vote",
 "post_id": "...",
 "user_id": "...",
 "created_at": "...",
 "magnitude": 1
}

When someone votes on a post, create a document with a field like "type": "vote", and other data reflecting the vote. To prevent duplicates, make the "_id": some unique and dependent value like "user_id+post_id" so that attempts by a single user to vote more than once on a post will be rejected as conflicts. Then, use MapReduce to count up the number of votes for a particular post, like this:

{
 map: function (doc) {
  if (doc.type === 'vote') {
   emit(doc.post_id, doc.magnitude);
  }
 },
 reduce: '_count'
}

Now, any number of folks can vote on a post at the same time, and none of them would experience write failures or data loss. (For more on MapReduce indexes, see our docs)

So, insert. Avoid updates. Where you can't avoid updates, try to reduce the number of agents that might be interacting with that document at once. If you need to update information about a user, for example, try to ensure only the user, and/or only admins, can edit that document.

http://techblog.netflix.com/2014/01/s3mper-consistency-in-cloud.html
Netflix considers S3 the “source of truth” for all data warehousing. There are many attractive features that draw us to this service including: 99.999999999% durability, 99.99% availability, effectively infinite storage, versioning (data recovery), and ubiquitous access. In combination with AWS’s EMR, we can dynamically expand/shrink clusters, provision/decommission clusters based on need or availability of reserved capacity, perform live cluster swapping without interrupting processing, and explore new technologies all utilizing the same data warehouse.

The consistency guarantees for S3 vary by region and operation (details here), but in general, any list or read operation is susceptible to inconsistent information depending on preceding operations
https://github.com/Netflix/s3mper

How Inconsistency Impacts Processing

The Netflix ETL Process is predominantly Pig and Hive jobs scheduled through enterprise workflow software that resolves dependencies and manages task execution. To understand how eventual consistency affects processing, we can distill the process down to a simple example of two jobs where the results of one feed into another. If we take a look at Pig-1 from the diagram, it consists of two MapReduce jobs in a pipeline. The initial dataset is loaded from S3 due to the source location referencing an S3 path. All intermediate data is stored in HDFS since that is the default file system. Consistency is not a concern for these intermediate stages. However, the results from Pig-1 are stored directly back to S3 so the information is immediately available for any other job across all clusters to consume.

Pig-2 is activated based on the completion of Pig-1 and immediately lists the output directories of the previous task. If the S3 listing is incomplete when the second job starts, it will proceed with incomplete data. This is particularly problematic, as we stated earlier, because there is no indication that a problem occurred. The integrity of resulting data is entirely at the mercy of how consistent the S3 listing was when the second job started.

A variety of other scenarios may result in consistency issues, but inconsistent listing is our primary concern. If the input data is incomplete, there is no indication anything is wrong with the result. Obviously it is noticeable when the expected results vary significantly from long standing patterns or emit no data at all, but if only a small portion of input is missing the results will appear convincing. Data loss occurring at the beginning of a pipeline will have a cascading effect where the end product is wildly inaccurate. Due to the potential impact, it is essential to understand the risks and methods to mitigate loss of data integrity.

Approaches to Managing Consistency

The Impractical

When faced with eventual consistency, the most obvious (and naive) approach is to simply wait a set amount of time before a job starts with the expectation that data will show up. The problem is knowing how long “eventual” will last. Injecting an artificial delay is detrimental because it defers processing even if requisite data is available and still misses data if it fails to materialize in time. The result is a net loss for both processing time and confidence in the resulting data.

Staging Data

A more common approach to processing in the cloud is to load all necessary data into HDFS, complete all processing, and store the final results to S3 before terminating the cluster. This approach works well if processing is isolated to a single cluster and performed in batches. As we discussed earlier, having the ability to decouple the data from the computing resources provides flexibility that cannot be achieved within a single cluster. Persistent clusters also make this approach difficult. Data in S3 may far exceed the capacity of the HDFS cluster and tracking what data needs to be staged and when it expires is a particularly complex problem to solve.

Eliminating update inconsistency is achievable by imposing a convention where the same location is never overwritten. Here at Netflix, we encourage the use of a batching pattern, where results are written into partitioned batches and the Hive metastore only references the valid batches. This approach removes the possibility of inconsistency due to update or delete. For all AWS regions except US Standard that provide “read-after-write” consistency, this approach may be sufficient, but relies on strict adherence.

Secondary Index

S3 is designed with an eventually consistent index, which is understandable in context of the scale and the guarantees it provides. At smaller scale, it is possible to achieve consistency through use of a consistent, secondary index to catalog file metadata while backing the raw data on S3. This approach becomes more difficult to achieve as the scale increases, but as long as the secondary index can handle the request rate and still provide guaranteed consistency, it will suffice. There are costs to this approach. The probability of data loss and the complexity increases while performance degrades due to relying on two separate systems.

S3mper: A Hybrid Approach

S3mper is an experimental approach to tracking file metadata through use of a secondary index that provides consistent reads and writes. The intent is to identify when an S3 list operation returns inconsistent results and provide options to respond. We implemented s3mper using aspects to advise methods on the Hadoop FileSystem interface and track file metadata with DynamoDB as the secondary index. The reason we chose DynamoDB is that it provides capabilities similar to S3 (e.g. high availability, durability through replication), but also adds consistent operations and high performance.

What makes s3mper a hybrid approach is its use of the S3 listing for comparison and only maintaining a window of consistency. The “source of truth” is still S3, but with an additional layer of checking added. The window of consistency allows for falling back to the S3 listing without concern that the secondary index will fail and lose important information or risk consistency issues that arise from using tools outside the hadoop stack to modify data in S3.

The key features s3mper provides include (see here for more detailed design and options):

Recovery: When an inconsistent listing is identified, s3mper will optionally delay the listing and retry until consistency is achieved. This will delay a job only long enough for data to become available without unnecessarily impacting job performance.

Recovery: When an inconsistent listing is identified, s3mper will optionally delay the listing and retry until consistency is achieved. This will delay a job only long enough for data to become available without unnecessarily impacting job performance.
Notification: If listing cannot be achieved, a notification is sent immediately and a determination can be made as to whether to kill the job or let it proceed with incomplete data.
Reporting: A variety of events are sent to track the number of recoveries, files missed, what jobs were affected, etc.
Configurability: Options are provided to control how long a job should wait, how frequently to recheck a listing, and whether to fail a job if the listing is inconsistent.
Modularity: The implementations for the metastore and notifications can be overridden based on the environment and services at your disposal.
Administration: Utilities are provided for inspecting the metastore and resolving conflicts between the secondary index in DynamoDB and the S3 index.

S3mper is not intended to solve every possible case where inconsistency can occur. Deleting data from S3 outside of the hadoop stack will result in divergence of the secondary index and jobs being delayed unnecessarily. Directory support is also limited such that recursive listings are still prone to inconsistency, but since we currently derive all our data locations from a Hive metastore, this does not impact us. While this library is still in its infancy and does not support every case, using it in combination with the conventions discussed earlier will alleviate the concern for our workflow and allow for further investigation and development of new capabilities.

Monday, October 10, 2016

Eventual Consistency