https://puncsky.com/hacking-the-software-engineer-interview#designing-robust-and-predictable-apis-with-idempotency
https://stripe.com/blog/idempotency
Consider a call between any two nodes. There are a variety of failures that can occur:
The initial connection could fail as the client tries to connect to a server.
The call could fail midway while the server is fulfilling the operation, leaving the work in limbo.
The call could succeed, but the connection break before the server can tell its client about it.
The easiest way to address inconsistencies in distributed state caused by failures is to implement server endpoints so that they’re idempotent, which means that they can be called any number of times while guaranteeing that side effects only occur once.
When a client sees any kind of error, it can ensure the convergence of its own state with the server’s by retrying, and can continue to retry until it verifiably succeeds. This fully addresses the problem of an ambiguous failure because the client knows that it can safely handle any failure using one simple technique.
This is where idempotency keys come into play. When performing a request, a client generates a unique ID to identify just that operation and sends it up to the server along with the normal payload. The server receives the ID and correlates it with the state of the request on its end. If the client notices a failure, it retries the request with the same ID, and from there it’s up to the server to figure out what to do with it.
Make sure that failures are handled consistently. Have clients retry operations against remote services. Not doing so could leave data in an inconsistent state that will lead to problems down the road.
Make sure that failures are handled safely. Use idempotency and idempotency keys to allow clients to pass a unique value and retry requests as needed.
Make sure that failures are handled responsibly. Use techniques like exponential backoff and random jitter. Be considerate of servers that may be stuck in a degraded state.
https://www.bennadel.com/blog/3390-considering-strategies-for-idempotency-without-distributed-locking-with-ben-darfler.htm
How to design robust and predictable APIs with idempotency?
How could APIs be un-robust and un-predictable?
- Networks are unreliable.
- Servers are more reliable but may still fail.
How to solve the problem? 3 Principles:
- Client retries to ensure consistency.
- Retry with idempotency and idempotency keys to allow clients to pass a unique value.
- In RESTful APIs, the PUT and DELETE verbs are idempotent.
- However, POST may cause “double-charge” problem in payment. So we use a idempotency key to identify the request.
- If the failure happens before the server, then there is a retry, and the server will see it for the first time, and process it normally.
- If the failure happens in the server, then ACID database will guarantee the transaction by the idempotency key.
- If the failure happens after the server’s reply, then client retries, and the server simply replies with a cached result of the successful operation.
- Retry with exponential backoff and random jitter. Be considerate of the thundering herd problem that servers that may be stuck in a degraded state and a burst of retries may further hurt the system.
For example, Stripe’s client retry calculates the delay like this…
def self.sleep_time(retry_count)
# Apply exponential backoff with initial_network_retry_delay on the
# number of attempts so far as inputs. Do not allow the number to exceed
# max_network_retry_delay.
sleep_seconds = [Stripe.initial_network_retry_delay * (2 ** (retry_count - 1)), Stripe.max_network_retry_delay].min
# Apply some jitter by randomizing the value in the range of (sleep_seconds
# / 2) to (sleep_seconds).
sleep_seconds = sleep_seconds * (0.5 * (1 + rand()))
# But never sleep less than the base sleep seconds.
sleep_seconds = [Stripe.initial_network_retry_delay, sleep_seconds].max
sleep_seconds
end
Consider a call between any two nodes. There are a variety of failures that can occur:
The initial connection could fail as the client tries to connect to a server.
The call could fail midway while the server is fulfilling the operation, leaving the work in limbo.
The call could succeed, but the connection break before the server can tell its client about it.
When a client sees any kind of error, it can ensure the convergence of its own state with the server’s by retrying, and can continue to retry until it verifiably succeeds. This fully addresses the problem of an ambiguous failure because the client knows that it can safely handle any failure using one simple technique.
This is where idempotency keys come into play. When performing a request, a client generates a unique ID to identify just that operation and sends it up to the server along with the normal payload. The server receives the ID and correlates it with the state of the request on its end. If the client notices a failure, it retries the request with the same ID, and from there it’s up to the server to figure out what to do with it.
The Stripe API implements idempotency keys on mutating endpoints (i.e. anything under
POST
in our case) by allowing clients to pass a unique value in with the special Idempotency-Key
header, which allows a client to guarantee the safety of distributed operations:
Being a good distributed citizen
It’s usually recommended that clients follow something akin to an exponential backoff algorithm as they see errors. The client blocks for a brief initial wait time on the first failure, but as the operation continues to fail, it waits proportionally to 2n, where n is the number of failures that have occurred. By backing off exponentially, we can ensure that clients aren’t hammering on a downed server and contributing to the problem.
Furthermore, it’s also a good idea to mix in an element of randomness. If a problem with a server causes a large number of clients to fail at close to the same time, then even with back off, their retry schedules could be aligned closely enough that the retries will hammer the troubled server. This is known as the thundering herd problem.
Furthermore, it’s also a good idea to mix in an element of randomness. If a problem with a server causes a large number of clients to fail at close to the same time, then even with back off, their retry schedules could be aligned closely enough that the retries will hammer the troubled server. This is known as the thundering herd problem.
Make sure that failures are handled safely. Use idempotency and idempotency keys to allow clients to pass a unique value and retry requests as needed.
Make sure that failures are handled responsibly. Use techniques like exponential backoff and random jitter. Be considerate of servers that may be stuck in a degraded state.
Once the server knows that a request has definitively finished by either succeeding or failing in a way that’s not recoverable, it stores the request’s results and associates them with the idempotency key. If a client makes another request with the same key, the server simply short circuits and returns the stored results.
Keys are not meant to be used as a permanent request archive but rather as a mechanism for ensuring near-term correctness. Servers should recycle them out of the system beyond a horizon where they won’t be of much use – say 24 hours or so.
The request lifecycle
When a new rides comes in we’ll perform this set of operations:
- Insert an idempotency key record.
- Create a ride record to track the ride that’s about to happen.
- Create an audit record referencing the ride.
- Make an API call to Stripe to charge the user for the ride (here we’re leaving our own stack, and this presents some risk).
- Update the ride record with the created charge ID.
- Send the user a receipt via email.
- Update idempotency key with results.
Foreign state mutations
To shore up our backend, it’s key to identify where we’re making foreign state mutations; that is, calling out and manipulating data on another system. This might be creating a charge on Stripe, adding a DNS record, or sending an email.
Some foreign state mutations are idempotent by nature (e.g. adding a DNS record), some are not idempotent but can be made idempotent with the help of an idempotency key (e.g. charge on Stripe, sending an email), and some operations are not idempotent, most often because a foreign service hasn’t designed them that way and doesn’t provide a mechanism like an idempotency key.
The reason that the local vs. foreign distinction matters is that unlike a local set of operations where we can leverage an ACID store to roll back a result that we didn’t like, once we make our first foreign state mutation, we’re committed one way or another 1. We’ve pushed data into a system beyond our own boundaries and we shouldn’t lose track of it.
Atomic phases
An atomic phase is a set of local state mutations that occur in transactions between foreign state mutations. We say that they’re atomic because we can use an ACID-compliant database like Postgres to guarantee that either all of them will occur, or none will.
Atomic phases should be safely committed before initiating any foreign state mutation. If the call fails, our local state will still have a record of it happening that we can use to retry the operation.
Recovery points
A recovery point is a name of a check point that we get to after having successfully executed any atomic phase or foreign state mutation. Its purpose is to allow a request that’s being retried to jump back to the point in the lifecycle just before the last attempt failed.
For convenience, we’re going to store the name of the recovery point reached right onto the idempotency key relation that we’ll build. All requests will initially get a recovery point of
started
, and after any request is complete (again, through either a success or definitive error) it’ll be assigned a recovery point of finished
. When in an atomic phase, the transition to a new recovery point should be committed as part of that phase’s transaction.Background jobs and staging
In-band foreign state mutations make a request slower and more difficult to reason about, so they should be avoided when possible. In many cases it’s possible to defer this type of work to after the request is complete by sending it to a background job queue.
In our Rocket Rides example the charge to Stripe probably can’t be deferred – we want to know whether it succeeded right away so that we can deny the request if it didn’t. Sending an email can and should be sent to the background.
By using a transactionally-staged job drain, we can hide jobs from workers until we’ve confirmed that they’re ready to be worked by isolating them in a transaction. This also means that the background work becomes part of an atomic phase and greatly simplifies its operational properties. Work should always be offloaded to background queues wherever possible.
locked_at
: A field that indicates whether this idempotency key is actively being worked. The first API request that creates the key will lock it automatically, but subsequent retries will also set it to make sure that they’re the only request doing the work.
params
: The input parameters of the request. This is stored mostly so that we can error if the user sends two requests with the same idempotency key but with different parameters, but can also be used for our own backend to push unfinished requests to completion (see the completionist below).recovery_point
: A text label for the last phase completed for the idempotent request (see recovery points above). Gets an initial value ofstarted
and is set tofinished
when the request is considered to be complete.
Loosely speaking, in a web application, an idempotent action is one that can be safely executed against the application multiple times without producing an undesired result. This doesn't mean that each subsequent action will necessarily be accepted - in fact, it may very well be rejected. But, an "idempotent action" implies that it is safe for a consumer to try and run the action against the application multiple times, even in parallel.
the distributed lock was put in place to perform one (or both) of the following duties:
- Enforce a constraint across multiple resources.
- Reduce processing demands on the server.
For the latter, consider something like generating a PDF document, resizing an image, or performing an rdiff between two binaries. In such cases, a distributed lock was used to ensure that parallel requests for the same action would "fail fast", reducing load on the server. When we move to implement idempotency without distributed locks, we are, to some degree, dismissing this requirement out of hand. We're embracing the idea that processing power is cheap, especially at scale; and, if a server has to perform redundant processing in certain edge-cases, then so be it.