Massive Technical Interviews Tips: Taming The Long Latency Tail

http://highscalability.com/blog/2012/3/12/google-taming-the-long-latency-tail-when-more-machines-equal.html

Luiz André Barroso, Distinguished Engineer at Google, talks about this fundamental property of scaling systems in his fascinating talk, Warehouse-Scale Computing: Entering the Teenage Decade. Google found the larger the scale the greater the impact of latency variability. When a request is implemented by work done in parallel, as is common with today's service oriented systems, the overall response time is dominated by the long tail distribution of the parallel operations. Every response must have a consistent and low latency or the overall operation response time will be tragically slow. The implication: high performance equals high tolerances, which means your entire system must be designed to exacting standards.

What is forcing a deeper look into latency variability is the advent of interactive real-time computing. Responsiveness becomes key. Good average response times aren't good enough. You simply can't naively scale up techniques to build larger systems. The reason is surprising and has deep implications on how we design service dominated systems:

Google Likes Request Level Parallelism, Which Is Easy

http://highscalability.com/blog/2012/6/18/google-on-latency-tolerant-systems-making-a-predictable-whol.html

Mom And Apple Pie Techniques For Managing Latency

These techniques the “general good engineering practices” for managing latency:

Prioritize request queues and network traffic. Do the most important work first.
Reduce head-of-line blocking. Break large requests into a sequence of small requests. This time slices a large request rather than let it block all other requests.
Rate limit activity. Drop or delay traffic that exceeds a specified rate.
Defer expensive activity until load is lower.
Synchronize disruptions. Regular maintenance and monitoring tasks should not be randomized. Randomization means at any given time there will be slow machines in a computation. Instead, run tasks at the same time so the latency hit is only taken during a small window.

Cross Request Adaptation Strategies

The idea behind these strategies is to examine recent behavior and take action to improve latency of future requests within tens of seconds or minutes. The strategies are:

Fine-grained dynamic partitioning. Partition large datasets and computations. Keep more than 1 partition per machine (often 10-100/machine). Partitions make it easy to assign work to machines and react to changing situations as the partitions can be recovered on failure or replicated or moved as needed.
Load balancing. Load is shed in few percent increments. Shifting load can be prioritized when the imbalance is severe. Different resource dimensions can be overloaded: memory, disk, or CPU. You can't just look at CPU load because they may all have the same load. Collect distributions of each dimension, make histograms, and try to even out work to machines by looking at std deviations and distributions.

Different resource dimensions can be overloaded: memory, disk, or CPU. You can't just look at CPU load because they may all have the same load. Collect distributions of each dimension, make histograms, and try to even out work to machines by looking at std deviations and distributions.
Selective partitioning. Make more replicas of heavily used items. This works for static or dynamic content. Important documents or Chinese documents, for example, can be replicated to handle greater query loads.
Latency-induced probation. When a server is slow to respond it could be because of interference caused by jobs running on the machine. So make a copy of the partition and move it to another machine, still sending shadow copies of the requests to the server. Keep measuring latency and when the latency improves return the partition to service.

Within-Request Adaptation Strategies

The idea behind these strategies is to fix a slow request as it is happening. The strategies are:

Canary requests
Backup requests with cross-server cancellation
Tainted results

Backup Requests With Cross-Server Cancellation

Backup requests are the idea of sending requests out to multiple replicas, but in a particular way. Here’s the example for a read operation for a distributed file system client:

send request to first replica
wait 2 ms, and send to second replica
servers cancel request on other replica when starting read

A request could wait in a queue stuck behind an expensive query or a packet could be dropped, so if a reply is not returned quickly other replicas are tried. Responses come back faster if requests sit in multiple queues.

Cancellation reduces the frequency at which redundant work occurs.

You might think this is just a lot of extra traffic, but remember, the goal is to squeeze down the 99th percentile distribution, so the backup requests, even with what seems like a long wait time really bring down the tail latency and standard deviation. Requests in the 99th percentile take so long that the wait time is short in comparison.

Bigtable, for example, used backup requests after two milliseconds, which dramatically dropped the 99th percentile by 43 percent on an idle system. With a loaded system the reduction was 38 percent with only one percent extra disk seeks. Backups requests with cancellations gives the same distribution as an unloaded cluster.

With these latency tolerance techniques you are taking a loaded cluster with high variability and making it perform like an unloaded cluster.

Within-Request Adaptation Strategies

The idea behind these strategies is to fix a slow request as it is happening. The strategies are:

Canary requests
Backup requests with cross-server cancellation
Tainted results

Backup Requests With Cross-Server Cancellation

Backup requests are the idea of sending requests out to multiple replicas, but in a particular way. Here’s the example for a read operation for a distributed file system client:

send request to first replica
wait 2 ms, and send to second replica
servers cancel request on other replica when starting read

Tainted Results - Proactively Abandon Slow Subsystems

Tradeoff completeness for responsiveness. Under load noncritical subcomponents can be dropped out. It’s better, for example, to search a smaller subset of pages or skip spelling corrections than it is to be slow.
Do not cache results. You don’t want users to see tainted results.
Set cutoffs dynamically based on recent measurements.

http://highscalability.com/blog/2010/11/22/strategy-google-sends-canary-requests-into-the-data-mine.html

Google runs queries against thousands of in-memory index nodes in parallel and then merges the results. One of the interesting problems with this approach, explains Google's Jeff Dean in this lecture at Stanford, is the Query of Death.

A query can cause a program to fail because of bugs or various other issues. This means that a single query can take down an entire cluster of machines, which is not good for availability and response times, as it takes quite a while for thousands of machines to recover. Thus the Query of Death. New queries are always coming into the system and when you are always rolling out new software, it's impossible to completely get rid of the problem.

Two solutions:

Test against logs. Google replays a month's worth of logs to see if any of those queries kill anything. That helps, but Queries of Death may still happen.
Send a canary request. A request is sent to one machine. If the request succeeds then it will probably succeed on all machines, so go ahead with the query. If the request fails the only one machine is down, no big deal. Now try the request again on another machine to verify that it really is a query of death. If the request fails a certain number of times then the request if rejected and logged for further debugging.

The result is only a few servers are crashed instead of 1000s. This is a pretty clever technique, especially given the combined trends of scale-out and continuous deployment. It could also be a useful strategy for others.

https://landing.google.com/sre/sre-book/chapters/addressing-cascading-failures/

Sunday, March 31, 2019

Taming The Long Latency Tail - Handling sloe requests

Mom And Apple Pie Techniques For Managing Latency

Cross Request Adaptation Strategies

Within-Request Adaptation Strategies

Backup Requests With Cross-Server Cancellation

Within-Request Adaptation Strategies

Backup Requests With Cross-Server Cancellation

Tainted Results - Proactively Abandon Slow Subsystems

Labels

Popular Posts