http://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/
New systems engineers will find the Fallacies of Distributed Computing and the CAP theorem as part of their self-education.
Distributed systems are different because they fail often.
What sets distributed systems engineering apart is the probability of failure and, worse, the probability of partial failure.
Robust, open source distributed systems are much less common than robust, single-machine systems.
Coordination is very hard.
Avoid coordinating machines wherever possible. This is often described as “horizontal scalability”. The real trick of horizontal scalability is independence – being able to get data to machines such that communication and consensus between those machines is kept to a minimum. Every time two machines have to agree on something, the service is harder to implement.
Learning about the Two Generals and Byzantine Generals problems are useful here. (Oh, and Paxos really is very hard to implement; that’s not grumpy old engineers thinking they know better than you.)
“It’s slow” is the hardest problem you’ll ever debug.
Dapper and Zipkin were built for a reason.
Exposing metrics (such as latency percentiles, increasing counters on certain actions, rates of change) is the only way to cross the gap from what you believe your system does in production and what it actually is doing.
Prefer logging as if someone who has not seen the code will be reading the logs.
Feature flags are how infrastructure is rolled out. “Feature flags” are a common way product engineers roll out new features in a system.
Exploit data-locality.
data-locality implies locality in space, but also locality in time. If multiple users are making the same expensive request at nearly the same time, perhaps their requests can be joined into one. If multiple instances of requests for the same kind of data are made near to one another, they could be joined into one larger request. Doing so often affords lower communication overheard and easier fault management.
Writing cached data back to persistent storage is bad.
A common presentation of this flaw is user information (e.g. screennames, emails, and hashed passwords) mysteriously reverting to a previous value.
Computers can do more than you think they can.
Use the CAP theorem to critique systems.
However, it is well-suited for critiquing a distributed system design, and understanding what trade-offs need to be made. Taking a system design and iterating through the constraints CAP puts on its subsystems will leave you with a better design at the end.
New systems engineers will find the Fallacies of Distributed Computing and the CAP theorem as part of their self-education.
Distributed systems are different because they fail often.
What sets distributed systems engineering apart is the probability of failure and, worse, the probability of partial failure.
Systems engineers that haven’t worked in distributed computation will come up with ideas like “well, it’ll just send the write to both machines” or “it’ll just keep retrying the write until it succeeds”. These engineers haven’t completely accepted (though they usually intellectually recognize) that networked systems fail more than systems that exist on only a single machine and that failures tend to be partial instead of total. One of the writes may succeed while the other fails, and so now how do we get a consistent view of the data? These partial failures are much harder to reason about.
Switches go down, garbage collection pauses make masters “disappear”, socket writes seem to succeed but have actually failed on the other machine, a slow disk drive on one machines causes a communication protocol in the whole cluster to crawl, and so on. Reading from local memory is simply more stable than reading across a few switches.
Design for failure.
Writing robust distributed systems costs more than writing robust single-machine systems.
there are failure conditions that are difficult to replicate on a single machine. Whether it’s because they only occur on dataset sizes much larger than can be fit on a shared machine, or in the network conditions found in datacenters, distributed systems tend to need actual, not simulated, distribution to flush out their bugs. Simulation is, of course, very useful.Robust, open source distributed systems are much less common than robust, single-machine systems.
Coordination is very hard.
Avoid coordinating machines wherever possible. This is often described as “horizontal scalability”. The real trick of horizontal scalability is independence – being able to get data to machines such that communication and consensus between those machines is kept to a minimum. Every time two machines have to agree on something, the service is harder to implement.
Learning about the Two Generals and Byzantine Generals problems are useful here. (Oh, and Paxos really is very hard to implement; that’s not grumpy old engineers thinking they know better than you.)
“It’s slow” is the hardest problem you’ll ever debug.
Dapper and Zipkin were built for a reason.
Implement backpressure throughout your system.
Backpressure is the signaling of failure from a serving system to the requesting system and how the requesting system handles those failures to prevent overloading itself and the serving system. Designing for backpressure means bounding resource utilization during times of overload and times of system failure. This is one of the basic building blocks of creating a robust distributed system.
Common versions include dropping new messages on the floor (and incrementing a metric) if the system’s resources are already over-scheduled, and shipping errors back to users when the system determines it will be unable to finish the request in a given amount of time. Timeouts and exponential back-offs on connections and requests to other systems are also useful.
Without backpressure mechanisms in place, cascading failure or unintentional message loss become likely.
When a system is not able to handle the failures of another, it tends to emit failures to another system that depends on it.
When a system is not able to handle the failures of another, it tends to emit failures to another system that depends on it.
Find ways to be partially available.
Partial availability is being able to return some results even when parts of your system is failing.
Partial availability is being able to return some results even when parts of your system is failing.
A typical search system sets a time limit on how long it will search its documents, and, if that time limit expires before all of its documents are searched, it will return whatever results it has gathered. This makes search easier to scale in the face of intermittent slowdowns, and errors because those failures are treated the same as not being able to search all of their documents. The system allows for partial results to be returned to the user and its resilience is increased.
Metrics are the only way to get your job done. Exposing metrics (such as latency percentiles, increasing counters on certain actions, rates of change) is the only way to cross the gap from what you believe your system does in production and what it actually is doing.
Prefer logging as if someone who has not seen the code will be reading the logs.
Use percentiles, not averages.
Percentiles (50th, 99th, 99.9th, 99.99th) are more accurate and informative than averages in the vast majority of distributed systems. Using a mean assumes that the metric under evaluation follows a bell curve but, in practice, this describes very few metrics an engineer cares about. “Average latency” is a commonly reported metric, but I’ve never once seen a distributed system whose latency followed a bell curve. If the metric doesn’t follow a bell curve, the average is meaningless and leads to incorrect decisions and understanding. Avoid the trap by talking in percentiles. Default to percentiles, and you’ll better understand how users really see your system.
Percentiles (50th, 99th, 99.9th, 99.99th) are more accurate and informative than averages in the vast majority of distributed systems. Using a mean assumes that the metric under evaluation follows a bell curve but, in practice, this describes very few metrics an engineer cares about. “Average latency” is a commonly reported metric, but I’ve never once seen a distributed system whose latency followed a bell curve. If the metric doesn’t follow a bell curve, the average is meaningless and leads to incorrect decisions and understanding. Avoid the trap by talking in percentiles. Default to percentiles, and you’ll better understand how users really see your system.
Learn to estimate your capacity.
Jeff Dean’s Numbers Everyone Should Know slide is a good expectation-setter.Feature flags are how infrastructure is rolled out. “Feature flags” are a common way product engineers roll out new features in a system.
Feature flags sound like a terrible mess of conditionals to a classically trained object-oriented developer or a new engineer with well-intentioned training. And the use of feature flags means accepting that having multiple versions of infrastructure and data is a norm, not an rarity. This is a deep lesson. What works well for single-machine systems sometimes falters in the face of distributed problems.
Feature flags are best understood as a trade-off, trading local complexity (in the code, in one system) for global simplicity and resilience.
Choose id spaces wisely. The space of ids you choose for your system will shape your system.
The more ids required to get to a piece of data, the more options you have in partitioning the data. The fewer ids required to get a piece of data, the easier it is to consume your system’s output.
Consider version 1 of the Twitter API. All operations to get, create, and delete tweets were done with respect to a single numeric id for each tweet. The tweet id is a simple 64-bit number that is not connected to any other piece of data. As the number of tweets goes up, it becomes clear that creating user tweet timelines and the timeline of other user’s subscriptions may be efficiently constructed if all of the tweets by the same user were stored on the same machine.
But the public API requires every tweet be addressable by just the tweet id. To partition tweets by user, a lookup service would have to be constructed. One that knows what user owns which tweet id. Doable, if necessary, but with a non-trivial cost.
An alternative API could have required the user id in any tweet look up and, initially, simply used the tweet id for storage until user-partitioned storage came online. Another alternative would have included the user id in the tweet id itself at the cost of tweet ids no longer being k-sortable and numeric.
Watch out for what kind of information you encode in your ids, explicitly and implicitly. Clients may use the structure of your ids to de-anonymize private data, crawl your system in ways you didn’t expect (auto-incrementing ids are a typical sore point), or a host of other things you won’t expect.
data-locality implies locality in space, but also locality in time. If multiple users are making the same expensive request at nearly the same time, perhaps their requests can be joined into one. If multiple instances of requests for the same kind of data are made near to one another, they could be joined into one larger request. Doing so often affords lower communication overheard and easier fault management.
Writing cached data back to persistent storage is bad.
A common presentation of this flaw is user information (e.g. screennames, emails, and hashed passwords) mysteriously reverting to a previous value.
Computers can do more than you think they can.
Use the CAP theorem to critique systems.
However, it is well-suited for critiquing a distributed system design, and understanding what trade-offs need to be made. Taking a system design and iterating through the constraints CAP puts on its subsystems will leave you with a better design at the end.
One last note: Out of C, A, and P, you can’t choose CA.
Extract services.
http://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/