http://book.mixu.net/distsys/single-page.html
the core of distributed programming is dealing with distance and having more than one thing.
how distance, time and consistency models interact.
some new exciting ways to look at eventual consistency that haven't still made it into college textbooks - such as CRDTs and the CALM theorem.
Distributed programming is the art of solving the same problem that you can solve on a single computer using multiple computers - usually, because the problem no longer fits on a single computer.
the best value is in mid-range, commodity hardware - as long as the maintenance costs can be kept down through fault-tolerant software.
Scalability
is the ability of a system, network, or process, to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth.
Size scalability: adding more nodes should make the system linearly faster; growing the dataset should not increase latency
Geographic scalability: it should be possible to use multiple data centers to reduce the time it takes to respond to user queries, while dealing with cross-data center latency in some sensible manner.
Administrative scalability: adding more nodes should not increase the administrative costs of the system (e.g. the administrators-to-machines ratio).
Performance
is characterized by the amount of useful work accomplished by a computer system compared to the time and resources used.
Latency
The state of being latent; delay, a period between the initiation of something and the occurrence.
Availability
the proportion of time a system is in a functioning condition. If a user cannot access the system, it is said to be unavailable.
Availability from a technical perspective is mostly about being fault tolerant. Because the probability of a failure occurring increases with the number of components, the system should be able to compensate so as to not become less reliable as the number of components increases.
the core of distributed programming is dealing with distance and having more than one thing.
how distance, time and consistency models interact.
some new exciting ways to look at eventual consistency that haven't still made it into college textbooks - such as CRDTs and the CALM theorem.
Distributed programming is the art of solving the same problem that you can solve on a single computer using multiple computers - usually, because the problem no longer fits on a single computer.
the best value is in mid-range, commodity hardware - as long as the maintenance costs can be kept down through fault-tolerant software.
Scalability
is the ability of a system, network, or process, to handle a growing amount of work in a capable manner or its ability to be enlarged to accommodate that growth.
Size scalability: adding more nodes should make the system linearly faster; growing the dataset should not increase latency
Geographic scalability: it should be possible to use multiple data centers to reduce the time it takes to respond to user queries, while dealing with cross-data center latency in some sensible manner.
Administrative scalability: adding more nodes should not increase the administrative costs of the system (e.g. the administrators-to-machines ratio).
Performance
is characterized by the amount of useful work accomplished by a computer system compared to the time and resources used.
Latency
The state of being latent; delay, a period between the initiation of something and the occurrence.
Availability
the proportion of time a system is in a functioning condition. If a user cannot access the system, it is said to be unavailable.
Availability from a technical perspective is mostly about being fault tolerant. Because the probability of a failure occurring increases with the number of components, the system should be able to compensate so as to not become less reliable as the number of components increases.
Availability % | How much downtime is allowed per year? |
90% ("one nine") | More than a month |
99% ("two nines") | Less than 4 days |
99.9% ("three nines") | Less than 9 hours |
99.99% ("four nines") | Less than an hour |
99.999% ("five nines") | ~ 5 minutes |
99.9999% ("six nines") | ~ 31 seconds |
Fault tolerance
ability of a system to behave in a well-defined manner once faults occur
Fault tolerance boils down to this: define what faults you expect and then design a system or an algorithm that is tolerant of them. You can't tolerate faults you haven't considered.
Abstractions make things more manageable by removing real-world aspects that are not relevant to solving a problem. Models describe the key properties of a distributed system in a precise manner.
System model (asynchronous / synchronous)
Failure model (crash-fail, partitions, Byzantine)
Consistency model (strong, eventual)
Often, the most familiar model (for example, implementing a shared memory abstraction on a distributed system) is too expensive.
One can often gain performance by exposing more details about the internals of the system.
Network latency and network partitions (e.g. total network failure between some nodes) mean that a system needs to sometimes make hard choices about whether it is better to stay available but lose some crucial guarantees that cannot be enforced, or to play it safe and refuse clients when these types of failures occur.
Design techniques: partition and replicate (consistency)
It can be split over multiple nodes (partitioning) to allow for more parallel processing. It can also be copied or cached on different nodes to reduce the distance between the client and the server and for greater fault tolerance (replication).
Only one consistency model for replication - strong consistency - allows you to program as-if the underlying data was not replicated. Other consistency models expose some internals of the replication to the programmer. However, weaker consistency models can provide lower latency and higher availability - and are not necessarily harder to understand, just different.
ability of a system to behave in a well-defined manner once faults occur
Fault tolerance boils down to this: define what faults you expect and then design a system or an algorithm that is tolerant of them. You can't tolerate faults you haven't considered.
Abstractions make things more manageable by removing real-world aspects that are not relevant to solving a problem. Models describe the key properties of a distributed system in a precise manner.
System model (asynchronous / synchronous)
Failure model (crash-fail, partitions, Byzantine)
Consistency model (strong, eventual)
Often, the most familiar model (for example, implementing a shared memory abstraction on a distributed system) is too expensive.
One can often gain performance by exposing more details about the internals of the system.
Network latency and network partitions (e.g. total network failure between some nodes) mean that a system needs to sometimes make hard choices about whether it is better to stay available but lose some crucial guarantees that cannot be enforced, or to play it safe and refuse clients when these types of failures occur.
Design techniques: partition and replicate (consistency)
It can be split over multiple nodes (partitioning) to allow for more parallel processing. It can also be copied or cached on different nodes to reduce the distance between the client and the server and for greater fault tolerance (replication).
Only one consistency model for replication - strong consistency - allows you to program as-if the underlying data was not replicated. Other consistency models expose some internals of the replication to the programmer. However, weaker consistency models can provide lower latency and higher availability - and are not necessarily harder to understand, just different.