Wednesday, February 14, 2018

Paxos



https://blog.acolyer.org/2015/03/01/cant-we-all-just-agree/
https://blog.acolyer.org/2015/03/04/paxos-made-simple/
If the majority of members choose some value v : V, then it is impossible for there to be a majority agreeing on any value other than v. Now let the members of S each choose a new value, and suppose that a majority choose some new value v’. At least one of the members in the new majority must also have been a member of the earlier majority. This follows straightforwardly from the definition of majority: if you substract from S the set of members in the original majority there are (by definition) not enough members left to form a majority. Therefore any new majority that does form, must do so by including a member from a previous majority

An asynchronous, non-Byzantine communications model is assumed.
before issuing a proposal with proposal number
n, a proposer must ask a majority of the acceptors for the proposal value of highest numbered proposal.

Instead of trying to predict the future, the proposer controls it by extracting a promise that there won’t be any such acceptances. In other words, the proposer requests that the acceptors not accept any more proposals numbered less than n

Prepare phase

  1. proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors. “If I make a proposal with number n, are there any constraints on the value I must propose?”
  2. If an acceptor receives a prepare request with number n, where n is greater than any of the prepare requests it has already responded to, then it responds with a promise not to accept any more proposals numbered less than n, and with the highest numbered proposal (if any) that it has accepted. (Acceptors therefore need to maintain as reliable state the highest numbered proposal they have accepted, and the high watermark value of the largest n it has responded to in a prepare request).

Accept phase

Pre-condition: a proposer has received promise responses to its prepare request numbered n from a majority of acceptors.


  1. The proposer sends an accept message for proposal (n,v), where v is the proposal value of the highest numbered accepted proposal amongst the promise responses, or any value the proposer chooses if no prior acceptances are returned.
  2. If an acceptor receives an accept message for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request with a value higher than n. (Several proposals may be circulating concurrently).


A less reliable model, but one that reduces communication, is to have one or more nominated ‘distinguished learners’ to which acceptors send their acceptance notifications, and these then broadcast to the rest of the learners.

Progress

To guarantee progress, a distinguished proposer must be selected as the only one to try issuing proposals. If the distinguished proposer can communicate successfully with a majority of acceptors, and if it uses a proposal with number greater than any already used, then it will succeed in issuing a proposal that is accepted. By abandoning a proposal and trying again if it learns about some request with a higher proposal number, the distinguished proposer will eventually choose a high enough proposal number

The Paxos algorithm assumes a network of processes. In its consensus algorithm, each process plays the role of proposer, acceptor, and learner. The algorithm chooses a leader, which plays the roles of the distinguished proposer and the distinguished learner.
Clients send commands to the leader, which is responsible for deciding the overall sequence of commands. The selected command is given the next available sequence number, and the leader runs an instance of the Paxos consensus protocol proposing this command as the next ‘value’ to be agreed upon by the participants. Most of the time this will succeed, and the state machines in each participant can advance. Each round of the paxos protocol involves a number of round-trip messages, and the leader does not have to wait for each round to complete before initiating the next one – so long as there are client commands coming in the leader can keep assigning new sequence numbers and initiating rounds. We therefore will often have multiple rounds of consensus algorithm operating in parallel – each reaching agreement on “the command at sequence number s is c“.
http://the-paper-trail.org/blog/consensus-protocols-paxos/
Every proposal is tagged with a unique sequence number that we assume can be generated by any proposer. These sequence numbers are used to totally order the proposals so that all the acceptors agree on which proposals came ‘before’ and ‘after’. When a proposal arrives, the acceptor checks to see what the highest numbered proposal that it has already received is. If the new proposal is ordered after the highest current proposal, the acceptor returns a promise that guarantees that the acceptor will not accept any more proposals that are ordered before the new proposal. If instead the new proposal is ordered before the highest current proposal the acceptors will reject it and return the sequence number of the current proposal. This allows the proposer to choose a large enough sequence number at the next time of asking, rather than having to guess.
This ordering is used so that no matter what order the messages containing the prepare requests arrive in, the acceptors can agree – without further communication – on which one to agree, tentatively, to accepting. This helps cope with one of the artifacts of an asynchronous system – the possibility of messages arriving in different orders at different hosts.
How can we ensure that all proposals are uniquely numbered? The easiest way is to have all proposers draw from disjoint sets of sequence numbers. In particular, one practical way is to construct a pair (seq. number, address) where the address value is the proposer’s unique network address. These pairs can be totally ordered and at the same time all proposers can ‘outbid’ all others if they choose a sufficiently large sequence number.

Legitimate proposals

Once the proposer receives responses to its prepare message from a majority of acceptors, it can go ahead and ask the acceptors to commit to a value it proposes. Again, this is very like 2PC, except again there are constraints on which values a proposer may legitimately propose. Remember that there are potentially many proposers proposing values at any one time. Consider the case where a proposer has committed his proposal to the smallest possible majority of acceptors, at which point a single acceptor fails. A majority of accept confirmation messages will not reach the proposer, and therefore the protocol will no terminate. A second proposer might then try to propose a value – which is accepted by a majority since the proposer orders its request after the first. The second proposer then commits its proposal, and a majority of acceptors respond. The second proposer considers the protocol completed. At this point, the failed acceptor can recover and send the final accept message to the original proposer, which then considers the protocol completed. If the first and second proposer both propose different values, correctness is violated. This is a problem.
This execution cannot be avoided in an asynchronous network. Therefore, the only way around is to somehow make sure that both proposers propose the same value. This avoids the complications above by ensuring that all committed values are the same at every acceptor – no matter which proposer proposed them. It’s easy to ensure that all proposed values are the same. When acceptors respond to a prepare request, they reply with the value of the highest numbered proposal that they have already accepted. The proposer is then bound only to ask that this value be committed. This way the protocol informs the proposer about other completed proposals, and forces it to commit their values, not the one it originally proposed

https://angus.nyc/2012/paxos-by-example/
Distributed consensus algorithms are used to enable a set of computers to agree on a single value, such as the commit or rollbackdecision typically made using a two- or three-phase commit. It doesn’t matter to the algorithm what this value is, as long as only a single value is ever chosen.
In distributed systems this is hard, because messages between machines can be lost or indefinitely delayed, or the machines themselves can fail.
proposer proposes a value that it wants agreement upon. It does this by sending a proposal containing a value to the set of all acceptors, which decide whether to accept the value. Each acceptor chooses a value independently — it may receive multiple proposals, each from a different proposer — and sends its decision to learners, which determine whether any value has been accepted. For a value to be accepted by Paxos, a majority of acceptors must choose the same valu

Paxos algorithm proposers send two types of messages to acceptors: prepare and accept requests. In the first stage of this algorithm a proposer sends a prepare request to each acceptor containing a proposed value, v, and a proposal number, n. Each proposer’s proposal number must be a positive, monotonically increasing, unique, natural number, with respect to other proposers’ proposal numbers



Figure 2: Paxos. Proposers A and B each send prepare requests to every acceptor. In this example proposer A’s request reaches acceptors X and Y first, and proposer B’s request reaches acceptor Z first.
If the acceptor receiving a prepare request has not seen another proposal, the acceptor responds with a prepare response which promises never to accept another proposal with a lower proposal number. This is illustrated in Figure 3 below, which shows the responses from each acceptor to the first prepare request they receive.


Figure 3: Paxos. Each acceptor responds to the first prepare request message that it receives.
Once a proposer has received prepare responses from a majority of acceptors it can issue an accept request. Since proposer A only received responses indicating that there were no previous proposals, it sends an accept request to every acceptor with the same proposal number and value as its initial proposal (n=2, v=8). However, these requests are ignored by every acceptor because they have all promised not to accept requests with a proposal number lower than (in response to the prepare request from proposer B).
Proposer B sends an accept request to each acceptor containing the proposal number it previously used (n=4) and the value associated with the highest proposal number among the prepare response messages it received (v=8)[3]. Note that this is not the value that proposer B initially proposed, but the highest value from the prepare response messages it saw.


Figure 5: Paxos. Proposer B sends an accept request to each acceptor, with its previous proposal number (4), and the value of the highest numbered proposal it has seen (8, from [n=2, v=8
If an acceptor receives an accept request for a higher or equal proposal number than it has already seen, it accepts and sends a notification to every learner node. A value is chosen by the Paxos algorithm when a learner discovers that a majority of acceptors have accepted a value, as is illustrated below:
Once a value has been chosen by Paxos, further communication with other proposers cannot change this value. If another proposer, proposer C, sends a prepare request with a higher proposal number than has previously been seen, and a different value (for example, n=6, v=7), each acceptor responds with the previous highest proposal (n=4, v=8). This requires proposer C to send an accept requestcontaining [n=6, v=8], which only confirms the value that has already been chosen. Furthermore, if some minority of acceptors have not yet chosen a value, this process ensures that they eventually reach consensus on the same value.
Various efficiency improvements to the standard Paxos algorithm are discussed in the papers by Lamport and Baker et al.. For example, a prepare request is not necessary if the proposer knows that it is the first to suggest a value. The proposal for such a request is numbered 0, so that it will be ignored if any higher numbered requests have been received.

Note that this is the highest proposal number that it received from prepare response messages. In this example, proposer B has a higher numbered proposal (n=4) than proposer A (n=2), but it has only received proposer A’s proposal in response to its prepare request. If no previous proposals were returned by the prepare response messages, proposer B would use its own proposal (n=4).
https://www.beyondthelines.net/algorithm/basic-paxos/
As we added more servers we still needed the clients to see the whole system as a single server. That means the behaviour of the distributed system should be the same (from the clients point of view) as if there was only a single server.




When a service is provided by multiple server they need to agree on the order in which the operations are processed in order to provide a consistent state across the system,


In this section we’re focusing on a single round of consensus. This is basic Paxos. We can then apply another round for every command to be processed by the system.

Paxos is the name of a greek island and the Paxos algorithm is inspired by the way the parliament was run on this island. Basically a law needed a majority of the votes to be adopted. The algorithm relies on this same idea that a proposed value must be accepted by a majority to be chosen.

  • The proposers: these are the ones who propose new values to agree on. Typically the servers who handle client requests.
  • The acceptors: these are the ones who vote or “accept” a value. They also keep track of the decision process.
  • The learners: these are the ones who want to know the “chosen” values.

The assumptions

In each round:
  • One or more servers proposes a value
  • The participants agree on a single value
  • Only one value may be chosen
  • Participants never learn a value unless it has been chosen
The following assumptions are also made on the system:
  • No byzantine failures: the servers behave correctly or fail completely. The algorithm doesn’t support malicious behaviour of the participants.
  • The system is eventually synchronous (most messages are successfully delivered and the number of lost messages is bounded).
Each round comprises 2 (or 3) phases:
  • a “prepare” phase: where the proposer checks that the acceptors are going to vote for his proposal.
  • an “accept” phase: where the acceptors vote for the proposal and “choose” a value.
  • and possibly a “learning” phase: where the learners are notified of the outcome of the round. (It doesn’t influence the way a value is chosen so let’s ignore this phase for now).
Each proposal has a unique number and higher numbers take priority over lower numbers.
To avoid number collision the proposal number should include some sort of server id. (e.g. <proposal #>.<server id>).
THE PROPOSER
The proposer must maintain and persist one value: the max proposal number. Its behaviour is defined as follow:
  1. Choose a new proposal number higher the proposer’s max proposal number.
  2. Send a “prepare” message to all the acceptors with the new proposal number.
  3. when responses received from the majority of the acceptors:
    • if it received a response with an accepted proposal number  the acceptor replaces the proposed value with the received value.
  4. Send “accept” message with the  proposal number and the value.
  5. When responses received from the majority of the acceptors:
    • If there is a rejection (response with a number higher than the current proposal number). The value is not “chosen” so the proposer updates its max proposal number and start again from 1.
    • If no rejection the value is “chosen”.
THE ACCEPTOR
The acceptor must maintain 3 values:
  • min proposal number
  • accepted proposal number
  • accepted value
Its behaviour is the counter-part of the proposer behaviour:
  1. When it receives a “prepare” message
    • If the received proposal number is greater than the acceptor’s min proposal value then update its min proposal
    • Then replies with a message containing the highest accepted proposal number and its associated  value (if any).
  2. When receives an “accept” message:
    • if the proposal number is greater or equal to the acceptor’s min proposal value, it updates its min proposal and its accepted proposal number with the associated value and then replies ok.
    • Then it replies with a message containing its min proposal value.
An interesting fact is that only the proposer knows that a value has been chosen. The acceptors only knows the value it has accepted but it doesn’t know wether the value is accepted by a majority. That’s why there is an additional “learning” phase where the chosen values gets dispatch to the “learners”.

http://www.ux.uis.no/~meling/papers/2013-paxostutorial-opodis.pdf

http://blog.csdn.net/dellme99/article/details/14162159
Paxos对这类问题的解决就是试图对各Server上的状态进行全局编号,如果能编号成功,那么所有操作都按照编号顺序执行,一致性就不言而喻。当Cluster中的Server都接收了一些数据,如何进行编号?就是表决,让所有的Server进行表决,看哪个Server上的哪个数据应该排第一,哪个排第二...,只要多数Server同意某个数据该排第几,那就排第几。
很显然,为了给每个数据唯一编号,每次表决只能产生一个数据,否则表决就没有任何意义。Paxos的算法的所有精力都放在如何在一次表决只产生一个数据。再进一步,我们称表决的数据叫Value,Paxos算法的核心和精华就是确保每次表决只产生一个Value。下文中的编号不是这里的编号,下文中的编号是为了只产生一个value的这次(其实是一轮或者多轮)表决中的不同次数提交议案(value)的流水号。

3.Paxos算法

我们对原文的概念加以补充:
  • promise:Acceptor对proposer承诺,如果没有更大编号的proposal会accept它提交的proposal
  • accept:Acceptor没有发现有比之前的proposal更大编号的proposal,就批准了该proposal
  • chosen:当Acceptor的多数派都accept一个proposal时,该proposal就被最终选择,也称为决议
也就是说,Acceptor对proposer有两个动作:promise和accept
下面的解释也主要围绕着”Only a single value is chosen,“,再看下条件P1,

P1:An acceptor must accept the first proposal that it receives.

P2:If a proposal with value v is chosen, then every higher-numbered proposal that is chosen has value v.


    参考转载的《Paxos算法在大型系统中常见的应用场景》,在chubby中paxos用于保持chubby cell内部所有主机操作序列的一致性,同时也用于选举出chubby cell中的master或者说是leader。

https://zh.wikipedia.org/wiki/Paxos%E7%AE%97%E6%B3%95


https://stackoverflow.com/questions/47967772/how-to-derive-a-sequence-number-in-paxos
In paxos, every proposer independently generates a sequence number for its proposal. So let's say a proposer keeps on generating higher numbered sequence number. Won't this proposer because he is generating higher sequence number nullify other proposers proposals? i.e. is there a possibility in paxos where one proposer always dominate?
That would be a problem if your proposer was acting badly. In practice the proposers follow a simple protocol for choosing sequencers.
For example, in one system I maintained a list of allowed proposers was replicated along with the data, so all proposers had a position p. A proposer would always choose its ith sequence number like this: seqno(i, p) = i * len(proposers) + p. Thus every proposer had a unique set of (interleaved) sequence numbers to choose from.

http://www.ux.uis.no/~meling/papers/PaxosTutorial-Meling-OPODIS2013.pdf
We Need to Order Client Requests!
Let’s Designate a Leader to Order Requests
Add Sequence Numbers
Discard Out-of-Order Messages


- GAE
  • Paxos - A consensus protocol where a group of independent nodes reach a majority consensus on a decision.
    - Protocol: there's a propose step and then an agree step. You only need a majority of nodes to agree to say something is persisted for it to be considered persisted.
    - Unlike 2PC it is fully distributed. There's no single master coordinator.
    - Multiple transactions can be run in parallel. There's less serialization.
    - Writes are high latency because of the 2 extra round coordination trips required in the protocol.
    - Wanted to do this, but the they didn't want to pay the 150msec latency hit to writes, especially when competing against 5msec writes for RDBMSes. 
    - They tried using physcially close datacenters but the built-in multi-datacenter overhead (routers, etc) was too high. Even in the same datacenter was too slow.
    - Paxos is still used a ton within Google. Especially for lock servers. For coordinating anything they do across datacenters. Especially when state is moved between datacenters. If your app is serving data in one datacenter and it should be moved to another that coordination is done through Paxos. It's used also in managing memcache and offline processing. 


  • Labels

    Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

    Popular Posts