https://raft.github.io/raft.pdf
https://raft.github.io/
It's equivalent to Paxos in fault-tolerance and performance. The difference is that it's decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems
Consensus typically arises in the context of replicated state machines, a general approach to building fault-tolerant systems. Each server has a state machine and a log. The state machine is the component that we want to make fault-tolerant, such as a hash table.
http://thesecretlivesofdata.com/raft/
http://container-solutions.com/raft-explained-part-1-the-consenus-problem/
with services having global coverage, one might want to reduce latency, having requests served by a replicated instance that is closest geographically
Fundamentally, the goal of consensus is not that of the negotiation of an optimal value of some kind, but just the collective agreement on some value that was previously proposed by one of the participating servers in that round of the consensus algorithm.With the help of consensus the distributed system is made to act as though it were a single entity.
To this end, they propose the following techniques to improve understandability: Decomposition – the separation of functional components like leader election, log replication, and safety – and state space reduction – the reduction of possible states of the protocol to a minimal functional subset.
The needs of real-world systems let the actually implemented algorithm deviate so far from the theoretical description that the theoretical proof may often not apply in practice anymore. These issues are well described within Paxos Made Live – An Engineering Perspective by Chandra et. al 2007 where they describe their experiences of implementing Paxos within Chubby (which Google uses within Google File System, BigTable, and MapReduce to synchronize accesses to shared resources):
Replicated State Machine Model
Raft does not apply the idea of consensus to agreeing to single values (like classic Paxos), but instead to the goal of finding agreement on the order with which operations are committed to a single replicated log.
Within Raft, the authors draw a clear distinction between the parts of such a replicated state machine: The state machine, the replicated log, and the consensus module.
http://www.zenlife.tk/raft.md
https://raft.github.io/
It's equivalent to Paxos in fault-tolerance and performance. The difference is that it's decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems
Consensus typically arises in the context of replicated state machines, a general approach to building fault-tolerant systems. Each server has a state machine and a log. The state machine is the component that we want to make fault-tolerant, such as a hash table.
http://thesecretlivesofdata.com/raft/
Raft
works by keeping a replicated log. This log is an append-only data structure where new entries are added, and only a single server, the leader, is responsible for managing this log. Every write
request is sent to the leader node, and this node will distribute it to the follower nodes and make sure the client receives a confirmation for this write just when the data is safely stored. Let’s get into the details.
The consensus problem is divided into three sub-problems: Leader election, Replication and Safety.
Every node will always be in one of these three states: Leader, Follower or Candidate, and we should never have more than one leader at the same time. Time in
Raft
is divided into terms, which is basically an arbitrary period of time, identified by a number that is sequentially incremented.
A server always starts as a follower, and it expects a heartbeat from the leader. The follower will wait for this heartbeat for some time (defined as the
election timeout
), and if it does not receive it, it will assume the leader is dead and transition to the Candidate state. After it goes to this state, the first thing it will do is to vote for itself, and then send a vote request to all the other nodes (this request is an RPC called RequestVote
). If it receives a confirmation for this request from the majority of the nodes in this cluster (e.g. 3 out of 5), it transitions to the Leader state.Raft
mitigates this issue by using a randomized election timeout for each node, meaning one of the followers will usually timeout before the others, likely becoming the new leader.
When this timeout happens and the term doesn’t have a leader, a new term will be initiated, and each node will have a new random timeout value for the next election, that is probably not the same again. We will have a performance penalty because of that, but this timeout is usually just a few milliseconds, and a split vote situation should be quite rare.
When the leader receives a request, it first appends it to its log, and then send a request to every follower so they can do the same thing. This RPC is called
AppendEntries
. Although the message was appended to the log, it was not committed yet, and the client didn’t get a confirmation that the operation succeeded. Just after the leader gets a confirmation from the majority of the nodes it can actually commit the message, knowing it’s safely stored, and then respond to the client. When the followers receive the next heartbeat message (that is just an empty AppendEntries
RPC) they know they can also commit this message.
Other than the command sent by the client, each log entry also has a term number and an index. The term just defines a unit of time (and, remember, each term has no more than one leader), and the index is the position in the log
Raft
maintains the Log Matching Property property, that says that if two distinct log entries have the same term number and the same index, then they will:- Store the exact same command;
- Be identical in all the preceding entries.
The second property, guaranteeing that all the preceding entries are identical, is achieved by a consistency check that the followers perform when they receive an
AppendEntries
RPC.
It works like this: The leader keeps track of the highest index that is committed in its log, and send that information in every
AppendEntries
RPC (even heartbeats). If the follower does not find an entry with that index in its local log, it will reject the request, so if the AppendEntries
RPC returns successfully, the leader knows that its log and the follower’s are identical.
When the nodes are operating normally, these logs will always be consistent. When a leader crashes, though, this log can be left inconsistent, and that’s when
AppendEntries
’s consistency check will help us. Imagine this scenario:- We have three nodes,
N1
,N2
andN3
,N1
being the leader; N1
replicates the messagesterm=1; index=1; command=x
andterm=1; index=2; command=y
withN2
, butN3
never gets these messages;- Now
N1
crashes andN2
becomes the new leader; - If
N2
tries to replicate the messageterm=2; index=3; command=z
toN3
, the consistency check will reject this message, as the highest committed index (3
) is not present inN3
’s log; N2
will then go back in the log and transmit all the entries after the latest entry present inN3
, making the logs consistent again.
Election Restriction
This property guarantees that a candidate will never win the leader election if it does not have all the committed entries in its own log. As an entry needs to be present in the majority of the nodes to be considered committed, when an election is taking place at least one node will have the latest committed entry. If a follower node receives a
RequestVote
RPC from a candidate that is behind in the log (meaning a smaller term number, or same term number but smaller index), it will not grant its vote to this candidate.Raft
is divided into 3 parts: Leader election, log replication and safety;- A node can be in one of these three states: Follower, Candidate or Leader;
- Every node starts as a Follower, and after an election timeout transitions to the candidate state;
- A Candidate will vote for itself and send
RequestVote
RPCs to all the other nodes; - If it gets votes from the majority of the nodes, it becomes the new Leader;
- The leader is the only node responsible for managing the log, followers just add new entries to their logs in response to the leader
AppendEntries
RPC; - When the leader receives a command from the client, it first saves this uncommitted message, then sends it to every follower;
- When it gets a successful response from the majority of nodes, the command is committed and the client gets a confirmation;
- In the next
AppendEntries
RPC sent to the follower (that can be a new entry or just a heartbeat), the follower also commits the message; - The
AppendEntries
RPC implements a consistency check, to guarantee its local log is consistent with the leader’s; - A follower will just grant its vote to a candidate that has a log at least as up to date as its own;
http://container-solutions.com/raft-explained-part-1-the-consenus-problem/
with services having global coverage, one might want to reduce latency, having requests served by a replicated instance that is closest geographically
Fundamentally, the goal of consensus is not that of the negotiation of an optimal value of some kind, but just the collective agreement on some value that was previously proposed by one of the participating servers in that round of the consensus algorithm.With the help of consensus the distributed system is made to act as though it were a single entity.
To this end, they propose the following techniques to improve understandability: Decomposition – the separation of functional components like leader election, log replication, and safety – and state space reduction – the reduction of possible states of the protocol to a minimal functional subset.
The needs of real-world systems let the actually implemented algorithm deviate so far from the theoretical description that the theoretical proof may often not apply in practice anymore. These issues are well described within Paxos Made Live – An Engineering Perspective by Chandra et. al 2007 where they describe their experiences of implementing Paxos within Chubby (which Google uses within Google File System, BigTable, and MapReduce to synchronize accesses to shared resources):
The acronyms under usage patterns stand for server replication (SR), log replication (LR), synchronisation service (SS), barrier orchestration (BO), service discovery (SD), leader election (LE), metadata management (MM), and Message Queues (Q).
The consensus system of Chubby is based on Multi-Paxos (a variation on Paxos that decides on more than a single value), Zookeeper is based on Zab (a protocol similar but not the same as Paxos), and etcd is built on top of Raft – the protocol about which this blog post speaks.
Another two good examples of the Raft protocol in practice are Consul and Nomad from Hashicorp. Consul is a “consensus system” in the language of the above table and is widely used for service discovery in distributed systems. Nomad – Hashicorp’s orchestration platform – uses Raft for state machine replication and integrates with Consul for service discovery of scheduled workloads. A comparsion of Consul with some of the other consensus systems in the above table can be found here:https://www.consul.io/intro/vs/zookeeper.html.
Raft does not apply the idea of consensus to agreeing to single values (like classic Paxos), but instead to the goal of finding agreement on the order with which operations are committed to a single replicated log.
Within Raft, the authors draw a clear distinction between the parts of such a replicated state machine: The state machine, the replicated log, and the consensus module.
The clients interact with the replicated state machine through the consensus module, which – once a command has been committed to the log, that is agreed upon by consensus – guarantees that eventually the command will applied in the same order (the one specified in the leader’s log) on every live state machine.
Raft integrates consensus with the very common use case of the replicated state machine, giving the advantage that any guarantees proven theoretically will hold in the implementations, if the specification is closely followed
A fundamental difference between Raft and Paxos as well as Viewstamped Replication, two of the main influences on the protocol, is that Raft implements strong leadership. Contrary to other protocols, who may defer leader election to an oracle (a PhD way of saying magic black box), Raft integrates leader election as an essential part of the consensus protocol. Once a leader has been elected, all decision-making within the protocol will then be driven only by the leader. Only one leader can exist at a single time and log entries can only flow from the leader to the followers.
Strong leadership extends classical leader-driven consensus by adding the following constraints:
- All message passing can only be initialized by the leader or a server attempting to become leader. This is enforced in the protocols specification through the modeling of all communications as RPCs, implicitly encoding a server’s role as either active or passive.
- The leader takes care of all communication with clients of the application and needs to be contacted directly by them.
- The system is only available when a leader has been elected and is alive. Otherwise, a new leader will be elected and the system will remain unavailable for the duration of the vote.
To mitigate problems with clock synchronization in asynchronous systems, where the messages – through which clock synchronization may be negotiated – can have arbitrary delays, Raft uses a logical clock in the form of terms. Logical time uses the insight that no exact notion of time is needed to keep track of causality in a system. Each server has its own local view of time that is represented by its currentTerm. This currentTerm number increases monotonically over time, meaning that it can only go up.
Every communication in Raft includes an exchange and comparison of currentTerms. A term is only updated when a server (re-)starts an election or when the currentTerm of the party that a server communicates with is higher than its own, in which case the term get’s updated with that higher value. Any communication attempt with a server of a higher term is always rejected and when a candidate or leader learns of a higher term than its own, it immediately returns to being a follower.
Roles
The client of the application makes requests only to and gets responses only from the leader server. This means that the replicated state machine service can only be available when a leader has been successfully elected and is alive.
Each server participating in the protocol can only take one of the following roles at a time:
Follower:Followers only respond to RPCs, but do not initiate any communication.
Candidate:Candidates start a new election, incrementing the term, requesting a vote, and voting for themselves. Depending on the outcome of the election, become leader, follower (be outvoted or receive RPC from valid leader), or restart these steps (within a new term). Only a candidate with a log that contains all committed commands can become leader.
Leader:The leader sends heartbeats (empty AppendEntries RPCs) to all followers, thereby preventing timeouts in idle periods. For every command from the client, append to local log and start replicating that log entry, in case of replication on at least a majority of the servers, commit, apply commited entry to its own leader state machine, and then return the result to the client. If logIndex is higher than the nextIndex of a follower, append all log entries at the follower using RPC, starting from the his nextIndex.
All of these roles have a randomized time-out, on the elapse of which all roles assume that the leader has crashed and convert to be candidates, triggering a new election and incrementing the current term.
Log and State Machine Replication
Once a server is established as leader, it becomes available to serve the clients requests. These requests are commands that are to be committed to the replicated state machine. For every received command, the leader assigns a term and index to the command, which gives a unique identifier within the server’s logs, and appends the command to its own log.
Commitment
If the leader is then able to replicate the command (through the AppendEntries RPC) across the logs of a strict majority of servers, the command is committed, applied to the leaders state machine, and the result of the operation returned to the client. Once a command is safe to be applied to all replicated state machines, that is when a leader replicates a command from its current term to a majority of servers, it as well as – implicitly – all of the leaders preceding entries are considered committed.
The unidirectional flow of information from the leader to the followers and therefore the guarantee of identical ordering across all replicated logs on any participating server that is alive, lead to eventually consistent state across all replicated state machines: If a message gets delayed, lost, or a server is temporarily down and later comes back up, the follower will catch back up once he receives the next RPC from the leader. Once the leader becomes aware of the current state of that server through the RPC, it then appends all missing commands. It does so starting from the next expected index to the leader’s current log index, the latter being the last appended position on the leader log.
For every AppendEntries RPC performed on the follower, it performs a consistency check and rejects the new entry only if the logs match in their previous entry. This creates an inductive property: If a follower accepts a new entry from the leader, it checks the consistency the leader’s previous entry with its own, which must have been accepted because of the consistency of the previous entry and so on. Because of this inductive property, the replicated log is guaranteed to match the leader’s log up until that last accepted entry.
Raft assumes that the leader’s log is always correct. The cleanup of inconsistent replicated state happens during normal operation and the protocol is designed in such a way that normal operation (at least half of the servers alive) converges all the logs.
The combination of leader election and log replication commitment rules provide Raft’s safety guarantees: Correctness and availability of the system remains guaranteed as long as a majority of the servers remain up.
Safety can be informally described as nothing bad happens and liveness as something good eventually happens.
Safety: For each term, every server gives out one and only one vote and, consequently, two different candidates can never accumulate majorities within the same term. This vote needs to be reliably persisted on disk, in order to account for the possibility of servers failing and recovering within the same term.
Liveness: Theoretically, competing candidates could cause repeated split votes. Raft mitigates this by having each participating server individually choose a new random timeout within each given interval. This will lead to a situation, where usually there is only one server awake, which can then win the election while every other server is still asleep. This works best if the lower bound of the chosen interval is considerably larger than the broadcast time.
Picking the Best Leader
Safety: In leader elections, candidates include the lastTerm and lastIndex of their last log entry and, based on this information, other servers, deny their vote if their own log is more complete. This is the case if the voting server has either participated in a higher term or the same term, but having a higher lastIndex (and therefore a longer log).
Making Logs Consistent
Safety: If an AppendEntries RPC to a follower fails because there are missing entries, for every failed call the nextIndex of that follower will be decremented and the call will be tried again until eventually the last item in the log is reached and the log filled up to mirror the leader’s log.
In case an AppendEntries RPC to a follower returns that there is already an entry at that point in the followers log, this entry will be considered extraneous – as the leader’s log is considered authoritative – and be overwritten. This will also invalidate all consecutive entries on the follower, as all entries following an extraneous entry are also extraneous.
Neutralizing Old Leaders
Safety: A leader might be separated from a majority of other servers by a partition and during this time, these would then elect a new leader. With a new leader in place, the protocol will continue on that side of the partition, but once the deposed leader becomes reconnected to the rest of the servers, the protocol needs to take care of the situation where two servers think they are leader. The stale leader would behave according to its perceived role and try to replicate logs, talk to the client, record logs, or commit log entries. This would be unsafe and needs to be dealt with by the protocol.
Terms are the mechanism by which the protocol takes care of stale leaders. All possible RPCs always include the term of the sender and the comparison of sender’s to receiver’s term leads to the updating of the more out-of-date. Additionally, in case of an older sender’s term, the RPC gets rejected and the sender reverts to follower or, in case of an older receiver’s term, the receiver reverts to follower and then processes the RPC normally.
The key to understanding safety in regard to any situation with stale leaders or candidates is to become aware that any election necessarily updates the terms within a majority of servers. Therefore, once a leader is stale, it is impossible within the mechanics of the protocol for it to commit any new log entries.
Client Relations
Clients of the Raft protocol interact with it through the leader. Only once a command has been logged, committed, and executed on the state machine of the leader, will he then respond.
Availability: If the leader is unknown or the request times out (e.g. leader crash), the client may contact any of the servers, which will then redirect to the (new) leader, where the request will then be retried. This guarantees that the command will eventually be executed if a majority of the servers are still alive.
Safety: Yet, in order to make sure that commands are not duplicated, there is an additional property: Every command is paired with a unique command id, which assures that the leader is aware of the situation and that execution of the command as well as the response happen only once for every unique command.
In theoretical computer science, there comes great benefit from reducing problems to their abstract core
Understanding is made unnecessarily hard by a state-space that is larger than needed to have the same guarantees as Paxos. Specifically, when used as Multi-Paxos within a replicated state machine, functional segregation of the intertwined parts of the protocol is not given with the same rigor and formal specification as with Raft. Implementation of Paxos is hard, as the protocol is specified in a way that is detached from real-world implementation issues and use cases: The original guarantees and proofs given in theory may not hold anymore in practice. This leads to Paxos implementations needing extensive proofs and verification of their own, detaching them further from the original theoretical results.
In my opinion, the Raft paper is exceptional in that it combines insights into the nature of real-world use-cases and the power of abstract reasoning and combines the two in a way that is not only understandable and tractable while giving hard theoretical guarantees, but also easily reproducible in real-world systems.
It’s currently used in several large scale system, like Consul, etcd and InfluxDB, so it’s pretty mature and battle-tested.
http://highscalability.com/blog/2013/8/7/raft-in-search-of-an-understandable-consensus-algorithm.html
Raft 为什么是更易理解的分布式一致性算法- There was a discussion of a lack of clarity of using an AppendEntry to mean both HeartBeat and AppendEntry. I agree this is unclear in practice. While non-heartbeat messages can indicate a heartbeat, what do you do when there are no messages being sent? Sending an AppendEntry when nothing is being appended is trickiness without a real gain.
- RAFT made simplifying assumptions in having a leader, having a log, and how leader election occurs. These simplifications are fine most of the time during operation, but by simplifying they can write down a concrete explanation that people can understand.
- They built the system. Explaining without building leads to a lot of hand waving.
不同视角的可理解性
甲乙两人轮流在一张圆桌上平放黑白围棋子,每次放一子,棋子不许重叠,谁先没有地方放就输。请问怎样放才能赢?
这个问题有两层意思,第一,有没有一种放法保证必赢?第二,如果有怎么证明?
上面的图回答了这个问题,就是先行者必胜,这里使用了三种不同的思维方式。
- 假如桌子只有一个围棋子那么大。
- 假如桌子无限大,先行者先占住圆心,由于圆是对称图形,所以只要对手还能找到位置放,你总能在对称的另一面找到位置放。
- 一个圆中可画单数个直径相等且互切的小圆。
三种不同的思维方式在可理解性难度上逐渐加深。第一种是极简化思维,但数学上是不严谨的。第二种是极限思维,和第一种结合起来就是数学归纳法了,在数学上是严谨的。第三种是形象思维,使用了几何学概念,但对于没有几何学基础知识的人就很难理解了。
在一个由 Raft 协议组织的集群中有三类角色:
- Leader(领袖)
- Follower(群众)
- Candidate(候选人)
Leader 选举过程
在极简的思维下,一个最小的 Raft 民主集群需要三个参与者(如下图:A、B、C),这样才可能投出多数票。初始状态 ABC 都是 Follower,然后发起选举这时有三种可能情形发生。下图中前二种都能选出 Leader,第三种则表明本轮投票无效(Split Votes),每方都投给了自己,结果没有任何一方获得多数票。之后每个参与方随机休息一阵(Election Timeout)重新发起投票直到一方获得多数票。这里的关键就是随机 timeout,最先从 timeout 中恢复发起投票的一方向还在 timeout 中的另外两方请求投票,这时它们就只能投给对方了,很快达成一致。
选出 Leader 后,Leader 通过定期向所有 Follower 发送心跳信息维持其统治。若 Follower 一段时间未收到 Leader 的心跳则认为 Leader 可能已经挂了再次发起选主过程。
Leader 节点对一致性的影响
Raft 协议强依赖 Leader 节点的可用性来确保集群数据的一致性。数据的流向只能从 Leader 节点向 Follower 节点转移。当 Client 向集群 Leader 节点提交数据后,Leader 节点接收到的数据处于未提交状态(Uncommitted),接着 Leader 节点会并发向所有 Follower 节点复制数据并等待接收响应,确保至少集群中超过半数节点已接收到数据后再向 Client 确认数据已接收。一旦向 Client 发出数据接收 Ack 响应后,表明此时数据状态进入已提交(Committed),Leader 节点再向 Follower 节点发通知告知该数据状态已提交。
在这个过程中,主节点可能在任意阶段挂掉,看下 Raft 协议如何针对不同阶段保障数据一致性的。
1. 数据到达 Leader 节点前
这个阶段 Leader 挂掉不影响一致性,不多说。
2. 数据到达 Leader 节点,但未复制到 Follower 节点
这个阶段 Leader 挂掉,数据属于未提交状态,Client 不会收到 Ack 会认为超时失败可安全发起重试。Follower 节点上没有该数据,重新选主后 Client 重试重新提交可成功。原来的 Leader 节点恢复后作为 Follower 加入集群重新从当前任期的新 Leader 处同步数据,强制保持和 Leader 数据一致。
3. 数据到达 Leader 节点,成功复制到 Follower 所有节点,但还未向 Leader 响应接收
这个阶段 Leader 挂掉,虽然数据在 Follower 节点处于未提交状态(Uncommitted)但保持一致,重新选出 Leader 后可完成数据提交,此时 Client 由于不知到底提交成功没有,可重试提交。针对这种情况 Raft 要求 RPC 请求实现幂等性,也就是要实现内部去重机制。
4. 数据到达 Leader 节点,成功复制到 Follower 部分节点,但还未向 Leader 响应接收
这个阶段 Leader 挂掉,数据在 Follower 节点处于未提交状态(Uncommitted)且不一致,Raft 协议要求投票只能投给拥有最新数据的节点。所以拥有最新数据的节点会被选为 Leader 再强制同步数据到 Follower,数据不会丢失并最终一致。
5. 数据到达 Leader 节点,成功复制到 Follower 所有或多数节点,数据在 Leader 处于已提交状态,但在 Follower 处于未提交状态
这个阶段 Leader 挂掉,重新选出新 Leader 后的处理流程和阶段 3 一样。
6. 数据到达 Leader 节点,成功复制到 Follower 所有或多数节点,数据在所有节点都处于已提交状态,但还未响应 Client
这个阶段 Leader 挂掉,Cluster 内部数据其实已经是一致的,Client 重复重试基于幂等策略对一致性无影响。
7. 网络分区导致的脑裂情况,出现双 Leader
网络分区将原先的 Leader 节点和 Follower 节点分隔开,Follower 收不到 Leader 的心跳将发起选举产生新的 Leader。这时就产生了双 Leader,原先的 Leader 独自在一个区,向它提交数据不可能复制到多数节点所以永远提交不成功。向新的 Leader 提交数据可以提交成功,网络恢复后旧的 Leader 发现集群中有更新任期(Term)的新 Leader 则自动降级为 Follower 并从新 Leader 处同步数据达成集群数据一致。
综上穷举分析了最小集群(3 节点)面临的所有情况,可以看出 Raft 协议都能很好的应对一致性问题,并且很容易理解。
一般是5台机器。任一时间有处于3种状态之一:leader,follower,candidate。正常情况下只有一个leader
选主
leader周期性地heartbeat到所有的follower。follower如果能收到leader发来的消息,那么就保持follower状态。如果follower一段时间收到不消息了,则开始新的选主。
首先当前term计数加1,然后给自己投票并向其它结点发投票请求。直到以下三种情况:
- 它赢得选举
- 另一个服务器成为leader
- 持续一段时间没有主机胜出
在选主期间,candidate可能收到来自其它自称为leader的写请求,如果该leader的term不小于candidate的当前term,那么candidate承认它是一个合法的leader并回到follower状态,否则拒绝请求。
如果出现两个candidate得票一样多,则它们都无法获取超过半数投票。这种情况会持续到超时。然后进行新一轮的选举。
使用随机的选举超时,这样不容易发生上面情况。
日志复制
leader收到client写请求后,先写自己的log,然后发到所有服务器,当确认记录已安全复制后,回应client。
每条日志记录会存命令以及term编号,term编号用于检测日志的不一致。
每个提交的记录都是持久的,并且是最终一致的。当log记录成功复投票请求中包含了这个限制:请求中有关于candidate的log信息制到大多数服务器时,记录被提交。如果投票者的log比它新,则拒绝请求。
冲突解决,leader通过强制follower复制自己的log来处理不一致。
安全
举个例子,一个follower可能一段时间不可用,期间leader持续提交了多次log,然后这个follower被选为leader了,那么它会覆盖掉提交的记录。
所以要限制哪些服务器可以被选为leader。使用投票过程阻止candidate选举中获胜,除非它的log包含了所有已提交的记录。
因为要获得超过半数的投票,那么candidate至少要跟大多数的log一样新。这样它拥有所有提交的记录。投票请求中包含了这个限制:请求中有关于candidate的log信息,如果投票者的log比它新,则拒绝请求。
如果follower或candidate崩溃了,那么发给它的请求会失败,raft将无限次的重试。当它恢复后,会继续收到未完成的请求。如果一个服务器完成了请求但尚未回复,接着crash了,那么它重启后会收到相同的请求。
http://www.zenlife.tk/raft-fault-tolerance.md
开发分布式系统跟开发单机最大的差别在于什么?failure的处理!如果没有failure,那就完全没什么难点了。但是在分布式环境下,错误是常态。某个操作的执行结果是有三种状态的:成功/失败/超时。超时不能确认执行到底是成功了,还是失败了。接下来,就用raft来讨论一下各种failure场景。
假设我们有一个master多个slave。写master后,master将操作发到各个slave,所有slave写成功后,master才返回客户端,这样是可以保证强一致的。如果某个slave挂了怎么办?停止对外服务,让系统就挂掉。显然这么做可用性是有问题的,也就是CAP中一致性和可用性的冲突。为了写入的高可用,我们放松一点限制:写操作写入W份副本就算成功,总副本数量为N。那么读需要读取R份副本,并且R+W>N,这样读到的副本中一定包含了最新的副本,由客户端去选择版本。
假设master是先返回client写成功,再异步去同步给slave的。那么我们看读操作,如果读只允许从master读,那么slave只是一个容灾措施,这种方式是强一致性的。那如果放松一点限制,同一个用户,只会从固定的某一个副本上读取。这样子能够保证单调一致性,即用户不会读到比上次旧的版本。如果再放松一点限制,用户在某一个session上,只能读固定的某个副本,那么这是会话一致性。
如果继续放松限制,可以随意读取某个副本,比如这次读的master,下次读slave。这样做可能读到比之前更旧的数据。比如A的值从5更新到6,master已经执行,但是还没有异步到slave。客户端读取master得到6。后来,客户端下一次读,选择了一个slave节点,读出来5。这个级别是最终一致性。
如果继续放松限制,可以随意读取某个副本,比如这次读的master,下次读slave。这样做可能读到比之前更旧的数据。比如A的值从5更新到6,master已经执行,但是还没有异步到slave。客户端读取master得到6。后来,客户端下一次读,选择了一个slave节点,读出来5。这个级别是最终一致性。
写后读:如果一个系统保证,写数据的那个client立即去读,一定读到自己写入的最新数据,而其它的client则不一定能读到最新数据,这也是一种级别的一致性。
写写冲突:如果允许多个master,就可能出现写写冲突,处理起来是比较麻烦的。这里只讨论强一致性,raft保证的。
raft中有leader和follower角色,读写都只能通过leader进行。这样大大地简化了协议。第一步会选举leader,然后进入到日志同步阶段。选择leader的时候,需要获得超过半数的投票。
有节点挂了怎么办?raft并不需要写成功全部副本,只需要超过半数。leader只有在记录了日志,提交了,执行状态机后,才会回复client。整个流程是,leader把日志发给所有follower,如果收到大多数响应就提交,提交之后返回client。我们看一下一些failure。
得到大多数的响应,但是还没有commit,这时leader挂掉了。那么follower里面都是脏数据。然后follower被选为leader了,这里面的数据该怎么处理?
脏数据分几种情况看:少了、多了、错了。先看少了:follower中有缺失的log,假设让它成为leader了,那就相当于之前提交过的数据丢失,这是不能容忍的。
为了避免这个问题,raft在选举的时候加入了一条限制:必须要比大部分其它候选者的log新,才有机会成为leader。这个条件其实是“只有拥有所有commit日志,才有可能被选为leader”的一个充分非必要条件。即,选出来的leader一定是包含完整的commit日志的。这样就保证提交了的数据绝对不会丢。
实现这条限制的方式是,在收到AppendEntries消息时,如果follower发现自身的log比发送者的更完整,就不会投票。于是成为leader的候选者,它既然能收到大多数投票,它的log肯定是比大多数结点更完整的,或者说,它的log比大多数节点要新。所以它一定是包含所有的commit日志的。
对于未成为leader结构的处理,删除脏的日志,补充缺的日志。leader需要为每个follower维护一个nextIndex。如果一致检查失败了,则nextIndex减1了再测试。
得到大多数的响应,leader于是commit,并且回复client了。但是leader没有给follower发送commit成功的信息,然后就挂掉了。这次commit是成功的么?
commit是成功的。但是是通过一条间接规则实现的:如果一条记录提交了,那么在它之前的记录一定都是提交了的。
根据之前的选举原则,新选举出来的leader一定是包含完整commit log的。然后新选出来的leader,term号一定大于上一轮的term。那么当新的日志提交以后,之前的commit就被间接地提交了。
为什么会设定成这个样子?我们必须要理解下面的问题。即"commit了的log",跟"在大多数机器中存在",两者有什么关系?commit了的log,必然在大多数机器中存在。因为只有leader成功同步给大多数的follower了,才会commit。那么,一条记录在大多数机器中存在,是否能说明这条记录是提交了呢?不能!有可能是leader同步给了大多数follwer,但是并没有执行commit,就挂掉了。
这里就要思考一个问题:继任者成为leader以后,是能要把在超过半数的机器中存在的log进行提交?这样做是否安全。其实是不安全的。所以raft在这里有一个限制:新的leader并不会手动提交老的term里面的记录。
然后我们看为什么不安全。因为领导人无法通过老的日志的任期号来判断其提交状态。看看这个时序:
在 (a) 中,S1 是领导者,部分的复制了索引位置 2 的日志条目。在 (b) 中,S1 崩溃了,然后 S5 在任期 3 里通过 S3、S4 和自己的选票赢得选举,然后从客户端接收了一条不一样的日志条目放在了索引 2 处。然后到 (c),S5 又崩溃了;S1 重新启动,选举成功,开始复制日志。在这时,来自任期 2 的那条日志已经被复制到了集群中的大多数机器上,但是还没有被提交。如果 S1 在 (d) 中又崩溃了,S5 可以重新被选举成功(通过来自 S2,S3 和 S4 的选票),然后覆盖了他们在索引 2 处的日志。但是,在崩溃之前,如果 S1 在自己的任期里复制了日志条目到大多数机器上,如 (e) 中,然后这个条目就会被提交(S5 就不可能选举成功)。 在这个时候,之前的所有日志就会被正常提交处理。
脑裂问题(partition)。会主动放弃leader么?不会。假设leader跟其它所有节点partition了,但是它还以为自己是leader,继续处理请求。由于它得不到大部分的确认,它没法commit,但是会积累脏数据。积累了脏数据的这个家伙,回头被选为leader,会覆盖掉已提交的正确数据?
为了解决这个问题,raft计算commit的条件继续补充:至少需要存储到大多数的节点上了。这个leader负责的term至少有一个新的记录也存到大多数的节点中了(即term号遍布到大多数结点了)。
这个可怜的家伙(partition的老leader)是无法被选为leader的。因为它由于partition的原因,它无法把自己的term遍布到大多数结点上,而当partition恢复的时候,如果期间有新的leader选出来过,term号已经增加了。
网络闪断:leader连不上网,其它节点被选为leader,老的leader想提交记录,但是leader的term是小于其它的,所以会失败,它成为follower。
一个节点先投票给了A,后来又接到一个来自B的term更高的投票请求,该怎么办?投票!更高的term,说明已经是下一轮选举了。上一轮肯定是已经结束的,要么是没有leader被选出来,要么是选出了leader但是这个节点没收到过leader请求。不管哪种情况,这是新的一轮了。
执行了命令,还没有返回client,crash了。客户端超时失败后会重试这个请求。
线性可序列化要求这个请求是不能执行两次的。做法是客户端每个命令带一个id,服务端发现如果是重复id,就是不执行命令,而是直接返回上次执行的结果。
扩容。迁移期间的一致性。在冲突中,一部分处理新配置,一部分处理老配置。大家都觉得自己是majorities。这一部分内容还没有细看。
正确性证明。上面写了很多failure的场景,raft协议都能正确地处理,但是这样子举例子并不能表示没问题。必须从理论上证明,不过raft确实是可以证明正确性的协议。只大致地提一下证明的思路。
一个是证明领导人的完备性。证明领导人的完备性,是指一旦日志被commit之后,后面的无论发生leader选举或者什么情况,日志不会丢,不会重写或覆盖以前的日志。保证这一点之后,复制状态机模型就可以保证在不同机器上执行结果的一致性。
证明的方式是使用反证法,先假设领导人完全性特性是不存在的,然后我们推出矛盾来。假设任期 T 的领导人(领导人 T)在任期内提交了一条日志条目,但是这条日志条目没有被存储到该领导人未来某个任期的日志中。设大于 T 的最小任期 U 的领导人 U 没有这条日志条目。具体的证明过程就略了。
另一个是证明日志匹配原则。日志匹配原则是要证明,如果一条记录提交了,那么在它之前的记录一定都是提交了的。因为在上面证明领导人完备性时,需要使用到这个条件。这个证明使用的是数学归纳法:初始状态是满足条件的,然后证明假设某一个状态满足条件,按我们的操作和约束,下一个状态也是满足条件的。具体过程也略了。
其实理解raft的比较关键的点,已经用的粗体字了。这几条理解了,应该就能理解它是如何容错的。
- 选择leader的时候,需要获得超过半数的投票
- 必须要比大部分其它候选者的log新,才有机会成为leader
- 如果一条记录提交了,那么在它之前的记录一定都是提交了的
每次改变数据先记录日志,日志未提交不能改节点的数值。然后LEADER会复制数据给其他follower节点,并等大多数节点写日志成功再提交数据。
每个节点随机等150到300MS,如果时间到了就开始发选票,因为有的节点等的时间短,所以它会先发选票,从而当选成候选人。但是如果两个从节点获得的票一样多,它们之间就要打加时赛,这个时候又会重新随机等150到300MS,然后发选票,直到获得最多票当选候选人。
每个节点会记录主节点是谁,并且和主节点之间维持一个心跳超时时间,如果没有收到主节点回复,从节点就要重新选举候选人节点。
当集群恢复之后,原来的主节点发现自己不是选票最多的节点,就会变成从节点,并回滚自己的日志,最后主节点会同步日志给从节点,保持主从数据的一致性。