Massive Technical Interviews Tips: The Chubby lock service for loosely-coupled distributed systems

Monday, October 26, 2015

The Chubby lock service for loosely-coupled distributed systems

https://medium.com/coinmonks/chubby-a-centralized-lock-service-for-distributed-applications-390571273052

Master: Chubby master consists of multiple replicas, with one of them getting elected as the master using distributed consensus protocol like paxos. All replicas also grant a lease to the master, during which they don’t elect a new master.

Once the master is elected, it is responsible for writing to the database any persistent state that it needed, which is then replicated at other replicas. A write needs to be replicated at majority before being acknowledged back to the client. A read can be served back to the client by the master as long as the lease hasn’t expired — this indicates that there is no other master around.

If the master fails, consensus protocol is again run to elect a new master.

Client: A chubby cell serves thousands of clients, so these clients are connecting to a master for all the coordination needs. Clients use DNS to find the master. Replicas respond to clients issuing DNS queries by redirecting the clients to the current master. Once client finds the master, all requests go to that master. Clients run the locking protocol on application’s behalf and notify application of certain events such as master fail-over has occurred.

File based interface

Chubby exports UNIX file system like APIs. Files and directories are called nodes. There are no links allowed in the system. Nodes can be permanent or ephemeral. Ephemeral nodes go away as no client using the node go away. A file can be opened in a read/write mode indicating the exclusivity. Clients get a handle to the given node. The following metadata is also allocated per node:

Instance number — always increasing for the same name
Content generation number — Increased anytime content is overwritten
Lock generation number — Increases when lock transitions from free to held
There also ACLs on nodes like in a traditional file system for controlling access and an ACL number increases on ACL changes.

Locks, Lock-Delays and Sequencers

A client can create a node/file in a write(exclusive) or read(shared) mode. All the locks are advisory i.e. participating entities need to follow the locking protocol for accesses the distributed critical section. Having a lock on the file, doesn’t prevent unfettered accesses to the file.

One of the issues with locking in distributed systems is that applications holding locks can die. Consider the following example, where R1 ends up accessing data in an inconsistent manner. In the last step(after 6), update from R1 lands on the master and can corrupt data. R1 does not have a valid lock at that time because it died in step 4 and master then granted the lock on N to client R2 in the meanwhile.

Update from step 3 B=by R1 arrives at master somewhat late. By that time master has already granted the lock on N to another client R2.

One of the ways this is handled is using a lock-delay. When an application holding the lock dies without releasing the lock, for some configurable time, no one else gets a lock on the locks held by the now-defunct application. This makes for a simple and effective(but not perfect) solution where the client can specify the threshold upto which a faulty application can hold a lock.

Another possible solution that Chubby provides is a sequencer based checking. When a client acquires a lock, it can request for sequencer from the chubby master. This is a string that consists of lock name, lock generation number(that changes on every transition from free to held) and the mode of acquisition. This string can be passed on to the modules needing the lock for protected transactions. These modules can check for the validity of the lock using sequencers by checking against the chubby master or using the module’s chubby cache.

Detection of changes using events

Chubby also allows some small aspects of a publish and subscribe mechanisms. Files in chubby also allow for storing a small amount of data which makes it more effective than just for indicating whether a lock was taken or not. As we discussed earlier, clients are interested in knowing when a new master has been elected or when the contents of the lock that they are using have changed. This is accomplished using events and callbacks that are registered at the time of opening of the files. The following events are used:

File contents have changed: Used to describe the new locations for the given service
Child node added to a directory: Used for describe addition of a new replica
Chubby master fail-over: Used for client to go into recovery mode
Invalid Handle: Some communication issues

Electing a primary using Chubby

Using the mechanisms described so far, client can now elect a primary. It is fairly straightforward to do:

All the entities that want to become a master, try to open a file in write mode.
Only one of those get the write mode access and others fail.
The one with write access, then writes its identity to the file
All the others get the file modification event and know about the the current master now.
Primary uses either a sequencer or a lock-delay based mechanism to ensure that out-of-order messages don’t cause inconsistent access and services can confirm if the sequencer for the current master is valid or not.

Caching and KeepAlive calls

Clients keep a cache that be used for reading and is always consistent. For writes, the write is propagated to the master and doesn’t complete until master acknowledges it. Master maintains state information about all the clients and hence can invalidate a client’s cache if someone else writes to the same file. The client that issued the write in such cases is blocked until all invalidations have been sent to the other clients and acknowledged by them.

There are KeepAlive calls that client makes to the master. At any point, for a well behaving client, there will always be one outstanding KeepAlive call at the master. Basically a client acknowledges master’s response by issuing the next KeepAlive call. Server can send some information back as a response of this call at a later time e.g. an invalidation can be sent to the client as response of a prior KeepAlive call. Client will see the response and then invalidate its own cache and then open another KeepAlive call at the master for future communication from the master. Another advantage of this mechanism is that no additional holes need to be punched in the firewalls. Outbound calls from clients are generally allowed and clients don’t need to open and listen on ports for the master to initiate connections to clients.

Sessions

We discussed KeepAlive RPCs in the last section. These establish a client-master chubby session. When a client makes this KeepAlive call to the master, master blocks this call. Master then also assigns a lease to the client. This master lease guarantees that the master won’t unilaterally terminate this session. When the lease is about to expire or if there is some event to which the client is subscribed, master can use this blocked call for sending the information back. In the former case, master may extend the lease or in the later case master can send information such as which files have changed.

Clients cannot be sure if the master is alive and the lease that the client has is still valid. So clients keep a slightly smaller local lease timeout. If this timeout occurs and master hasn’t responded, then client isn’t sure if the master is still around and if its local lease is valid. At this time, client considers that it’s session is in jeopardy and starts the grace period. It also disables its cache. If client heard back from the master during the grace period(45s), then the client can enable the cache once more. If client doesn’t hear back from the master then it is assumed the master is inaccessible and clients return errors back to the application. Applications get informed about both jeopardy andexpired events from the chubby client library.

http://www.slideshare.net/romain_jacotin/the-google-chubby-lock-service-for-looselycoupled-distributed-systems
Chubby lock service is intended for use within a loosely-coupled distributed system consis7ng
large number of machines (10.000) connected by a high-speed network
– Provides coarse-grained locking
– And reliable (low-volume) storage
• Chubby provides an interface much like a distributed file system with advisory locks
– Whole file read and writes opera9on (no seek)
– Advisory locks
– No9fica9on of various events such as file modifica9on
• Design emphasis
– Availability
– Reliability
– But not for high performance / high throughput
• Chubby uses asynchronous consensus: PAXOS with lease 7mers to ensure liveness

Chubby exports a file system interface simpler than Unix
– Tree of files and directories with name components separated by slashes
– Each directory contains a list of child files and directories (collec9vely called nodes)
– Each file contains a sequence of un-interpreted bytes
– No symbolic or hard links
– No directory modified 9mes, no last-access 9mes (to make easier to cache file meta-data)
– No path-dependent permission seman9cs: file is controlled by the permissions on the file itself

Files & directories : Nodes
• Nodes (= files or directories) may be either permanent or ephemeral
• Ephemeral files are used as temporary files, and act as indicators to others that a client is alive
• Any nodes may be deleted explicitly
– Ephemeral nodes files are also deleted if no client has them open
– Ephemeral nodes directories are also deleted if they are empty
• Any node can act as an advisory reader/writer lock

http://static.googleusercontent.com/media/research.google.com/zh-CN//archive/chubby-osdi06.pdf
http://muratbuffalo.blogspot.com/2010/10/chubby-lock-service-for-loosely-coupled.html

http://systemdesigns.blogspot.com/2016/01/chubby-lock-service_10.html?view=sidebar
Chubby是一种分布式锁服务, 用于解决分布式的一致性问题。从Chubby最著名的两个应用场景, GFS和Bigtable上来看, 主要用于elect master和高可用（high availability）的配置管理, 比如系统的元数据。

详细一点，Chubby帮助开发者处理他们的系统中的粗粒度同步，特别是处理从一组各方面相当的服务器中选举领导者。例如GFS使用一个Chubby锁来选择GFS Master 服务器，Bigtable以数种方式使用Chbbuy：选择领导者；使得Master可以发现它控制的服务器；使得客户应用(client)可以找到Master。此外，GFS和Bigtable都用Chubby作为众所周知的、可访问的存储小部分元数据(meta-data)的位置；实际上它们用Chubby作为它们的分布式数据结构的根。一些服务使用锁在多个服务器间（在粗粒度级别上）拆分工作。

Chubby要解决的问题的本质是在异步通信环境下的分布式一致性问题，而所有有效的解决异步一致性问题的协议其核心本质都是Paxos算法。因此有人可能会争论说我们应该构建一个包含Paxos的库，而不是一个访问中心化锁服务的库。

所以Chubby设计思路的第一点是为何实现lock service, 而非一个标准的Paxos库。原因是使用Chubby这种lock service取代Paxos库有如下几点好处：

a. 便于已有系统的移植, 对于系统初期设计没有考虑到分布式一致性问题, 后期如果基于Paxos库, 难度和修改比较大. 而如果基于lock service就容易的多。
b. Chubby不但用于elect master, 并且提供机制公布结果(mechanism for advertising the results), 还能够通过consistent client caching机制(rather than time-based caching)来保证所有client端cache的一致性. 这也是为什么Chubby作为name server非常成功的原因。
c. 基于锁的接口更为程序员所熟悉。
d. 分布式协同算法使用quorums做决策, 如果基于paxos库, 要求用户使用时必须先搭建集群.而基于lock service, 只有一个客户端也能成功取到锁。

设计思路的第二点是：

提供小文件(small-files)存储, 而非单纯锁服务。

Chubby首先是个lock service, 然后出于方便, 提供文件的存储, 但这不是他的主要目的, 所以不要要求high throughput或存储大文件
如上面b点所说, 需要advertise结果或保存配置参数, 诸如此类的需要, 就直接解决了, 省得还要依赖另一个service

设计思路的第三点是：

粗粒度锁（coarse-grained lock）。

粗细（coarse-grained vs fine-grained）颗粒度锁的区别在于锁hold住data access的时间长短，细颗粒度锁通常只hold住几秒，粗颗粒度锁可以hold住数小时甚至数天之久。

Chubby只支持粗粒度锁, 因为它所使用的场景, 比如elect master, metadata存储...一旦决定, 不会频繁的变化, 没有必要支持细粒度。
粗粒度锁的好处
a. 负载轻, 粗粒度所以不用频繁去访问, 对于Chubby这样基于master的service, 在面对大量client的时候, 这点非常重要
b. 临时性的锁服务器不可用对client的影响比较小

问题是,因为粗粒度, 所以client往往会cache得到的结果, 以避免频繁的访问master那么如何保证cache的一致性?当master fail-over时如果保证lock service的结果不丢失并继续有效?

这两个问题分别是通过caching时发送invalidation通知和fail－over时new master的重建来解决的，稍后会有细节讲解。

设计思路的一些其他考虑包括：

· Chubby服务需要考虑支持大量的client(数千), 当然粗粒度锁部分的解决这个问题, 如果还是不够可以使用proxy或partition的方式

· 避免客户端反复轮询, 需要提供一种事件通知机制

· 在客户端提供缓存, 并提供一致性缓存(consistent caching)机制

提供了包括访问控制在内的安全机制

2.2 Chubby的系统结构

Chubby的架构其实很简单, 包含client libary和chubby server

Chubby cell, 一个chubby server集群（通常是5个）

Replica, 集群中任意一个server

Master, replicas中需要elect master
   a, master具有master lease(租约), 不是永久的, 避免通信问题导致block
   b, master lease需要定期的续约(periodically renewed)
   c, 只有master有发起读写请求的权力, 而replicas只能通过一致性协议和master进行同步
   d, client通过发送master location requests给任意replica来定位master

如果一个master实效了，其他的replicas在他们的master租期到期时运行选举，通常情况下一个新的master将在几秒之内选举出来。

如果一个replica实效了并且在几个小时内没有恢复，一个简单的替换系统将选择一台新的机器，并在其上启动锁服务器的二进制文件(binary)。然后更新DNS表，将失效了的replica的IP替换为新启动的replica的IP。当前的master周期性地轮询DNS并最终注意到这个变化，然后它在该Chubby单元的数据库中更新单元成员列表，这个列表在所有成员之间通过常规的复制协议保持一致。与此同时，新的replica取得存储在文件服务器上的数据库备份和active replica的更新的组合。一旦新replica处理了一个当前master在等待提交的请求，新的replica就被许可在新master的选举中投票。

2.3 Chubby的文件系统（files，directories，handles）

一个典型的Chubby路径如下： /ls/foo/wombat/pouch

乍一看跟UNIX文件系统很像，首先ls表示lock service, 第二个表示chubby cell, 后面的是各个cell内部的文件结构, 由各个cell自己解析
通过在路径第二级设置不同的chubby cell name, 可以简单的把目录放到不同的chubby集群上

然而跟UNIX相比，Chubby文件系统可以说是相似但更简化。和UNIX文件系统的区别如下：

a. 为允许不同目录下的文件由不同的Chubby Master来服务，没有expose那些将文件从一个目录移动到另一个目录的操作。

b. 不维护目录修改时间。

c. 避开路径相关的权限语义（也就是文件的访问由其本身的权限控制，而不由它上层路径上的目录控制）。

d. 为使缓存文件元数据更容易，系统不公开最后访问时间。

对于文件或目录, 在Chubby中统一叫做Node, node分为permanent or ephemeral, 并且任一node都可用作为advisory reader/writer lock

Node包含如下元数据,

a. 访问控制列表(ACLs)的三个名字，分别用于控制读、写和修改其ACL。

b. 四个单调递增的64位编号。这些编号允许客户端很容易地检测变化：

实例编号：大于任意先前的同名节点的实例编号。

内容的世代编号（只针对文件）：这个编号在文件内容被写入时增加。

锁的世代编号：这个编号在节点的锁由自由(free)转换到持有(held)时增加。

ACL的世代编号：这个编号在节点的ACL名字被写入时增加。

b. 一个64位的文件内容的校验码，所以客户端可以分辨文件是否有变化。

Client在打开node的时候, 会获得类似UNIX file descriptors的handles。

2.4 Chubby的Locks and sequencers（锁和序号）

每个Chubby文件和目录都能作为一个read/write lock：要么一个client handles以排他（writer）模式持有这个锁，要么任意数目的client handles以共享（reader）模式持有这个锁。

值得注意的是，Chubby使用Advisory lock。Advisory Lock, 相对于mandatory locks
Lock一般由file实现, advisory表示只关注取得锁这种互斥关系, 而不在乎锁文件本身是否可以被access。

使用Advisory lock的原因是

a. Chubby锁经常保护由其他服务实现的资源，而不仅仅是与锁关联的文件
b. 不想强迫用户在他们为了调试和管理而访问这些锁定的文件时关闭应用程序
c. 开发者用常规的方式来执行错误检测，例如"lock X is held”, 所以他们从强制检查中受益很少. 意思一般都会先check “lock X is held”, 不会直接访问文件通过强制检查来check lock。

我们来看一个仅用锁的情况下失败的例子：

例子, 要求互斥R请求和S操作(比如两者需要修改同一个文件)

进程A, 获取锁L, 并发送请求R, 然后fail 进程B, 试图获取锁L, 如果能获取, 并又没有收到请求R, 则执行S操作

正常情况下, A会保持锁L, 所以B没法获取L, 则无法执行S 。即使在A fail的case下, 如果R能够及时被B收到, 也可以避免S的执行 (因为A fail, 到B获得L之间肯定有一段时间T) 。但如果R请求delay(B既获得L, 又没有收到R), 就可能导致S操作发生后又接收到R 请求, 互斥失败, 导致数据被写脏。

这个问题的本质是, 不同generation的lock的操作或消息混在一起。而在应用层, 只知道是否获得锁, 并不会区别锁的generation, 所以带来的问题是多个命令同时被选中。

Chubby 的解决办法是给每个命令带上Sequencer, 以区别来自不同generation lock的操作。基于这个机制, A获取L的同时, 生成sequencer L1, 并将L1发送给文件服务器, 并发送带L1的请求R, 然后fail 。由于A fail, B获得L, 生成sequencer L2, 并将L2发送给文件服务器, 然后执行S操作更新文件。 R到达, B相应R, 去更新文件时, 文件服务器会check R带的sequencer L1, 发现这个lock已经过期, 拒绝该请求，从而完成互斥。

Chubby对于不支持Sequencer的server提供如下简单的方案, 就是通过lock-delay增加前面例子中的时间T(即不同generation lock之间的间隔), 通过这个方法来减少delay或reorder的风险。不完美的地方就是不能完全保证, 取决于lock-delay到底设多长。

2.5 Events

接下来讨论Chubby的事件（Events）。

为了避免client反复的轮询, Chubby提供event机制。 Client可用订阅一系列的events, 当创建handle的时候. 这些event会异步的通过up-call传给client 。Chubby会保证在操作发生后再发送event, 所以client得到event后一定可以看到最新的状态。

Events的种类包括：

· 文件内容被修改–常用于监视通过文件公布的某个服务的位置信息(location)。

· 子节点的增加、删除和修改 — 用于实现镜像(mirroring)。(除了允许新文件可被发现外，返回子节点上的事件使得可以监控临时文件(ephemeral files)而不影响这些临时文件的引用计算(reference counts))

· Chubby master故障恢复 — 警告客户端其他的事件可能已经丢失了，所以必须重新检查数据。

· 一个handle（和它的lock）失效了 — 这经常意味着某个通讯问题。

· 锁被请求了(lock required) — 可用于判断什么时候primary选出来了。

· 来自另一个客户端的相冲突的锁请求 — 允许锁的caching。

2.6 API

我们再来稍微了解一下Chubby的API

Open（）打开类似UNIX file descriptor的handle, 只有open本身是需要node name的, 其他的所有操作可用直接基于handle

Close() 关闭一个打开的handle

Poison()引起该handle上未完成的和接下来的操作失败，但不关闭它，这就允许一个client取消由其他线程创建的Chubby调用而不需要担心被它们访问的内存的释放。

GetContentsAndStat() 返回文件的内容和元数据。

SetContents()将内容写到文件。

Delete() 删除一节节点，如果它没有子节点的话。

Accquire(), TryAccquire(), Release() 获得和释放锁。

GetSequencer() 返回一个描述这个handle持有的锁的序号。

SetSequencer() 将一个sequencer与handle关联。

CheckSequencer() 检查一个sequencer是否有效。

2.7 Caching

我们再来看Chubby很重要的一个feature：缓存（caching）

对于粗粒度锁, 为了避免反复去master读取数据, chubby会在client端cache文件数据和node的元数据。
问题是怎么样保证大量client上cache的一致性?

Chubby master维护所有有cache的client list, 并通过发送invalidation(失效通知)的方式来保证所有client cache 的一致性。
具体思路是, 在数据变更时, 先发一轮invalidation, 并等所有client cache失效后(收到ack或lease超时), 再更新。除了cache文件数据和node的元数据, client还可用cache handle和lock。invalidations只需要发送一轮, 对于那些没有ack的client, 只要等到lease结束, 就可用认为这些client没有cache 。

当invalidation的过程中, 大部分client是没有cache的, 对于此时的读操作该怎么处理呢?
Default的方式是不做处理, 没有cache就让读操作直接去master读, 毕竟写操作数量远远小于读操作。
但是这样带来的问题, 会导致这段时间master负载过重, 所以另一种方案是, 在invalidation期间, client发现没有cache就block所有的读操作
这两种方式可用配合使用。

除了data和meta-data, client还cache handles, 当然是在保证不影响client观察到的语义的前提下.

Client还允许长时间cache lock, 方便下次client还要使用该lock, 当其他的client需要aquire这个lock时, 发event通知它, 它才release这个lock 。这种方式几乎没人使用。

2.8 Sessions and KeepAlives

Chubby session 是 Chubby cell 和 a Chubby client之间的关联. client’s handles, locks, and cached data在session中都是有效的。

Session lease： session是具有期限的, 称为lease, master保证在lease内不会单方面的终止session. lease到期, 称为session lease timeout。

KeepAlives：如果要保持session, 就需要不断通过keepAlives来让master advance(延长) session lease。

Master可以任意延长lease, 但不能缩短lease, 并且在下面三种情况下会延长lease：会话创建时、master故障恢复发生时和它应答来自客户端的KeepAlive RPC时。

KeepAlives的具体流程：
1. Master会block住client发来的keepAlive请求, 直到client当前的lease快过期的时候, 再reply该keepAlive请求并附带新的lease timeout. Lease timeout的具体值, master可以随便定, 默认是12s, 当master认为负载过重的情况下, 可以加大timeout值, 以降低负载。

2. Client在收到master的返回后, 发送新的KeepAlive到master. 可见client会保证总有一个keepAlive被blocked在master 。

3. KeepAlives reply除了用于extending lease外, 还用于传递event和cache invalidations, 所以当出现event或cache invalidations时, KeepAlives会被提前reply

Client的local lease timeout只是近似于(略小于)master上的lease timeout, 因为考虑到KeepAlive以及响应消耗的时间, 以及master的时钟和client的不同步。

Client local lease过期时的处理流程:client lease超时, 无法确定master是否已经终结该session, 清空cache, 试图在宽限期内和master完成keepalive, 成功就recover, 宽限期结束仍未成功就意味session expired 。在lease过期的情况下, 谁都有权力终结session, 所以宽限期结束的情况下, 仍然没有master的消息, client可以单方面终结session 。并返回error给API调用, 从而避免API并无限的block。同时Chubby library可以通过jeopardy event, safe event, expired event来通知application 。 App收到jeopardy event可以暂停自身, 在收到safe event后再recover, 这样避免在出现master不可用时的app重启(尤其对有large startup overhead的app) 。