Monday, October 26, 2015

The Chubby lock service for loosely-coupled distributed systems



https://medium.com/coinmonks/chubby-a-centralized-lock-service-for-distributed-applications-390571273052
Master: Chubby master consists of multiple replicas, with one of them getting elected as the master using distributed consensus protocol like paxos. All replicas also grant a lease to the master, during which they don’t elect a new master.
Once the master is elected, it is responsible for writing to the database any persistent state that it needed, which is then replicated at other replicas. A write needs to be replicated at majority before being acknowledged back to the client. A read can be served back to the client by the master as long as the lease hasn’t expired — this indicates that there is no other master around.
If the master fails, consensus protocol is again run to elect a new master.

Client: A chubby cell serves thousands of clients, so these clients are connecting to a master for all the coordination needs. Clients use DNS to find the master. Replicas respond to clients issuing DNS queries by redirecting the clients to the current master. Once client finds the master, all requests go to that master. Clients run the locking protocol on application’s behalf and notify application of certain events such as master fail-over has occurred.

File based interface

Chubby exports UNIX file system like APIs. Files and directories are called nodes. There are no links allowed in the system. Nodes can be permanent or ephemeral. Ephemeral nodes go away as no client using the node go away. A file can be opened in a read/write mode indicating the exclusivity. Clients get a handle to the given node. The following metadata is also allocated per node:
  1. Instance number — always increasing for the same name
  2. Content generation number — Increased anytime content is overwritten
  3. Lock generation number — Increases when lock transitions from free to held
  4. There also ACLs on nodes like in a traditional file system for controlling access and an ACL number increases on ACL changes.

Locks, Lock-Delays and Sequencers

A client can create a node/file in a write(exclusive) or read(shared) mode. All the locks are advisory i.e. participating entities need to follow the locking protocol for accesses the distributed critical section. Having a lock on the file, doesn’t prevent unfettered accesses to the file.
One of the issues with locking in distributed systems is that applications holding locks can die. Consider the following example, where R1 ends up accessing data in an inconsistent manner. In the last step(after 6), update from R1 lands on the master and can corrupt data. R1 does not have a valid lock at that time because it died in step 4 and master then granted the lock on N to client R2 in the meanwhile.
Update from step 3 B=by R1 arrives at master somewhat late. By that time master has already granted the lock on N to another client R2.
One of the ways this is handled is using a lock-delay. When an application holding the lock dies without releasing the lock, for some configurable time, no one else gets a lock on the locks held by the now-defunct application. This makes for a simple and effective(but not perfect) solution where the client can specify the threshold upto which a faulty application can hold a lock.
Another possible solution that Chubby provides is a sequencer based checking. When a client acquires a lock, it can request for sequencer from the chubby master. This is a string that consists of lock name, lock generation number(that changes on every transition from free to held) and the mode of acquisition. This string can be passed on to the modules needing the lock for protected transactions. These modules can check for the validity of the lock using sequencers by checking against the chubby master or using the module’s chubby cache.

Detection of changes using events

Chubby also allows some small aspects of a publish and subscribe mechanisms. Files in chubby also allow for storing a small amount of data which makes it more effective than just for indicating whether a lock was taken or not. As we discussed earlier, clients are interested in knowing when a new master has been elected or when the contents of the lock that they are using have changed. This is accomplished using events and callbacks that are registered at the time of opening of the files. The following events are used:
  1. File contents have changed: Used to describe the new locations for the given service
  2. Child node added to a directory: Used for describe addition of a new replica
  3. Chubby master fail-over: Used for client to go into recovery mode
  4. Invalid Handle: Some communication issues

Electing a primary using Chubby

Using the mechanisms described so far, client can now elect a primary. It is fairly straightforward to do:
  1. All the entities that want to become a master, try to open a file in write mode.
  2. Only one of those get the write mode access and others fail.
  3. The one with write access, then writes its identity to the file
  4. All the others get the file modification event and know about the the current master now.
  5. Primary uses either a sequencer or a lock-delay based mechanism to ensure that out-of-order messages don’t cause inconsistent access and services can confirm if the sequencer for the current master is valid or not.

Caching and KeepAlive calls

Clients keep a cache that be used for reading and is always consistent. For writes, the write is propagated to the master and doesn’t complete until master acknowledges it. Master maintains state information about all the clients and hence can invalidate a client’s cache if someone else writes to the same file. The client that issued the write in such cases is blocked until all invalidations have been sent to the other clients and acknowledged by them.
There are KeepAlive calls that client makes to the master. At any point, for a well behaving client, there will always be one outstanding KeepAlive call at the master. Basically a client acknowledges master’s response by issuing the next KeepAlive call. Server can send some information back as a response of this call at a later time e.g. an invalidation can be sent to the client as response of a prior KeepAlive call. Client will see the response and then invalidate its own cache and then open another KeepAlive call at the master for future communication from the master. Another advantage of this mechanism is that no additional holes need to be punched in the firewalls. Outbound calls from clients are generally allowed and clients don’t need to open and listen on ports for the master to initiate connections to clients.

Sessions

We discussed KeepAlive RPCs in the last section. These establish a client-master chubby session. When a client makes this KeepAlive call to the master, master blocks this call. Master then also assigns a lease to the client. This master lease guarantees that the master won’t unilaterally terminate this session. When the lease is about to expire or if there is some event to which the client is subscribed, master can use this blocked call for sending the information back. In the former case, master may extend the lease or in the later case master can send information such as which files have changed.

Clients cannot be sure if the master is alive and the lease that the client has is still valid. So clients keep a slightly smaller local lease timeout. If this timeout occurs and master hasn’t responded, then client isn’t sure if the master is still around and if its local lease is valid. At this time, client considers that it’s session is in jeopardy and starts the grace period. It also disables its cache. If client heard back from the master during the grace period(45s), then the client can enable the cache once more. If client doesn’t hear back from the master then it is assumed the master is inaccessible and clients return errors back to the application. Applications get informed about both jeopardy andexpired events from the chubby client library.


http://www.slideshare.net/romain_jacotin/the-google-chubby-lock-service-for-looselycoupled-distributed-systems
Chubby lock service is intended for use within a loosely-coupled distributed system consis7ng
large number of machines (10.000) connected by a high-speed network
– Provides coarse-grained locking
– And reliable (low-volume) storage
• Chubby provides an interface much like a distributed file system with advisory locks
– Whole file read and writes opera9on (no seek)
Advisory locks
– No9fica9on of various events such as file modifica9on
• Design emphasis
– Availability
– Reliability
– But not for high performance / high throughput
• Chubby uses asynchronous consensus: PAXOS with lease 7mers to ensure liveness

Chubby exports a file system interface simpler than Unix
– Tree of files and directories with name components separated by slashes
– Each directory contains a list of child files and directories (collec9vely called nodes)
– Each file contains a sequence of un-interpreted bytes
– No symbolic or hard links
– No directory modified 9mes, no last-access 9mes (to make easier to cache file meta-data)
– No path-dependent permission seman9cs: file is controlled by the permissions on the file itself

Files & directories : Nodes
• Nodes (= files or directories) may be either permanent or ephemeral
• Ephemeral files are used as temporary files, and act as indicators to others that a client is alive
• Any nodes may be deleted explicitly
– Ephemeral nodes files are also deleted if no client has them open
– Ephemeral nodes directories are also deleted if they are empty
• Any node can act as an advisory reader/writer lock

http://static.googleusercontent.com/media/research.google.com/zh-CN//archive/chubby-osdi06.pdf
http://muratbuffalo.blogspot.com/2010/10/chubby-lock-service-for-loosely-coupled.html

http://systemdesigns.blogspot.com/2016/01/chubby-lock-service_10.html?view=sidebar
Chubby是一种分布式锁服务, 用于解决分布式的一致性问题。从Chubby最著名的两个应用场景, GFS和Bigtable上来看, 主要用于elect master和高可用(high availability)的配置管理, 比如系统的元数据。

详细一点,Chubby帮助开发者处理他们的系统中的粗粒度同步,特别是处理从一组各方面相当的服务器中选举领导者。例如GFS使用一个Chubby锁来选择GFS Master 服务器,Bigtable以数种方式使用Chbbuy:选择领导者;使得Master可以发现它控制的服务器;使得客户应用(client)可以找到Master。此外,GFS和Bigtable都用Chubby作为众所周知的、可访问的存储小部分元数据(meta-data)的位置;实际上它们用Chubby作为它们的分布式数据结构的根。一些服务使用锁在多个服务器间(在粗粒度级别上)拆分工作。

Chubby要解决的问题的本质是在异步通信环境下的分布式一致性问题所有有效的解决异步一致性问题的协议其核心本质都是Paxos算法。因此有人可能会争论说们应该构建一个包含Paxos,而不是一个访问中心化服务的

所以Chubby思路的第一点是何实lock service, 而非一个准的Paxos库。原因是使用Chubbylock service取代Paxos有如下几点好处:

a. 便于已有系统的移植于系统初期设计没有考到分布式一致性问题后期如果基于Paxos度和修改比而如果基于lock service就容易的多。
b. Chubby
不但用于elect master, 并且提供机制公布(mechanism for advertising the results), 能够通consistent client caching机制(rather than time-based caching)所有clientcache的一致性也是什么Chubbyname server非常成功的原因。
c. 
基于的接口更程序所熟悉
d. 
分布式同算法使用quorums做决策如果基于paxos要求用户使用先搭建集群.而基于lock service, 只有一个客户端也能成功取到锁。

思路的第二点是:

提供小文件(small-files)而非单纯锁服务。

Chubby首先是个lock service, 然后出于方便提供文件的存储但这不是他的主要目的所以不要要求high throughput或存储大文件 
如上面b点所说需要advertise结果或保存配置参数诸如此类的需要就直接解决了省得还要依赖另一个service

思路的第三点是:

粗粒度coarse-grained lock)。

coarse-grained vs fine-grained粒度区别在于holddata access间长短,细颗粒度通常只hold住几秒,粗粒度可以hold甚至天之久。

Chubby只支持粗粒度它所使用的比如elect master, metadata...一旦决定不会繁的变化有必要支持粒度
粗粒度的好处
a. 
负载轻粗粒度所以不用繁去访问Chubby样基于masterservice, 在面大量client点非常重要
b. 
临时性的服务器不可用client的影响比

问题,粗粒度所以client往往会cache得到的以避免繁的访问master那么如何保cache的一致性?master fail-over如果保lock service果不失并继续有效?

两个问题是通caching发送invalidation通知和failovernew master的重建解决的,稍后会有细节讲解。

思路的一些其他考包括:
·      Chubby服务需要考虑支持大量的client(数千), 当然粗粒度锁部分的解决这个问题如果还是不够可以使用proxypartition的方式
·      避免客户端反复轮询需要提供一种事件通知机制
·      在客户端提供缓存并提供一致性缓存(consistent caching)机制
  • 提供了包括访问控制在的安全机制

2.2 Chubby的系统



 

Chubby的架构其实很简单包含client libarychubby server
Chubby cell一个chubby server集群(通常是5个)
Replica集群中任意一个server
Master, replicas中需要elect master
   a, master
具有master lease(租约), 不是永久的避免通信问题导致block
   b, master lease
需要定期的续约
(periodically renewed)
   c, 
只有master有发起读写请求的权力replicas只能通过一致性协议和master进行同步
   d, client
通过发送master location requests给任意replica来定位master
如果一个master实效了,其他的replicas在他master租期到期运行选举,通常情况下一个新的master在几秒之内选举

如果一个replica实效了并且在几个小时内没有恢复,一个简单的替系统将选择一台新的机器,并在其上启动服务器的二制文件(binary)。然后更新DNS表,失效了的replicaIP换为新启动的replicaIP前的master周期性地轮询DNS并最注意到个变化,然后它在Chubby元的中更新元成列表,个列表在所有成之间通的复制协议保持一致。与此同,新的replica取得存在文件服务器上的备份和active replica的更新的合。一旦新replica处理了一个master在等待提交的求,新的replica就被可在新master选举中投票。

2.3 Chubby的文件系统(filesdirectorieshandles

一个典型的Chubby如下: /ls/foo/wombat/pouch
乍一看跟UNIX文件系统很像,首先ls表示lock service, 第二个表示chubby cell, 后面的是各个cell部的文件由各个cell自己解析
在路第二设置不同的chubby cell name, 可以简单的把目放到不同的chubby集群上

然而跟UNIX相比,Chubby文件系统可以是相似但更化。和UNIX文件系统的区别如下:
a.    不同目下的文件由不同的Chubby Master服务,expose那些文件从一个目移动到另一个目的操作。
b.    护目修改间。
c.    义(也就是文件的访问由其本身的限控制,而不由它上上的目控制)。
d.    使存文件元据更容易,系统不公最后访问时间。


于文件或目Chubby中统一叫做Node, nodepermanent or ephemeral, 并且任一node都可用作advisory reader/writer lock 

Node
包含如下元
a.    访问控制列表(ACLs)的三个名字,分用于控制和修改其ACL
b. 四个单调递增的64位编号。这些编号允许客户端很容易地检测变化:
实例号:大于任意先前的同名点的实例号。
容的世代号(只针对文件):号在文件容被增加。
的世代号:号在点的由自由(free)转换到持有(held)增加。
ACL的世代号:号在点的ACL名字被增加。
b.    一个64位的文件容的校验码,所以客户端可以分辨文件是否有变化。

Client在打node得类似UNIX file descriptorshandles

2.4 Chubby的Locks and sequencers(锁和序号)

每个Chubby文件和目都能作一个read/write lock:要么一个client handles以排他(writer)模式持有,要么任意目的client handles以共享(reader)模式持有

 值得注意的是,Chubby使用Advisory lockAdvisory Lock, mandatory locks
Lock
一般由file, advisory表示只注取得锁这种互斥而不在乎文件本身是否可以被access 

使用Advisory lock的原因是
a. Chubby经常保护由其他服务实源,而不仅仅是与锁关联的文件
b. 
不想强迫用户在他们为调试和管理而访问这定的文件时关闭应用程序
c. 
发者用常的方式来执错误检测,例如"lock X is held”, 所以他从强制查中受益很少意思一般都会先check “lock X is held”, 不会直接访问文件通强制check lock

们来看一个的情况下失的例子:
例子要求互斥R请求和S操作(比如两者需要修改同一个文件)
进程A, 获取锁L, 并发送请求R, 然后fail 进程B, 试图获取锁L, 如果能获取并又没有收到请求R, 则执行S操作

正常情况下, A会保持L, 所以BL, 无法 即使在A failcase如果R能够及B收到也可以避免S (A fail, BL之间肯定有一段T) 但如果Rdelay(BL, 有收到R), 就可能S操作发生后又接收到互斥失据被写脏。

问题的本不同generationlock的操作或消息混在一起  而在只知道是否并不会区别锁generation, 所以带来问题是多个命令同 

Chubby 的解决法是每个命令Sequencer, 区别来自不同generation lock的操作。 基于个机制, AL的同生成sequencer L1, L1发送文件服务器并发送L1R, 然后fail 由于A fail, BL, 生成sequencer L2, L2发送文件服务器然后S操作更新文件。 R到达, BR, 去更新文件文件服务器会check Rsequencer L1, 现这lock已经绝该请 从而完成互斥。

Chubby于不支持Sequencerserver提供如下简单的方案,  就是通lock-delay增加前面例子中的T(即不同generation lock之间的间隔), 过这个方法减少delayreorder的风险 不完美的地方就是不能完全保取决于lock-delay到底设多长 

 2.5 Events

接下来讨论Chubby的事件(Events)。
了避免client反复的轮询, Chubby提供event机制 Client可用订阅一系列的events, 当创handleevent会异步的通up-call传给client Chubby会保在操作发生后再发送event, 所以client得到event后一定可以看到最新的状态 

Events的种类包括:
·         文件容被修改常用于视通文件公布的某个服务的位置信息(location)
·         点的增加、删除和修改 — 用于实现镜(mirroring)(除了允新文件可被发外,返回子点上的事件使得可以临时文件(ephemeral files)而不影响临时文件的引用(reference counts))
·         Chubby master故障恢复 — 警告客户端其他的事件可能已经失了,所以必重新据。
·         一个handle(和它的lock)失效了 — 经常意味着某个通讯问题
·         求了(lock required) — 可用于判什么primary了。
·         自另一个客户端的相冲突的锁请 — 许锁caching

2.6 API

稍微了解一下ChubbyAPI
Open()打类似UNIX file descriptorhandle, 只有open本身是需要node name其他的所有操作可用直接基于handle 
Close() 关闭一个打handle
Poison()引起handle上未完成的和接下的操作失,但不关闭它,就允一个client取消由其他线建的Chubby用而不需要担心被它们访问存的放。
GetContentsAndStat() 返回文件的容和元据。
SetContents()将内到文件。
Delete() 删除一节节点,如果它有子点的
Accquire(), TryAccquire(), Release() 得和

GetSequencer() 
返回一个描述handle持有的的序号。
SetSequencer() 一个sequencerhandle联。


CheckSequencer() 查一个sequencer是否有效。

2.7 Caching

Chubby很重要的一个feature存(caching
于粗粒度了避免反复去master, chubby会在clientcache文件据和node的元 
问题是怎么样保大量clientcache的一致性

Chubby master护所有有cacheclient list, 并通发送invalidation(失效通知)的方式所有client cache 的一致性 
具体思路是据变更先发一轮invalidation, 并等所有client cache失效后(收到acklease), 再更新。除了cache文件数据和node的元数据, client还可用cache handlelockinvalidations只需要发送一轮于那些ackclient, 只要等到lease就可用认为这clientcache 

invalidation程中大部分clientcache于此操作怎么处理呢?
Default
的方式是不做处理cache让读操作直接去master操作远远小于操作 
但是带来问题master负载过所以另一种方案是invalidation期间, client现没cacheblock所有的操作
两种方式可用配合使用。

除了datameta-data, clientcache handles, 然是在保不影响client察到的义的前提下
Clientcache lock, 方便下次client要使用lock, 其他的client需要aquirelockevent通知它它才releaselock 。这种方式几乎人使用。 

2.8 Sessions and KeepAlives

Chubby session  Chubby cell  a Chubby client之间的关联. client’s handles, locks, and cached datasession中都是有效的。
Session lease session是具有期限的称为lease, master保证在lease内不会单方面的终止session. lease到期称为session lease timeout
KeepAlives 如果要保持session, 就需要不断通过keepAlives来让master advance(延长) session lease
Master可以任意延长lease, 但不能lease, 并且在下面三种情况下会延长lease 话创master故障恢复发生和它自客户端的KeepAlive RPC

KeepAlives的具体流程:
1. MasterblockclientkeepAlive直到client前的lease期的replykeepAlive求并附新的lease timeout. Lease timeout的具体值, master可以随便定12s, master认为负载过重的情况下可以加大timeout以降低负载 
2Client在收到master的返回后发送新的KeepAlivemaster. 可见client会保总有一个keepAliveblockedmaster  
3. KeepAlives reply除了用于extending lease用于eventcache invalidations, 所以eventcache invalidations, KeepAlives会被提前reply 
Clientlocal lease timeout只是近似于(略小于)master上的lease timeout, KeepAlive以及响消耗的以及master钟和client的不同步 

Client local lease的处理流程:client lease无法确定master是否已经终结该session, 清空cache, 试图限期master完成keepalive, 成功就recover, 限期束仍未成功就意味session expired  lease期的情况下都有终结session, 所以限期束的情况下仍然master的消息, client可以方面终结session 并返回errorAPI从而避免API并无限的blockChubby library可以通jeopardy event, safe event, expired event通知application  App收到jeopardy event可以停自身在收到safe event后再recover, 样避免在出master不可用app重启(尤其large startup overheadapp) 

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts