http://www.raychase.net/2878
Transactions Across Datacenters
(and other weekend projects)
Of three properties of distributed data systems - consistency, availability, partition tolerance - choose two.
multihoming (n): operating from multiple datacenters simultaneously
Weak consistency
After a write, reads may or may not see it
Best effort only
"Message in a bottle"
App Engine: memcache
VoIP, live online video
Realtime multiplayer games
Eventual consistency
After a write, reads will eventually see it
App Engine: mail
Search engine indexing
DNS, SMTP, snail mail
Amazon S3, SimpleDB
Why transactions?
Correctness
Consistency
Enforce invariants
ACID
Why across datacenters?
Catastrophic failures
Expected failures
Routine maintenance
Geolocality
CDNs, edge caching
Why not across datacenters?
Within a datacenter
High bandwidth: 1-100Gbps interconnects
Low latency: < 1ms within rack, 1-5ms across
Little to no cost
Between datacenters
Low bandwidth: 10Mbps-1Gbps
High latency: 50-150ms
$$$ for fiber
Option 1: Don't
...instead, bunkerize.
Most common
Microsoft Azure (tables)
Amazon SimpleDB
Bad at catastrophic failure
Large scale data loss
Example: SVColo
Not great for serving
No geolocation
Option 2: Primary with hot failover(s)
Better, but not ideal
Mediocre at catastrophic failure
Window of lost data
Failover data may be inconsistent
Amazon Web Services
EC2, S3, SQS: choose US or EU
EC2: Availability Zones, Elastic Load Balancing
Banks, brokerages, etc.
Geolocated for reads, not for writes
Option 2: Primary with hot failover(s)
Better, but not ideal
Mediocre at catastrophic failure
Window of lost data
Failover data may be inconsistent
Amazon Web Services
EC2, S3, SQS: choose US or EU
EC2: Availability Zones, Elastic Load Balancing
Banks, brokerages, etc.
Geolocated for reads, not for writes
Multi-master replication
Umbrella term for merging concurrent writes
Asynchronous, eventual consistency
Need serialization protocol
e.g. timestamp oracle: monotonically increasing timestamps
Either SPOF with master election...
...or distributed consensus protocol
No global transactions!
Datastore: no strong consistency
Two Phase Commit
Semi-distributed consensus protocol
deterministic coordinator
1: propose, 2: vote, 3: commit/abort
Heavyweight, synchronous, high latency
3PC buys async with extra round trip
Datastore: poor throughput
Paxos
Fully distributed consensus protocol
"Either Paxos, or Paxos with cruft, or broken"
Mike Burrows
Majority writes; survives minority failure
Protocol similar to 2PC/3PC
Lighter, but still high latency
https://everythingisdata.wordpress.com/2009/10/17/numbers-everyone-should-know/
http://katemats.com/distributed-systems-basics-handling-failure-fault-tolerance-and-monitoring/
In small, self-contained systems it is much easier to simulate the conditions required to replicate and debug issues, with most of these issues classified as being a Bohrbug, that is a bug “that manifests itself consistently under a well-defined (but possibly unknown) set of conditions” [3]. However, in more complex systems or production environments having many servers, it can be extremely difficult to find and diagnose more unusual bugs; like the Heisenbug “that disappears or alters its characteristics when an attempt is made to study it” [3].
http://horicky.blogspot.com/2008/03/web-site-scalability.html
Dynamic Content
There are 2 layers of dispatching for a Client who is making an HTTP request to reach the application server
DNS Resolution based on user proximity
Transactions Across Datacenters (and Other Weekend Projects) at the Google I/O 2009 conference.
The different multi-homing options are: Backups, Master-Slave, Multi-Master, 2PC, and Paxos. You'll also learn how they each fair on support for consistency, transactions, latency, throughput, data loss, and failover.
Google App Engine uses master/slave replication between datacenters. They chose this approach in order to provide:
- lowish latency writes
- datacenter failure survival
- strong consistency guarantees.
No solution is all win, so a compromise must be made depending on what you think is important. A major Google App Engine goal was to provide a strong consistency model for programmers. They also wanted to be able to survive datacenter failures. And they wanted write performance that wasn't too far behind a typical relational database. These priorities guided their architectural choices.
In the future they hope to offer optional models so you can select Paxos, 2PC, etc for your particular problem requirements (Yahoo's PNUTS does something like this).
The different multi-homing options are: Backups, Master-Slave, Multi-Master, 2PC, and Paxos. You'll also learn how they each fair on support for consistency, transactions, latency, throughput, data loss, and failover.
Google App Engine uses master/slave replication between datacenters. They chose this approach in order to provide:
- lowish latency writes
- datacenter failure survival
- strong consistency guarantees.
No solution is all win, so a compromise must be made depending on what you think is important. A major Google App Engine goal was to provide a strong consistency model for programmers. They also wanted to be able to survive datacenter failures. And they wanted write performance that wasn't too far behind a typical relational database. These priorities guided their architectural choices.
In the future they hope to offer optional models so you can select Paxos, 2PC, etc for your particular problem requirements (Yahoo's PNUTS does something like this).
Single Master. Pick a master datacenter that writes go to and other sites replicate to. The replicates sites off read-only services.
- Better, but not great.
- Data are usually replicated asynchronously so there's a window of vulnerability for loss.
- Data in your other datacenters may not be consistent on failure.
- Popular with financial institutions.
- You get geolocation to serve reads. Consistency depends on the technique. Writes are still limited to one datacenter.
Multi-Master. True multihoming. The Holy Grail. All datacenters are serving reads and writes. All data is consistent. Transactions just work. This is really hard.
- So some choose to do it with just two datacenters. NASDAQ has two datacenters close together (low latency) and perform a two-phase commit on every transaction, but they have very strict latency requirements.
- Using more than two datacenters is fundamentally harder. You pay for it with queuing delays, routing delays, speed of light. You have to talk between datacenters. Just fundamentally slower with a smaller pipe. You may pay for with capacity and throughput, but you'll definitely pay in latency.
Master/Slave Replication - Writes to a master are also written to one or more slaves.
- Replication is asynchronous so good for latency and throughput.
- Weak/eventual consistency unless you are very careful.
- You have multiple copies in the datacenters, so you'll lose a little data on failure, but not much. Failover can go read-only until the master has been moved to another datacenter.
- Datastore currently uses this mechanism. Truly multihoming adds latency because you have to add the extra hop between datacenters. App Engine is already slow on writes so this extra hit would be painful. M/S gives you most of the benefits of better forms while still offering lower latency writes.
Multi-Master Replication - support writes from multiple datacenters simultaneously.
- You figure out how to merge all the writes later when there's a conflict. It's like asynchronous replication, but you are serving writes from multiple locations.
- Best you can do is Eventual Consistency. Writes don't immediately go everywhere. This is a paradigm shift here. We've assumed with a strongly consistent system that backup and M/S that they don't change anything. They are just techniques to help us multihome. Here it literally changes how the system runs because the multiple writes must be merged.
- To do the merging you must find away to serialize, impose an ordering on all your writes. There is no global clock. Things happen in parallel. You can't ever know what happens first. So you make it up using timestamps, local timetamps + skew, local version numbers, distributed consensus protocol. This is the magic and there are a number of ways to do it.
- There's no way to do a global transaction. With multiple simultaneous writes you can't guarantee transactions. So you have to figure out what to do afterward.
- AppEngine wants strong consistency to make building applications easier, so they didn't consider this option.
- Failover is easy because each datacenter can handle writes.
Two Phase Commit (2PC) - protocol for setting up transactions between distributed systems.
- Semi-distributed because there's always a master coordinator for a given 2PC transaction. Because there are so few datacenters you tend to go through the same set of master coordinators.
- It's synchronous. All transactions are serialized through that master which kills your throughput and increases latency.
- Never serious considered this option because write throughput is very important to them. No single point of failure or serialization point would work for them. Latency is high because of the extra coordination. Writes can be in the 200msec area.
- This option does work though. You write to all datacenters or nothing. You get strong consistency and transactions.
- Need N+1 datacenters. If you take one down then you still have N to handle your load.
Paxos - A consensus protocol where a group of independent nodes reach a majority consensus on a decision.
- Protocol: there's a propose step and then an agree step. You only need a majority of nodes to agree to say something is persisted for it to be considered persisted.
- Unlike 2PC it is fully distributed. There's no single master coordinator.
- Multiple transactions can be run in parallel. There's less serialization.
- Writes are high latency because of the 2 extra round coordination trips required in the protocol.
- Wanted to do this, but the they didn't want to pay the 150msec latency hit to writes, especially when competing against 5msec writes for RDBMSes.
- They tried using physcially close datacenters but the built-in multi-datacenter overhead (routers, etc) was too high. Even in the same datacenter was too slow.
- Paxos is still used a ton within Google. Especially for lock servers. For coordinating anything they do across datacenters. Especially when state is moved between datacenters. If your app is serving data in one datacenter and it should be moved to another that coordination is done through Paxos. It's used also in managing memcache and offline processing.
http://horicky.blogspot.com/2008/02/scalable-system-design.html
Common Techniques
Server Farm (real time access)
Scalable System Design Patterns
Load Balancer
Scatter and Gather
In this model, the dispatcher will first lookup if the request has been made before and try to find the previous result to return, in order to save the actual execution.
Pipe and Filter
This model is also known as "Data Flow Programming"; all workers connected by pipes where data is flow across.
This pattern is a very common EAI pattern.
http://lethain.com/introduction-to-architecting-systems-for-scale/
A good indicator that the question is taking you and the candidate off-track is if the boxes he or she draws on the whiteboard lack method names. If there are no methods, there are no behaviors, which means the candidate is designing first, regardless of requirements. Steer the candidate back by asking them to write method names.
A good indicator that the question is taking you and the candidate off-track is if the boxes he or she draws on the whiteboard lack method names. If there are no methods, there are no behaviors, which means the candidate is designing first, regardless of requirements. Steer the candidate back by asking them to write method names.
For example, ask the candidate to design the object model for a simple bookshelf. A bookshelf is so simple that there's virtually no way to overcomplicate it, so you can say that you simply need to be able to add a book to the bookshelf, nothing else. Once they have done this, give the candidate another "test", by adding a requirement. Now say that you want the bookshelf to be a "smart" bookshelf, where you can look up a book by title and get a reference to the book. The candidate will make some changes to their design. Continue adding requirements to the thing until you're satisfied with the evolution of the candidate's design.
Modify the design to allow me to search the bookshelf by Title, Author, or ISBN
Allow me to use the bookshelf as an ad-hoc library, so I can "check out" a book, removing it from the bookshelf. Make it possible to look up books that have been "borrowed".
Require a user provide their name when checking out a book. Give them a return date. Make it so the bookshelf can generate a list of overdue books and who has them. Where does that method belong? Should it be on its own object that Bookshelf uses?
Make the bookshelf more concrete, so that there is only a certain number of books that can fit on any given shelf in the overall bookshelf. Make it possible to ask which shelf a book is on. Make it possible to add a new shelf.
You can obviously pick your own model (Car, Elevator, Fridge, etc) and drive the direction the design should go using requirements. Have the candidate write method names (but not implementations) where appropriate. Keep asking if the method is on the right object, or it belongs on another one.
if you're in an interview you may get the "design an object model for a x" question. If you find yourself in this situation, you essentially want to ask questions of your interviewer to pull requirements out of them. The reality is, they probably DO have a particular kind of system in mind when they ask the question, so you need to find out what that is.
If you're supposed to design the Car class, ask if there are other types of vehicles in the system. If not, don't make a Vehicle parent class, and explain that you see no need unless the system required additional vehicles. Before you draw a box named Tire, ask if the car needs to be able to support snow tires or some other kind of interchangeable tire. If not, why would you make a class or interface for them? Start as simply as you can and only draw a new box when you've extracted enough information from the interviewer to deem it warranted. If you jump up to the whiteboard and draw two or three boxes before asking any questions, you're placing your interview at risk because the interviewer may be picturing a completely different usage of this system than you are.
http://systemdesigns.blogspot.com/2015/11/0.html
第1阶段的主题先从奠定基础开始,本周目标是搞懂系统设计相关的所有最最基本概念:
• Horizontal vs. Vertical Scaling
• Load Balancer
• Database Denormalization & NoSQL
• Database Partitioning (Sharding)
• Caching
• Asysnchronous Processing & Queue
• Networking Metrics
• MapReduce
### Scalability for Dummies: http://www.lecloud.net/tagged/scalability
内含四篇短文,涵盖了Clones、Database、Cache、Aysnchronism主题,内容浅显易懂,串联了各个知识点,作为第一阶段进入状态用,将基础概念都过一遍。每篇阅读时间预计5min。
http://www.hiredintech.com/system-design/
与1、2有内容重叠,建议把剩下的一些都看完,系统地打好基础。
1. Scalability
- Concept/Trade off: https://en.wikipedia.org/wiki/Scalability#Horizontal_and_vertical_scaling
- 系统的可扩展性:http://article.yeeyan.org/view/21984/185038
- Horizontal VS Vertical Scaling Compare: http://www.vtagion.com/scalability-scale-up-scale-out-care/
- Trade off: http://yunjiechao-163-com.iteye.com/blog/2124300
2. Load Balancer
- Concept: https://en.wikipedia.org/wiki/Load_balancing_(computing)
- 跟我一起学Load Balance(1): http://blog.chinaunix.net/uid-23629988-id-3320256.html
- 跟我一起学Load Balance(2): http://blog.chinaunix.net/uid-23629988-id-3324539.html
4. Database Denormalization & NoSQL
- NOSQL: http://www.jdon.com/nosql.html
- NoSQL简介:http://tsaikoga.github.io/blog/2013/10/15/nosqljian-jie/
- NoSQL vs SQL: http://dataconomy.com/sql-vs-nosql-need-know/
5. Database Partitioning (Sharding)
http://highscalability.com/blog/2009/8/6/an-unorthodox-approach-to-database-design-the-coming-of-the.html
6. Cache
- Cache methods: http://www.lecloud.net/post/9246290032/scalability-for-dummies-part-3-cache
- Cache Strategy: http://www.coderanch.com/how-to/java/CachingStrategies
7. Asysnchronous Processing & Queues
http://www.cs.unc.edu/~dewan/242/s04/notes/ipc/node11.html
8. Networking Metrics
9. MapReduce
#阅读笔记
https://trello.com/c/NwiVecc9 (by 陈驰)
#QUESTIONS
- load balancer分发策略
- cache的方式
- denormalization in nosql的实现
http://systemdesigns.blogspot.com/2015/11/system-design1.html
The CAP Theorem (henceforth 'CAP') says that it is impossible to build an implementation of read-write storage in an asynchronous network that satisfies all of the following three properties: Consistency, Availability, Partition Tolerance。
SQL Model -> ACID (Atomicity, Consistency, Isolation, and Durability)
No-SQL Model -> BASE (Basically Available, Soft State, Eventual consistency)
How to architect a system to work across multiple datacenters。 我们将serve something from multiple datacenters定义为Multihoming,这是当今分布式系统计算最具挑战的问题之一。
第一篇文章是关于Google是如何从全球多个数据中心为用户提供数据的,文章摘自Google I/O 2009 Talk。本文针对multihoming问题提出了5种可行解决方法并进行了对比分析:
在分布式算法领域,有个非常重要的算法叫Paxos, 它的重要性有多高呢,Google的Chubby 中提到 “all working protocols for asynchronous consensus we have so far encountered have Paxos at their core.” Paxos用于解决分布式系统中一致性问题。其利用Majority机制的投票形式保证2F+1的容错能力,即2F+1个结点的系统最多允许F个结点同时出现故障。也可以理解为:N个服务器要确定一个值Value等于多少,只要有多于半数的服务器还是存活并可以有效通信,那么这个值就可以通过Paxos算法确定下来,并且该值是唯一的。
https://gist.github.com/Shanni/a29a1aa741fdeb73e5fa#file-systemdesign
http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
https://blog.eood.cn/scalable_system_patterns
首先, 反复沟通和澄清系统需求 。只有把需求澄清清楚了,才可以开始思考并落到纸面上。但是需求的沟通应该是持续和循序渐进的,问题很难从一开始就思考全面。最重要的条目包括:
- use cases,通常问题只需要 2~3 个 use cases 需要考虑,其他的部分会晚些考虑,或者不考虑。这样就可以简化问题。
- 用户数量(用户可以是下游系统或者人)、数据数量,澄清这个事实无疑非常重要,对系统设计的决策有重大意义。
- 请求模型,比如读远大于写,比如 60% 的请求都访问 top 20 的记录。
其次, 尝试抽象一个简单的模型 ,从简单模型开始,思考不同的场景和约束,逐步完善。落实到代码上的时候,最核心的部分包括:
- 模型的定义。
- 代码接口,API。
- 数据是怎样被存储的,比如表结构和文件结构。
在此基础上,考虑 最基础的组件和架构划分 ,整个系统要分几层,有哪些组件,各自什么作用,可能的瓶颈是什么等等。还有前面的 API、模型分别被安插到哪部分上面,同时反复比较第一步的几个 use case 是否都被满足。
再次, 细化层结构和组件 ,比如:
- 存储层。是否需要持久化存储?选择文件、关系数据库,还是 NoSQL 数据库?
- 如果选择关系数据库,是否需要 sharding 或 partition?参考 Quora:What’s the difference between sharding and partition?
- 如果选择 NoSQL 数据库,CAP 中分别牺牲和优先保证哪一个?参考:Visual Guide to NoSQL System。扩容的策略(比如一致性 hash)?
- 存储是否需要分层(比如冷层——文件、暖层——关系型数据库、热层——缓存,不同成本的存储介质,用以应付不同的数据访问频率)?
- 集群。所有节点对等还是中心节点主备?请求负载分担的策略?如何增减节点?如何判定节点健康状况?是否需要会话?会话如何同步?Scale up 和 Scale out 的区别,参考 Wikipedia:Scalability。
- 消息模型。生产者和消费者的速率?无法应付时是否需要缓冲队列?消息流量控制?速率控制的精细度?
- 缓存系统。缓存的分层?分布式部署还是集中式缓存服务?使用什么缓存淘汰算法(比如 LRU)?参考:In-Process Caching vs. Distributed Caching。
其中,系统瓶颈的识别和 scale 是紧密联系着的两个话题。在需求驱使的基础上着手优化,比如缓存的应用,这需要建立在系统瓶颈被识别或者假定被识别的基础上,否则就是乱枪打鸟。在瓶颈解决之后再考虑 scale。
最后,不断 讨论和完善,每一个讨论迭代都要得出一个实际的结论 ,避免持续停留在过高抽象层面。这里涉及的部分可以很多,包括可扩展性、数据规模、吞吐量、可维护性、实时性、数据一致性等等。
所以,归纳起来的四部分为,先从系统外部理清楚需求,接着设计核心模型和 API,再进行基本的分层和组件划分,最后才是细化每一层或者每个组件的设计。从外到内,逐层剖析。
这些点说起来容易做起来难,通过反复阅读和思考一些常见的系统设计场景,其实我们还是可以从中总结出若干规律来。
下面列出几个非常常见和典型的系统设计问题的 hints:
1、怎样设计一个微博/Twitter 系统(news feed 系统)
- 思考读写模型,latency 上看明显是读的要求明显高于写的模式。
- 转发和回复,拷贝原微博文字还是存储转发/回复树形关系?分析利弊。另外,这里涉及到产品设计,参见:Twitter Vs. Weibo: 8 Things Twitter Can Learn From The Latter。
- 区分两种典型消息传播的触发方式:push on change 和 pull on demand,两种方式利弊明显。参考:Why Are Facebook, Digg, And Twitter So Hard To Scale?
- 存储分级。这里 CAP 中 A 最为重要,往往 C 可以被牺牲,达到最终一致性。
- 缓存设计,分层的数据流动?如何识别热门?
- 删除微博功能的设计。
2、怎样设计一个短网址映射系统(tiny url 系统)
- 思考读写模型,明显是读优先级高于写的服务,但是通常不需要修改。读写服务分离,在写服务崩溃时保证读服务健康运行。
- 链接缩短使用的加密和映射的方式中,算法如何选择?短链接可以接受那些字符?此处可以估算特定的规则下长度为 n 的短链接最多可能表示多少种实际链接。
- 如果使用统一的短链接序列分配机制,如何分布式设计这个分配逻辑?它不能够成为瓶颈。例如,一种常见的思路是让关系数据库的自增长索引给出唯一 id,但是如果不使用关系数据库,在分布式系统中如何产生唯一序列号且避免冲突?参考: 如何在高并发分布式系统中生成全局唯一 Id。
- 是否需要保留原始链接最后部分?如 http://abc.def.com/p/124/article/12306.html 压缩成 http://short.url/asdfasdf/12306.html,以增加可读性。
- 由于协议部分可以预见或者需要支持的数量很少(比如就支持 http 和 https),是否可以考虑把协议部分略去,以枚举方式和短链接正文放在一起?
- 由于属于 web 服务,考虑判定 URL 合法性,考虑怎样控制请求流量和应对恶意访问。
3、怎样设计一个实时聊天系统?可以是 MSN、QQ 这样的 CS 结构的,也可以是 Facebook Chat 或者微博私信这样的 BS 结构的。
- 登陆时需要获取好友状态列表,这通常也是 priority 0 的 use case。
- 这个问题的特点是客户端和服务端已经天然地分开了,核心问题之一是把服务端和客户端的交互梳理清楚。如果是 BS 结构的,怎样使得从服务端到客户端的消息推送成为可能?Ajax+Comet 是一个办法,hang 住连接,消息推送。还有一种办法是客户端轮询,但是显然实时性无法胜任。
- 如果使用 Comet,服务端将存在大量的空闲连接。参考,select、poll、epoll 之间的区别总结 [整理]。对于消息的处理,引入协程,减少线程调度开销,参考:Difference between a “coroutine” and a “thread”?。
- 如果是 CS 的,消息和状态还可以通过 P2P 的方式;但是如果是 BS 的,就必须实现一个服务端的消息系统,参考:实时通信协议 XMPP。
- 存储方面,CAP 怎么权衡?可以牺牲掉哪一个?
- 如果要求完成历史消息功能,那又是另一番故事了。其他值得讨论的扩展的功能包括系统消息群发、好友状态更新和消息搜索等等。
还有一些系统设计典型和经典问题,想到的先列在下面,等后续有时间总结了再补充到上面去:
- 搜索引擎设计(包括网页爬虫)
- 邮件系统设计(例如 GMail)
无论如何,对于这些问题的解决,反复思考是非常有趣的。这些东西貌似可以直接拿来学习的材料比较少,而且也不像算法那样丁是丁卯是卯的,评判的标准模糊得很。
http://snarfed.org/transactions_across_datacenters_io.htmlTransactions Across Datacenters
(and other weekend projects)
Of three properties of distributed data systems - consistency, availability, partition tolerance - choose two.
multihoming (n): operating from multiple datacenters simultaneously
Weak consistency
After a write, reads may or may not see it
Best effort only
"Message in a bottle"
App Engine: memcache
VoIP, live online video
Realtime multiplayer games
Eventual consistency
After a write, reads will eventually see it
App Engine: mail
Search engine indexing
DNS, SMTP, snail mail
Amazon S3, SimpleDB
Why transactions?
Correctness
Consistency
Enforce invariants
ACID
Why across datacenters?
Catastrophic failures
Expected failures
Routine maintenance
Geolocality
CDNs, edge caching
Why not across datacenters?
Within a datacenter
High bandwidth: 1-100Gbps interconnects
Low latency: < 1ms within rack, 1-5ms across
Little to no cost
Between datacenters
Low bandwidth: 10Mbps-1Gbps
High latency: 50-150ms
$$$ for fiber
Option 1: Don't
...instead, bunkerize.
Most common
Microsoft Azure (tables)
Amazon SimpleDB
Bad at catastrophic failure
Large scale data loss
Example: SVColo
Not great for serving
No geolocation
Option 2: Primary with hot failover(s)
Better, but not ideal
Mediocre at catastrophic failure
Window of lost data
Failover data may be inconsistent
Amazon Web Services
EC2, S3, SQS: choose US or EU
EC2: Availability Zones, Elastic Load Balancing
Banks, brokerages, etc.
Geolocated for reads, not for writes
Option 2: Primary with hot failover(s)
Better, but not ideal
Mediocre at catastrophic failure
Window of lost data
Failover data may be inconsistent
Amazon Web Services
EC2, S3, SQS: choose US or EU
EC2: Availability Zones, Elastic Load Balancing
Banks, brokerages, etc.
Geolocated for reads, not for writes
Multi-master replication
Umbrella term for merging concurrent writes
Asynchronous, eventual consistency
Need serialization protocol
e.g. timestamp oracle: monotonically increasing timestamps
Either SPOF with master election...
...or distributed consensus protocol
No global transactions!
Datastore: no strong consistency
Two Phase Commit
Semi-distributed consensus protocol
deterministic coordinator
1: propose, 2: vote, 3: commit/abort
Heavyweight, synchronous, high latency
3PC buys async with extra round trip
Datastore: poor throughput
Paxos
Fully distributed consensus protocol
"Either Paxos, or Paxos with cruft, or broken"
Mike Burrows
Majority writes; survives minority failure
Protocol similar to 2PC/3PC
Lighter, but still high latency
https://everythingisdata.wordpress.com/2009/10/17/numbers-everyone-should-know/
Operation | Time (nsec) |
L1 cache reference | 0.5 |
Branch mispredict | 5 |
L2 cache reference | 7 |
Mutex lock/unlock | 25 |
Main memory reference | 100 |
Compress 1KB bytes with Zippy | 3,000 |
Send 2K bytes over 1 Gbps network | 20,000 |
Read 1MB sequentially from memory | 250,000 |
Roundtrip within same datacenter | 500,000 |
Disk seek | 10,000,000 |
Read 1MB sequentially from disk | 20,000,000 |
Send packet CA -> Netherlands -> CA | 150,000,000 |
Some useful figures that aren’t in Dean’s data can be found in this articlecomparing NetBSD 2.0 and FreeBSD 5.3 from 2005. Approximating those figures, we get:
Operation | Time (nsec) |
System call overhead | 400 |
Context switch between processes | 3000 |
fork() (statically-linked binary) | 70,000 |
fork() (dynamically-linked binary) | 160,000 |
http://katemats.com/distributed-systems-basics-handling-failure-fault-tolerance-and-monitoring/
In small, self-contained systems it is much easier to simulate the conditions required to replicate and debug issues, with most of these issues classified as being a Bohrbug, that is a bug “that manifests itself consistently under a well-defined (but possibly unknown) set of conditions” [3]. However, in more complex systems or production environments having many servers, it can be extremely difficult to find and diagnose more unusual bugs; like the Heisenbug “that disappears or alters its characteristics when an attempt is made to study it” [3].
When designing distributed systems it is said that the following (perhaps normal) assumptions should be considered false (and these are so well known that they commonly referred to as the Fallacies of Distributed Computing):
- The network is reliable.
- Latency is zero.
- Bandwidth is infinite.
- The network is secure.
- Topology doesn’t change.
- There is one administrator.
- Transport cost is zero.
- The network is homogeneous. [4]
Fault Tolerance:
Another important part of service based architectures is to set up each service to be fault tolerant, such that in the event one of its dependencies are unavailable or return an error, it is able to handle those cases and degrade gracefully. There are many methods for achieving fault tolerance in a distributed system, for example: redundancy (as described above), standbys, feature flags, and asynchrony.
Standbys – a standby is exactly that, a redundant set of functionality or data waiting on standby that may be swapped to replace another failing instance. Replication can be utilized to maintain real time copies of the master database so that data may be replaced without loss or disruption.
Feature flags – a feature flag is used to enable or disable functionality in a production system. In the event of a failure for a particular system, features that depend on that system can be turned off and made unavailable until that system comes back online.
Asynchrony – this is probably one of the more important design considerations in any distributed application. It essentially means that each service, or functional piece of the system, communicates with each of its external dependencies asynchronously, so that slow or unavailable services do not directly impact the primary functioning of the application. This also typically implies that operations aren’t tightly coupled, requiring the success of one operation for another to succeed, like a transaction, and don’t require services to be available to handle requests. For example, writing an image could return success to the client, even if all the copies of the file haven’t been created and written to the file store. Some of the pieces of the “file write” are asynchronous – there will be a file upload, converting the file to the desired format/size, writing the converted file to disk, and then replicating it to its redundant copy. While all of those are required for the file write to be considered a success, each of those eventually have to occur, meaning that the client can receive “success” once the file is uploaded and the rest of the operations can happen asynchronously afterward. This means that if one of the down stream services was unavailable or congested, the other parts of the write could continue operating as expected (in this case, accepting files from clients
Monitor each service independently of one another. Since each service has its own focus of control, making sure that each one is monitored is a key part to recognizing or diagnosing problems within that service. The monitoring for each service may not be the same, but often there is some overlap in monitored metrics (like standard system metrics, i.e. CPU usage, memory usage, disk and network i/o etc.). In the image hosting example, some of the service specific monitoring would be: the speed of reads, how many concurrent reads were happening for the Image Retrieval Service, whereas the Image Write Service would watch the write queue, number of connections.
http://horicky.blogspot.com/2008/03/web-site-scalability.html
Dynamic Content
- Most of the content display is dynamic content. Some application logic will be executed at the web server which generate an HTML for the client browser. The efficiency of application logic will have a huge impact on the overall site's scalability. This is our main topic here.
- Sometimes it is possible to pre-generate dynamic content and store it as static content. When the real request comes in, instead of re-running the application logic to generate the page, we just need to lookup the pre-generated page, which can be much faster
There are 2 layers of dispatching for a Client who is making an HTTP request to reach the application server
DNS Resolution based on user proximity
- Depends on the location of the client (derived from the IP address), the DNS server can return an ordered list of sites according to the proximity measurement. Therefore client request will be routed to the data center closest to him/her
- After that, the client browser will cache the server IP
- Load balancer (hardware-based or software-based) will be sitting in front of a pool of homogeneous servers which provide same application services. The load balancer's job is to decide which member of the pool should handle the request
- The decision can be based on various strategy, simple one include round robin or random, more sophisticated one involves tracking the workload of each member (e.g. by measuring their response time) and dispatch request to the least busy one
- Members of the pool can also monitor its own workload and mark itself down (by not responding to the ping request of the load balancer)
Client communication
This is concerned about designing an effective mechanism to communicate with the client, which is typically the browser making some HTTP call (maybe AJAX as well)
Designing the granularity of service call
Typical web transaction involves multiple steps. Session state need to be maintained across multiple interactions
Memory-based session state with Load balancer affinity
Caching
Remember the previous result can reuse them for future request can drastically reduce the workload of the system. But don't cache request which modifies the backend state
http://highscalability.com/blog/2009/8/24/how-google-serves-data-from-multiple-datacenters.htmlThis is concerned about designing an effective mechanism to communicate with the client, which is typically the browser making some HTTP call (maybe AJAX as well)
Designing the granularity of service call
- Reduce the number of round trips by using a coarse grain API model so your client is making one call rather than many small calls
- Don't send back more data than your client need
- Consider using an incremental processing model. Just send back sufficient result for the first page. Use a cursor model to compute more result for subsequent pages in case the client needs it. But it is good to calculate an estimation of the total matched result to return to the client.
- If you have control on the client side (e.g. I provide the JavaScript library which is making the request), then you can choose a more compact encoding scheme and not worry about compatibility.
- If not, you have to use a standard encoding mechanism such as XML. You also need to publish the XML schema of the message (the contract is the message format)
- If the message size is big, then we can apply compression technique (e.g. gzip) to the message before sending it.
- You are trading off CPU for bandwidth savings, better to measure whether this is a gain first
- AJAX fits very well here. User can proceed to do other things while the server is working on the request
- Consider not sending the result at all. Rather than sending the final order status to the client who is sending an order placement request, consider sending an email acknowledgment.
Typical web transaction involves multiple steps. Session state need to be maintained across multiple interactions
Memory-based session state with Load balancer affinity
- One way is to store the state in the App Server's local memory. But we need to make sure subsequent request land on the same App Server instance otherwise it cannot access the previous stored session state
- Load balancer affinity need to be turned on. Typically request with the same cookie will be routed to the same app server
- Another way to have the App server sharing a global session state by replicating its changes to each other
- Double check the latency of replication so we can make sure there is enough time for the replication to complete before subsequent request is made
- Store the session state into a DB which can be accessed by any App Server inside the pool
- Under this model, the cookie will be used to store the IP address of the last app server who process the client request
- When the next request comes in, the dispatcher is free to forward to any members of the pool. The app server which receive this request will examine the IP address of the last server and pull over the session state from there.
- If the session state is small, you don't need to store at the server side at all. You can just embed all information inside a cookie and send back to the client.
- You need to digitally sign the cookie so that modification cannot happen
Caching
Remember the previous result can reuse them for future request can drastically reduce the workload of the system. But don't cache request which modifies the backend state
Transactions Across Datacenters (and Other Weekend Projects) at the Google I/O 2009 conference.
- lowish latency writes
- datacenter failure survival
- strong consistency guarantees.
- lowish latency writes
- datacenter failure survival
- strong consistency guarantees.
- Better, but not great.
- Data are usually replicated asynchronously so there's a window of vulnerability for loss.
- Data in your other datacenters may not be consistent on failure.
- Popular with financial institutions.
- You get geolocation to serve reads. Consistency depends on the technique. Writes are still limited to one datacenter.
- So some choose to do it with just two datacenters. NASDAQ has two datacenters close together (low latency) and perform a two-phase commit on every transaction, but they have very strict latency requirements.
- Using more than two datacenters is fundamentally harder. You pay for it with queuing delays, routing delays, speed of light. You have to talk between datacenters. Just fundamentally slower with a smaller pipe. You may pay for with capacity and throughput, but you'll definitely pay in latency.
- Replication is asynchronous so good for latency and throughput.
- Weak/eventual consistency unless you are very careful.
- You have multiple copies in the datacenters, so you'll lose a little data on failure, but not much. Failover can go read-only until the master has been moved to another datacenter.
- Datastore currently uses this mechanism. Truly multihoming adds latency because you have to add the extra hop between datacenters. App Engine is already slow on writes so this extra hit would be painful. M/S gives you most of the benefits of better forms while still offering lower latency writes.
- You figure out how to merge all the writes later when there's a conflict. It's like asynchronous replication, but you are serving writes from multiple locations.
- Best you can do is Eventual Consistency. Writes don't immediately go everywhere. This is a paradigm shift here. We've assumed with a strongly consistent system that backup and M/S that they don't change anything. They are just techniques to help us multihome. Here it literally changes how the system runs because the multiple writes must be merged.
- To do the merging you must find away to serialize, impose an ordering on all your writes. There is no global clock. Things happen in parallel. You can't ever know what happens first. So you make it up using timestamps, local timetamps + skew, local version numbers, distributed consensus protocol. This is the magic and there are a number of ways to do it.
- There's no way to do a global transaction. With multiple simultaneous writes you can't guarantee transactions. So you have to figure out what to do afterward.
- AppEngine wants strong consistency to make building applications easier, so they didn't consider this option.
- Failover is easy because each datacenter can handle writes.
- Semi-distributed because there's always a master coordinator for a given 2PC transaction. Because there are so few datacenters you tend to go through the same set of master coordinators.
- It's synchronous. All transactions are serialized through that master which kills your throughput and increases latency.
- Never serious considered this option because write throughput is very important to them. No single point of failure or serialization point would work for them. Latency is high because of the extra coordination. Writes can be in the 200msec area.
- This option does work though. You write to all datacenters or nothing. You get strong consistency and transactions.
- Need N+1 datacenters. If you take one down then you still have N to handle your load.
- Protocol: there's a propose step and then an agree step. You only need a majority of nodes to agree to say something is persisted for it to be considered persisted.
- Unlike 2PC it is fully distributed. There's no single master coordinator.
- Multiple transactions can be run in parallel. There's less serialization.
- Writes are high latency because of the 2 extra round coordination trips required in the protocol.
- Wanted to do this, but the they didn't want to pay the 150msec latency hit to writes, especially when competing against 5msec writes for RDBMSes.
- They tried using physcially close datacenters but the built-in multi-datacenter overhead (routers, etc) was too high. Even in the same datacenter was too slow.
- Paxos is still used a ton within Google. Especially for lock servers. For coordinating anything they do across datacenters. Especially when state is moved between datacenters. If your app is serving data in one datacenter and it should be moved to another that coordination is done through Paxos. It's used also in managing memcache and offline processing.
http://horicky.blogspot.com/2008/02/scalable-system-design.html
- Response time, Throughput
- Rank the importance of traffic so you know what to sacrifice in case you cannot handle all of them
- Scale the system horizontally (adding more cheap machine), but not vertically (upgrade to a more powerful machine)
- The ability to swap out old code and replace with new code without worries of breaking other parts of the system allows you to experiment different ways of optimization quickly
- Never sacrifice code modularity for any (including performance-related) reasons
- Bottlenecks are slow code which are frequently executed. Don't optimize slow code if they are rarely executed
- Write performance unit test so you can collect fine grain performance data at the component level
- Setup a performance lab so you can conduct end-to-end performance improvement measurement easily
- Do regular capacity planning. Collect usage statistics, predict the growth rate
Common Techniques
Server Farm (real time access)
- If there is a large number of independent (potentially concurrent) request, then you can use a server farm which is basically a set of identically configured machine, frontend by a load balancer.
- The application itself need to be stateless so the request can be dispatched purely based on load conditions and not other factors.
- Incoming requests will be dispatched by the load balancer to different machines and hence the workload is spread and shared across the servers in the farm.
- The architecture allows horizontal growth so when the workload increases, you can just add more server instances into the farm.
- This strategy is even more effective when combining with Cloud computing as adding more VM instances into the farm is just an API call.
- Spread your data into multiple DB so that data access workload can be distributed across multiple servers
- By nature, data is stateful. So there must be a deterministic mechanism to dispatch data request to the server that host the data
- Data partitioning mechanism also need to take into considerations the data access pattern. Data that need to be accessed together should be staying in the same server. A more sophisticated approach can migrate data continuously according to data access pattern shift.
- Most distributed key/value store do this
- The algorithm itself need to be parallelizable. This usually mean the steps of execution should be relatively independent of each other.
- Google's Map/Reduce is a good framework for this model. There is also an open source Java framework Hadoop as well.
- This is common for static media content. The idea is to create many copies of contents that are distributed geographically across servers.
- User request will be routed to the server replica with close proxmity
- This is a time vs space tradeoff. Some executions may use the same set of input parameters over and over again. Therefore, instead of redo the same execution for same input parameters, we can remember the previous execution's result.
- This is typically implemented as a lookup cache.
- Memcached and EHCache are some of the popular caching packages
- DBSession and TCP connection are expensive to create, so reuse them across multiple requests
- Instead of calculate an accurate answer, see if you can tradeoff some accuracy for speed.
- If real life, usually some degree of inaccuracy is tolerable
- Try to do more processing upstream (where data get generated) than downstream because it reduce the amount of data being propagated
- You make a call which returns a result. But you don't need to use the result until at a much later stage of your process. Therefore, you don't need to wait immediately after making the call., instead you can proceed to do other things until you reach the point where you need to use the result.
- In additional, the waiting thread is idle but consume system resources. For high transaction volume, the number of idle threads is (arrival_rate * processing_time) which can be a very big number if the arrival_rate is high. The system is running under a very ineffective mode
- The service call in this example is better handled using an asynchronous processing model. This is typically done in 2 ways: Callback and Polling
- In callback mode, the caller need to provide a response handler when making the call. The call itself will return immediately before the actually work is done at the server side. When the work is done later, response will be coming back as a separate thread which will execute the previous registered response handler. Some kind of co-ordination may be required between the calling thread and the callback thread.
- In polling mode, the call itself will return a "future" handle immediately. The caller can go off doing other things and later poll the "future" handle to see if the response if ready. In this model, there is no extra thread being created so no extra thread co-ordination is needed.
- Use efficient algorithms and data structure. Analyze the time (CPU) and space (memory) complexity for logic that are execute frequently (ie: hot spots). For example, carefully decide if hash table or binary tree should be use for lookup.
- Analyze your concurrent access scenarios when multiple threads accessing shared data. Carefully analyze the synchronization scenario and make sure the locking is fine-grain enough. Also watch for any possibility of deadlock situation and how you detect or prevent them. A wrong concurrent access model can have huge impact in your system's scalability. Also consider using Lock-Free data structure (e.g. Java's Concurrent Package have a couple of them)
- Analyze the memory usage patterns in your logic. Determine where new objects are created and where they are eligible for garbage collection. Be aware of the creation of a lot of short-lived temporary objects as they will put a high load on the Garbage Collector.
- However, never trade off code readability for performance. (e.g. Don't try to bundle too much logic into a single method). Let the VM handle this execution for you.
Scalable System Design Patterns
Load Balancer
Scatter and Gather
In this model, the dispatcher multicast the request to all workers of the pool. Each worker will compute a local result and send it back to the dispatcher, who will consolidate them into a single response and then send back to the client.
Result CacheIn this model, the dispatcher will first lookup if the request has been made before and try to find the previous result to return, in order to save the actual execution.
Pipe and Filter
This model is also known as "Data Flow Programming"; all workers connected by pipes where data is flow across.
This pattern is a very common EAI pattern.
http://lethain.com/introduction-to-architecting-systems-for-scale/
With that caveat in mind, what is a smart client? It is a client which takes a pool of service hosts and balances load across them, detects downed hosts and avoids sending requests their way (they also have to detect recovered hosts, deal with adding new hosts, etc, making them fun to get working decently and a terror to get working correctly).
CDNs take the burden of serving static media off of your application servers (which are typically optimzed for serving dynamic pages rather than static media), and provide geographic distribution. Overall, your static assets will load more quickly and with less strain on your servers (but a new strain of business expense).
In a typical CDN setup, a request will first ask your CDN for a piece of static media, the CDN will serve that content if it has it locally available (HTTP headers are used for configuring how the CDN caches a given piece of content). If it isn't available, the CDN will query your servers for the file and then cache it locally and serve it to the requesting user (in this configuration they are acting as a read-through cache).
If your site isn't yet large enough to merit its own CDN, you can ease a future transition by serving your static media off a separate subdomain (e.g.
static.example.com
) using a lightweight HTTP server like Nginx, and cutover the DNS from your servers to a CDN at a later date.
Most applications start out with a web application communicating directly with a database. This approach tends to be sufficient for most applications, but there are some compelling reasons for adding a platform layer, such that your web applications communicate with a platform layer which in turn communicates with your databases.
First, separating the platform and web application allow you to scale the pieces independently. If you add a new API, you can add platform servers without adding unnecessary capacity for your web application tier. (Generally, specializing your servers' role opens up an additional level of configuration optimization which isn't available for general purpose machines; your database machine will usually have a high I/O load and will benefit from a solid-state drive, but your well-configured application server probably isn't reading from disk at all during normal operation, but might benefit from more CPU.)
Second, adding a platform layer can be a way to reuse your infrastructure for multiple products or interfaces (a web application, an API, an iPhone app, etc) without writing too much redundant boilerplate code for dealing with caches, databases, etc.
http://www.nomachetejuggling.com/2010/04/06/avoiding-the-big-design-interview-question/A good indicator that the question is taking you and the candidate off-track is if the boxes he or she draws on the whiteboard lack method names. If there are no methods, there are no behaviors, which means the candidate is designing first, regardless of requirements. Steer the candidate back by asking them to write method names.
A good indicator that the question is taking you and the candidate off-track is if the boxes he or she draws on the whiteboard lack method names. If there are no methods, there are no behaviors, which means the candidate is designing first, regardless of requirements. Steer the candidate back by asking them to write method names.
For example, ask the candidate to design the object model for a simple bookshelf. A bookshelf is so simple that there's virtually no way to overcomplicate it, so you can say that you simply need to be able to add a book to the bookshelf, nothing else. Once they have done this, give the candidate another "test", by adding a requirement. Now say that you want the bookshelf to be a "smart" bookshelf, where you can look up a book by title and get a reference to the book. The candidate will make some changes to their design. Continue adding requirements to the thing until you're satisfied with the evolution of the candidate's design.
Modify the design to allow me to search the bookshelf by Title, Author, or ISBN
Allow me to use the bookshelf as an ad-hoc library, so I can "check out" a book, removing it from the bookshelf. Make it possible to look up books that have been "borrowed".
Require a user provide their name when checking out a book. Give them a return date. Make it so the bookshelf can generate a list of overdue books and who has them. Where does that method belong? Should it be on its own object that Bookshelf uses?
Make the bookshelf more concrete, so that there is only a certain number of books that can fit on any given shelf in the overall bookshelf. Make it possible to ask which shelf a book is on. Make it possible to add a new shelf.
You can obviously pick your own model (Car, Elevator, Fridge, etc) and drive the direction the design should go using requirements. Have the candidate write method names (but not implementations) where appropriate. Keep asking if the method is on the right object, or it belongs on another one.
if you're in an interview you may get the "design an object model for a x" question. If you find yourself in this situation, you essentially want to ask questions of your interviewer to pull requirements out of them. The reality is, they probably DO have a particular kind of system in mind when they ask the question, so you need to find out what that is.
If you're supposed to design the Car class, ask if there are other types of vehicles in the system. If not, don't make a Vehicle parent class, and explain that you see no need unless the system required additional vehicles. Before you draw a box named Tire, ask if the car needs to be able to support snow tires or some other kind of interchangeable tire. If not, why would you make a class or interface for them? Start as simply as you can and only draw a new box when you've extracted enough information from the interviewer to deem it warranted. If you jump up to the whiteboard and draw two or three boxes before asking any questions, you're placing your interview at risk because the interviewer may be picturing a completely different usage of this system than you are.
http://systemdesigns.blogspot.com/2015/11/0.html
第1阶段的主题先从奠定基础开始,本周目标是搞懂系统设计相关的所有最最基本概念:
• Horizontal vs. Vertical Scaling
• Load Balancer
• Database Denormalization & NoSQL
• Database Partitioning (Sharding)
• Caching
• Asysnchronous Processing & Queue
• Networking Metrics
• MapReduce
### Scalability for Dummies: http://www.lecloud.net/tagged/scalability
内含四篇短文,涵盖了Clones、Database、Cache、Aysnchronism主题,内容浅显易懂,串联了各个知识点,作为第一阶段进入状态用,将基础概念都过一遍。每篇阅读时间预计5min。
http://www.hiredintech.com/system-design/
与1、2有内容重叠,建议把剩下的一些都看完,系统地打好基础。
1. Scalability
- Concept/Trade off: https://en.wikipedia.org/wiki/Scalability#Horizontal_and_vertical_scaling
- 系统的可扩展性:http://article.yeeyan.org/view/21984/185038
- Horizontal VS Vertical Scaling Compare: http://www.vtagion.com/scalability-scale-up-scale-out-care/
- Trade off: http://yunjiechao-163-com.iteye.com/blog/2124300
2. Load Balancer
- Concept: https://en.wikipedia.org/wiki/Load_balancing_(computing)
- 跟我一起学Load Balance(1): http://blog.chinaunix.net/uid-23629988-id-3320256.html
- 跟我一起学Load Balance(2): http://blog.chinaunix.net/uid-23629988-id-3324539.html
4. Database Denormalization & NoSQL
- NOSQL: http://www.jdon.com/nosql.html
- NoSQL简介:http://tsaikoga.github.io/blog/2013/10/15/nosqljian-jie/
- NoSQL vs SQL: http://dataconomy.com/sql-vs-nosql-need-know/
5. Database Partitioning (Sharding)
http://highscalability.com/blog/2009/8/6/an-unorthodox-approach-to-database-design-the-coming-of-the.html
6. Cache
- Cache methods: http://www.lecloud.net/post/9246290032/scalability-for-dummies-part-3-cache
- Cache Strategy: http://www.coderanch.com/how-to/java/CachingStrategies
7. Asysnchronous Processing & Queues
http://www.cs.unc.edu/~dewan/242/s04/notes/ipc/node11.html
8. Networking Metrics
9. MapReduce
#阅读笔记
https://trello.com/c/NwiVecc9 (by 陈驰)
#QUESTIONS
- load balancer分发策略
- cache的方式
- denormalization in nosql的实现
http://systemdesigns.blogspot.com/2015/11/system-design1.html
The CAP Theorem (henceforth 'CAP') says that it is impossible to build an implementation of read-write storage in an asynchronous network that satisfies all of the following three properties: Consistency, Availability, Partition Tolerance。
SQL Model -> ACID (Atomicity, Consistency, Isolation, and Durability)
No-SQL Model -> BASE (Basically Available, Soft State, Eventual consistency)
How to architect a system to work across multiple datacenters。 我们将serve something from multiple datacenters定义为Multihoming,这是当今分布式系统计算最具挑战的问题之一。
第一篇文章是关于Google是如何从全球多个数据中心为用户提供数据的,文章摘自Google I/O 2009 Talk。本文针对multihoming问题提出了5种可行解决方法并进行了对比分析:
- Technique #1: Backups
- Technique #2: Master/slave replication
- Technique #3: Multi-master replication
- Technique #4: Two phase commit
- Technique #5: Paxos
在分布式算法领域,有个非常重要的算法叫Paxos, 它的重要性有多高呢,Google的Chubby 中提到 “all working protocols for asynchronous consensus we have so far encountered have Paxos at their core.” Paxos用于解决分布式系统中一致性问题。其利用Majority机制的投票形式保证2F+1的容错能力,即2F+1个结点的系统最多允许F个结点同时出现故障。也可以理解为:N个服务器要确定一个值Value等于多少,只要有多于半数的服务器还是存活并可以有效通信,那么这个值就可以通过Paxos算法确定下来,并且该值是唯一的。
https://gist.github.com/Shanni/a29a1aa741fdeb73e5fa#file-systemdesign
http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
https://blog.eood.cn/scalable_system_patterns
可扩展系统设计的要点
1. 快速失效并返回错误
2. 将复杂的大的请求分解成多个小请求
3. 利用 超时
4. 利用 缓存
5. 用队列来做缓冲
6. 精确测量每个步骤,记录详细日志