Thursday, July 23, 2015

Achieving Rapid Response Times in Large Online Services - Jeff Dean



http://conferences.oreilly.com/velocity/velocity2014/public/schedule/detail/34266
https://www.youtube.com/watch?v=1-3Ahy7Fxsc
http://static.googleusercontent.com/media/research.google.com/en//people/jeff/Berkeley-Latency-Mar2012.pdf
http://joepaperreview.blogspot.com/2013/12/dean-jeff-achieving-rapid-response.html
Achieving Rapid Response Times in Large Online Services
Large Fanout Services

Why Does Fanout Make Things Harder?
Overall latency ≥ latency of slowest component
–small blips on individual machines cause delays
–touching more machines increases likelihood of delays
• Server with 1 ms avg. but 1 sec 99%ile latency
–touch 1 of these: 1% of requests take ≥1 sec
–touch 100 of these: 63% of requests take ≥1 sec

One Approach: Squash All Variability
Careful engineering all components of system
• Possible at small scale
–dedicated resources
–complete control over whole system
–careful understanding of all background activities
–less likely to have hardware fail in bizarre ways
• System changes are difficult
–software or hardware changes affect delicate balance
Not tenable at large scale: need to share resources

Shared Environment
Huge benefit: greatly increased utilization
• ... but hard to predict effects increase variability
–network congestion
–background activities
–bursts of foreground activity
–not just your jobs, but everyone else’s jobs, too
• Exacerbated by large fanout systems

Basic Latency Reduction Techniques
Differentiated service classes
–prioritized request queues in servers
–prioritized network traffic
• Reduce head-of-line blocking
–break large requests into sequence of small requests
Manage expensive background activities
–e.g. log compaction in distributed storage systems
–rate limit activity
defer expensive activity until load is lower

Synchronized Disruption
Large systems often have background daemons
–various monitoring and system maintenance tasks
• Initial intuition: randomize when each machine performs these tasks
–actually a very bad idea for high fanout services
• at any given moment, at least one or a few machines are slow
• Better to actually synchronize the disruptions
–run every five minutes “on the dot”
–one synchronized blip better than unsynchronized

Tolerating Faults vs. Tolerating Variability
Tolerating faults:
– rely on extra resources
• RAIDed disks, ECC memory, dist. system components, etc.
– make a reliable whole out of unreliable parts
• Tolerating variability:
– use these same extra resources
– make a predictable whole out of unpredictable parts

Times scales are very different:
– variability: 1000s of disruptions/sec, scale of milliseconds
– faults: 10s of failures per day, scale of tens of seconds

Latency Tolerating Techniques
Cross request adaptation
–examine recent behavior
–take action to improve latency of future requests
–typically relate to balancing load across set of servers
–time scale: 10s of seconds to minutes
Within request adaptation
–cope with slow subsystems in context of higher level request
–time scale: right now, while user is waiting

Fine-Grained Dynamic Partitioning
Partition large datasets/computations
– more than 1 partition per machine (often 10-100/machine)
– e.g. BigTable, query serving systems, GFS, ...

• Can shed load in few percent increments
–prioritize shifting load when imbalance is more severe

Speeds Failure Recovery
Many machines each recover one or a few partition
–e.g. BigTable tablets, GFS chunks, query serving shards

Selective Replication
Find heavily used items and make more replicas
–can be static or dynamic
• Example: Query serving system
–static: more replicas of important docs
–dynamic: more replicas of Chinese documents as
Chinese query load increases

Latency-Induced Probation
Servers sometimes become slow to respond
–could be data dependent, but...
–often due to interference effects
• e.g. CPU or network spike for other jobs running on shared server
• Non-intuitive: remove capacity under load to improve latency (?!)
• Initiate corrective action
–e.g. make copies of partitions on other servers
–continue sending shadow stream of requests to server
• keep measuring latency
• return to service when latency back down for long enough

Handling Within-Request Variability
Take action within single high-level request
• Goals:
–reduce overall latency
–don’t increase resource use too much
–keep serving systems safe

Data Independent Failures
Canary Requests
Backup Requests
Backup Requests Effects
In-memory BigTable lookups
–data replicated in two in-memory tables
–issue requests for 1000 keys spread across 100 tablets
–measure elapsed time until data for last key arrives

            Avg Std Dev 95%ile 99%ile 99.9%ile
No backups 33 ms 1524 ms 24 ms 52 ms 994 ms
Backup after 10 ms 14 ms 4 ms 20 ms 23 ms 50 ms
Backup after 50 ms 16 ms 12 ms 57 ms 63 ms 68 ms
Modest increase in request load:
– 10 ms delay: <5% extra requests; 50 ms delay: <1%

Backup Requests w/ Cross-Server Cancellation
Each request identifies other server(s) to which request might be sent
Read operations in distributed file system client
– send request to first replica
– wait 2 ms, and send to second replica
– servers cancel request on other replica when starting read
• Time for bigtable monitoring ops that touch disk
Cluster state Policy 50%ile 90%ile 99%ile 99.9%ile
Mostly idle No backups 19 ms 38 ms 67 ms 98 ms
Backup after 2 ms 16 ms 28 ms 38 ms 51 ms
Backups cause about ~1% extra disk reads

Backup Request Variants
• Many variants possible:
• Send to third replica after longer delay
–sending to two gives almost all the benefit, however.
• Keep requests in other queues, but reduce priority
• Can handle Reed-Solomon reconstruction similarly

Tainted Partial Results
Many systems can tolerate inexact results
–information retrieval systems
• search 99.9% of docs in 200ms better than 100% in 1000ms
–complex web pages with many sub-components
• e.g. okay to skip spelling correction service if it is slow
Design to proactively abandon slow subsystems
–set cutoffs dynamically based on recent measurements
• can tradeoff completeness vs. responsiveness
–important to mark such results as tainted in caches

Hardware Trends
Some good:
–lower latency networks make things like backup request
cancellations work better
• Some not so good:
–plethora of CPU and device sleep modes save power,
but add latency variability
–higher number of “wimpy” cores => higher fanout =>
more variability
• Software techniques can reduce variability despite increasing variability in underlying hardware

Conclusions
• Tolerating variability
– important for large-scale online services
– large fanout magnifies importance
– makes services more responsive
– saves significant computing resources
• Collection of techniques
–general good engineering practices
• prioritized server queues, careful management of background activities
–cross-request adaptation
• load balancing, micro-partitioning
–within-request adaptation
• backup requests, backup requests w/ cancellation, tainted results
Jeff Dean谈如何在大型在线服务中做到快速响应
http://www.infoq.com/cn/news/2014/07/jeff-dean-large-online-service
备份请求

Jeff首先以Google的搜索服务为例,说明了何为大扇出服务(Large Fanout Service),即一个搜索请求需要有大量子系统(Web、新闻、图像、视频、博客等等)参与其中,以便提供更丰富的搜索结果。在Google,基本不 会为特定的服务提供特定的机器,而是将服务都部署在一个机器池中,这被称为共享环境(Shared Environment),Google的共享环境大致会包含以下几个部分――Linux、调度系统、文件系统ChunkServer、多种其他系统服 务、Bigtable Tablet Server、随机MapReduce任务、CPU密集型任务以及随机应用。它的好处是可以极大地提升利用率,但同时也会带来诸多无法预测的问题,比如网 络拥塞等等。尤其是响应时间的长尾现象比较明显,一次请求的平均响应时间是10毫秒,但是却有99%ile的响应时间大于1秒,在大扇出服务中,如果需要 调用100台服务器获得最终结果,那有63%的请求耗时会大于1秒。
(备注:99%ile的含义:%ile means the percentage of people ranked below you)
针对延时问题,有些基本的降低延时的技术:
  • 定义不同的服务级别,针对服务器上的请求队列和网络流量设定优先级。
  • 减 少线头阻塞(head-of-line blocking),将大的请求打散成一系列小请求;比如,一个读请求需要读取64MB数据,而另有一个100KB的读请求必须等前者完成了才能得到处 理,此时可以将大请求分为多个小请求,以便100KB的那个请求能及时得到处理。
  • 管理好昂贵的后台活动,比如分布式存储系统中的日志压缩就算昂贵的后台活动,此类活动可以考虑在负载低峰期去执行。
Jeff指出,我们要做的事就是基于一堆不可靠的资源打造一个可靠的整体,基于一堆无法预估的资源打造可以预测的整体。在延时处理方面,Jeff将对应的技术分为两大块:
  • 跨请求适应(cross request adaptation),通过检测最近的行为,采取一些措施来优化后续的请求处理,通常会和负载均衡有关,生效时间大约是十几秒到几分钟。
  • 同请求适应(within request adaptation),在当次请求中,对响应较慢的子系统采取一些措施,以改善本次请求的整体响应时间,通常是立刻生效的。
随后,他分别就两类技术进行具体展开说明。

1. 跨请求适应
可以通过细粒度的动态分区,将大数据集或大型计算拆分到多台服务器上,一般一台服务器上会分到10到100个块,BigTable和GFS就是这么处理的。
这么做的好处是显而易见的,比如,当一台机器负载过重时,可以将其中的一部分内容移动到另一台上;可以加速故障恢复,多台服务器分别恢复不同的数据块;实现选择性复制,针对重度使用的内容可以动态或静态地生成更多副本。
当 服务器的响应变慢时,有可能这和当时发送的数据有关,但更多情况下是受到干扰的影响,比如共享服务器上其他任务造成的CPU或网络尖刺。此时可以选择将部 分负载移动到其他服务器上以便改善延时情况,也可以在其他服务器上创建更多该分区的副本,继续给这台服务器发送请求(正常请求发给其他服务器了,发给本服 务器的请求只是一份镜像)持续观察延时情况,待延时正常后再让其提供正常服务。
2. 同请求适应
同请求适应的目标是在不会增加太多资源消耗的情况下减少整体延时,同时还要保障系统安全运行。
有 时一些失败是和请求数据有关的,比如一些测试过程中没有发现的"奇怪数据"会造成系统宕机等等,如果一次性将有问题的请求发给所有节点,那么所有节点就都 出问题了。在Google会使用一种名为Canary Requests的请求,即把请求先发给一个节点,收到响应后,基本可以证明这个查询是可行的,随后再将其发给其他节点。
(备注:之前可真的遇到了某些异常的query将所有后端的leaf节点全部搞挂掉了,这里使用canary request可解决改问题)
Jeff 在随后的时间里详细介绍了他们使用的另一项技术――备份请求(Backup Requests),大致的思路是先把请求发给一个副本,请求会被放进队列;随后再给另一个副本发送相同请求,如果第二个副本处理地很快,处理完毕后发回 结果,客户端再给第一个副本发送取消任务请求。
(备注:当时在做检索优化时,对于慢leaf,一直没有很好的解决方案,慢leaf会拖慢一次请求的耗时,backup request很好的解决了该问题,同时对资源的负载增加也不会很多)
以In-memory BigTable Lookups为例,数据存储了两份,将在100个Tablet中发送1000个键查询,统计最后一个键返回的总耗时。在没有备份请求时,平均耗时33毫 秒,99%ile为52毫秒,99.9%ile高达994毫秒;当在10毫秒后发送备份请求时,平均耗时降为14毫秒,99%ile和99.9%ile分 别降到了23毫秒和50毫秒,此外还测试了在50毫秒后发送备份请求的情况,耗时同样比没有备份要好,但较前者表现略逊一筹。本案例中,延时10毫秒的备 份请求仅增加了不到5%的额外请求负担,在延时50毫秒的情况下,更是下降到不足1%。可见备份请求能在很大程度上改善延时的长尾效应,同时并未增加太多开销。
备 份请求技术还可进一步优化,在发送请求时将处理请求的另一台服务器信息也纳入请求中,一旦一台服务器开始执行,就直接通知另一台取消执行,这项优化称为跨 服务器取消。Jeff同样提供了一个例子,在分布式文件系统客户端中发送读请求,等待2毫秒后发送备份请求,耗时情况如下:
  • 集群处于空闲状态
    • 没有备份请求,50%ile为19毫秒,90%ile为38毫秒,99%ile为67毫秒,99.9%ile为98毫秒
    • 有备份请求,50%ile为16毫秒,90%ile为28毫秒,99%ile为38毫秒,99.9%ile为51毫秒
  • 集群正在进行大量排序
    • 没有备份请求,50%ile为24毫秒,90%ile为56毫秒,99%ile为108毫秒,99.9%ile为159毫秒
    • 有备份请求,50%ile为19毫秒,90%ile为35毫秒,99%ile为67毫秒,99.9%ile为108毫秒
两 种情况下,使用备份请求延时都有显著改善,99%ile分别下降了43%和38%,在第二种情况下备份请求只引入了大约1%的额外磁盘读请求。如果没有备 份请求,集群需要一直处于低负载状态,而使用了备份请求,集群则可处于相对较高的负载,同时还能有相对较好的响应延时。
对 于备份请求而言,最坏的情况即是两台机器几乎同时收到请求,并且都处理了请求,这会带来一定的资源浪费。当然,也可以引入第三个请求,但通常情况下向两台 服务器发送请求就已经足够了。在演讲后的Office Hour中,Jeff表示备份请求也不是万能的,对于一些不可重复执行得请求,比如在线交易,就不能使用备份请求,以免造成数据不一致等情况。


Read full article from Jeff Dean谈如何在大型在线服务中做到快速响应

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts