Sunday, June 28, 2015

Designs, Lessons and Advice from Building Large Distributed Systems



Designs, Lessons and Advice from Building Large Distributed Systems
Reliability & Availability
• Things will crash. Deal with it!
– Assume you could start with super reliable servers (MTBF of 30 years)
– Build computing system with 10 thousand of those
– Watch one fail per day
• Fault-tolerant software is inevitable
• Typical yearly flakiness metrics
– 1-5% of your disk drives will die
– Servers will crash at least twice (2-4% failure rate)

• Cluster is 1000s of machines, typically one or handful of configurations
• File system (GFS) + Cluster scheduling system are core services
• Typically 100s to 1000s of active jobs (some w/1 task, some w/1000s)
• mix of batch and low-latency, user-facing production jobs

Distributed systems are a must:
– data, request volume or both are too large for single machine
• careful design about how to partition problems
• need high capacity systems even within a single datacenter
–multiple datacenters, all around the world
• almost all products deployed in multiple locations
–services used heavily even internally
• a web search touches 50+ separate services, 1000s
machines

Protocol Buffers
• Good protocol description language is vital
• Desired attributes:
– self-describing, multiple language support
– efficient to encode/decode (200+ MB/s), compact serialized form

Designing Efficient Systems
Given a basic problem definition, how do you choose the "best" solution?
• Best could be simplest, highest performance, easiest to extend, etc.
Important skill: ability to estimate performance of a system design
– without actually having to build it!

Back of the Envelope Calculations
How long to generate image results page (30 thumbnails)?
Design 1: Read serially, thumbnail 256K images on the fly
30 seeks * 10 ms/seek + 30 * 256K / 30 MB/s = 560 ms
Design 2: Issue reads in parallel:
10 ms/seek + 256K read / 30 MB/s = 18 ms
(Ignores variance, so really more like 30-60 ms, probably

Lots of variations:
– caching (single images? whole sets of thumbnails?)
– pre-computing thumbnails
– …
Back of the envelope helps identify most promising…

Know Your Basic Building Blocks
Core language libraries, basic data structures,
protocol buffers, GFS, BigTable,
indexing systems, MySQL, MapReduce, …
Not just their interfaces, but understand their implementations (at least at a high level)
If you don’t know what’s going on, you can’t do decent back-of-the-envelope calculations!

Encoding Your Data
• CPUs are fast, memory/bandwidth are precious, ergo…
– Variable-length encodings
– Compression
– Compact in-memory representations
• Compression/encoding very important for many systems
– inverted index posting list formats
– storage systems for persistent data
• We have lots of core libraries in this area
– Many tradeoffs: space, encoding/decoding speed, etc. E.g.:
• Zippy: encode@300 MB/s, decode@600MB/s, 2-4X compression
• gzip: encode@25MB/s, decode@200MB/s, 4-6X compression

Designing & Building Infrastructure
Identify common problems, and build software systems to address them in a general way

Don't build infrastructure just for its own sake
• Identify common needs and address them
• Don't imagine unlikely potential needs that aren't really there
• Best approach: use your own infrastructure (especially at first!)
– (much more rapid feedback about what works, what doesn't)

Design for Growth
Try to anticipate how requirements will evolve
keep likely features in mind as you design base system
Ensure your design works if scale changes by 10X or 20X
but the right solution for X often not optimal for 100X

Interactive Apps: Design for Low Latency
• Aim for low avg. times (happy users!)
–90%ile and 99%ile also very important
–Think about how much data you’re shuffling around
• e.g. dozens of 1 MB RPCs per user request -> latency will be lousy
• Worry about variance!
–Redundancy or timeouts can help bring in latency tail
• Judicious use of caching can help
• Use higher priorities for interactive requests
• Parallelism helps!

Making Applications Robust Against Failures
Canary requests
http://highscalability.com/blog/2010/11/22/strategy-google-sends-canary-requests-into-the-data-mine.html


  • Test against logs. Google replays a month's worth of logs to see if any of those queries kill anything. That helps, but Queries of Death may still happen.
  • Send a canary request. A request is sent to one machine. If the request succeeds then it will probably succeed on all machines, so go ahead with the query. If the request fails the only one machine is down, no big deal. Now try the request again on another machine to verify that it really is a query of death. If the request fails a certain number of times then the request if rejected and logged for further debugging.
The result is only a few servers are crashed instead of 1000s. This is a pretty clever technique, especially given the combined trends of scale-out and continuous deployment. It could also be a useful strategy for others. 

Failover to other replicas/datacenters
Bad backend detection:
stop using for live requests until behavior gets better
More aggressive load balancing when imbalance is more severe
Make your apps do something reasonable even if not all is right
– Better to give users limited functionality than an error page

Add Sufficient Monitoring/Status/Debugging Hooks
All our servers:
• Export HTML-based status pages for easy diagnosis
• Export a collection of key-value pairs via a standard interface
– monitoring systems periodically collect this from running servers
• RPC subsystem collects sample of all requests, all error requests, all
requests >0.0s, >0.05s, >0.1s, >0.5s, >1s, etc.
• Support low-overhead online profiling
– cpu profiling
– memory profiling
– lock contention profiling

http://perspectives.mvdirona.com/2009/10/jeff-dean-design-lessons-and-advice-from-building-large-scale-distributed-systems/
http://glinden.blogspot.com/2009/10/advice-from-google-on-large-distributed.html
Video:
Building Software Systems At Google and Lessons Learned at Stanford
Building Large Systems at Google GoogleTalksArchive
Designs, Lessons and Advice from Building Large Distributed Systems 

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts