Designs, Lessons and Advice from Building Large Distributed Systems
Make your apps do something reasonable even if not all is right
– Better to give users limited functionality than an error page
• Use higher priorities for interactive requests
http://massivetechinterview.blogspot.com/2015/12/system-design-snake-netflix-now-now-now.html
Kilobit: data 数据设计, 不同数据的存储模型
Consistent Hashing
http://massivetechinterview.blogspot.com/2015/06/system-design-for-big-data-consistent.html
Redis
http://massivetechinterview.blogspot.com/2015/06/pragmatic-programming-techniques.html
Data partition
Data partitioning mechanism also need to take into considerations the data access pattern. Data that need to be accessed together should be staying in the same server. A more sophisticated approach can migrate data continuously according to data access pattern shift.
http://massivetechinterview.blogspot.com/2015/10/system-design-misc.html
If servers cannot be made stateless, then the load-balancer must be made smarter.
It should route stateful requests to the proper server.
This is done by making the load balancer inspect the session-ID of each request and
matching that with the appropriate server.
Load balancer (Also called "Reverse Proxy")
Redundancy and tolerance to machine failures
Elastic load-balancers can shut-down some of the servers during non-peak hours.
http://massivetechinterview.blogspot.com/2015/06/flickrarchitecture-high-scalability.html
Test in production.
Be sensitive to the usage patterns for your type of application.
Be sensitive to the demands of exponential growth.
http://massivetechinterview.blogspot.com/2015/06/akfs-most-commonly-adopted.html
Design for Rollback
Design to Be Disabled
Design to Be Monitored
Isolate Faults
http://massivetechinterview.blogspot.com/2016/02/scalable-web-architectures-common.html
Three goals of application architecture:
Scale
HA
Performance
http://massivetechinterview.blogspot.com/2015/12/pinterest-building-scalableavailablesam.html
http://massivetechinterview.blogspot.com/2015/09/scalability-rules-50-principles-for.html
http://massivetechinterview.blogspot.com/2015/08/database-sharding-vs-partitioning.html
(user_id, update_timestamp)
The first part of that key is hashed to determine the partition, but the other columns are used as a concatenated index for sorting the data in Cassandra’s SSTables.
PARTITIONING SECONDARY INDEXES BY DOCUMENT
each partitions maintains its own secondary indexes, covering only the documents in that partition. It doesn’t care what data is stored in other partitions. a document-partitioned index is also known as a local index.
This approach to querying a partitioned database is sometimes known as scatter/gather
PARTITIONING SECONDARY INDEXES BY TERM
A global index must also be partitioned, but it can be partitioned differently from the primary key index.
The advantage of a global (term-partitioned) index over a document-partitioned index is that it can make reads more efficient: rather than doing scatter/gather over all partitions, a client only needs to make a request to the partition containing the term that it wants.
Practice
http://massivetechinterview.blogspot.com/2015/08/design-news-feed-system-medium.html
Get Second/Third Friends
http://massivetechinterview.blogspot.com/2015/12/linkedin-get-secondthird-friends.html
http://massivetechinterview.blogspot.com/2015/09/the-uber-software-architecture.html
http://highscalability.com/blog/2012/6/18/google-on-latency-tolerant-systems-making-a-predictable-whol.html
backup requests with cross server cancellation
A good solution is to have backup requests with cross server cancellation. This is baked-in to TChannel as a first class feature. A request is sent to Service B(1) along with the information that the request is also being sent to Service B(2). Then some delay later the request is sent to Service B(2). When B(1) completes the request it cancels the request on B(2). With the delay it means in the common case B(2) didn’t perform any work. But if B(1) does fail then B(2) will process the request and return a reply in a lower latency than if B(1) was tried first, a timeout occurred, and then B(2) is tried.
Separate failure detection from membership updates
• Do not rely on a single peer for failure detection
No data is replicated across data centers, as that puts a lot of constraints on availability and consistency. Uber uses the driver's phones to distribute the data. Given that the driver's phones post location updates to the server every four seconds, the server periodically replies with an encrypted state digest. If a data center fails the driver will contact a new data center to post a location update. The new data center doesn't know anything about this particular driver so it asks for the state digest and picks up from there.
http://massivetechinterview.blogspot.com/2015/10/logstash.html
Make your apps do something reasonable even if not all is right
– Better to give users limited functionality than an error page
• Use higher priorities for interactive requests
Don't build infrastructure just for its own sake
• Identify common needs and address them
• Don't imagine unlikely potential needs that aren't really there
• Identify common needs and address them
• Don't imagine unlikely potential needs that aren't really there
http://massivetechinterview.blogspot.com/2015/12/system-design-snake-netflix-now-now-now.html
Kilobit: data 数据设计, 不同数据的存储模型
- 比如用户服务可以用mysql, 查询逻辑强
- 电影文件就用文件存,不用数据库
1.1 Step 1. Clarify Requirements and Specs
1.2 Step 2. Sketch Out High Level Design
1.3.1 Load Balancer
1.3.2 Reverse Proxy
Stateless
MicroServices
The single responsibility principle advocates small and autonomous services that work together, so that each service can do one thing well and not block others
Service Discovery - Zookeeper
How do those services find each other? Zookeeper is a popular and centralized choice. Instances with name, address, port, etc. are registered into the path in ZooKeeper for each service. If one service does not know where to find another service, it can query Zookeeper for the location and memorize it until that location is unavailable.
Restful Web Services vs RPC
RPC is internally used by many tech companies for performance issues, but it is rather hard to debug and not flexible. So for public APIs, we tend to use HTTP APIs, and are usually following the RESTful style.
REST (Representational state transfer of resources)
Best practice of HTTP API to interact with resources.
URL only decides the location. Headers (Accept and Content-Type, etc.) decide the representation. HTTP methods(GET/POST/PUT/DELETE) decide the state transfer.
minimize the coupling between client and server (a huge number of HTTP infras on various clients, data-marshalling).
stateless and scaling out.
service partitioning feasible.
used for public API.
Data Tiers
do not save a blob, like a photo, into a relational database, and choose the right database for the right service. For example, read performance is important for follower service, therefore it makes sense to use a key-value cache. Feeds are generated as time passes by, so HBase / Cassandra’s timestamp index is a great fit for this use case. Users have relationships with other users or objects, so a relational database is our choice by default in an user profile
When we design a distributed system, trading off among CAP (consistency, availability, and partition tolerance) is almost the first thing we want to consider.
- Consistency: all nodes see the same data at the same time
- Availability: a guarantee that every request receives a response about whether it succeeded or failed
- Partition tolerance: system continues to operate despite arbitrary message loss or failure of part of the system
Frameworks
Google Protobuf, Thrift, Apache Avro
Cassandra:
SSTable + LSM tree
http://massivetechinterview.blogspot.com/2015/12/sstable-lsm-tree-cassandra-leveldb.html
http://massivetechinterview.blogspot.com/2015/10/cassandra.html
http://wiki.apache.org/cassandra/
Kafka
http://massivetechinterview.blogspot.com/2016/01/apache-kafka-file-system.html
http://massivetechinterview.blogspot.com/2015/08/kafka-internal.html
Cassandra:
SSTable + LSM tree
http://massivetechinterview.blogspot.com/2015/12/sstable-lsm-tree-cassandra-leveldb.html
http://massivetechinterview.blogspot.com/2015/10/cassandra.html
http://wiki.apache.org/cassandra/
Kafka
http://massivetechinterview.blogspot.com/2016/01/apache-kafka-file-system.html
http://massivetechinterview.blogspot.com/2015/08/kafka-internal.html
Consistent Hashing
http://massivetechinterview.blogspot.com/2015/06/system-design-for-big-data-consistent.html
Redis
http://massivetechinterview.blogspot.com/2015/06/pragmatic-programming-techniques.html
Data partition
Data partitioning mechanism also need to take into considerations the data access pattern. Data that need to be accessed together should be staying in the same server. A more sophisticated approach can migrate data continuously according to data access pattern shift.
http://massivetechinterview.blogspot.com/2015/10/system-design-misc.html
- High Availability - Have some redundant nodes running in active/active mode.
- Use a load balancer if one server is not sufficient
- Try to put queuing systems for asynchronous consumption of offline loads.
If servers cannot be made stateless, then the load-balancer must be made smarter.
It should route stateful requests to the proper server.
This is done by making the load balancer inspect the session-ID of each request and
matching that with the appropriate server.
Load balancer (Also called "Reverse Proxy")
Redundancy and tolerance to machine failures
Elastic load-balancers can shut-down some of the servers during non-peak hours.
http://massivetechinterview.blogspot.com/2015/06/flickrarchitecture-high-scalability.html
Test in production.
Be sensitive to the usage patterns for your type of application.
Be sensitive to the demands of exponential growth.
http://massivetechinterview.blogspot.com/2015/06/akfs-most-commonly-adopted.html
Design for Rollback
Design to Be Disabled
Design to Be Monitored
Isolate Faults
http://massivetechinterview.blogspot.com/2016/02/scalable-web-architectures-common.html
Three goals of application architecture:
Scale
HA
Performance
http://massivetechinterview.blogspot.com/2015/12/pinterest-building-scalableavailablesam.html
http://massivetechinterview.blogspot.com/2015/09/scalability-rules-50-principles-for.html
http://massivetechinterview.blogspot.com/2015/08/database-sharding-vs-partitioning.html
(user_id, update_timestamp)
The first part of that key is hashed to determine the partition, but the other columns are used as a concatenated index for sorting the data in Cassandra’s SSTables.
PARTITIONING SECONDARY INDEXES BY DOCUMENT
each partitions maintains its own secondary indexes, covering only the documents in that partition. It doesn’t care what data is stored in other partitions. a document-partitioned index is also known as a local index.
This approach to querying a partitioned database is sometimes known as scatter/gather
PARTITIONING SECONDARY INDEXES BY TERM
A global index must also be partitioned, but it can be partitioned differently from the primary key index.
The advantage of a global (term-partitioned) index over a document-partitioned index is that it can make reads more efficient: rather than doing scatter/gather over all partitions, a client only needs to make a request to the partition containing the term that it wants.
Practice
http://massivetechinterview.blogspot.com/2015/08/design-news-feed-system-medium.html
Get Second/Third Friends
http://massivetechinterview.blogspot.com/2015/12/linkedin-get-secondthird-friends.html
http://massivetechinterview.blogspot.com/2015/09/the-uber-software-architecture.html
http://highscalability.com/blog/2012/6/18/google-on-latency-tolerant-systems-making-a-predictable-whol.html
backup requests with cross server cancellation
A good solution is to have backup requests with cross server cancellation. This is baked-in to TChannel as a first class feature. A request is sent to Service B(1) along with the information that the request is also being sent to Service B(2). Then some delay later the request is sent to Service B(2). When B(1) completes the request it cancels the request on B(2). With the delay it means in the common case B(2) didn’t perform any work. But if B(1) does fail then B(2) will process the request and return a reply in a lower latency than if B(1) was tried first, a timeout occurred, and then B(2) is tried.
Separate failure detection from membership updates
• Do not rely on a single peer for failure detection
No data is replicated across data centers, as that puts a lot of constraints on availability and consistency. Uber uses the driver's phones to distribute the data. Given that the driver's phones post location updates to the server every four seconds, the server periodically replies with an encrypted state digest. If a data center fails the driver will contact a new data center to post a location update. The new data center doesn't know anything about this particular driver so it asks for the state digest and picks up from there.
http://massivetechinterview.blogspot.com/2015/10/logstash.html