Sunday, December 13, 2015

Instagration Pt. 2: Scaling our infrastructure to multiple data centers



http://engineering.instagram.com/posts/548723638608102/instagration-pt-2-scaling-our-infrastructure-to-multiple-data-centers/

The key to expanding to multiple data centers is to distinguish global data and local data. Global data needs to be replicated across data centers, while local data can be different for each region (for example, the async jobs created by web server would only be viewed in that region).
The next consideration is hardware resources. These can be roughly divided into three types: storage, computing and caching.
Storage
Instagram mainly uses two backend database systems: PostgreSQL and Cassandra. Both PostgreSQL and Cassandra have mature replication frameworks that work well as a globally consistent data store.
Global data neatly maps to data stored in these servers. The goal is to have eventual consistency of these data across data centers, but with potential delay. Because there are vastly more read than write operations, having read replica each region avoids cross data center reads from web servers.
Writing to PostgreSQL, however, still goes across data centers because they always write to the primary.
CPU Processing
Web servers, async servers are both easily distributed computing resources that are stateless, and only need to access data locally. Web servers can create async jobs that are queued by async message brokers, and then consumed by async servers, all in the same region.

Caching

The cache layer is the web servers' most frequently accessed tier, and they need to be collocated within a data center to avoid user request latency. This means that updates to cache in one data center are not reflected in another data center, therefore creating a challenge for moving to multiple data centers.
Imagine a user commented on your newly posted photo. In the one data center case, the web server that served the request can just update the cache with the new comment. A follower will see the new comment from the same cache.
In the multi data center scenario, however, if the commenter and the follower are served in different regions, the follower’s regional cache will not be updated and the user will not see the comment.
Our solution is to use PgQ and enhance it to insert cache invalidation events to the databases that are being modified.
On the primary side:
  • Web server inserts a comment to PostgreSQL DB;
  • Web server inserts a cache invalidation entry to the same DB.
On the replica side:
  • Replicate primary DB, including both the newly inserted comment as well as the cache invalidation entry
  • Cache invalidation process reads the cache invalidation entry and invalidates regional cache
  • Djangos will read from DB with the newly inserted comment and refill the cache
This solves the cache consistency issue. On the other hand, compared to the one-region case where django servers directly update cache without re-reading from DB, this would create increased read load on databases. In order to mitigate this problem, we took two approaches: 1) reduce computational resources needed for each read by denormalizing counters; 2) reduce number of reads by using cache leases.

De-normalizing Counters

The most commonly cached keys are counters. For example, we would use a counter to determine the number of people who liked a specific post from Justin Bieber. When there was just one region, we would update the memcache counters by incrementing from web servers, therefore avoiding a “select count(*)” call to the database, which would take hundreds of milliseconds.
But with two regions and PgQ invalidation, each new like creates a cache invalidation event to the counter. This will create a lot of “select count(*)”, especially on hot objects.
To reduce the resources needed for each of these operations, we denormalized the counter for likes on the post. Whenever a new like comes in, the count is increased in the database. Therefore, each read of the count will just be a simple “select” which is a lot more efficient.
There is also an added benefit of denormalizing counters in the same database where the liker to the post is stored. Both updates can be included in one transaction, making the updates atomic and consistent all the time. Whereas before the change, the counter in cache could be inconsistent with what was stored in the database due to timeout, retries etc.
PostgreSQL Read Replicas Can't Catch up
As a Postgres primary takes in writes, it generates delta logs. The faster the writes come in, the more frequent these logs are generated. The primaries themselves store the most recent log files for occasional needs from the replicas, but they archive all the logs to storage to make sure that they are saved and accessible by any replicas that need older data than what the primary has retained. This way, the primary does not run out of disk space.
When we build a new read replica, the read replica starts to read a snapshot of the database from the primary. Once it’s done, it needs to apply the logs that have happened since the snapshot to the database. When all the logs are applied, it will be up-to-date and can stream from the primary and serve reads from web servers.
However, when a large database's write rate is quite high, and there is a lot of network latency between the replica and storage device, it is possible that the rate at which the logs are read is slower than the log creation rate. The replica will fall further and further behind and never catch up!
Our fix was to start a second streamer on the new readreplica as soon as it starts to transfer the base snapshot from the primary. It streams logs and stores it on local disk. When snapshot finishes transfer, the readreplica can read the logs locally, making it a much faster recovery process.
This not only solved our database replication issues across the US, but also cut down the time it took to build a new replica by half. Now, even if the primary and replica are in the same region, operational efficiency is drastically increased.
http://dockone.io/article/841

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts