Massive Technical Interviews Tips: Instagram Architecture

http://www.ghacks.net/2015/05/29/a-close-look-at-google-photos-unlimited-storage-offer/
http://blog.instagram.com/post/122260662827/150623-search-and-explore

http://blog.gainlo.co/index.php/2016/03/01/system-design-interview-question-create-a-photo-sharing-app/

More specifically, the system allows people to follow each other, share/comment/like pictures, and maybe some other features like explore, advertisement and so on so forth.

it’s recommended to start with a high-level solution and then you can dig into all sorts of details later.

The advantage of this approach is that you’re gonna have a clear idea of what you are trying to solve and interviewers are less likely to be confused.

To design a picture sharing system, it’s quite straightforward to identify two major objects – user object and picture object.

Personally, I’d like to use relational database to explain as it’s usually easier to understand. In this case, we will have a user table for sure, which contains information like name, email, registration date and so on. The same goes for picture table.

In addition, we also need to store two relations – user follow relation and user-picture relation. This comes very naturally and it’s worth to note that user follow relation is not bi-directional.

Therefore, having such data model allows users to follow each other. To check a user’s feed, we can fetch all pictures from people he follows.

Potential scale issues

The above solution should definitely work well. As an interviewer, I always like to ask what can go wrong when we have millions of users and how to solve it.

This question is a great way to test if a candidate can foresee potential scale issues and it’s better than just asking how can you solve problem XYZ.

Of course, there’re no standard answers and I would like to list few ideas as inspirations.

1. Response time

When users get to a certain number, it’s quite common to see slow response time becomes the bottleneck.

For instance, one costly operation is to render users feed. The server has to go over everyone the user follows, fetch all the pictures from them, and rank them based on particular algorithms. When a user has followed many people with a large number of pictures, the operation can be slow.

Various approaches can be applied here. We can upgrade the ranking algorithm if it’s the bottleneck, e.g. if we are ranking by date, we can just read the top N most recent pictures from each person with infinite scroll feature. Or we can use offline pipelines to precompute some signals that can speed up the ranking.

The point is that it’s unlikely to have someone following hundreds of users, but it’s likely to have someone with thousands of pictures. Therefore, accelerating the picture fetching and ranking is the core.

2. Scale architecture

When there are only tens of users and pictures, we may store and serve everything from a single server.

However, with millions of users, a single server is far from enough due to storage, memory, CPU bound issues etc.. That’s why it’s pretty common to see server crashes when there are a large number of requests.

To scale architecture, the rule of thumb is that service-oriented architecture beats monolithic application.

Instead of having everything together, it’s better to divide the whole system into small components by service and separate each component. For example, we can have database separate from web apps (in different servers) with load balancers.

3. Scale database

Even if we put the database in a separate server, it will not be able to store an infinite number of data.

At a certain point, we need to scale the database. For this specific problem, we can either do the vertical splitting (partitioning) by splitting the database into sub-databases like user database, comment database etc. or horizontal splitting (sharding) by splitting based on attributes like US users, European users.

Feed ranking

It’s also interesting to discuss how to rank feeds (pictures) in users timelines.

Although it’s quite straightforward to rank everything in chronological order, is it the best approach? Such open-ended question is very common in system design interviews.

Actually, there can be quite a few alternatives. For example, an algorithm that combines time and how likely the user will like this picture is definitely promising.

a common strategy is to come up with a scoring mechanism that takes various features as signals and computes a final score for each picture.

Intuitively, features that matter a lot include like/comment numbers, whether the user has liked many photos of the owner and so on. A linear combination can be used as a starting point due to simplicity.

Image optimization

Since a picture sharing system is full of images, I would like to ask what can be optimized related to images?

First of all, it’s usually recommended to store all pictures separately in production. Amazon S3 is one of the most popular storage systems. However, you don’t need to be able to come up with this.

The point is that images are usually of large size and seldom get updated. So a separate system for image storage has a lot of advantages. For instance, cache and replication can be much simpler when files are static.

Secondly, to save space, images should be compressed. One common approach is to only store/serve the compressed version of images. Google photos actually is using this approach with unlimited free storage.

Also for reference, you can check Instagram infrastructure and Flickr architecture.

Instagram uses Amazon CDN to deliver images efficiently
http://www.planetcassandra.org/blog/interview/facebooks-instagram-making-the-switch-to-cassandra-from-redis-a-75-insta-savings/

Initially our deployment was for storing auditing information related to security and site integrity purposes. To break down that concept, it means fighting spam, finding abusive users, and other things like that. It was really a sweet spot for the Cassandra offering.

Originally, these features were conducted in Redis; the data size was growing too rapidly, and keeping it in memory was not a productive way to go. It was a really high write rate and really low read rate, a spot where Cassandra really pops and shines so the switch ended up being a no-brainer for us to adopt Cassandra in that area.

For the first use case mentioned above for our backend, we moved off of a Redis master/slave replication setup; it was just too costly to have that.
We moved from having everything in memory, with very large instances, to just putting everything on disks; when you really don’t need to read that often, it works fine having it on disks.

Implementing Cassandra cut our costs to the point where we were paying around a quarter of what we were paying before. Not only that, but it also freed us to just throw data at the cluster because it was much more scalable and we could add nodes whenever needed. When you’re going from an un-sharded setup to a sharded setup, it can be a pain; you basically get that for free with Cassandra, where you don’t have to go through the painful process of sharding your data.

Recently, we decided to port another use case that is much more critical. We spent time getting everyone on the team up-to-date with Cassandra, reading documentation, learning how to operate it effectively. We chose to use Cassandra for what we call the “inboxes” or the newsfeed part of our app.

Basically, it’s a feed of all the activity that would be associated with a given user’s account; you can see if people like your photos, follow you, if your friends have joined Instagram, received comments, etc. The reason we decided to move that to Cassandra was that it was previously in Redis and we were experiencing the same memory limitations.

For this “inbox” use case, the feed was already sharded; it was a 32 node cluster with 16 masters and 16 replicas that were fail-over replicas and, of course, we had to go through all the sharing of things. We noticed that we were running out of space on these machines and they weren’t really consuming a lot of CPU (Redis can be incredibly efficient with CPU) but obviously when you run out of memory… you run out of memory.

Instagram Architecture: 14 Million users, Terabytes of Photos, 100s of Instances, Dozens of Technologies - High Scalability -

Lessons learned: 1) Keep it very simple 2) Don't re-invent the wheel 3) Go with proven and solid technologies when you can.

Amazon’s Elastic Load Balancer routes requests and 3 nginx instances sit behind the ELB.
SSL terminates at the ELB, which lessens the CPU load on nginx.
Amazon’s Route53 for the DNS.

Gunicorn as their WSGI server. Apache harder to configure and more CPU intensive.
Fabric is used to execute commands in parallel on all machines. A deploy takes only seconds.
PostgreSQL (users, photo metadata, tags, etc) runs on 12 Quadruple Extra-Large memory instances.
Twelve PostgreSQL replicas run in a different availability zone.
PostgreSQL instances run in a master-replica setup using Streaming Replication. EBS is used for snapshotting, to take frequent backups.
EBS is deployed in a software RAID configuration. Uses mdadm to get decent IO.
All of their working set is stored memory. EBS doesn’t support enough disk seeks per second.
Vmtouch (portable file system cache diagnostics) is used to manage what data is in memory, especially when failing over from one machine to another, where there is no active memory profile already.
XFS as the file system. Used to get consistent snapshots by freezing and unfreezing the RAID arrays when snapshotting.
Pgbouncer is used pool connections to PostgreSQL.
Several terabytes of photos are stored on Amazon S3.
Amazon CloudFront as the CDN.
Redis powers their main feed, activity feed, sessions system, and other services.
Redis runs on several Quadruple Extra-Large Memory instances. Occasionally shard across instances.
Redis runs in a master-replica setup. Replicas constantly save to disk. EBS snapshots backup the DB dumps. Dumping on the DB on the master was too taxing.
6 memcached instances for caching. Connect using pylibmc & libmemcached. Amazon Elastic Cache service isn't any cheaper.
Gearman is used to: asynchronously share photos to Twitter, Facebook, etc; notifying real-time subscribers of a new photo posted; feed fan-out.
200 Python workers consume tasks off the Gearman task queue.
Pyapns (Apple Push Notification Service) handles over a billion push notifications. Rock solid.
Munin to graph metrics across the system and alert on problems. Write many custom plugins using Python-Munin to graph, signups per minute, photos posted per second, etc.
Pingdom for external monitoring of the service.
PagerDuty for handling notifications and incidents.

http://instagram-engineering.tumblr.com/post/13649370142/what-powers-instagram-hundreds-of-instances
We use Amazon CloudFront as our CDN, which helps with image load times from users around the world

Sharding & IDs at Instagram
http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram
Generated IDs should be sortable by time (so a list of photo IDs, for example, could be sorted without fetching more information about the photos)
IDs should ideally be 64 bits (for smaller indexes, and better storage in systems like Redis)
The system should introduce as few new ‘moving parts’ as possible—a large part of how we’ve been able to scale Instagram with very few engineers is by choosing simple, easy-to-understand solutions that we trust.

Generate IDs in web application
Each application thread generates IDs independently, minimizing points of failure and contention for ID generation
If you use a timestamp as the first component of the ID, the IDs remain time-sortable
Cons:

Generally requires more storage space (96 bits or higher) to make reasonable uniqueness guarantees

Some UUID types are completely random and have no natural sort

Generate IDs through dedicated service
DB Ticket Servers
Uses the database’s auto-incrementing abilities to enforce uniqueness. Flickr uses this approach, but with two ticket DBs (one on odd numbers, the other on even) to avoid a single point of failure.

Pros:
DBs are well understood and have pretty predictable scaling factors

Cons:
Can eventually become a write bottleneck (though Flickr reports that, even at huge scale, it’s not an issue).
An additional couple of machines (or EC2 instances) to admin
If using a single DB, becomes single point of failure. If using multiple DBs, can no longer guarantee that they are sortable over time.
Ticket Servers: Distributed Unique Primary Keys on the Cheap
http://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/

GUIDs?

why not use GUIDs? Mostly because GUIDs are big, and they index badly in MySQL. One of the ways we keep MySQL fast is we index everything we want to query on, and we only query on indexes.
So index size is a key consideration. If you can’t keep your indexes in memory, you can’t keep your database fast.
Additionally ticket servers give us sequentiality which has some really nice properties including making reporting and debugging more straightforward, and enabling some caching hacks.

Centralizing Auto-Increments

If we can’t make MySQL auto-increments work across multiple databases, what if we just used one database? If we inserted a new row into this one database every time someone uploaded a photo we could then just use the auto-incrementing ID from that table as the primary key for all of our databases.

REPLACE works exactly like INSERT, except that if an old row in the table has the same value as a new row for a PRIMARY KEY or a UNIQUE index, the old row is deleted before the new row is inserted.

CREATE TABLE `Tickets64` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `stub` char(1) NOT NULL default '',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `stub` (`stub`)
) ENGINE=MyISAM

REPLACE INTO Tickets64 (stub) VALUES ('a');
SELECT LAST_INSERT_ID();

SPOFs
You really really don’t know want provisioning your IDs to be a single point of failure. We achieve “high availability” by running two ticket servers. At this write/update volume replicating between the boxes would be problematic, and locking would kill the performance of the site. We divide responsibility between the two boxes by dividing the ID space down the middle, evens and odds, using:

TicketServer1:
auto-increment-increment = 2
auto-increment-offset = 1

TicketServer2:
auto-increment-increment = 2
auto-increment-offset = 2
We round robin between the two servers to load balance and deal with down time. The sides do drift a bit out of sync

Fabric is a Python (2.5 or higher) library and command-line tool for streamlining the use of SSH for application deployment or systems administration tasks.
Once a task is defined, it may be run on one or more servers, like so:
$ fab -H localhost,linuxbox host_type
Read full article from Instagram Architecture: 14 Million users, Terabytes of Photos, 100s of Instances, Dozens of Technologies - High Scalability -

http://allenlsy.com/scaling-instagram

PostgreSQL 存储更多需要做 join 操作的数据，比如用户之间的 follow 关系。这里使用 master 和 replica 的结构，实现 eventually consistency。replica 分布在不同的 data centre。Django 将数据写入 master，而从 replica 读取，这样读写分离。实际应用中发现，master 到 replica 这个延迟对业务没什么影响。

Cassandra 存储用户发的 post，用户活动记录等等。Cassandra 自己是一个没有 master 的 NoSQL 数据库，各 replica 之间实现 eventually consistency 。Cassandra 的 consistency 是通过 Quorum 机制实现的，所以根据业务的不同，通过调节 quorum 数量可以配置不同程度的 consistency。

不同于存储，每个 region 在计算方面是相对独立的。在每个 region ，Instagram 都部署了 Django，Celery 和 RabbitMQ。Load balancer 会把 request 发给本地的 Django，backend job 和 queue message 也是在当地 region 运行。

因为 cache 的目标就是为了提速，也就是说从 CAP 理论来讲，追求 low latancy，所以会牺牲 consistency。进而也就是说，如果一个美国用户发了一条 post ，欧洲用户晚那么几分钟收到也没什么关系。

现在假设一个场景。一个用户发了一条 post 。web server （这里指 Django）将这条 post 被写入了 PostgreSQL，也写入了当地 data centre 的 memcache 里面。此时另一个用户访问这条 post 时：

web server 总是先尝试从 memcache 读取数据。如果读者用户也在当地，那么可以直接从 memcache 读到 post
而如果读者在其他地区，此时因为延迟，读者不一定能及时读到 post

那如何将这条 post 复制到其他地区的 memcache呢？

发布者地区的 data centre 会通过 PostgreSQL 复制机制，将数据复制到其他地区，也就是图中右边部分的 DC2
DC2 的 PostgreSQL 会告诉当地的 memcache 说，相关的数据过期了（比如说这个发布者的所有最近 post 这个数据已过期），应把它从 memcache 里清除。DC2 的 web server 之后发现 memcache 里没有需要的数据，会从 PostgreSQL 读取，然后更新 memcache

而当数据从 memcache 被清除的那一刻，如果流量很大，那么所有的 web server 瞬间全部转向 PostgreSQL 询问。这时 database 的压力非常大，极有可能影响其他业务。

这个在英文里叫 thundering herd problem，中文常常叫缓存的雪崩效应。

解决的办法是使用 memcache lease 机制。当第一台 web server 找 memcache 要数据时，memcache 说“我没有数据，你去 database 拿最新数据来更新我吧”，而再之后的 web server 再来问， memcache 说“我没有数据，但是有人去拿数据了。你要么等等，要么用我这里过期的数据吧”

优化：Instagram 在服务器端，针对逻辑使用了一些优化技巧。比如用户请求一张图片的 url 时，根据用户的位置，服务器直接返回离用户最近的 cdn url 地址。再比如将一些经常被调用的函数使用 Cython 或者 C/C++来写，减少实际调用的 CPU 指令

纵向扩展的另一个方面叫做 less server （减少服务器数量）。一台服务器上要运行很多 process。这些process 之间，在内存中，有时是可以共享同一段代码，有时使用自己的独立内存空间。如果让一台服务器上的 process 尽量多的共享代码，比如让同类的 process 运行在同一台服务器上，那么在 cluster 中就可以减少服务器数量。

比如 Instagram app 的首页分为几个部分。为了提高 app 响应速度，客户端 app 是通过异步的方式，从不同的 service 请求不同的数据，再展示在 app 上，而并不是从同一 service 同步获取。

Instagram，如以往文章中提到的 Google 开发流程一样，也是使用 Trunk based development。新功能直接在 master branch 开发，使用 CI 控制代码质量。这使得各个功能的开发团队都能使用最新代码。同时，Instagram 会持续监测代码的性能。

代码上线的过程是：工程师开发功能 -> 测试团队测试，反馈 -> 内部员工 alpha 测试 -> Canary deployemnt （灰度发布/金丝雀发布）-> 100% 上线。上线的过程中会逐步进行 load test 负载测试。上线过程使用了自动化的 Continuous delivery （持续支付）流程，所以每天能 deploy 40-60 次。deploy 一次需要大概10分钟，就能 deploy 到 20k+ 服务器上。

Sunday, June 28, 2015

Instagram Architecture - Create a Photo Sharing App

Potential scale issues

Feed ranking

GUIDs?

Centralizing Auto-Increments

Labels

Popular Posts