Sunday, June 28, 2015

Instagram Architecture - Create a Photo Sharing App
More specifically, the system allows people to follow each other, share/comment/like pictures, and maybe some other features like exploreadvertisement and so on so forth.
it’s recommended to start with a high-level solution and then you can dig into all sorts of details later.
The advantage of this approach is that you’re gonna have a clear idea of what you are trying to solve and interviewers are less likely to be confused.
To design a picture sharing system, it’s quite straightforward to identify two major objects – user object and picture object.
Personally, I’d like to use relational database to explain as it’s usually easier to understand. In this case, we will have a user table for sure, which contains information like name, email, registration date and so on. The same goes for picture table.
In addition, we also need to store two relations – user follow relation and user-picture relation. This comes very naturally and it’s worth to note that user follow relation is not bi-directional.
Therefore, having such data model allows users to follow each other. To check a user’s feed, we can fetch all pictures from people he follows.

Potential scale issues

The above solution should definitely work well. As an interviewer, I always like to ask what can go wrong when we have millions of users and how to solve it.
This question is a great way to test if a candidate can foresee potential scale issues and it’s better than just asking how can you solve problem XYZ.
Of course, there’re no standard answers and I would like to list few ideas as inspirations.
1. Response time
When users get to a certain number, it’s quite common to see slow response time becomes the bottleneck.
For instance, one costly operation is to render users feed. The server has to go over everyone the user follows, fetch all the pictures from them, and rank them based on particular algorithms. When a user has followed many people with a large number of pictures, the operation can be slow.
Various approaches can be applied here. We can upgrade the ranking algorithm if it’s the bottleneck, e.g. if we are ranking by date, we can just read the top N most recent pictures from each person with infinite scroll feature. Or we can use offline pipelines to precompute some signals that can speed up the ranking.
The point is that it’s unlikely to have someone following hundreds of users, but it’s likely to have someone with thousands of pictures. Therefore, accelerating the picture fetching and ranking is the core.
2. Scale architecture
When there are only tens of users and pictures, we may store and serve everything from a single server.
However, with millions of users, a single server is far from enough due to storage, memory, CPU bound issues etc.. That’s why it’s pretty common to see server crashes when there are a large number of requests.
To scale architecture, the rule of thumb is that service-oriented architecture beats monolithic application.
Instead of having everything together, it’s better to divide the whole system into small components by service and separate each component. For example, we can have database separate from web apps (in different servers) with load balancers.
3. Scale database
Even if we put the database in a separate server, it will not be able to store an infinite number of data.
At a certain point, we need to scale the database. For this specific problem, we can either do the vertical splitting (partitioning) by splitting the database into sub-databases like user database, comment database etc. or horizontal splitting (sharding) by splitting based on attributes like US users, European users.

Feed ranking

It’s also interesting to discuss how to rank feeds (pictures) in users timelines.
Although it’s quite straightforward to rank everything in chronological order, is it the best approach? Such open-ended question is very common in system design interviews.
Actually, there can be quite a few alternatives. For example, an algorithm that combines time and how likely the user will like this picture is definitely promising.
a common strategy is to come up with a scoring mechanism that takes various features as signals and computes a final score for each picture.
Intuitively, features that matter a lot include like/comment numbers, whether the user has liked many photos of the owner and so on. A linear combination can be used as a starting point due to simplicity.
Image optimization
Since a picture sharing system is full of images, I would like to ask what can be optimized related to images?
First of all, it’s usually recommended to store all pictures separately in production. Amazon S3 is one of the most popular storage systems. However, you don’t need to be able to come up with this.
The point is that images are usually of large size and seldom get updated. So a separate system for image storage has a lot of advantages. For instance, cache and replication can be much simpler when files are static.
Secondly, to save space, images should be compressed. One common approach is to only store/serve the compressed version of images. Google photos actually is using this approach with unlimited free storage.
Also for reference, you can check Instagram infrastructure and Flickr architecture.

Instagram uses Amazon CDN to deliver images efficiently
Initially our deployment was for storing auditing information related to security and site integrity purposes. To break down that concept, it means fighting spam, finding abusive users, and other things like that. It was really a sweet spot for the Cassandra offering.

Originally, these features were conducted in Redis; the data size was  growing too rapidly, and keeping it in memory was not a productive way to go. It was a really high write rate and really low read rate, a spot where Cassandra really pops and shines so the switch ended up being a no-brainer for us to adopt Cassandra in that area.  
For the first use case mentioned above for our backend, we moved off of a Redis master/slave replication setup; it was just too costly to have that.
We moved from having everything in memory, with very large instances, to just putting everything on disks; when you really don’t need to read that often, it works fine having it on disks.

Implementing Cassandra cut our costs to the point where we were paying around a quarter of what we were paying before. Not only that, but it also freed us to just throw data at the cluster because it was much more scalable and we could add nodes whenever needed.  When you’re going from an un-sharded setup to a sharded setup, it can be a pain; you basically get that for free with Cassandra, where you don’t have to go through the painful process of sharding your data.
Recently, we decided to port another use case that is much more critical. We spent time getting everyone on the team up-to-date with Cassandra, reading documentation, learning how to operate it effectively. We chose to use Cassandra for what we call the “inboxes” or the newsfeed part of our app.

Basically, it’s a feed of all the activity that would be associated with a given user’s account; you can see if people like your photos, follow you, if your friends have joined Instagram, received comments, etc. The reason we decided to move that to Cassandra was that it was previously in Redis and we were experiencing the same memory limitations.  
For this “inbox” use case, the feed was already sharded; it was a 32 node cluster with 16 masters and 16 replicas that were fail-over replicas and, of course, we had to go through all the sharing of things. We noticed that we were running out of space on these machines and they weren’t really consuming a lot of CPU (Redis can be incredibly efficient with CPU) but obviously when you run out of memory… you run out of memory.  
Instagram Architecture: 14 Million users, Terabytes of Photos, 100s of Instances, Dozens of Technologies - High Scalability -

  • Lessons learned: 1) Keep it very simple 2) Don't re-invent the wheel 3) Go with proven and solid technologies when you can.

  • Amazon’s Elastic Load Balancer routes requests and 3 nginx instances sit behind the ELB.
    SSL terminates at the ELB, which lessens the CPU load on nginx.
    Amazon’s Route53 for the DNS.
    • Gunicorn as their WSGI server. Apache harder to configure and more CPU intensive.
    • Fabric is used to execute commands in parallel on all machines. A deploy takes only seconds.
    • PostgreSQL (users, photo metadata, tags, etc) runs on 12 Quadruple Extra-Large memory instances.
    • Twelve PostgreSQL replicas run in a different availability zone.
    • PostgreSQL instances run in a master-replica setup using Streaming Replication. EBS is used for snapshotting, to take frequent backups. 
    • EBS is deployed in a software RAID configuration. Uses mdadm to get decent IO.
    • All of their working set is stored memory. EBS doesn’t support enough disk seeks per second.
    • Vmtouch (portable file system cache diagnostics) is used to manage what data is in memory, especially when failing over from one machine to another, where there is no active memory profile already.
    • XFS as the file system. Used to get consistent snapshots by freezing and unfreezing the RAID arrays when snapshotting.
    • Pgbouncer is used pool connections to PostgreSQL.
    • Several terabytes of photos are stored on Amazon S3.
    • Amazon CloudFront as the CDN.
    • Redis powers their main feed, activity feed, sessions system, and other services.
    • Redis runs on several Quadruple Extra-Large Memory instances. Occasionally shard across instances.
    • Redis runs in a master-replica setup. Replicas constantly save to disk. EBS snapshots backup the DB dumps. Dumping on the DB on the master was too taxing.
    • 6 memcached instances for caching. Connect using pylibmc & libmemcached. Amazon Elastic Cache service isn't any cheaper.
    • Gearman is used to: asynchronously share photos to Twitter, Facebook, etc; notifying real-time subscribers of a new photo posted; feed fan-out.
    • 200 Python workers consume tasks off the Gearman task queue.
    • Pyapns (Apple Push Notification Service) handles over a billion push notifications. Rock solid.
    • Munin to graph metrics across the system and alert on problems. Write many custom plugins using Python-Munin to graph, signups per minute, photos posted per second, etc.
    • Pingdom for external monitoring of the service.
    • PagerDuty for handling notifications and incidents.
    We use Amazon CloudFront as our CDN, which helps with image load times from users around the world

    Sharding & IDs at Instagram
    Generated IDs should be sortable by time (so a list of photo IDs, for example, could be sorted without fetching more information about the photos)
    IDs should ideally be 64 bits (for smaller indexes, and better storage in systems like Redis)
    The system should introduce as few new ‘moving parts’ as possible—a large part of how we’ve been able to scale Instagram with very few engineers is by choosing simple, easy-to-understand solutions that we trust.

    Generate IDs in web application
    Each application thread generates IDs independently, minimizing points of failure and contention for ID generation
    If you use a timestamp as the first component of the ID, the IDs remain time-sortable

    Generally requires more storage space (96 bits or higher) to make reasonable uniqueness guarantees

    Some UUID types are completely random and have no natural sort

    Generate IDs through dedicated service
    DB Ticket Servers

    Uses the database’s auto-incrementing abilities to enforce uniqueness. Flickr uses this approach, but with two ticket DBs (one on odd numbers, the other on even) to avoid a single point of failure.

    DBs are well understood and have pretty predictable scaling factors

    Can eventually become a write bottleneck (though Flickr reports that, even at huge scale, it’s not an issue).
    An additional couple of machines (or EC2 instances) to admin
    If using a single DB, becomes single point of failure. If using multiple DBs, can no longer guarantee that they are sortable over time.
    Ticket Servers: Distributed Unique Primary Keys on the Cheap


    why not use GUIDs? Mostly because GUIDs are big, and they index badly in MySQL. One of the ways we keep MySQL fast is we index everything we want to query on, and we only query on indexes
    So index size is a key consideration. If you can’t keep your indexes in memory, you can’t keep your database fast. 
    Additionally ticket servers give us sequentiality which has some really nice properties including making reporting and debugging more straightforward, and enabling some caching hacks.

    Centralizing Auto-Increments

    If we can’t make MySQL auto-increments work across multiple databases, what if we just used one database? If we inserted a new row into this one database every time someone uploaded a photo we could then just use the auto-incrementing ID from that table as the primary key for all of our databases.
    REPLACE works exactly like INSERT, except that if an old row in the table has the same value as a new row for a PRIMARY KEY or a UNIQUE index, the old row is deleted before the new row is inserted.
    CREATE TABLE `Tickets64` (
      `id` bigint(20) unsigned NOT NULL auto_increment,
      `stub` char(1) NOT NULL default '',
      PRIMARY KEY  (`id`),
      UNIQUE KEY `stub` (`stub`)
    REPLACE INTO Tickets64 (stub) VALUES ('a');
    You really really don’t know want provisioning your IDs to be a single point of failure. We achieve “high availability” by running two ticket servers. At this write/update volume replicating between the boxes would be problematic, and locking would kill the performance of the site. We divide responsibility between the two boxes by dividing the ID space down the middle, evens and odds, using:

    auto-increment-increment = 2
    auto-increment-offset = 1

    auto-increment-increment = 2
    auto-increment-offset = 2
    We round robin between the two servers to load balance and deal with down time. The sides do drift a bit out of sync

    Fabric is a Python (2.5 or higher) library and command-line tool for streamlining the use of SSH for application deployment or systems administration tasks.
    Once a task is defined, it may be run on one or more servers, like so:
    $ fab -H localhost,linuxbox host_type
    Read full article from Instagram Architecture: 14 Million users, Terabytes of Photos, 100s of Instances, Dozens of Technologies - High Scalability -

    No comments:

    Post a Comment


    Review (561) System Design (304) System Design - Review (196) Java (179) Coding (75) Interview-System Design (65) Interview (60) Book Notes (59) Coding - Review (59) to-do (45) Linux (40) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (29) Product Architecture (28) Big Data (27) Soft Skills (27) Concurrency (26) MultiThread (26) Miscs (25) Cracking Code Interview (24) Distributed (24) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) OOD Design (20) System Design - Practice (19) How to Ace Interview (16) Security (16) Algorithm (15) Brain Teaser (14) Google (14) Redis (14) Linux - Shell (13) Spark (13) Spring (13) Code Quality (12) How to (12) Interview-Database (12) Interview-Operating System (12) Tools (12) Architecture Principles (11) Company - LinkedIn (11) Solr (11) Testing (11) Resource (10) Search (10) Amazon (9) Cache (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Company - Uber (8) Interview - MultiThread (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Scalability (8) Trouble Shooting (8) Cassandra (7) Company - Facebook (7) Design (7) Git (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Machine Learning (7) NoSQL (7) C++ (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) API Design (4) Be Architect (4) Big Fata (4) C (4) Company Product Architecture (4) Data structures (4) Design Principles (4) Facebook (4) GeeksforGeeks (4) Generics (4) Google Interview (4) Hardware (4) JDK8 (4) Optimization (4) Product + Framework (4) Puzzles (4) Python (4) Shopping System (4) Source Code (4) Web Service (4) node.js (4) Back-of-Envelope (3) Chrome (3) Company - Pinterest (3) Company - Twiiter (3) Company - Twitter (3) Consistent Hash (3) Elasticsearch (3) GOF (3) Game Design (3) GeoHash (3) Growth (3) Guava (3) Html (3) Interview-Big Data (3) Interview-Linux (3) Interview-Network (3) Java EE Patterns (3) Javarevisited (3) Map Reduce (3) Math - Probabilities (3) Performance (3) RateLimiter (3) Resource-System Desgin (3) Scala (3) UML (3) ZooKeeper (3) geeksquiz (3) AI (2) Advanced data structures (2) AngularJS (2) Behavior Question (2) Bugs (2) Coding Interview (2) Company - Netflix (2) Crawler (2) Cross Data Center (2) Data Structure Design (2) Database-Shard (2) Debugging (2) Docker (2) Garbage Collection (2) Go (2) Hadoop (2) Interview - Soft Skills (2) Interview-Miscs (2) Interview-Web (2) JDK (2) Logging (2) POI (2) Papers (2) Programming (2) Project Practice (2) Random (2) Software Desgin (2) System Design - Feed (2) Thread Synchronization (2) Video (2) reddit (2) Ads (1) Algorithm - Review (1) Android (1) Approximate Algorithms (1) Base X (1) Bash (1) Books (1) C# (1) CSS (1) Client-Side (1) Cloud (1) CodingHorror (1) Company - Yelp (1) Counter (1) DSL (1) Dead Lock (1) Difficult Puzzles (1) Distributed ALgorithm (1) Eclipse (1) Facebook Interview (1) Function Design (1) Functional (1) GoLang (1) How to Solve Problems (1) ID Generation (1) IO (1) Important (1) Internals (1) Interview - Dropbox (1) Interview - Project Experience (1) Interview Stories (1) Interview Tips (1) Interview-Brain Teaser (1) Interview-How (1) Interview-Mics (1) Interview-Process (1) Java Review (1) Jeff Dean (1) Joda (1) LeetCode - Review (1) Library (1) LinkedIn (1) LintCode (1) Mac (1) Micro-Services (1) Mini System (1) MySQL (1) Nigix (1) NonBlock (1) Process (1) Productivity (1) Program Output (1) Programcreek (1) Quora (1) RPC (1) Raft (1) Reactive (1) Reading (1) Reading Code (1) Refactoring (1) Resource-Java (1) Resource-System Design (1) Resume (1) SQL (1) Sampling (1) Shuffle (1) Slide Window (1) Spotify (1) Stability (1) Storm (1) Summary (1) System Design - TODO (1) Tic Tac Toe (1) Time Management (1) Web Tools (1) algolist (1) corejavainterviewquestions (1) martin fowler (1) mitbbs (1)

    Popular Posts