Thursday, December 25, 2014

Design the Facebook timeline function
At a high level we needed to scan, aggregate, and rank posts, shares, photos and check-ins to surface the most significant events over years of Facebook activity.

After a few discussions we decided to build on four of our core technologies: MySQL/InnoDB  for storage and replication, Multifeed (the technology that powers News Feed) for ranking, Thrift for communications, and memcached for caching.

Denormalizing the data
Before we began Timeline, our existing data was highly normalized, which required many round trips to the databases. Because of this, we relied on caching to keep everything fast. When data wasn’t found in cache, it was unlikely to be clustered together on disk, which led to lots of potentially slow, random disk IO. To support our ranking model for Timeline, we would have had to keep the entire data set in cache, including low-value data that wasn’t displayed.

2. Non-recent activity data had been moved to slow network storage. We hacked a read-only build of MySQL and deployed hundreds of servers to exert maximum IO pressure and copy this data out in weeks instead of months.
3. Massive join queries that did tons of random IO. We consolidated join tables into a tier of flash-only databases. Traditionally PHP can perform database queries on only one server at a time, so we wrote a parallelizing query proxy that allowed us to query the entire join tier in parallel.

Caching is an important part of any Facebook project. One of the nice properties of Timeline is that the results of big queries, such as ranking all your activity in 2010, are small and can be cached for a long period without cache invalidations. A query result cache is of huge benefit and memcached is an excellent solution.

Recent Activity changes frequently so a query cache is frequently invalidated, but regenerating the summary of Recent Activity is quite fast. 
Here a row cache helps further boost query performance. We rely on the InnoDB buffer pool in RAM and our ownFlashcache kernel driver to expand the OS cache onto a flash device.
Multifeed (a custom distributed system which takes the tens of thousands of updates from friends and picks the most relevant)

Denormalize. Format data in the way you need to use it.
Denormalization, creating special purpose objects instead of distributed rows that must be joined, minimizes random IO by reducing the number of trips to the database.

Caching can often get around the need for denormalization, but given the amount of timeline data and how much of it is cold, that is it will rarely be viewed, caching everything isn't a good design.

Timeline is like a datamart in a data warehouse. Data must be slurped up from dozens of different systems, cleaned, merged, and reformatted into a new canonical format. Facebook of course did this in a Facebook-like way. They created a custom data conversion language, they deployed hundreds of MySQL servers to extract the data out of "legacy" systems as fast as possible, they deployed flash storage to speed up joins, they created a parallelizing query proxy, and they standardized on the Multifeed data format for future flexibility.
  • Keep different types of caches
    • Short term cache.  A timeline of recent activity is frequently invalidated because it is changing all the time as you perform actions through your life. This cache is an in RAM row cache inside InnoDB that uses the Flashcache kernel driver to expand the OS cache onto a flash device.
    • Long term cache. A query cache is kept in memcached. The results of large queries, like the ranking of all your activities in 2010, can be efficiently cached since they will rarely be invalidated.
  • Run operations locally. The Timeline Aggregator (geographically clustering nearby check-ins, ranking status updates, etc) runs on each database so it can max out the disks. Only data that needs to be displayed is sent over the network.
    • Parallelize development
    News Feed is the constantly updating list of stories in the middle of your home page.News Feed includes status updates, photos, videos, links, app activity and likes from people, Pages and groups that you follow on Facebook.
    Your Timeline, which is sometimes referred to as your profile, is your collection of the photos, stories and experiences that tell your story.

    News Feed—the center column of your home page—is a constantly updating list of stories from people and Pages that you follow on Facebook. News Feed stories include status updates, photos, videos, links, app activity and likes.
    每个用户有两个queue, 发送queue(feed),接收queue(timeline)。用户的发送queue里面会存他自己写的tweet,接收queue里面会有他follow了的人更新的tweet. 系统里面有一个重要的数据结构 social graph。 每个用户有自己的profile,里面存了follow了那些人,和被哪些人follow 了.

      图1 user’s feed

    图2: user’s timeline
    以tweeter为例子, tweeter是一个读取远大于写入的应用。读取就是用户刷新自己的timeline,得到following的更新。 QPS=300k, 写入就是用户发tweet,QPS=5K. 同时在线用户达到了150million,每天的推文有400million。necesaries里面 还有一个notify(推送)操作,不过并没有单独讲。感觉可以包含在后面的push操作中。

    1. push model

    用户发了一条tweet,通过social graph查找到他的follower,然后再写入所有这些follower的timeline。如果有N个follower,这个操作的时间复杂度是O(N)。对另一个用户来说,当他刷新自己的timeline的时候,不需要去读following的feed,所以只需要读自己的timeline,时间O(1)。

    这个算法会导致一个问题,九章里面叫storm of lady gaga. Gaga有50m的follower, 他发送一条tweet会写到50M用户的timeline里面。会对服务器造成巨大的负担。但是有一个优化可以做,先写到在线用户的timeline,其他的follower可以推后写。在线用户可以定义为最近一周登陆过的用户。

    图3: push model

         2. pull model

    用户发一条tweet,只写到自己的发送queue里面,就是feed。另一个用户要读去timeline的时候, 根据social graph找到所有following的feed,然后读取每个人的最近50条,再merge。时间复杂度 写O(1),读O(n)。 问题是如果有一个超级用户,关注了其他所有用户,他要读取timeline的时候会把所有的用户的feed都读一遍。这是极端例子,可以采取设置关注上限来解决。

     图4:pull model

         3. push + pull model

    当一个用户要读timeline的时候,他的following被分成了两部分,第一部分是following的followers<100k的。这部分人更新的时候会把tweet直接写到当前用户的timeline。第二部分: following的follower>100k的,当前用户需要去此类用户的feed里面读取最近的tweet,再和自己已有的timeline进行merge。从tweet发送者的角度来说就是,当一个用户发送tweet的时候,如果他的follower大于十万,他就只写在自己的feed里面,如果小于十万,就直接推送给所有的follower。但是如果设置100k为阈值会产生抖动问题: 如果一个用户的follower在十万左右摇摆,那么他的tweet一会儿会推送给所有follower,一会儿又只会写到自己的feed里面。会造成follower们得到的tweet可能不一致。解决办法,不设十万为阈值,而设定一个范围。 follower小于十万的做push,大于九万的做pull。在follower在九万和十万之间的用户会又push又pull,只要用户的follower不是突然从大于10万掉到小于9万,就没有问题。
    虽然可以两个模式相结合,不过现实中facebook采用了pull ,instagram 采用了push。不过在high demanding read的情况下pull会比push好一些,pull不容易造成短时间内的高并发。在特殊时期,比如春节,世界杯,大量的人写tweet,push就会有高并发的问题了。

    图5:push + pull model

    4. Data + Evolution

    优化的关键就是把所有的数据结构都放到memory。 需要放进memory的有 social graph, timeline list, feed list。计算内存需求。

    • 1billion users
    • Average feed size: 50 tweets
    • Average timeline size: 1000 tweets
    • Tweet size: 200B
    • Average followers = 30
    Memory needed
    • Size of timeline lists = 1billion* 10^3*200 = 200T
    • Size of feed lists = 1billion * 50 *200 = 10T
    • Sizeofsocialgraph=1billion*30*2*8=480G

    200T不现实。优化第一步: 只存active user,并且只存active user最近的80条tweet。

    • Weekly active users: 100 million
    • Average feed size: 80 tweets
    • Average timeline size: 500 tweets
    • Tweet size: 200B
    • Average followers = 20
    Memory need
    • Size of timeline lists = 100million* 500 *200 = 10T
    • Size of feed lists = 100million * 80 *200 = 1.6T
    • Sizeofsocialgraph=100millioin*20*2*8=32G
    大概需要12T 内存。

    再优化: normalization

    tweet metadata size: 
    20B = userID(8)+tweetID(8) + indicators(4)

     一条tweet220B,metadata只有20B,一下变成原来的十分之一,1.2T内存。额外还要加上空间去存真实的tweeter,100Million * 80 * 20B = 150G。 

    图6: 内存化


    timeline builder会根据新用户的profile读取tweeter写到他的timeline,然后在把new user加到memory的数据结构里面。新用户的timeline全是pull产生的。 push只有用户发tweet的时候才会发生,新用户关注的时候,关注者并不一定恰好发tweet,而且push也只会push当前发送的那一条。

    图7 加入新用户


    tweet 的metadata 的indicator 会被标记成delete,timeline在生成的时候就不会再加入删除的微博了。
    1. Layered - components are independent and isolated. 
    2. Service/API Driven - each layer is connected via well defined interface that is the sole entry point for accessing that service. This prevents nasty complicated interdependencies. Clients hide behind an application API. Applications use a data access layer.
    3. Distributed - layers are transparently distributed across clusters of machines
    4. Separate Application Logic - Application logic is encapsulated in application servers that provide an API endpoint. Application logic is implemented in terms of other services. The application server tier also hides a write-through cache as this is the only place user data is written or retrieved, it is the perfect spot for a cache.
    5. Stateless - State is kept in a database tier, caching tier, or other services, not in applications or stuffed in local pockets.
    6. Scalable Component Services - Other services that themselves are scalable and highly available are used to implement this service. Messages also uses: Memcache, User index service, and HayStack.
    7. Full Stack Ops - Tools were developed to manage, monitor, identify performance issues and upgrade the entire service, up, down and across the stack.
    8. Celled - Messages has as the basic building block of their system a cluster of machines and services called a cell. If you need more power captain, then cells are added in an incremental fashion. A cell consists of ZooKeeper controllers, an application server cluster, and a metadata store. Cells provide isolation as the storage and application horsepower to process requests is independent of other cells. Cells can fail, be upgraded, and distributed across datacenters independent of other cells.

    Design the Facebook news seed function
    Determining Importance

    You'll likely want to rank edges by importance rather than simply the most recent updates, meaning that you need to calculate some sort of score.
    Facebook's EdgeRank was described by the formula ∑e = ue we de, wherein ∑e is the sum of the edge's rank, ue is the affinity score with the user who created the edge, we is the weight for the content type, and de is a time decay factor.

    Calculating a friend's affinity score can be done something like this:
    ∑i = li ni wi, wherein ∑i is the sum of the interactions with that friend, li is the time since your last interaction (this would need to be weighted so that 1 day > 30 days), ni is the number of interacts, and wi is the weight of those interactions. 
    This method allows you to rank friends in a separate database and then perhaps only show ten updates from the ten closest friends, which isn't a bad idea considering few of us are likely to have more close friends than this.

    What to Store
    Activity(id, user_id, source_id, activity_type, edge_rank, parent_id, parent_type, data, time)

    Assuming you're using MySQL as your database store, you can index on (user_id, time) and then perform your basic queries.

    In MySQL, your tables would be heavily denormalized since performing joins will hurt performance.

    1. Optimize for the read.
    2. Users don't know what they don't know. If eventually consistent, attempt to provide read consistency for the publisher.
    3. If reading from 3rd party sources, leverage 'push' delivery options where ever possible. Polling is expensive and inefficient.
    4. Queue and prompt on new activity. Stream in (preferably in real-time) comments and likes.
    5. All content is not created equal. While reverse chronological delivery is popular, in high volume situations identify signals to leverage in ranking and ordering content in a more intelligent fashion.
    6. If supporting API content delivery, provide attribution to developer applications.
    7. Leverage open standards such as Standardizing on the data format you consume and emit is better for everyone involved in the long run.

    I found this paper interesting -
    * Combination of push (events are pushed to materialized per consumer feeds) and pull (events are pulled from per producer event store) approaches. Purely push or pull model is less versatile.
    * The push/pull decision is made locally on per consumer/producer basis. One size doesn't fit all.
    * Global and per producer coherency - With global coherency, events are displayed in the global order of timestamp. With per-producer coherency, time sequence is maintained per producer basis.
    * Feed diversity - A frequent producer of events may overshadow events from less frequent producers. Feed diversity addresses diversity in favor of absolute sequencing by timestamp alone.

    Last essential aspects: how frequently people connect might need some specific studies, to weight the ‘li’ mentioned by Josh; people with shared connection, and historical common comment threads might enjoy closer relation, ie a bigger weight ‘wi’. Usual time patterns should simplify the first, and PageRank give you enough insight about how to deal with the later.

    Many startups (and major media companies) use Echo StreamServer to power their real-time news feeds quickly and cheaply.
    We did the typical things, like using read slaves and memcache to increase read throughput and sharding our database to improve write throughput.

    In particular, making schema changes or adding indexes to a database with more than 10 - 20 million rows completely locks the database for hours at a time. Removing old indexes takes just as much time, and not removing them hurts performance because the database will continue to read and write to those unused blocks on every INSERT, pushing important blocks out of memory.
    The schema for the log is activity_log(uid INT(11), activity ENUM, activity_id INT(11), title TEXT, date TIMESTAMP)
    ...and the schema for the feed is newsfeed(uid INT(11), poster_uid INT(11), activity ENUM, activity_id INT(11), title TEXT, date TIMESTAMP).
    Any time a user does something relevant to the news feed, for example asking a question, it will get logged to the activity log immediately.
    Then every X minutes (5 minutes at the moment, will change to 15-30 minutes later), I run a cron jobthat executes the script below. This script loops through all of the users in the database, finds all the activities for all of that user's friends, and then writes those activities to the news feed.
    At the moment, the SQL that culls the activity (called in ActivityLog::getUsersActivity()) has a LIMIT 100 imposed for performance* reasons. 
    Here's the flaws I see in my mind with your current implementation:
    1. You are processing all of the friends for all users, but you will end up processing the same users many times due to the fact that the same groups of people have similar friends.
    2. If one of my friends posts something, it won't show up on my news feed for at most 5 minutes. Whereas it should show up immediately, right?
    3. We are reading the entire news feed for a user. Don't we just need to grab the new activities since the last time we crunched the logs?
    4. This doesn't scale that well.
    Publishing Methods

    "Push" Model, or Fan-out-on-write
    This method involves denormalizing the user's activity data and pushing the metadata to all the user's friends at the time it occurs.
    You store only one copy of the data as in the schema above, then push pointers to friends with the metadata. The problem with this method is that if you have a large fan-out (a large number of followers), you run the risk of this breaking while your feed accumulates a backlog.

    If you go with this strategy, you also risk a large number of disk seeks and random writes. You'll want some sort of write-optimized data store such as Cassandra, HBase, or BigTable.

    "Pull" Model, or Fan-out-on-load
    This method involves keeping all recent activity data in memory and pulling in (or fanning out) that data at the time a user loads their home page.
    Data doesn't need to be pushed out to all subscribers as soon as it happens, so no back-log and no disk seeks. The problem with this method is that you may fail to generate a user's news feed altogether. 
    To mitigate this risk, you should have a fallback mechanism in place that approximates the user's feed or serves as a good alternative.

    relevance vs recency
    There are two fundamental building blocks for feeds: connections and activities.
    Activities form a log of what some entity on the site has done, or had done to it.
    Connections express relationships between entities.

    Connections are a superset of Circles, Favorites, Orders, and other relationships between
    entities on the site.

    Connections are implemented as a directed graph.
    Currently, the nodes can be people or shops. (In principle they can be other objects.)

    We also assign each edge a weight, known as affinity.
    activity := (subject, verb, object)

    activities are a description of an event on
    Etsy boiled down to a subject (“who did it”), a verb (“what they did”), and an object (“what
    they did it to”).

    The problem (the main one we’re trying to solve with activity feeds) is how to
    notify all of them about it. In order to achieve that goal, as usual we copy the data all over the place.

    activity := [owner,(subject, verb, object)]
    So what we do is duplicate the S,V,O combinations with di"erent owners.
    Steve will have his record that he connected to Kyle, and Kyle will be given his own record that Steve connected to him.

    Getting to the end result (the activity feed page) has two distinct phases: aggregation and display.

    Aggregation turns activities (in the database) into a Newsfeed (in memcache).
    Aggregation typically occurs offline, with Gearman.

    The first step in aggregation is to turn the list of people you are connected to into the list of
    people we’re actually going to go seek out activities for.

    In theory, the way we would do this is rank the connections by affinity and then treat the affinity as the probability that we’ll pick it.

    $choose_connection = mt_rand() < $affinity;
    So then we’d be more likely to pick the close connections, but leaving the possibility that we will pick the distant ones.

    Aggregation, Step 2: Making Activity Sets
    Once the connections are chosen, we then select historical activity for them and convert them
    into in-memory structures called activity sets.
    These are just the activities grouped by connection, with a score and flags field for each.
    The next few phases of aggregation operate on these.

    The next thing that happens is that we iterate through all of the activities in all of the sets and classify them.

    Aggregation, Step 3: Classification
    Aggregation, Step 4: Scoring
    At this point the score is just a simple time decay function (older activities always score lower than new ones).

    Aggregation, Step 5: Pruning
    it’s possible to wind up seeing the same event as two or more activities.
    The next stage of aggregation detects these situations.

    Aggregation, Step 6: Sort & Merge
    Then we take the final set of activities and merge it on to the owner’s existing newsfeed.
    (Or we create a new newsfeed if they don’t have one.)

    Aggregation: Cleaning Up
    We trim o" the end of the newsfeed, so that they don’t become arbitrarily large.
    And then finally we store the feed into memcached.

    Display: Enter Rollups
    To solve this problem we combine similar stories into rollups.

    on the entire feed in just a few very large rollups.
    Then the similar stories are grouped together within the sections.

    Display: Filling in Stories
    There are multiple layers of caching here. Things that are global (like the shop associated with a favorited listing) are cached separately from things that are unique to the person looking at the feed (like the exact way the story is phrased).

    Hack #1: Cache Warming
    The first thing we do to speed things up is run almost the entire pipeline proactively using gearman.
    So after aggregation we trigger a display run, even though nobody is there to look at the html.
    The end result is that almost every pageview is against a hot cache.

    Hack #2: TTL Caching
    The second thing we do is add bits of TTL caching where few people will notice.
    Straightforward but not done in many places on the site.

    Hack #3: Judicious Associations
    We also profiled the pages and meticulously simplified ORM usage.

    Hack #4: Lazy Below the Fold
    We don’t load much at the outset. You get more as you scroll down (finite scrolling).

    Neil Fraser: Writing: Differential Synchronization

    Neil Fraser: Writing: Differential Synchronization
    Differential synchronization is a symmetrical algorithm employing an unending cycle of background difference (diff) and patch operations. There is no requirement that "the chickens stop moving so we can count them" which plagues server-side three-way merges.

    Dual Shadow Method
    One such algorithm for plain-text is described in Diff Strategies, along with a set of optimizations to make diff significantly faster.
    The patch operation is just as critical to the operation of the system. This system requires that patches be applied in and around text which may not exactly match the expected text. Furthermore, patches must be applied 'delicately', taking care not to overwrite changes which do not need to be overwritten. One such algorithm for plain-text is described in Fuzzy Patch, along with a set of optimizations to make patch significantly faster.

    Guaranteed Delivery Method
    In a nutshell: Normal operation works identically to the Dual System Method described above. However in the case of packet loss, the edits are queued up in a stack and are retransmitted to the remote party on every sync until the remote party returns an acknowledgment of receipt. The server keeps two copies of the shadow, "Server Shadow" is the most up to date copy, and "Backup Shadow" is the previous version for use in the event that the previous transmission was not received.

    Normal operation: Client Text is changed by the user. A Diff is computed between Client Text and Client Shadow to obtain a set of edits which were made by the user. These edits are tagged with a client version number ('n') relating to the version of Client Shadow they were created from. Client Shadow is updated to reflect the current value of Client Text, and the client version number is incremented. The edits are sent to the server along with the client's acknowledgment of the current server version number ('m') from the previous connection. The server's Server Shadow should match both the provided client version number and the provided server version number. The server patches the edits onto Server Shadow, increments the client version number of Server Shadow and takes a backup of Server Shadow into Backup Shadow. Finally the server then patches the edits onto Server Text. The process then repeats symmetrically from the server to the client, with the exception that the client doesn't take a backup shadow. During the return communication the server will inform the client that it received the edits for version 'n', whereupon the client will delete edits 'n' from the stack of edits to send.
    Duplicate packet: The client appears to send Edits 'n' to the server twice. The first communication is processed normally and the response sent. Server Shadow's 'n' is incremented. The second communication contains an 'n' smaller than the 'n' recorded on Server Shadow. The server has no interest in edits it has already processed, so does nothing and sends back a normal response.
    Lost outbound packet: The client sends Edits 'n' to the server. The server never receives it. The server never acknowledges receipt of the edit. The client leaves the edits in the outbound stack. After the connection times out, the client takes another diff, updates the 'n' again, and sends both sets of edits to the server. The stack of edits transmitted keeps increasing until the server eventually responds with acknowledgment that it got a certain version.
    Lost return packet: The client sends Edits 'n' to the server. The server receives it, but the response is lost. The client leaves the edits in the outbound stack. After the connection times out, the client takes another diff, updates the 'n' again, and sends both sets of edits to the server. The server observes that the server version number 'm' which the client is sending does not match the server version number on Server Shadow. But both server and client version numbers do match the Backup Shadow. This indicates that the previous response must have been lost. Therefore the server deletes its edit stack and copies the Backup Shadow into Shadow Text (step 4). The server throws away the first edit because it already processed (same as a duplicate packet). The normal workflow is restored: the server applies the second edit, then computes and transmits a fresh diff to the client.
    Out of order packet: The server appears to lose a packet, and one (or both) of the lost packet scenarios is played out. Then the lost packet arrives, and the duplicate packet scenario is played out.
    Data corruption in memory or network: There are too many potential failure points to list, however if the shadow checksums become out of sync, or one side's version number skips into the future, the system will reinitialize itself. This will result in data loss for one side, but it will never result in an infinite loop of polling.
    Adaptive Timing
    An adaptive system can continuously modify the network synchronization frequency for each client based on current activity. Hard-coded upper and lower limits would be defined to keep the cycle within a reasonable range (e.g. 1 second and 10 seconds respectively). User activity and remote activity would both decrease the time between updates (e.g. halving the period). Sending and receiving an empty update would increase the time between updates (e.g. increasing the period by one second). This adaptive timing automatically tunes the update frequency so that each client gradually backs off when activity is low, and quickly reengages when activity is high.

    Try the demonstration of MobWrite, a web-based multi-user editor which uses this differential synchronization.
    One limitation of differential synchronization as described here is that only one synchronization packet may be in flight at any given time. This would be a problem if there was very significant latency in the connection
    Read full article from Neil Fraser: Writing: Differential Synchronization

    Wednesday, December 24, 2014

    Tom White: Consistent Hashing

    Tom White: Consistent Hashing
    The basic idea behind the consistent hashing algorithm is to hash both objects and caches using the same hash function. The reason to do this is to map the cache to an interval, which will contain a number of object hashes. If the cache is removed then its interval is taken over by a cache with an adjacent interval. All the other caches remain unchanged.

    Here's a picture of the circle with a number of objects (1, 2, 3, 4) and caches (A, B, C) marked at the points that they hash to:

    public class ConsistentHash<T> {

    private final HashFunction hashFunction;
    private final int numberOfReplicas;
    private final SortedMap<Integer, T> circle =
    new TreeMap<Integer, T>();

    public ConsistentHash(HashFunction hashFunction,
    int numberOfReplicas, Collection<T> nodes) {

    this.hashFunction = hashFunction;
    this.numberOfReplicas = numberOfReplicas;

    for (T node : nodes) {

    public void add(T node) {
    for (int i = 0; i < numberOfReplicas; i++) {
    circle.put(hashFunction.hash(node.toString() + i),

    public void remove(T node) {
    for (int i = 0; i < numberOfReplicas; i++) {
    circle.remove(hashFunction.hash(node.toString() + i));

    public T get(Object key) {
    if (circle.isEmpty()) {
    return null;
    int hash = hashFunction.hash(key);
    if (!circle.containsKey(hash)) {
    SortedMap<Integer, T> tailMap =
    hash = tailMap.isEmpty() ?
    circle.firstKey() : tailMap.firstKey();
    return circle.get(hash);

    The circle is represented as a sorted map of integers, which represent the hash values, to caches (of type T here).
    When a ConsistentHash object is created each node is added to the circle map a number of times (controlled by numberOfReplicas). The location of each replica is chosen by hashing the node's name along with a numerical suffix, and the node is stored at each of these points in the map.

    To find a node for an object (the get method), the hash value of the object is used to look in the map. Most of the time there will not be a node stored at this hash value (since the hash value space is typically much larger than the number of nodes, even with replicas), so the next node is found by looking for the first key in the tail map. If the tail map is empty then we wrap around the circle by getting the first key in the circle.
    Read full article from Tom White: Consistent Hashing

    How to Ace a Systems Design Interview
    That’s also the reason I never worry about if the interviewee has seen the question before. Let’s take the question “Design a web crawler” as an example. As an interviewer, I can make the interview focused on the overall crawler infrastructure, I can discuss how to dedup URLs in detail, and I can also ask how to detect if a page has been updated.
    First and foremost, I’ll evaluate if the design actually works. Although there’s no implementation to verify that, based on work experience and some common sense, I would ask myself if I would try the proposed approach if given this problem. More often than not, it’s quite obvious to tell if the design is problematic and I’ll just use some examples to challenge the candidate. For example, if I ask him to check if an URL has been crawled before, I’ll see if the solution handles short URL like or URLs with UTM params. This is the bare minimum requirement. If the candidate can’t make it work, I won’t go deeper or I may switch to a separate question.
    Secondly, I would check feasibility. Some candidates will come up with solutions that only work in theory. It may require infinite memory or the system is unnecessarily complicated. In either case, I will ask him to fix it. A good way to verify this is to ask yourself how much time and how many engineers do you need to implement this design. Personally, I prefer designs with ease and simplicity. If you can’t make a prototype within one or two weeks, I might ask you to simplify it.
    Thirdly, I expect the candidate to be clear about what he’s talking about. More specifically, I want to make sure that he’s aware of why the system should be designed in a specific way, what the constraints are, and whether there’re any other solutions. Usually, the design questions are vaguely described. Good candidates are able to tell you what assumptions are made and how this design is compared to others. To make it even clearer, ask yourself what are alternative solutions and why you make the system in this way instead of others.

    The reason I think this is important is that you won’t know if your design would work without actually working on it. With some hands-on experience, you’ll soon realize that a lot of things are really hard to implement but seem reasonable at first glance. For example, if you want to check if a page’s content has been updated since the last time you crawled and rely on if the HTML content remains the same, you’ll notice that many pages have the same content but things like comments, sidebars have been changed. This is a design I don’t think it works, although it may sound reasonable.


    It’s important to be generally curious about everything. One great practice is to pick whatever product you are using every day like Youtube and think about how would you design the system from scratch.
    Sometimes the product can be really complicated, you can also just design one of its features like Facebook friends recommendation. If you have time, writing some code to implement a prototype would be a plus. But the point is that you should try to get down to the detail.
    Although system design questions don’t have any standard answers, you can still search for how these products/features are implemented. Compare it with your own designs and understand the difference. High Scalability is highly recommended, but don’t spend too much time on the particular tools (see the point “What’s Not Important”).
    NOTE: One trick is that a lot of interviewers like to ask design questions that are related to the company. For instance, you are more likely to design a Google product/feature in Google interviews. It’s not always the case, but it doesn’t hurt to pay a little more attention to products of this company or similar products.
    System Design Interview Questions have a very detailed analysis of common questions

    What’s not important

    One common mistake is that many people pay too much attention to particular technique. For instance, they have spent a lot of time on how to use AWS, how to config Google cloud platform and how to use a specific web framework.
    I’m not saying these are not useful, in fact, these are definitely good things to learn. However, from system design interview’s perspective, I would say interviewers care more about the understanding of knowledge than particular technique.
    For example, when discussing processing large data, as an interviewer, what I would like to discuss is about how to distribute the data to multiple machines, how to aggregate them together later and how to equally distribute the load. If someone just tells me that he’ll use Hadoop on AWS, I’ll ask for more details and he would still end up answering all questions above.
    The rule of thumb is to focus more on how each tool is designed than what tool to use.

    Simple, High-level solution

    Systems that are simple, straightforward and efficient really win. 
    Some people really want to show off that he can design something complicated and would just start with an overly complicated system. This is the completely wrong mindset.
    What interviewer cares is not what cool technology you’re using, but whether you can design a system that works. I’ve seen so many candidates that kept saying all kinds of buzzwords without thinking about the question for seconds. You know what, I think those buzzwords are just bullshit and please forget about it in interviews.
    The recommended approach is to start with something as simple as possible and try to design a high-level solution. If you are asked to design Google autocompletion system, you can just start from suggesting the most common queries with the given prefix so that all you need is a log processor and a suggest server.
    Of course, the solution won’t work in many cases like sometimes we need personalization and the data may exceed memory limit. Then we can prioritize the problems and address one by one. Never shoot yourself in the foot by over-complicating the problem.

    Don’t Rush

    A common pitfall in coding interviews is to start coding without much consideration and discussion. The same problem happens in system design interview as well.
    Remember that no one would expect you to come up with a design within seconds. See the whole interview process as a discussion rather than an exam. If it’s a discussion, you are encouraged to think loudly. You don’t need to point out a solution quickly, but you can talk about how you think about the problem, what you are trying to solve at the moment and what you are stuck with.
    As an interviewer, candidates that I would give strong hire usually make the whole interview process very comfortable. It’s completely a discussion process and it’s like we are working on the problem together. They won’t pretend to know everything. Instead, they will always tell me what they are stuck with and how they are approaching the problem.


    Remember that your solution highly relies on the restriction, which can be both explicit and implicit. Explicit restrictions are ones set by the interviewer like you have only one machine. However, most people don’t pay attention to implicit restrictions.
    A lot of times we are just making assumptions without knowing. Uncovering those hidden assumptions can help you better understand your solution. For example, time and space trade-off is a common theme in design problem. At some point, it’s okay to be a little bit slow but you can save a lot of memory. If you have a clear reason that speed is not important in particular circumstances, your design is reasonable.
    A good practice is to think about what alternative approaches are and why the approach you picked is better. Usually, the reason it’s better is due to some constraints and assumptions. So it’s important to validate those assumptions. Changing your mind during the interview is completely okay. In fact, it’s a good sign that you are considering all scenarios.


    In other words, sometimes whether to scale or not is not told by the interviewer. You can get the answer by making reasonable assumptions and do the calculation.
    This is a very important point because in real projects, good engineers are making a lot of decisions in this way. And this certainly needs some practices.

    “There are some libraries out there”

    One common pitfall is that many candidates like to tell me “there are some libraries to do this out there” as an excuse to not design this function in detail.
    Do interviewers know there are existing libraries? Of course. But there are many reasons why it makes sense to ask candidates to design particular functions as well:
    • Existing libraries might not do a good job. For example, a lot of libraries are able to extract date information from web pages, but none of them does it perfectly. In fact, just this feature itself can be a big team in many companies.
    • There might be better solutions given the particular problem with constraints.
    • In the end, there are existing systems for every design questions but it still makes sense to discuss the problem.
    However, I’m not suggesting that you should never use any existing tools in system design interviews. What the design question is focused on matters. If it’s a common tool or something trivial comparing to the overall question, it’s completely fine to use existing tools.
    1. How to approach a problem in a systematic way to maximize your chances of success?
    System Design interviews are less about getting lucky and more about actually doing the hard work of attaining knowledge. At the end, your performance in these interviews depends on the following 2 factors.
    1. Your knowledge — gained either through studying or practical experience.
    2. Your ability to articulate your thoughts.
    When companies ask design questions, they want to evaluate your design skills and experience in designing large scale distributed systems
    7 steps to approach a System Design Interview

    Step 1: Requirement Gathering:

    Many candidates think that system design interviews are all about “scale”, forgetting to put required emphasis on the “system” part of the interview.
    You need to have a working “system” before you can scale it.
    As the first step in your interview, you should ask questions to find the exact scope of the problem. Design questions are mostly open-ended, and they don’t have ONE correct answer. That’s why clarifying ambiguities early in the interview becomes critical. Candidates who spend time in clearly defining the end goals of the system, always have a better chance of success.
    Here are some questions for designing Twitter that should be answered before moving on to next steps:
    1. Who can post a tweet? (answer: any user)
    2. Who can read the tweet? (answer: any user — as all tweets are public)
    3. Will a tweet contain photos or videos (answer: for now, just photos)
    4. Can a user follow another user? (answer: yes).
    5. Can a user ‘like’ a tweet? (answer: yes).
    6. What gets included in the user feed (answer: tweets from everyone whom you are following).
    7. Is feed a list of tweets in chronological order? (answer: for now, yes).
    8. Can a user search for tweets (answer: yes).
    9. Are we designing the client/server interaction or backend architecture or both (answer: we want to understand the interaction between client/server but we will focus on how to scale the backend).
    10. How many total users are there (answer: we expect to reach 200 Million users in the first year).
    11. How many daily active users are there (100 million users sign-in everyday)
    It’s a hypothetical problem geared towards evaluating your approach. You are just asking these questions to scope the problem that you are going to solve today. e.g. you now don’t have to worry about handling videos or generating a timeline using algorithms etc.

    Step 2: System interface definition

    If you have gathered the requirements and can identify the APIs exposed by the system, you are 50% done.
    Define what APIs are expected from the system. This would not only establish the exact contract expected from the system but would also ensure if you haven’t gotten any requirements wrong. Some examples for our Twitter-like service would be:
    postTweet(user_id, tweet_text, image_url, user_location, timestamp, …) generateTimeline(user_id, current_time) recordUserTweetLike(user_id, tweet_id, timestamp, …)

    Step 3: Back-of-the-envelope capacity estimation

    It’s always a good idea to estimate the scale of the system you’re going to design. This would also help later when you’ll be focusing on scaling, partitioning, load balancing and caching.
    1. What scale is expected from the system (e.g., number of new tweets, number of tweet views, how many timeline generations per sec., etc.)
    2. How much storage would we need? This will depend on whether users can upload photos and videos in their tweets?
    3. What network bandwidth usage are we expecting? This would be crucial in deciding how would we manage traffic and balance load between servers.

    Step 2: System interface definition

    If you have gathered the requirements and can identify the APIs exposed by the system, you are 50% done.
    Define what APIs are expected from the system. This would not only establish the exact contract expected from the system but would also ensure if you haven’t gotten any requirements wrong. Some examples for our Twitter-like service would be:
    postTweet(user_id, tweet_text, image_url, user_location, timestamp, …) generateTimeline(user_id, current_time) recordUserTweetLike(user_id, tweet_id, timestamp, …)

    Step 3: Back-of-the-envelope capacity estimation

    It’s always a good idea to estimate the scale of the system you’re going to design. This would also help later when you’ll be focusing on scaling, partitioning, load balancing and caching.
    1. What scale is expected from the system (e.g., number of new tweets, number of tweet views, how many timeline generations per sec., etc.)
    2. How much storage would we need? This will depend on whether users can upload photos and videos in their tweets?
    3. What network bandwidth usage are we expecting? This would be crucial in deciding how would we manage traffic and balance load between servers.

    Step 4: Defining the data model

    Defining the data model early will clarify how data will flow among different components of the system. Later, it will guide you towards better data partitioning and management. Candidate should be able to identify various entities of the system, how they will interact with each other and different aspect of data management like storage, transfer, encryption, etc. Here are some entities for our Twitter-like service:
    User: UserID, Name, Email, DoB, CreationData, LastLogin, etc.
    Tweet: TweetID, Content, TweetLocation, NumberOfLikes, TimeStamp, etc.
    UserFollows: UserdID1, UserID2
    FavoriteTweets: UserID, TweetID, TimeStamp
    Which database system should we use? Would NoSQL like Cassandra best fits our needs, or we should use MySQL-like solution. What kind of blob storage should we use to store photos and videos?

    Step 5: High-level design

    Draw a block diagram with 5–6 boxes representing core components of your system. You should identify enough components that are needed to solve the actual problem from end-to-end.
    For Twitter, at a high level, we would need multiple application servers to serve all the read/write requests with load balancers in front of them for traffic distributions. If we’re assuming that we’ll have a lot more read traffic (as compared to write), we can decide to have separate servers for handling reads v.s writes. On the backend, we need an efficient database that can store all the tweets and can support a huge number of reads. We would also need a distributed file storage system for storing photos (and videos) and a search index and infrastructure to enable searching of tweets.

    Step 6: Detailed design for selected components

    Dig deeper into 2–3 components; interviewers feedback should always guide you towards which parts of the system she wants you to explain further. You should be able to provide different approaches, their pros and cons, and why would you choose one? Remember there is no single answer, the only thing important is to consider tradeoffs between different options while keeping system constraints in mind. e.g.
    1. Since we’ll be storing a huge amount of data, how should we partition our data to distribute it to multiple databases? Should we try to store all the data of a user on the same database? What issues can it cause?
    2. How would we handle high-traffic users e.g. celebrities who have millions of followers?
    3. Since user’s timeline will contain most recent (and relevant) tweets, should we try to store our data in a way that is optimized to scan latest tweets?
    4. How much and at which layer should we introduce cache to speed things up?
    5. What components need better load balancing?

    Step 7: Identifying and resolving bottlenecks

    Try to discuss as many bottlenecks as possible and different approaches to mitigate them.
    1. Is there any single point of failure in our system? What are we doing to mitigate it?
    2. Do we’ve enough replicas of the data so that if we lose a few servers, we can still serve our users?
    3. Similarly, do we’ve enough copies of different services running, such that a few failures will not cause total system shutdown?
    4. How are we monitoring the performance of our service? Do we get alerts whenever critical components fail or their performance degrades?
    The key here is to understand what your interviewer is looking for. He wants you to give him a 50,000 ft overview, identify high-level components and describe the interactions between components as succinctly as possible. Here are 3 ​phases of such a discussion.
    1. Draw a big box that represents the system.
    2. Zoom-in and break that big box into 5–6 components.
    3. Briefly discuss the role of each component e.g. compute, storage, front-end, back-end, caching, queueing, networking, load-balancing, etc.
    Your interviewer would want you to discuss 1–2 components in more depth and he is going to specify which one. You are rarely expected to write any code during these discussions.
    Only use buzzwords and in-fashion technologies e.g. “GraphQL” if you understand them well and can justify and defend your approach.

    1. He has probably asked this question a 1000 times and is well versed in the possible solutions. He’ll quickly find out how much you actually understand.
    Rule 2: Never pretend to be an expert. The person interviewing you almost always understands the domain better than you and can even be an industry expert.

    This is really my domain. I’ll be done in 15 minutes

    Good for you but slow down. Instead of jumping to the solution that you already know, do the following:
    1. Gather requirements.
    2. Ask Questions. Your interviewer is interested in understanding your thought processes.
    3. Evaluate multiple solutions, discuss pros and cons and see where the discussion takes you.
    In reality, it is a good idea to do this whether you know about the domain or not.
    Rule 3: Don’t rush to a solution. Gather requirements, suggest multiple solutions and evaluate them. It is meant to be an open-ended discussion.

    How to Ace a Systems Design Interview | Palantir

    Nominally, this interview appears to require knowledge of systems and a knack for design—and it does. What makes it interesting, though, and sets it apart from a coding or an algorithms interview, is that whatever solution you come up with during the interview is just a side effect. What we actually care about is the process.
    In other words, the systems design interview is all about communication.
    Do you know the constraints? What kind of inputs does your system need to handle? You have to get a sense for the scope of the problem before you start exploring the space of possible solutions. And remember, there is no single right answer to a real-world problem. Everything is a tradeoff.

    • Concurrency. Do you understand threads, deadlock, and starvation? Do you know how to parallelize algorithms? Do you understand consistency and coherence?
    • Networking. Do you roughly understand IPC and TCP/IP? Do you know the difference between throughput and latency, and when each is the relevant factor?
    • Abstraction. You should understand the systems you’re building upon. Do you know roughly how an OS, file system, and database work? Do you know about the various levels of caching in a modern OS?
    • Real-World Performance. You should be familiar with the speed of everything your computer can do, including the relative performance of RAM, disk, SSD and your network.
    • Estimation. Estimation, especially in the form of a back-of-the-envelope calculation, is important because it helps you narrow down the list of possible solutions to only the ones that are feasible. Then you have only a few prototypes or micro-benchmarks to write.
    • Availability and Reliability. Are you thinking about how things can fail, especially in a distributed environment? Do know how to design a system to cope with network failures? Do you understand durability?

    How do you get better at something? If your answer isn’t along the lines of “practice” or “hard work,” then I have a bridge to sell you. Just like you have to write a lot of code to get better at coding and do a lot of drills to get really good at basketball, you’ll need practice to get better at design. Here are some activities that can help:
    • Do mock design sessions. Grab an empty room and a fellow engineer, and ask her to give you a design problem, preferably related to something she’s worked on. Don’t think of it as an interview—just try to come up with the best solution you can. Design interviews are similar to actual design sessions, so getting better at one will make you better at the other.
    • Work on an actual system
    • Contribute to OSS or build something with a friend. Treat your class projects as more than just academic exercises—actually focus on the architecture and the tradeoffs behind each decision. As with most things, the best way to learn is by doing.
    • Do back-of-the-envelope calculations for something you’re building 
    • and then write micro-benchmarks to verify them. If your micro-benchmarks don’t match your back-of-the-envelope numbers, some part of your mental model will have to give, and you’ll learn something in the process.
    • Dig into the performance characteristics of an open source system. For example, take a look at LevelDB. It’s new and clean and small and well-documented. Read about the implementation to understand how it stores its data on disk and how it compacts the data into levels. Ask yourself questions about tradeoffs: which kinds of data and sizes are optimal, and which degrade read/write performance? (Hint: think about random vs. sequential writes.)
    • Learn how databases and operating systems work under the hood. 
    • These technologies are not only tools in your belt, but also a great source of design inspiration. If you can think like a DB or an OS and understand how each solves the problems it was designed to solve, you’ll be able to apply that mindset to other systems.

    The systems design interview can be difficult, but it’s also a place to be creative and to take joy in the imagining of systems unbuilt. If you listen carefully, make sure you fully understand the problem, and then take a clear, straightforward approach to communicating your ideas, you should do fine.
    Jeff Dean makes similar points in his LADIS 2009 keynote (which I unfortunately wasn’t able to attend). In particular, he gives a useful table of “Numbers Everyone Should Know” — that is, the cost of some fundamental operations:
    OperationTime (nsec)
    L1 cache reference0.5
    Branch mispredict5
    L2 cache reference7
    Mutex lock/unlock25
    Main memory reference100
    Compress 1KB bytes with Zippy3,000
    Send 2K bytes over 1 Gbps network20,000
    Read 1MB sequentially from memory250,000
    Roundtrip within same datacenter500,000
    Disk seek10,000,000
    Read 1MB sequentially from disk20,000,000
    Send packet CA -> Netherlands -> CA150,000,000
    Some useful figures that aren’t in Dean’s data can be found in this article comparing NetBSD 2.0 and FreeBSD 5.3 from 2005. Approximating those figures, we get:
    OperationTime (nsec)
    System call overhead400
    Context switch between processes3000
    fork() (statically-linked binary)70,000
    fork() (dynamically-linked binary)160,000
    首先,HR指出,Facebook 的 System Design 面试时长为45分钟,主要考察是是,求职者能否处理 Large Scale 的问题。在面试过程中,面试官会让你为Facebook设计一个feature。通过设计一个复杂的系统,面试官主要是想考察你在 consistency, availability 和 partition tolerance 之间如何做 tradeoff,进而评估你的思考、实践能力。在这里,Facebook HR 着重要求面试者,一定要看以下的资料:
    【1】Dropbox Large Scale 相关视频:
    【2】Facebook 关于Scaling Memcache的文章:
    【1】Concurrency (threads, deadlock, starvation, consistency, coherence)
    【2】Abstraction (understanding how OS, filesystem, and database works)
    【3】Real-world performance (relative performance RAM, disk, your network, SSD)
    【4】Availability and Reliability (durability, understanding how things can fail)
    【5】Datastorage (RAMvs. durablestorage, compression, byte sizes)





    如果你去问面试官,他可能直接给你数据,比如,我们每秒要处理400个请求;或者,他会说一些更笼统的信息让你去估计,比如,这个网站不是top 3但是却是top 10的。






        1. Cache:缓存,万金油,哪里不行优先考虑
        2. Queue:消息队列,常见使用Linkedin的kafka
        3. Asynchronized:批处理+异步,减少系统IO瓶颈
        4. Load Balance: 负载均衡,可以使用一致性hash技术做到尽量少的数据迁移
        5. Parallelization:并行计算,比如MapReduce
        6. Replication:提高可靠性,如HDFS,基于位置感知的多块拷贝
        7. Partition:数据库sharding,通过hash取摸

        1. 熟悉常见的 Web 应用架构模式。比如你经常刷知乎的话,知道知乎的时间线怎么设计嘛?可以到题库里面搜。
        2. 如果你拿到了比较熟悉的题目,把它当成一次展示。因为系统设计面试很需要项目经验,很多公司在系统设计面试的同时会考查你项目经验。什么叫展示,去看看 QCon,PyCon 的视频。
        3. 如果你拿到了你不熟悉的题目,但你还记得一些东西,把面试官当成你的同事,把面试当成共同解决这个问题。许多公司会在这里考查沟通能力,而且这样也会得到面试官的引导,让面试更愉悦一点。
        4. 这个前面也有人提到了。先考虑单机情况,然后扩展。这样有个好处就是,一般来说系统设计问题的单机版本就是面向对象设计问题。你起码可以告诉面试官,你单机上是完全没问题的。
        先考虑大方向(BigPic)然后划分角色(数据库、缓存、消息总线等等)。最后再谈具体的技术(比如 Redis),许多人一谈到具体的技术,就说可以用它解决,但不谈技术的原理,这个非常麻烦。

        Read full article from How to Ace a Systems Design Interview | Palantir


        Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

        Popular Posts