Thursday, December 25, 2014

Design the Facebook timeline function



https://www.facebook.com/note.php?note_id=10150468255628920
At a high level we needed to scan, aggregate, and rank posts, shares, photos and check-ins to surface the most significant events over years of Facebook activity.

After a few discussions we decided to build on four of our core technologies: MySQL/InnoDB  for storage and replication, Multifeed (the technology that powers News Feed) for ranking, Thrift for communications, and memcached for caching.


Denormalizing the data
Before we began Timeline, our existing data was highly normalized, which required many round trips to the databases. Because of this, we relied on caching to keep everything fast. When data wasn’t found in cache, it was unlikely to be clustered together on disk, which led to lots of potentially slow, random disk IO. To support our ranking model for Timeline, we would have had to keep the entire data set in cache, including low-value data that wasn’t displayed.

2. Non-recent activity data had been moved to slow network storage. We hacked a read-only build of MySQL and deployed hundreds of servers to exert maximum IO pressure and copy this data out in weeks instead of months.
3. Massive join queries that did tons of random IO. We consolidated join tables into a tier of flash-only databases. Traditionally PHP can perform database queries on only one server at a time, so we wrote a parallelizing query proxy that allowed us to query the entire join tier in parallel.

Caching is an important part of any Facebook project. One of the nice properties of Timeline is that the results of big queries, such as ranking all your activity in 2010, are small and can be cached for a long period without cache invalidations. A query result cache is of huge benefit and memcached is an excellent solution.

Recent Activity changes frequently so a query cache is frequently invalidated, but regenerating the summary of Recent Activity is quite fast. 
Here a row cache helps further boost query performance. We rely on the InnoDB buffer pool in RAM and our ownFlashcache kernel driver to expand the OS cache onto a flash device.

http://highscalability.com/blog/2012/1/23/facebook-timeline-brought-to-you-by-the-power-of-denormaliza.html
Multifeed (a custom distributed system which takes the tens of thousands of updates from friends and picks the most relevant)

Denormalize. Format data in the way you need to use it.
Denormalization, creating special purpose objects instead of distributed rows that must be joined, minimizes random IO by reducing the number of trips to the database.

Caching can often get around the need for denormalization, but given the amount of timeline data and how much of it is cold, that is it will rarely be viewed, caching everything isn't a good design.

Timeline is like a datamart in a data warehouse. Data must be slurped up from dozens of different systems, cleaned, merged, and reformatted into a new canonical format. Facebook of course did this in a Facebook-like way. They created a custom data conversion language, they deployed hundreds of MySQL servers to extract the data out of "legacy" systems as fast as possible, they deployed flash storage to speed up joins, they created a parallelizing query proxy, and they standardized on the Multifeed data format for future flexibility.
  • Keep different types of caches
    • Short term cache.  A timeline of recent activity is frequently invalidated because it is changing all the time as you perform actions through your life. This cache is an in RAM row cache inside InnoDB that uses the Flashcache kernel driver to expand the OS cache onto a flash device.
    • Long term cache. A query cache is kept in memcached. The results of large queries, like the ranking of all your activities in 2010, can be efficiently cached since they will rarely be invalidated.
  • Run operations locally. The Timeline Aggregator (geographically clustering nearby check-ins, ranking status updates, etc) runs on each database so it can max out the disks. Only data that needs to be displayed is sent over the network.
    • Parallelize development
    News Feed is the constantly updating list of stories in the middle of your home page.News Feed includes status updates, photos, videos, links, app activity and likes from people, Pages and groups that you follow on Facebook.
    https://www.facebook.com/help/community/question/?id=10153813355475433
    Your Timeline, which is sometimes referred to as your profile, is your collection of the photos, stories and experiences that tell your story.

    News Feed—the center column of your home page—is a constantly updating list of stories from people and Pages that you follow on Facebook. News Feed stories include status updates, photos, videos, links, app activity and likes.
    http://systemdesigns.blogspot.com/2015/12/facebook-news-feed-and-timeline.html
    每个用户有两个queue, 发送queue(feed),接收queue(timeline)。用户的发送queue里面会存他自己写的tweet,接收queue里面会有他follow了的人更新的tweet. 系统里面有一个重要的数据结构 social graph。 每个用户有自己的profile,里面存了follow了那些人,和被哪些人follow 了.


      图1 user’s feed

       
    图2: user’s timeline
    以tweeter为例子, tweeter是一个读取远大于写入的应用。读取就是用户刷新自己的timeline,得到following的更新。 QPS=300k, 写入就是用户发tweet,QPS=5K. 同时在线用户达到了150million,每天的推文有400million。necesaries里面 还有一个notify(推送)操作,不过并没有单独讲。感觉可以包含在后面的push操作中。

    1. push model

    用户发了一条tweet,通过social graph查找到他的follower,然后再写入所有这些follower的timeline。如果有N个follower,这个操作的时间复杂度是O(N)。对另一个用户来说,当他刷新自己的timeline的时候,不需要去读following的feed,所以只需要读自己的timeline,时间O(1)。

    这个算法会导致一个问题,九章里面叫storm of lady gaga. Gaga有50m的follower, 他发送一条tweet会写到50M用户的timeline里面。会对服务器造成巨大的负担。但是有一个优化可以做,先写到在线用户的timeline,其他的follower可以推后写。在线用户可以定义为最近一周登陆过的用户。
                                                 

    图3: push model

         2. pull model

    用户发一条tweet,只写到自己的发送queue里面,就是feed。另一个用户要读去timeline的时候, 根据social graph找到所有following的feed,然后读取每个人的最近50条,再merge。时间复杂度 写O(1),读O(n)。 问题是如果有一个超级用户,关注了其他所有用户,他要读取timeline的时候会把所有的用户的feed都读一遍。这是极端例子,可以采取设置关注上限来解决。


                                            
     图4:pull model

         3. push + pull model

    当一个用户要读timeline的时候,他的following被分成了两部分,第一部分是following的followers<100k的。这部分人更新的时候会把tweet直接写到当前用户的timeline。第二部分: following的follower>100k的,当前用户需要去此类用户的feed里面读取最近的tweet,再和自己已有的timeline进行merge。从tweet发送者的角度来说就是,当一个用户发送tweet的时候,如果他的follower大于十万,他就只写在自己的feed里面,如果小于十万,就直接推送给所有的follower。但是如果设置100k为阈值会产生抖动问题: 如果一个用户的follower在十万左右摇摆,那么他的tweet一会儿会推送给所有follower,一会儿又只会写到自己的feed里面。会造成follower们得到的tweet可能不一致。解决办法,不设十万为阈值,而设定一个范围。 follower小于十万的做push,大于九万的做pull。在follower在九万和十万之间的用户会又push又pull,只要用户的follower不是突然从大于10万掉到小于9万,就没有问题。
    虽然可以两个模式相结合,不过现实中facebook采用了pull ,instagram 采用了push。不过在high demanding read的情况下pull会比push好一些,pull不容易造成短时间内的高并发。在特殊时期,比如春节,世界杯,大量的人写tweet,push就会有高并发的问题了。


    图5:push + pull model

    4. Data + Evolution

    优化的关键就是把所有的数据结构都放到memory。 需要放进memory的有 social graph, timeline list, feed list。计算内存需求。

    Assumption
    • 1billion users
    • Average feed size: 50 tweets
    • Average timeline size: 1000 tweets
    • Tweet size: 200B
    • Average followers = 30
    Memory needed
    • Size of timeline lists = 1billion* 10^3*200 = 200T
    • Size of feed lists = 1billion * 50 *200 = 10T
    • Sizeofsocialgraph=1billion*30*2*8=480G

    200T不现实。优化第一步: 只存active user,并且只存active user最近的80条tweet。
    最后得到的结果是:

    Assumption
    • Weekly active users: 100 million
    • Average feed size: 80 tweets
    • Average timeline size: 500 tweets
    • Tweet size: 200B
    • Average followers = 20
    Memory need
    • Size of timeline lists = 100million* 500 *200 = 10T
    • Size of feed lists = 100million * 80 *200 = 1.6T
    • Sizeofsocialgraph=100millioin*20*2*8=32G
    大概需要12T 内存。
                 

    再优化: normalization

    把tweet的content和metadata分开存。没有必要在进行write,read的时候真的去读写tweet,只需要读写tweet的metadata就行了。
    tweet metadata size: 
    20B = userID(8)+tweetID(8) + indicators(4)

     一条tweet220B,metadata只有20B,一下变成原来的十分之一,1.2T内存。额外还要加上空间去存真实的tweeter,100Million * 80 * 20B = 150G。 




    图6: 内存化

    新用户注册:

    timeline builder会根据新用户的profile读取tweeter写到他的timeline,然后在把new user加到memory的数据结构里面。新用户的timeline全是pull产生的。 push只有用户发tweet的时候才会发生,新用户关注的时候,关注者并不一定恰好发tweet,而且push也只会push当前发送的那一条。
                        

    图7 加入新用户

    用户删除tweet

    tweet 的metadata 的indicator 会被标记成delete,timeline在生成的时候就不会再加入删除的微博了。


    http://highscalability.com/blog/2011/5/17/facebook-an-example-canonical-architecture-for-scaling-billi.html
    1. Layered - components are independent and isolated. 
    2. Service/API Driven - each layer is connected via well defined interface that is the sole entry point for accessing that service. This prevents nasty complicated interdependencies. Clients hide behind an application API. Applications use a data access layer.
    3. Distributed - layers are transparently distributed across clusters of machines
    4. Separate Application Logic - Application logic is encapsulated in application servers that provide an API endpoint. Application logic is implemented in terms of other services. The application server tier also hides a write-through cache as this is the only place user data is written or retrieved, it is the perfect spot for a cache.
    5. Stateless - State is kept in a database tier, caching tier, or other services, not in applications or stuffed in local pockets.
    6. Scalable Component Services - Other services that themselves are scalable and highly available are used to implement this service. Messages also uses: Memcache, User index service, and HayStack.
    7. Full Stack Ops - Tools were developed to manage, monitor, identify performance issues and upgrade the entire service, up, down and across the stack.
    8. Celled - Messages has as the basic building block of their system a cluster of machines and services called a cell. If you need more power captain, then cells are added in an incremental fashion. A cell consists of ZooKeeper controllers, an application server cluster, and a metadata store. Cells provide isolation as the storage and application horsepower to process requests is independent of other cells. Cells can fail, be upgraded, and distributed across datacenters independent of other cells.

    Labels

    Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

    Popular Posts