Massive Technical Interviews Tips: Facebook Architecture

Thursday, December 3, 2015

Facebook Architecture

https://www.facebook.com/atscaleevents/videos/1732398473699916/
http://highscalability.com/blog/2016/6/27/how-facebook-live-streams-to-800000-simultaneous-viewers.html

This began a year of product improvement and protocol iteration.

They started with HLS, HTTP Live Streaming. It’s supported by the iPhone and allowed them to use their existing CDN architecture.
Simultaneously began investigating RTMP (Real-Time Messaging Protocol), a TCP based protocol. There’s a stream of video and a stream of audio that is sent from the phone to the Live Stream servers.
- Advantage: RTMP has lower end-end latency between the broadcaster and viewers. This really makes a difference an interactive broadcast where people are interacting with each other. Then lowering latency and having a few seconds less delay makes all the difference in the experience.
- Disadvantage: requires a whole now architecture because it’s not HTTP based. A new RTMP proxy need to be developed to make it scale.
Also investigating MPEG-DASH (Dynamic Adaptive Streaming over HTTP).
- Advantage: compared to HLS it is 15% more space efficient.
- Advantage: it allows adaptive bit rates. The encoding quality can be varied based on the network throughput.

Live video is different than normal videos: it causes spiky traffic patterns.
- Live videos are more engaging so tend to get watched 3x more than normal videos.
- Live videos appear at the top of the news feed so have a higher probability of being watched.
- Notifications are sent to all the fans of each page so that’s another group of people who might watch the video.
Spiky traffic cause problems in the caching system and the load balancing system.
Caching Problems
- A lot of people may want to watch a live video at the same time. This is your classic Thundering Herd problem.
- The spiky traffic pattern puts pressure on the caching system.
- Video is segmented into one second files. Servers that cache these segments may overload when traffic spikes.
Global Load Balancing Problem
- Facebook has points of presence (PoPs) distributed around the world. Facebook traffic is globally distributed.
- The challenge is preventing a spike from overloading a PoP.

This is how a live stream goes from one broadcaster to millions of viewers.

A broadcaster starts a live video on their phone.
The phone sends a RTMP stream to a Live Stream server.
The Live Stream server decodes the video and transcodes to multiple bit rates.
For each bit rate a set of one-second MPEG-DASH segments is continuously produced.
Segments are stored in a datacenter cache.
From the datacenter cache segments are sent to caches located in the points of presence (a PoP cache).
On the view side the viewer receives a Live Story.
The player on their device starts fetching segments from a PoP cache at a rate of one per second.

There is one point of multiplication between the datacenter cache and the many PoP caches. Users access PoP caches, not the datacenter, and there are many PoP caches distributed around the world.
Another multiplication factor is within each PoP.
- Within the PoP there are two layers: a layer of HTTP proxies and a layer of cache.
- Viewers request the segment from a HTTP proxy. The proxy checks if the segment is in cache. If it’s in cache the segment is returned. If it’s not in cache a request for the segment is sent to the datacenter.
- Different segments are stored in different caches so that helps with load balancing across different caching hosts.

Protecting The Datacenter From The Thundering Herd

What happens when all the viewers are requesting the same segment at the same time?
If the segment is not in cache one request will be sent to the datacenter for each viewer.
Request Coalescing. The number of requests is reduced by adding request coalescing to the PoP cache. Only the first request is sent to the datacenter. The other requests are held until the first response arrives and the data is sent to all the viewers.
New caching layer is added to the proxy to avoid the Hot Server problem.
- All the viewers are sent to one cache host to wait for the segment, which could overload the host.
- The proxy adds a caching layer. Only the first request to the proxy actually makes a request to the cache. All the following requests are served directly from the proxy.

PoPs Are Still At Risk - Global Load Balancing To The Rescue

So the datacenter is protected from the Thundering Herd problem, but the PoPs are still at risk. The problem with Live is the spikes are so huge that a PoP could be overloaded before the load measure for a PoP reaches the load balancer.
Each PoP has a limited number of servers and connectivity. How can a spike be prevented from overloading a PoP?
A system called Cartographer maps Internet subnetworks to PoPs. It measure the delay between each subnet and each PoP. This is the latency measurement.
The load for each PoP is measured and each user is sent to the closest PoP that has enough capacity. There are counters in the proxies that measure how much load they are receiving. Those counters are aggregated so we know the load for each PoP.
Now there’s an optimization problem that respects capacity constraints and minimizes latency.
With control systems there’s a delay to measure and a delay to react.
They changed the load measurement window from 1.5 minutes to 3 seconds, but there’s still that 3 second window.
The solution is to predict the load before it actually happens.
A capacity estimator was implemented that extrapolates the previous load and the current load of each PoP to the future load.
- How can a predictor predict the load will decrease if the load is currently increasing?
- Cubic splines are used for the interpolation function.
- The first and second derivative are taken. If the speed is positive the load is increasing. If the acceleration is negative that means the speed is decreasing and it will eventually be zero and start decreasing.
- Cubic splines predict more complex traffic patterns than linear interpolation.
- Avoiding oscillations. This interpolation function also solves the oscillation problem.
- The delay to measure and react means decisions are made on stale data. The interpolation reduces error, predicting more accurately, and reduces oscillations. So the load can be closer to the capacity target
- Currently prediction is based on the last three intervals where each interval is 30 seconds. Almost instantaneous load.

Testing

You need to be able to overload a PoP.
A load testing service was built that is globally distributed across the PoPs that simulates live traffic.
Able to simulate 10x production load.
Can simulate a viewer that is requesting one segment at a time.
This system helped reveal and fix problems in the capacity estimator, to tune parameters, and to verify the caching layer solves the Thundering Herd problem.

Upload Reliability

Uploading a video in real-time is challenging.
Take, for an example, an upload that has between 100 and 300 Kbps of available bandwidth.
Audio requires 64 Kbps of throughput.
Standard definition video require 500 Kbps of throughput.
Adaptive encoding on the phone is used to adjust for the throughput deficit of video + audio. The encoding bit-rate of the video is adjusted based on the available network bandwidth.
The decision for the upload bitrate is done in the phone by measuring uploaded bytes on the RTMP connection and it does a weighted average of the last intervals.

Investigating a push mechanism rather than the request-pull mechanism, leveraging HTTP/2 to push to the PoPs before segments have been requested.

http://www.chepoo.com/facebook-platform-architecture-design-1.html

特别是第2点它扩大了传播范围，在Facebook认为，第2点比1更为重要，Facebook平台的意义就是在这里。”we build the platform optimize for build apps for social graph”，开放平台的意义就是让扩展应用将social graph发扬光大。

“在Facebook开放平台以前，social network封闭式平台，但是今天这种情况结束了”

Facebook Platform的三大目标

1. Apps深度整合到Facebook平台(Deep Integration Into Facebook Website)
app可以集成到用户profile
app拥有独立的首页(canvas page)，首页完全是应用自己控制的，可放任何内容，包括广告。
app可以以用户身份发布feed
app可以发送消息，邀请，提醒等
一个应用只要被用户授权访问他的profile之后，应用就可以调用api获取到比如user/friend/application/privacy information，Facebook API接口也值得各种平台设计者学习。比如用户API接口

2. 病毒式的传播(Mass Distribution through the Social Graph)
传播的核心是feed体系(从09年的眼光来看，Twitter的feed可能比它做得更出色，甚至造成了威胁，并进一步引发它最近的homepage改版)。App可以发布3种不同类型的Feed

application story，相当与应用添加提醒，比如A添加了某应用。
simple story, feed里面表现为一行文本。
full story, 详细，可以预览图片，视频等。
关于feed可参看另外一篇技术分析文章：Facebook的feed格式设计

App可以发送notification(提醒)，request(邀请)。Facebook还提供平台级别的工具如friend selector供app使用。Facebook还通过应用嵌入到Profile通过exposure让更多的用户来使用，比如用户看到好友Profile某个应用有趣也会立即add。

通过以上途径，促进应用的传播，促进信息的传播，促进人的社会化交流。

3. 商业机会(New Business Opportunity)
canvas page可以放任何广告，也可以进行电子商务进行销售，app可以获得所有收入。对于这两种方式，Facebook都是持支持态度。

http://www.chepoo.com/facebook-platform-architecture-design-2.html

一、Facebook Connect

Facebook开放平台之后围墙的问题依然存在，所有的用户所有的内容都在facebook网站的内部。facebook connect可以将facebook的用户，好友，feed和第三方网站作深度整合。将social graph扩大到所有的Web领域。到目前为止Facebook Connect的应用已经非常广泛，比如6月27号的Facebook Developer Garage Shanghai介绍了不少基于Facebook Connect的网站，如提供给外国人分享在上海活动图片的citymoments就非常不错。

二、Facebook新的设计

Mark介绍了很多Facebook新的设计, 比如应用可以不再局限在profile box里面，可以作为一个独立的profile tab, 相当一个独立的页面，应用开发商有更多独立的发挥空间。
另外facebook开放了翻译工具, facebook的翻译工具可以让全球的用户帮助将第三方开发的应用翻译成各种本地语言，并由用户投票每个条目最合适的翻译结果。这个本来用于facebook平台自身的国际化，此次开放给第三方开发者

1. meaningful/有意义
a. social(graph), e.g. Green Patch
b. useful/有用，如Carpool
c. Expressive/表达, Graffiti, draw on friend profile
d. Engaging, 比如2008/5，用户投入在playfish上的时间有9亿分钟。

2. trustworthy/信任
safe/安全, trusted
secure – 平台越提供更多的privacy控制, 用户才会产生越多内容
respectful
transparent

3. well designed/良好的设计
clean, facebook平台确实很干净，值得陈赞, 因此平台要求应用也如此。
fast, use more, 访问速度越快，用户用得越多。
robust, 强壮

原则总结起来就一句话，”keep the ecosystem safe for user, fair for developers“, 平台设计的目标是对用户安全，对开发者公平。

http://colobu.com/2015/04/20/What-is-Facebook-s-architecture/

Web前端使用PHP. Facebook的HipHop编译器[1] 会将它们转换成C++然后使用g++编译,这样就提供了一个高性能的模版和web逻辑执行层（high performance templating and Web logic execution layer）
由于完全依赖静态编译的限制， Facebook已经开始开发一个HipHop的解释程序[2]和HipHop虚拟机，它会将PHP代码转换成HipHop字节码 [3].
业务逻辑使用Thrift包装成服务[4]. 服务可能使用PHP, C++ 或者 Java开发，也可能有其它语言，这依赖于服务需求
由Java实现的服务并不使用某种企业应用服务器，而是使用Facebook自己定制的应用服务器. 看起来好像是重新发明轮子，但是这些服务通常是使用Thrift暴露或者使用，使用tomcat或者Jetty都太重了
持久化采用MySQL, Memcached [5], Hadoop's HBase [6]. Memcached为MySQL做通用缓存
离线处理使用Hadoop 和 Hive.
logging, clicks 和 feeds数据的传输使用Scribe [7]。这些数据集中存储在HDFS（使用Scribe-HDFS [8]）, 因此可以使用MapReduce做扩展分析
自有技术BigPipe [9]用来加速页面的渲染（使用pipelining logic）
Varnish Cache [10]用作HTTP网关，由于它的高性能和效率作为他们的首选
数十亿的用户上传的招聘使用Haystack处理。Facebbook开发的一个特别的存储方案。提供底层的优化和append-only writes [12].
Facebook消息系统基于自己的架构，显著利用分片sharding和动态集群管理. 业务逻辑和持久化被封装进一个称之为'Cell'模块. 每个Cell处理一部分的用户; 当用户增加时新的Cell可以加进来[13]. 持久化使用HBase [14].
Facebook消息搜索引擎建立在存储在HBase上的反向索引[15]
输入提示搜索使用一个定制的存储和检索算法 [16]
Chat基于Epoll服务器，使用Erlang开发，通过Thrift访问 [17]
他们还建立了一个自动化的系统，可以对监控警报进行响应，启动一个合适的修补流程或者在无法自动修补的情况下通知人力来修补[18].