Massive Technical Interviews Tips: How FarmVille Scales To Harvest 75 Million Players A Month

Sunday, December 13, 2015

How FarmVille Scales To Harvest 75 Million Players A Month

http://highscalability.com/blog/2010/2/8/how-farmville-scales-to-harvest-75-million-players-a-month.html

In order to make FarmVille scale as a game, we have to accommodate the workload requirements of a game. A user's state contains a large amount of data which has subtle and complex relationships. For example, in a farm, objects cannot collide with each other, so if a user places a house on their Farm, the backend needs to check that no other object in that user's farm occupies an overlapping space.

Unlike most major site like Google or Facebook, which are read heavy, FarmVille has an extremely heavy write workload. The ratio of data reads to writes 3:1, which is an incredibly high write rate. A majority of the requests hitting the backend for FarmVille in some way modifies the state of the user playing the game. To make this scalable, we have worked to make our application interact primarily with cache components. Additionally, the release of new content and features tends to cause usage spikes since we are effectively extending the game. The load spikes can be as large as 50% the day of a new feature's release. We have to be able to accommodate this spikey traffic.

The other piece is making FarmVille scale as the largest application on a web platform and is as large as some of the largest websites in the world. Since the game is run inside of the Facebook platform, we are very sensitive to latency and performance variance of the platform.
As a result, we've done a lot of work to mitigate that latency variance: we heavily cache Facebook data and gracefully ratchet back usage of the platform when we see performance degrade. FarmVille has deployed an entire cluster of caching servers for the Facebook platform. The amount of traffic between FarmVille and the Facebook platform is enormous: at peak, roughly 3 Gigabits/sec of traffic go between FarmVille and Facebook while our caching cluster serves another 1.5 Gigabits/sec to the application. Additionally, since performance can be variable, the application has the ability to dynamically turn off any calls back to the platform. We have a dial that we can tweak that turns off incrementally more calls back to the platform. We have additionally worked to make all calls back to the platform avoid blocking the loading of the application itself. The idea here is that, if all else fails, players can continue to at least play the game.

For any web application, high latency kills your app and highly variable latency eventually kills your app. To address the high latency, FarmVille has worked to put a lot of caching in front of high latency components. Highly variable latency is another challenge as it requires a rethinking of how the application relies on pieces of its architecture which normally have an acceptable latency. Just about every component is susceptible to this variable latency, some more than others. Because of FarmVille's nature, where the workload is very write and transaction heavy, variability in latency has a magnified effect on user experience compared with a traditional web application. The way FarmVille has handled these scenarios is through thinking about every single component as a degradable service.

Memcache, Database, REST Apis, etc. are all treated as degradable services. The way in which services degrade are to rate limit errors to that service and to implement service usage throttles. The key ideas are to isolate troubled and highly latent services from causing latency and performance issues elsewhere through use of error and timeout throttling, and if needed, disable functionality in the application using on/off switches and functionality based throttles.

Lessons Learned

Interactive games are write heavy. Typical web apps read more than they write so many common architectures may not be sufficient. Read heavy apps can often get by with a caching layer in front of a single database. Write heavy apps will need to partition so writes are spread out and/or use an in-memory architecture.
Design every component as a degradable service. Isolate components so increased latencies in one area won't ruin another. Throttle usage to help alleviate problems. Turn off features when necessary.
Cache Facebook data. When you are deeply dependent on an external component consider caching that component's data to improve latency.
Plan ahead for new realease related usage spikes.
Sample. When analyzing large streamsof data, looking for problems for example, not every piece of data needs to processed. Sampling data can yield the same resuls for much less work.

http://timyang.net/architecture/farmville/

所有模块都是一个可降级的服务

For any web application, high latency kills your app and highly variable latency eventually kills your app.

由于大型的网络应用需要依赖各种底层及内部服务，但是服务调用的高延迟是各种应用的最大问题，在竞争激烈的SNS app领域更是如此。解决此问题的方法是将所有的模块设计成一种可降级的服务，包括Memcache, Database, REST API等。将所有可能会发生大延迟的服务进行隔离。这可以通过控制调用超时时间来控制，另外还可以通过应用中的一些开关来关闭某些某些功能避免服务降级造成的影响。

上面这点我也有一些教训，曾碰到过由于依赖的一些模块阻塞造成服务不稳定的现象。

1. 某Socket Server使用了ThreadPool来处理所有核心业务。
2. 不少业务需要访问内网的一个远程的User Service(RPC)来获取用户信息。
3. User Service需要访问数据库。
4. 数据库有时候会变慢，一些大查询需要10秒以上才能完成。

结果4造成3很多调用很久才能执行完，3造成2的RPC调用阻塞，2造成1的ThreadPool堵塞，ThreadPool不断有新任务加入，但是老的任务迟迟不能完成。因此对于最终用户的表现是很多请求没有响应。部分用户认为是网络原因会手工重复提交请求，这样会造成状况并进一步恶化。上面的问题根本是没有意识到远程服务可能会超时或失败，把远程服务RPC调用当成一个本地调用来执行。

解决思路一：RPC增加Timeout
解决思路二：将RPC改成异步调用。

另一分布式大牛James Hamilton谈到(2)上面这种做法就是他论文Designing and Deploying Internet-Scale Services中的graceful degradation mode(优雅降级)。

FarmVille其他数据

FarmVille基于LAMP架构，运行在EC2上。
读写比例是3:1。
使用开源工具来做运维监控，如nagios报警，munin监控，puppet配置。另外还开发了很多内部的程序来监控Facebook DB, Memcache等。
到Facebook接口的流量峰值达到3Gb/s，同时内部的cache还承担了1.5Gb/s。
另外可动态调整到Facebook与Cache之间的流量，Facebook接口变慢时，可以利用cache数据直接返回，终极目的是不管发生了那个环节的故障，能够让用户继续游戏。

http://www.phppan.com/2016/09/startup-technology/

降级服务能力：在遇到正常或不正常的大流量时，可以在一定范围内将业务降级，业务降级可以前期提供手动降级能力，后续实现自动降级；
第三方服务可替换：花钱能解决问题，但花钱一般不能真正的解决问题，因为花钱买来的可能是一个坑，还是一个需要自己填的坑。在使用第三方服务时，需要多家备用可替换，如短信服务，多接两家，平时两家均衡分发，或者按业务分发，当某一家出问题时，直接切到正常的那家；
日志中心：日志是定位问题的必备工具，当后台服务有多台机器时，就不能一台一台的用 grep 搜索了，需要有一个集中存储的地方，直接上一个 elk 也许能解决大部分的问题；

1. 原则和规范

注意解耦，分层，动静分离、轻重分离的原则；
开发的规范，代码及代码分支管理规范、发布流程；
在开发过程中，对于公共的操作要抽象成组件，即我们常说的职责单一，如缓存操作，数据库操作等等都封装成组件，一边开发一边封装；

2. 保留水平扩展的能力

业务服务端无状态，会话通过 memcache 等来管理；
数据库设计考虑到一定时间内的容量，做好必要的分库分表，如1到2年的容量规划；
热点数据缓存起来，将大量请求打到缓存而不是数据库；

3. 业务隔离

隔离关键业务和非关键业务；
隔离主业务系统与旁路上报、日志上报等周边系统；如果是 HTTP 服务，至少要在域名级别保证其隔离；
不同端业务的隔离；如 PC 侧的业务和 H5 的页面可以是同一套代码，但是域名不同，接入点不同，后端机器相同；

4. 用好开源的轮子

在满足现有业务需求的情况下，对业界开源的轮子做技术选型，在能驾驭的前提下尽量使用已有的，成熟的，经过了大量公司实践的开源组件，如nginx，redis，elk等等。

5. 必要的安全策略

安全是互联网应用无法回避的问题，我们需要在框架或基础组件层面引入常见的 XSS 、CSRF 和 SQL注入等安全问题的过滤；
对于静态的能放到CDN的内容尽量放到CDN，一是就近接入，提高访问速度，二是减少后台的服务压力；
保留快速切到云服务防 DDoS 的能力；
在业务层面实现一定的规则以及联合 WEB 容器实现一定程度上的防 CC 攻击能力；

6. 备份、备份、备份

宕机、不同城市的机房同时起火、光缆被挖断、数据错乱等等各种神奇的事情都有可能出现，此时备份就显示出其价值。我们不仅仅是要备份业务数据库，还要备份代码，备份部署脚本等等；
当所有的不幸都发生的时候，我们所有的东西都不见的时候，我们能够很快的将应用恢复到上一个可预见的备份版本，即我们有灾备方案，最好是能够提前演练过；