Saturday, September 12, 2015

How to design netflix

https://medium.com/netflix-techblog/netflixs-viewing-data-how-we-know-where-you-are-in-house-of-cards-608dd61077da
Each time a member starts to watch a movie or TV episode, a “view” is created in our data systems and a collection of events describing that view is gathered. Given that viewing is what members spend most of their time doing on Netflix, having a robust and scalable architecture to manage and process this data is critical to the success of our business.

What titles have I watched?

Our system needs to know each member’s entire viewing history for as long as they are subscribed. This data feeds the recommendation algorithms so that a member can find a title for whatever mood they’re in. It also feeds the “recent titles you’ve watched” row in the UI.

What gets watched provides key metrics for the business to measure member engagement and make informed product and content decisions.

Where did I leave off in a given title?

For each movie or TV episode that a member views, Netflix records how much was watched and where the viewer left off. This enables members to continue watching any movie or TV show on the same or another device.

What else is being watched on my account right now?

Sharing an account with other family members usually means everyone gets to enjoy what they like when they’d like. It also means a member may have to have that hard conversation about who has to stop watching if they’ve hit their account’s concurrent screens limit. To support this use case, Netflix’s viewing data system gathers periodic signals throughout each view to determine whether a member is or isn’t still watching.

Current Architecture

Our current architecture evolved from an earlier monolithic database-backed application (see this QCon talk or slideshare for the detailed history).

Current Architecture Diagram

The current architecture’s primary interface is the viewing service, which is segmented into a stateful and stateless tier. The stateful tier has the latest data for all active views stored in memory. Data is partitioned into N stateful nodes by a simple mod N of the member’s account id. When stateful nodes come online they go through a slot selection process to determine which data partition will belong to them. Cassandra is the primary data store for all persistent data. Memcached is layered on top of Cassandra as a guaranteed low latency read path for materialized, but possibly stale, views of the data.

We started with a stateful architecture design that favored consistency over availability in the face of network partitions (for background, see the CAP theorem). At that time, we thought that accurate data was better than stale or no data. Also, we were pioneering running Cassandra and memcached in the cloud so starting with a stateful solution allowed us to mitigate risk of failure for those components. The biggest downside of this approach was that failure of a single stateful node would prevent 1/nth of the member base from writing to or reading from their viewing history.

After experiencing outages due to this design, we reworked parts of the system to gracefully degrade and provide limited availability when failures happened. The stateless tier was added later as a pass-through to external data stores. This improved system availability by providing stale data as a fallback mechanism when a stateful node was unreachable.

Breaking Points

Our stateful tier uses a simple sharding technique (account id mod N) that is subject to hot spots, as Netflix viewing usage is not evenly distributed across all current members. Our Cassandra layer is not subject to these hot spots, as it uses consistent hashing with virtual nodes to partition the data. Additionally, when we moved from a single AWS region to running in multiple AWS regions, we had to build a custom mechanism to communicate the state between stateful tiers in different regions. This added significant, undesirable complexity to our overall system.

We created the viewing service to encapsulate the domain of viewing data collection, processing, and providing. As that system evolved to include more functionality and various read/write/update use cases, we identified multiple distinct components that were combined into this single unified service. These components would be easier to develop, test, debug, deploy, and operate if they were extracted into their own services.

Memcached offers superb throughput and latency characteristics, but isn’t well suited for our use case. To update the data in memcached, we read the latest data, append a new view entry (if none exists for that movie) or modify an existing entry (moving it to the front of the time-ordered list), and then write the updated data back to memcached. We use an eventually consistent approach to handling multiple writers, accepting that an inconsistent write may happen but will get corrected soon after due to a short cache entry TTL and a periodic cache refresh. For the caching layer, using a technology that natively supports first class data types and operations like append would better meet our needs.

We created the stateful tier because we wanted the benefit of memory speed for our highest volume read/write use cases. Cassandra was in its pre-1.0 versions and wasn’t running on SSDs in AWS. We thought we could design a simple but robust distributed stateful system exactly suited to our needs, but ended up with a complex solution that was less robust than mature open source technologies. Rather than solve the hard distributed systems problems ourselves, we’d rather build on top of proven solutions like Cassandra, allowing us to focus our attention on solving the problems in our viewing data domain.

Next Generation Architecture

In order to scale to the next order of magnitude, we’re rethinking the fundamentals of our architecture. The principles guiding this redesign are:

Availability over consistency — our primary use cases can tolerate eventually consistent data, so design from the start favoring availability rather than strong consistency in the face of failures.

Microservices — Components that were combined together in the stateful architecture should be separated out into services (components as services).

Components are defined according to their primary purpose — either collection, processing, or data providing.
Delegate responsibility for state management to the persistence tiers, keeping the application tiers stateless.
Decouple communication between components by using signals sent through an event queue.

Polyglot persistence — Use multiple persistence technologies to leverage the strengths of each solution.

Achieve flexibility + performance at the cost of increased complexity.
Use Cassandra for very high volume, low latency writes. A tailored data model and tuned configuration enables low latency for medium volume reads.
Use Redis for very high volume, low latency reads. Redis’ first-class data type support should support writes better than how we did read-modify-writes in memcached.

Our next generation architecture will be made up of these building blocks:

Re-architecting a critical system to scale to the next order of magnitude is a hard problem, requiring many months of development, testing, proving out at scale, and migrating off of the previous architecture.

https://puncsky.com/hacking-the-software-engineer-interview#designing-netflix-view-state-service

How Netflix Serves Viewing Data?

Motivation

How to keep users’ viewing data in scale (billions of events per day)?

Here, viewing data means…

viewing history. What titles have I watched?
viewing progress. Where did I leave off in a given title?
on-going viewers. What else is being watched on my account right now?

Architecture

The viewing service has two tiers:

stateful tier = active views stored in memory
- Why? to support the highest volume read/write
- How to scale out?
  - partitioned into N stateful nodes by account_id mod N
    - One problem is that load is not evenly distributed and hence the system is subject to hot spots
  - CP over AP in CAP theorem, and there is no replica of active states.
    - One failed node will impact 1/nth of the members. So they use stale data to degrade gracefully.
stateless tier = data persistence = Cassandra + Memcached
- Use Cassandra for very high volume, low latency writes.
  - Data is evenly distributed. No hot spots because of consistent hashing with virtual nodes to partition the data.
- Use Memcached for very high volume, low latency reads.
  - How to update the cache?
    - after writing to Cassandra, write the updated data back to Memcached
    - eventually consistent to handling multiple writers with a short cache entry TTL and a periodic cache refresh.
  - in the future, prefer Redis’ appending operation to a time-ordered list over “read-modify-writes” in Memcached.

http://codeinterviews.com/Uber-Design-Netflix/

架构上：基本上就是数据层，Service层，前端。因为Netflix是AWS的忠实用户，所以基本就以AWS为例：数据存储使用s3，配合Relational db / Non sql database；然后是Service layer，功能包括：User authentication，session management，data streaming and other business logic；前台则主要是用户界面。优化包括：如何Cache，如何利用CDN network replicate data close to the users. 因为Netflix的数据大部分是静态数据，很少更新，电影电视剧的内容完全可以Replicate很多份放到Internet的Edge server上。

follow up 很多比如如何限制同一个用户多处登录不能同时看一个资源，如何实现根据用户的网速调整清晰度，怎么热门推荐等等。

http://baozitraining.org/blog/2014-star-startup-interview-uber/
这个其实小编也不太了解，就尽量根据自己之前看过的资料开始一步一步给出自己的Solution，然后进行优化。架构上：基本上就是数据层，Service层，前端。因为小编知道Netflix是AWS的忠实用户，所以基本就以AWS为例：数据存储使用s3，配合Relational db / Non sql database；然后是Service layer，功能包括：User authentication，session management，data streaming and other business logic；前台则主要是用户界面。优化包括：如何Cache，如何利用CDN network replicate data close to the users. 因为Netflix的数据大部分是静态数据，很少更新，电影电视剧的内容完全可以Replicate很多份放到Internet的Edge server上。

http://serverfault.com/questions/67484/what-is-an-edge-server-router-device

It's usually a caching proxy server, located near the user accessing the data, used to improve bandwidth and latency to far away users while lessening the load on central servers.

http://en.wikipedia.org/wiki/Content_delivery_network

http://www.w3.org/AudioVideo/9610_Workshop/paper14/paper14.html

https://www.jiuzhang.com/qa/3839/

但是，当面试官要求你在45分钟内设计大规模分布式系统时，你需要指出高层组件并描述它们之间如何交互的，而不是花时间说明如何通过避免缓冲区副本的方式来减少20毫秒反应时间的。

既然完成这个问题基本是不可能的，那么面试官到底想知道些什么？其实他希望你给他一个的概述，定义高级模块，并尽可能简洁地描述组件之间的交互。大概分三个阶段：

1.画一个大框架来表示系统
2.放大框架并把它打散变成5-6个模块
3.简要讨论每个模块的功能。计算，存储，前端，后端，缓存，队列，网络，负载均衡等

1.乱用术语
你可能觉得对于抽象层面的设计，大概在设计面试的时候胡扯就可以了。千万不要这样想！你的面试官正在选择的可是未来可能每天和他一起工作的同事，任何有经验的面试官都会注意那些随便说像是“No-SQL”，“Mongo DB”和“Hadoop”这些术语的人，你要随时准备好他会问更多的细节和为什么你选择这些技术。如果你没有深刻理解或着没法证明和支持你的方法的时候，不要用比如 “GraphQL”这样的术语或很潮的技术。

2.假装你是某个方面的行家
我听说过一些非常尴尬的情况，应聘者假装是一个行家，但是最后发现，面试官正是该领域的知名专家。

在2006年，我面试微软，我的面试官问我是否实现过B树。我告诉他，我知道B树是什么，用于数据库，其他不记得了。他于是换了话题。后来我发现我的面试官是James Hamilton，他是数据库和分布式系统中最重要的专家。之后几年，我要为微软的Azure实现B+树（大型B+树储存了TB的数据，现在我对B+树略知一二。即便是现在，我也不敢对 James Hamilton 说我知道B+树是什么。）

在面试别人的时候，曾经一个应聘者告诉我他在某个代码库中实现了某些功能。他不知道我加入团队之前，曾经在那个代码库上工作过。我试探了他一下，意识到他仅仅实现了代码库的客户端，并不像他说的一样做了那么多。

上面的事并不常见，下面的事更有可能发生：

1.你的面试官可能正在做你谈论的技术，可以很容易地区分冒牌者和专家。
2.他可能已经问过了这个问题上千次，精通可能的解决方案，他一下就知道你真正理解多少。

所以千万不要假装是一个专家。面试你的人一定比你更了解这个领域，甚至可能是业界专家。

3.即使是熟悉的领域，也不要急于求成
熟悉的领域对你有利，但是也不要急于求成，一下跳到你知道的解决方案，所以你要做这三件事：

1.收集要求。
2.多问问题。你的面试官对你的思考过程更感兴趣。
3.评估多个解决方案，讨论利弊。

无论你是否了解这个领域，这样做都有利于你的面试结果。

4.错误回答典例
（声明：以下是一个假设的对话，如有雷同纯粹巧合）。

面试官：让我们实现Twitter。你要如何存储所有的tweets？
候选人：我要使用NoSQL数据库比如MongoDB。

面试官：为什么不用MySQL？
候选人：RDBMS不可扩展。我们需要一个可扩展的数据库，比如MongoDB或BigTable。

面试官：但是我们在Twitter存储我们在MySQL中的所有tweets，它的扩展性非常好。
候选人：嗯，那也许你的规模不是那么大。当需要更大规模的服务的时候，比如Facebook，就要使用NO-SQL解决方案。

面试官：但Facebook也使用MySQL。
候选人：我不知道他们如何扩展，我必须再做做功课。也许他们用MySQL由BigTable支持前端。

面试官：那好吧。我们应该在哪里存储我们的分析数据？
候选人：显然在MySQL。

面试官：但是对MySQL来说太多了。我们在HDFS中储存。
候选人：可能你们在MongoDB成熟之前搭建的Twitter。 MongoDB可以轻松处理tweets和分析数据。