Massive Technical Interviews Tips: The Architecture of Open Source Applications (Volume 2): Scalable Web Architecture and Distributed Systems

The Architecture of Open Source Applications (Volume 2): Scalable Web Architecture and Distributed Systems
Availability
Performance
Reliability: A system needs to be reliable, such that a request for data will consistently return the same data. In the event the data changes or is updated, then that same request should return the new data. Users need to know that if something is written to the system, or stored, it will persist and can be relied on to be in place for future retrieval.
Scalability
Manageability: Designing a system that is easy to operate is another important consideration. The manageability of the system equates to the scalability of operations: maintenance and updates. Things to consider for manageability are the ease of diagnosing and understanding problems when they occur, ease of making updates or modifications, and how simple the system is to operate. (I.e., does it routinely operate without failure or exceptions?)
Cost

services, redundancy, partitions, and handling failure

Services

When considering scalable system design, it helps to decouple functionality and think about each part of the system as its own service with a clearly defined interface.

For these types of systems, each service has its own distinct functional context, and interaction with anything outside of that context takes place through an abstract interface, typically the public-facing API of another service.

Deconstructing a system into a set of complementary services decouples the operation of those pieces from one another.

This abstraction helps establish clear relationships between the service, its underlying environment, and the consumers of that service.

Creating these clear delineations can help isolate problems, but also allows each piece to scale independently of one another.
This sort of service-oriented design for systems is very similar to object-oriented design for programming.

such a scenario makes it easy to see how longer writes will impact the time it takes to read the images (since they two functions will be competing for shared resources).

Depending on the architecture this effect can be substantial. Even if the upload and download speeds are the same (which is not true of most IP networks, since most are designed for at least a 3:1 download-speed:upload-speed ratio), read files will typically be read from cache, and writes will have to go to disk eventually (and perhaps be written several times in eventually consistent situations). Even if everything is in memory or read from disks (like SSDs), database writes will almost always be slower than reads.

Another potential problem with this design is that a web server like Apache or lighttpd typically has an upper limit on the number of simultaneous connections it can maintain (defaults are around 500, but can go much higher) and in high traffic, writes can quickly consume all of those.

Since reads can be asynchronous, or take advantage of other performance optimizations like gzip compression or chunked transfer encoding, the web server can switch serve reads faster and switch between clients quickly serving many more requests per second than the max number of connections (with Apache and max connections set to 500, it is not uncommon to serve several thousand read requests per second).

Writes, on the other hand, tend to maintain an open connection for the duration for the upload, so uploading a 1MB file could take more than 1 second on most home networks, so that web server could only handle 500 such simultaneous writes.

Figure 1.2: Splitting out reads and writes

Planning for this sort of bottleneck makes a good case to split out reads and writes of images into their own services, shown inFigure 1.2.

This allows us to scale each of them independently (since it is likely we will always do more reading than writing), but also helps clarify what is going on at each point. Finally, this separates future concerns, which would make it easier to troubleshoot and scale a problem like slow reads.

The advantage of this approach is that we are able to solve problems independently of one another—we don't have to worry about writing and retrieving new images in the same context.

Both of these services still leverage the global corpus of images, but they are free to optimize their own performance with service-appropriate methods (for example, queuing up requests, or caching popular images—more on this below).

And from a maintenance and cost perspective each service can scale independently as needed, which is great because if they were combined and intermingled, one could inadvertently impact the performance of the other as in the scenario discussed above.

determine the system needs (heavy reads or writes or both, level of concurrency, queries across the data set, ranges, sorts, etc.), benchmark different alternatives, understand how the system will fail, and have a solid plan for when failure happens.

Redundancy

ensuring that multiple copies or versions are running simultaneously can secure against the failure of a single node.

Creating redundancy in a system can remove single points of failure and provide a backup or spare functionality if needed in a crisis.

For example, in our image server application, all images would have redundant copies on another piece of hardware somewhere (ideally in a different geographic location in the event of a catastrophe like an earthquake or fire in the data center), and the services to access the images would be redundant, all potentially servicing requests. (See Figure 1.3.) (Load balancers are a great way to make this possible, but there is more on that below).

Figure 1.3: Image hosting application with redundancy

Partitions

When it comes to horizontal scaling, one of the more common techniques is to break up your services into partitions, or shards. The partitions can be distributed such that each logical set of functionality is separate; this could be done by geographic boundaries, or by another criteria like non-paying versus paying users. The advantage of these schemes is that they provide a service or data store with added capacity.

In our image server example, it is possible that the single file server used to store images could be replaced by multiple file servers, each containing its own unique set of images.

Of course there are challenges distributing data or functionality across multiple servers. One of the key issues is data locality; in distributed systems the closer the data to the operation or point of computation, the better the performance of the system. Therefore it is potentially problematic to have data spread across multiple servers, as any time it is needed it may not be local, forcing the servers to perform a costly fetch of the required information across the network.

Another potential issue comes in the form of inconsistency. When there are different services reading and writing from a shared resource, potentially another service or data store, there is the chance for race conditions—where some data is supposed to be updated, but the read happens prior to the update—and in those cases the data is inconsistent.

scaling access to the data

As they grow, there are two main challenges: scaling access to the app server and to the database. In a highly scalable application design, the app (or web) server is typically minimized and often embodies a shared-nothing architecture. This makes the app server layer of the system horizontally scalable. As a result of this design, the heavy lifting is pushed down the stack to the database server and supporting services; it's at this layer where the real scaling and performance challenges come into play.

Caches

Caches can exist at all levels in architecture, but are often found at the level nearest to the front end, where they are implemented to return data quickly without taxing downstream levels.
browser caches, cookies, and URL rewriting

Global Cache

Distributed Cache

Facebook then use a global cache that is distributed across many servers (see "Scaling memcached at Facebook"), such that one function call accessing the cache could make many requests in parallel for data stored on different Memcached servers. This allows them to get much higher performance and throughput for their user profile data, and have one central place to update data (which is important, since cache invalidation and maintaining consistency can be challenging when you are running thousands of servers).

Proxies Squid and Varnish

Typically, proxies are used to filter requests, log requests, or sometimes transform requests (by adding/removing headers, encrypting/decrypting, or compression).

Proxies are also immensely helpful when coordinating requests from multiple servers, providing opportunities to optimize request traffic from a system-wide perspective. One way to use a proxy to speed up data access is to collapse the same (or similar) requests together into one request, and then return the single result to the requesting clients. This is known as collapsed forwarding.

Another great way to use the proxy is to not just collapse requests for the same data, but also to collapse requests for data that is spatially close together in the origin store (consecutively on disk).

It is worth noting that you can use proxies and caches together, but generally it is best to put the cache in front of the proxy, for the same reason that it is best to let the faster runners start first in a crowded marathon race. This is because the cache is serving data from memory, it is very fast, and it doesn't mind multiple requests for the same result. But if the cache was located on the other side of the proxy server, then there would be additional latency with every request before the cache, and this could hinder performance.

Indexes

An index makes the trade-offs of increased storage overhead and slower writes (since you must both write the data and update the index) for the benefit of faster reads.
Indexes can also be used to create several different views of the same data.
Database index, solr: inverted index for text search

Load Balancers

There are many different algorithms that can be used to service requests, including picking a random node, round robin, or even selecting the node based on certain criteria, such as memory or CPU utilization. Load balancers can be implemented as software or hardware appliances. One open source software load balancer that has received wide adoption is HAProxy).

One of the challenges with load balancers is managing user-session-specific data.
One way around this can be to make sessions sticky so that the user is always routed to the same node, but then it is very hard to take advantage of some reliability features like automatic failover.
If a system only has a couple of a nodes, systems like round robin DNS may make more sense since load balancers can be expensive and add an unneeded layer of complexity.

Load balancers also provide the critical function of being able to test the health of a node, such that if a node is unresponsive or over-loaded, it can be removed from the pool handling requests, taking advantage of the redundancy of different nodes in your system.

Queues

another important part of scaling the data layer is effective management of writes.

achieving performance and availability requires building asynchrony into the system; a common way to do that is with queues.

This kind of synchronous behavior can severely degrade client performance; the client is forced to wait, effectively performing zero work, until its request can be answered.

Queues enable clients to work in an asynchronous manner, providing a strategic abstraction of a client's request and its response.

In an asynchronous system the client requests a task, the service responds with a message acknowledging the task was received, and then the client can periodically check the status of the task, only requesting the result once it has completed. While the client is waiting for an asynchronous request to be completed it is free to perform other work, even making asynchronous requests of other services

-- retry request
Queues also provide some protection from service outages and failures. For instance, it is quite easy to create a highly robust queue that can retry service requests that have failed due to transient server failures. It is more preferable to use a queue to enforce quality-of-service guarantees than to expose clients directly to intermittent service outages, requiring complicated and often-inconsistent client-side error handling.

RabbitMQ, ActiveMQ, BeanstalkD, but some also use services like Zookeeper, or even data stores like Redis.

可扩展Web架构与分布式系统

当设计一个系统架构时，有一些东西是要考虑的：正确的部分是什么，怎样让这些部分很好地融合在一起，以及好的折中方法是什么。通常在系统架构需要之前就为它的可扩展性投资不是一个聪明的商业抉择；然而，在设计上的深谋远虑能在未来节省大量的时间和资源。

这部分关注点是几乎所有大型web应用程序中心的一些核心因素：服务、冗余、划分和错误处理。每一个因素都包含了选择和妥协，特别是上部分提到的设计原则

实例：图片托管应用

有时候你可能会在线上传一张图片。对于那些托管并负责分发大量图片的网站来说，要搭建一个既节省成本又高效还能具备较低的延迟性（你能快速的获图片）的网站架构确实是一种挑战。

我们来假设一个系统，用户可以上传他们的图片到中心服务器，这些图片又能够让一些web链接或者API获取这些图片，就如同现在的Flickr或者Picasa。为了简化的需要，我们假设应用程序分为两个主要的部分：一个是上传图片到服务器的能力（通常说的写操作），另一个是查询一个图片的能力。然而，我们当然想上传功能很高效，但是我们更关心的是能够快速分发能力，也就是说当某个人请求一个图片的时候（比如，一个web页面或者其它应用程序请求图片）能够快速的满足。这种分发能力很像web服务器或者CDN连接服务器（CDN服务器一般用来在多个位置存储内容一边这些内容能够从地理位置或者物理上更靠近访问它的用户，已达到高效访问的目的）气的作用。

系统其他重要方面：

对图片存储的数量没有限制，所以存储需要可扩展，在图像数量方面需要考虑。
图片的下载和请求不需要低延迟。
如果用户上传一个图片，图片应该都在那里（图片数据的可靠性）。
系统应该容易管理（可管理性）。
由于图片主机不会有高利润的空间，所以系统需要具有成本效益。

当要考虑设计一个可扩展的系统时，为功能解耦和考虑下系统每部分的服务都定义一个清晰的接口都是很有帮助的。在实际中，在这种方式下的系统设计被成为面向服务架构（SOA）。对于这类型的系统，每个服务有自己独立的方法上下文，以及使用抽象接口与上下文的外部任何东西进行交互，典型的是别的服务的公共API。

把一个系统解构为一些列互补的服务，能够为这些部分从别的部分的操作解耦。这样的抽象帮助在这些服务服、它的基础环境和服务的消费者之间建立清晰的关系。建立这种清晰的轮廓能帮助隔离问题，但也允许各模块相对其它部分独立扩展。这类面向服务设计系统是非常类似面向对象设计编程的。

在我们的例子中，上传和检索图像的请求都是由同一个服务器处理的；然而，因为系统需要具有伸缩性，有理由要将这两个功能分解为各由自己的服务进行处理。

快速转发（Fast-forward）假定服务处于大量使用中；在这种情况下就很容易看到，读取图像所花的时间中有多少是由于受到了写入操作的影响（因为这两个功能将竞争使用它们共享的资源）。取决于所采用的体系结构，这种影响可能是巨大的。即使上传和下载的速度完全相同（在绝大多数IP网络中都不是这样的情况，大部分下载速度和上传速度之比都至少设计为3:1），文件读取操作一般都是从高速缓存中进行的，而写操作却不得不进行最终的磁盘操作（而且可能要写几次才能达成最后的一致状态）。即使所有内容都已在内存中，或者从磁盘（比如SSD磁盘）中进行读取，数据库写入操作几乎往往都要慢于读取操作。（Pole Position是一个开源的DB基准测试工具）

这种设计另一个潜在的问题出在web服务器上，像Apache或者lighttpd通常都有一个能够维持的并发连接数上限（默认情况下在500左右，不过可以更高）和最高流量数，它们会很快被写操作消耗掉。因为读操作可以异步进行，或者采用其它一些像gizp压缩的性能优化或者块传输编码方式，web服务器可以通过在多个请求服务之间切换来满足比最大连接数更多的请求（一台Apache的最大连接数设置为500，它每秒钟提供近千次读请求服务也是正常的）。写操作则不同，它需要在上传过程中保持连接，所以大多数家庭网络环境下，上传一个1MB的文件可能需要超过1秒的时间，所以web服务器只能处理500个这样并发写操作请求

对于这种瓶颈，一个好的规划案例是将读取和写入图片分离为两个独立的服务，如图Figure 1.2.所示。这让我们可以单独的扩展其中任意一个（因为有可能我们读操作比写操作要频繁很多），同时也有助于我们理清每个节点在做什么。最后，这也避免了未来的忧虑，这使得故障诊断和查找问题更简单，像慢读问题。

这种方法的优点是我们能够单独的解决各个模块的问题-我们不用担心写入和检索新图片在同一个上下文环境中。这两种服务仍然使用全球资料库的图片，但是它们可通过适当的服务接口自由优化它们自己的性能（比如，请求队列，或者缓存热点图片-在这之上的优化）。从维护和成本角度来看，每个服务按需进行独立规模的规划，这点非常有用，试想如果它们都组合混杂在一起，其中一个无意间影响到了性能，另外的也会受影响。

当然，上面的例子在你使用两个不同端点时可以很好的工作（事实上，这非常类似于云存储和内容分发网络）。虽然有很多方式来解决这样的瓶颈，但每个都有各自的取舍。

比如，Flickr通过分配用户访问不同的分片解决这类读/写问题，每一个分片只可以处理一定数量的用户，随着用户的增加更多的分片被添加到集群上。在第一个例子中，可以根据实际用途更简单的规划硬件资源（在整个系统中读和写的比例），然而，Flickr规划是根据用户基数（假定每个用户拥有相同的资源空间）。在前者中一个故障或者问题会导致整个系统功能的下降（比如，全部不能写入文件了），然而Flickr一个分片的故障只会影响到相关的那部分用户。在第一个例子中，更容易操作整个数据集-比如，在所有的图像元数据上更新写入服务用来包含新的元数据或者检索-然而在Flickr架构上每一个分片都需要执行更新或者检索（或者需要创建个索引服务来核对元数据-找出哪一个才是实际结果）。

冗余(Redundancy)

为了优雅的处理故障，web架构必须冗余它的服务和数据。例如，单服务器只拥有单文件的话，文件丢失就意味这永远丢失了。丢失数据是个很糟糕的事情，常见的方法是创建多个或者冗余备份。

同样的原则也适用于服务。如果应用有一个核心功能，确保它同时运行多个备份或者版本可以安全的应对单点故障。

在系统中创建冗余可以消除单点故障，可以在紧急时刻提供备用功能。例如，如果在一个产品中同时运行服务的两个实例，当其中一个发生故障或者降级(degrade)，系统可以转移(failover)到好的那个备份上。故障转移(Failover)可以自动执行或者人工手动干预。

服务冗余的另一个关键部分是创建无共享(shared-nothing)架构。采用这种架构，每个接点都可以独立的运作，没有中心”大脑”管理状态或者协调活动。这可以大大提高可伸缩性(scalability)因为新的接点可以随时加入而不需要特殊的条件或者知识。而且更重要的是，系统没有单点故障。所以可以更好的应对故障。

例如，在我们的图片服务应用，所有的图片应该都冗余备份在另外的一个硬件上（理想的情况下，在不同的地理位置，以防数据中心发生大灾难，例如地震，火灾），而且访问图片的服务（见Figure 1.3.）-包括所有潜在的服务请求-也应该冗余。（负载均衡器是个很好的方法冗余服务，但是下面的方法不仅仅是负载均衡）

以我们的图像服务器为例，将曾经储存在单一的文件服务器的图片重新保存到多个文件服务器中是可以实现的，每个文件服务器都有自己惟一的图片集。（见图表1.4。）这种构架允许系统将图片保存到某个文件服务器中，在服务器都即将存满时，像增加硬盘一样增加额外的服务器。这种设计需要一种能够将文件名和存放服务器绑定的命名规则。一个图像的名称可能是映射全部服务器的完整散列方案的形式。或者可选的，每个图像都被分配给一个递增的 ID，当用户请求图像时，图像检索服务只需要保存映射到每个服务器的 ID 范围（类似索引）就可以了。

当然，为多个服务器分配数据或功能是充满挑战的。一个关键的问题就是数据局部性；对于分布式系统，计算或操作的数据越相近，系统的性能越佳。因此，一个潜在的问题就是数据的存放遍布多个服务器，当需要一个数据时，它们并不在一起，迫使服务器不得不为从网络中获取数据而付出昂贵的性能代价。

另一个潜在的问题是不一致性。当多个不同的服务读取和写入同一共享资源时，有可能会遭遇竞争状态——某些数据应当被更新，但读取操作恰好发生在更新之前——这种情形下，数据就是不一致的。例如图像托管方案中可能出现的竞争状态，一个客户端发送请求，将其某标题为“狗”的图像改名为”小家伙“。而同时另一个客户端发送读取此图像的请求。第二个客户端中显示的标题是“狗”还是“小家伙”是不能明确的。

当然，对于分区还有一些障碍存在，但分区允许将问题——数据、负载、使用模式等——切割成可以管理的数据块。这将极大的提高可扩展性和可管理性，但并非没有风险。

随着它们的成长，主要发生了两方面的变化：应用服务器和数据库的扩展。在一个高度可伸缩的应用程序中，应用服务器通常最小化并且一般是shared-nothing架构（译注：shared nothing architecture是一 种分布式计算架构，这种架构中不存在集中存储的状态，整个系统中没有资源竞争，这种架构具有非常强的扩张性，在web应用中广泛使用）方式的体现，这使得系统的应用服务器层水平可伸缩。由于这种设计，数据库服务器可以支持更多的负载和服务；在这一层真正的扩展和性能改变开始发挥作用了。

代理

简单来说，代理服务器是一种处于客户端和服务器中间的硬件或软件，它从客户端接收请求，并将它们转交给服务器。代理一般用于过滤请求、记录日志或对请求进行转换(增加/删除头部、加密/解密、压缩

当需要协调来自多个服务器的请求时，代理服务器也十分有用，它允许我们从整个系统的角度出发、对请求流量执行优化。压缩转发(collapsed forwarding)是利用代理加快访问的其中一种方法，将多个相同或相似的请求压缩在同一个请求中，然后将单个结果发送给各个客户端。

假设，有几个节点都希望请求同一份数据，而且它并不在缓存中。在这些请求经过代理时，代理可以通过压缩转发技术将它们合并成为一个请求，这样一来，数据只需要从磁盘上读取一次即可(见图1.14)。这种技术也有一些缺点，由于每个请求都会有一些时延，有些请求会由于等待与其它请求合并而有所延迟。不管怎么样，这种技术在高负载环境中是可以帮助提升性能的，特别是在同一份数据被反复访问的情况下。压缩转发有点类似缓存技术，只不过它并不对数据进行存储，而是充当客户端的代理人，对它们的请求进行某种程度的优化。

在一个LAN代理服务器中，客户端不需要通过自己的IP连接到Internet，而代理会将请求相同内容的请求合并起来。这里比较容易搞混，因为许多代理同时也充当缓存(这里也确实是一个很适合放缓存的地方)，但缓存却不一定能当代理。

值得注意的是，代理和缓存可以放到一起使用，但通常最好把缓存放到代理的前面，放到前面的原因和在参加者众多的马拉松比赛中最好让跑得较快的选手在队首起跑一样。因为缓存从内存中提取数据，速度飞快，它并不介意存在对同一结果的多个请求。但是如果缓存位于代理服务器的另一边，那么在每个请求到达cache之前都会增加一段额外的时延，这就会影响性能。

如果你正想在系统中添加代理，那你可以考虑的选项有很多；Squid和Varnish都经过了实践检验，广泛用于很多实际的web站点中。这些代理解决方案针对大部分client－server通信提供了大量的优化措施。将二者之中的某一个安装为web服务器层的反向代理（reverse proxy，下面负载均衡器一节中解释）可以大大提高web服务器的性能，减少处理来自客户端的请求所需的工作量。

负载均衡器是一种能让你扩展系统能力的简单易行的方式，和本文中所讲的其它技术一样，它在分布式系统架构中起着基础性的作用。负载均衡器还要提供一个比较关键的功能，它必需能够探测出节点的运行状况，比如，如果一个节点失去响应或处于过载状态，负载均衡器可以将其总处理请求的节点池中移除出去，还接着使用系统中冗余的其它不同节点。

队列使客户端能以异步的方式工作，提供了一个客户端请求与其响应的战略抽象。换句话说，在一个同步系统，没有请求与响应的区别，因此它们不能被单独的管理。在一个异步的系统，客户端请求一个任务，服务端响应一个任务已收到的确认，然后客户端可以周期性的检查任务的状态，一旦它结束就请求结果。当客户端等待一个异步的请求完成，它可以自由执行其它工作，甚至异步请求其它的服务。后者是队列与消息在分布式系统如何成为杠杆的例子。

队列也对服务中断和失败提供了防护。例如，创建一个高度强健的队列，这个队列能够重新尝试由于瞬间服务器故障而失败的服务请求，是非常容易的事。相比直接暴露客户端于间歇性服务中断，需要复杂的而且经常矛盾的客户端错误处理程序，用一个队列去加强服务质量的担保更为可取。

队列对管理任何大规模分布式系统不同部分之间的分布式通信，是一个基础，而且实现它们有许多的方法。有不少开源的队列如 RabbitMQ, ActiveMQ, BeanstalkD，但是有些也用像 Zookeeper的服务，或者甚至像Redis的数据存储。

http://systemdesigns.blogspot.com/2015/11/system-design1.html
其实我们讲起分布式系统时，讲的就是可延伸的网络架构。
我们在设计一个产品会考虑各方面因素。在设计分布式系统时，我们也会考虑诸多因素，其中总结起来有六条重要原则：可用性（availability），性能（performance），可靠性（reliability），可延伸性（scalability），可管理性（manageability），成本（cost）。

简单而言

+ 可用性（availability）：要求系统经常可用，有抗故障的能力

+ 性能（performance）：主要体现读写操作的速度上，少延迟，速度快

+ 可靠性（reliability）：体现在数据的一致性

+ 可延伸性（scalability）：体现在系统的延伸，包括资源及其它们访问的增加

+ 可管理性（manageability）：容易维护和更新（好比内和外都能管理得好）

+ 成本（cost）：各种意义上的成本，不仅包括软件硬件资源，还包括员工学习系统所话的的时间与精力，等等

这些原则是相互依存的关系。譬如说提高性能就很有可能需要更多的成本。

而这些原则综合起来体现在四个方面：服务（Service），冗余（Redundancy），划分（Partition），故障处理（Failure Handling）

+ 服务（Service）：主要体现的是服务的解耦合。譬如读和写两个操作成为一个服务的话，一个操作的性能会影响另外一个。所以要把它们分成两个独立的服务，通过API来使用。这有助于性能（performance）的提升，也方便于每个服务独立地延伸（scalability）和管理（manageability）

+ 冗余（Redundancy）：有服务冗余和数据冗余。主要是用来提高可用性（availability），提高抵抗故障的能力。同时，譬如服务冗余之间相互独立，也很方便扩展（scalability）。另外数据冗余，譬如database的denormalization也有利于读操作的性能（performance）提升

+ 划分（Partition）：是指水平延伸（譬如有更多服务器）中，对于服务或资源的划分。当这种划分符合好的逻辑独立性，譬如根据不同地点或者像之前服务的解耦合一样，它也会有利于独立的延伸（scalability）和管理（manageability）

+ 故障处理（Failure Handling）：博客里没有深入，只是提供额外的链接http://katemats.com/distributed-systems-basics-handling-failure-fault-tolerance-and-monitoring/

讲完这四个大方面，接着讲的是如何提升分布式系统性能，尤其是读的性能。主要有四种技术有助于其提升：高速缓存（Cache），代理（Proxy），索引（Index），和负载均衡器（load balancer)

+ 高速缓存（Cache）：文中主要讲了缓存的作用和特性（量小而读写操作快）。同时讲了实际使用cache的问题以及如何处理。

+ 代理（Proxy）：代理服务器的主要作用就把相同的request或者相关的request合在一起，从而减少读的次数（和mapreduce里的reducer类似）。

+ 索引（Index）：通过建立索引，提高找到相关信息的速度。同时因为避免了实际对数据的排序，索引的建立还提供了对数据多种不同的view。

+ 负载均衡器（Load Balancer）：它刚好和Proxy把request合并一起相反，它负载request的分发（distribute），所以它也被称为reverse proxy。文中讲了使用负载均衡器带来的挑战和问题。

讲完对读操作的优化后，文中最后讲了对写操作的管理，提到Queue对于写操作异步处理的实现

http://systemdesigns.blogspot.com/2015/12/image-sharinghosting-system-flickr.html

Flickr是个线上图片管理和分享的应用，包含上传图片，下载图片，etc 和各种与图片分享联系的社交功能。

1）图片管理

2）图片社交

这是两个主要功能，但是除此之外需要把他们联系起来还需要两个很重要的服务：

3）账号管理（单独拿出来是因为图片管理和社交都需要它）

a) 登入登出

b) 创建用户

4）搜索（search）

当然最后这两个服务不是今天最主要的讨论部分，但是在考虑功能设计时是需要我们考虑的。另外他们本身也会用一些开源的软件/服务可以直接用。譬如django中user auth管理的模板化，还是要solr, elasticsearch, luence这些用来搜索的。

图片管理分为很多子功能，包括下载、上传，展示，分类，标签，加描述等等功能。

不过我现在只是给大家做个反例，如果面试中像现在这样每个子功能越挖越深就会没完没了，所以针对图片管理方面这里主要提到上传、下载和存储。

功能是很直接明了的，但是随着数据量越来越大，架构也会随之受到很大的挑战。我会在后续SNAKE中的N和E环节中提到这些数据带来哪些实际数据上的限制，还有通过哪些架构上的变化来适应这些调整。主要是performance上的。

1.3 图片社交

大概说完图片管理的，还说一下图片的社交方面。图片社交和普通线上社交没有太多区别。都是一些join group, follow people, find finds, rating, favorite, comments的功能。

a) 图片分享

b) 人群分类（管理）

1. Join Group

2. Follow People

3. Find Friends/ Invite Friends [Following/ Friends/ Family]

c) 线上活动（Rating/ Favorites/ Comments）

a) User

i. 日活用户量 daily user = 1M

ii. 同时在线用户量: concurrent user = daily users/5 = 0.2M

iii. 高峰期用户量：peak users = concurrent users * 3 = 0.6M (譬如星期天)

iv. 3个月后的高峰期用户量：MAX peak users in 3 months = 0.6M * 2 = 1.2M

(成熟期 * 2 [假设不流失]，增长期 * 10)

- 我们以日活1m user来算算其他标准

- 同时在线用户 1m /5 = 0.2 m 这里的5只要不是24就好（因为用户会集中在某段时间）

- 而对应的高峰期用户量我们可以乘以个三，0.2m * 3 = 0.6m（譬如周末的时候就会有有更多人上传照片）
* 计算高峰用户量也方便后面我们计算qps和traffic对应的量，从而知道系统架构受到多少冲击

b) Traffic

i. Per Image (upload 1 images): 1MB/s (13MB/ 13s)

以上传为例子，大家平均上传一张图片多少时间，或者算一下上传好几张图片，大概花了多少时间？我假设1张图片一秒。（其实我做了个小实验，分别批量上传10,20,30图片，算出来的平均时间都是在1MB/s 浮动。）

ii. Image Size: 2MB per image

iii. Max peak traffic

1. 之前做了个三个月预测后最大用户峰值是1.2百万，那么对应的三个月后最大的峰值traffic是：

1,200,000 * 1MB/s = 1.2TB/s（就是一秒钟大概要看完一个硬盘里所有视频的量。）

2. 6000 * 2MB/ 60s = 2GB/s (data from 2 below) ???

iv. Data

1. Conflicts

a) 截止于2013年3月，每天有3.5M新图片上传

b) 每天有40万新增图片

2. Upload 6000 each minutes

https://www.flickr.com/photos/ironrodart/6144091654

3. 截止于2007年11月，有2B图片

(http://highscalability.com/flickr-architecture)

4. 截止于2011年8月，有6B图片

5. 470M photos, 4 or 5 sizes of each

a) User

i. 日活用户量 daily user = 1M

ii. 同时在线用户量: concurrent user = daily users/5 = 0.2M

iii. 高峰期用户量：peak users = concurrent users * 3 = 0.6M (譬如星期天)

iv. 3个月后的高峰期用户量：MAX peak users in 3 months = 0.6M * 2 = 1.2M

(成熟期 * 2 [假设不流失]，增长期 * 10)

- 我们以日活1m user来算算其他标准

- 同时在线用户 1m /5 = 0.2 m 这里的5只要不是24就好（因为用户会集中在某段时间）

b) Traffic

i. Per Image (upload 1 images): 1MB/s (13MB/ 13s)

ii. Image Size: 2MB per image

iii. Max peak traffic

1. 之前做了个三个月预测后最大用户峰值是1.2百万，那么对应的三个月后最大的峰值traffic是：

1,200,000 * 1MB/s = 1.2TB/s（就是一秒钟大概要看完一个硬盘里所有视频的量。）

2. 6000 * 2MB/ 60s = 2GB/s (data from 2 below) ???

iv. Data

1. Conflicts

a) 截止于2013年3月，每天有3.5M新图片上传

b) 每天有40万新增图片

2. Upload 6000 each minutes

https://www.flickr.com/photos/ironrodart/6144091654

3. 截止于2007年11月，有2B图片

(http://highscalability.com/flickr-architecture)

4. 截止于2011年8月，有6B图片

5. 470M photos, 4 or 5 sizes of each

c) Memory

- User Data

i. 和Douban.fm案例一样，这里估计Memory Per User = 10kb

ii. Max daily memory = 日活用户（20M） * 3个月增长率（2）* 每个用户占用空间（10KB）= 20,000,000 * 2 * 10KB = 400GB

- Photo Storage

其实上flickr带来的挑战不是用户信息的存储，而是图片数据，也就是图片的存储和读写。提示一下，压缩算法是常用增快读写file的一个方法。那我们来算一下flickr存的图的总大小（再加上个80%左右的压缩率）：

我做个很粗率的估算，每天有10M张图，每个图压缩后是1m大小，365天上传，从

2004年建立，到2015年Flickr现在：

All photo memory = 100M * 1m * 365 * 12 约等于438PB。

d) QPS

i. Data

1. 36K per second, double if burst

2. 超过40亿queries per day

a) 这里的query是指什么

b) Federation at Flickr: Doing Billions of Queries a Day by Dathan Pattishall.

1) User Management Service

2) Image Hosting

3) 社交功能(explore)

4) 搜索（搜索功能主要依靠yahoo搜索引擎本身）

3.2 社交功能

这里稍微提一下社交功能，譬如favorite功能。

1） Favorites（点赞功能）

点赞是一个一对一distributed transaction，和一般database的transaction一样，只不过这里开通个connection，然后对于在不同shard的user 数据库里进行操作。操作结束后，点赞也实现了。具体过程如下：

2）推荐算法

社交网站还有个常用的功能就是推荐算法。flickr推荐的算法并没有给太多细节，以下是关于它的大概描述：

Interestingness Algorithm

Interestingness is what Flickr calls the criteria used for selecting which photos are shown in Explore. All photos are given an Interestingness "score" that can also be used to sort any image search on Flickr. The top 500 photos ranked by Interestingness are shown in Explore. Interestingness rankings are calculated automatically by a secret computer algorithm. The algorithm is often referred to by name as the Interestingness algorithm. Although the algorithm is secret, Flickr has stated that many factors go into calculating Interestingness including: a photo's tags, how many groups the photo is in, views, favorites, where click-throughs are coming from, who comments on a photo and when, and more. The velocity of any of those components is a key factor. For example, getting 20 comments in an hour counts much higher than getting 20 comments in a week.

然后根据各自方面来判断是不是interesting。和昨晚的推荐应该有一定的联系，譬如做个性化推荐可以根据你赞了哪些图片。

非个性化推荐，就是一个比较普通的classification problem，根据各自feature来确定它是有趣还是没趣。

4. Kilobit

对于snake的kilobit部分，也很直接明了。除了图片是以file的形式存储，其他都是用database，包括用户信息，搜索信息。

4.1. Account/ Profile/ User: MySQL(安全可靠，数据不容易丢失)

4.2. Image / Video Hosting: Files

5. Evolve

花了不少时间说一些比较general的东西，我们开始说回重点。

1. 从用户角度

a) better: constraints

b) broader: new cases

c) deeper: details

2. 从服务端分析

a) Performance

b) Scalability

c) Availability / Robustness

这里重点是做Performance的提升。

i. Database

一般读写比例是：80/20,或者90/10

More Read Power: a read/ write ratio btw 80/20 & 90/10

1. Read Performance

a) Cache

i. 约3.5百万 photos in squid cache

ii. 约 2百万photos in squid’s RAM

b) MySQL replication

i. Master-Slave Replication

2. Write Performance

i. Master-Master

1). Dual Trees

Ring (Master <---> Master) + (Master --> Slave)

ii. Data Federation

A federated database system is a type of meta-database management system (DBMS), which transparently maps multiple autonomous database systems into a single federated database.

1. Sharding

Shards are set up in Active Master-Master Ring Replication

a) Done by sticking a user to a server in a shard

b) Shard assignments are from a random number for new accounts

2. Global Ring

Lookup Ring

For stuff that can’t be for a shard

3. PHP logic to connect to the shards and keep the data consistent

3. Search Farm

问题：一般读写比例是：80/20,或者90/10。那从读方面优化，大家还记得有哪些常用的技巧吗？

（Yang）cdn

（Shirley）cache, replica

问题：replica和duplicate是一样的吗？

Database duplication generally refers to restoring a physical backup of a database to a different server (preferably using RMAN). That is normally done periodically to refresh lower environments from production.

Database replication generally refers to the process of copying a subset of data from one database to another on an ongoing basis. Replication generally implies that the data is being copied from one production database to another production database (or a test database to a test database, etc.)

方法

1）Master-Slave

第一种模式是Master-Slave，它是用了dup，但是这些dup都是作为slave。不仅有增快读的速度（像我刚才说的类似cdn），而且也可以作为备份，增加robustness

master-slave的意思是只有一个主机（红色那个），并且只有主机才有权限来写和修改数据，而其他蓝色的只能允许用户读数据。要修改数据，只能通过红色的master允许才行。系统设计很多都是tradeoff的部分，大家觉得这种设计有什么缺点？

（Ethan）我觉得是 master修改和slave传输，有没有latency。导致slave读出问题，同时读写那种，是latency和consistency。

（CC）不过刚才Ethan说的master有个对应的名词，SPOF指的是single point of failure，就是某个点挂了会影响这个系统的运行。

2）Dual Tree Central Database

为了解决以上缺点，flickr决定换了个新的架构，它名字叫Dual Tree Central Database

但是这里两个master不是互为备份左右的，大家看一下它们之间信息的导向，两个主服务器形成一个环，他们其实是Master-Master Shard。

- Write

这里用到之前提到的技巧“replica”，变成shard 分成到不同主机上。这样做sharding好处就是提升性能，比如写得快。这里同时又对写的性能进一步提升，虽然有读写率，但是随着图片量越来越多，一个master服务器应该不足够，所以这里增加到两个。

- Read

（Ethan）那其实这么看 slave可以局部 sharding，虽然提升不如master明显

（CC）荣叔又为flickr提出了第三个方案，slave sharding

问题：如果SQL有些数据不能sharding，肿么办呢？比如flickr里带有group的数据不能sharding。

好像在这里，owner_id和group_id分别根据他们就会分到不同的shard，所以这里只能根据一个，譬如不能用group在做shard，这时候就类似荣叔所说copy一份，但具体说的是，对于不同的shard，它们有共同的group的话，就把那部分data缓存起来，用memcached，形成一个lookup

3） Data Federation

根据之前的讨论，下图是flickr现在最主要的database优化架构——data federation，除了sharding也有针对不能sharding的情况

图有点模糊，在换上的那两台服务器就是称为federated database。

附上解释：A federated database system is a type of meta-database management system (DBMS), which transparently maps multiple autonomous database systems into a single federated database.

最后给大家一张flickr的架构图（主要处理数据库的优化）:

（松岩）补充一下，federation和clustering两种，federation强耦合，clustering弱耦合，这个架构图跟Bigtable好像……

（CC）flickr有个ppt说这些架构大家互相抄，最开始抄的是LiveJournal

Read full article from The Architecture of Open Source Applications (Volume 2): Scalable Web Architecture and Distributed Systems

Friday, June 26, 2015

The Architecture of Open Source Applications (Volume 2): Scalable Web Architecture and Distributed Systems