Massive Technical Interviews Tips: June 2015

Tuesday, June 30, 2015

Scalability Rules: 50 Principles for Scaling Web Sites

Scalability Rules: 50 Principles for Scaling Web Sites
Relax Temporal Constraints

Alleviate temporal constraints in your system whenever possible.

Anytime you are considering adding a constraint that an item or object must maintain a certain state between a user’s actions.

Why: The difficulty in scaling systems with temporal constraints is significant because of the ACID properties of most RDBMSs.

Key takeaways: Carefully consider the need for constraints such as items being available from the time a user views them until the user purchases them. Some possible edge cases where users are disappointed are much easier to compensate for than not being able to scale.

9. Design for Fault Tolerance and Graceful Failure
Not designed to fail.

Rule 36—Design Using Fault Isolative “Swimlanes” ==> todo
Shard: split data

Rule 37—Never Trust Single Points of Failure
Identify single instances on architectural diagrams. Strive for active/active configurations.
Strive for active/active rather than active/passive solutions.
Use load balancers to balance traffic across instances of a service.
Use control services with active/passive instances for patterns that require singletons.

Rule 38—Avoid Putting Systems in Series
Components in series have a multiplicative effect of failure.
When necessary to do so add multiple versions of that component so that if one fails others are available to take its place.

Rule 7—Design to Clone Things (X Axis)

Simply clone services and implement a load balancer.

For databases, ensure the accessing code understands the difference between a read and a write.

Just because the data is out of sync with the primary transactional database by 3 or 30 or 90 seconds doesn’t mean that it isn’t correct, just that there is a chance that it isn’t correct.

MySQL implements replication through the master-slave concept—the master database being the primary transactional database that gets written to and the slave databases are read-only copies of the master databases.

Rule 8—Design to Split Different Things (Y Axis)

scaling through the separation of distinct and different functions and data, by either nouns or verbs or a combination of both nouns and verbs.

Separate sets of data, so we can scale them differently.

Scale code bases.

Rule 9—Design to Split Similar Things (Z Axis): sharding and podding

take one data set or service and partitioning it into several pieces.

Rule 15—Firewalls, Firewalls Everywhere!

Use firewalls only when they significantly reduce risk and recognize that they cause issues with scalability and availability.

Firewalls can lower availability and cause unnecessary scalability chokepoints.

Rule 16—Actively Use Log Files

Splunk, Cricket or Cacti

summarizing logs over time and archiving and purging them as their value decreases.

logging in an asynchronous fashion.

5. Don’t Duplicate Your Work

Rule 18—Stop Redirecting Traffic

A slightly better way to redirect with HTML is to use the meta tag “refresh” and automatically send the user’s browser to the new page.

we can request the server to redirect for us: mod_alias or mod_rewrite.

Alias /image /www/html/image

Redirect /service http://foo2.akfpartners.com/service

6. Use Caching Aggressively

Rule 20—Leverage Content Delivery Networks

Rule 21—Use Expires Headers
proxy caches don’t inspect the HTML, they do not abide by these tags at all.

Keep-alives, or HTTP persistent connections, allow for the reuse of TCP connections for multiple HTTP requests.

Rule 22—Cache Ajax Calls
Rule 23—Leverage Page Caches
A page cache(reverse proxy server) is a caching server you install in front of your Web servers to offload requests for both static and dynamic objects from those servers.

ETags are unique identifiers issued by the server for an object at the time of first request by a browser. If the resource on the server side is changed, a new ETag is assigned to it. Assuming appropriate support by the browser (client), the object and its ETag are cached by the browser and subsequent If-None-Match requests by the browser to the Web server will include the tag. If the tag matches, the server may respond with an HTTP 304 Not Modified response. If the tag is inconsistent with that on the server, the server will issue the updated object and its associated ETag.

Rule 24—Utilize Application Caches
Rule 25—Make Use of Object Caches
Rule 26—Put Object Caches on Their Own “Tier”

Chapter 7. Learn from Your Mistakes
Rule 27—Learn Aggressively
Doing something without measuring the results or having an incident without learning from it are wasted opportunities that your competitors are taking advantage of.

Rule 28—Don’t Rely on QA to Find Mistakes
Rule 29—Failing to Design for Rollback Is Designing for Failure ==> to-do
Rule 30—Discuss and Learn from Failures

http://www.cnblogs.com/xing901022/p/4425124.html

本书从多个方面围绕高扩展性提出了50条建议，一个高扩展性的网站会随着业务的发展、用户的增加，自由的扩展架构，从而轻松的应付网站的快速发展。下面看看本书的具体内容：

回到顶部

化简方程

　　1 不要过度的设计

　　过度的设计相当于给系统增加了复杂度与维护的成本。而这些过度的设计，在正常的使用中，却没有太大的作用。往往是设计者自己认为很重要或者锦上添花的功能，实际用处不大。

　　2 设计时考虑到扩展性

　　在设计时要遵循一下的设计原则：设计时考虑20倍的容量，实现时考虑3倍的容量，部署时考虑1.5的容量。一面项目扩大时，临时扩展造成的困难。

　　3 把方案一简再简

　　应该遵循帕累托法则，20%的设计做了80%的工作，所以80%的时间，都应该放在这20%的设计上。

　　一个产品主要的功能其实都集中在几个点上，把这几个点设计好了，其他的都是些附加的功能而已。所以这核心的业务一定要保证足够的简洁易用。

　　4 减少DNS查询

　　每个不同的域下的文件，加载时都需要查询DNS。比如cnblogs.com与i.cnblogs.com就属于不同的域。那么在查询DNS的时候，就会查询两次。当业务量很大时，就会造成一定的影响。

　　5 尽可能减少对象

　　由于对象在浏览器访问时，需要加载。所以可以考虑减少请求文件的数量（数量与浏览器并发加载数有关），把一些对象尽量的合并。比如图标类的文件，可以合并成一个大的图片。合理的文件数量，会加速浏览器的访问加载。

　　6 使用同一品牌的网络设备

　　由于一个http请求，可能通过很多物理设备。比如负载均衡器，交换机，路由器。所以尽量使用同一品牌的设备，会避免一些意外的情况。

分布工作

　　7 X轴，横向复制

　　这种事最简单的服务扩充，通过克隆或者复制实现，比如你的应用放在多个服务器上进行服务。常见的比如集群，负载均衡等等，数据库的读写分离。

　　8 Y轴，拆分不同的东西

　　大型系统中，拆分不同的功能，比如注册、购买、查询、云盘。等等

　　9 Z轴，拆分不同的相似的东西

　　比如按照用户的级别，或者用户的地理位置等等拆分。

横向扩展设计

　　10 设计横向的扩展方案

　　扩展包括横向、纵向。横向就是通过复制克隆应用，利用小型机集群扩展。纵向就是提高服务器的硬件以及网络设施。

　　通过很多的案例都可以发现，单纯的升级硬件实现的纵向扩展，仅仅能解决一点点现实压力。而通过横向的集群扩展，却能够自由的实现伸缩。

　　11 采用经济型系统

　　与上面的原则类似，采用高价格的服务器，并不能保证日后的良好性能。应该使用普通的小型机集群扩展。

　　12 横向扩展数据中心

　　数据中心有很多的设计方案，比如

　　热冷站配置：使用热站提供服务，当热站崩溃时，使用冷站继续服务。

　　推荐使用多个实时站点，成本更低，动态调用。缺点是增加了运维的难度。

　　13 利用云技术进行设计

　　云计算的有点就是虚拟化，可以在业务峰值时，弹性的扩充设备。并且在日常处理用，归还该扩展。

　　缺点是提高了应用于虚拟环境的耦合。后面提到利用物理设备，隔离业务，在虚拟化的云计算中，可能会对业务隔离错误排查造成一定的干扰

使用正确的工具

　　14 合理使用数据库

　　目前有许多的数据库版本，比如传统的关系型数据库Oracle、MySQl，还有比较新的非关系型数据库NoSql，比如MongoDB，以及内存数据库FastDB，还有专门针对SSD固态硬盘的Aerospike等等。

　　但是到了选型的时候，还是要一句个人的业务需求来定。看你的数据库要求的是速度，还是安全性等等。

　　15 防火墙，到处都是防火墙

　　防火墙可以对一些无效的访问进行拦截过滤。通常把一些CSS，静态文件，图片，JS等不采用防火墙，而关键的业务涉及到个人信息时采用。合理的设计防火墙，也会对网站的性能产生一定的影响。

　　16 积极的利用日志文件

　　利用各种日志以及工具，实时的监控业务。不仅仅是监控服务器的内存CPU，还应该监控业务上的数据。比如splunk（提供日志的搜集，存储，搜索，图形化展示）。

不要做重复的工作

　　17 不要立即检查刚做过的工作

　　比如刚刚写如了数据，不要立即读取。虽然有些客户需要保证数据的完整，不能丢失。但是可以通过日志等记录，写完查这种做法，还是不推荐。

　　18 停止重定向

　　重定向会消耗一定的延迟，计算资源。应该尽量避免

　　19 放松时序约束

　　大多数的关系型数据库讲究ACID属性，扩展时就造成一定的困扰。因此某些业务适当的放松时序约束，可以提高网站的性能。

　　比如某站在预定酒店时，用户预定后，会等待酒店的审核。比如某宝，在提款时，进行范围时间的确认。这种就是扩大了时序约束，进而提高网站性能以及事务安全。

积极利用缓存

　　20 利用CDN

　　可以利用CDN保存客户的数据和内容。大概的过程是，用户在进行网站访问时，转到CDN的服务器，CDN执行DNS查询，把用户请求分摊到不同的服务器。有很多的CDN服务商提供这种服务。

　　21 使用过期头

　　针对不同的对象类型，使用过期头，减少对象请求。常见的HTTP对应属性为:public no-cahe max-age等等

　　22 缓存Ajax调用

　　正确修改Http头Last-Modified Cache-Control Expires等属性。

　　23 利用页面缓存

　　缓存响应之前的冬天请求，降低web服务器的负载。

　　24 利用应用缓存

　　比如针对某些特殊的用户，缓存其请求数据。

　　25 利用对象缓存

　　适用于反复查询使用的数据对象。比如一个购物网站，缓存热销产品数据。

　　26 把对象缓存放在自己的层上

　　使用单独的缓层，易于扩展和维护。

从错误中吸取教训

　　27 积极的学习

　　一个公司有学习的氛围，才会衍生出更好的产品。学习的内容一方面包括客户的业务知识，一方面来自技术和运维领域。

　　28 不要依靠QA发现失误

　　雇佣测试或者质量保证人员，最大的目的是为了检测产品的正确性。它能减少成本，提高开发人员的开发速度，因为开发人员不需要时刻关注代码的正确性，可以交给QA来测试。

　　但是QA只负责发现问题，如何避免为题还是得依靠开发人员。

　　29 没有回退的设计是失败的设计

　　这里的回退，指的是产品发布的回退。如果碰上某些版本的BUG，可能需要交付之前可运行的版本，此时没有回退，就无法交付产品了。

　　这里推荐学习持续集成的相关内容。

　　30 讨论失败并从中吸取教训

　　不应该在同一个问题上失败两次，每次失败多进行总结是不可缺少的。

数据库原则

　　关系型数据库的ACID属性：

　　原子性：一个事务要么全执行，要么都不执行，

　　一致性：事务开始和结束时，所有数据状态要一致，

　　隔离性：事务的表现，是事务对数据库唯一的操作，

　　持久性：事务完成，操作不能更改。

　　31 注意代价高的关系

　　应该在设计阶段完善的设计表的结构，等开发开始时，在增加某些列，可能会花费很高的代价。

　　32 使用正确的数据库锁

　　数据库有很多锁的概念，比如隐式锁、显式锁、行锁、页锁、范围锁、表锁、数据库锁等等。

　　不合理的使用锁，会影响网站的吞吐量。

　　33 不要使用多阶段提交

　　比如两阶段提交：先表决，在提交。这回降低扩展性，因为在其提交事务完成前，是不能作其他操作的。

　　34 不要使用select for update

　　因为FOR UPDATE从句会导致锁定行，降低事务处理的速度。

　　35 不要选择所有的数据

　　比如select * from xxx;

　　这种做法第一是不开与数据的扩展，比如本来有四列数据，业务处理代码直接写死。当增加了一列数据时，就会导致出错；另外就是会查询出不必要的数据。

　　或者inset into xxx values(xxxx);

　　这是当列信息不匹配时，也会出错。

容错设计与故障控制

　　36 采用隔离故障的”泳道“

　　服务与数据的划分有很多种，比如容器，集群，池，分片，泳道。泳道意味着每个业务有自己的领域，不能跨泳道调用。

　　37 不要信任单点故障

　　有很多系统设计成单点模式，当整个系统只是用该模块时，当出现单点故障，整个系统也就崩溃了。

　　38 避免系统串联

　　比如一个系统有很多的组件组成，每个组件99.9%的安全性，当串联3个组件时，整个系统的可用性就变成了99.7%。

　　39 确保能够启用/禁用功能

　　对于某些共享库，第三方服务，应该提供开启或者关闭的功能。

避免或分发状态

　　40 努力实现无状态

　　实现状态会限制扩展性，增大成本

　　41 尽可能在浏览器端维护会话

　　一方面降低服务器压力，另一方面任何的请求可以发送给任何的服务器。

　　42 利用分布式缓存存放状态

　　使用独立的缓存层，利于扩展。有很多分布式的缓存方案，比如memcached。

异步通信和消息总线

　　43 尽可能使用异步通信

　　异步通信，可以确保每个服务和层之间的独立性，这样易于早呢更加系统的扩展性和减小耦合度。

　　44 确保消息总线能够扩展

　　尽量采用Y轴或者Z轴扩展，即按业务需求和功能扩展。因为单纯的复制或者克隆，反而会增加各个消息订阅者的监听数目。按照业务隔离，可以分离业务压力。

　　45 避免让消息总线过度拥挤

　　衡量价值与消息的成本。

其他原则

　　46 慎用第三方解决方案扩展

　　企业如果出现问题，那么寻找第三方能够解决燃眉之急。但是却不是长久之计，因为解决方案的提供商有很多客户，你的危机并不是他们的危机，所以不可能在关键时刻，尽职尽责。因此企业还是应该有一定的掌控力（这个词真是高大上！）。

　　47 清除、归档和成本合理的存储

　　有一些不必要的数据，就应该定期的删除。一些略有价值的数据进行定期的归档直接删除。一些很有价值的数据，应该进行备份以及快速访问。

　　48 删除事务处理中的商业智能

　　应该把产品系统与业务系统分离，提高产品的扩展性。

　　避免业务扩展时，受到系统架构的限制。

　　49 设计能够监控的应用

　　应该设计全局的监控策略，保证回答

　　”发生了问题了吗？“

　　”哪里发生了问题？“

　　”发生了什么问题？“

　　”会发生问题吗？“

　　”能自动修复吗？“

　　50 要能胜任

　　应该在每个设计中涉及到最优秀的架构，不能完全依赖第三方的解决方案。

　　一个简单优秀的架构，都是小而精的，如果单纯的依靠开源解决架构，虽然解决了问题，却会导致应用的臃肿。

Scalability Rules: 50 Principles for Scaling Web Sites
50 Scalability Rules in Brief
Rule 1—Don’t Overengineer the Solution
Rule 2—Design Scale into the Solution (D-I-D Process)
Rule 3—Simplify the Solution Three Times Over
Rule 4—Reduce DNS Lookups
Rule 5—Reduce Objects Where Possible
Rule 6—Use Homogeneous Networks
Rule 7—Design to Clone or Replicate Things (X Axis)
Rule 8—Design to Split Different Things (Y Axis)
Rule 9—Design to Split Similar Things (Z Axis)
Rule 10—Design Your Solution to Scale Out, Not Just Up
Rule 11—Use Commodity Systems (Goldfish Not Thoroughbreds)
Rule 12—Scale Out Your Hosting Solution
Rule 13—Design to Leverage the Cloud
Rule 14—Use Databases Appropriately
Rule 15—Firewalls, Firewalls Everywhere!
Rule 16—Actively Use Log Files
Rule 17—Don’t Check Your Work
Rule 18—Stop Redirecting Traffic
Rule 19—Relax Temporal Constraints
Rule 20—Leverage Content Delivery Networks
Rule 21—Use Expires Headers
Rule 22—Cache Ajax Calls
Rule 23—Leverage Page Caches
Rule 24—Utilize Application Caches
Rule 25—Make Use of Object Caches
Rule 26—Put Object Caches on Their Own “Tier”
Rule 27—Learn Aggressively
Rule 28—Don’t Rely on QA to Find Mistakes
Rule 29—Failing to Design for Rollback Is Designing for Failure
Rule 30—Remove Business Intelligence from Transaction Processing
Rule 31—Be Aware of Costly Relationships
Rule 32—Use the Right Type of Database Lock
Rule 33—Pass on Using Multiphase Commits
Rule 34—Try Not to Use Select for Update
Rule 35—Don’t Select Everything
Rule 36—Design Using Fault-Isolative “Swim Lanes”
Rule 37—Never Trust Single Points of Failure
Rule 38—Avoid Putting Systems in Series
Rule 39—Ensure That You Can Wire On and Off Features
Rule 40—Strive for Statelessness
Rule 41—Maintain Sessions in the Browser When Possible
Rule 42—Make Use of a Distributed Cache for States
Rule 43—Communicate Asynchronously as Much as Possible
Rule 44—Ensure That Your Message Bus Can Scale
Rule 45—Avoid Overcrowding Your Message Bus
Rule 46—Be Wary of Scaling through Third Parties
Rule 47—Purge, Archive, and Cost-Justify Storage
Rule 48—Partition Inductive, Deductive, Batch, and User Interaction (OLTP) Workloads
Rule 49—Design Your Application to Be Monitored
Rule 50—Be Competent

Very High—1
Rule 19 Relax Temporal Constraints
Rule 25 Make Use of Object Caches
Rule 29 Failing to Design for Rollback Is Designing for Failure
Rule 32 Use the Right Type of Database Lock
Rule 35 Don’t Select Everything
Rule 46 Be Wary of Scaling through Third Parties
Rule 50 Be Competent

High—2
Rule 1 Don’t Overengineer the Solution
Rule 7 Design to Clone or Replicate Things (X Axis)
Rule 10 Design Your Solution to Scale Out, Not Just Up
Rule 11 Use Commodity Systems (Goldfish Not Thoroughbreds)
Rule 14 Use Databases Appropriately
Rule 15 Firewalls, Firewalls Everywhere!
Rule 22 Cache Ajax Calls
Rule 26 Put Object Caches on Their Own “Tier”
Rule 27 Learn Aggressively
Rule 28 Don’t Rely on QA to Find Mistakes
Rule 30 Remove Business Intelligence from Transaction Processing
Rule 33 Pass on Using Multiphase Commits
Rule 34 Try Not to Use Select for Update
Rule 37 Never Trust Single Points of Failure
Rule 41 Maintain Sessions in the Browser When Possible
Rule 42 Make Use of a Distributed Cache for States
Rule 43 Communicate Asynchronously as Much as Possible
Rule 44 Ensure That Your Message Bus Can Scale
Rule 45 Avoid Overcrowding Your Message Bus
Rule 48 Partition Inductive, Deductive, Batch, and User Interaction (OLTP) Workloads
Rule 49 Design Your Application to Be Monitored

Medium—3
Rule 2 Design Scale into the Solution (D-I-D Process)
Rule 3 Simplify the Solution Three Times Over
Rule 4 Reduce DNS Lookups
Rule 5 Reduce Objects Where Possible
Rule 6 Use Homogeneous Networks
Rule 8 Design to Split Different Things (Y Axis)
Rule 9 Design to Split Similar Things (Z Axis)
Rule 12 Scale Out Your Hosting Solution
Rule 16 Actively Use Log Files
Rule 18 Stop Redirecting Traffic
Rule 20 Leverage Content Delivery Networks
Rule 21 Use Expires Headers
Rule 23 Leverage Page Caches
Rule 24 Utilize Application Caches
Rule 31 Be Aware of Costly Relationships
Rule 36 Design Using Fault-Isolative “Swim Lanes”
Rule 38 Avoid Putting Systems in Series
Rule 40 Strive for Statelessness
Rule 47 Purge, Archive, and Cost-Justify Storage

Low—4
Rule 13 Design to Leverage the Cloud
Rule 17 Don’t Check Your Work
Rule 39 Ensure That You Can Wire On and Off Features
Read full book from Scalability Rules: 50 Principles for Scaling Web Sites

AKF’s Most Commonly Adopted Architectural Principles

https://www.puncsky.com/notes/41-how-to-scale-a-web-service

Horizontal Duplication and Cloning (X-Axis ). Having a farm of identical and preferably stateless instances behind a load balancer or reverse proxy. Therefore, every request can be served by any of those hosts and there will be no single point of failure.
Functional Decomposition and Segmentation - Microservices (Y-Axis). e.g. auth service, user profile service, photo service, etc
Horizontal Data Partitioning - Shards (Z-Axis). Replicate the whole stack to different “pods”. Each pod can target a specific large group of users. For example, Uber had China and US data centers. Each datacenter might have different “pods” for different regions.

https://akfpartners.com/growth-blog/akf-scale-cube-ze-case-for-z-axis
The AKF Scale Cube has three axes by which a website can be scaled. X axis, or horizontal duplication, is a common first choice. Duplicating web and app servers into load balanced pools is a best practice that provides the ability to conduct maintenance or roll code without taking the site down as well as scalability if the pools are N+1 or even N+2.

The Y axis is a service split - decomposing a monolithic code base into smaller services than can run independently. The Y axis also allows you to scale your technology team - teams focus on the services for which they are responsible and no longer need complete domain expertise.

The third axis is Z, a lookup oriented split. A common choice is geographic location or customer identity. A Z axis split takes a monolithic stack and slices it into N smaller stacks, using fewer or smaller servers and working with 1/Nth of the data.

https://akfpartners.com/growth-blog/scale-cube
Scaling Solutions with the Z Axis of the Scale Cube

Whereas the Y axis addresses the splitting of dissimilar things (often along noun or verb boundaries), the Z-axis addresses segmentation of “similar” things. Examples may include splitting customers along an unbiased modulus of customer_id, or along a somewhat biased (but beneficial for response time) geographic boundary. Product catalogs may be split by SKU, and content may be split by content_id. Z-axis scaling, like all of the axes, improves the solution’s transactional scalability and if fault isolated it’s availability. Because the software deployed to servers is essentially the same in each Z axis shard (but the data is distinct) there is no increase in organizational scalability. Cache hit rates often go up with smaller data sets, and operational costs generally go down as commodity servers or smaller IaaS instances can be used.

AKF’s Most Commonly Adopted Architectural Principles
N + 1 Design
At least one additional instance of that system in the event of failure.

Design for Rollback
Design to Be Disabled
This will give you additional time to “fix forward” or ensure that your system doesn’t go down as a result of a bug that introduces strange, out-of-bounds demand characteristics to the system.

Design to Be Monitored
identifying services that are alive or dead, examining or polling log files, collecting system-related data over time, and evaluating end-user response times.
applications and systems are designed from the ground up to be if not self-healing, then at least self-diagnosing.

Significant standard deviations from the mean could be “alerted” for future or immediate action depending on the value. This approach leverages a control chart from statistical process control.
Rates of errors, the response time from other services

Design for Multiple Live Sites
Ensuring that the business environment in which you operate, including contracts, partners, and facilities, is also scalable.

Use Mature Technologies

Asynchronous Design

Stateless Systems
consider storing state with the end user rather than within your system. If that is not possible, consider a centralized state caching mechanism that keeps state data off of the application servers and allows for its distribution across multiple servers.

Scale Out, Not Up
Design for at Least Two Axes of Scale

Buy When Non-Core
Use Commodity Hardware
Build Small, Release Small, Fail Fast
Isolate Faults
Segment your product such that the failure of a service or subservice does not affect multiple other services. Alternatively, segment your customers such that failures for some customers do not affect failures for your customer base in its entirety.

Automation over People

SMART guidelines: Specific, Measurable, Achievable, Realistic, Testable
AKF’s Most Commonly Adopted Architectural Principles

Pragmatic Programming Techniques: Scalable System Design

Pragmatic Programming Techniques: Scalable System Design
"Scalability" is not equivalent to "Raw Performance"
Understand environmental workload conditions that the system is design for
Dimension of growth and growth rate: e.g. Number of users, Transaction volume, Data volume
Measurement and their target: e.g. Response time, Throughput

Understand who is your priority customers
Rank the importance of traffic so you know what to sacrifice in case you cannot handle all of them

Scale out and Not scale up
Keep your code modular and simple

Don't guess the bottleneck, Measure it
Bottlenecks are slow code which are frequently executed. Don't optimize slow code if they are rarely executed
Write performance unit test so you can collect fine grain performance data at the component level
Setup a performance lab so you can conduct end-to-end performance improvement measurement easily
Plan for growth
Do regular capacity planning. Collect usage statistics, predict the growth rate

Common Techniques
Server Farm (real time access)
Incoming requests will be dispatched by the load balancer to different machines and hence the workload is spread and shared across the servers in the farm.

Data Partitioning
By nature, data is stateful. So there must be a deterministic mechanism to dispatch data request to the server that host the data
Data partitioning mechanism also need to take into considerations the data access pattern. Data that need to be accessed together should be staying in the same server. A more sophisticated approach can migrate data continuously according to data access pattern shift.

Map / Reduce (Batch Parallel Processing)

Content Delivery Network (Static Cache)
This is common for static media content. The idea is to create many copies of contents that are distributed geographically across servers.
User request will be routed to the server replica with close proximity

Cache Engine (Dynamic Cache)
This is a time vs space tradeoff. Some executions may use the same set of input parameters over and over again. Therefore, instead of redo the same execution for same input parameters, we can remember the previous execution's result.
==> some times, the input may be different, but shares a lot of common parameter, we can query the data source using the common parameters, and cache them, do other filtering in code. ==> if we can.

Resources Pool
DBSession and TCP connection are expensive to create, so reuse them across multiple requests
Calculate an approximate result
Instead of calculate an accurate answer, see if you can tradeoff some accuracy for speed.
If real life, usually some degree of inaccuracy is tolerable
Filtering at the source
Try to do more processing upstream (where data get generated) than downstream because it reduce the amount of data being propagated

Asynchronous Processing
In callback mode, the caller need to provide a response handler when making the call. The call itself will return immediately before the actually work is done at the server side. When the work is done later, response will be coming back as a separate thread which will execute the previous registered response handler. Some kind of co-ordination may be required between the calling thread and the callback thread.

In polling mode, the call itself will return a "future" handle immediately. The caller can go off doing other things and later poll the "future" handle to see if the response if ready. In this model, there is no extra thread being created so no extra thread co-ordination is needed.

Implementation design considerations
Use efficient algorithms and data structure.

Analyze your concurrent access scenarios when multiple threads accessing shared data. Carefully analyze the synchronization scenario and make sure the locking is fine-grain enough. Also watch for any possibility of deadlock situation and how you detect or prevent them. A wrong concurrent access model can have huge impact in your system's scalability. Also consider using Lock-Free data structure (e.g. Java's Concurrent Package have a couple of them)

Analyze the memory usage patterns in your logic. Determine where new objects are created and where they are eligible for garbage collection. Be aware of the creation of a lot of short-lived temporary objects as they will put a high load on the Garbage Collector.
However, never trade off code readability for performance. (e.g. Don't try to bundle too much logic into a single method). Let the VM handle this execution for you.

http://highscalability.com/blog/2008/2/13/whats-your-scalability-plan.html

Move MySQL to a separate server. This frees up resources (CPU, disk, memory). What you want to run on this server depend on its capabilities. Maybe run a memcached server on it.

Move to a distributed memory cache using memcached.

Add a MySQL master/slave configuration.

If more webservers are needed us LVS on the front end as a load balancer.

Read full article from Pragmatic Programming Techniques: Scalable System Design

Monday, June 29, 2015

Design a web crawler

https://blog.scrapinghub.com/2015/08/05/distributed-frontera-web-crawling-at-large-scale
Hubs are web pages that contain a large number of outgoing links to authority sites. For example, Reddit, the DMOZ Directory and Hacker News are hubs because they point to a lot of pages with authoritative content. Authorities are sites that obtain a high authority score, determined by the relevant information on the page and its authoritative sources

Storing the spider log in Kafka allowed us to replay the log out of the box, which can be useful when changing crawling strategy on the fly. We were able to set up partitions by domain name, which made it easier to ensure that each domain was downloaded at most by one spider.

We used the consumer offset feature to track the position of the spider and provide new messages slightly ahead of time in order to prevent topic overload by deciding if the spider was ready to consume more.

The main advantage of this design is real-time operation. The crawling strategy can be changed without having to stop the crawl. It’s worth mentioning that the crawling strategy can be implemented as a separate module. This module would contain the logic for ordering URLs and stopping crawls, as well as the scoring model.

Distributed Frontera is polite to web hosts by design because each host is downloaded by only one spider process. This is achieved by Kafka topic partitioning. All distributed Frontera components are written in Python. Python is much simpler to write than C++ or Java, the two most common languages for large-scale web crawlers.

https://sakthipriyan.com/2015/04/18/building-a-distributed-web-crawler.html

Crawler will dequeue the url json from queue and checks if the url already cached in couchbase url bucket or has to be reextracted. If it is not available, page is fetched via the proxy servers.

Once a page is available, raw page is stored in AWS S3 and an entry is made into url bucket. Then, extractor will extract two things, one set of documents and next set of url(s). Data is stored in couchbase data bucket and next set of url(s) are checked if it has to be crawled. Next set url(s) can also contain, required headers and also can carry data map to next level. Then, they are added to the same queue.

Read full article from Algo Ramblings: Design a web crawler
TODO:http://www.ijcaonline.org/volume15/number7/pxc3872629.pdf
http://www.slideshare.net/gnap/design-and-implementation-of-a-high-performance-distributed-web-crawler
https://benbernardblog.com/the-tale-of-creating-a-distributed-web-crawler/
Given a seed URL, the crawler needed to auto-discover the value of the missing fields for a particular record. So if a web page didn't contain the information that I was looking for, the crawler needed to follow outbound links, until the information was found.
It needed to be some kind of crawler-scraper hybrid, because it had to simultaneously follow outbound links and extract specific information from web pages.

The scraped data needed to be stored somewhere, most likely in a database.

The main components were:

A crawler dispatcher, responsible for dispatching URLs to be crawled to the m crawler supervisors, and for collecting results (fields) from them.
m crawler supervisors, responsible for supervising n child processes. Those child processes would perform the actual crawling. I'll refer to them as crawlers for convenience.
A database server, responsible for storing the initial seed URLs as well as the extracted fields.

http://flexaired.blogspot.com/2011/09/design-web-crawler.html

http://www.1point3acres.com/bbs/thread-148865-1-1.html

4. Pirate. Design Wikipedia crawler.

followup 1: No global status.

followup 2: deal with machine failure

followup 3: make the wiki site unaware of this crawler.

1. distributed bfs
2. consistent hashing
3. assign task with a random delay

Algo Ramblings: Design a web crawler
Crawler basic algorithm

Remove a URL from the unvisited URL list
Determine the IP Address of its host name
Download the corresponding document
Extract any links contained in it.
If the URL is new, add it to the list of unvisited URLs
Process the downloaded document
Back to step 1

Issues on crawler

General architecture
What pages should the crawler download ?
How should the crawler refresh pages ?
How should the load on the visited web sites be minimized ?
How should the crawling process be parallelized ?

http://baozitraining.org/blog/design-a-basic-web-crawler/
the internet could be abstracted as a directed graph, with each page as a node and hyperlinks as an edge.

How to crawl?
BFS is normally used.

Decide what you want to crawl?
Scope: crawl whole internet or just company web site.
Having a few websites in your crawler’s most frequent list to crawl, such as some authoritative news website, etc

You should have lots of fetchers living on many host classes. Use machine learning to predict which websites are most likely to have frequent update, put those into the fetchers’ priority queue.

Hot to track what your fetchers have crawled?
You don’t want your fetchers to crawl some website over and over again while other websites don’t get crawled at all. There are many ways to achieve this. For example, your scheduler should generate non-duplicate jobs for fetchers. Or,
your fetchers could keep track off the last visited time for each URL. Note, due to the scalability, a consistent hashmap is needed.

Respect the standard?
If a crawler sees a robots.txt on a site, it should skip crawling it.
You need to have some ways to execute those javascript and get the URL.

How to rank:
Page rank algorithm

Forget about the fact that you’re dealing with billions of pages. How would you design this system if it were just a small number of pages? You should have an understanding of how you would solve the simple, small case in order to understand how you would solve the bigger case.

2. Now, think about the issues that occur with billions of pages. Most likely you can’t fit
the data on one machine. How will you di
http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
Data Structure
We use priority search based graph traversal for our web crawling. Its basically breadth first search, but we use a priority queue instead of a queue to determine which URL to crawl next.
==> BFS+score, so we will crawl important pages/sites more timely.

We assign an importance score to each URL which we discover and then crawl them accordingly.
We use Redis sorted sets to store the priority associated with each URL and hashes to store the visited status of the discovered URLs. This, of course, comes with a large memory footprint.

A possible alternative would be to use Bloom filters.
==> False positive actually means that the bloom filter will tell you that you have already visited a URL but you actually haven't. The consequent of this is that you will miss crawling web pages rather than crawling the same page twice.

Recrawling
Since one of the tasks we do is in building price histories of products, recrawling of pages needs to be done intelligently.
We want to recrawl products that frequently change prices frequently (duh) as compared to pages that don’t really change their price.
Hence a brute-force complete recrawl doesn’t really make sense.

We use a power law distribution (some pages are crawled more frequently than other pages) for recrawling products, with pages ranked based on their importance (using signals ranging from previous price history changes to how many reviews they have).

Another challenge is in page discovery (eg: if a new product has been launched, how quickly can we have it in our system).
===> if the crawler is built for companies site, we can provide api so client can add or manually crawl new added pages.

Check resource from https://www.linkedin.com/pulse/20140831103409-27984489-designing-a-search-engine-design-patterns-for-crawlers

http://www.jiuzhang.com/problem/44/

如果让你来设计一个最基本的Web Crawler，该如何设计？需要考虑的因素有哪些？

没有标准答案。需要尽可能的回答出多一点的考虑因素。

面试官角度：

这个问题是面试中常见的设计类问题。实际上如果你没有做过相关的设计，想要回答出一个让面试官满意的结果其实并不是很容易。该问题并不局限于你在去面试搜索引擎公司时可能会问到。这里，我们从Junior Level和Senior Level两个角度来解答这个问题。

1. 如何抽象整个互联网

Junior: 抽象为一个无向图，网页为节点，网页中的链接为有向边。

Senior: 同上。

2. 抓取算法

Junior: 采用BFS的方法，维护一个队列，抓取到一个网页以后，分析网页的链接，扔到队列里。

Senior: 采用优先队列调度，区别于单纯的BFS，对于每个网页设定一定的抓取权重，优先抓取权重较高的网页。对于权重的设定，考虑的因素有：1. 是否属于一个比较热门的网站 2. 链接长度 3. link到该网页的网页的权重 4. 该网页被指向的次数等等。

进一步考虑，对于热门的网站，不能无限制的抓取，所以需要进行二级调度。首先调度抓取哪个网站，然后选中了要抓取的网站之后，调度在该网站中抓取哪些网页。这样做的好处是，非常礼貌的对单个网站的抓取有一定的限制，也给其他网站的网页抓取一些机会。

3. 网络模型

Junior: 多线程抓取。

Senior: 分别考虑单机抓取和分布式抓取的情况。对于Windows的单机，可以使用IOCP完成端口进行异步抓取，该种网络访问的方式可以最大程度的利用闲散资源。因为网络访问是需要等待的，如果简单的同时开多个线程，计算机用于线程间切换的耗费会非常大，这种用于处理抓取结果的时间就会非常少。IOCP可以做到使用几个线程就完成几十个线程同步抓取的效果。对于多机的抓取，需要考虑机器的分布，如抓取亚洲的站点，则用在亚洲范围内的计算机等等。

4. 实时性

Junior: 无需回答

Senior: 新闻网页的抓取一般来说是利用单独的爬虫来完成。新闻网页抓取的爬虫的权重设置与普通爬虫会有所区别。首先需要进行新闻源的筛选，这里有两种方式，一种是人工设置新闻源，如新浪首页，第二种方式是通过机器学习的方法。新闻源可以定义链接数非常多，链接内容经常变化的网页。从新闻源网页出发往下抓取给定层级限制的网页所得到，再根据网页中的时间戳信息判断，就可以加入新闻网页。

5. 网页更新

Junior: 无需回答。

Senior: 网页如果被抓下来以后，有的网页会持续变化，有的不会。这里就需要对网页的抓取设置一些生命力信息。当一个新的网页链接被发现以后，他的生命力时间戳信息应该是被发现的时间，表示马上需要被抓取，当一个网页被抓取之后，他的生命力时间戳信息可以被设置为x分钟以后，那么，等到x分钟以后，这个网页就可以根据这个时间戳来判断出，他需要被马上再抓取一次了。一个网页被第二次抓取以后，需要和之前的内容进行对比，

如果内容一致，则延长下一次抓取的时间，如设为2x分钟后再抓取，直到达到一个限制长度如半年或者三个月（这个数值取决于你爬虫的能力）。如果被更新了，则需要缩短时间，如，x/2分钟之后再抓取。

http://blog.baozitraining.org/2014/08/how-to-design-basic-web-crawler.htm
Keep in mind a production web crawler could be very sophisticated and normally takes a few teams weeks/months to develop. The interviewer would not expect you to cover all the detail, but you should be able to mention some key design perspectives.

How to abstract the internet?

You should quickly realize the internet could be abstracted as a directed graph, with each page as a node and hyperlinks as an edge.

How to crawl?

BFS is normally used. However, DFS is also used in some situation, such as if your crawler has already established a connection with the website, it might just DFS all the URLs within this website to save some handshaking overhead.

Decide what you want to crawl?

Internet is huge thus your graph is huge. It is almost impossible to crawl the entire internet since it is keep growing every sec. Google probably has 60 trillion while Bing has 30 trillion, roughly speaking.

Some strategy used in practice includes

a) Having a few websites in your crawler’s most frequent list to crawl, such as some authoritative news website, etc

b) You should have lots of fetchers living on many host classes. Use machine learning to predict which websites are most likely to have frequent update, put those into the fetchers’ priority queue.

Hot to track what your fetchers have crawled?

You don’t want your fetchers to crawl some website over and over again while other websites don’t get crawled at all. There are many ways to achieve this. For example, your scheduler should generate non-duplicate jobs for fetchers. Or, your fetchers could keep track off the last visited time for each URL. Note, due to the scalability, a consistent hashmap is needed.

Respect the standard?

If a crawler sees a robots.txt on a site, it should skip crawling it. However, if this happens a lot for a fetcher, it is doing less work on the run. Some optimization could be let the fetcher pick some other sites to crawl to reduce the overhead of spawning a fetcher.

Sometime, URL on a website is not easy to extract. For example, some URLs are generated by javascript. You need to have some ways to execute those javascript and get the URL.

[Reference]

http://read.pudn.com/downloads163/doc/742502/%E6%95%B0%E5%AD%A6%E4%B9%8B%E7%BE%8E/%E6%95%B0%E5%AD%A6%E4%B9%8B%E7%BE%8E%20%E7%B3%BB%E5%88%97%E5%85%AD.pdf

Coding
https://github.com/jefferyyuan/500lines/tree/master/crawler

http://itsumomono.blogspot.com/2015/08/crawling.html

Robustness: avoid fetching an infinite number of pages in a particular domain.

Politeness: Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them.

Distributed:

Scalable:

Performance and efficiency: CPU, storage and network bandwidth.

Quality:be biased towards fetching “useful” pages first.

Freshness: continuous mode: it should obtain fresh copies of previously fetched pages.

Extensible: e.g. new data formats, new fetch protocols.

begin with a seed set

Check already fetched -- The simplest implementation for this would

use a simple fingerprint such as a checksum. https://en.wikipedia.org/wiki/Checksum

A more sophisticated test would use shingles

URL filter is used to determine whether the extracted URL should be excluded from the frontier based on one of several tests.

Many hosts on theWeb place certain portions of their websites off-limits to crawling, under a standard known as the Robots Exclusion Protocol. This is done by placing a file with the name robots.txt at the root of the URL hierarchy at the site. maintaining a cache of robots.txt files

URL should be normalized in the following sense: often the HTML encoding of a link from a web page p indicates the target of that link relative to the page p.

Finally, the URL is checked for duplicate elimination. When the URL is added to the frontier, it is assigned a priority based on which it is eventually removed from the frontier for fetching.

Certain housekeeping tasks are typically performed by a dedicated thread.

This thread is generally quiescent except that it wakes up once every few seconds to log crawl progress statistics (URLs crawled, frontier size, etc.), decide whether to terminate the crawl, or (once every few hours of crawling) checkpoint the crawl.

In checkpointing, a snapshot of the crawler’s state (say the URL frontier) is committed to disk. In the event of a catastrophic crawler

failure, the crawl is restarted from the most recent checkpoint.

Documents change over time and so, in the context of continuous crawling, we must be able to delete their outdated fingerprints/shingles from the content-seen set(s). In order to do so, it is necessary to save the fingerprint/shingle of the document in the URL frontier, along with the URL itself.

DNS resolution is a well-known bottleneck in web crawling. DNS resolution may entail multiple requests and round-trips across the internet, requiring seconds and sometimes even longer.

DNS cache, are generally synchronous.

This means that once a request is made to the Domain Name Service, other crawler threads at that node are blocked until the first request is completed. To circumvent this, most web crawlers implement their own DNS resolver as a component of the crawler.

Thread executing the resolver code sends a message to the DNS server and then performs a timed wait: it resumes either when being signaled by another thread or when a set time quantum expires. A single, separate DNS thread listens on the standard DNS port (port 53) for incoming response packets from the name service.

Upon receiving a response, it signals the appropriate crawler thread (in this case, i) and hands it the response packet if i has not yet resumed because its time quantum has expired. A crawler thread that resumes because its wait time quantum has expired retries for a fixed number of attempts, sending out a new message to the DNS server and performing

a timed wait each time; the designers of Mercator recommend of the order

of five attempts. The time quantum of the wait increases exponentially with

each of these attempts; Mercator started with one second and ended with

roughly 90 seconds, in consideration of the fact that there are host names

that take tens of seconds to resolve partitioning by terms, also known as global index organization, and partitioning by documents,

The URL frontier

high-quality pages that change frequently should be

prioritized for frequent crawling. Thus, the priority of a page should be a

function of both its change rate and its quality (using some reasonable quality

estimate). The combination is necessary because a large number of spam

pages change completely on every fetch

The second consideration is politeness: we must avoid repeated fetch requests

to a host within a short time span.

A common heuristic is to insert a

gap between successive fetch requests to a host that is an order of magnitude

larger than the time taken for the most recent fetch from that host.

Its goals are to ensure that (i) only one connection is open at a time to any

host; (ii) a waiting time of a few seconds occurs between successive requests

to a host and (iii) high-priority pages are crawled preferentially.

In the former, the dictionary of index terms is partitioned into subsets, each subset residing at a node.

Along with the terms at a node, we keep the postings for those terms. A

query is routed to the nodes corresponding to its query terms. In principle,

this allows greater concurrency since a stream of queries with different query

terms would hit different sets of machines.

In practice, partitioning indexes by vocabulary terms turns out to be nontrivial.

Multi-word queries require the sending of long postings lists between

sets of nodes for merging, and the cost of this can outweigh the greater concurrency.

Load balancing the partition is governed not by an a priori analysis

of relative term frequencies, but rather by the distribution of query terms

and their co-occurrences, which can drift with time or exhibit sudden bursts.

Achieving good partitions is a function of the co-occurrences of query terms

and entails the clustering of terms to optimize objectives that are not easy to

quantify. Finally, this strategy makes implementation of dynamic indexing

more difficult.

A more common implementation is to partition by documents: each node

contains the index for a subset of all documents. Each query is distributed to

all nodes, with the results fromvarious nodes being merged before presentation

to the user. This strategy trades more local disk seeks for less inter-node

communication. One difficulty in this approach is that global statistics used

in scoring – such as idf – must be computed across the entire document collection

even though the index at any single node only contains a subset of

the documents. These are computed by distributed “background” processes

that periodically refresh the node indexes with fresh global statistics

http://systemdesigns.blogspot.com/2015/12/web-crawling-system.html

slave。Bloom Filter放到master的内存里，而被访问过的url放到运行在master上的Redis里，这样保证所有操作都是O(1)。

我感觉这相当于统筹分配任务：

Master：读写task list这种工作需要统一交给一个master机器来做，这样能保持数据的一致性，也更便于管理，所以也就需要这个机器读写速度快，稳定。

Slave：而slave做的就是爬数据，把爬到的东西传回到master里面。多一个，少一个，甚至坏一个都不太影响整个爬虫的工作。这样的机器运算速度不需要特别高。

在老师的视频里面是这么划分页面的：

- 一个是list，那种汇聚了许多新闻超链接的页面

- 还有一种是page，也就是普通的页面

2. Multiple Page Website Crawler

爬虫首先要找到那个list 页面，从这个页面开始搜集page。这个就产生了最基本的爬多个页面网站的爬虫架构：

上图是一个很基础的爬虫结构。值得注意的是对于每个网站来说，list crawler 只有一个，但是news crawler有许多个，而且news crawler 可以复用。

scheduler 就算是一个master，在控制整个爬虫的工作。但是，这也有一个问题——复用。这些list crawler会有浪费。比如说新浪没有啥新闻，list crawler就会闲置在那里；而163有很多新闻，crawler却不够使，这是浪费资源。

所以，讲义上提供了第三版的结构。

3. Master-Slave Crawler

这种结构没有分是list 还是 page 的crawler。由一个master 的负责读写，其他crawler 来爬。这样就很好解决了一个资源浪费的问题。

这样就很好解决了一个资源浪费的问题。在这幅图里面画了task table的架构。这里面有一些列还是值得思考的。

- Priority

比如说priority是用来划分爬虫的优先级的。可以把要用来爬list的优先级提高。把要来爬page的优先级降低。

- State

还有state在这里包含 done, working, new 这几个状态。这个可以用于管理，这个也可以用来做verification。比如说一个爬虫的状态是done了。那就应该是爬过了。但是如果没有得到应有的数据，及link里面的内容。我就可以拿它来查是不是那个爬虫坏了。

- Timestamp

最后面还有两个时间。这样还能做到定时爬取的效果，原因可能在于要对一个网站做到爬虫友好。

Scheduler

然后就应该说说那个scheduler 这个可以是个不过分的考点。考多线程的锁。这是第一种加锁的方式：睡眠。

1) Sleep

当有任务的时候爬，没任务的时候睡。这个方式的缺点在于睡多久。课件里面给的是睡一秒。但是外一这个一秒有任务了。爬虫还是不能马上开始爬。但是睡眠还是多线程一个很常见的处理方式。这个图很清晰。按照视频的说法是把他理解并背诵

2) Conditional Variable

第二种就比较复杂了。用条件变量。视频里面把这个用了一个生动的比喻来描述这个结构。就是顾客去餐馆吃饭。进去要问有没有空位。如果没有空位，就去等待区等候。然后一旦有空位，饭馆会来提醒有空位了。

1. 自己观察

2. 可以排队

3. 有人通知

那个Cond_wait 的意思是：没有任务的话就等着，内部释放锁。

那个Cond_signal 就是去唤醒一个等着的线程

这是具体那两个conditional variable 的具体实现。

视频里的老师说这个也要理解并背诵。

3) Semaphore

然后课件里面又提到了第三种经典的处理多线程的方式用信号量。

这个在视频里面也有一个很好的比喻。比如说我能处理有10 个新任务，我就有10个令牌。如果没有令牌就意味着没有新任务。这个爬虫就开始等着。当爬虫爬完页面，就把令牌放回。

- Advantage

我觉得这个很巧妙的地方就是，简化了很多控制变量。就用一个Integer就控制了多个爬虫。信号量就是一个整形数正数代表有新任务负数代表没新任务。每次爬虫就来看一下这个数就好。任务拿走减去1 发现新任务＋1。

生产者消费者模型

然后呢。视频里面对于这个生产者消费者模型，借着爬虫又详细的讲了一下。而且是对于这种处理大量数据的情况做了很经典的讲解。说到大数据。我们一台机器上肯定放不下所有的爬虫肯定我们就要把他们部署在多个机器上。每个crawler负责爬页面获取任务并把得到的页面发回给connector。由这个connector负责往table 里面读写。但是这样每个crawler就需要一个connector有点浪费。

所以，为了减少connector 就按照crawler 的工作内容：拿任务,送回页面分成了两类 sender和receiver这样一组爬虫结构就需要两个大的控制结构就好。

Design and Implementation of a High-Performance Distributed Web Crawler

http://www.cnblogs.com/javanerd/p/5121472.html

上述单机的版的爬虫，在数据量不大和数据更新频率要求不高的情况下，可以很好的工作，但是当需要爬取的页面数量过多，或者网站有反爬虫限制的时候，上述代码并不能很好的工作。

例如通用的搜索爬虫需要爬取很多网页的时候，就需要多个爬虫来一起工作，这个时候各个爬虫必然要共享上述两个数据结构。

其次，现在很多网站对于爬虫都有限制，如果要是爬取的过于频繁，会被封Ip，为了应对这种情况，对应的策略是休眠一段时间，这样的话，又浪费了CPU资源。

最后，当要求实现不同的爬取策略，或者统一管理爬虫作业生命周期的时候，必然要一个Master来协调各个Slave的工作。

3. 设计实现

3.1 Master：

我们框架的主节点称为WebCrawlerMaster，针对不同的爬虫任务，WebCrawlerMaster会生成不同的WebCrawlerManager，WebCrawlerManger的功能是管理UrlsToDo和UrlsDone两个数据结构。Master主要的功能是管理WebCrawlerManager的实例，并且将不同的请求路由到对应的WebCrawlerManager上去。

对于Master来说，最主要的组件是一个叫做MetaDataHolder的成员，它主要用来管理元数据信息。为了加强系统的健壮性，这部分信息是一定需要持久化的，至于持久化的选择，可以是Redis，或者关系型数据库，甚至写文件都可以。如果用Mysql来做持久化的工作，则需要做应用层的cache（通常用一个HashMap来实现）。

3.2 数据结构

对于一个CrawlManager，它主要管理两个数据结构UrlToDo，和UrlDone，前者可以抽象成一个链表，栈或者有优先级的队列，后者对外的表现是一个Set，做去重的工作。当定义出ADT(abstract data type)以后，则可以考虑出怎么样的去实现这个数据结构。这样的设计方法其实和实现一个数据结构是一样的，只不过当我们实现数据结构的时候，操作的对象是内存中的数组和列表，而在这个项目中，我们操作的对象是各种存储中提供给我们的功能，例如Redis中的List、Set，关系型数据库中的表等等。

4. 后记

这次的爬虫框架，从最开始的伪代码来看，是很简单的事情，但是一旦涉及到分布式的环境和系统的可扩展性，要真的实现起来，还是需要考虑到一些额外的东西，例如并发状态下共享数据结构的读写、系统的高可用等等，但是我觉得这个项目真正让我满意的地方，是通过合理的数据结构行为层面的抽象，让这个爬虫系统有着很强的扩展性。例如现在默认的UrlToDo是一个FIFO的队列，这样的话，爬虫实际上是按照BSF的策略去爬取的。但是当UrlToDo配置成一个LIFO的stack以后，爬虫实际上按照DSF的策略去爬取的，而这样的变化，只需要的更改一下请求新的WebCrawlerManager的参数，爬虫的业务代码并不需要任何的修改。

TODO:
http://nlp.stanford.edu/IR-book/html/htmledition/web-crawling-and-indexes-1.html
fb面经题web crawler讨论
https://www.1point3acres.com/bbs/forum.php?mod=viewthread&tid=436948
给你10K个机器，然后1B的url，机器之间不能通信，问你怎么样每个机器才能平均的分任务

http://www.1point3acres.com/bbs/thread-268942-1-1.html
帖子的楼主在3楼说了一下面试时被面试官引导的方向
“思路就是每台机器都从起始点开始，然后对拿到的url做hash，事先规定好每台机器都只做那些hash value的job，如果hash的值跟当前机器的预定值不一样就skip，一样才继续crawl”

10k机器，都从起点开始，可是起点hash之后只会对应到一台机器，按照^说的方向，那其他所有机器都不用爬就一直空闲下去了
我理解其实就是一致性哈希啊。但是在分配任务之前，需要找到这1B个link，对它们做hash啊。这个事情并不能分布式的做

这是经典的distributed hash table 问题. 从root url开始，下载webpage,从这个page里提取embedded url links, distribute 每个link 到对应负责任的machine. In general, 毎个机器接收url links from peers, 看是否已经访问过，没有的话，放入work queue, 有个thread 专门从queue取url, download page, extract urls, distributes to other machines

如果只能遍历所有数据，一共10K台机器，一起遍历，没有数据库能承受这么大的load。有两种解决方法：
1 分拆数据库，按照hash range拆分，然后每一百个机器对应一个数据库instance
2 在crawler之前加一层，叫dispatcher，有比如100个，每个dispatcher负责一个range，分配任务给下面的crawler。

当然还要继续考虑dispatcher down了怎么办

比如有1 billion url需要定期的crawl。简单的做法，每个url一个hash，hash的取值范围就是1 到 1billion。把hash分成10K个组，每个组一个range。这个range就是一个partition，或者叫shard。每台机器就负责一个shard。

更好的做法可以参考consistency hash，或者直接用cassandra存取这些URLs。可以到几百上千的node，支持10K机器没啥问题