Tuesday, June 30, 2015

Scalability Rules: 50 Principles for Scaling Web Sites

Scalability Rules: 50 Principles for Scaling Web Sites
Relax Temporal Constraints
Alleviate temporal constraints in your system whenever possible.
Anytime you are considering adding a constraint that an item or object must maintain a certain state between a user’s actions.

Why: The difficulty in scaling systems with temporal constraints is significant because of the ACID properties of most RDBMSs.

Key takeaways: Carefully consider the need for constraints such as items being available from the time a user views them until the user purchases them. Some possible edge cases where users are disappointed are much easier to compensate for than not being able to scale.

9. Design for Fault Tolerance and Graceful Failure
Not designed to fail.

Rule 36—Design Using Fault Isolative “Swimlanes” ==> todo
Shard: split data

Rule 37—Never Trust Single Points of Failure
Identify single instances on architectural diagrams. Strive for active/active configurations.
Strive for active/active rather than active/passive solutions. 
Use load balancers to balance traffic across instances of a service. 
Use control services with active/passive instances for patterns that require singletons.

Rule 38—Avoid Putting Systems in Series
Components in series have a multiplicative effect of failure.
When necessary to do so add multiple versions of that component so that if one fails others are available to take its place.

Rule 7—Design to Clone Things (X Axis)
Simply clone services and implement a load balancer.
For databases, ensure the accessing code understands the difference between a read and a write.

Just because the data is out of sync with the primary transactional database by 3 or 30 or 90 seconds doesn’t mean that it isn’t correct, just that there is a chance that it isn’t correct.

MySQL implements replication through the master-slave concept—the master database being the primary transactional database that gets written to and the slave databases are read-only copies of the master databases.
Rule 8—Design to Split Different Things (Y Axis)
scaling through the separation of distinct and different functions and data, by either nouns or verbs or a combination of both nouns and verbs.
Separate sets of data, so we can scale them differently.
Scale code bases.

Rule 9—Design to Split Similar Things (Z Axis): sharding and podding
take one data set or service and partitioning it into several pieces.

Rule 15—Firewalls, Firewalls Everywhere!
Use firewalls only when they significantly reduce risk and recognize that they cause issues with scalability and availability.
Firewalls can lower availability and cause unnecessary scalability chokepoints.

Rule 16—Actively Use Log Files
Splunk, Cricket or Cacti
summarizing logs over time and archiving and purging them as their value decreases.
logging in an asynchronous fashion.

5. Don’t Duplicate Your Work
Rule 18—Stop Redirecting Traffic
A slightly better way to redirect with HTML is to use the meta tag “refresh” and automatically send the user’s browser to the new page.
<meta http-equiv="Refresh" content="0;url=http://www.akfpartners.com/techblog" />

we can request the server to redirect for us: mod_alias or mod_rewrite.
Alias /image /www/html/image
Redirect /service http://foo2.akfpartners.com/service

6. Use Caching Aggressively
Rule 20—Leverage Content Delivery Networks

Rule 21—Use Expires Headers
proxy caches don’t inspect the HTML, they do not abide by these tags at all.

Keep-alives, or HTTP persistent connections, allow for the reuse of TCP connections for multiple HTTP requests.

Rule 22—Cache Ajax Calls
Rule 23—Leverage Page Caches
A page cache(reverse proxy server) is a caching server you install in front of your Web servers to offload requests for both static and dynamic objects from those servers.

ETags are unique identifiers issued by the server for an object at the time of first request by a browser. If the resource on the server side is changed, a new ETag is assigned to it. Assuming appropriate support by the browser (client), the object and its ETag are cached by the browser and subsequent If-None-Match requests by the browser to the Web server will include the tag. If the tag matches, the server may respond with an HTTP 304 Not Modified response. If the tag is inconsistent with that on the server, the server will issue the updated object and its associated ETag.

Rule 24—Utilize Application Caches
Rule 25—Make Use of Object Caches
Rule 26—Put Object Caches on Their Own “Tier”

Chapter 7. Learn from Your Mistakes
Rule 27—Learn Aggressively
Doing something without measuring the results or having an incident without learning from it are wasted opportunities that your competitors are taking advantage of.

Rule 28—Don’t Rely on QA to Find Mistakes
Rule 29—Failing to Design for Rollback Is Designing for Failure ==> to-do
Rule 30—Discuss and Learn from Failures


  1 不要过度的设计
  2 设计时考虑到扩展性
  3 把方案一简再简
  4 减少DNS查询
  5 尽可能减少对象
  6 使用同一品牌的网络设备


  7 X轴,横向复制
  8 Y轴,拆分不同的东西
  9 Z轴,拆分不同的相似的东西


  10 设计横向的扩展方案
  11 采用经济型系统
  12 横向扩展数据中心
  13 利用云技术进行设计


  14 合理使用数据库
  15 防火墙,到处都是防火墙
  16 积极的利用日志文件


  17 不要立即检查刚做过的工作
  18 停止重定向
  19 放松时序约束


  20 利用CDN
  21 使用过期头
  针对不同的对象类型,使用过期头,减少对象请求。常见的HTTP对应属性为:public no-cahe max-age等等
  22 缓存Ajax调用
  正确修改Http头Last-Modified Cache-Control Expires等属性。
  23 利用页面缓存
  24 利用应用缓存
  25 利用对象缓存
  26 把对象缓存放在自己的层上


  27 积极的学习
  28 不要依靠QA发现失误
  29 没有回退的设计是失败的设计
  30 讨论失败并从中吸取教训


  31 注意代价高的关系
  32 使用正确的数据库锁
  33 不要使用多阶段提交
  34 不要使用select for update
  因为FOR UPDATE从句会导致锁定行,降低事务处理的速度。
  35 不要选择所有的数据
  比如select * from xxx;
  或者inset into xxx values(xxxx);


  36 采用隔离故障的”泳道“
  37 不要信任单点故障
  38 避免系统串联
  39 确保能够启用/禁用功能


  40 努力实现无状态
  41 尽可能在浏览器端维护会话
  42 利用分布式缓存存放状态


  43 尽可能使用异步通信
  44 确保消息总线能够扩展
  45 避免让消息总线过度拥挤


  46 慎用第三方解决方案扩展
  47 清除、归档和成本合理的存储
  48 删除事务处理中的商业智能
  49 设计能够监控的应用
  ”发生了 问题了吗?“
  50 要能胜任
Scalability Rules: 50 Principles for Scaling Web Sites
50 Scalability Rules in Brief
Rule 1—Don’t Overengineer the Solution
Rule 2—Design Scale into the Solution (D-I-D Process)
Rule 3—Simplify the Solution Three Times Over
Rule 4—Reduce DNS Lookups
Rule 5—Reduce Objects Where Possible
Rule 6—Use Homogeneous Networks
Rule 7—Design to Clone or Replicate Things (X Axis)
Rule 8—Design to Split Different Things (Y Axis)
Rule 9—Design to Split Similar Things (Z Axis)
Rule 10—Design Your Solution to Scale Out, Not Just Up
Rule 11—Use Commodity Systems (Goldfish Not Thoroughbreds)
Rule 12—Scale Out Your Hosting Solution
Rule 13—Design to Leverage the Cloud
Rule 14—Use Databases Appropriately
Rule 15—Firewalls, Firewalls Everywhere!
Rule 16—Actively Use Log Files
Rule 17—Don’t Check Your Work
Rule 18—Stop Redirecting Traffic
Rule 19—Relax Temporal Constraints
Rule 20—Leverage Content Delivery Networks
Rule 21—Use Expires Headers
Rule 22—Cache Ajax Calls
Rule 23—Leverage Page Caches
Rule 24—Utilize Application Caches
Rule 25—Make Use of Object Caches
Rule 26—Put Object Caches on Their Own “Tier”
Rule 27—Learn Aggressively
Rule 28—Don’t Rely on QA to Find Mistakes
Rule 29—Failing to Design for Rollback Is Designing for Failure
Rule 30—Remove Business Intelligence from Transaction Processing
Rule 31—Be Aware of Costly Relationships
Rule 32—Use the Right Type of Database Lock
Rule 33—Pass on Using Multiphase Commits
Rule 34—Try Not to Use Select for Update
Rule 35—Don’t Select Everything
Rule 36—Design Using Fault-Isolative “Swim Lanes”
Rule 37—Never Trust Single Points of Failure
Rule 38—Avoid Putting Systems in Series
Rule 39—Ensure That You Can Wire On and Off Features
Rule 40—Strive for Statelessness
Rule 41—Maintain Sessions in the Browser When Possible
Rule 42—Make Use of a Distributed Cache for States
Rule 43—Communicate Asynchronously as Much as Possible
Rule 44—Ensure That Your Message Bus Can Scale
Rule 45—Avoid Overcrowding Your Message Bus
Rule 46—Be Wary of Scaling through Third Parties
Rule 47—Purge, Archive, and Cost-Justify Storage
Rule 48—Partition Inductive, Deductive, Batch, and User Interaction (OLTP) Workloads
Rule 49—Design Your Application to Be Monitored
Rule 50—Be Competent

Very High—1
Rule 19 Relax Temporal Constraints
Rule 25 Make Use of Object Caches
Rule 29 Failing to Design for Rollback Is Designing for Failure
Rule 32 Use the Right Type of Database Lock
Rule 35 Don’t Select Everything
Rule 46 Be Wary of Scaling through Third Parties
Rule 50 Be Competent

Rule 1 Don’t Overengineer the Solution
Rule 7 Design to Clone or Replicate Things (X Axis)
Rule 10 Design Your Solution to Scale Out, Not Just Up
Rule 11 Use Commodity Systems (Goldfish Not Thoroughbreds)
Rule 14 Use Databases Appropriately
Rule 15 Firewalls, Firewalls Everywhere!
Rule 22 Cache Ajax Calls
Rule 26 Put Object Caches on Their Own “Tier”
Rule 27 Learn Aggressively
Rule 28 Don’t Rely on QA to Find Mistakes
Rule 30 Remove Business Intelligence from Transaction Processing
Rule 33 Pass on Using Multiphase Commits
Rule 34 Try Not to Use Select for Update
Rule 37 Never Trust Single Points of Failure
Rule 41 Maintain Sessions in the Browser When Possible
Rule 42 Make Use of a Distributed Cache for States
Rule 43 Communicate Asynchronously as Much as Possible
Rule 44 Ensure That Your Message Bus Can Scale
Rule 45 Avoid Overcrowding Your Message Bus
Rule 48 Partition Inductive, Deductive, Batch, and User Interaction (OLTP) Workloads
Rule 49 Design Your Application to Be Monitored

Rule 2 Design Scale into the Solution (D-I-D Process)
Rule 3 Simplify the Solution Three Times Over
Rule 4 Reduce DNS Lookups
Rule 5 Reduce Objects Where Possible
Rule 6 Use Homogeneous Networks
Rule 8 Design to Split Different Things (Y Axis)
Rule 9 Design to Split Similar Things (Z Axis)
Rule 12 Scale Out Your Hosting Solution
Rule 16 Actively Use Log Files
Rule 18 Stop Redirecting Traffic
Rule 20 Leverage Content Delivery Networks
Rule 21 Use Expires Headers
Rule 23 Leverage Page Caches
Rule 24 Utilize Application Caches
Rule 31 Be Aware of Costly Relationships
Rule 36 Design Using Fault-Isolative “Swim Lanes”
Rule 38 Avoid Putting Systems in Series
Rule 40 Strive for Statelessness
Rule 47 Purge, Archive, and Cost-Justify Storage

Rule 13 Design to Leverage the Cloud
Rule 17 Don’t Check Your Work
Rule 39 Ensure That You Can Wire On and Off Features
Read full book from Scalability Rules: 50 Principles for Scaling Web Sites

AKF’s Most Commonly Adopted Architectural Principles

AKF Scale Cube
  1. Horizontal Duplication and Cloning (X-Axis ). Having a farm of identical and preferably stateless instances behind a load balancer or reverse proxy. Therefore, every request can be served by any of those hosts and there will be no single point of failure.
  2. Functional Decomposition and Segmentation - Microservices (Y-Axis). e.g. auth service, user profile service, photo service, etc
  3. Horizontal Data Partitioning - Shards (Z-Axis). Replicate the whole stack to different “pods”. Each pod can target a specific large group of users. For example, Uber had China and US data centers. Each datacenter might have different “pods” for different regions.

The AKF Scale Cube has three axes by which a website can be scaled.  X axis, or horizontal duplication, is a common first choice.  Duplicating web and app servers into load balanced pools is a best practice that provides the ability to conduct maintenance or roll code without taking the site down as well as scalability if the pools are N+1 or even N+2.

The Y axis is a service split - decomposing a monolithic code base into smaller services than can run independently.  The Y axis also allows you to scale your technology team - teams focus on the services for which they are responsible and no longer need complete domain expertise.

The third axis is Z, a lookup oriented split.  A common choice is geographic location or customer identity.  A Z axis split takes a monolithic stack and slices it into N smaller stacks, using fewer or smaller servers and working with 1/Nth of the data.

Scaling Solutions with the Z Axis of the Scale Cube

Whereas the Y axis addresses the splitting of dissimilar things (often along noun or verb boundaries), the Z-axis addresses segmentation of “similar” things.  Examples may include splitting customers along an unbiased modulus of customer_id, or along a somewhat biased (but beneficial for response time) geographic boundary.  Product catalogs may be split by SKU, and content may be split by content_id.  Z-axis scaling, like all of the axes, improves the solution’s transactional scalability and if fault isolated it’s availability. Because the software deployed to servers is essentially the same in each Z axis shard (but the data is distinct) there is no increase in organizational scalability.  Cache hit rates often go up with smaller data sets, and operational costs generally go down as commodity servers or smaller IaaS instances can be used.

AKF’s Most Commonly Adopted Architectural Principles
N + 1 Design
At least one additional instance of that system in the event of failure.

Design for Rollback
Design to Be Disabled
This will give you additional time to “fix forward” or ensure that your system doesn’t go down as a result of a bug that introduces strange, out-of-bounds demand characteristics to the system.

Design to Be Monitored
identifying services that are alive or dead, examining or polling log files, collecting system-related data over time, and evaluating end-user response times.
applications and systems are designed from the ground up to be if not self-healing, then at least self-diagnosing.

Significant standard deviations from the mean could be “alerted” for future or immediate action depending on the value. This approach leverages a control chart from statistical process control.
Rates of errors, the response time from other services

Design for Multiple Live Sites
Ensuring that the business environment in which you operate, including contracts, partners, and facilities, is also scalable.

Use Mature Technologies

Asynchronous Design

Stateless Systems
consider storing state with the end user rather than within your system. If that is not possible, consider a centralized state caching mechanism that keeps state data off of the application servers and allows for its distribution across multiple servers.

Scale Out, Not Up
Design for at Least Two Axes of Scale

Buy When Non-Core
Use Commodity Hardware
Build Small, Release Small, Fail Fast
Isolate Faults
Segment your product such that the failure of a service or subservice does not affect multiple other services. Alternatively, segment your customers such that failures for some customers do not affect failures for your customer base in its entirety.

Automation over People

SMART guidelines: Specific, Measurable, Achievable, Realistic, Testable
AKF’s Most Commonly Adopted Architectural Principles

Pragmatic Programming Techniques: Scalable System Design

Pragmatic Programming Techniques: Scalable System Design
"Scalability" is not equivalent to "Raw Performance"
Understand environmental workload conditions that the system is design for
Dimension of growth and growth rate: e.g. Number of users, Transaction volume, Data volume
Measurement and their target: e.g. Response time, Throughput

Understand who is your priority customers
Rank the importance of traffic so you know what to sacrifice in case you cannot handle all of them

Scale out and Not scale up
Keep your code modular and simple

Don't guess the bottleneck, Measure it
Bottlenecks are slow code which are frequently executed. Don't optimize slow code if they are rarely executed
Write performance unit test so you can collect fine grain performance data at the component level
Setup a performance lab so you can conduct end-to-end performance improvement measurement easily
Plan for growth
Do regular capacity planning. Collect usage statistics, predict the growth rate

Common Techniques
Server Farm (real time access)
Incoming requests will be dispatched by the load balancer to different machines and hence the workload is spread and shared across the servers in the farm.

Data Partitioning
By nature, data is stateful. So there must be a deterministic mechanism to dispatch data request to the server that host the data
Data partitioning mechanism also need to take into considerations the data access pattern. Data that need to be accessed together should be staying in the same server. A more sophisticated approach can migrate data continuously according to data access pattern shift.

Map / Reduce (Batch Parallel Processing)

Content Delivery Network (Static Cache)
This is common for static media content. The idea is to create many copies of contents that are distributed geographically across servers.
User request will be routed to the server replica with close proximity

Cache Engine (Dynamic Cache)
This is a time vs space tradeoff. Some executions may use the same set of input parameters over and over again. Therefore, instead of redo the same execution for same input parameters, we can remember the previous execution's result.
==> some times, the input may be different, but shares a lot of common parameter, we can query the data source using the common parameters, and cache them, do other filtering in code. ==> if we can.

Resources Pool
DBSession and TCP connection are expensive to create, so reuse them across multiple requests
Calculate an approximate result
Instead of calculate an accurate answer, see if you can tradeoff some accuracy for speed.
If real life, usually some degree of inaccuracy is tolerable
Filtering at the source
Try to do more processing upstream (where data get generated) than downstream because it reduce the amount of data being propagated

Asynchronous Processing
In callback mode, the caller need to provide a response handler when making the call. The call itself will return immediately before the actually work is done at the server side. When the work is done later, response will be coming back as a separate thread which will execute the previous registered response handler. Some kind of co-ordination may be required between the calling thread and the callback thread.

In polling mode, the call itself will return a "future" handle immediately. The caller can go off doing other things and later poll the "future" handle to see if the response if ready. In this model, there is no extra thread being created so no extra thread co-ordination is needed.

Implementation design considerations
Use efficient algorithms and data structure.

Analyze your concurrent access scenarios when multiple threads accessing shared data. Carefully analyze the synchronization scenario and make sure the locking is fine-grain enough. Also watch for any possibility of deadlock situation and how you detect or prevent them. A wrong concurrent access model can have huge impact in your system's scalability. Also consider using Lock-Free data structure (e.g. Java's Concurrent Package have a couple of them)

Analyze the memory usage patterns in your logic. Determine where new objects are created and where they are eligible for garbage collection. Be aware of the creation of a lot of short-lived temporary objects as they will put a high load on the Garbage Collector.
However, never trade off code readability for performance. (e.g. Don't try to bundle too much logic into a single method). Let the VM handle this execution for you.


  • Move MySQL to a separate server. This frees up resources (CPU, disk, memory). What you want to run on this server depend on its capabilities. Maybe run a memcached server on it.
  • Move to a distributed memory cache using memcached.
  • Add a MySQL master/slave configuration.
  • If more webservers are needed us LVS on the front end as a load balancer.
  • Read full article from Pragmatic Programming Techniques: Scalable System Design

    Monday, June 29, 2015

    Design a web crawler

    Hubs are web pages that contain a large number of outgoing links to authority sites. For example, Reddit, the DMOZ Directory and Hacker News are hubs because they point to a lot of pages with authoritative content. Authorities are sites that obtain a high authority score, determined by the relevant information on the page and its authoritative sources

    Storing the spider log in Kafka allowed us to replay the log out of the box, which can be useful when changing crawling strategy on the fly. We were able to set up partitions by domain name, which made it easier to ensure that each domain was downloaded at most by one spider.
    We used the consumer offset feature to track the position of the spider and provide new messages slightly ahead of time in order to prevent topic overload by deciding if the spider was ready to consume more.

    The main advantage of this design is real-time operation. The crawling strategy can be changed without having to stop the crawl. It’s worth mentioning that the crawling strategy can be implemented as a separate module. This module would contain the logic for ordering URLs and stopping crawls, as well as the scoring model.
    Distributed Frontera is polite to web hosts by design because each host is downloaded by only one spider process. This is achieved by Kafka topic partitioning. All distributed Frontera components are written in Python. Python is much simpler to write than C++ or Java, the two most common languages for large-scale web crawlers.
    Crawler will dequeue the url json from queue and checks if the url already cached in couchbase url bucket or has to be reextracted. If it is not available, page is fetched via the proxy servers.
    Once a page is available, raw page is stored in AWS S3 and an entry is made into url bucket. Then, extractor will extract two things, one set of documents and next set of url(s). Data is stored in couchbase data bucket and next set of url(s) are checked if it has to be crawled. Next set url(s) can also contain, required headers and also can carry data map to next level. Then, they are added to the same queue.

    Read full article from Algo Ramblings: Design a web crawler
    Given a seed URL, the crawler needed to auto-discover the value of the missing fields for a particular record. So if a web page didn't contain the information that I was looking for, the crawler needed to follow outbound links, until the information was found.
    It needed to be some kind of crawler-scraper hybrid, because it had to simultaneously follow outbound links and extract specific information from web pages.

    The scraped data needed to be stored somewhere, most likely in a database.
    The main components were:
    1. A crawler dispatcher, responsible for dispatching URLs to be crawled to the m crawler supervisors, and for collecting results (fields) from them.
    2. m crawler supervisors, responsible for supervising n child processes. Those child processes would perform the actual crawling. I'll refer to them as crawlers for convenience.
    3. A database server, responsible for storing the initial seed URLs as well as the extracted fields.

    4. Pirate. Design Wikipedia crawler.
                      followup 1: No global status.
                      followup 2: deal with machine failure
                      followup 3: make the wiki site unaware of this crawler.
    1. distributed bfs
    2. consistent hashing
    3. assign task with a random delay

    Algo Ramblings: Design a web crawler
    Crawler basic algorithm
    1. Remove a URL from the unvisited URL list
    2. Determine the IP Address of its host name
    3. Download the corresponding document
    4. Extract any links contained in it.
    5. If the URL is new, add it to the list of unvisited URLs
    6. Process the downloaded document
    7. Back to step 1
    Issues on crawler

    General architecture
    What pages should the crawler download ?
    How should the crawler refresh pages ?
    How should the load on the visited web sites be minimized ?
    How should the crawling process be parallelized ?

    the internet could be abstracted as a directed graph, with each page as a node and hyperlinks as an edge.

    How to crawl?
    BFS is normally used.

    Decide what you want to crawl?
    Scope: crawl whole internet or just company web site.
    Having a few websites in your crawler’s most frequent list to crawl, such as some authoritative news website, etc

    You should have lots of fetchers living on many host classes. Use machine learning to predict which websites are most likely to have frequent update, put those into the fetchers’ priority queue.

    Hot to track what your fetchers have crawled?
    You don’t want your fetchers to crawl some website over and over again while other websites don’t get crawled at all. There are many ways to achieve this. For example, your scheduler should generate non-duplicate jobs for fetchers. Or,
    your fetchers could keep track off the last visited time for each URL. Note, due to the scalability, a consistent hashmap is needed.

    Respect the standard?
    If a crawler sees a robots.txt on a site, it should skip crawling it.
    You need to have some ways to execute those javascript and get the URL.

    How to rank:
    Page rank algorithm

    Forget about the fact that you’re dealing with billions of pages. How would you design this system if it were just a small number of pages? You should have an understanding of how you would solve the simple, small case in order to understand how you would solve the bigger case.

    2. Now, think about the issues that occur with billions of pages. Most likely you can’t fit
    the data on one machine. How will you di
    Data Structure
    We use priority search based graph traversal for our web crawling. Its basically breadth first search, but we use a priority queue instead of a queue to determine which URL to crawl next.
    ==> BFS+score, so we will crawl important pages/sites more timely.

    We assign an importance score to each URL which we discover and then crawl them accordingly. 
    We use Redis sorted sets to store the priority associated with each URL and hashes to store the visited status of the discovered URLs. This, of course, comes with a large memory footprint.

    A possible alternative would be to use Bloom filters.
    ==> False positive actually means that the bloom filter will tell you that you have already visited a URL but you actually haven't. The consequent of this is that you will miss crawling web pages rather than crawling the same page twice.

    Since one of the tasks we do is in building price histories of products, recrawling of pages needs to be done intelligently.
    We want to recrawl products that frequently change prices frequently (duh) as compared to pages that don’t really change their price. 
    Hence a brute-force complete recrawl doesn’t really make sense.

    We use a power law distribution (some pages are crawled more frequently than other pages) for recrawling products, with pages ranked based on their importance (using signals ranging from previous price history changes to how many reviews they have).

    Another challenge is in page discovery (eg: if a new product has been launched, how quickly can we have it in our system).
    ===> if the crawler is built for companies site, we can provide api so client can add or manually crawl new added pages.

    Check resource from https://www.linkedin.com/pulse/20140831103409-27984489-designing-a-search-engine-design-patterns-for-crawlers

    如果让你来设计一个最基本的Web Crawler,该如何设计?需要考虑的因素有哪些?


    这个问题是面试中常见的设计类问题。实际上如果你没有做过相关的设计,想要回答出一个让面试官满意的结果其实并不是很容易。该问题并不局限于你在去面试搜索引擎公司时可能会问到。这里,我们从Junior Level和Senior Level两个角度来解答这个问题。

    1. 如何抽象整个互联网

    Junior: 抽象为一个无向图,网页为节点,网页中的链接为有向边。

    Senior: 同上。

    2. 抓取算法

    Junior: 采用BFS的方法,维护一个队列,抓取到一个网页以后,分析网页的链接,扔到队列里。

    Senior: 采用优先队列调度,区别于单纯的BFS,对于每个网页设定一定的抓取权重,优先抓取权重较高的网页。对于权重的设定,考虑的因素有:1. 是否属于一个比较热门的网站 2. 链接长度 3. link到该网页的网页的权重 4. 该网页被指向的次数 等等。


    3. 网络模型

    Junior: 多线程抓取。

    Senior: 分别考虑单机抓取和分布式抓取的情况。对于Windows的单机,可以使用IOCP完成端口进行异步抓取,该种网络访问的方式可以最大程度的利用闲散资源。因为网络访问是需要等待的,如果简单的同时开多个线程,计算机用于线程间切换的耗费会非常大,这种用于处理抓取结果的时间就会非常少。IOCP可以做到使用几个线程就完成几十个线程同步抓取的效果。对于多机的抓取,需要考虑机器的分布,如抓取亚洲的站点,则用在亚洲范围内的计算机等等。
    4. 实时性
    Junior: 无需回答
    Senior: 新闻网页的抓取一般来说是利用单独的爬虫来完成。新闻网页抓取的爬虫的权重设置与普通爬虫会有所区别。首先需要进行新闻源的筛选,这里有两种方式,一种是人工设置新闻源,如新浪首页,第二种方式是通过机器学习的方法。新闻源可以定义链接数非常多,链接内容经常变化的网页。从新闻源网页出发往下抓取给定层级限制的网页所得到,再根据网页中的时间戳信息判断,就可以加入新闻网页。
    5. 网页更新
    Junior: 无需回答。
    Senior: 网页如果被抓下来以后,有的网页会持续变化,有的不会。这里就需要对网页的抓取设置一些生命力信息。当一个新的网页链接被发现以后,他的生命力时间戳信息应该是被发现的时间,表示马上需要被抓取,当一个网页被抓取之后,他的生命力时间戳信息可以被设置为x分钟以后,那么,等到x分钟以后,这个网页就可以根据这个时间戳来判断出,他需要被马上再抓取一次了。一个网页被第二次抓取以后,需要和之前的内容进行对比,
    Keep in mind a production web crawler could be very sophisticated and normally takes a few teams weeks/months to develop. The interviewer would not expect you to cover all the detail, but you should be able to mention some key design perspectives.

    How to abstract the internet?
    You should quickly realize the internet could be abstracted as a directed graph, with each page as a node and hyperlinks as an edge.

    How to crawl?

    BFS is normally used. However, DFS is also used in some situation, such as if your crawler has already established a connection with the website, it might just DFS all the URLs within this website to save some handshaking overhead.

    Decide what you want to crawl?

    Internet is huge thus your graph is huge. It is almost impossible to crawl the entire internet since it is keep growing every sec. Google probably has 60 trillion while Bing has 30 trillion, roughly speaking.

    Some strategy used in practice includes

    a) Having a few websites in your crawler’s most frequent list to crawl, such as some authoritative news website, etc

    b) You should have lots of fetchers living on many host classes. Use machine learning to predict which websites are most likely to have frequent update, put those into the fetchers’ priority queue.

    Hot to track what your fetchers have crawled?

    You don’t want your fetchers to crawl some website over and over again while other websites don’t get crawled at all. There are many ways to achieve this. For example, your scheduler should generate non-duplicate jobs for fetchers. Or, your fetchers could keep track off the last visited time for each URL. Note, due to the scalability, a consistent hashmap is needed.

    Respect the standard?
    If a crawler sees a robots.txt on a site, it should skip crawling it. However, if this happens a lot for a fetcher, it is doing less work on the run. Some optimization could be let the fetcher pick some other sites to crawl to reduce the overhead of spawning a fetcher.

    Sometime, URL on a website is not easy to extract. For example, some URLs are generated by javascript. You need to have some ways to execute those javascript and get the URL.



    Robustness: avoid  fetching an infinite number of pages in a particular domain.
    Politeness: Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them.

     Performance and efficiency: CPU, storage and network bandwidth.
     Quality:be biased towards fetching “useful” pages first.
     Freshness: continuous mode: it should obtain fresh copies of previously fetched pages.
     Extensible:  e.g. new data formats, new fetch protocols.

    begin with a seed set

    Check already fetched -- The simplest implementation for this would
    use a simple fingerprint such as a checksum. https://en.wikipedia.org/wiki/Checksum

    A more sophisticated test would use shingles
    URL filter is used to determine whether the extracted URL should be excluded from the frontier based on one of several tests.

    Many hosts on theWeb place certain portions of their websites off-limits to crawling, under a standard known as the Robots Exclusion Protocol. This is done by placing a file with the name robots.txt at the root of the URL hierarchy at the site. maintaining a cache of robots.txt files

    URL should be normalized in the following sense: often the HTML encoding of a link from a web page p indicates the target of that link relative to the page p.

    Finally, the URL is checked for duplicate elimination. When the URL is added to the frontier, it is assigned a priority based on which it is eventually removed from the frontier for fetching.

    Certain housekeeping tasks are typically performed by a dedicated thread. 
    This thread is generally quiescent except that it wakes up once every few seconds to log crawl progress statistics (URLs crawled, frontier size, etc.), decide whether to terminate the crawl, or (once every few hours of crawling) checkpoint the crawl. 

    In checkpointing, a snapshot of the crawler’s state (say the URL frontier) is committed to disk. In the event of a catastrophic crawler
    failure, the crawl is restarted from the most recent checkpoint.

    Documents change over time and so, in the context of continuous crawling, we must be able to delete their outdated fingerprints/shingles from the content-seen set(s). In order to do so, it is necessary to save the fingerprint/shingle of the document in the URL frontier, along with the URL itself.

    DNS resolution is a well-known bottleneck in web crawling
    . DNS resolution may entail multiple requests and round-trips across the internet, requiring seconds and sometimes even longer.

    DNS cache, are generally synchronous. 
    This means that once a request is made to the Domain Name Service, other crawler threads at that node are blocked until the first request is completed. To circumvent this, most web crawlers implement their own DNS resolver as a component of the crawler. 
    Thread executing the resolver code sends a message to the DNS server and then performs a timed wait: it resumes either when being signaled by another thread or when a set time quantum expires. A single, separate DNS thread listens on the standard DNS port (port 53) for incoming response packets from the name service. 

    Upon receiving a response, it signals the appropriate crawler thread (in this case, i) and hands it the response packet if i has not yet resumed because its time quantum has expired. A crawler thread that resumes because its wait time quantum has expired retries for a fixed number of attempts, sending out a new message to the DNS server and performing
    a timed wait each time; the designers of Mercator recommend of the order
    of five attempts. The time quantum of the wait increases exponentially with
    each of these attempts; Mercator started with one second and ended with
    roughly 90 seconds, in consideration of the fact that there are host names
    that take tens of seconds to resolve partitioning by terms, also known as global index organization, and partitioning by documents,

     The URL frontier
     high-quality pages that change frequently should be
    prioritized for frequent crawling. Thus, the priority of a page should be a
    function of both its change rate and its quality (using some reasonable quality
    estimate). The combination is necessary because a large number of spam
    pages change completely on every fetch

    The second consideration is politeness: we must avoid repeated fetch requests
    to a host within a short time span.
    A common heuristic is to insert a
    gap between successive fetch requests to a host that is an order of magnitude
    larger than the time taken for the most recent fetch from that host.
    Its goals are to ensure that (i) only one connection is open at a time to any
    host; (ii) a waiting time of a few seconds occurs between successive requests
    to a host and (iii) high-priority pages are crawled preferentially.

    In the former, the dictionary of index terms is partitioned into subsets, each subset residing at a node.
    Along with the terms at a node, we keep the postings for those terms. A
    query is routed to the nodes corresponding to its query terms. In principle,
    this allows greater concurrency since a stream of queries with different query
    terms would hit different sets of machines.
    In practice, partitioning indexes by vocabulary terms turns out to be nontrivial.
    Multi-word queries require the sending of long postings lists between
    sets of nodes for merging, and the cost of this can outweigh the greater concurrency.
    Load balancing the partition is governed not by an a priori analysis
    of relative term frequencies, but rather by the distribution of query terms
    and their co-occurrences, which can drift with time or exhibit sudden bursts.
    Achieving good partitions is a function of the co-occurrences of query terms
    and entails the clustering of terms to optimize objectives that are not easy to
    quantify. Finally, this strategy makes implementation of dynamic indexing
    more difficult.
    A more common implementation is to partition by documents: each node
    contains the index for a subset of all documents. Each query is distributed to
    all nodes, with the results fromvarious nodes being merged before presentation
    to the user. This strategy trades more local disk seeks for less inter-node
    communication. One difficulty in this approach is that global statistics used
    in scoring – such as idf – must be computed across the entire document collection
    even though the index at any single node only contains a subset of
    the documents. These are computed by distributed “background” processes
    that periodically refresh the node indexes with fresh global statistics
    slaveBloom Filter放到master的内存里,而被访问过的url放到运行在master上的Redis里,这样保证所有操作都是O(1)

    Master:读写task list这种工作需要统一交给一个master机器来做,这样能保持数据的一致性,也更便于管理,所以也就需要这个机器读写速度快,稳定。
    Slave:而slave做的就是爬数据,把爬到的东西传回到master里面。 多一个,少一个,甚至坏一个都不太影响整个爬虫的工作。这样的机器运算速度不需要特别高。

    -       一个是list, 那种汇聚了许多新闻超链接的页面
    -       还有一种是page, 也就是普通的页面

    2. Multiple Page Website Crawler
    爬虫首先要找到那个list 页面,从这个页面开始搜集page。这个就产生了最基本的爬多个页面网站的爬虫架构:

    上图是一个很基础的爬虫结构。值得注意的是对于每个网站来说,list crawler 只有一个,但是news crawler有许多个,而且news crawler 可以复用。
    scheduler 就算是一个master,在控制整个爬虫的工作。但是,这也有一个问题——复用。这些list crawler会有浪费。比如说新浪没有啥新闻,list crawler就会闲置在那里;而163有很多新闻,crawler却不够使,这是浪费资源。


    3. Master-Slave Crawler

    这种结构没有分是list 还是 page crawler。 由一个master 的负责读写,其他crawler 来爬。这样就很好解决了一个资源浪费的问题。
    这样就很好解决了一个资源浪费的问题。在这幅图里面画了task table的架构。这里面有一些列还是值得思考的。
    - Priority
    - State
    还有state在这里包含 done, working, new 这几个状态。这个可以用于管理,这个也可以用来做verification。比如说一个爬虫的状态是done了。那就应该是爬过了。但是如果没有得到应有的数据,及link里面的内容。我就可以拿它来查是不是那个爬虫坏了。
    - Timestamp


    然后就应该说说那个scheduler 这个可以是个不过分的考点。考多线程的锁。这是第一种加锁的方式:睡眠。

    1) Sleep
    当有任务的时候爬,没任务的时候睡。这个方式的缺点在于睡多久。课件里面给的是睡一秒。但是外一这个一秒有任务了。爬虫还是不能马上开始爬。但是睡眠还是多线程一个很常见的处理方式。这个图很清晰。 按照视频的说法是把他理解并背诵
    2) Conditional Variable

    1.    自己观察
    2.    可以排队
    3.    有人通知

    那个Cond_wait 的意思是: 没有任务的话就等着, 内部释放锁。
    那个Cond_signal 就是去唤醒一个等着的线程
    这是具体那两个conditional variable 的具体实现。


    3) Semaphore

    这个在视频里面也有一个很好的比喻。比如说我能处理有10 个新任务,我就有10个令牌。如果没有令牌就意味着没有新任务。这个爬虫就开始等着。当爬虫爬完页面, 就把令牌放回。
    - Advantage

    然后呢。视频里面对于这个生产者消费者模型,借着爬虫又详细的讲了一下。而且是对于这种处理大量数据的情况做了很经典的讲解。说到大数据。我们一台机器上肯定放不下所有的爬虫 肯定我们就要把他们部署在多个机器上。每个crawler负责爬页面获取任务 并把得到的页面发回给connector。由这个connector负责往table 里面读写。但是这样每个crawler就需要一个connector有点浪费。

    所以, 为了减少connector 就按照crawler 的工作内容: 拿任务,送回页面 分成了两类 senderreceiver这样一组爬虫结构 就需要两个大的控制结构就好。



    3. 设计实现
    3.1 Master:
    3.2 数据结构
    对于一个CrawlManager,它主要管理两个数据结构UrlToDo,和UrlDone,前者可以抽象成一个链表,栈或者有优先级的队列,后者对外的表现是一个Set,做去重的工作。当定义出ADT(abstract data type)以后,则可以考虑出怎么样的去实现这个数据结构。这样的设计方法其实和实现一个数据结构是一样的,只不过当我们实现数据结构的时候,操作的对象是内存中的数组和列表,而在这个项目中,我们操作的对象是各种存储中提供给我们的功能,例如Redis中的List、Set,关系型数据库中的表等等。

    4. 后记
    fb面经题web crawler讨论

    “思路就是每台机器都从起始点开始,然后对拿到的url做hash,事先规定好每台机器都只做那些hash value的job,如果hash的值跟当前机器的预定值不一样就skip,一样才继续crawl”


    这是经典的distributed hash table 问题. 从root url开始,下载webpage,从这个page里提取embedded url links, distribute 每个link 到对应负责任的machine. In general, 毎个机器接收url links from peers, 看是否已经访问过,没有的话,放入work queue, 有个thread 专门从queue取url, download page, extract urls, distributes to other machines

    1 分拆数据库,按照hash range拆分,然后每一百个机器对应一个数据库instance
    2 在crawler之前加一层,叫dispatcher,有比如100个,每个dispatcher负责一个range,分配任务给下面的crawler。

    当然还要继续考虑dispatcher down了怎么办

    比如有1 billion url需要定期的crawl。简单的做法,每个url一个hash,hash的取值范围就是1 到 1billion。把hash分成10K个组,每个组一个range。这个range就是一个partition,或者叫shard。每台机器就负责一个shard。

    更好的做法可以参考consistency hash,或者直接用cassandra存取这些URLs。可以到几百上千的node,支持10K机器没啥问题


    Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

    Popular Posts