Monday, June 29, 2015

Design a web crawler


TODO:http://www.ijcaonline.org/volume15/number7/pxc3872629.pdf
http://www.slideshare.net/gnap/design-and-implementation-of-a-high-performance-distributed-web-crawler

http://www.1point3acres.com/bbs/thread-148865-1-1.html
4. Pirate. Design Wikipedia crawler.
                  followup 1: No global status.
                  followup 2: deal with machine failure
                  followup 3: make the wiki site unaware of this crawler.
1. distributed bfs
2. consistent hashing
3. assign task with a random delay

Algo Ramblings: Design a web crawler
Crawler basic algorithm
  1. Remove a URL from the unvisited URL list
  2. Determine the IP Address of its host name
  3. Download the corresponding document
  4. Extract any links contained in it.
  5. If the URL is new, add it to the list of unvisited URLs
  6. Process the downloaded document
  7. Back to step 1
Issues on crawler

General architecture
What pages should the crawler download ?
How should the crawler refresh pages ?
How should the load on the visited web sites be minimized ?
How should the crawling process be parallelized ?

http://baozitraining.org/blog/design-a-basic-web-crawler/
the internet could be abstracted as a directed graph, with each page as a node and hyperlinks as an edge.

How to crawl?
BFS is normally used.

Decide what you want to crawl?
Scope: crawl whole internet or just company web site.
Having a few websites in your crawler’s most frequent list to crawl, such as some authoritative news website, etc

You should have lots of fetchers living on many host classes. Use machine learning to predict which websites are most likely to have frequent update, put those into the fetchers’ priority queue.

Hot to track what your fetchers have crawled?
You don’t want your fetchers to crawl some website over and over again while other websites don’t get crawled at all. There are many ways to achieve this. For example, your scheduler should generate non-duplicate jobs for fetchers. Or,
your fetchers could keep track off the last visited time for each URL. Note, due to the scalability, a consistent hashmap is needed.

Respect the standard?
If a crawler sees a robots.txt on a site, it should skip crawling it.
You need to have some ways to execute those javascript and get the URL.

How to rank:
Page rank algorithm

Forget about the fact that you’re dealing with billions of pages. How would you design this system if it were just a small number of pages? You should have an understanding of how you would solve the simple, small case in order to understand how you would solve the bigger case.

2. Now, think about the issues that occur with billions of pages. Most likely you can’t fit
the data on one machine. How will you di
http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
Data Structure
We use priority search based graph traversal for our web crawling. Its basically breadth first search, but we use a priority queue instead of a queue to determine which URL to crawl next.
==> BFS+score, so we will crawl important pages/sites more timely.

We assign an importance score to each URL which we discover and then crawl them accordingly. 
We use Redis sorted sets to store the priority associated with each URL and hashes to store the visited status of the discovered URLs. This, of course, comes with a large memory footprint.

A possible alternative would be to use Bloom filters.
==> False positive actually means that the bloom filter will tell you that you have already visited a URL but you actually haven't. The consequent of this is that you will miss crawling web pages rather than crawling the same page twice.

Recrawling
Since one of the tasks we do is in building price histories of products, recrawling of pages needs to be done intelligently.
We want to recrawl products that frequently change prices frequently (duh) as compared to pages that don’t really change their price. 
Hence a brute-force complete recrawl doesn’t really make sense.

We use a power law distribution (some pages are crawled more frequently than other pages) for recrawling products, with pages ranked based on their importance (using signals ranging from previous price history changes to how many reviews they have).

Another challenge is in page discovery (eg: if a new product has been launched, how quickly can we have it in our system).
===> if the crawler is built for companies site, we can provide api so client can add or manually crawl new added pages.

Check resource from https://www.linkedin.com/pulse/20140831103409-27984489-designing-a-search-engine-design-patterns-for-crawlers

http://www.jiuzhang.com/problem/44/
如果让你来设计一个最基本的Web Crawler,该如何设计?需要考虑的因素有哪些?
没有标准答案。需要尽可能的回答出多一点的考虑因素。

面试官角度:

这个问题是面试中常见的设计类问题。实际上如果你没有做过相关的设计,想要回答出一个让面试官满意的结果其实并不是很容易。该问题并不局限于你在去面试搜索引擎公司时可能会问到。这里,我们从Junior Level和Senior Level两个角度来解答这个问题。

1. 如何抽象整个互联网

Junior: 抽象为一个无向图,网页为节点,网页中的链接为有向边。

Senior: 同上。

2. 抓取算法
Junior: 采用BFS的方法,维护一个队列,抓取到一个网页以后,分析网页的链接,扔到队列里。
Senior: 采用优先队列调度,区别于单纯的BFS,对于每个网页设定一定的抓取权重,优先抓取权重较高的网页。对于权重的设定,考虑的因素有:1. 是否属于一个比较热门的网站 2. 链接长度 3. link到该网页的网页的权重 4. 该网页被指向的次数 等等。

进一步考虑,对于热门的网站,不能无限制的抓取,所以需要进行二级调度。首先调度抓取哪个网站,然后选中了要抓取的网站之后,调度在该网站中抓取哪些网页。这样做的好处是,非常礼貌的对单个网站的抓取有一定的限制,也给其他网站的网页抓取一些机会。
3. 网络模型
Junior: 多线程抓取。
Senior: 分别考虑单机抓取和分布式抓取的情况。对于Windows的单机,可以使用IOCP完成端口进行异步抓取,该种网络访问的方式可以最大程度的利用闲散资源。因为网络访问是需要等待的,如果简单的同时开多个线程,计算机用于线程间切换的耗费会非常大,这种用于处理抓取结果的时间就会非常少。IOCP可以做到使用几个线程就完成几十个线程同步抓取的效果。对于多机的抓取,需要考虑机器的分布,如抓取亚洲的站点,则用在亚洲范围内的计算机等等。
4. 实时性
Junior: 无需回答
Senior: 新闻网页的抓取一般来说是利用单独的爬虫来完成。新闻网页抓取的爬虫的权重设置与普通爬虫会有所区别。首先需要进行新闻源的筛选,这里有两种方式,一种是人工设置新闻源,如新浪首页,第二种方式是通过机器学习的方法。新闻源可以定义链接数非常多,链接内容经常变化的网页。从新闻源网页出发往下抓取给定层级限制的网页所得到,再根据网页中的时间戳信息判断,就可以加入新闻网页。
5. 网页更新
Junior: 无需回答。
Senior: 网页如果被抓下来以后,有的网页会持续变化,有的不会。这里就需要对网页的抓取设置一些生命力信息。当一个新的网页链接被发现以后,他的生命力时间戳信息应该是被发现的时间,表示马上需要被抓取,当一个网页被抓取之后,他的生命力时间戳信息可以被设置为x分钟以后,那么,等到x分钟以后,这个网页就可以根据这个时间戳来判断出,他需要被马上再抓取一次了。一个网页被第二次抓取以后,需要和之前的内容进行对比,
如果内容一致,则延长下一次抓取的时间,如设为2x分钟后再抓取,直到达到一个限制长度如半年或者三个月(这个数值取决于你爬虫的能力)。如果被更新了,则需要缩短时间,如,x/2分钟之后再抓取。
http://blog.baozitraining.org/2014/08/how-to-design-basic-web-crawler.htm
Keep in mind a production web crawler could be very sophisticated and normally takes a few teams weeks/months to develop. The interviewer would not expect you to cover all the detail, but you should be able to mention some key design perspectives.

How to abstract the internet?
You should quickly realize the internet could be abstracted as a directed graph, with each page as a node and hyperlinks as an edge.

How to crawl?

BFS is normally used. However, DFS is also used in some situation, such as if your crawler has already established a connection with the website, it might just DFS all the URLs within this website to save some handshaking overhead.

Decide what you want to crawl?

Internet is huge thus your graph is huge. It is almost impossible to crawl the entire internet since it is keep growing every sec. Google probably has 60 trillion while Bing has 30 trillion, roughly speaking.
Some strategy used in practice includes

a) Having a few websites in your crawler’s most frequent list to crawl, such as some authoritative news website, etc

b) You should have lots of fetchers living on many host classes. Use machine learning to predict which websites are most likely to have frequent update, put those into the fetchers’ priority queue.

Hot to track what your fetchers have crawled?
You don’t want your fetchers to crawl some website over and over again while other websites don’t get crawled at all. There are many ways to achieve this. For example, your scheduler should generate non-duplicate jobs for fetchers. Or, your fetchers could keep track off the last visited time for each URL. Note, due to the scalability, a consistent hashmap is needed.

Respect the standard?
If a crawler sees a robots.txt on a site, it should skip crawling it. However, if this happens a lot for a fetcher, it is doing less work on the run. Some optimization could be let the fetcher pick some other sites to crawl to reduce the overhead of spawning a fetcher.

Sometime, URL on a website is not easy to extract. For example, some URLs are generated by javascript. You need to have some ways to execute those javascript and get the URL.


[Reference]

Coding
https://github.com/jefferyyuan/500lines/tree/master/crawler

http://itsumomono.blogspot.com/2015/08/crawling.html
Robustness: avoid  fetching an infinite number of pages in a particular domain.
Politeness: Web servers have both implicit and explicit policies regulating the rate at which a crawler can visit them.

 Distributed:
 Scalable:
 Performance and efficiency: CPU, storage and network bandwidth.
 Quality:be biased towards fetching “useful” pages first.
 Freshness: continuous mode: it should obtain fresh copies of previously fetched pages.
 Extensible:  e.g. new data formats, new fetch protocols.

begin with a seed set

Check already fetched -- The simplest implementation for this would
use a simple fingerprint such as a checksum. https://en.wikipedia.org/wiki/Checksum

A more sophisticated test would use shingles
URL filter is used to determine whether the extracted URL should be excluded from the frontier based on one of several tests.

Many hosts on theWeb place certain portions of their websites off-limits to crawling, under a standard known as the Robots Exclusion Protocol. This is done by placing a file with the name robots.txt at the root of the URL hierarchy at the site. maintaining a cache of robots.txt files

URL should be normalized in the following sense: often the HTML encoding of a link from a web page p indicates the target of that link relative to the page p.

Finally, the URL is checked for duplicate elimination. When the URL is added to the frontier, it is assigned a priority based on which it is eventually removed from the frontier for fetching.

Certain housekeeping tasks are typically performed by a dedicated thread. 
This thread is generally quiescent except that it wakes up once every few seconds to log crawl progress statistics (URLs crawled, frontier size, etc.), decide whether to terminate the crawl, or (once every few hours of crawling) checkpoint the crawl. 

In checkpointing, a snapshot of the crawler’s state (say the URL frontier) is committed to disk. In the event of a catastrophic crawler
failure, the crawl is restarted from the most recent checkpoint.

Documents change over time and so, in the context of continuous crawling, we must be able to delete their outdated fingerprints/shingles from the content-seen set(s). In order to do so, it is necessary to save the fingerprint/shingle of the document in the URL frontier, along with the URL itself.

DNS resolution is a well-known bottleneck in web crawling
. DNS resolution may entail multiple requests and round-trips across the internet, requiring seconds and sometimes even longer.

DNS cache, are generally synchronous. 
This means that once a request is made to the Domain Name Service, other crawler threads at that node are blocked until the first request is completed. To circumvent this, most web crawlers implement their own DNS resolver as a component of the crawler. 
Thread executing the resolver code sends a message to the DNS server and then performs a timed wait: it resumes either when being signaled by another thread or when a set time quantum expires. A single, separate DNS thread listens on the standard DNS port (port 53) for incoming response packets from the name service. 

Upon receiving a response, it signals the appropriate crawler thread (in this case, i) and hands it the response packet if i has not yet resumed because its time quantum has expired. A crawler thread that resumes because its wait time quantum has expired retries for a fixed number of attempts, sending out a new message to the DNS server and performing
a timed wait each time; the designers of Mercator recommend of the order
of five attempts. The time quantum of the wait increases exponentially with
each of these attempts; Mercator started with one second and ended with
roughly 90 seconds, in consideration of the fact that there are host names
that take tens of seconds to resolve partitioning by terms, also known as global index organization, and partitioning by documents,

 The URL frontier
 high-quality pages that change frequently should be
prioritized for frequent crawling. Thus, the priority of a page should be a
function of both its change rate and its quality (using some reasonable quality
estimate). The combination is necessary because a large number of spam
pages change completely on every fetch

The second consideration is politeness: we must avoid repeated fetch requests
to a host within a short time span.
A common heuristic is to insert a
gap between successive fetch requests to a host that is an order of magnitude
larger than the time taken for the most recent fetch from that host.
Its goals are to ensure that (i) only one connection is open at a time to any
host; (ii) a waiting time of a few seconds occurs between successive requests
to a host and (iii) high-priority pages are crawled preferentially.


In the former, the dictionary of index terms is partitioned into subsets, each subset residing at a node.
Along with the terms at a node, we keep the postings for those terms. A
query is routed to the nodes corresponding to its query terms. In principle,
this allows greater concurrency since a stream of queries with different query
terms would hit different sets of machines.
In practice, partitioning indexes by vocabulary terms turns out to be nontrivial.
Multi-word queries require the sending of long postings lists between
sets of nodes for merging, and the cost of this can outweigh the greater concurrency.
Load balancing the partition is governed not by an a priori analysis
of relative term frequencies, but rather by the distribution of query terms
and their co-occurrences, which can drift with time or exhibit sudden bursts.
Achieving good partitions is a function of the co-occurrences of query terms
and entails the clustering of terms to optimize objectives that are not easy to
quantify. Finally, this strategy makes implementation of dynamic indexing
more difficult.
A more common implementation is to partition by documents: each node
contains the index for a subset of all documents. Each query is distributed to
all nodes, with the results fromvarious nodes being merged before presentation
to the user. This strategy trades more local disk seeks for less inter-node
communication. One difficulty in this approach is that global statistics used
in scoring – such as idf – must be computed across the entire document collection
even though the index at any single node only contains a subset of
the documents. These are computed by distributed “background” processes
that periodically refresh the node indexes with fresh global statistics
http://systemdesigns.blogspot.com/2015/12/web-crawling-system.html
slaveBloom Filter放到master的内存里,而被访问过的url放到运行在master上的Redis里,这样保证所有操作都是O(1)


我感觉这相当于统筹分配任务:
Master:读写task list这种工作需要统一交给一个master机器来做,这样能保持数据的一致性,也更便于管理,所以也就需要这个机器读写速度快,稳定。
Slave:而slave做的就是爬数据,把爬到的东西传回到master里面。 多一个,少一个,甚至坏一个都不太影响整个爬虫的工作。这样的机器运算速度不需要特别高。

在老师的视频里面是这么划分页面的:
-       一个是list, 那种汇聚了许多新闻超链接的页面
-       还有一种是page, 也就是普通的页面

2. Multiple Page Website Crawler
爬虫首先要找到那个list 页面,从这个页面开始搜集page。这个就产生了最基本的爬多个页面网站的爬虫架构:


上图是一个很基础的爬虫结构。值得注意的是对于每个网站来说,list crawler 只有一个,但是news crawler有许多个,而且news crawler 可以复用。
scheduler 就算是一个master,在控制整个爬虫的工作。但是,这也有一个问题——复用。这些list crawler会有浪费。比如说新浪没有啥新闻,list crawler就会闲置在那里;而163有很多新闻,crawler却不够使,这是浪费资源。

所以,讲义上提供了第三版的结构。

3. Master-Slave Crawler

这种结构没有分是list 还是 page crawler。 由一个master 的负责读写,其他crawler 来爬。这样就很好解决了一个资源浪费的问题。
这样就很好解决了一个资源浪费的问题。在这幅图里面画了task table的架构。这里面有一些列还是值得思考的。
- Priority
比如说priority是用来划分爬虫的优先级的。可以把要用来爬list的优先级提高。把要来爬page的优先级降低。
- State
还有state在这里包含 done, working, new 这几个状态。这个可以用于管理,这个也可以用来做verification。比如说一个爬虫的状态是done了。那就应该是爬过了。但是如果没有得到应有的数据,及link里面的内容。我就可以拿它来查是不是那个爬虫坏了。
- Timestamp
最后面还有两个时间。这样还能做到定时爬取的效果,原因可能在于要对一个网站做到爬虫友好。

Scheduler

然后就应该说说那个scheduler 这个可以是个不过分的考点。考多线程的锁。这是第一种加锁的方式:睡眠。

1) Sleep
当有任务的时候爬,没任务的时候睡。这个方式的缺点在于睡多久。课件里面给的是睡一秒。但是外一这个一秒有任务了。爬虫还是不能马上开始爬。但是睡眠还是多线程一个很常见的处理方式。这个图很清晰。 按照视频的说法是把他理解并背诵
2) Conditional Variable

第二种就比较复杂了。用条件变量。视频里面把这个用了一个生动的比喻来描述这个结构。就是顾客去餐馆吃饭。进去要问有没有空位。如果没有空位,就去等待区等候。然后一旦有空位,饭馆会来提醒有空位了。
1.    自己观察
2.    可以排队
3.    有人通知

那个Cond_wait 的意思是: 没有任务的话就等着, 内部释放锁。
那个Cond_signal 就是去唤醒一个等着的线程
这是具体那两个conditional variable 的具体实现。

视频里的老师说这个也要理解并背诵。

3) Semaphore
然后课件里面又提到了第三种经典的处理多线程的方式用信号量。





这个在视频里面也有一个很好的比喻。比如说我能处理有10 个新任务,我就有10个令牌。如果没有令牌就意味着没有新任务。这个爬虫就开始等着。当爬虫爬完页面, 就把令牌放回。
- Advantage
我觉得这个很巧妙的地方就是,简化了很多控制变量。就用一个Integer就控制了多个爬虫。信号量就是一个整形数正数代表有新任务负数代表没新任务。每次爬虫就来看一下这个数就好。任务拿走减去发现新任务+1


生产者消费者模型
然后呢。视频里面对于这个生产者消费者模型,借着爬虫又详细的讲了一下。而且是对于这种处理大量数据的情况做了很经典的讲解。说到大数据。我们一台机器上肯定放不下所有的爬虫 肯定我们就要把他们部署在多个机器上。每个crawler负责爬页面获取任务 并把得到的页面发回给connector。由这个connector负责往table 里面读写。但是这样每个crawler就需要一个connector有点浪费。



所以, 为了减少connector 就按照crawler 的工作内容: 拿任务,送回页面 分成了两类 senderreceiver这样一组爬虫结构 就需要两个大的控制结构就好。

http://www.cnblogs.com/javanerd/p/5121472.html


上述单机的版的爬虫,在数据量不大和数据更新频率要求不高的情况下,可以很好的工作,但是当需要爬取的页面数量过多,或者网站有反爬虫限制的时候,上述代码并不能很好的工作。
例如通用的搜索爬虫需要爬取很多网页的时候,就需要多个爬虫来一起工作,这个时候各个爬虫必然要共享上述两个数据结构。
其次,现在很多网站对于爬虫都有限制,如果要是爬取的过于频繁,会被封Ip,为了应对这种情况,对应的策略是休眠一段时间,这样的话,又浪费了CPU资源。
最后,当要求实现不同的爬取策略,或者统一管理爬虫作业生命周期的时候,必然要一个Master来协调各个Slave的工作。

3. 设计实现
3.1 Master:
我们框架的主节点称为WebCrawlerMaster,针对不同的爬虫任务,WebCrawlerMaster会生成不同的WebCrawlerManager,WebCrawlerManger的功能是管理UrlsToDo和UrlsDone两个数据结构。Master主要的功能是管理WebCrawlerManager的实例,并且将不同的请求路由到对应的WebCrawlerManager上去。
对于Master来说,最主要的组件是一个叫做MetaDataHolder的成员,它主要用来管理元数据信息。为了加强系统的健壮性,这部分信息是一定需要持久化的,至于持久化的选择,可以是Redis,或者关系型数据库,甚至写文件都可以。如果用Mysql来做持久化的工作,则需要做应用层的cache(通常用一个HashMap来实现)。
3.2 数据结构
对于一个CrawlManager,它主要管理两个数据结构UrlToDo,和UrlDone,前者可以抽象成一个链表,栈或者有优先级的队列,后者对外的表现是一个Set,做去重的工作。当定义出ADT(abstract data type)以后,则可以考虑出怎么样的去实现这个数据结构。这样的设计方法其实和实现一个数据结构是一样的,只不过当我们实现数据结构的时候,操作的对象是内存中的数组和列表,而在这个项目中,我们操作的对象是各种存储中提供给我们的功能,例如Redis中的List、Set,关系型数据库中的表等等。

4. 后记
这次的爬虫框架,从最开始的伪代码来看,是很简单的事情,但是一旦涉及到分布式的环境和系统的可扩展性,要真的实现起来,还是需要考虑到一些额外的东西,例如并发状态下共享数据结构的读写、系统的高可用等等,但是我觉得这个项目真正让我满意的地方,是通过合理的数据结构行为层面的抽象,让这个爬虫系统有着很强的扩展性。例如现在默认的UrlToDo是一个FIFO的队列,这样的话,爬虫实际上是按照BSF的策略去爬取的。但是当UrlToDo配置成一个LIFO的stack以后,爬虫实际上按照DSF的策略去爬取的,而这样的变化,只需要的更改一下请求新的WebCrawlerManager的参数,爬虫的业务代码并不需要任何的修改。
TODO:
http://nlp.stanford.edu/IR-book/html/htmledition/web-crawling-and-indexes-1.html
Read full article from Algo Ramblings: Design a web crawler

No comments:

Post a Comment

Labels

Review (554) System Design (293) System Design - Review (189) Java (178) Coding (75) Interview-System Design (65) Interview (60) Book Notes (59) Coding - Review (59) to-do (45) Knowledge (39) Linux (39) Interview-Java (35) Knowledge - Review (32) Database (30) Design Patterns (29) Product Architecture (28) Big Data (27) Soft Skills (27) Miscs (25) MultiThread (25) Concurrency (24) Cracking Code Interview (24) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Distributed (20) Interview Q&A (20) OOD Design (20) System Design - Practice (19) Security (17) Algorithm (15) How to Ace Interview (15) Brain Teaser (14) Google (13) Linux - Shell (13) Spark (13) Spring (13) Code Quality (12) How to (12) Interview-Database (12) Interview-Operating System (12) Redis (12) Tools (12) Architecture Principles (11) Company - LinkedIn (11) Testing (11) Resource (10) Solr (10) Amazon (9) Cache (9) Search (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Company - Uber (8) Interview - MultiThread (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Scalability (8) Cassandra (7) Git (7) Interview Corner (7) JVM (7) Java Basics (7) Machine Learning (7) NoSQL (7) C++ (6) Design (6) File System (6) Highscalability (6) How to Better (6) Kafka (6) Network (6) Restful (6) Trouble Shooting (6) CareerCup (5) Code Review (5) Company - Facebook (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Be Architect (4) Big Fata (4) C (4) Company Product Architecture (4) Data structures (4) Design Principles (4) Facebook (4) GeeksforGeeks (4) Generics (4) Google Interview (4) Hardware (4) JDK8 (4) Optimization (4) Product + Framework (4) Shopping System (4) Source Code (4) Web Service (4) node.js (4) Back-of-Envelope (3) Company - Pinterest (3) Company - Twiiter (3) Company - Twitter (3) Consistent Hash (3) GOF (3) Game Design (3) GeoHash (3) Growth (3) Guava (3) Interview-Big Data (3) Interview-Linux (3) Interview-Network (3) Java EE Patterns (3) Javarevisited (3) Map Reduce (3) Math - Probabilities (3) Performance (3) Puzzles (3) Python (3) Resource-System Desgin (3) Scala (3) UML (3) geeksquiz (3) AI (2) API Design (2) AngularJS (2) Behavior Question (2) Bugs (2) Coding Interview (2) Company - Netflix (2) Crawler (2) Cross Data Center (2) Data Structure Design (2) Database-Shard (2) Debugging (2) Docker (2) Elasticsearch (2) Garbage Collection (2) Go (2) Hadoop (2) Html (2) Interview - Soft Skills (2) Interview-Miscs (2) Interview-Web (2) JDK (2) Logging (2) POI (2) Papers (2) Programming (2) Project Practice (2) Random (2) Software Desgin (2) System Design - Feed (2) Thread Synchronization (2) Video (2) ZooKeeper (2) reddit (2) Ads (1) Advanced data structures (1) Algorithm - Review (1) Android (1) Approximate Algorithms (1) Base X (1) Bash (1) Books (1) C# (1) CSS (1) Chrome (1) Client-Side (1) Cloud (1) CodingHorror (1) Company - Yelp (1) Counter (1) DSL (1) Dead Lock (1) Difficult Puzzles (1) Distributed ALgorithm (1) Eclipse (1) Facebook Interview (1) Function Design (1) Functional (1) GoLang (1) How to Solve Problems (1) ID Generation (1) IO (1) Important (1) Internals (1) Interview - Dropbox (1) Interview - Project Experience (1) Interview Tips (1) Interview-Brain Teaser (1) Interview-How (1) Interview-Mics (1) Interview-Process (1) Jeff Dean (1) Joda (1) LeetCode - Review (1) Library (1) LinkedIn (1) LintCode (1) Mac (1) Micro-Services (1) Mini System (1) MySQL (1) Nigix (1) NonBlock (1) Process (1) Productivity (1) Program Output (1) Programcreek (1) Quora (1) RPC (1) Raft (1) RateLimiter (1) Reactive (1) Reading (1) Reading Code (1) Refactoring (1) Resource-Java (1) Resource-System Design (1) Resume (1) SQL (1) Sampling (1) Shuffle (1) Slide Window (1) Spotify (1) Stability (1) Storm (1) Summary (1) System Design - TODO (1) Tic Tac Toe (1) Time Management (1) Web Tools (1) algolist (1) corejavainterviewquestions (1) martin fowler (1) mitbbs (1)

Popular Posts