Tuesday, July 21, 2015

Big Data Interview Questions - Part 1



Big Data - Fuzzy Search Url (Bloom Filter)
给你A,B两个文件,各存放50亿条URL,每条URL占用64字节,内存限制是4G,让你找出A,B文件共同的URL。如果是三个乃至n个文件呢?

Bloom Filter就被广泛用于拼写检查,数据库系统中。。。可以实现数据字典,进行数据的判重,或者集合求交集

基本原理及要点

An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.

很明显这个过程并不保证查找的结果是100%正确的。同时也不支持删除一个已经插入的关键字,因为该关键字对应的位会牵动到其他的关键字。

所以一个简单的改进就是 counting Bloom filter,用一个counter数组代替位数组,就可以支持删除了。

Bloom Filter可以用来实现数据字典,进行数据的判重,或者集合求交集.

Solution

Of course we can always use 【分治+trie树/hash+小顶堆】 standard solution, but for Fuzzy search, BF is the best.

4G = 232 大概是40亿 x 8大概是340亿bit,n = 50亿,如果按出错率0.01算需要的大概是480亿个bit。现在可用的是340亿,相差并不多,这样可能会使出错率上升些。

Find Common Elements in 2 Lists
Traditional solution: 2 pointers, Hash
Big Data: Bloom filter

Find Similar Library Books
link

有很大一个电子图书馆,里面每本书的每一页都是OCR转换出来的text,所有每页大 约有5%的error(转换错误,错误分割单词,跳脱。。。)。设计一个方法判定图书馆 里是否有完全一样的书(duplicate),或者将来有书进来时判定同样的书是否已存在。
Solution
Very large string matching, we mush use hashing or similar technique. Since books have error, we need to do fuzzy search/matching. Bloom filter is designed to do this!



Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts