Tuesday, December 8, 2015

Writing a full-text search engine using Bloom filters



http://www.stavros.io/posts/bloom-filter-search-engine/

Client-side search engine caveats

The problem with client-side search engines is that you (obviously) have to do all the searching on the client, so you have to transmit all available information there.

Bloom filters to the rescue

Bloom filter is a very interesting data structure that can store elements in a fixed number of bits and tell you whether it’s seen those elements before when you query it. 
A simple way to implement a full-text search engine that uses Bloom filters is the following:
  • Create one filter per document and add all the words in that document in the filter.
  • Serialize the (fixed-size) filter in some sort of string and send it to the client.
  • When the client needs to search, iterate through all the filters, looking for ones that match all the terms, and return the document names.
# Read all my posts.
posts = {post_name: open(POST_DIR + post_name).read() for post_name in os.listdir(POST_DIR)}
# Create a dictionary of {"post name": "lowercase word set"}.
split_posts = {name: set(re.split("\W+", contents.lower())) for name, contents in posts.items()}
filters = {}
for name, words in split_posts.items():
    filters[name] = BloomFilter(capacity=len(words), error_rate=0.1)
    for word in words:
        filters[name].add(word)
def search(search_string):
    search_terms = re.split("\W+", search_string)
    return [name for name, filter in filters.items() if all(term in filter for term in search_terms)]
All search() does is iterate through all the filters and return the ones that match every given term. 

An alternate approach

Another approach would be to use a single filter and concatenate post IDs with the search words (e.g “3:term”) and search for that. Unfortunately, that results in a filter that is exactly the same size as the sum of the small ones, and has more computational complexity because you have to hash N times (where N is the number of posts) rather than just once.
  • Since Bloom filters are probabilistic, this method may produce false positives, but it will not produce false negatives, which is what we want for a search engine (we don’t mind a few irrelevant pages in the results, but we do mind if relevant ones are missing).
Naturally, it also has weaknesses, which include:
  • You can’t weight pages by relevance, since you don’t know how many times a word appears in a page, all you know is whether it appears or not. You may or may not care about this.
Of course, the full-text index will still be large and probably not practical to load on every page view, even when using this approach, but this method may be suitable for using in a dedicated “search” page in your static site.

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts