http://www.stavros.io/posts/bloom-filter-search-engine/
Client-side search engine caveats
The problem with client-side search engines is that you (obviously) have to do all the searching on the client, so you have to transmit all available information there.
Bloom filters to the rescue
A Bloom filter is a very interesting data structure that can store elements in a fixed number of bits and tell you whether it’s seen those elements before when you query it.
A simple way to implement a full-text search engine that uses Bloom filters is the following:
- Create one filter per document and add all the words in that document in the filter.
- Serialize the (fixed-size) filter in some sort of string and send it to the client.
- When the client needs to search, iterate through all the filters, looking for ones that match all the terms, and return the document names.
# Read all my posts. posts = {post_name: open(POST_DIR + post_name).read() for post_name in os.listdir(POST_DIR)} # Create a dictionary of {"post name": "lowercase word set"}. split_posts = {name: set(re.split("\W+", contents.lower())) for name, contents in posts.items()}
filters = {}
for name, words in split_posts.items():
filters[name] = BloomFilter(capacity=len(words), error_rate=0.1)
for word in words:
filters[name].add(word)
def search(search_string):
search_terms = re.split("\W+", search_string)
return [name for name, filter in filters.items() if all(term in filter for term in search_terms)]
All
search()
does is iterate through all the filters and return the ones that match every given term. An alternate approach
Another approach would be to use a single filter and concatenate post IDs with the search words (e.g “3:term”) and search for that. Unfortunately, that results in a filter that is exactly the same size as the sum of the small ones, and has more computational complexity because you have to hash N times (where N is the number of posts) rather than just once.
- Since Bloom filters are probabilistic, this method may produce false positives, but it will not produce false negatives, which is what we want for a search engine (we don’t mind a few irrelevant pages in the results, but we do mind if relevant ones are missing).
Naturally, it also has weaknesses, which include:
- You can’t weight pages by relevance, since you don’t know how many times a word appears in a page, all you know is whether it appears or not. You may or may not care about this.
Of course, the full-text index will still be large and probably not practical to load on every page view, even when using this approach, but this method may be suitable for using in a dedicated “search” page in your static site.