Massive Technical Interviews Tips: Search Misc

Tuesday, December 8, 2015

Search Misc

Google Search
https://medium.com/how-i-learned-ruby-rails/why-googling-is-the-most-important-skill-a-developer-must-have-d69b89b22218
Your answer is out there. Always remind yourself that if you can’t find your answer, you didn’t search it correctly, didn’t search it enough or didn’t search it at all. Many people been there before you facing the same problem or asking themselves the same questions

Linux
https://unix.stackexchange.com/questions/45383/how-can-i-do-the-history-command-and-not-have-line-numbers-so-i-can-copy-multi

https://stackoverflow.com/questions/7110119/bash-history-without-line-numbers/

history | awk '{ $1=""; print }'

history | awk '{ $1=""; print $0 }'

Both of these solutions do the same thing. The output of history is being fed to awk. Awk then blanks out the first column, which corresponds to the numbers in the history command's output. Here awk is more convenient because you don't have to concern yourself with the number of characters in the number part of the output.

print $0 is equivalent to print, since the default is to print everything that appears on the line. Typing print $0 is more explicit, but which one you choose is up to you. The behavior of print $0 and simply print when used with awk is more evident if you used awk to print a file (cat would be faster to type instead of awk, but this is for illustrating a point).

https://exde601e.blogspot.com/2012/12/search-operators-for-Blogger-labels.html

While playing around on the blog I discovered an easier way to do this: modify the normal label URL in Blogger by adding a + sign followed by the name of the second label: <blog-URL>search/label/LABEL1+LABEL2. It’s easier to remember and to type, but it’s also case sensitive. It actually works with more than two labels – I only tested it with three, so I’m not sure if there is a limit to the number of labels you can add in the URL. I don’t use any labels with multiple words so I couldn’t test this case, but it’s safe to assume you need to escape spaces by replacing them with %20 just like Blogger does with the regular label pages.

But what about the other case, when you want to find posts with any of those labels – the OR operator instead of AND? Unfortunately the second option doesn’t seem to support OR, but the first one does: just replace the + in the search query with a vertical bar like this: <blog-URL>/search/?q=label:LABEL1|label:LABEL2. It’s fun to see this in action especially with labels that have very little to do with one another.

https://www.quora.com/Which-data-structure-does-Google-use-in-its-search-engine-or-does-it-have-its-own-data-structure
Another key realization is that the queries (they're called queries) are written asynchronously. They don't all end up in the same system at once.

When you do a query on Google, a log entry is written to the machine serving the query's local hard drive. No complicated data structure there, just a very efficient binary packing of the data, probably in protocol buffer format.

Every few minutes, a log saver job running on that machine next to the search job grabs the log and sends it upstream to another set of servers. The data is batches so instead of handling a store request for every query, it stores thousands (or more) queries at a time.

This data is then stored distributed across many, many, many machines. The format (I think) is called ColumnIO, and it's just another very efficient binary packing of data but is also sorted. This data is accessed by a tool called Dremel, which Google has published a paper about. Dremel looks a little like SQL and runs through this data at high speeds using, again, several machines at a time.

Google uses (or maybe used) BigTable for storing data which was built using Google File System, Chubby and SSTables. Some data structures/algorithms used by these technologies are:
1. The Log-Structured Merge-Tree (LSM-Tree) - used by SSTable
2. Paxos Algorithm - used by Chubby (Paxos (computer science))
https://www.quora.com/What-is-the-data-structure-for-search-engine
The term Search Engine is still vast and covers atleast these components
1. Crawler: To collect new webpages from the net
2. Index: To ensure super-quick retrieval of required webpages
3. Query Processor: To provide an easy interface for user to query

Each of these can be seen as an individual college-level project, though there are tools and freely available frameworks to build these with just a click.

Crawlers typically use queues to collect the webpages they're yet to visit, and usually aBloom Filter to mark the pages that are already read. Another alternative is a Hashset. However to avoid situations like spider-traps, some kind of priority queues are used. There is a lot of research on how to modify the rules so that the pages your system prefers are crawled.

Query processor can be as simple as a string spliter or a regex library, or even something as complex as a complete Natural Language Understanding system.

Index is the core. The most common style of indexing (and the data structure used) is called an inverted index. It is a hashmap like data structure that directs you from a word to a document or a web page.

A detailed explanation is given in my 2 year old answer:
Information Retrieval: What is inverted index?

And if you want to read more,
How can one build a search engine for some specific search?

Tuesday, December 8, 2015

Search Misc

Labels

Popular Posts