Tuesday, December 8, 2015

Implementing Search Engines



http://www.ardendertat.com/2011/05/30/how-to-implement-a-search-engine-part-1-create-index/
Inverted Index
The inverted index contains mappings from terms (words) to the documents that those terms appear in. Each vocabulary term is a key in the index whose value is its postings list. A term’s postings list is the list of documents that the term appears in. 

Query Types
1) One Word Queries: Queries that consist of a single word. Such as movie, or hotel.
2) Free Text Queries: Queries that contain sequence of words separated by space. Such as godfather movie, or hotels in San Francisco.
3) Phrase Queries: These are more advance queries that again consist of sequence of words separated by space, but they are typed inside double quotes and we want the matching documents to contain the terms in the query exactly in the specified order. Such as “godfather movie”.
Parsing the Collection
1) Concatenate the title and the text of the page.
2) Lowercase all words.
3) Get all tokens, where a token is a string of alphanumeric characters terminated by a non-alphanumeric character. The alphanumeric characters are defined to be [a-z0-9]. So, the tokens for the word ‘apple+orange’ would be ‘apple’ and ‘orange’.
4) Filter out all the tokens that are in the stop words list, such as ‘a’, ‘an’, ‘the’.
5) Stem each token using Porter Stemmer to finally obtain the stream of terms. Porter Stemmer removes common endings from words.
Building the Inverted Index
We will use a Hashtable (python’s dictionary) to store the inverted index in memory
we would like to keep one additional information in the postings list, the positions of term occurrences within the document.
Now the postings list for the term ‘web’ is [ [1, [0, 2]], [2, [2]], [3, [1]] ].
But notice that since we are building an index for just the current page, the postings list of a term won’t be a list of lists. It will simply be a list where the first element is the document ID, and the second element is the list of positions the term appears in that document. Then we merge the index of the current page with the main inverted index, which is the index for the whole corpus. The merging is simple. For every term in the current page, we append its postings list to the postings list of that term in the main index (which is a list of lists as described above).
Then we extract the second document:
search engine web ranking
and build the index for that document:
{ ‘search’: [2, [0]], ‘engine’: [2, [1]], ‘web’: [2, [2]], ‘ranking’: [2, [3]] }
And we merge the current page index with the main index. Which is simply appending the current postings of a term to the postings list of the corresponding term in the main index. After merging, our main index becomes:
{ ‘web’: [ [1, [0, 2]], [2, [2]] ], ‘retrieval’: [ [1, [1]] ], ‘search’: [ [1, [3]], [2, [0]] ], ‘information’: [ [1, [4]] ], ‘engine’: [ [2, [1]] ], ‘ranking’: [ [2, [3]] ] }
The index is stored as text in the following format:
term|docID1:pos1,pos2;docID2:pos3,pos4,pos5;…

Stemming is a very useful and important operation, but it increases the execution time by a factor of 5. The reason is that stemming is performed on every term and it contains lots of checks.

'''

Author: Arden Dertat

Contact: ardendertat@gmail.com
License: MIT License
'''

#!/usr/bin/env python

import sys
import re
from porterStemmer import PorterStemmer
from collections import defaultdict
from array import array
import gc

porter=PorterStemmer()

class CreateIndex:

    def __init__(self):
        self.index=defaultdict(list)    #the inverted index

    
    def getStopwords(self):
        '''get stopwords from the stopwords file'''
        f=open(self.stopwordsFile, 'r')
        stopwords=[line.rstrip() for line in f]
        self.sw=dict.fromkeys(stopwords)
        f.close()
        

    def getTerms(self, line):
        '''given a stream of text, get the terms from the text'''
        line=line.lower()
        line=re.sub(r'[^a-z0-9 ]',' ',line) #put spaces instead of non-alphanumeric characters
        line=line.split()
        line=[x for x in line if x not in self.sw]  #eliminate the stopwords
        line=[ porter.stem(word, 0, len(word)-1) for word in line]
        return line


    def parseCollection(self):
        ''' returns the id, title and text of the next page in the collection '''
        doc=[]
        for line in self.collFile:
            if line=='</page>\n':
                break
            doc.append(line)
        
        curPage=''.join(doc)
        pageid=re.search('<id>(.*?)</id>', curPage, re.DOTALL)
        pagetitle=re.search('<title>(.*?)</title>', curPage, re.DOTALL)
        pagetext=re.search('<text>(.*?)</text>', curPage, re.DOTALL)
        
        if pageid==None or pagetitle==None or pagetext==None:
            return {}

        d={}
        d['id']=pageid.group(1)
        d['title']=pagetitle.group(1)
        d['text']=pagetext.group(1)

        return d


    def writeIndexToFile(self):
        '''write the inverted index to the file'''
        f=open(self.indexFile, 'w')
        for term in self.index.iterkeys():
            postinglist=[]
            for p in self.index[term]:
                docID=p[0]
                positions=p[1]
                postinglist.append(':'.join([str(docID) ,','.join(map(str,positions))]))
            print >> f, ''.join((term,'|',';'.join(postinglist)))
            
        f.close()
        

    def getParams(self):
        '''get the parameters stopwords file, collection file, and the output index file'''
        param=sys.argv
        self.stopwordsFile=param[1]
        self.collectionFile=param[2]
        self.indexFile=param[3]
        

    def createIndex(self):
        '''main of the program, creates the index'''
        self.getParams()
        self.collFile=open(self.collectionFile,'r')
        self.getStopwords()
                
        #bug in python garbage collector!
        #appending to list becomes O(N) instead of O(1) as the size grows if gc is enabled.
        gc.disable()
        
        pagedict={}
        pagedict=self.parseCollection()
        #main loop creating the index
        while pagedict != {}:                    
            lines='\n'.join((pagedict['title'],pagedict['text']))
            pageid=int(pagedict['id'])
            terms=self.(lines)
            
            #build the index for the current page
            termdictPage={}
            for position, term in enumerate(terms):
                try:
                    termdictPage[term][1].append(position)
                except:
                    termdictPage[term]=[pageid, array('I',[position])]
            
            #merge the current page index with the main index
            for termpage, postingpage in termdictPage.iteritems():
                self.index[termpage].append(postingpage)
            
            pagedict=self.parseCollection()
        gc.enable()
        self.writeIndexToFile()
if __name__=="__main__":
    c=CreateIndex()
    c.createIndex()
Our query index program will first read the index file from disk and construct the index back in memory, in the same format as in create index.

Free Text QueriesThe input in FTQ is a sequence of words, and the output is the list of documents that contain any of the query terms. So, we will get the list of documents for each query term, and take the union of them. 
Phrase Queries
First we need the documents that all query terms appear in. We will again get the list of documents for each query term as we did in FTQ, but now we will take the intersection of them instead of union. Because we want the documents that all query terms appear in, instead of the documents that any query term appears. Then, we should check whether they are in correct order or not.

Get positions of the query terms in the current document, and put each to a separate list. So, if there are n query terms, there will be n lists where each list contains the positions of the corresponding query term in the current document. Leave the position list of the 1st query term as it is, subtract 1 from each element of the 2nd position list, subtract 2 from each element of the 3rd position list, …, subtract n-1 from each element of nth position list. Then intersect all the position lists. If the result of the intersection is non-empty, then all query terms appear in the current document in correct order, meaning this is a matching document. Perform these operations on all documents that every query term appears in. The matching documents are the ones that have non-empty intersection. Why does this algorithm work? Because, for each query term, we check whether it appears right after the previous query terms in the document. And we do this in an elegant way. Additionally, there is an optimization while performing position list intersections. We can start intersecting from smaller lists, because the size of the final intersection is always less than or equal to the size of the smallest list. Therefore, we can short-circuit whenever the intermediate intersection result becomes an empty list, obtaining an efficient algorithm.

An example will make everything clear. Let’s say we search for “computer science department”. We first get the document list of all query terms, as we did in FTQ: computer: [1, 2, 3], science: [1, 2, 3], and department: [1, 2]. Then we intersect all these lists to get the documents that contain all query terms, which is [1, 2]. Next, we should check whether the order is correct or not. First, we get the postings list of the query terms for document 1. Which is computer: [1, [2, 5]], science: [1, [3]], and department: [1, [4, 6]. Then, we extract the positions of each query term, and put them in separate lists, resulting in [ [2, 5], [3], [4, 6] ]. Each list corresponds to the positional information of a query term. We don’t touch the first list, but subtract i-1 from the elements in the ith list, resulting in [ [2, 5], [2], [2, 4] ]. Finally, we take the intersection of the lists, which is [2]. Since the intersection is not empty, we conclude that document 1 is a matching document. Next, we check document 2. Get the positions of query terms and put them to separate lists as before: [ [1, 7], [2, 5], [0, 6] ]. Perform the subtractions: [ [1, 7], [1, 4], [-2, 4] ]. And take the intersection: []. The result of the intersection is empty, meaning the query terms don’t appear in the correct order, so this is not a matching document. There is no more document that contains all query terms. So, the result of the phrase query is document 1 only: [1].
http://www.ardendertat.com/2011/07/17/how-to-implement-a-search-engine-part-3-ranking-tf-idf/
Tf-idf is a weighting scheme that assigns each term in a document a weight based on its term frequency (tf) and inverse document frequency (idf).  The terms with higher weight scores are considered to be more important.

In bag of words model, the document is represented as an unordered collection of words.
We can use a vector to represent the document in bag of words model, since the ordering of terms is not important. There is an entry for each unique term in the document with the value being its term frequency.

We can indeed represent every document in the corpus as a k-dimensonal vector, where k is the number of unique terms in that document. Each dimension corresponds to a separate term in the document. Now every document lies in a common vector space. The dimensionality of the vector space is the total number of unique terms in the corpus. We will further analyze this model in the following sections. The representation of documents as vectors in a common vector space is known as the vector space model

To remedy this effect, we length normalize term frequencies. So, the term frequency of a term t in document D now becomes:
tf_{t,d} = \dfrac{N_{t,d}}{||D||}
||D|| is known as the Euclidean norm and is calculated by taking the square of each value in the document vector, summing them up, and taking the square root of the sum. After normalizing the document vector, the entries are the final term frequencies of the corresponding terms. The document vector is also a unit vector, having a length of 1 in the vector space.
Inverse Document Frequency – idf
The document frequency of a term t is the number of documents containing the term:
df_t = N_t
Note that the occurrence counts of the term in the individual documents is not important. We are only interested in whether the term is present in a document or not, without taking into consideration the counts. It’s like a binary 0/1 counting. If we were to consider the number of occurrences in the documents, then it’s called collection frequency. But document frequency proves to be more accurate. Also note that term frequency is a document-wise statistic while document frequency is collection-wise.

Term frequency is the occurrence count of a term in one particular document only; while document frequency is the number of different documents the term appears in, so it depends on the whole corpus.

This is a very useful statistic, but it also requires a slight modification. Consider a corpus with 1000 documents. A term appears in 10 documents and another term appears in 100, so the document frequencies are 10 and 100 respectively. The inverse document frequencies are 100 and 10. Idf is 100 for the term that has a df of 10 (1000/10), and idf is 10 for the document with df 100 (1000/100) by definition. Now as we can see the term that appears in 10 times more documents is considered to be 10 times less important. It’s expected that the more frequent term to be considered less important, but the factor 10 seems too harsh. Therefore, we take the logarithm of the inverse document frequencies. Let’s say the base of log is 2, than term that appears 10 times less often is considered to be around 3 times more important. So, the idf of a term t becomes:
idf_t = log\dfrac{N}{df_t}
tf\mbox{-}idf_{t,d} = tf_{t,d} \cdot idf_t
class QueryIndex:

    def __init__(self):
        self.index={}
        self.titleIndex={}
        self.tf={}      #term frequencies
        self.idf={}    #inverse document frequencies


    def intersectLists(self,lists):
        if len(lists)==0:
            return []
        #start intersecting from the smaller list
        lists.sort(key=len)
        return list(reduce(lambda x,y: set(x)&set(y),lists))
     
 
    def getStopwords(self):
        f=open(self.stopwordsFile, 'r')
        stopwords=[line.rstrip() for line in f]
        self.sw=dict.fromkeys(stopwords)
        f.close()
     

    def getTerms(self, line):
        line=line.lower()
        line=re.sub(r'[^a-z0-9 ]',' ',line) #put spaces instead of non-alphanumeric characters
        line=line.split()
        line=[x for x in line if x not in self.sw]
        line=[ porter.stem(word, 0, len(word)-1) for word in line]
        return line
     
 
    def getPostings(self, terms):
        #all terms in the list are guaranteed to be in the index
        return [ self.index[term] for term in terms ]
 
 
    def getDocsFromPostings(self, postings):
        #no empty list in postings
        return [ [x[0] for x in p] for p in postings ]


    def readIndex(self):
        #read main index
        f=open(self.indexFile, 'r');
        #first read the number of documents
        self.numDocuments=int(f.readline().rstrip())
        for line in f:
            line=line.rstrip()
            term, postings, tf, idf = line.split('|')    #term='termID', postings='docID1:pos1,pos2;docID2:pos1,pos2'
            postings=postings.split(';')        #postings=['docId1:pos1,pos2','docID2:pos1,pos2']
            postings=[x.split(':') for x in postings] #postings=[['docId1', 'pos1,pos2'], ['docID2', 'pos1,pos2']]
            postings=[ [int(x[0]), map(int, x[1].split(','))] for x in postings ]   #final postings list
            self.index[term]=postings
            #read term frequencies
            tf=tf.split(',')
            self.tf[term]=map(float, tf)
            #read inverse document frequency
            self.idf[term]=float(idf)
        f.close()
     
        #read title index
        f=open(self.titleIndexFile, 'r')
        for line in f:
            pageid, title = line.rstrip().split(' ', 1)
            self.titleIndex[int(pageid)]=title
        f.close()
     
   
    def dotProduct(self, vec1, vec2):
        if len(vec1)!=len(vec2):
            return 0
        return sum([ x*y for x,y in zip(vec1,vec2) ])
         
     
    def rankDocuments(self, terms, docs):
        #term at a time evaluation
        docVectors=defaultdict(lambda: [0]*len(terms))
        queryVector=[0]*len(terms)
        for termIndex, term in enumerate(terms):
            if term not in self.index:
                continue
         
            queryVector[termIndex]=self.idf[term]
         
            for docIndex, (doc, postings) in enumerate(self.index[term]):
                if doc in docs:
                    docVectors[doc][termIndex]=self.tf[term][docIndex]
                 
        #calculate the score of each doc
        docScores=[ [self.dotProduct(curDocVec, queryVector), doc] for doc, curDocVec in docVectors.iteritems() ]
        docScores.sort(reverse=True)
        resultDocs=[x[1] for x in docScores][:10]
        #print document titles instead if document id's
        resultDocs=[ self.titleIndex[x] for x in resultDocs ]
        print '\n'.join(resultDocs), '\n'


    def queryType(self,q):
        if '"' in q:
            return 'PQ'
        elif len(q.split()) > 1:
            return 'FTQ'
        else:
            return 'OWQ'


    def owq(self,q):
        '''One Word Query'''
        originalQuery=q
        q=self.getTerms(q)
        if len(q)==0:
            print ''
            return
        elif len(q)>1:
            self.ftq(originalQuery)
            return
     
        #q contains only 1 term
        term=q[0]
        if term not in self.index:
            print ''
            return
        else:
            postings=self.index[term]
            docs=[x[0] for x in postings]
            self.rankDocuments(q, docs)
       

    def ftq(self,q):
        """Free Text Query"""
        q=self.getTerms(q)
        if len(q)==0:
            print ''
            return
     
        li=set()
        for term in q:
            try:
                postings=self.index[term]
                docs=[x[0] for x in postings]
                li=li|set(docs)
            except:
                #term not in index
                pass
     
        li=list(li)
        self.rankDocuments(q, li)


    def pq(self,q):
        '''Phrase Query'''
        originalQuery=q
        q=self.getTerms(q)
        if len(q)==0:
            print ''
            return
        elif len(q)==1:
            self.owq(originalQuery)
            return

        phraseDocs=self.pqDocs(q)
        self.rankDocuments(q, phraseDocs)
     
     
    def pqDocs(self, q):
        """ here q is not the query, it is the list of terms """
        phraseDocs=[]
        length=len(q)
        #first find matching docs
        for term in q:
            if term not in self.index:
                #if a term doesn't appear in the index
                #there can't be any document maching it
                return []
     
        postings=self.getPostings(q)    #all the terms in q are in the index
        docs=self.getDocsFromPostings(postings)
        #docs are the documents that contain every term in the query
        docs=self.intersectLists(docs)
        #postings are the postings list of the terms in the documents docs only
        for i in xrange(len(postings)):
            postings[i]=[x for x in postings[i] if x[0] in docs]
     
        #check whether the term ordering in the docs is like in the phrase query
     
        #subtract i from the ith terms location in the docs
        postings=copy.deepcopy(postings)    #this is important since we are going to modify the postings list
     
        for i in xrange(len(postings)):
            for j in xrange(len(postings[i])):
                postings[i][j][1]=[x-i for x in postings[i][j][1]]
     
        #intersect the locations
        result=[]
        for i in xrange(len(postings[0])):
            li=self.intersectLists( [x[i][1] for x in postings] )
            if li==[]:
                continue
            else:
                result.append(postings[0][i][0])    #append the docid to the result
     
        return result

     
    def getParams(self):
        param=sys.argv
        self.stopwordsFile=param[1]
        self.indexFile=param[2]
        self.titleIndexFile=param[3]


    def queryIndex(self):
        self.getParams()
        self.readIndex()
        self.getStopwords()

        while True:
            q=sys.stdin.readline()
            if q=='':
                break

            qt=self.queryType(q)
            if qt=='OWQ':
                self.owq(q)
            elif qt=='FTQ':
                self.ftq(q)
            elif qt=='PQ':
                self.pq(q)
     
     
if __name__=='__main__':
    q=QueryIndex()
    q.queryIndex()

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts