Monday, April 1, 2019

Facebook Unicorn Architecture - Graph Search



https://massivetechinterview.blogspot.com/2015/12/facebook-typeahead.html
https://research.fb.com/publications/unicorn-a-system-for-searching-the-social-graph/
https://davidamerland.com/seo-tips/1002-facebook-graph-search-takes-one-more-step-into-the-semantic-web.html
https://www.slideshare.net/nitishupreti/faceboko-tao-unicorn
• TAO : A read optimized graph data store to server Facebook’s “Social
Graph”.
• Unicorn : Online, In-Memory social graph-aware search and indexing

system.

In-Memory “social graph aware” indexing system serving billions of query a day.
• The idea is to promote social proximity.
• Serves as the backend infrastructure for graph search.
• Searching all basic structured information on the social graph and
perform complex set of operations on the results.


Core Technical Ideas
• Applying common information retrieval architectural concepts in the
domain of social graph search.
• How do you promote socially relevant search results ?
• Building rich operators ( apply & extract ) that allow rich semantic
graph queries that allow multiple round trip algorithms for serving
complicated queries.

There are billions of users in social graph. An average user is friend
has approximately 130 friends.

• Best way to implement social graph (sparse) : Adjacency Lists.
Hits = Results
Posting List = Adjacency Lists
Hit Data is extra meta data.
Sort key helps us find globally important ids.

• Client sends ‘Thrift’ requests to server. (Facebook’s own Protocol –
Buffer)
• Request is routed to closest Unicorn server.
• Several operators supported : Or, And, Difference.
• Meta-Data : ‘graduation year’ and ‘major’ for attended.
• In Distributed Systems: Never ever forget to
shard !
• All Posting lists are sharded by ‘result_id’.
• Index servers store adjacency lists and perform
set operations on those lists.
• Each index server is responsible for a particular shard.
• Rack Aggregator benefits from the fact that bandwidth to servers within a rack is higher.

• It all started with a Type Ahead Search.
• Users are shown a list of possible match for the query as they are typing.
• Index servers for Type Ahead contain posting lists for every name prefix up to a predefined character limit.
• These posting lists contain the ids of users whose first or last name matches the prefix.
• A simple Type ahead implementation would merely map input
prefixes to the posting lists for those prefixes and return the resultant ids.

• How do you make this socially relevant ?

• How do you ensure that search results are
socially relevant ?
• Can we “AND” the solution with the friend list
of user ?
( Ignores results for users who might be relevant but
are not friends with the user performing the search).
• We actually want a way to force some fraction
of the final results to possess a trait, while not
requiring this trait from all results.
• The answer is WeakAnd operator.
• The WeakAnd operator is a modification of
And that allows operands to be missing from
some fraction of the results within an index shard.
• Implementation : Allow only finite number of

hits to be non-friends.


Requires certain operands to be present
in some fraction of the matches.
• Enforces diversity in the set.
• Example : Fetching geographically
diversity in the result set.
• At least 20% from San Francisco.

• An optional weight parameter as well.

• We might want to prioritize results for individuals who are in close in age to
the user typing the query.
• This requires that we store the age (or birth date) of users with the index.
• For storing per-entity metadata, Unicorn provides a forward index, which is
simply a map of id to a blob that contains metadata for the id. The forward
index for an index shard only contains entries for the ids that reside on that
shard.
• Based on Thrift parameters included with the client’s request, the client
can select a scoring function in the index server that scores each result.

• Aggregators give priority to documents with higher score.

Imagine a scenario : We might want to know the pages liked by friends of
Bill who likes Trekking :
1. First execute the query (and friend:7 likers:42)
2. Collecting the results, and create a new query that produces the union of the
pages liked by any of these individuals.
• Inefficient due to multiple round trips involved between index servers and
top aggregator.
• The ability to use the results of previous executions as seeds for future
executions creates new applications for a search system, and was the
inspiration for Facebook’s Graph Search consumer product. The idea was to
build a general-purpose, online system for users to find entities in the

social graph that matched a set of user-defined constraints.

Apply Operator
A graph traversal operator that allows client to query a set of ids and then use the resultant ids to construct and execute a new query.
• Apply is a ‘syntactic sugar’ to allow a
system to perform expensive operations
lower in the hardware stack. However, by
allowing clients to show semantic intent,
optimizations are possible to preserve

search time.

Extract Operator
Say you want to look up people tagged in
photos of “Jon Jones”.
• Solution: ‘Apply’ operator to look up photos
of Jon Jones in the photos vertical and then
to query the users vertical for people
tagged in these photos.
• Now you need hundred of billions of new
terms in users vertical.
• Billions of “one to few” mapping.
• Better way: Store the ids of people tagged
in a photo in the forward index data for that
photo in the photos vertical. This is a case
where partially de-normalizing. We thus
store the result ids in the forward index of
the secondary vertical and do the lookup
inline.
• This is exactly what Extract operator accomplishes

• Privacy is crucial !
• Certain graph edges cannot be shown to all users but rather only to
users who are friends with or in the same network as a particular
person.
• Unicorn itself does not have privacy information incorporated into its
index : Strict consistency and durability guarantees are absent that
are needed for a full privacy solution.
• Facebook PHP frontend makes a proper privacy check on the result.
This design decision imposes a modest efficiency penalty on the
overall system.
• However it also keeps privacy logic separate from Unicorn with the

DRY (“Don’t Repeat Yourself”) principle of software development


To enable clients to make privacy decisions, a string of
metadata is attached to each search result to describe
its lineage.
Lineage is a structured representation of the edges

that were traversed in order to yield a result.

https://www.facebook.com/notes/facebook-engineering/under-the-hood-building-out-the-infrastructure-for-graph-search/10151347573598920
At first, the old search on Facebook (called PPS) was keyword based--the searcher entered keywords and the search engine produced a results page that was personalized and could be filtered to focus on specific kinds of entities such as people, pages, places, groups, etc.


In 2009, Facebook started work on a new search product (called Typeahead) that would deliver search results as the searcher typed, or “prefix matching."  This product required a complete reimplementation of the backend and frontend for prefix matching and high performance. We launched this overhaul in 2010.

Many algorithms went into the design of Typeahead, but in order to achieve its performance goals and deliver results in an acceptable amount of time, the index capacity remained limited. To maintain recall, Typeahead passed searchers to PPS when they asked to see more results.

In addition to PPS and Typeahead, there are other products that feature search, like Nearby, tagging within posts, and location tagging of posts and photos – some of which had their own backends. In order to make Graph Search work, and return high-quality results, we needed to create an index that would support all of these systems and allow for the richer queries of Graph Search.

A Crash-Course in Graph Structure
The Facebook graph is the collection of entities and their relationships on Facebook. The entities are the nodes and the relationships are the edges. One way to think of this is if the graph were represented by language, the nodes would be the nouns and the edges would be the verbs. Every user, page, place, photo, post, etc. are nodes in this graph. Edges between nodes represent friendships, check-ins, tags, relationships, ownership, attributes, etc. 


Both nodes and edges have metadata associated with them. Nodes in the graph are identified by a unique number called the fbid.


The Facebook graph contains social information, such as friendships and likes, in addition to information relevant for everybody--e.g., the relationship between.

Designing a System for Graph Search
PPS and Typeahead search Facebook entities based on their metadata--primarily using their name (title). The kinds of entities searched are users, pages, places, groups, applications, and events. The goal of Graph Search was to extend this capability to also search based on the relationship between entities--meaning we are also searching over the edges between the corresponding nodes. We chose to use natural language as the input for the queries, as natural language is able to precisely express the graph relationships being searched over. For example:

  • Restaurants liked by Facebook employees
  • People who went to Gunn High School and went to Stanford University
  • Restaurants in San Francisco liked by people who graduated from the Culinary Institute of America
Decision to use Unicorn
As we’ve mentioned in previous posts, we realized that Graph Search would require the building of a very large index. For example, we would have to index every single “check-in” (since queries can ask about this), whereas previously we could aggregate check-in information since it was only used as a ranking signal. So we needed a search infrastructure that would scale. We were also getting overwhelmed by supporting multiple search backends--so we saw this as an opportunity to move to a single search backend--in order to make the development and maintenance process more efficient.

There was an effort within Facebook to build an inverted-index system called Unicorn since 2009.  By late 2010, Unicorn had been used for many projects at Facebook as an in-memory “database” to lookup entities given combinations of attributes. In 2011, we decided to extend Unicorn to be a search engine and migrate all our existing search backends to Unicorn as the first step towards building Graph Search.

The main components of Unicorn are:
  • The index -- a many-to-many mapping from attributes (strings) to entities (fbids)
  • A framework to build the index from other persistent data and incremental updates
  • A framework to retrieve entities from the index based on various constraints on attributes
The main components of Unicorn are:
  • The index -- a many-to-many mapping from attributes (strings) to entities (fbids)
  • A framework to build the index from other persistent data and incremental updates
  • A framework to retrieve entities from the index based on various constraints on attributes

Suppose David’s friend has fbid 1234 and lives in New York and likes Downton Abbey. The index corresponding to my friend will include the mappings:

            friend:10003 → 1234
            lives-in:111 → 1234
            like:222 → 1234

Here, we assume David’s fbid is 10003, and the fbid’s of New York and Downtown Abbey are 111 and 222 respectively. In addition, friend:10003, lives-in:111, and like:222 may map to other users that share these attributes.

Unicorn makes it easy to find nodes that are connected to another node by searching for an edge-type combined with an input node.  E.g.:
  • David’s friends:  friend:10003
  • People who live in new york: lives-in:111
  • People who like downtown abbey: like:222

This is analogous to the way that keyword search systems work, except that here the keywords describe graph relationships instead of text contained within a document

Many queries like this require just one "hop" to get to the set of results, or to traverse one edge to find the nodes that form the search results. The output node is always one edge away from the node that seeded the query. But often there are interesting queries that require doing more than one consecutive hop. Unicorn is unique in that it supports these “multi-hop" queries for finding interesting results in the social graph, and thus makes it possible to connect with new nodes in the graph.

Suppose we wish to index the following five users: Mark Zuckerberg (fbid: 4), Randi Zuckerberg (fbid: 13755), Mark David Johnson (fbid: 1001), Randi Johnson (fbid: 5542), and David Johnson (fbid: 10003).  The mappings we use are:

mark → 4
zuck → 4
randi → 13755
zuck → 13755
mark → 1001
david → 1001
johnson → 1001
randi → 5542
johnson →5542
david → 10003
johnson → 10003


The following diagram shows this mapping in more detail:


The vertical lines in this diagram are called posting lists – a list of all fbids corresponding to a particular attribute.  So the posting list for “mark” contains 4 and 1001. The horizontal lines in this diagram describe entities – so the entity with fbid 4 is indexed by “mark” and “zuck” (fbid 4 has 2 attributes). Posting lists can have as few as one entry and as many as millions of entries, and entities may be indexed with as few as ten attributes and as many as thousands of attributes.

Unicorn supports a query language that allows retrieval of matching entities. Here are some examples of queries:
  • (term mark):
    Matches all entities in the posting list for “mark” – i.e., 4 and 1001.
  • (and david johnson):
    Matches all entities that are in both the posting list for “david” and the posting list for “johnson” – i.e., 1001 and 10003.
  • (or zuck randi):
    Matches all entities that are either in the posting list for “zuck” or in the posting list for “randi” – i.e., 4, 13755, and 5542.

To perform operations (such as “and”) efficiently, the order of entities in each posting list must be the same. For example, they can be ordered by fbid, time of creation of the entity, etc.  We prefer to order them by importance (from most important to least important). We use query-independent signals to come up with a numeric value for importance and then order entities by the value. This value is called the static rank of the entity.

Unicorn retrieval expressions may be matched by a large number of entities. Typically, we are not interested in all the entities, but only the most important ones. So along with the retrieval expression, we can also specify a limit on the number of entities to be retrieved. In combination with the static rank, this means we retrieve only the most important entities that match the expression.

Indexing (or the process of building the index) is performed as a combination of map-reduce scripts that collect data from Hive tables, process them, and convert them into the inverted index data structures. 

Facebook is a unique data set in that it is constantly changing. There are over 2.5 billion new pieces of content added every day, and more than 2.7 billion likes added every day. This means that in addition to indexing the relatively static data, we need to include the updates in our unicorn index. All changes being made constantly by Facebook users are streamed into the index via a separate live update pipeline to keep the index current to within seconds of these changes.

With the development of Unicorn and launch of Graph Search, we have a unified system supporting the index of the entire Graph. For more detail on how Unicorn works, check out the video of a recent talk by Soren Lassen and Mike Curtiss.
Extending Unicorn to be a Search Engine
Unicorn is an inverted index framework and includes capabilities to build indices and retrieve data from the index. Some examples of retrieval queries described in our earlier post are:

(term mark)
(and david johnson)
(or zuck randi)

A Unicorn instance (or vertical) may store an index that is too large to fit into the memory of a single machine. So we break up the index into multiple shards such that each shard can fit on a single machine. These machines are called index servers. Queries to Unicorn are sent to a Vertical Aggregator, which in turn broadcasts the queries to all index servers. Each index server retrieves entities from its shard and returns it to the Vertical Aggregator – which then combines all these entities and returns them to the caller.


Unicorn is fundamentally an in-memory “database” with a query language for retrieval. Before we could use Unicorn for Graph Search, we had to extend it with search ranking capabilities. Here are some of the things we did:

Keep different entity types in separate Unicorn verticals
Given that different entity types (users, pages, photos, etc.) have very different requirements for ranking, we decided to maintain each entity type in a separate Unicorn vertical and create a new entity--the top aggregator--that would coordinate activities across the verticals.

 

Extending the retrieval operations with weak “and” and strong “or”
There are many instances (especially for social and contextual retrieval) where it is useful to weaken the semantics of “and” to allow some misses, and to strengthen the semantics of “or” to require certain matches.

Query rewriting
The queries that are issued by our end users are (obviously) not in the format required for retrieval. So we need to rewrite the query to translate the user intent and their social context into a structured Unicorn retrieval query. This is where we try to understand the query intent, bias it socially, correct spelling, use language models for synonyms, segmentation, etc.

Scoring
By default, Unicorn returns results in their static rank order--which means the relevance of the result for the query is not considered. Scoring is the step that takes place immediately after retrieval when a numeric score is assigned to an entity using both the entity data and the query.

Forward index
Additional information about the entity (beyond what is in the inverted index) is typically required for scoring. We added the ability to store a blob of metadata for each entity in the index. The indexing pipeline facilitates the creation of forward index data along with the inverted index.

Result set scoring
Scoring works on a single entity at a time and the score assigned is independent of the scores assigned to other entities. This can cause a result set to become very one-dimensional and offer a poor search experience, (for example, “photos of Facebook employees” may return too many photos of Mark Zuckerberg). Result set scoring offers yet another layer of filtering that looks at a number of entities together and returns a subset of these entities that are most interesting as a set (and not necessarily the highest scoring set of results).

Blending
When results come in from different verticals to the top aggregator, they need to be combined into a single result set. To do this, the top aggregator needs to blend the results together in the right order. This requires normalizing the scores from different verticals so they can be compared against each other.

Nested queries
For many Graph Search queries, we have to perform multiple searches across many verticals and “join” the results. For example, “restaurants liked by Facebook employees” requires a search on the user vertical to obtain Facebook employees, and then a search with these users on the places vertical to obtain restaurants liked by them. Although Unicorn provides native support for nested queries through the “apply” operator, we had to extend that with a framework to inject different scorers (and result set scorers) for the different searches, and to allow for these scoring components to share data with each other.

Testing
We built a framework to run ranking experiments where the happiness metrics are compared between various experiments against their controls. The Unicorn stack, with the components described above, is shown in the diagram below. The Top Aggregator maintains a separate query rewriter and a separate result set scorer for each vertical.


The Life of a Graph Search Query
The interaction with Graph Search takes place in two distinct phases – the query suggestion phase, and the search phase. In the query suggestion phase the searcher enters text into the search box and obtains suggestions. The search phase starts from the selected suggestion and produces a results page.

The Query Suggestion Phase
When a searcher types a query into the search box, a Natural Language Processing (NLP) module attempts to parse it based on a grammar. It identifies parts of the query as potential entities and passes these parts down to Unicorn to search for them. The NLP module also provides hints as to what kind of entity that may be – for example, if the searcher types “people who live in Sri”, the NLP module sends the query “sri” to Unicorn suggesting a strong bias towards cities and places. This causes results like “Sri Lanka” and “Srinagar” to be returned instead of results like “Sriram Sankar.”

Generally the NLP module also sends the entire query to Unicorn and when these results rank high, Graph Search works like Typeahead. The following screenshots show these two cases of calling Unicorn with the query “sri."


Social Query Rewriting
The Unicorn queries shown so far only reflect the retrieval constraints imposed by the search query. In addition, the query has to be rewritten to bias it socially and contextually for the searcher. The weak “and” and the strong “or” retrieval operators are used to achieve this. Let me continue with the example from the earlier post to illustrate how different searchers looking for “mark” end up getting the most relevant Marks.
We rewrite the queries (to search for “mark”) to include the searcher and their friends as shown below:

Cristina searching for Mark:

    (and mark (or friend:11 friend:10 friend:12 friend:21))

While these queries do well in returning friends and friends of friends called “mark,” they do not return any other results – for example famous people called “mark”. This is solved using the weak “and” operator that permits its operands to be missing some of the time:

    (weak-and mark (or friend:11 friend:10 friend:12 friend:21)[0.7])

Here the second argument of the weak “and” only needs to be present 70% of the time. Now let us look at how we rewrite the Unicorn query that goes to the places vertical for “restaurants liked by Facebook employees”:

   (and place-kind:273819889375819 (or liked-by:fbid1 liked-by:fbid2 ... liked- by:fbidn))

To make this query social and contextual for me, we would rewrite it further as follows (assume my friends have fbids 11, 12, and 20, and assume that the Facebook Graph indicates that I am in the Palo Alto area):

   (weak-and
       (and place-kind:273819889375819
            (or liked-by:fbid1 liked-by:fbid2 ... liked-by:fbidn))
       (strong-or
            (or liked-by:11 liked-by:12 liked-by:20) [0.4]
            (or location:palo-alto) [0.7] ) [0.6]
    )

The above query requires 60% of the results to be either liked by me, or my friends, or be located in Palo Alto.

http://usefulstuff.io/2013/03/how-it-works-facebook-part-2/
Original Facebook search engine simply searches into cached users informations: friends list, like lists and so on. Typeahead search (the search box on the top of Facebook frontend) came on 2009.

First attempts are still on browser cache where are stored informations about user (friends, like, pages). If cache misses starts and AJAX request.
Many Leaf services search for results inside theirs indexes (stored into an inverted index called Unicorn). When results references are fetched from the indexes, they are merged and loaded from global datastore. An aggregator provide a single channel to send data to client. Obviously query are cached.
On 2012 Facebook starts from the core part of typeahead search to build a new search tool. Unicorn is the core part of the new Graph Search. Formally is an in-memory inverted index which maps Facebook contents as a graph database would do and you can query it as graph traversing tool. To be used for Graph SearchUnicorn was updated to be more than a traversing tool. Now supports nested queries, scoring and support for different kind of resources. Results are aggregated on different levels.

Query lifecycle is usually made by 2 steps: the Suggestion Phase and the Search Phase.
The Suggestion Phase works like “autocomplete” and Is powered by a Natural Language Processing (NLP) module attempts to parse text based on a grammar. It identifies parts of the query as potential entities and passes these parts down to Unicorn to search for them.
The Search Phase begins when the searcher has made a selection from the suggestions. The parse tree, along with the fbids of the matched entities, is sent back to the Top Aggregator. A user readable version of this query is displayed as part of the URL.
Currently Graph Search is still in beta.
Insights and Sources
Resources and insights
This post and the previous one are based and inspired by the answer of Michaël Figuière to the following question on Quora: What is Facebook’s architecture?
Additional stuff:
https://www.quora.com/What-exactly-is-Facebook-Unicorn-Does-anyone-know-what-kind-of-system-it-is-Is-it-a-graph-database-How-does-it-work


What Aditya Tayade said is correct; Unicorn processes queries that select a section of the social graph.  Unicorn is essentially built like a standard search engine that maps terms to sets of documents; unlike a text search engine, the terms aren't words, but connections in the social graph (for example, "friend:4" maps to the ids of all users who are friends with the user with id 4; we have similar terms for "photos taken by user id 4", "users who like page with id 25" etc).

On top of this, there are other layers (parsing natural language queries, ranking, privacy filtering, rendering the results, etc) but Unicorn sits at the core.

So let's consider that you want to find your friends of friends. Unicorn allows you to write a simple s-expression query to get a list of all fb-ids which are your FoFs. Prior to unicorn this was done at a much lower level by asking the system explicitly but Unicorn serves as a nice interface for querying. With S-expressions you can traverse the graph to multiple depths in one shot. For example you can query for something like -- a list of all the fans of all the pages that your FoFs have liked, and so on. This makes developers life simple, so to speak.

I was not aware of Graph Search at that time and Unicorn was just a graph traversing tool for me. Obviously now, with Graph Search out, Unicorn fits into the big picture

Unicorn is a standard inverted index system optimized for the needs of Facebook.  It has been extended to be a search engine - with the ability to plug in scorers, query rewriting, result set scoring, A/B testing, etc.  All of Facebook search functionality (including Graph Search) is now backed by Unicorn.

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts