Tuesday, July 5, 2016

Social Network



Social Network - 小土刀的面试刷题笔记
How would you design the data structures for a very large social network like Facebook or LinkedIn? Describe how you would design an algorithm to show the shortest path between two people (e.g., Me -> Bob -> Susan -> Jason -> You)

Solution

construct a graph by treating each person as a node and letting an edge between two nodes indicate that the two users are friends.
BFS search -> bidirectional breadth-first-search
Suppose every person has k friends, and node S and node D have a friend C in common.
  • Traditional bfs from S to D: go through k+k*k nodes
  • Bidirectional bfs: go through 2k nodes
Generalizing this to a path of length q
  • BFS: O(qk)
  • Bidirectional BFS: O(qk/2 + qk/2)

Handle the Millions of User

data in different machines, use hashtable or other mechanism to index them and search them for the same time.
Optimization:
  • Reduce machine jumps
  • Smart division of people and machines(related data should be put together)
  • Instead of marking VISITED, we use a hashtable to record the nodes we have already visited
https://github.com/filipegoncalves/interview-questions/blob/master/systems_design/SocialNetwork.md
Step 1: Scope the Problem
What features should we worry about? We will focus our efforts in designing the core features of a generic social network: a user has a profile where he can share some information, upload pictures, and be friends with other people. Each user has a timeline that is used to post stuff and show to the world whatever he wants.
So, the typical use cases are:
  • Add a new post to the timeline
  • Upload one or more pictures
  • Update / change profile information
  • Become friends with another user
  • Visit another user's timeline and photos

Step 2: Make Reasonable Assumptions

First, let's estimate the amount of storage needed to run the social network. People nowadays upload all sorts of stupid and weird pictures all the time, so we will assume that users upload a lot of photos every day. Photos in a social network are not necessarily high quality, so let's say each photo has at most 500 kB. This errs on the side of exaggeration; services like Facebook will usually resize photos to a few hundred kilobytes.
Assuming a timespan of 5 years during which 30% of users uploads 10 photos / day, 50% uploads 2 and 20% uploads 0.1 (1 photo every 10 days), this all adds up to 5500 petabytes of storage. With a 70% capacity model, this means we are looking at a system capable of storing 8000 petabytes of photos. This is quite a lot, but we also assumed a relatively active and large userbase to begin with. Building a system capable of hosting 8000 petabytes of photos implies an extraordinarily high investment cost, so it would be wise to start small and assume a user base of a few million users and then go from there. With a smaller userbase, we could use analytics tools to extrapolate the typical user's behavior and then try to plan from there what would happen in the long run.
What about requests / second? Somewhere, there is always someone checking on Facebook, so the service is expected to have a very high number of requests / second. If we assume that each user checks on Facebook once every two hours, we will be looking at something like 200K requests / second.
In a social network like Facebook, it is also important to note that typically there is a high read-to-write ratio; people are often curious to see what their friends have written recently, their new photos, and all of that. This means that we can probably optimize access times a lot with caching.

Step 3: Design & Scale

We could have a huge database storing everything from user profile information, to photos and timeline posts. Instead, perhaps a better alternative is to keep things logically separated: we can have one component respnsible for storing user profile information, another respnsible for photos, and another for timeline posts. This has the advantage of making it easier to scale each component independently. For example, if we stop getting new users but existing users start uploading an unusual amount of photos, then we can easily focus our efforts on adding more horsepower to the photo storage component, while leaving profile information and timeline posts untouched. However, splitting information makes it harder and more expensive to do perform lookup operations such as finding all the photos of a user, or finding all the timeline posts of a user. We can try to mitigate this by duplicating some information across the storage system, but that would complicate the design, not to mention that it kind of misses the point: after all, if we are to do that, then maybe we would be better if we just didn't split in the first place.
As mentioned before, caching can be of great help in here. We can cache queries that are expensive to answer (like listing the photos of a user, or finding his posts in the system). This has the added benefit that very popular users that get lots of visits are very likely to be cached. In fact, Facebook is one of the greatest contributors and user of memcached, and it is known that its caching infrastructure is capable of keeping several thousands of terabytes in a distributed cache to speed up requests.
As a social network, we will be essentially delivering content (photos) to people all over the world. The core architecture consists of a set of clusters of core servers holding the data and running behind hardware loadbalancers. Data can be sharded in several different ways: we could have a set of servers responsible for a given region in the globe. This has several advantages: first, people are often friends with other people in the same region, so requests to view another user's profile or photos would be mostly local to the cluster, which is good. Second, each cluster is relatively independent, so there is not a whole lot of state and synchronization to do between clusters. Third, sharding data by region allows us to place the cluster responsible for a region close to that region, thus minimizing latency. Of course, it's not a perfect solution: it doesn't seem to be fair - some regions are probably going to have much more users than other regions, which could lead to an unbalanced distribution among the clusters. On the other hand, since clusters are relatively independent, we can easily add capacity to a cluster serving a region with heavy traffic, so this doesn't seem to be a big problem.
Social networks like Facebook also make use of another concept: points of presence. PoPs in Facebook are used in regions where it is not feasible / necessary to run a full cluster of servers housing data. Instead, PoPs maintain a set of ready-to-use, opened TCP connections to the main cluster service. These PoPs serve nearby user requests and forward them to the main cluster service using the ready-to-use, established connections. Thus the user doesn't incur the overhead of establishing a connection with a server that is far away: since that connection is already established in the PoP, the user can just establish a connection with the PoP. This technique enabled Facebook to scale inexpensively while keeping the count of core clusters more or less constant. Whereas a transatlantic TCP connection handshake can take as long as 500 ms (and this is just the time to establish the connection before actually transmitting any data), establishing a TCP connection to a nearby PoP will usually take less than 100 ms. The performance boost is considerable, especially taking into account that PoPs are always ready to communicate with the cluster and as such can avoid the 500 ms delay of establishing a connection to a far away endpoint.
How can we design the algorithm to show the shortest path between two people?
Friend relationships can be seen as a graph. Finding the shortest path between two people boils down to running a BFS in the social network graph. Note that DFS is not really a great choice here, we could be looking at lots and lots of users before branching down on the right edge. But a traditional BFS might still be too expensive: for a path of length q, if each person has O(k) friends, BFS will go through O(k^q) nodes before finding a match.
We can narrow it down by using bi-directional BFS: start from the source and from the destination at the same time, and perform BFS in both directions until they collide. On average, this reduces the runtime to O(k^(q/2) + k^(q/2)). It might seem like it's not a big deal, but it's huge: if each person has 100 friends and our endpoints are 4 hops away, then traditional BFS would look at O(100^4) nodes before finding a match. Bi-directional BFS would look at O(100^2 + 100^2). That's 1 million vs. 20,000. It makes a hell of a difference

Step 1: Simplify the Problem-Forget About the Millions of Users

If you imagine a path like A- >B- >C- >D- >E where each person has 100 friends, this is a big difference.
BFS will require looking at 100 million (1004) nodes. A bidirectional BFS will require looking at only 20,000 nodes (2 x 1002). A bidirectional BFS will generally be faster than the traditional BFS


Step 2: Handle the Millions of Users
Instead, we can replace our list of friends with a list of their IDs, and traverse as follows:
1. For each friend ID: int machine index = getMachineIDForUser( person ID);
2. Go to machine #machine_index
3. On that machine, do: Person friend = getPersonWith ID( person_id);

Optimization: Reduce machine jumps
Jumping from one machine to another is expensive. Instead of randomly jumping from machine to machine with each friend, try to batch these jumps-e.g., if five of my friends live on one machine, I should look them up all at once

Optimization: Smart division of people and machines
People are much more likely to be friends with people who live in the same country as they do. Rather than randomly dividing people across machines, try to divide them by country, city, state, and so on. This will reduce the number of jumps.
Question: Breadth-first search usually requires "marking" a node as visited. How do you do that in
this case?
Usually, in BFS, we mark a node as visited by setting a visited flag in its node class. Here, we don't want to do that. There could be multiple searches going on at the same time, so it's a bad idea to just edit our data.
Instead, we could mimic the marking of nodes with a hash table to look up a node id and determine
whether it's been visited.
Other Follow-Up Questions:
In the real world, servers fail. How does this affect you?
How could you take advantage of caching?
Do you search until the end of the graph (infinite)? How do you decide when to give up?
In real life, some people have more friends of friends than others, and are therefore more likely to make a path between you and someone else. How could you use this data to pick where to start traversing?

https://www.geeksforgeeks.org/design-data-structures-for-a-very-large-social-network-like-facebook-or-linkedln/
Design data structures for a very large social network like Facebook or Linkedln
Case 1: Simplify the Problem (Not considering millions of people)
Linkedlist<Person> findPathBiBFS(HashMap<Integer, Person> people,
                                    int source, int destination)
{
    BFSData sourceData = new BFSData(people.get(source));
    BFSData destData = new BFSData(people.get(destination));
  
    while (!sourceData.isFinished() && !destData.isFinished())
    {
  
        /* Search out from source. */
        Person collision = searchlevel(people, sourceData, destData);
        if (collision != null)
            return mergePaths(sourceData, destData, collision.getID());
  
        /* Search out from destination. */
        collision = searchlevel(people, destData, sourceData);
        if (collision != null)
            return mergePaths(sourceData, destData, collision.getID());
    }
  
    return null;
}
  
  
/* Search one level and return collision, if any.*/
Person searchLevel(HashMap<Integer, Person> people,
                BFSData primary, BFSData secondary)
{
  
    /* We only want to search one level at a time. Count
       how many nodes are currently
       in the primary's level and only do that many nodes.
       We continue to add nodes to the end. */
  
    int count = primary.toVisit.size();
    for (int i= 0; i < count; i++)
    {
        /* Pull out first node. */
        PathNode pathNode = primary.toVisit.poll();
        int personld = pathNode.getPerson().getID();
  
        /* Check if it's already been visited. */
        if (secondary.visited.containsKey(personid))
            return pathNode.getPerson();
  
        /* Add friends to queue. */
        Person person = pathNode. getPerson();
        Arraylist<Integer> friends = person.getFriends();
        for (int friendid : friends)
        {
            if (!primary.visited.containsKey(friendid))
            {
                Person friend= people.get(friendld);
                PathNode next = new PathNode(friend, pathNode);
                primary.visited.put(friendld, next);
                primary.toVisit.add(next);
            }
        }
    }
    return null;
}
  
  
/* Merge paths where searches met at the connection. */
Linkedlist<Person> mergePaths(BFSData bfsl, BFSData bfs2,
                                          int connection)
{
    // endl -> source, end2 -> dest
    PathNode endl = bfsl.visited.get(connection);
    PathNode end2 = bfs2.visited.get(connection);
  
    Linkedlist<Person> pathOne = endl.collapse(false);
    Linkedlist<Person> pathTwo = end2.collapse(true);
  
    pathTwo.removeFirst(); // remove connection
    pathOne.addAll(pathTwo); // add second path
  
    return pathOne;
}
  
class PathNode
{
    private Person person = null;
    private PathNode previousNode = null;
    public PathNode(Person p, PathNode previous)
    {
        person = p;
        previousNode = previous;
    }
  
    public Person getPerson()
    {
        return person;
    }
  
    public Linkedlist<Person> collapse(boolean startsWithRoot)
    {
        Linkedlist<Person> path= new Linkedlist<Person>();
        PathNode node = this;
        while (node != null)
        {
            if (startsWithRoot)
                path.addlast(node.person);
            else
                path.addFirst(node.person);
            node = node.previousNode;
        }
  
        return path;
    }
}
  
class BFSData
{
    public Queue<PathNode> toVisit = new Linkedlist<PathNode>();
    public HashMap<Integer, PathNode> visited =
                                 new HashMap<Integer, PathNode>();
  
    public BFSData(Person root)
    {
        PathNode sourcePath = new PathNode(root, null);
        toVisit.add(sourcePath);
        visited.put(root.getID(), sourcePath);
    }
    public boolean isFinished()
    {
        return toVisit.isEmpty();
    }
}


Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts