Massive Technical Interviews Tips: Design the data structure for large social network

Design data structures for a very large social network like Facebook or Linkedln - GeeksforGeeks
How would you design the data structures for a very large social network like Facebook or Linkedln? Describe how you would design an algorithm to show the shortest path between two people (e.g., Me-> Bob-> Susan-> Jason-> You).

A good way to approach this problem is to remove some of the constraints and solve it for that situation first.

Case 1: Simplify the Problem (Not considering millions of people)
We can construct a graph by treating each person as a node and letting an edge between two nodes indicate that the two users are friends. If we want to find the path between two people, we start with one person and do a simple breadth-first search. Alternatively, we can do bidirectional breadth first search. This means doing two breadth first searches, one from the source and one from the destination. When the searches collide, we know we’ve found a path.

Why not a depth-first search work well? First, the depth-first search would just find a path. It wouldn’t necessarily find the shortest path. Second, even if we just needed any path, it would be very inefficient. Two users might be only one degree of separation apart, but it could search millions of nodes in their”subtrees” before finding this relatively immediate connection.

In the implementation, we’ll use two classes to help us. BFSData holds the data needed for a breadth-first search, such as the isVisited hash table and the toVisit queue. PathNode represents the path as we’re searching, storing each Person and the previousNode we visited in this path.

Linkedlist<Person> findPathBiBFS(HashMap<Integer, Person> people,

                                    int source, int destination)

{

    BFSData sourceData = new BFSData(people.get(source));

    BFSData destData = new BFSData(people.get(destination));

    while (!sourceData.isFinished() && !destData.isFinished())

    {

        /* Search out from source. */

        Person collision = searchlevel(people, sourceData, destData);

        if (collision != null)

            return mergePaths(sourceData, destData, collision.getID());

        /* Search out from destination. */

        collision = searchlevel(people, destData, sourceData);

        if (collision != null)

            return mergePaths(sourceData, destData, collision.getID());

    }

    return null;

}

/* Search one level and return collision, if any.*/

Person searchLevel(HashMap<Integer, Person> people,

                BFSData primary, BFSData secondary)

{

    /* We only want to search one level at a time. Count

       how many nodes are currently

       in the primary's level and only do that many nodes.

       We continue to add nodes to the end. */

    int count = primary.toVisit.size();

    for (int i= 0; i < count; i++)

    {

        /* Pull out first node. */

        PathNode pathNode = primary.toVisit.poll();

        int personld = pathNode.getPerson().getID();

        /* Check if it's already been visited. */

        if (secondary.visited.containsKey(personid))

            return pathNode.getPerson();

        /* Add friends to queue. */

        Person person = pathNode. getPerson();

        Arraylist<Integer> friends = person.getFriends();

        for (int friendid : friends)

        {

            if (!primary.visited.containsKey(friendid))

            {

                Person friend= people.get(friendld);

                PathNode next = new PathNode(friend, pathNode);

                primary.visited.put(friendld, next);

                primary.toVisit.add(next);

            }

        }

    }

    return null;

}

/* Merge paths where searches met at the connection. */

Linkedlist<Person> mergePaths(BFSData bfsl, BFSData bfs2,

                                          int connection)

{

    // endl -> source, end2 -> dest

    PathNode endl = bfsl.visited.get(connection);

    PathNode end2 = bfs2.visited.get(connection);

    Linkedlist<Person> pathOne = endl.collapse(false);

    Linkedlist<Person> pathTwo = end2.collapse(true);

    pathTwo.removeFirst(); // remove connection

    pathOne.addAll(pathTwo); // add second path

    return pathOne;

}

class PathNode

{

    private Person person = null;

    private PathNode previousNode = null;

    public PathNode(Person p, PathNode previous)

    {

        person = p;

        previousNode = previous;

    }

    public Person getPerson()

    {

        return person;

    }

    public Linkedlist<Person> collapse(boolean startsWithRoot)

    {

        Linkedlist<Person> path= new Linkedlist<Person>();

        PathNode node = this;

        while (node != null)

        {

            if (startsWithRoot)

                path.addlast(node.person);

            else

                path.addFirst(node.person);

            node = node.previousNode;

        }

        return path;

    }

}

class BFSData

{

    public Queue<PathNode> toVisit = new Linkedlist<PathNode>();

    public HashMap<Integer, PathNode> visited =

                                 new HashMap<Integer, PathNode>();

    public BFSData(Person root)

    {

        PathNode sourcePath = new PathNode(root, null);

        toVisit.add(sourcePath);

        visited.put(root.getID(), sourcePath);

    }

    public boolean isFinished()

    {

        return toVisit.isEmpty();

    }

}

3. Generalizing this to a path of length q, we have this:
3.1 BFS: O(kq)
3.2 Bidirectional BFS: 0( kq/2 + kq/2), which is just 0( kq/2)

If we imagine a path like A->B->C->D->E where each person has 100 friends, this is a big difference. BFS will require looking at 100 million (1004) nodes. A bidirectional BFS will require looking at only 20,000 nodes (2 x 1002).

Case 2: Handle Millions of Users
For these many users, we cannot possibly keep all of our data on one machine. That means that our simple Person data structure from above doesn’t quite work-our friends may not live on the same machine as we do. Instead, we can replace our list of friends with a list of their IDs, and traverse as follows:

1: For each friend ID: int machine index = getMachineIDForUser(person_ID);

2: Go to machine #machine_index

3: On that machine, do: Person friend = getPersonWithID( person_ID);

The code below outlines this process. We’ve defined a class Server, which holds a list of all the machines, and a class Machine, which represents a single machine. Both classes have hash tables to efficiently lookup data.

Optimization: Reduce machine jumps
Jumping from one machine to another is expensive. Instead of randomly jumping from machine to machine with each friend, try to batch this jumps-e.g., if five of my friends live on one machine, I should look them up all at once.

Optimization: Smart division of people and machines
People are much more likely to be friends with people who live in the same country as they do. Rather than randomly dividing people across machines, try to divide them by country, city, state, and so on. This will reduce the number of jumps.

Question: Breadth-first search usually requires “marking” a node as visited. How do you do that in this case?
Usually, in BFS, we mark a node as visited by setting a visited flag in its node class. Here, we don’t want to do that. There could be multiple searches going on at the same time, so it’s a bad idea to just edit our data.

Instead, we could mimic the marking of nodes with a hash table to look up a node id and determine whether it’s been visited.

Other Follow-Up Questions:
1. In the real world, servers fail. How does this affect you?
2. How could you take advantage of caching?
3. Do you search until the end of the graph (infinite)? How do you decide when to give up?
4. In real life, some people have more friends of friends than others and are therefore more likely to make a path between you and someone else. How could you use this data to pick where to start traversing?

Design the data structure for large social network | Runhe Tian Coding Practice
Forget that we're dealing with millions of users at first. Design this for the simple case.
We can construct a graph by assuming every person is a node and if there is an edge between two nodes, then the two people are friends with each other.





1

2

3

4


class Person {

    Person[] friends;

    // Other info

}





If I want to find the connection between two people, I would start with one person

and do a simple breadth first search.
  But... oh no! Millions of users!

  When we deal with a service the size of Orkut or Facebook,

we cannot possibly keep all of our data on one machine.

That means that our simple Person data structure from above

doesn't quite work—our friends may not live on the same machine as us.

Instead, we can replace our list of friends with a list of their IDs,

and traverse as follows:
For each friend ID:  



1


int machine_index = lookupMachineForUserID(id);







Go to machine machine_index
  



1


Person friend = lookupFriend(machine_index);








There are more optimizations and follow up questions here than we could possibly discuss,

but here are just a few thoughts.


Optimization: Reduce Machine Jumps
  Jumping from one machine to another is expensive. Instead of randomly jumping 
from machine to machine with each friend, try to batch these jumps—e.g., 
if 5 of my friends live on one machine, I should look them up all at once.  
Optimization: Smart Division of People and Machines
  People are much more likely to be friends with people who live in the same 
country as them. Rather than randomly dividing people up across machines, 
try to divvy them up by country, city, state, etc. This will reduce the number 
of jumps.
Question: Breadth First Search usually requires "marking" a node as visited. 
How do you do that in this case?
  Usually, in BFS, we mark a node as visited by setting a flag visited in 
its node class. Here, we don't want to do that (there could be multiple 
searches going on at the same time, so it's bad to just edit our data). 
In this case, we could mimic the marking of nodes with a hash table to 
lookup a node id and whether or not it's been visited.
Other Follow-Up Questions:  
In the real world, servers fail. How does this affect you?
How could you take advantage of caching?
Do you search until the end of the graph (infinite)? 
How do you decide when to give up?
In real life, some people have more friends of friends than others, 
and are therefore more likely to make a path between you and someone else. 
How could you use this data to pick where you start traversing.

public class Server {

    ArrayList<Machine> machines = new ArrayList<Machine>();

}

public class Machine {

    public ArrayList<Person> persons = new ArrayList<Person>();

    public int machineID;

}

public class Person {

    private ArrayList<Integer> friends;

    private final int ID;

    private final int machineID;

    private String info;

    private final Server server = new Server();

    public String getInfo() { return info; }

    public void setInfo(String info) {

        this.info = info;

}

    public int[] getFriends() {

        int[] temp = new int[friends.size()];

        for (int i = 0; i < temp.length; i++) {

            temp[i] = friends.get(i);

}

        return temp;

}

    public int getID() { return ID; }

    public int getMachineID() { return machineID; }

    public void addFriend(int id) { friends.add(id); }

    // Look up a person given their ID and Machine ID

    public Person lookUpFriend(int machineID, int ID) {

        for (Machine m : server.machines) {

            if (m.machineID == machineID) {

                for (Person p : m.persons) {

                    if (p.ID == ID){

                        return p;

}

}

}

}

        return null;

}

    // Look up a machine given the machine ID

    public Machine lookUpMachine(int machineID) {

        for (Machine m:server.machines) {

            if (m.machineID == machineID)

                return m;

}

        return null;

}

    public Person(int iD, int machineID) {

        ID = iD;

        this.machineID = machineID;

}

}

http://xiaochongzhang.me/blog/?p=848

class Machine;

class Server

{

public:

    inline Machine getMachine(int id)

{

        for(int i = 0; i < machines.size(); ++i)

{

            if(id == machines[i].getMachineId())

                return machines[i];

}

        return machines[0];

}

private:

    vector<Machine> machines;

};

//machine class stores the list of person who in this machine and it has its own machine id

class Machine

{

public:

    inline int getMachineId()

{

        return this->MachineId;

}

    inline vector<Person> getPersons()

{

        return this->people;

}

private:

    int MachineId;

    vector<Person> people;

};

//this class stores the person id and how to look up person from the social clusters method

class Person

{

public:

    Person(int PId, int MId)

{

        this->PersonId = PId;

        this->MachineId = MId;

}

    inline string getInfo()

{

        return this->PersonalInfo;

}

    inline void setInfo(string info)

{

        this->PersonalInfo = info;

}

    inline int getId()

{

        return this->PersonId;

}

    inline void addFriend(Person ppl)

{

        this->friends.push_back(ppl);

}

    //look up a person by giving his/her machine id and person id

    inline Person lookUpFriends(int MId, int PId)

{

        Machine targetMachine = server.getMachine(MId); //find machine by machine's id

        vector<Person> targetPeople = targetMachine.getPersons();//get all people which stores in that machine

        for(int i = 0; i < targetPeople.size(); ++i)

{

            if(PId == targetPeople[i].getId())

                return targetPeople[i];

}

        return NULL;

}

private:

    int PersonId;

    int MachineId;

    vector<Person> friends;

    string PersonalInfo;

    Server server;

};

http://stackoverflow.com/questions/4129691/interview-question-data-structure-for-a-large-social-network

I would use a depth-first search with marking to find the friend. Marking Nodes which you've already traversed is essential since cycles of friends will exist. With a DFS the finding of the path is almost trivial because the stack you're using to execute the DFS is the path. So when you find the friend, you just pop the whole stack and you're done.

A breath first search doesn't have this nice property because the queue used to traverse the graph will have unexplored nodes, so you'll need to keep track of the path using some other structure. A breadth first search might be appropriate if we're expecting the function to be run against people in the same "neighbourhood" of friends and are really concerned about performance.

Another nice property of the DFS is that it can be parallelized. When encountering a new node one can create spawn new DFS processes/threads/whatever to process the nodes children. The new threads must be able to share the marking information through some sort of messaging system. This might be a bit of premature optimization now that I think about it some more.

https://github.com/zxqiu/leetcode-lintcode/blob/master/system%20design/Friendship_Service.java

Friendship Service
Support follow & unfollow, getFollowers, getFollowings.
follow(1, 3)
getFollowers(1) // return [3]
getFollowings(3) // return [1]
follow(2, 3)
getFollowings(3) // return [1,2]
unfollow(1, 3)
getFollowings(3) // return [2]
解：
单向好友关系。
使用两个HashMap分别保存follower和following，以提高查找效率。
在每个HashMap内，使用TreeSet来保存内容，利用TreeSet自排序特性，保证查找的结果有序。
同时，可以避免在插入时查重。

HashMap<Integer, Set<Integer>> followers, followings;

public FriendshipService() {
followers = new HashMap<Integer, Set<Integer>>();
followings = new HashMap<Integer, Set<Integer>>();
}

// @param user_id an integer
// return all followers and sort by user_id
public List<Integer> getFollowers(int user_id) {
if (!followers.containsKey(user_id)) {
return new ArrayList<Integer>();
}

return new ArrayList<Integer>(followers.get(user_id));
}

// @param user_id an integer
// return all followings and sort by user_id
public List<Integer> getFollowings(int user_id) {
if (!followings.containsKey(user_id)) {
return new ArrayList<Integer>();
}

return new ArrayList<Integer>(followings.get(user_id));
}

// @param from_user_id an integer
// @param to_user_id an integer
// from user_id follows to_user_id
public void follow(int from_user_id, int to_user_id) {
if (!followers.containsKey(from_user_id)) {
followers.put(from_user_id, new TreeSet<Integer>());
}

followers.get(from_user_id).add(to_user_id);

if (!followings.containsKey(to_user_id)) {
followings.put(to_user_id, new TreeSet<Integer>());
}

followings.get(to_user_id).add(from_user_id);
}

// @param from_user_id an integer
// @param to_user_id an integer
// from user_id unfollows to_user_id
public void unfollow(int from_user_id, int to_user_id) {
if (!followers.containsKey(from_user_id)) {
return;
}

followers.get(from_user_id).remove(to_user_id);

if (!followings.containsKey(to_user_id)) {
return;
}

followings.get(to_user_id).remove(from_user_id);
}
Read full article from Design the data structure for large social network | Runhe Tian Coding Practice

Wednesday, December 24, 2014

Design the data structure for large social network | Runhe Tian Coding Practice

Labels

Popular Posts