Design data structures for a very large social network like Facebook or Linkedln - GeeksforGeeks
How would you design the data structures for a very large social network like Facebook or Linkedln? Describe how you would design an algorithm to show the shortest path between two people (e.g., Me-> Bob-> Susan-> Jason-> You).
Design the data structure for large social network | Runhe Tian Coding Practice
Forget that we're dealing with millions of users at first. Design this for the simple case.
We can construct a graph by assuming every person is a node and if there is an edge between two nodes, then the two people are friends with each other.
http://xiaochongzhang.me/blog/?p=848
http://stackoverflow.com/questions/4129691/interview-question-data-structure-for-a-large-social-network
https://github.com/zxqiu/leetcode-lintcode/blob/master/system%20design/Friendship_Service.java
Friendship Service
Support follow & unfollow, getFollowers, getFollowings.
follow(1, 3)
getFollowers(1) // return [3]
getFollowings(3) // return [1]
follow(2, 3)
getFollowings(3) // return [1,2]
unfollow(1, 3)
getFollowings(3) // return [2]
解:
单向好友关系。
使用两个HashMap分别保存follower和following,以提高查找效率。
在每个HashMap内,使用TreeSet来保存内容,利用TreeSet自排序特性,保证查找的结果有序。
同时,可以避免在插入时查重。
HashMap<Integer, Set<Integer>> followers, followings;
public FriendshipService() {
followers = new HashMap<Integer, Set<Integer>>();
followings = new HashMap<Integer, Set<Integer>>();
}
// @param user_id an integer
// return all followers and sort by user_id
public List<Integer> getFollowers(int user_id) {
if (!followers.containsKey(user_id)) {
return new ArrayList<Integer>();
}
return new ArrayList<Integer>(followers.get(user_id));
}
// @param user_id an integer
// return all followings and sort by user_id
public List<Integer> getFollowings(int user_id) {
if (!followings.containsKey(user_id)) {
return new ArrayList<Integer>();
}
return new ArrayList<Integer>(followings.get(user_id));
}
// @param from_user_id an integer
// @param to_user_id an integer
// from user_id follows to_user_id
public void follow(int from_user_id, int to_user_id) {
if (!followers.containsKey(from_user_id)) {
followers.put(from_user_id, new TreeSet<Integer>());
}
followers.get(from_user_id).add(to_user_id);
if (!followings.containsKey(to_user_id)) {
followings.put(to_user_id, new TreeSet<Integer>());
}
followings.get(to_user_id).add(from_user_id);
}
// @param from_user_id an integer
// @param to_user_id an integer
// from user_id unfollows to_user_id
public void unfollow(int from_user_id, int to_user_id) {
if (!followers.containsKey(from_user_id)) {
return;
}
followers.get(from_user_id).remove(to_user_id);
if (!followings.containsKey(to_user_id)) {
return;
}
followings.get(to_user_id).remove(from_user_id);
}
Read full article from Design the data structure for large social network | Runhe Tian Coding Practice
How would you design the data structures for a very large social network like Facebook or Linkedln? Describe how you would design an algorithm to show the shortest path between two people (e.g., Me-> Bob-> Susan-> Jason-> You).
A good way to approach this problem is to remove some of the constraints and solve it for that situation first.
Case 1: Simplify the Problem (Not considering millions of people)
We can construct a graph by treating each person as a node and letting an edge between two nodes indicate that the two users are friends. If we want to find the path between two people, we start with one person and do a simple breadth-first search. Alternatively, we can do bidirectional breadth first search. This means doing two breadth first searches, one from the source and one from the destination. When the searches collide, we know we’ve found a path.
We can construct a graph by treating each person as a node and letting an edge between two nodes indicate that the two users are friends. If we want to find the path between two people, we start with one person and do a simple breadth-first search. Alternatively, we can do bidirectional breadth first search. This means doing two breadth first searches, one from the source and one from the destination. When the searches collide, we know we’ve found a path.
Why not a depth-first search work well? First, the depth-first search would just find a path. It wouldn’t necessarily find the shortest path. Second, even if we just needed any path, it would be very inefficient. Two users might be only one degree of separation apart, but it could search millions of nodes in their”subtrees” before finding this relatively immediate connection.
In the implementation, we’ll use two classes to help us. BFSData holds the data needed for a breadth-first search, such as the isVisited hash table and the toVisit queue. PathNode represents the path as we’re searching, storing each Person and the previousNode we visited in this path.
Linkedlist<Person> findPathBiBFS(HashMap<Integer, Person> people,
int
source,
int
destination)
{
BFSData sourceData =
new
BFSData(people.get(source));
BFSData destData =
new
BFSData(people.get(destination));
while
(!sourceData.isFinished() && !destData.isFinished())
{
/* Search out from source. */
Person collision = searchlevel(people, sourceData, destData);
if
(collision !=
null
)
return
mergePaths(sourceData, destData, collision.getID());
/* Search out from destination. */
collision = searchlevel(people, destData, sourceData);
if
(collision !=
null
)
return
mergePaths(sourceData, destData, collision.getID());
}
return
null
;
}
/* Search one level and return collision, if any.*/
Person searchLevel(HashMap<Integer, Person> people,
BFSData primary, BFSData secondary)
{
/* We only want to search one level at a time. Count
how many nodes are currently
in the primary's level and only do that many nodes.
We continue to add nodes to the end. */
int
count = primary.toVisit.size();
for
(
int
i=
0
; i < count; i++)
{
/* Pull out first node. */
PathNode pathNode = primary.toVisit.poll();
int
personld = pathNode.getPerson().getID();
/* Check if it's already been visited. */
if
(secondary.visited.containsKey(personid))
return
pathNode.getPerson();
/* Add friends to queue. */
Person person = pathNode. getPerson();
Arraylist<Integer> friends = person.getFriends();
for
(
int
friendid : friends)
{
if
(!primary.visited.containsKey(friendid))
{
Person friend= people.get(friendld);
PathNode next =
new
PathNode(friend, pathNode);
primary.visited.put(friendld, next);
primary.toVisit.add(next);
}
}
}
return
null
;
}
/* Merge paths where searches met at the connection. */
Linkedlist<Person> mergePaths(BFSData bfsl, BFSData bfs2,
int
connection)
{
// endl -> source, end2 -> dest
PathNode endl = bfsl.visited.get(connection);
PathNode end2 = bfs2.visited.get(connection);
Linkedlist<Person> pathOne = endl.collapse(
false
);
Linkedlist<Person> pathTwo = end2.collapse(
true
);
pathTwo.removeFirst();
// remove connection
pathOne.addAll(pathTwo);
// add second path
return
pathOne;
}
class
PathNode
{
private
Person person =
null
;
private
PathNode previousNode =
null
;
public
PathNode(Person p, PathNode previous)
{
person = p;
previousNode = previous;
}
public
Person getPerson()
{
return
person;
}
public
Linkedlist<Person> collapse(
boolean
startsWithRoot)
{
Linkedlist<Person> path=
new
Linkedlist<Person>();
PathNode node =
this
;
while
(node !=
null
)
{
if
(startsWithRoot)
path.addlast(node.person);
else
path.addFirst(node.person);
node = node.previousNode;
}
return
path;
}
}
class
BFSData
{
public
Queue<PathNode> toVisit =
new
Linkedlist<PathNode>();
public
HashMap<Integer, PathNode> visited =
new
HashMap<Integer, PathNode>();
public
BFSData(Person root)
{
PathNode sourcePath =
new
PathNode(root,
null
);
toVisit.add(sourcePath);
visited.put(root.getID(), sourcePath);
}
public
boolean
isFinished()
{
return
toVisit.isEmpty();
}
}
3. Generalizing this to a path of length q, we have this:
3.1 BFS: O(kq)
3.2 Bidirectional BFS: 0( kq/2 + kq/2), which is just 0( kq/2)
3.1 BFS: O(kq)
3.2 Bidirectional BFS: 0( kq/2 + kq/2), which is just 0( kq/2)
If we imagine a path like A->B->C->D->E where each person has 100 friends, this is a big difference. BFS will require looking at 100 million (1004) nodes. A bidirectional BFS will require looking at only 20,000 nodes (2 x 1002).
Case 2: Handle Millions of Users
For these many users, we cannot possibly keep all of our data on one machine. That means that our simple Person data structure from above doesn’t quite work-our friends may not live on the same machine as we do. Instead, we can replace our list of friends with a list of their IDs, and traverse as follows:
For these many users, we cannot possibly keep all of our data on one machine. That means that our simple Person data structure from above doesn’t quite work-our friends may not live on the same machine as we do. Instead, we can replace our list of friends with a list of their IDs, and traverse as follows:
1: For each friend ID: int machine index = getMachineIDForUser(person_ID);
2: Go to machine #machine_index
3: On that machine, do: Person friend = getPersonWithID( person_ID);
The code below outlines this process. We’ve defined a class Server, which holds a list of all the machines, and a class Machine, which represents a single machine. Both classes have hash tables to efficiently lookup data.
Optimization: Reduce machine jumps
Jumping from one machine to another is expensive. Instead of randomly jumping from machine to machine with each friend, try to batch this jumps-e.g., if five of my friends live on one machine, I should look them up all at once.
Jumping from one machine to another is expensive. Instead of randomly jumping from machine to machine with each friend, try to batch this jumps-e.g., if five of my friends live on one machine, I should look them up all at once.
Optimization: Smart division of people and machines
People are much more likely to be friends with people who live in the same country as they do. Rather than randomly dividing people across machines, try to divide them by country, city, state, and so on. This will reduce the number of jumps.
People are much more likely to be friends with people who live in the same country as they do. Rather than randomly dividing people across machines, try to divide them by country, city, state, and so on. This will reduce the number of jumps.
Question: Breadth-first search usually requires “marking” a node as visited. How do you do that in this case?
Usually, in BFS, we mark a node as visited by setting a visited flag in its node class. Here, we don’t want to do that. There could be multiple searches going on at the same time, so it’s a bad idea to just edit our data.
Usually, in BFS, we mark a node as visited by setting a visited flag in its node class. Here, we don’t want to do that. There could be multiple searches going on at the same time, so it’s a bad idea to just edit our data.
Instead, we could mimic the marking of nodes with a hash table to look up a node id and determine whether it’s been visited.
Other Follow-Up Questions:
1. In the real world, servers fail. How does this affect you?
2. How could you take advantage of caching?
3. Do you search until the end of the graph (infinite)? How do you decide when to give up?
4. In real life, some people have more friends of friends than others and are therefore more likely to make a path between you and someone else. How could you use this data to pick where to start traversing?
1. In the real world, servers fail. How does this affect you?
2. How could you take advantage of caching?
3. Do you search until the end of the graph (infinite)? How do you decide when to give up?
4. In real life, some people have more friends of friends than others and are therefore more likely to make a path between you and someone else. How could you use this data to pick where to start traversing?
Forget that we're dealing with millions of users at first. Design this for the simple case.
We can construct a graph by assuming every person is a node and if there is an edge between two nodes, then the two people are friends with each other.
If I want to find the connection between two people, I would start with one person
1234class
Person {
Person[] friends;
// Other info
}
and do a simple breadth first search. But... oh no! Millions of users!
When we deal with a service the size of Orkut or Facebook,
we cannot possibly keep all of our data on one machine.
That means that our simple Person data structure from above
doesn't quite work—our friends may not live on the same machine as us.
Instead, we can replace our list of friends with a list of their IDs,
and traverse as follows:There are more optimizations and follow up questions here than we could possibly discuss,
- For each friend ID:
1int
machine_index = lookupMachineForUserID(id);
- Go to machine machine_index
1Person friend = lookupFriend(machine_index);
but here are just a few thoughts.
- Optimization: Reduce Machine Jumps Jumping from one machine to another is expensive. Instead of randomly jumping
- from machine to machine with each friend, try to batch these jumps—e.g.,
- if 5 of my friends live on one machine, I should look them up all at once.
- Optimization: Smart Division of People and Machines People are much more likely to be friends with people who live in the same
- country as them. Rather than randomly dividing people up across machines,
- try to divvy them up by country, city, state, etc. This will reduce the number
- of jumps.
- Question: Breadth First Search usually requires "marking" a node as visited.
- How do you do that in this case? Usually, in BFS, we mark a node as visited by setting a flag visited in
- its node class. Here, we don't want to do that (there could be multiple
- searches going on at the same time, so it's bad to just edit our data).
- In this case, we could mimic the marking of nodes with a hash table to
- lookup a node id and whether or not it's been visited.
- Other Follow-Up Questions:
- In the real world, servers fail. How does this affect you?
- How could you take advantage of caching?
- Do you search until the end of the graph (infinite)?
- How do you decide when to give up?
- In real life, some people have more friends of friends than others,
- and are therefore more likely to make a path between you and someone else.
- How could you use this data to pick where you start traversing.
public
class
Server {
ArrayList<Machine> machines =
new
ArrayList<Machine>();
}
public
class
Machine {
public
ArrayList<Person> persons =
new
ArrayList<Person>();
public
int
machineID;
}
public
class
Person {
private
ArrayList<Integer> friends;
private
final
int
ID;
private
final
int
machineID;
private
String info;
private
final
Server server =
new
Server();
public
String getInfo() {
return
info; }
public
void
setInfo(String info) {
this
.info = info;
}
public
int
[] getFriends() {
int
[] temp =
new
int
[friends.size()];
for
(
int
i =
0
; i < temp.length; i++) {
temp[i] = friends.get(i);
}
return
temp;
}
public
int
getID() {
return
ID; }
public
int
getMachineID() {
return
machineID; }
public
void
addFriend(
int
id) { friends.add(id); }
// Look up a person given their ID and Machine ID
public
Person lookUpFriend(
int
machineID,
int
ID) {
for
(Machine m : server.machines) {
if
(m.machineID == machineID) {
for
(Person p : m.persons) {
if
(p.ID == ID){
return
p;
}
}
}
}
return
null
;
}
// Look up a machine given the machine ID
public
Machine lookUpMachine(
int
machineID) {
for
(Machine m:server.machines) {
if
(m.machineID == machineID)
return
m;
}
return
null
;
}
public
Person(
int
iD,
int
machineID) {
ID = iD;
this
.machineID = machineID;
}
}
http://xiaochongzhang.me/blog/?p=848
class Machine;
class Server
{
public:
inline Machine getMachine(int id)
{
for(int i = 0; i < machines.size(); ++i)
{
if(id == machines[i].getMachineId())
return machines[i];
}
return machines[0];
}
private:
vector<Machine> machines;
};
//machine class stores the list of person who in this machine and it has its own machine id
class Machine
{
public:
inline int getMachineId()
{
return this->MachineId;
}
inline vector<Person> getPersons()
{
return this->people;
}
private:
int MachineId;
vector<Person> people;
};
//this class stores the person id and how to look up person from the social clusters method
class Person
{
public:
Person(int PId, int MId)
{
this->PersonId = PId;
this->MachineId = MId;
}
inline string getInfo()
{
return this->PersonalInfo;
}
inline void setInfo(string info)
{
this->PersonalInfo = info;
}
inline int getId()
{
return this->PersonId;
}
inline void addFriend(Person ppl)
{
this->friends.push_back(ppl);
}
//look up a person by giving his/her machine id and person id
inline Person lookUpFriends(int MId, int PId)
{
Machine targetMachine = server.getMachine(MId); //find machine by machine's id
vector<Person> targetPeople = targetMachine.getPersons();//get all people which stores in that machine
for(int i = 0; i < targetPeople.size(); ++i)
{
if(PId == targetPeople[i].getId())
return targetPeople[i];
}
return NULL;
}
private:
int PersonId;
int MachineId;
vector<Person> friends;
string PersonalInfo;
Server server;
};
http://stackoverflow.com/questions/4129691/interview-question-data-structure-for-a-large-social-network
I would use a depth-first search with marking to find the friend. Marking Nodes which you've already traversed is essential since cycles of friends will exist. With a DFS the finding of the path is almost trivial because the stack you're using to execute the DFS is the path. So when you find the friend, you just pop the whole stack and you're done.
A breath first search doesn't have this nice property because the queue used to traverse the graph will have unexplored nodes, so you'll need to keep track of the path using some other structure. A breadth first search might be appropriate if we're expecting the function to be run against people in the same "neighbourhood" of friends and are really concerned about performance.
Another nice property of the DFS is that it can be parallelized. When encountering a new node one can create spawn new DFS processes/threads/whatever to process the nodes children. The new threads must be able to share the marking information through some sort of messaging system. This might be a bit of premature optimization now that I think about it some more.
https://github.com/zxqiu/leetcode-lintcode/blob/master/system%20design/Friendship_Service.java
Friendship Service
Support follow & unfollow, getFollowers, getFollowings.
follow(1, 3)
getFollowers(1) // return [3]
getFollowings(3) // return [1]
follow(2, 3)
getFollowings(3) // return [1,2]
unfollow(1, 3)
getFollowings(3) // return [2]
解:
单向好友关系。
使用两个HashMap分别保存follower和following,以提高查找效率。
在每个HashMap内,使用TreeSet来保存内容,利用TreeSet自排序特性,保证查找的结果有序。
同时,可以避免在插入时查重。
HashMap<Integer, Set<Integer>> followers, followings;
public FriendshipService() {
followers = new HashMap<Integer, Set<Integer>>();
followings = new HashMap<Integer, Set<Integer>>();
}
// @param user_id an integer
// return all followers and sort by user_id
public List<Integer> getFollowers(int user_id) {
if (!followers.containsKey(user_id)) {
return new ArrayList<Integer>();
}
return new ArrayList<Integer>(followers.get(user_id));
}
// @param user_id an integer
// return all followings and sort by user_id
public List<Integer> getFollowings(int user_id) {
if (!followings.containsKey(user_id)) {
return new ArrayList<Integer>();
}
return new ArrayList<Integer>(followings.get(user_id));
}
// @param from_user_id an integer
// @param to_user_id an integer
// from user_id follows to_user_id
public void follow(int from_user_id, int to_user_id) {
if (!followers.containsKey(from_user_id)) {
followers.put(from_user_id, new TreeSet<Integer>());
}
followers.get(from_user_id).add(to_user_id);
if (!followings.containsKey(to_user_id)) {
followings.put(to_user_id, new TreeSet<Integer>());
}
followings.get(to_user_id).add(from_user_id);
}
// @param from_user_id an integer
// @param to_user_id an integer
// from user_id unfollows to_user_id
public void unfollow(int from_user_id, int to_user_id) {
if (!followers.containsKey(from_user_id)) {
return;
}
followers.get(from_user_id).remove(to_user_id);
if (!followings.containsKey(to_user_id)) {
return;
}
followings.get(to_user_id).remove(from_user_id);
}
Read full article from Design the data structure for large social network | Runhe Tian Coding Practice