Saturday, July 25, 2015

Cracking the coding interview--Q12.2 Facebook Socail



Cracking the coding interview--Q12.2
How would you design the data structures for a very large social network (Facebook,LinkedIn, etc)? Describe how you would design an algorithm to show the connection, or path, between two people (e.g., Me -> Bob -> Susan -> Jason -> You).
译文:
你会怎样给一个非常大型的社交网站设计数据结构(比如Facebook,LinkedIn)? 设计一个算法来找到任意两个人之间的联系,比如:我 -> Bob -> Susan -> Jason -> 你

解答

方法:
首先,我们先不去考虑数据规模。先从简单的入手。
我们可以把每个人看作一个结点,如果两个人之间是朋友,则这两个结点间有一条边, 这样一来我们就可以构建出一个图。假设我们将"人"这个类设计如下:
class Person {     Person[] friends;     // Other info }
如果要找到两个人之间的联系,即两个人之间间隔着哪些人。我们就从其中的一个人开始, 做广度优先搜索(BFS)。(做双向的BFS会更快)
但是。。。数据规模太大了!
如果我们去处理Orkut或是Facebook上的数据,单台机器根本无法完成这个任务。 因此,我们考虑用多台机器来处理这个问题。这样一来,Person这个类就需要修改了。 在Person类中,我们存储朋友的ID,然后按照以下方式找到朋友:
对于每个朋友id:int machine_index = lookupMachineForUserID(id); 转到标号为machine_index的机器去 Person friend = lookupFriend(machine_index); 
对于这个问题,要考虑的优化和问题非常多,这里只列出一些。
优化:减少机器间的跳转次数
机器间的跳转是非常耗时的,因此我们不随机的跳转,而是进行批处理: 比如一个人,他其中的5个朋友在同一台机器上,那么跳转到那台机器后,一次性处理他们。
聪明地划分人与机器
由于同一国家的人更有可能是朋友,因此我们并不随机地把人分配到不同的机器上, 而是以国家,城市,州等进行划分。这样也可以减少机器间的跳转次数。
问题:广度优先搜索会标记已访问结点,对于这个问题,你怎么处理?
在这个问题中,有可能同时有许多人在搜索两人间的联系, 因此直接在原数据上做修改并不好。那怎么办呢?我们可以对每一个搜索, 使用一个哈希表来记录一个结点是否已经访问过。这样一来, 不同人的搜索之间就不会互相干扰。
其它问题
  1. 在真实的世界中,服务器有可能会出故障。你会怎么做?
  2. 你怎么利用缓存?
  3. 你是否一直搜索直到把图上的结点都遍历一次。如何决定什么时间停止搜索。
  4. 在真实世界中,你的朋友里,有一些人的朋友会更多。 那么通过他是否更有可能让你找到与特定某个人的联系。 你怎么利用这个数据来选择遍历的顺序。
Some functions don't belong to Person such as lookUpMachine etc.
public class Server {   
    ArrayList<Machine> machines = new ArrayList<Machine>();
}
public class Machine {
    public ArrayList<Person> persons = new ArrayList<Person>();
    public int machineID;
}

public class Person {    
    private ArrayList<Integer> friends;
    private int ID;
    private int machineID;
    private String info;
    private Server server = new Server();
    
    public String getInfo() { return info; }
    
    public void setInfo(String info) {
        this.info = info;
    }
    
    public int[] getFriends() {
        int[] temp = new int[friends.size()];
        for (int i = 0; i < temp.length; i++) {
            temp[i] = friends.get(i);
        }
        return temp;
    }
    
    
    
    public int getID() { return ID; }
    public int getMachineID() { return machineID; }
    public void addFriend(int id) { friends.add(id); }
    
    
    // Look up a person given their ID and Machine ID
    
    public Person lookUpFriend(int machineID, int ID) {
        for (Machine m : server.machines) {
            if (m.machineID == machineID) {
                for (Person p : m.persons) {
                    if (p.ID == ID){
                        return p;                      
                    }
                }
            }
        }
        return null;
    }
    
    
    // Look up a machine given the machine ID
    
    public Machine lookUpMachine(int machineID) {
        for (Machine m:server.machines) {
            if (m.machineID == machineID)
                return m;
        }
        return null;
    }
    
    
    public Person(int iD, int machineID) {
        ID = iD;
        this.machineID = machineID;
    }
}
全书题解目录:
Cracking the coding interview–问题与解答

https://hellosmallworld123.wordpress.com/2014/05/15/large-scale-system-design-and-memory-limis/
Reduce Machine Jumps: Jumping from one machine to another is expensive, so we retrive all the friends from one machine before we jump to another.
Smart Division of people and mahines: friends tends to live in the same contry, so try to store person info from the same country to the same (near by) machines.
There will be follow up questions such as
BFS usually requires marking a node as visited, how do you do that in the design? Because we can have multiple searches happen at the same time, it is not feasible to mark a flag in the object itself. So for every search, we need to have a hashset that stores all the visited id.
In the real world, servers fail, How does this affect you. As a user, server failure should be transparent. As for the engineer, failed server should have back up and data need to be redundant. When server went down, the original server should try to restart and restore the previous status, and upon failure, the server should report failure and back up server should step up and take place of the failed server.
How could you take advantage of caching?
Do you search until the end of the graph? How do you decide when to give up? Maybe we need to set some kind of limit so that the search can finish in limited time. This limit can based on the statistical data or some other algorithms.
In real life, some people have more friends of friends than others, and are therefore more likely to make a path between you and someone else. How could you use this data to pick where you start traversing? I think if we start of from one of these people, it is more likely we can find the target sooner.
http://www.cnblogs.com/grandyang/p/4853252.html
这道题让我们实现大型社交网站的数据结构,首先用户类Person需要包含好友和其他的一些信息,而且大型网站一般可能会有上百万的用户,我们一般不可能把所有的数据都存在一台机器上,所以我们在查找好友时,需要先查找好友所在的机器,再在机器上查询好友,每个好友或机器都有自己的编号,为了快速查找,均使用了哈希表来建立映射
class Person {
public:
    Person(int id): _personID(id) {}
    int getID() { return _personID; }
    void addFriend(int id) { _friendIDs.push_back(id); }
    
private:
    vector<int> _friendIDs;
    int _personID;
};

class Machine {
public:
    unordered_map<int, Person*> _persons;
    int _machineID;
    Person* getPersonWithID(int personID) {
        if (_persons.find(personID) == _persons.end()) {
            return nullptr;
        }
        return _persons[personID];
    }
};

class Server {
public:
    unordered_map<int, Machine*> _machines;
    unordered_map<int, int> _personToMachineMap;
    Machine* getMatchineWithId(int machineID) {
        if (_machines.find(machineID) == _machines.end()) {
            return nullptr;
        }
        return _machines[machineID];
    }
    int getMachineIDForUser(int personID) {
        if (_personToMachineMap.find(personID) == _personToMachineMap.end()) {
            return -1;
        }
        return _personToMachineMap[personID];
    }
    Person* getPersonWithID(int personID) {
        if (_personToMachineMap.find(personID) == _personToMachineMap.end()) {
            return nullptr;
        }
        int machineID = _personToMachineMap[personID];
        Machine *machine = getMatchineWithId(machineID);
        if (machine == nullptr) return nullptr;
        return machine->getPersonWithID(personID);
    }
};
Read full article from Cracking the coding interview--Q12.2

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts