https://engineering.linkedin.com/real-time-distributed-graph/using-set-cover-algorithm-optimize-query-latency-large-scale-distributed
As someone who was at LinkedIn as we scaled back most search results from including 3rd degree effects to only 1st and 2nd degree effects, I can attest to the computation intensive nature of this. Currently LinkedIn will show you if a given individual is in your 3rd degree or not (this is easy, because each user's 1st and 2nd degree connections are typically cached somewhere, so this gives you out to the degree of connection between those users if it is 4th or less), but will *never* do something like "grab all of my third degree connections".
I will guess that most likely due to the size of Facebook's graph, it is simply not the case that both the 1st and 2nd degree of most users is kept in memory. If only your friends are available, if my friends and yours don't overlap, checking whether we're 3rd, or 4th, requires some significant additional computation.
“已知一个函数,输入用户ID,可以返回该用户的所有友好(degree 1 friends),按好友ID从小到大排序。要求实现函数来输出返回一个用户的所有好友的好友(degree 2friends), 以及 degree 3 friends。
这题是想考什么呢?单是3级朋友,是不是BFS就可以解决了?是还要考distributedsystem怎样存取用户信息吗?
Overview of Query Routing
LinkedIn's distributed graph system consists of three major subcomponents:
- GraphDB: a partitioned and replicated graph database.
- Network Cache Service (NCS): a distributed cache that stores a member's network and serves queries requiring second degree knowledge.
- API layer: the access point for front-ends.
NCS, the caching layer, calculates and stores a member's second-degree set. With this cache, graph distance queries originating from a member can be converted to set intersections, avoiding further remote calls. For example, if we have member X's second degree calculated, to decide whether member Y is three degree apart from member X, we can simply fetch Y's connections and intersect X's second degree cache with Y's first degree connections.
NCS, the caching layer, calculates and stores a member's second-degree set. With this cache, graph distance queries originating from a member can be converted to set intersections, avoiding further remote calls. For example, if we have member X's second degree calculated, to decide whether member Y is three degree apart from member X, we can simply fetch Y's connections and intersect X's second degree cache with Y's first degree connections.
I will guess that most likely due to the size of Facebook's graph, it is simply not the case that both the 1st and 2nd degree of most users is kept in memory. If only your friends are available, if my friends and yours don't overlap, checking whether we're 3rd, or 4th, requires some significant additional computation.
“已知一个函数,输入用户ID,可以返回该用户的所有友好(degree 1 friends),按好友ID从小到大排序。要求实现函数来输出返回一个用户的所有好友的好友(degree 2friends), 以及 degree 3 friends。
这题是想考什么呢?单是3级朋友,是不是BFS就可以解决了?是还要考distributedsystem怎样存取用户信息吗?
论文的关键好像是说,当需要在db搜索(不是cache)的时候,如何很快找到所有的set,这个就是apply set coverage algorithm,尽量争取在一台机器上找到,而不是多台机器。但是它后面有些优化,好像很简单的办法就可以降低latency,这部分没看懂。【 在 kkuser (kkuser!) 的大作中提到: 】
: 学习了下这个链接还看了下相关的paper,似乎有了一些概念。
: 貌似是说建立一个network cache service,里面存有所有member的一层和二层的关系: ,这样的话,求一个member的三层关系,只需要one remote call for each 2nd
: connector。不知道这样理解对不对。
: 另外有个疑问,这个NCS,是个只用其memory的cluster吗?这个cluster可以装的下所: 有用户以及每个用户所有的二层关系吗?
: 一个machine没什么, 像FB那样的, gazillion machine来存你的ID-》 friend
list
: 信息, 你怎没办? machine 之间传来床去?
: bandwidth怎没办? 如果一个user, 像justin bieber那样, 有gazillion个
friends
: , 你怎没办?