Massive Technical Interviews Tips: Design the Facebook news seed function

http://www.quora.com/What-are-best-practices-for-building-something-like-a-News-Feed
Determining Importance

You'll likely want to rank edges by importance rather than simply the most recent updates, meaning that you need to calculate some sort of score.
Facebook's EdgeRank was described by the formula ∑e = ue we de, wherein ∑e is the sum of the edge's rank, ue is the affinity score with the user who created the edge, we is the weight for the content type, and de is a time decay factor.

Calculating a friend's affinity score can be done something like this:
∑i = li ni wi, wherein ∑i is the sum of the interactions with that friend, li is the time since your last interaction (this would need to be weighted so that 1 day > 30 days), ni is the number of interacts, and wi is the weight of those interactions.
This method allows you to rank friends in a separate database and then perhaps only show ten updates from the ten closest friends, which isn't a bad idea considering few of us are likely to have more close friends than this.

What to Store
Activity(id, user_id, source_id, activity_type, edge_rank, parent_id, parent_type, data, time)

Assuming you're using MySQL as your database store, you can index on (user_id, time) and then perform your basic queries.

In MySQL, your tables would be heavily denormalized since performing joins will hurt performance.

Optimize for the read.
Users don't know what they don't know. If eventually consistent, attempt to provide read consistency for the publisher.
If reading from 3rd party sources, leverage 'push' delivery options where ever possible. Polling is expensive and inefficient.
Queue and prompt on new activity. Stream in (preferably in real-time) comments and likes.
All content is not created equal. While reverse chronological delivery is popular, in high volume situations identify signals to leverage in ranking and ordering content in a more intelligent fashion.
If supporting API content delivery, provide attribution to developer applications.
Leverage open standards such as http://activitystrea.ms/. Standardizing on the data format you consume and emit is better for everyone involved in the long run.

I found this paper interesting - http://research.yahoo.com/files/...
Summary:
* Combination of push (events are pushed to materialized per consumer feeds) and pull (events are pulled from per producer event store) approaches. Purely push or pull model is less versatile.
* The push/pull decision is made locally on per consumer/producer basis. One size doesn't fit all.
* Global and per producer coherency - With global coherency, events are displayed in the global order of timestamp. With per-producer coherency, time sequence is maintained per producer basis.
* Feed diversity - A frequent producer of events may overshadow events from less frequent producers. Feed diversity addresses diversity in favor of absolute sequencing by timestamp alone.

Last essential aspects: how frequently people connect might need some specific studies, to weight the ‘li’ mentioned by Josh; people with shared connection, and historical common comment threads might enjoy closer relation, ie a bigger weight ‘wi’. Usual time patterns should simplify the first, and PageRank give you enough insight about how to deal with the later.

Many startups (and major media companies) use Echo StreamServer to power their real-time news feeds quickly and cheaply.

https://backchannel.org/blog/friendfeed-schemaless-mysql
We did the typical things, like using read slaves and memcache to increase read throughput and sharding our database to improve write throughput.

In particular, making schema changes or adding indexes to a database with more than 10 - 20 million rows completely locks the database for hours at a time. Removing old indexes takes just as much time, and not removing them hurts performance because the database will continue to read and write to those unused blocks on every INSERT, pushing important blocks out of memory.

http://stackoverflow.com/questions/4162020/how-can-i-improve-this-php-mysql-news-feed

The schema for the log is activity_log(uid INT(11), activity ENUM, activity_id INT(11), title TEXT, date TIMESTAMP)

...and the schema for the feed is newsfeed(uid INT(11), poster_uid INT(11), activity ENUM, activity_id INT(11), title TEXT, date TIMESTAMP).

Any time a user does something relevant to the news feed, for example asking a question, it will get logged to the activity log immediately.

Then every X minutes (5 minutes at the moment, will change to 15-30 minutes later), I run a cron jobthat executes the script below. This script loops through all of the users in the database, finds all the activities for all of that user's friends, and then writes those activities to the news feed.

At the moment, the SQL that culls the activity (called in ActivityLog::getUsersActivity()) has a LIMIT 100 imposed for performance* reasons.

Here's the flaws I see in my mind with your current implementation:

You are processing all of the friends for all users, but you will end up processing the same users many times due to the fact that the same groups of people have similar friends.
If one of my friends posts something, it won't show up on my news feed for at most 5 minutes. Whereas it should show up immediately, right?
We are reading the entire news feed for a user. Don't we just need to grab the new activities since the last time we crunched the logs?
This doesn't scale that well.

Publishing Methods

"Push" Model, or Fan-out-on-write
This method involves denormalizing the user's activity data and pushing the metadata to all the user's friends at the time it occurs.
You store only one copy of the data as in the schema above, then push pointers to friends with the metadata. The problem with this method is that if you have a large fan-out (a large number of followers), you run the risk of this breaking while your feed accumulates a backlog.

If you go with this strategy, you also risk a large number of disk seeks and random writes. You'll want some sort of write-optimized data store such as Cassandra, HBase, or BigTable.

"Pull" Model, or Fan-out-on-load
This method involves keeping all recent activity data in memory and pulling in (or fanning out) that data at the time a user loads their home page.
Data doesn't need to be pushed out to all subscribers as soon as it happens, so no back-log and no disk seeks. The problem with this method is that you may fail to generate a user's news feed altogether.
To mitigate this risk, you should have a fallback mechanism in place that approximates the user's feed or serves as a good alternative.

relevance vs recency
http://www.slideshare.net/danmckinley/etsy-activity-feeds-architecture
There are two fundamental building blocks for feeds: connections and activities.
Activities form a log of what some entity on the site has done, or had done to it.
Connections express relationships between entities.

Connections are a superset of Circles, Favorites, Orders, and other relationships between
entities on the site.

Direction
Connections are implemented as a directed graph.
Currently, the nodes can be people or shops. (In principle they can be other objects.)

We also assign each edge a weight, known as affinity.
activity := (subject, verb, object)

activities are a description of an event on
Etsy boiled down to a subject (“who did it”), a verb (“what they did”), and an object (“what
they did it to”).

The problem (the main one we’re trying to solve with activity feeds) is how to
notify all of them about it. In order to achieve that goal, as usual we copy the data all over the place.

activity := [owner,(subject, verb, object)]
So what we do is duplicate the S,V,O combinations with di"erent owners.
Steve will have his record that he connected to Kyle, and Kyle will be given his own record that Steve connected to him.

Getting to the end result (the activity feed page) has two distinct phases: aggregation and display.

Aggregation turns activities (in the database) into a Newsfeed (in memcache).
Aggregation typically occurs offline, with Gearman.

The first step in aggregation is to turn the list of people you are connected to into the list of
people we’re actually going to go seek out activities for.

In theory, the way we would do this is rank the connections by affinity and then treat the affinity as the probability that we’ll pick it.

$choose_connection = mt_rand() < $affinity;
So then we’d be more likely to pick the close connections, but leaving the possibility that we will pick the distant ones.

Aggregation, Step 2: Making Activity Sets
Once the connections are chosen, we then select historical activity for them and convert them
into in-memory structures called activity sets.
These are just the activities grouped by connection, with a score and flags field for each.
The next few phases of aggregation operate on these.

The next thing that happens is that we iterate through all of the activities in all of the sets and classify them.

Aggregation, Step 3: Classification
Aggregation, Step 4: Scoring
At this point the score is just a simple time decay function (older activities always score lower than new ones).

Aggregation, Step 5: Pruning
it’s possible to wind up seeing the same event as two or more activities.
The next stage of aggregation detects these situations.

Aggregation, Step 6: Sort & Merge
Then we take the final set of activities and merge it on to the owner’s existing newsfeed.
(Or we create a new newsfeed if they don’t have one.)

Aggregation: Cleaning Up
We trim o" the end of the newsfeed, so that they don’t become arbitrarily large.
And then finally we store the feed into memcached.

Display
Display: Enter Rollups
To solve this problem we combine similar stories into rollups.

on the entire feed in just a few very large rollups.
Then the similar stories are grouped together within the sections.

Display: Filling in Stories
There are multiple layers of caching here. Things that are global (like the shop associated with a favorited listing) are cached separately from things that are unique to the person looking at the feed (like the exact way the story is phrased).

Hack #1: Cache Warming
The first thing we do to speed things up is run almost the entire pipeline proactively using gearman.
So after aggregation we trigger a display run, even though nobody is there to look at the html.
The end result is that almost every pageview is against a hot cache.

Hack #2: TTL Caching
The second thing we do is add bits of TTL caching where few people will notice.
Straightforward but not done in many places on the site.

Hack #3: Judicious Associations
We also profiled the pages and meticulously simplified ORM usage.

Hack #4: Lazy Below the Fold
We don’t load much at the outset. You get more as you scroll down (finite scrolling).
https://medium.com/@bansal_ankur/design-a-news-feed-system-6bf42e9f03fb
TODO:
http://highscalability.com/blog/2013/7/8/the-architecture-twitter-uses-to-deal-with-150m-active-users.html

http://webcache.googleusercontent.com/search?q=cache:QNP3ZdQYhfcJ:https://backchannel.org/blog/friendfeed-schemaless-mysql+&cd=1&hl=en&ct=clnk&gl=us

Thursday, December 25, 2014

Design the Facebook news seed function

Labels

Popular Posts