Monday, May 23, 2016

How to Build Recommendation System



https://www.packtpub.com/books/content/building-recommendation-engine-spark
a recommendation engine tries to model the connections between users and some type of item.
Types of recommendation models
content-based filtering and collaborative filtering, ranking model
hybrids, incorporating elements of many different methods into a model or combination of models.

Content-based filtering
Content-based methods try to use the content or attributes of an item, together with some notion of similarity between two pieces of content, to generate items similar to a given item.

To generate recommendations for unknown items for a given user, we can use the known preferences of other users that exhibit similar behavior. We can do this by selecting a set of similar users and computing some form of combined score based on the items they have shown a preference for. The overall logic is that if others have tastes similar to a set of items, these items would tend to be good candidates for recommendation.

MATRIX FACTORIZATION
Matrix factorization (or matrix completion) attempts to directly model this user-item matrix by representing it as a product of two smaller matrices of lower dimension. Thus, it is a dimensionality-reduction technique.

If we want to find a lower dimension (low-rank) approximation to our user-item matrix with the dimension k, we would end up with two matrices: one for users of size U x k and one for items of size I x k. These are known as factor matrices. If we multiply these two factor matrices, we would reconstruct an approximate version of the original ratings matrix. Note that while the original ratings matrix is typically very sparse, each factor matrix is dense

Implicit matrix factorization

There are many different approaches to deal with implicit data. MLlib implements a particular approach that treats the input rating matrix as two matrices: a binary preference matrix, P, and a matrix of confidence weights, C.

ALS works by iteratively solving a series of least squares regression problems. In each iteration, one of the user- or item-factor matrices is treated as fixed, while the other one is updated using the fixed factor and the rating data. Then, the factor matrix that was solved for is, in turn, treated as fixed, while the other one is updated. This process continues until the model has converged (or for a fixed number of iterations).

import org.apache.spark.mllib.recommendation.ALS,Rating
val ratings = rawRatings.map { case Array(user, movie, rating) => Rating(user.toInt, movie.toInt, rating.toDouble) }
rank: This refers to the number of factors in our ALS model, that is, the number of hidden features in our low-rank approximation matrices. Generally, the greater the number of factors, the better, but this has a direct impact on memory usage, both for computation and to store models for serving, particularly for large number of users or items. Hence, this is often a trade-off in real-world use cases. A rank in the range of 10 to 200 is usually reasonable.
iterations: This refers to the number of iterations to run. While each iteration in ALS is guaranteed to decrease the reconstruction error of the ratings matrix, ALS models will converge to a reasonably good solution after relatively few iterations. So, we don't need to run for too many iterations in most cases (around 10 is often a good default).
lambda: This parameter controls the regularization of our model. Thus, lambda controls over fitting. The higher the value of lambda, the more is the regularization applied. What constitutes a sensible value is very dependent on the size, nature, and sparsity of the underlying data, and as with almost all machine learning models, the regularization parameter is something that should be tuned using out-of-sample test data and cross-validation approaches.
We'll use rank of 50, 10 iterations, and a lambda parameter of 0.01 to illustrate how to train our model:

val model = ALS.train(ratings, 50, 10, 0.01)
model.userFeatures
model.userFeatures.count
model.productFeatures.count

trainImplicit
The alpha parameter controls the baseline level of confidence weighting applied. A higher level of alpha tends to make the model more confident about the fact that missing data equates to no preference for the relevant user-item pair.

This usually takes the form of a top-K list, that is, the K items that our model predicts will have the highest probability of the user liking them. This is done by computing the predicted score for each item and ranking the list based on this score.

in user-based approaches, the ratings of similar users on items are used to compute the recommendations for a user, while in an item-based approach, the computation is based on the similarity of items the user has rated to the candidate items.

In matrix factorization, because we are modeling the ratings matrix directly, the predicted score can be computed as the vector dot product between a user-factor vector and an item-factor vector.

val predictedRating = model.predict(789, 123)
val topKRecs = model.recommendProducts(userId, K)
val titles = movies.map(line => line.split("\\|").take(2)).map(array => (array(0).toInt,  array(1))).collectAsMap()
val moviesForUser = ratings.keyBy(_.user).lookup(789)
moviesForUser.sortBy(-_.rating).take(10).map(rating => (titles(rating.product), rating.rating)).foreach(println)
topKRecs.map(rating => (titles(rating.product), rating.rating)).foreach(println)

Item recommendations
similarity is computed by comparing the vector representation of two items using some similarity measure. Common similarity measures include Pearson correlation and cosine similarity for real-valued vectors and Jaccard similarity for binary vectors.

These techniques aim to fill in the missing entries of a user-item association matrix

users and products are described by a small set of latent factors that can be used to predict missing entries. In particular, we implement the alternating least squares (ALS) algorithm to learn these latent factors.

http://data-artisans.com/computing-recommendations-at-extreme-scale-with-apache-flink/
A powerful approach for implementing recommenders are the so called “latent factor models”, a special case of the collaborative filtering techniques, which exploit the similarity between user tastes and item characteristics: If user A andB are similar, then items liked by user A make good recommendations for userB
The central step is to compute a low-rank factorization of the sparse rating matrix into a user- and an item matrix:
Example of a low-rank factorization of a sparse matrix
The result of this computation is a set of factors for each user and item that express how high the user and item score in a certain dimension (sometimes these dimension can be found to correlate with intuitive concepts, like movie/music genres). If a user and an item score high in the same factors, the user probably likes the item.

http://tech.hulu.com/blog/2011/09/19/recommendation-system.html
A recommendation system provides a solution when a lot of useful content becomes too much of a good thing. A recommendation engine can help users discover information of interest by analyzing historical behaviors. More and more online companies — including Netflix, Google, Facebook, and many others — are integrating a recommendation system into their services to help users discover and select information that may be of particular interest to them.

So the first goal of Hulu's recommendation system is to help users find content which will be of interest to them.

In addition to users, Hulu's recommendation system should also help content owners promote their video. Part of our mission is to deliver a service that users, advertisers, and content owners all unabashedly love. We have many different content partners, and we understand that these content partners want to more Hulu users to watch their videos — especially when new videos are released. 

Data Characteristics
we wanted to explain some characteristics of our data.
Since a lot of our content is comprised of episodes or clips within a show, we have decided to recommend shows to users instead of individual videos. Shows are a good method of organization, and videos in the same show are usually very closely related.
Our content can be mainly divided into two parts: on-air shows and library shows. On-air shows are highly important since more than half of our streaming comes from them.
Although on-air shows occupy a large part of our content, they are touched by a seasonal effect. During summer months, most of on-air shows do not air, causing on-air show streaming to decrease. Furthermore, there are fewer shows aired during weekends, thus the streaming of library shows will increase. Keeping this information in mind we can design the recommendation system to recommend more library shows to users during the weekend or summer months, as an example.
The key data that drives most recommendation systems is user behavior data. There are two main types of user behavior data: implicit user feedback data and explicit user feedback data. Explicit user feedback data primarily includes user voting data. Implicit feedback data includes information on users watching, browsing, searching, etc. Explicit feedback data can show a user's preference on a show explicitly, but implicit feedback data cannot. For example, if a user gives a 5-star rating to a show, we know that this user likes the show very much. But if a user only watches a video from a show page or searches for a show, we don't know whether this user likes the show.
As the quantity of implicit data at Hulu far outweighs the amount of explicit feedback, our system should be designed primarily to work with implicit feedback data.

http://ijcai13.org/files/tutorial_slides/td3.pdf

http://blog.gainlo.co/index.php/2016/05/24/design-a-recommendation-system/

Heuristic solution

Although machine learning (ML) is commonly used in building recommendation systems, it doesn’t mean it’s the only solution. There are many cases where we want simpler approaches, for example, we may have very few data, or we may want to build a minimal solution fast etc..
In such cases, we can start with some heuristic solutions. In fact, there are lots of hacks we can do to build a simple recommendation system. For instance, based on videos a user has watched, we can simply suggest videos from same authors. We can also suggest videos with similar titles or labels. If we use the popularity (number of comments, shares) as another signal, the recommendation system can work pretty well as a baseline.

Collaborative filtering

In a nutshell, to recommend videos for a user, I can provide videos liked by similar users. For instance, if user A and B have watched a bunch of same videos, it’s highly likely that user A will like videos liked by B. Of course, there are many ways to define what “similar” means here. It could be two users have liked same videos, it could also mean that they share the same location.
The above algorithm is called user-based collaborative filtering. Another version is called item-based collaborative filtering, which means to recommend videos (items) that are similar to videos a user has watched.

Feature engineer

In fact, mentioning collaborative filtering in a system design interview is not impressive at all since the algorithm is so common. What most interviewers care about is how to build the system specific to the interview question. So for Youtube video recommendation, what features can be used to build the recommendation system?
Usually, there are two types of features – explicit and implicit features. Explicit features can be ratings, favorites etc.. In Youtube, it can be the like/share/subscribe actions. Implicit features are less obvious. If a user has watched a video for only a couple of seconds, probably it’s a negative sign. Given a list of recommended videos, if a user clicks one over another, it can mean that he prefer to the one clicked. Usually, we need to explore a lot about implicit features.
Back to the Youtube problem, there are several features are quite obvious:
  • Like/share/subscribe – As mentioned above, they are strong signs about a user’s preferences.
  • Watch time
  • Video title/labels/categories
  • Freshness
It’s worth to note that when building machine learning systems, you have to experiment a lot with different combination of features so that you won’t know which one is good unless you give it a try.

Infrastructure

Another reason that recommendation system is a great system design interview question is that it can also be used to discuss infrastructure. Apparently, the system contains multiple steps/components. so how would you design the whole system in terms of infrastructure?
Given that comparing similar users/videos can be time-consuming on Youtube, this part should be done in offline pipelines. Therefore, we can divide the whole system into online and offline.
For the offline part, all the user models and videos need to store in distributed systems (it could be a whole article about storage, this post covers this topic briefly). Pipelines that calculate similar users/videos are also running regularly in order to keep data updated. In fact, for most machine learning systems, it’s common to use offline pipeline to process big data as you won’t expect it to finish with few seconds.
For the online part, based on the user profile and his actions (like videos just watched), we should be able to provide a list of recommended videos from offline data. Normally, the system fetches more videos than needed and then do filtering and ranking on the fly. We can filter videos that are obviously irrelevant like videos the user has watched. And then we should also rank the suggestions. Few factors should be considered include video popularity (share/comment/like numbers), freshness, quality and so on.
In reality, there are many ways to improve the system that we haven’t covered yet. I’d like to briefly mention few techniques:
  • Freshness can be a very important factor. We should figure out how to recommend fresh content.
  • Eval is an essential component of recommendation system, which allows us to understand how well the system works.
  • To train the collaborative filtering system, we may also include video position signals. Usually, videos ranked on top have much higher chance to be clicked.
http://www.importnew.com/19834.html
系统推荐: 根据大众行为的推荐引擎,对每个用户都给出同样的推荐,这些推荐可以是静态的由系统管理员人工设定的,或者基于系统所有用户的反馈统计计算出的当下比较流行的物品。
个性化推荐:对不同的用户,根据他们的口味和喜好给出更加精确的推荐,这时,系统需要了解需推荐内容和用户的特质,或者基于社会化网络,通过找到与当前用户相同喜好的用户,实现推荐。
2.1、系统推荐目的
针对所有用户推荐,当前比较流行的商品(必选) 或 促销实惠商品(可选) 或 新上市商品(可选),以促进商品的销售量。
PS:根据我们的应用情况考虑是否 选择推荐 促销实惠商品 和 新上市商品。(TODO1)
2.2、实现方式
实现方式包含:系统自动化推荐 和 人工设置推荐。
(1)系统自动化推荐考虑因素有:商品发布时间、商品分类、库存余量、历史被购买数量、历史被加入购物车数量、历史被浏览数量、降价幅度等。根据我们当前可用数据,再进一步确定(TODO2)
(2)人工设置:提供运营页面供运营人员设置,设置包含排行位置、开始时间和结束时间、推荐介绍等等。
3.1、个性化推荐目的
对不同的用户,根据他们的口味和喜好给出更加精确的推荐,系统需要了解需推荐内容和用户的特质,或者基于社会化网络,通过找到与当前用户相同喜好的用户,实现推荐,以促进商品的销售量。
3.2、三种推荐模式的介绍
据推荐引擎的数据源有三种模式:基于人口统计学的推荐、基于内容的推荐、基于协同过滤的推荐。
(1)基于人口统计学的推荐:针对用户的“性别、年龄范围、收入情况、学历、专业、职业”进行推荐。
(2)基于内容的推荐:如下图,这里没有考虑人对物品的态度,仅仅是因为电影A月电影C相似,因此将电影C推荐给用户A。这是与后面讲到的协同过滤推荐最大的不同。
这里写图片描述
(3)基于协同过滤的推荐:如下图,这里我们并不知道物品A和物品D是否相似,仅仅考虑人对物品的喜好进行推荐。
这里写图片描述
模式采用:这三种模式可以单独使用,也可结合使用。结合我们实际情况,采用基于协同过滤的推荐更加合适,看后期情况是否结合另外两种模式实现推荐。但基于协同过滤的推荐这种模式,会引发“冷启动”问题。关于,冷启动问题,后续会讨论解决方案。
(1)判断用户喜好因素:历史购买、历史购物车、历史搜索、历史浏览等,待确定我们可用数据再进一步细化。
(2)用户对某个商品的喜好程度,通过不同行为对应不同分值权重,如:历史购买(10)、历史购物车(8)、历史搜索(5)、历史浏览(6),确定用户喜好因素后再进一步对各个因素评分权重进行 合理的设计。
(3)用户对商品的喜好程度最终体现:结合某个商品的不同行为 统计出 最终对该商品的喜好程度,即对商品的喜好程度,最终以一个数字体现。
所谓冷启动,是指对于很多推荐引擎的开始阶段,当一个新用户进入推荐系统或者系统添加一个新的物品后,由于还没有大量的用户数据,系统无法计算出推荐模型,从而导致系统的推荐功能失效的问题。
可考虑的解决方案有:
(1)利用用户注册信息进行初步推荐,主要包括人口统计学信息、用户描述的个人兴趣内容,预先设定好用户的偏好信息。
(2)在用户第一次访问系统时,给用户提供一些物品,让用户反馈对这些物品的评分,然后根据用户的反馈形成初始的个性化推荐。
(3)邀请行业的专家对新的用户或者新的物品
进行分类、评注。
(4)随机推荐的方法。对于冷启动问题,实际应用中最简单最直观的方法是采用随机推荐的 方式。这种方法是比较冒险。
(5)平均值法。所有项目的均值,作为用户对未评价过项目的预测值,将原始评分矩阵进行 填充,然后在填充后的评分矩阵上寻找目标用户的最近邻居,应用协同过滤的方法产生推荐。但是均值的方法只能说是一种被动应付的方式,新用户对项目的喜好值正好等于其他用户对此项目的平均值的概率是非常小的。
http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/
, matrix factorization techniques are usually more effective because they allow us to discover the latent features underlying the interactions between users and items. Of course, matrix factorization is simply a mathematical tool for playing around with matrices, and is therefore applicable in many scenarios where one would like to find out something hidden under the data.

matrix factorization is to find out two (or more) matrices such that when you multiply them you will get back the original matrix.

the task of predicting the missing ratings can be considered as filling in the blanks (the hyphens in the matrix) such that the values would be consistent with the existing ratings in the matrix.

Firstly, we have a set U of users, and a set D of items. Let \mathbf{R} of size |U| \times |D| be the matrix that contains all the ratings that the users have assigned to the items. Also, we assume that we would like to discover $K$ latent features. Our task, then, is to find two matrics matrices \mathbf{P} (a |U| \times K matrix) and \mathbf{Q} (a |D| \times K matrix) such that their product approximates \mathbf{R}:

\mathbf{R} \approx \mathbf{P} \times \mathbf{Q}^T = \hat{\mathbf{R}}

In this way, each row of \mathbf{P} would represent the strength of the associations between a user and the features. Similarly, each row of \mathbf{Q} would represent the strength of the associations between an item and the features. To get the prediction of a rating of an item d_j by u_i, we can calculate the dot product of the two vectors corresponding to u_i and d_j:

\hat{r}_{ij} = p_i^T q_j = \sum_{k=1}^k{p_{ik}q_{kj}}

Now, we have to find a way to obtain \mathbf{P} and \mathbf{Q}. One way to approach this problem is the first intialize the two matrices with some values, calculate how `different’ their product is to \mathbf{M}, and then try to minimize this difference iteratively. Such a method is called gradient descent, aiming at finding a local minimum of the difference.

https://en.wikipedia.org/wiki/Matrix_decomposition
matrix decomposition or matrix factorization is a factorization of a matrix into a product of matrices


Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts