http://engineeringblog.yelp.com/2015/12/how-we-made-yelp-search-filters-data-driven.html
Yelp 的搜索设计
Building a model to recommend filters
We wanted to build a model that can take a set of informative features such as the query, date & time, location, personal preferences and a variety of other feature to suggest a set of most relevant filters to show to our users.
One of the most important features here is the query string: what the user is searching for will have a huge impact on the filters they plan to use. We can see this easily from our data. For example, searching for “birthday dinners” or “romantic dinners” often leads to the make a reservation filter. When people search for “coffee shops”, they often want to look for free-wifi or outdoor-seating.
Unfortunately, the query text is a super-sparse, long-tailed, feature with high cardinality (we receive tens of millions of distinct queries in a year in over 30 different countries), making it difficult to engineer and feed into a model. We wanted to build a function that maps the query text into a single number that tells us how relevant is this query to a particular filter. Ideally this function should be continuous and smooth; “Birthday dinner” and “Best birthday dinners” should produce very similar scores. The simplest approach here is look at our click data and see which filters people use for which queries and generate a signal based on that. However, this is not a very flexible approach as click data and can be very sparse and drops off quickly after the top queries. In addition, approximately 10 - 15% of queries are ones that we have never seen before. This is where language models become useful.
A language model can be thought of as a function that takes as an input a sequence of words and returns a probability (likelihood) estimate of that sequence. If we can build a language model for every search filter, we can use it to calculate the posterior probability of all filters being used given the query and thus help decide which filters are most relevant to our query. To calculate the posterior probability of P(filter | query), we use a simple application of Bayes rule (see diagram above). Intuitively, we want to ask how probable a query is to be generated by some particular filter language model versus how common the query is overall in the language of Yelp queries.
In the diagram above searching for “after work bars”, HappyHour and OutdoorSeating are given a positive score in our model and deemed relevant whereas GoodForBrunch and GoodForKids are deemed as not relevant.
Data Analysis
We launched the new filter redesign as an experiment and waited for results to roll in. The data analysis here is rather tricky; we completely rebuilt the filter panel and needed to carefully define what metrics we want to measure.
- Filter Engagement: Increase user engagement with Yelp search filters because they help our users discover relevant content faster.
- Search Quality: Improve the search experience by providing more relevant content through recommending better filters.
Filter Engagement
The first metric appears simple: loosely speaking we want to increase total number of filter clicks normalized by the total number of searches in each cohort. But as with all data analysis, there are many caveats to consider.
But how do we measure “search experience”? The most frequently used success metrics are clicked based. (ie: Click Through Rate (CTR), MAP (Mean Average Precision), or RR (Reciprocal Rank)). While clicks are easy to measure, they can often be the most deceptive metrics. Just because a user clicks on a result, doesn’t make it relevant. Instead, we use a number of other metrics that more closely couples to user interactions with the site to define success. For example, this includes user engagement metrics on the business page following a search and time it takes for a user to find relevant results.
Yelp 的搜索设计
Yelp里location、comments的增加应该远远小于搜索。所以应该定位是读频繁的分布式系统。
据查,该公司的tech stack是LAMP和ELK。
加减用户和商家的讯息当然非关系型数据库莫属。
MYSQL可以做基于LONGITUDE和LATITUDE的distance search,可以按照Quad Region来shard。当然Quad Region应该均匀(非连续)分布在多个DB cluster里,不过感觉城市的变迁不太容易控制,极容易就会造成shard的分布变得极其不均匀。
所以我认为更好的方法可能是把Quad Region放到NoSQL里。Cassandra、Mongo或者HBase一类。按照Region的GeoHash ID字符串翻转作为KEY。wx5ew9ue 变成 eu9we5xw。每个Region里面放入所有的商家(包括其tags)。
QuadTree的大小也会有变化。农村或者野外可能很大,都市会分的很小很细。大都市的繁华地带会分的尤其细致。尤其是有的摩天大楼里面商家很多就更会要求region的分布不是大小均匀的。
QuadTree可能需要记录特殊情况,比如特殊的EDGES的关系,这样一条大河隔开两岸,或者山上山下,或者国境线,或者岛屿内陆,YELP的推荐可能会有变化。
地点的search应该通过quadtree查找相邻的region,得到region里的符合tag的所有企业后,和从MYSQL前的cache里得到的其它信息整合后送给用户。
用户的点赞和comments可以异步进入关系型数据库。
同时title,description,和comments可以加入搜索引擎比如ElasticSearch。文字的搜索前做些words的处理
据查,该公司的tech stack是LAMP和ELK。
加减用户和商家的讯息当然非关系型数据库莫属。
MYSQL可以做基于LONGITUDE和LATITUDE的distance search,可以按照Quad Region来shard。当然Quad Region应该均匀(非连续)分布在多个DB cluster里,不过感觉城市的变迁不太容易控制,极容易就会造成shard的分布变得极其不均匀。
所以我认为更好的方法可能是把Quad Region放到NoSQL里。Cassandra、Mongo或者HBase一类。按照Region的GeoHash ID字符串翻转作为KEY。wx5ew9ue 变成 eu9we5xw。每个Region里面放入所有的商家(包括其tags)。
QuadTree的大小也会有变化。农村或者野外可能很大,都市会分的很小很细。大都市的繁华地带会分的尤其细致。尤其是有的摩天大楼里面商家很多就更会要求region的分布不是大小均匀的。
QuadTree可能需要记录特殊情况,比如特殊的EDGES的关系,这样一条大河隔开两岸,或者山上山下,或者国境线,或者岛屿内陆,YELP的推荐可能会有变化。
地点的search应该通过quadtree查找相邻的region,得到region里的符合tag的所有企业后,和从MYSQL前的cache里得到的其它信息整合后送给用户。
用户的点赞和comments可以异步进入关系型数据库。
同时title,description,和comments可以加入搜索引擎比如ElasticSearch。文字的搜索前做些words的处理
为什么点赞跟comments要放入关系型数据库而不是放NoSQL?
想引申另外个问题,像youtube这种like跟view数目更新频繁的,放在RDBMS合适吗?
想引申另外个问题,像youtube这种like跟view数目更新频繁的,放在RDBMS合适吗?
主要是觉得量不大,还有moderation的问题。
如果每个comment的who when都放入关系型数据库,假设comment不是太多的话,写入RDBMS还是可行的。
赞的次数归并与一个int就可以了。
comment需要写成一条记录。有删除的问题。比如用户贴广告的。阿拉伯女遗孀重金求子,请电。。。
如有不同思路还望展开说说。
如果每个comment的who when都放入关系型数据库,假设comment不是太多的话,写入RDBMS还是可行的。
赞的次数归并与一个int就可以了。
comment需要写成一条记录。有删除的问题。比如用户贴广告的。阿拉伯女遗孀重金求子,请电。。。
如有不同思路还望展开说说。