Massive Technical Interviews Tips: Pinterest - Building the interests platform

Sunday, December 13, 2015

Pinterest - Building the interests platform

https://engineering.pinterest.com/blog/building-interests-platform

In contrast with conventional methods of generating such data, which rely primarily on machine learning and data mining techniques, our system relies heavily on human curation. The ultimate goal is to build a system that’s both machine and human powered, creating a feedback mechanism by which human curated data helps drive improvements in our machine algorithms, and vice versa.

Raw input to the system includes existing data about Pins, boards, Pinners, and search queries, as well as explicit human curation signals about interests. With this data, we’re able to construct a continuously evolving interest dictionary, which provides the foundation to support other key components, such as interest feeds, interest recommendations, and related interests.

Generating the interest dictionary

We generated an initial collection of interests by extracting frequently occurring n-grams from Pin and board descriptions, as well as board titles, and filtering these n-grams using custom built grammars. While this approach provided a high coverage set of interests, we found many terms to be malformed phrases. For instance, we would extract phrases such as “lamborghini yellow” instead of “yellow lamborghini”. This proved problematic because we wanted interest terms to represent how Pinners would describe them, and so, we employed a variety of methods to eliminate malformed interests terms.

We first compared terms with repeated search queries performed by a group of Pinners over a few months. Intuitively, this criterion matches well with the notion that an interest should be an entity for which a group of Pinners are passionate.

Later we filtered the candidate set through public domain ontologies like Wikipedia titles. These ontologies were primarily used to validate proper nouns as opposed to common phrases, as all available ontologies represented only a subset of possible interests. This is especially true for Pinterest, where Pinners themselves curate special interests like “mid century modern style.”

Finally, we also maintain an internal blacklist to filter abusive words and x-rated terms as well as Pinterest specific stop words, like “love”. This filtering is especially important to interest terms which might be recommended to millions of users.

We arrived at a fair quality collection of interests following the above algorithmic approaches. In order to understand the quality of our efforts, we gave a 50,000 term subset of our collection to a third party vendor which used crowdsourcing to rate our data. To be rigorous, we composed a set of four criteria by which users would evaluate candidate Interests terms:

- Is it English?

- Is it a valid phrase in grammar?

- Is it a standalone concept?

- Is it a proper name?

This type of effort, however, is not easy to scale. The real solution is to allow Pinners to provide both implicit and explicit signals to help us determine the validity of an interest. Implicit signals behaviors like clicking and viewing, while explicit signals include asking Pinners to specifically provide information (which can be actions like a thumbs up/thumbs down, starring, or skipping recommendations).

To capture all the signals used for defining the collections of terms, we built a dictionary that stores all the data associated with each interest, including invalid interests and the reason why it’s invalid. This service plays a key role in human curation, by aggregating signals from different people. On top of this dictionary service, we can build different levels of reviewing system.

Identifying Pinner interests

With the Interests dictionary, we can associate Pins, boards, and Pinners with representative interests. One of the initial ways we experimented with this was launching a preview of a page where Pinners can explore their interests.

Calculating related interests

Related interests are an important way of enabling the ability to browse interests and discover new ones. To compute related interests, we simply combine the co-occurrence relationship for interests computed at Pin and board levels.

Powering interest feeds

Interests feeds provide Pinners with a continuous feed of Pins that are highly related. Our feeds are populated using a variety of sources, including search and through our annotation pipeline. A key property of the feed is flow. Only feeds with decent flow can attract Pinners to come back repeatedly, thereby maintaining high engagement. In order to optimize for our feeds, we’ve utilized a number of real-time indexing and retrieval systems, including real-time search, real-time annotating, and also human curation for some of the interests.

To ensure quality, we need to guarantee quality from all sources. For that purpose, we measure the engagement of Pins from each source and address quality issue accordingly.