http://engineering.ifttt.com/data/2015/10/14/data-infrastructure/
http://www.infoq.com/cn/news/2015/11/ifttt-data-infrastructure
Data Sources
There are three sources of data at IFTTT that are crucial for understanding the behavior of our users and performance of our Channels.
First, there’s a MySQL cluster on AWS RDS that maintains the current state of our primary application entities like users, Channels, and Recipes, along with their relations. IFTTT.com and our mobile apps run on a Rails application, backed by this instance. This data gets exported to S3 and ingested into Redshift daily using AWS Data Pipeline.
Next, as users interact with IFTTT products, we feed event data from our Rails application into our Kafka cluster.
Lastly, in order to help monitor the behavior of the hundreds of partner APIs that IFTTT connects to, we collect information about the API requests that our workers make when running Recipes. This includes metrics such as response time and HTTP status codes, and it all gets funneled into our Kafka cluster.
Kafka at IFTTT
We use Kafka as our data transport layer to achieve loose coupling between data producers and consumers. With this type of architecture, Kafka acts as an abstraction between the producers and consumers in the system. Instead of pushing data directly to consumers, producers push data to Kafka. The consumers then read data from Kafka. This makes adding new data consumers trivial.
Because Kafka acts as a log-based event stream, consumers keep track of their own position in the event stream. This enables consumers to operate in two modes: real-time and batch. It also allows consumers to reprocess data they have previously consumed, which is helpful if data needs to be reprocessed in the case of an error.
Once the data is in Kafka, we can use it for all types of purposes. Batch consumers send a copy of this data to S3 in hourly batches using Secor. Real-time consumers push data to an Elasticsearch cluster using a library we hope to open source soon.
Real-time Monitoring and Alerting
API events are stored in Elasticsearch for real-time monitoring and alerting. We useKibana to visualize the performance of our worker processes, and the performance of partner APIs in real-time.
IFTTT partners have access to the Developer Channel, a special Channel that triggers when their API is having issues. They can create Recipes using the Developer Channel that notify them using the action Channel of their choice (SMS, Email, Slack, etc).
- Separation between producers and consumers through a data transport layer like Kafka is pure bliss, and makes the data pipeline much more resilient. For example, a few slow consumers won’t impact the performance of the other consumers or producers.
- Use date based folder structure (YYYY/MM/DD) to store event data in permanent storage (S3 in our case). Event data stored in this way is easy to process. For example if you want to read a particular day’s data, you just need to read data from one directory.
- Similar to the above, create time based indexes (ex: hourly) in Elasticsearch. This way if you query Elasticsearch to find all API errors in the last hour, it can find the answer by looking at a single index, increasing efficiency.
- Rather than pushing individual events to Elasticsearch, push events in the batches (based on a time duration and/or number of events). This helps limit IO. -- batch
- Depending on the type of data and queries you are running, it is important to optimize number of nodes, number of shards, maximum size of each shard and replication factor in Elasticsearch.