Sunday, March 31, 2019

ALT - Aggregator Leaf Tailer



https://rockset.com/blog/aggregator-leaf-tailer-an-architecture-for-live-analytics-on-event-streams/

ALT: Real-Time Analytics Without Pipelines

The ALT architecture addresses these shortcomings of Lambda architectures. The key component of ALT is a high-performance serving layer that serves complex queries, and not just key-value lookups. The existence of this serving layer obviates the need for complex data pipelines.
ALTarch
The ALT architecture described:
  1. The Tailer pulls new incoming data from a static or streaming source into an indexing engine. Its job is to fetch from all data sources, be it a data lake, like S3, or a dynamic source, like Kafka or Kinesis.
  2. The Leaf is a powerful indexing engine. It indexes all data as and when it arrives via the Tailer. The indexing component builds multiple types of indexes—inverted, columnar, document, geo, and many others—on the fields of a data set. The goal of indexing is to make any query on any data field fast.
  3. The scalable Aggregator tier is designed to deliver low-latency aggregations, be it columnar aggregations, joins, relevance sorting, or grouping. The Aggregators leverage indexing so efficiently that complex logic typically executed by pipeline software in other architectures can be executed on the fly as part of the query.

The ALT architecture enables the app developer or data scientist to run low-latency queries on raw data sets without any prior transformation. A large portion of the data transformation process can occur as part of the query itself. How is this possible in the ALT architecture?
  1. Indexing is critical to making queries fast. The Leaves maintain a variety of indexes concurrently, so that data can be quickly accessed regardless of the type of query—aggregation, key-value, time series, or search. Every document and field is indexed, including both value and type of each field, resulting in fast query performance that allows significantly more complex data processing to be inserted into queries.
  2. Queries are distributed across a scalable Aggregator tier. The ability to scale the number of Aggregators, which provide compute and memory resources, allows compute power to be concentrated on any complex processing executed on the fly.
  3. The Tailer, Leaf, and Aggregator run as discrete microservices in disaggregated fashion. Each Tailer, Leaf, or Aggregator tier can be independently scaled up and down as needed. The system scales Tailers when there is more data to ingest, scales Leaves when data size grows, and scales Aggregators when the number or complexity of queries increases. This independent scalability allows the system to bring significant resources to bear on complex queries when needed, while making it cost-effective to do so.
The most significant difference is that the Lambda architecture performs data transformations up front so that results are pre-materialized, while the ALT architecture allows for query on demand with on-the-fly transformations.

Why ALT Makes Sense Today
While not as widely known as the Lambda architecture, the ALT architecture has been in existence for almost a decade, employed mostly on high-volume systems.
  • Facebook’s Multifeed architecture has been using the ALT methodology since 2010, backed by the open-source RocksDB engine, which allows large data sets to be indexed efficiently.
  • LinkedIn’s FollowFeed was redesigned in 2016 to use the ALT architecture. Their previous architecture, like the Lambda architecture discussed above, used a pre-materialization approach, also called fan-out-on-write, where results were precomputed and made available for simple lookup queries. LinkedIn's new ALT architecture uses a query on demand or fan-out-on-read model using RocksDB indexing instead of Lucene indexing. Much of the computation is done on the fly, allowing greater speed and flexibility for developers in this approach.
  • Rockset uses RocksDB as a foundational data store and implements the ALT architecture (see white paper) in a cloud service.


The ALT architecture clearly has the performance, scale, and efficiency to handle real-time use cases at some of the largest online companies. Why has it not been used as widely till recently? The short answer is that “indexing” software is traditionally costly, and not commercially viable, when data size is large. That ruled out many smaller organizations from pursuing an ALT, query-on-demand approach in the past. But the current state of technology—the combination of powerful indexing software built on open-source RocksDB and favorable cloud economics—has made ALT not only commercially feasible today, but an elegant architecture for real-time data processing and analytics.

https://engineering.linkedin.com/blog/2016/03/followfeed--linkedin-s-feed-made-faster-and-smarter


Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts