Wednesday, February 10, 2016

Linkedin Databus



https://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-latency-change-data-capture-system
LinkedIn has a diverse ecosystem of specialized data storage and serving systems. Primary OLTP data-stores take user facing writes and some reads. Other specialized systems serve complex queries or accelerate query results through caching. For example, search queries are served by a search index system which needs to continually index the data in the primary database.
This leads to a need for reliable, transactionally consistent change capture from primary data sources to derived data systems throughout the ecosystem. 

The Databus transport layer provides end-to-end latencies in milliseconds and handles throughput of thousands of change events per second per server while supporting infinite lookback capabilities and rich subscription functionality.

As shown above, systems such as Search Index and Read Replicas act as Databus consumers using the client library. When a write occurs to a primary OLTP database, the relays connected to that database pull the change into the relay. The databus consumer embedded in the search index or cache pulls it from the relay (or bootstrap) and updates the index or cache as the case may be. This keeps the index up to date with the state of the source database.
  • Source-independent: Databus supports change data capture from multiple sources including Oracle and MySQL. The Oracle adapter is included in our open-source release. We plan to open source the MySQL adapter soon.
  • Scalable and highly available: Databus scales to thousands of consumers and transactional data sources while being highly available.
  • Transactional in-order delivery: Databus preserves transactional guarantees of the source database and delivers change events grouped in transactions, in source commit order.
  • Low latency and rich subscription: Databus delivers events to consumers within milliseconds of the changes being available from the source. Consumers can also retrieve specific partitions of the stream using server-side filtering in Databus.
  • Infinite lookback: One of the most innovative components of Databus is the ability to support infinite lookback for consumers. When a consumer needs to generate a downstream copy of the entire data (for example a new search index), it can do so without putting any additional load on the primary OLTP database. This also helps consumers when they fall significantly behind the source database.

As depicted, the Databus System comprises of relays, bootstrap service and the client library. The Relay fetches committed changes from the source database and stores the events in a high performance log store. The Bootstrap Service stores a moving snapshot of the source database by periodically applying the change stream from the Relay. Applications use the Databus Client Library to pull the change stream from the Relay or Bootstrap and process the change events in Consumers that implement a callback API defined by the library.
Fast moving consumers retrieve events from the Databus relay. If a consumer were to fall behind such that the data it requests is no longer in the Relay's log, the consumer will be delivered a consolidated snapshot of changes that have occurred since the last change processed by the consumer. If a new consumer with no prior copy of the dataset shows up, it’s given a snapshot of the data (consistent as of some point in time) from the Bootstrap service, at which point it continues catching up from the Databus relay.

http://www.slideshare.net/amywtang/introduction-to-databus
The Consequence of Specialization in Data Systems
Data Flow is essential
Data Consistency is critical !!!

Two Ways
1. Dual Writes
Application code dual writes to database and pub-sub system
Easy on the surface
Consistent?

2.Log mining
Extract changes from database commit log

Logical clocks attached to the source
Physical offsets could be used for internal transport
Simplifies data portability
Pull model
Restarts are simple
Derived State = f (Source state, Clock)
+ Idempotence = Timeline Consistent!

 Isolate fast consumers from slow consumers
 Workload separation between online, catch-up, bootstrap
 Isolate sources from consumers
 Schema changes
 Physical layout changes
 Speed mismatch
 Schema-aware
 Filtering, Projections
 Typically network-bound  can burn more CPU

Requirements
 Timeline consistency
 Guaranteed, at least once delivery
 Low latency
 Schema evolution
 Source independence
 Scalable consumers
 Handle for slow/new consumers without affecting happy ones (look-back requirements)

Four Logical Components
 Fetcher
 Fetch from db, relay…
 Log Store
 Store log snippet
 Snapshot Store
 Store moving data snapshot
 Subscription Client
 Orchestrate pull across these

The Relay
 Change event buffering (~ 2 – 7 days)
 Low latency (10-15 ms)
 Filtering, Projection
 Hundreds of consumers per relay
 Scale-out, High-availability through redundancy

Option 1: Peered Deployment
Option 2: Clustered Deployment

The Bootstrap Service
 Catch-all for slow / new consumers
 Isolate source OLTP instance from large scans
 Log Store + Snapshot Store
 Optimizations
 Periodic merge
 Predicate push-down
 Catch-up versus full bootstrap
 Guaranteed progress for consumers via chunking
 Implementations
 Database (MySQL)
 Raw Files
 Bridges the continuum between stream and batch systems

The Consumer Client Library
 Glue between Databus infra and business logic in the consumer
 Isolates the consumer from changes in the databus layer.
 Switches between relay and bootstrap as needed
 API
 Callback with transactions
 Iterators over windows

Fetcher Implementations
 Oracle
 Trigger-based
 MySQL
 Custom-storage-engine based
 In Labs
 Alternative implementations for Oracle
 OpenReplicator integration for MySQL

Meta-data Management
 Event definition, serialization and transport
 Avro
 Oracle, MySQL
 Avro definition generated from the table schema
 Schema evolution
 Only backwards-compatible changes allowed
 Isolation between upgrades on producer and consumer

Scaling the consumers (Partitioning)
 Server-side filtering
 Range, mod, hash
 Allows client to control partitioning function
 Consumer groups
 Distribute partitions evenly across a group
 Move partitions to available consumers on failure
 Minimize re-processing

Development with Databus – Client Library
onDataEvent(DbusEvent, Decoder)


register(consumers, sources , filter)
start() ,
shutdown(),

//configure
DatabusHttpClientImpl.Config clientConfig =
new DatabusHttpClientImpl.Config();
clientConfig.loadFromFile(“mydbus”, “mdbus.props”);
DatabusHttpClientImpl client =
new DatabusHttpClientImpl(clientConfig);
client.registerDatabusStreamListener(callback,
null, "com.linkedin.events.member2.MemberProfile”);
//start client library
client.startAndBlock();
https://github.com/linkedin/databus/wiki
Databus is a low latency change capture system which has become an integral part of LinkedIn’s data processing pipeline. Databus addresses a fundamental requirement to reliably capture, flow and processes primary data changes. Databus provides the following features :
  1. Isolation between sources and consumers
  2. Guaranteed in order and at least once delivery with high availability
  3. Consumption from an arbitrary time point in the change stream including full bootstrap capability of the entire data.
  4. Partitioned consumption
  5. Source consistency preservation

Labels

Review (572) System Design (334) System Design - Review (198) Java (189) Coding (75) Interview-System Design (65) Interview (63) Book Notes (59) Coding - Review (59) to-do (45) Linux (43) Knowledge (39) Interview-Java (35) Knowledge - Review (32) Database (31) Design Patterns (31) Big Data (29) Product Architecture (28) MultiThread (27) Soft Skills (27) Concurrency (26) Cracking Code Interview (26) Miscs (25) Distributed (24) OOD Design (24) Google (23) Career (22) Interview - Review (21) Java - Code (21) Operating System (21) Interview Q&A (20) System Design - Practice (20) Tips (19) Algorithm (17) Company - Facebook (17) Security (17) How to Ace Interview (16) Brain Teaser (14) Linux - Shell (14) Redis (14) Testing (14) Tools (14) Code Quality (13) Search (13) Spark (13) Spring (13) Company - LinkedIn (12) How to (12) Interview-Database (12) Interview-Operating System (12) Solr (12) Architecture Principles (11) Resource (10) Amazon (9) Cache (9) Git (9) Interview - MultiThread (9) Scalability (9) Trouble Shooting (9) Web Dev (9) Architecture Model (8) Better Programmer (8) Cassandra (8) Company - Uber (8) Java67 (8) Math (8) OO Design principles (8) SOLID (8) Design (7) Interview Corner (7) JVM (7) Java Basics (7) Kafka (7) Mac (7) Machine Learning (7) NoSQL (7) C++ (6) Chrome (6) File System (6) Highscalability (6) How to Better (6) Network (6) Restful (6) CareerCup (5) Code Review (5) Hash (5) How to Interview (5) JDK Source Code (5) JavaScript (5) Leetcode (5) Must Known (5) Python (5)

Popular Posts