Massive Technical Interviews Tips: Linkedin Databus

Wednesday, February 10, 2016

Linkedin Databus

https://engineering.linkedin.com/data-replication/open-sourcing-databus-linkedins-low-latency-change-data-capture-system

LinkedIn has a diverse ecosystem of specialized data storage and serving systems. Primary OLTP data-stores take user facing writes and some reads. Other specialized systems serve complex queries or accelerate query results through caching. For example, search queries are served by a search index system which needs to continually index the data in the primary database.

This leads to a need for reliable, transactionally consistent change capture from primary data sources to derived data systems throughout the ecosystem.

The Databus transport layer provides end-to-end latencies in milliseconds and handles throughput of thousands of change events per second per server while supporting infinite lookback capabilities and rich subscription functionality.

As shown above, systems such as Search Index and Read Replicas act as Databus consumers using the client library. When a write occurs to a primary OLTP database, the relays connected to that database pull the change into the relay. The databus consumer embedded in the search index or cache pulls it from the relay (or bootstrap) and updates the index or cache as the case may be. This keeps the index up to date with the state of the source database.

Source-independent: Databus supports change data capture from multiple sources including Oracle and MySQL. The Oracle adapter is included in our open-source release. We plan to open source the MySQL adapter soon.
Scalable and highly available: Databus scales to thousands of consumers and transactional data sources while being highly available.
Transactional in-order delivery: Databus preserves transactional guarantees of the source database and delivers change events grouped in transactions, in source commit order.
Low latency and rich subscription: Databus delivers events to consumers within milliseconds of the changes being available from the source. Consumers can also retrieve specific partitions of the stream using server-side filtering in Databus.
Infinite lookback: One of the most innovative components of Databus is the ability to support infinite lookback for consumers. When a consumer needs to generate a downstream copy of the entire data (for example a new search index), it can do so without putting any additional load on the primary OLTP database. This also helps consumers when they fall significantly behind the source database.

As depicted, the Databus System comprises of relays, bootstrap service and the client library. The Relay fetches committed changes from the source database and stores the events in a high performance log store. The Bootstrap Service stores a moving snapshot of the source database by periodically applying the change stream from the Relay. Applications use the Databus Client Library to pull the change stream from the Relay or Bootstrap and process the change events in Consumers that implement a callback API defined by the library.

Fast moving consumers retrieve events from the Databus relay. If a consumer were to fall behind such that the data it requests is no longer in the Relay's log, the consumer will be delivered a consolidated snapshot of changes that have occurred since the last change processed by the consumer. If a new consumer with no prior copy of the dataset shows up, it’s given a snapshot of the data (consistent as of some point in time) from the Bootstrap service, at which point it continues catching up from the Databus relay.

http://www.slideshare.net/amywtang/introduction-to-databus
The Consequence of Specialization in Data Systems
Data Flow is essential
Data Consistency is critical !!!

Two Ways
1. Dual Writes
Application code dual writes to database and pub-sub system
Easy on the surface
Consistent?

2.Log mining
Extract changes from database commit log

Logical clocks attached to the source
Physical offsets could be used for internal transport
Simplifies data portability
Pull model
Restarts are simple
Derived State = f (Source state, Clock)
+ Idempotence = Timeline Consistent!

Isolate fast consumers from slow consumers
Workload separation between online, catch-up, bootstrap
Isolate sources from consumers
Schema changes
Physical layout changes
Speed mismatch
Schema-aware
Filtering, Projections
Typically network-bound  can burn more CPU

Requirements
Timeline consistency
Guaranteed, at least once delivery
Low latency
Schema evolution
Source independence
Scalable consumers
Handle for slow/new consumers without affecting happy ones (look-back requirements)

Four Logical Components
Fetcher
Fetch from db, relay…
Log Store
Store log snippet
Snapshot Store
Store moving data snapshot
Subscription Client
Orchestrate pull across these

The Relay
Change event buffering (~ 2 – 7 days)
Low latency (10-15 ms)
Filtering, Projection
Hundreds of consumers per relay
Scale-out, High-availability through redundancy

Option 1: Peered Deployment
Option 2: Clustered Deployment

The Bootstrap Service
Catch-all for slow / new consumers
Isolate source OLTP instance from large scans
Log Store + Snapshot Store
Optimizations
Periodic merge
Predicate push-down
Catch-up versus full bootstrap
Guaranteed progress for consumers via chunking
Implementations
Database (MySQL)
Raw Files
Bridges the continuum between stream and batch systems

The Consumer Client Library
Glue between Databus infra and business logic in the consumer
Isolates the consumer from changes in the databus layer.
Switches between relay and bootstrap as needed
API
Callback with transactions
Iterators over windows

Fetcher Implementations
Oracle
Trigger-based
MySQL
Custom-storage-engine based
In Labs
Alternative implementations for Oracle
OpenReplicator integration for MySQL

Meta-data Management
Event definition, serialization and transport
Avro
Oracle, MySQL
Avro definition generated from the table schema
Schema evolution
Only backwards-compatible changes allowed
Isolation between upgrades on producer and consumer

Scaling the consumers (Partitioning)
Server-side filtering
Range, mod, hash
Allows client to control partitioning function
Consumer groups
Distribute partitions evenly across a group
Move partitions to available consumers on failure
Minimize re-processing

Development with Databus – Client Library
onDataEvent(DbusEvent, Decoder)
…
…
register(consumers, sources , filter)
start() ,
shutdown(),

//configure
DatabusHttpClientImpl.Config clientConfig =
new DatabusHttpClientImpl.Config();
clientConfig.loadFromFile(“mydbus”, “mdbus.props”);
DatabusHttpClientImpl client =
new DatabusHttpClientImpl(clientConfig);
client.registerDatabusStreamListener(callback,
null, "com.linkedin.events.member2.MemberProfile”);
//start client library
client.startAndBlock();

https://github.com/linkedin/databus/wiki

Databus is a low latency change capture system which has become an integral part of LinkedIn’s data processing pipeline. Databus addresses a fundamental requirement to reliably capture, flow and processes primary data changes. Databus provides the following features :

Isolation between sources and consumers
Guaranteed in order and at least once delivery with high availability
Consumption from an arbitrary time point in the change stream including full bootstrap capability of the entire data.
Partitioned consumption
Source consistency preservation

Wednesday, February 10, 2016

Linkedin Databus

Labels

Popular Posts