Massive Technical Interviews Tips: Stream and Batch Processing Frameworks

Saturday, April 13, 2019

Stream and Batch Processing Frameworks

https://puncsky.com/hacking-the-software-engineer-interview

Why such frameworks? process high-throughput in low latency.
fault-tolerance in distributed systems.
generic abstraction to serve volatile business requirements.
for bounded data set (batch processing) and for unbounded data set (stream processing).

Brief history of batch/stream processing Hadoop and MapReduce. Google made batch processing as simple as MR result = pairs.map((pair) => (morePairs)).reduce(somePairs => lessPairs) in a distributed system.
Apache Storm and DAG Topology. MR doesn’t efficiently express iterative algorithms. Thus Nathan Marz abstracted stream processing into a graph of spouts and bolts.
Spark in-memory Computing. Reynold Xin said Spark sorted the same data 3X fasterusing 10X fewer machines compared to Hadoop.
Google Dataflow based on Millwheel and FlumeJava. Google supports both batch and streaming computing with the windowing API.

Wait… But why is Flink gaining popularity? its fast adoption of Google Dataflow/Beam programming model.
its highly efficient implementation of Chandy-Lamport checkpointing.

How? 
Architectural Choices 
To serve requirements above with commodity machines, the steaming framework use distributed systems in these architectures…
master-slave (centralized): apache storm with zookeeper, apache samza with YARN.
P2P (decentralized): apache s4.

Features DAG Topology for Iterative Processing. e.g. GraphX in Spark, topologies in Apache Storm, DataStream API in Flink.
Delivery Guarantees. How guaranteed to deliver data from nodes to nodes? at-least once / at-most once / exactly once.
Fault-tolerance. Using cold/warm/hot standby, checkpointing, or active-active.
Windowing API for unbounded data set. e.g. Stream Windows in Apache Flink. Spark Window Functions. Windowing in Apache Beam.

Comparison FrameworkStormStorm-tridentSparkFlink
Modelnativemicro-batchmicro-batchnative
Guarenteesat-least-onceexactly-onceexactly-onceexactly-once
Fault-toleranceRecord-Ackrecord-ackcheckpointcheckpoint
Overhead of fault-tolerancehighmediummediumlow
latencyvery-lowhighhighlow
throughputlowmediumhighhigh

Saturday, April 13, 2019

Stream and Batch Processing Frameworks

Why such frameworks?

Brief history of batch/stream processing

Wait… But why is Flink gaining popularity?

How?

Architectural Choices

Features

Comparison

Labels

Popular Posts

Framework	Storm	Storm-trident	Spark	Flink
Model	native	micro-batch	micro-batch	native
Guarentees	at-least-once	exactly-once	exactly-once	exactly-once
Fault-tolerance	Record-Ack	record-ack	checkpoint	checkpoint
Overhead of fault-tolerance	high	medium	medium	low
latency	very-low	high	high	low
throughput	low	medium	high	high