Smaato was very happy to host the spring to summer edition of the Big Data and NoSQL Hamburg (BDNSHH) meetup with two great guests from Berlin: Aljoscha Krettek and Maximilian Michels from dataArtisans, the company behind Apache Flink.
Apache Flink is an open source platform for scalable batch and stream data processing. At its core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. Interesting features are its custom dataflow optimizer, custom memory management, and its strategies to perform well when memory runs out.
We’ve interviewed our guests to dig deeper into Apache Flink:
Q: Why Flink? What makes it stand out against the competition (Spark, Samza, Storm, Hadoop)?
First, all of these are Apache projects. Apache projects do not compete for revenue. Rather, they evolve the available technology. Flink has a set of features that puts it one step ahead of other open source streaming systems, in particular, the ability to deliver exactly-once guarantees with high throughput and low latency. Moreover, it can maintain mutable-state, natural-flow control (to deal with backpressure effects), has an easy-to-use streaming API, and can do batch processing very well. The end result is a unique architecture that can do proper low-latency streaming and batch processing in one system.
Q: Map/reduce is not well suited for iterative algorithms. Spark tackles that by providing the RDD abstraction to decouple data from its physical location, allowing for fault-tolerance and re-useability of loaded datasets. What is the Flink approach in this regard?
When a user writes a Flink program, the system generates a Dataflow graph based on the operations in the program. This graph contains all the data-dependencies and can also contain loops (iterations). Flink optimizes this graph and finds the best execution strategies before pushing execution to the worker nodes. Since Flink understands all the intermediate data sets in the execution plan, it can provide fault tolerance and re-use intermediate data sets.
Q: When it comes to streaming, there are two general approaches: processing of single events and processing of micro-batches. Both have their merits, but that usually depends on the actual use-case. How is Flink handling data streaming?
Actually, this is a misconception that has long been solved by newer technologies such as Google Cloud Dataflow and Flink itself. Flink’s approach combines the best of both worlds by following the continuous operator model where operators are enduring and stateful, and there are no stages. Data is transferred over the network in a continuous and managed flow of buffers which gives high throughput. Because the buffers have a configurable timeout, you can control the maximal latency as well. With this approach, there is really no reason to go for either the "single event" or "mini batch" model. The system provides exactly-once guarantees without resorting to micro-batches based on distributed snapshots.
Q: The different streaming frameworks (Storm, Samza, Spark Streaming) each have their own strategy to handle state: externalizing to external datastores, local key-value stores, or using a custom dataset abstraction. How does Flink handle persistent state for global aggregations or stream joins?
In the last Flink release (0.9.0) the system allows stateful operators to checkpoint their state at specific intervals to a backend. This enables the user to use a key/value store in an operator and write it to a filesystem, like HDFS, for checkpointing. Upcoming in the next release is a key/value state interface-backed external key/value store.
Q: Twitter recently described Heron, a streaming system they built to address issues in Apache Storm, introducing methods to handle traffic spikes in order to avoid back pressure, tools to ease debugging, and improving the predictability of performance at scale. What has Flink to offer in these regards?
In the next release (0.10.0) Flink will introduce live monitoring of workers. This includes statistics about machine utilization (CPU, RAM), processed records and custom counters. This should help greatly with debugging. Flink has natural flow control to avoid backpressure.
Q: The Spark roadmap currently focuses on better resource utilization with custom memory management and improved CPU utilization. Flink handles this nicely as well. What are your secrets to maximum performance?
It's not really that big of a secret: Avoid object allocations, don't keep objects around very long, keep data in serialized form, have fast serialization/deserialization, and use well-known external memory algorithms for heavy operations.
Q: All the big data frameworks constantly evolve, so what are the most exciting features coming up on your roadmap?
The most exciting upcoming features are master high-availability, asynchronous and incremental backups of streaming state and life monitoring of custom counters, job statistics and machine utilization.
Additional insights were presented during the meetup. Check out the slides here:
You can also watch the presentation in its entirety here:
Along with presentation, Maximilian and Aljoscha did a great live demo showcasing Flink, which led to a lively discussion in our cafeteria. Here are a few more impressions from the evening:
Many thanks to our speakers and all participants for the great evening and we all are looking forward to the next edition of the Big Data and NoSQL Hamburg meetup.