Distributed streaming applications are like Rube Goldberg machines. Lots of levers and knobs. You feel like you have to observe them in order to figure out how they work. This article is not about the one magic Spark parameter making your app scream. It's about how to approach optimizing your freshly developed Spark Streaming app.
As an example I will use an application which reads ad request data from a Kafka stream, processes it with Spark and writes the aggregated results to Aerospike (a key value store).
Stabilizing your application
Before you can make your stream processing faster, it must first be stable. If you visualize your stream as a chain, the complete process can't be faster than the slowest link and each link can overpower subsequent links by producing too much data too fast.
If you'd look at our sample app, you'd see a huge amount of messages in Kafka and a pretty simple aggregation process in Spark. Since the app has to read/write to Aerospike and we haven't build out the Aerospike cluster yet, let's assume the database is our slowest link.
The goal now is to slow down Spark so it does not overpower the database with read/write operations.
There are several parameters which must work in concert to make this happen:
Streaming batch interval
Spark splits the stream into micro batches. The batch interval defines the size of the batch in seconds. For example, a batch interval of 5 seconds will cause Spark to collect 5 seconds worth of data to process.
This is the time it takes Spark to process one batch of data within the streaming batch interval. In our case, it’s the time we need to aggregate the records and write them to Aerospike.
Median streaming rate
This defines the number of records Spark pulls from the stream per second and stream receiver. For example, if the streaming batch interval is 5 seconds, we have 3 stream receivers and a median streaming rate of 4,000 records, Spark would pull 4,000 x 3 x 5 = 60,000 records per batch. So we would have to be able to process these 60,000 records within 5 seconds. Otherwise we run behind and our streaming application becomes unstable.
You can find all of these in the "Streaming" tab of your Spark application.
You'd want a situation where:
- There is no scheduling delay (C), or it's only increasing occasionally and recovers quickly.
- The processing time (C) is almost as long as the batch interval (A) but stays below it.
- The median streaming rate (B) is as high as possible without causing trouble for numbers 1 and 2.
You need to figure out how many records your app can process within the batch interval.
Finding the right batch window size
Start with a batch window size of 5-10 seconds and observe processing time. Try to alter the time interval and see which effect this has on processing time. Try to go as low as you can without negatively affecting it.
Limit the stream receiver
spark.streaming.receiver.maxRate option you can limit the number of messages pulled by the stream receiver. Once you’ve found a good batch window size, reduce the number of incoming messages per second to a point where the processing time for this window stays within the window and the scheduling delay stays at zero.
Once you're at this point, you have a stable streaming app. Now you can try to make it faster by:
- Tuning Spark's serialization, shuffle and memory parameters
- Increasing parallelism
- Stabilizing it again with the above process
The Spark docs do a pretty good job of explaining configuration parameters and their effects. Switching to Kryo will increase serialization throughput which helps quite a bit.
Increasing parallelism can be done in a number of ways:
- Increasing the amount of Spark workers
- Partitioning your Kafka messages and creating a Stream for each partition
- Repartitioning these Streams again
For our application, we achieved the best results by having as many Kafka partitions as we have Spark workers and creating a DStream for each worker/partition.
Once your stream processing becomes faster, you will find that you can process more and more records within the batch window. Now you can lift the median streaming rate to process the maximum number of records possible.
Whenever you change your processing algorithms, add Spark workers or Kafka partitions, you’ll want to repeat these optimizations. Everything you change within your streaming application, other connected systems or the underlying hardware will affect the batch processing time. Best case scenario, this means you’re not as fast as you could be. Worst case, your application will choke on the incoming data and you could, for example, run out of serialization disk space.
- You need to start out creating a stable stream processing application before focusing on throughput.
- You might have to make your app slower at first, then keep scaling by parallelizing processing.
- This is an iterative process which you will have to perform continuously.
I hope this approach will help you to get the most out of your Spark Streaming application.