<iframe src="//www.googletagmanager.com/ns.html?id=GTM-MXN9JJ" height="0" width="0" style="display:none;visibility:hidden">

The Smaato Blog

Tuning Spark Streaming Applications

Home » Blog » Tuning Spark Streaming Applications
Posted by Stephan Brosinski on April 20, 2015
Tuning Spark Streaming Applications

Distributed streaming applications are like Rube Goldberg machines. Lots of levers and knobs. You feel like you have to observe them in order to figure out how they work. This article is not about the one magic Spark parameter making your app scream. It's about how to approach optimizing your freshly developed Spark Streaming app.

As an example I will use an application which reads ad request data from a Kafka stream, processes it with Spark and writes the aggregated results to Aerospike (a key value store).

Tuning Spark Streaming Applications

Stabilizing your application

Before you can make your stream processing faster, it must first be stable. If you visualize your stream as a chain, the complete process can't be faster than the slowest link and each link can overpower subsequent links by producing too much data too fast.

If you'd look at our sample app, you'd see a huge amount of messages in Kafka and a pretty simple aggregation process in Spark. Since the app has to read/write to Aerospike and we haven't build out the Aerospike cluster yet, let's assume the database is our slowest link.

The goal now is to slow down Spark so it does not overpower the database with read/write operations.

There are several parameters which must work in concert to make this happen:

Streaming batch interval

Spark splits the stream into micro batches. The batch interval defines the size of the batch in seconds. For example, a batch interval of 5 seconds will cause Spark to collect 5 seconds worth of data to process.

Processing time

This is the time it takes Spark to process one batch of data within the streaming batch interval. In our case, it’s the time we need to aggregate the records and write them to Aerospike.

Median streaming rate

This defines the number of records Spark pulls from the stream per second and stream receiver. For example, if the streaming batch interval is 5 seconds, we have 3 stream receivers and a median streaming rate of 4,000 records, Spark would pull 4,000 x 3 x 5 = 60,000 records per batch. So we would have to be able to process these 60,000 records within 5 seconds. Otherwise we run behind and our streaming application becomes unstable.

You can find all of these in the "Streaming" tab of your Spark application.

Tuning Spark Streaming Applications

You'd want a situation where:

  1. There is no scheduling delay (C), or it's only increasing occasionally and recovers quickly.
  2. The processing time (C) is almost as long as the batch interval (A) but stays below it.
  3. The median streaming rate (B) is as high as possible without causing trouble for numbers 1 and 2.

You need to figure out how many records your app can process within the batch interval.

Finding the right batch window size

Start with a batch window size of 5-10 seconds and observe processing time. Try to alter the time interval and see which effect this has on processing time. Try to go as low as you can without negatively affecting it.

Limit the stream receiver

With Spark's spark.streaming.receiver.maxRate option you can limit the number of messages pulled by the stream receiver. Once you’ve found a good batch window size, reduce the number of incoming messages per second to a point where the processing time for this window stays within the window and the scheduling delay stays at zero.

Parallelize

Once you're at this point, you have a stable streaming app. Now you can try to make it faster by:

  • Tuning Spark's serialization, shuffle and memory parameters
  • Increasing parallelism
  • Stabilizing it again with the above process

The Spark docs do a pretty good job of explaining configuration parameters and their effects. Switching to Kryo will increase serialization throughput which helps quite a bit.

Increasing parallelism can be done in a number of ways:

  • Increasing the amount of Spark workers
  • Partitioning your Kafka messages and creating a Stream for each partition
  • Repartitioning these Streams again

For our application, we achieved the best results by having as many Kafka partitions as we have Spark workers and creating a DStream for each worker/partition.

Iterate

Once your stream processing becomes faster, you will find that you can process more and more records within the batch window. Now you can lift the median streaming rate to process the maximum number of records possible.

Whenever you change your processing algorithms, add Spark workers or Kafka partitions, you’ll want to repeat these optimizations. Everything you change within your streaming application, other connected systems or the underlying hardware will affect the batch processing time. Best case scenario, this means you’re not as fast as you could be. Worst case, your application will choke on the incoming data and you could, for example, run out of serialization disk space.

Conclusion

  • You need to start out creating a stable stream processing application before focusing on throughput.
  • You might have to make your app slower at first, then keep scaling by parallelizing processing.
  • This is an iterative process which you will have to perform continuously.

I hope this approach will help you to get the most out of your Spark Streaming application. Please leave a comment and tell us about your experience with Spark.

Written by Stephan Brosinski

Stephan is a Senior Developer in Smaato’s Data Engineering team. He previously worked for some of Europe's leading digital agencies on large e-commerce projects. He has over 15 years of experience in the software industry.

Want the latest in mobile advertising monetization strategies & ideas in your inbox? Subscribe today!