Workloads

This benchmark allows studying different performance aspects of distributed stream processing systems. For each of these we developed separate workloads, which we describe here.

Runbook for workload for latency measurement

For each of the frameworks [Spark Streaming, Flink, Kafka Streams and Structured Streaming]

  For each of the pipeline complexities [ingest, parse, join, tumbling window, sliding window]

		- Start cluster of the framework, if it requires one and wait a few minutes to complete startup.
		- Create a new Kafka output topic for the results to be published on and a topic for the JMX metrics to be published on.
		- Start up the JMX exporter.
		- Start up the processing job.
		- Start the data stream generator.
		- Wait for 40 minutes: 30 minutes to process the data and 10 minutes to catch up with possible lags.
		- Stop the data stream generator, JMX exporter and streaming job.

Start a job to consume the output and metrics from Kafka and write it to S3.
Start a job to evaluate the output, JMX metrics and cadvisor metrics.

Runbook for workload for sustainable throughput measurement

For each of the frameworks [Spark Streaming (3 seconds and 5 seconds micro-batch intervals), Flink, Kafka Streams and Structured Streaming]:

	For each of the pipeline complexities [ingest, parse, join, tumbling window, sliding window]:

		 For a list of different throughput levels:

			- Start cluster of the framework, if it requires one and wait a few minutes to complete startup.
			- Create a new Kafka output topic for the results to be published on and a topic for the JMX metrics to be published on.
			- Start up the JMX exporter.
			- Start up the processing job.
			- Start the input stream producer.
			- Wait for 40 minutes: 30 minutes to process the data and 10 minutes to catch up with possible lags.  
			- Stop the input stream producer, JMX exporter and streaming job.

Start a job to consume the output and metrics from Kafka and write it to S3.
Start a job to evaluate the output, JMX metrics and cadvisor metrics.

Runbook for workload for burst at startup

For each of the frameworks [Spark Streaming (3 seconds and 5 seconds micro-batch intervals), Flink, Kafka Streams and Structured Streaming]:

	For each of the pipeline complexities [ingest, parse, join, tumbling window, sliding window]:

		- Start cluster of the framework, if it requires one and wait a few minutes to complete startup.
		- Create a new Kafka output topic for the results to be published on and a topic for the JMX metrics to be published on.
		- Start the input stream producer and let it publish for 5 minutes.
		- Start up the JMX exporter.
		- Start up the processing job
		- Wait for 10 minutes. The processing job will catch up with the five minute delay and then continues processing the newly incoming data.
		- Stop the input stream producer, JMX exporter and streaming job.

Start a job to consume the output and metrics from Kafka and write it to S3.
Start a job to evaluate the output, JMX metrics and cadvisor metrics.

Runbook for workload with periodic bursts

For each of the frameworks [Spark Streaming, Flink, Kafka Streams and Structured Streaming]

  For each of the pipeline complexities [ingest, parse, join, tumbling window, sliding window]

		- Start cluster of the framework, if it requires one and wait a few minutes to complete startup.
		- Create a new Kafka output topic for the results to be published on and a topic for the JMX metrics to be published on.
		- Start up the JMX exporter.
		- Start up the processing job.
		- Start the input stream producer with periodic bursts.
		- Wait for 40 minutes: 30 minutes to process the data and 10 minutes to catch up with possible lags.
		- Stop the input stream producer, JMX exporter and streaming job.

Start a job to consume the output and metrics from Kafka and write it to S3.
Start a job to evaluate the output, JMX metrics and cadvisor metrics.

This work has been made possible by Klarrio

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workloads

Runbook for workload for latency measurement

Runbook for workload for sustainable throughput measurement

Runbook for workload for burst at startup

Runbook for workload with periodic bursts

Clone this wiki locally