-
Notifications
You must be signed in to change notification settings - Fork 15
Workloads
gvdongen edited this page Jul 9, 2021
·
12 revisions
This benchmark allows studying different performance aspects of distributed stream processing systems. For each of these we developed separate workloads, which we describe here.
For each of the frameworks [Spark Streaming, Flink, Kafka Streams and Structured Streaming]
For each of the pipeline complexities [ingest, parse, join, tumbling window, sliding window]
- Start cluster of the framework, if it requires one and wait a few minutes to complete startup.
- Create a new Kafka output topic for the results to be published on and a topic for the JMX metrics to be published on.
- Start up the JMX exporter.
- Start up the processing job.
- Start the data stream generator.
- Wait for 40 minutes: 30 minutes to process the data and 10 minutes to catch up with possible lags.
- Stop the data stream generator, JMX exporter and streaming job.
Start a job to consume the output and metrics from Kafka and write it to S3.
Start a job to evaluate the output, JMX metrics and cadvisor metrics.
For each of the frameworks [Spark Streaming (3 seconds and 5 seconds micro-batch intervals), Flink, Kafka Streams and Structured Streaming]:
For each of the pipeline complexities [ingest, parse, join, tumbling window, sliding window]:
For a list of different throughput levels:
- Start cluster of the framework, if it requires one and wait a few minutes to complete startup.
- Create a new Kafka output topic for the results to be published on and a topic for the JMX metrics to be published on.
- Start up the JMX exporter.
- Start up the processing job.
- Start the input stream producer.
- Wait for 40 minutes: 30 minutes to process the data and 10 minutes to catch up with possible lags.
- Stop the input stream producer, JMX exporter and streaming job.
Start a job to consume the output and metrics from Kafka and write it to S3.
Start a job to evaluate the output, JMX metrics and cadvisor metrics.
For each of the frameworks [Spark Streaming (3 seconds and 5 seconds micro-batch intervals), Flink, Kafka Streams and Structured Streaming]:
For each of the pipeline complexities [ingest, parse, join, tumbling window, sliding window]:
- Start cluster of the framework, if it requires one and wait a few minutes to complete startup.
- Create a new Kafka output topic for the results to be published on and a topic for the JMX metrics to be published on.
- Start the input stream producer and let it publish for 5 minutes.
- Start up the JMX exporter.
- Start up the processing job
- Wait for 10 minutes. The processing job will catch up with the five minute delay and then continues processing the newly incoming data.
- Stop the input stream producer, JMX exporter and streaming job.
Start a job to consume the output and metrics from Kafka and write it to S3.
Start a job to evaluate the output, JMX metrics and cadvisor metrics.
For each of the frameworks [Spark Streaming, Flink, Kafka Streams and Structured Streaming]
For each of the pipeline complexities [ingest, parse, join, tumbling window, sliding window]
- Start cluster of the framework, if it requires one and wait a few minutes to complete startup.
- Create a new Kafka output topic for the results to be published on and a topic for the JMX metrics to be published on.
- Start up the JMX exporter.
- Start up the processing job.
- Start the input stream producer with periodic bursts.
- Wait for 40 minutes: 30 minutes to process the data and 10 minutes to catch up with possible lags.
- Stop the input stream producer, JMX exporter and streaming job.
Start a job to consume the output and metrics from Kafka and write it to S3.
Start a job to evaluate the output, JMX metrics and cadvisor metrics.
This work has been made possible by Klarrio