-
Notifications
You must be signed in to change notification settings - Fork 15
Benchmark
The data source is traffic data from NDW (Nationaal Dataportaal Wegverkeer). We use a dataset which contains measurements of a set of road measurement locations in the Netherlands. At these locations, sensors count the amount of cars that pass by on a certain lane and the average speed of these cars in a time interval.
We use the Data Stream Generator to generate an input stream on two Kafka topics: ndwflow and ndwspeed. Flow measurements (ndwflow) describe the amount of cars that passed by a certain measurement point and lane. The speed measurements (ndwspeed) describe the average speed of the cars that passed by a certain measurement point and lane during that time frame. The throughput and data characteristics of the stream can be regulated with environment variables. More information can be found in the Data Stream Generator documentation.
This benchmark suite aims to include the most important stream operations and to study the influence of different implementations: high-level vs. low-level DSL, tumbling vs. interval join, tumbling vs. sliding window, etc. Therefore, multiple pipelines have been included.
Each of these pipelines has different complexities and characteristics. There are three main groups of pipelines. The first one are the stateless pipelines: ingesting and ETL. The second group are the simple stateful pipelines with joins and windowing operations. Finally, there are the complex pipelines which contain multiple stateful operations: end-to-end analytics pipelines.
Most of these pipelines are deducted from a base pipeline which we will explain here to make the following sections more concise. The base pipeline looks as follows:

adapted from [1]
-
Ingest: Reading JSON event data from Kafka from two input streams: speed stream and flow stream.
-
Parse: Parsing the data from the two streams and extract useful fields.
-
Join: Joining the flow stream and the speed stream. The join is done on the key: measurement ID, internal ID and timestamp rounded down to second-level. The internal ID describes the lane of the road. After the join, we know how many cars passed and how fast they drove for each lane of the road and time period.
-
Tumbling window: Aggregating the speed and the flow over all the lanes belonging to the same measurement ID. So here the data is grouped by measurement ID and timestamp (rounded down to second-level). We compute the average speed accumulated flow over all the lanes.
-
Sliding window: For the sliding window stage, we compute the relative change in flow and speed over two lookback period: a short one and a long one. The length of the lookback periods is defined in the configuration file. The relative change is calculated by computing:
speed(t) - speed(t-1) / speed(t-1)
-
Output: Publishing output back to Kafka.
The pipeline allows the execution of only a part of it. You can make the pipeline as complex as you want. The more stages you add, the more complex the pipeline becomes. This setting is set by the environment variable LAST_STAGE, as is explained in the documentation of each of the components. The last stage index is given next to the arrow of the output to Kafka.
Stateless pipelines do simple transformations for which no other events are required. With this benchmarking suite two stateless pipelines are included.
This pipeline executes the base pipeline until stage 0. It reads data and directly writes this data back to Kafka without doing any transformations on it. It is basically an empty pipeline which can be used to check network and framework overhead.

This pipeline executes the base pipeline until stage 1 and mimics a simple ETL pipeline.

These pipelines allow benchmarking the performance on common stateful operations: joins, tumbling windows and sliding windows. The pipeline contains only one stateful operation.

To test different join implementations, we included a pipeline which two streams and joins them together. This pipeline is the basis for the complex pipeline. This is the pipeline that is obtained if the complex pipeline is only executed until the join stage.
For Flink, an interval join and tumbling window join have been included to research the differences in performance that they give. For Spark Streaming, only a tumbling window join can be done. For Structured Streaming and Kafka Streams we do interval joins.
The implementation of sliding and tumbling windows rely on the same code. Based on the configuration of the window duration and slide duration, the code will either execute a tumbling or sliding window.
When the slide and window duration are equal:
When the window duration is a multiple of the slide duration:
Complex stateful pipelines contain multiple chained stateful operations.


[1] van Dongen, G., & Van den Poel, D. (2020). Evaluation of Stream Processing Frameworks. IEEE Transactions on Parallel and Distributed Systems, 31(8), 1845-1858.
This work has been made possible by Klarrio