-
Notifications
You must be signed in to change notification settings - Fork 15
Benchmark
The data source is traffic data from NDW (Nationaal Dataportaal Wegverkeer). We use a dataset which contains measurements of a set of road measurement locations in the Netherlands. At these locations, sensors count the amount of cars that pass by on a certain lane and the average speed of these cars in a time interval.
We use the Data Stream Generator to generate an input stream on two Kafka topics: ndwflow and ndwspeed. Flow measurements (ndwflow) describe the amount of cars that passed by a certain measurement point and lane. The speed measurements (ndwspeed) describe the average speed of the cars that passed by a certain measurement point and lane during that time frame. The throughput and data characteristics of the stream can be regulated with environment variables. More information can be found in the Data Stream Generator documentation.
This benchmark suite aims to include the most important stream operations and to study the influence of different implementations: high-level vs. low-level DSL, tumbling vs. interval join, tumbling vs. sliding window, etc. Therefore, multiple pipelines have been included.
Each of these pipelines has different complexities and characteristics. There are two main groups of pipelines. The first one is the complex pipeline which is an end-to-end flow of multiple steps that can be executed completely or up to a specified stage.
The complete processing pipeline looks as follows:

adapted from [1]
The flow of operations done in the benchmark is as follows:
-
Ingest: Reading JSON event data from Kafka from two input streams: speed stream and flow stream.
-
Parse: Parsing the data from the two streams and extract useful fields.
-
Join: Joining the flow stream and the speed stream. The join is done on the key: measurement ID, internal ID and timestamp rounded down to second-level. The internal ID describes the lane of the road. After the join, we know how many cars passed and how fast they drove for each lane of the road and time period.
-
Tumbling window: Aggregating the speed and the flow over all the lanes belonging to the same measurement ID. So here the data is grouped by measurement ID and timestamp (rounded down to second-level). We compute the average speed accumulated flow over all the lanes.
-
Sliding window: For the sliding window stage, we compute the relative change in flow and speed over two lookback period: a short one and a long one. The length of the lookback periods is defined in the configuration file. The relative change is calculated by computing:
speed(t) - speed(t-1) / speed(t-1)
-
Output: Publishing output back to Kafka.
The pipeline allows the execution of only a part of it. You can make the pipeline as complex as you want. The more stages you add, the more complex the pipeline becomes. This setting is set by the environment variable LAST_STAGE, as is explained in the documentation of each of the components.


These pipelines allow benchmarking the performance on common stateful operation: joins, tumbling windows and sliding windows.

To test different join implementations, we included a pipeline which two streams and joins them together. This pipeline is the basis for the complex pipeline. This is the pipeline that is obtained if the complex pipeline is only executed until the join stage.
For Flink, an interval join and tumbling window join have been included to research the differences in performance that they give. For Spark Streaming, only a tumbling window join can be done. For Structured Streaming and Kafka Streams we do interval joins.


The implementation of sliding and tumbling windows rely on the same code. Based on the configuration of the window duration and slide duration, the code will either execute a tumbling or sliding window.



adapted from [1]
[1] van Dongen, G., & Van den Poel, D. (2020). Evaluation of Stream Processing Frameworks. IEEE Transactions on Parallel and Distributed Systems, 31(8), 1845-1858.
This work has been made possible by Klarrio