Skip to content

arouchdi/sf-crime-statistics

Repository files navigation

sf-crime-statistics

This project provides statistical analyses on crime data in San Francisco using Apache Spark Structured Streaming. The real-world dataset was extracted from Kaggle

Development Environment

  • Spark 2.4.3
  • Scala 2.11.x
  • Java 1.8.x
  • Kafka build with Scala 2.11.x
  • Python 3.6.x or 3.7.x

QuickStart

Install requirements using ./start.sh if you use conda for Python. If you use pip rather than conda, then use pip install -r requirements.txt.

Run the Application

  • Start Zookeeper and Kafka
`bin/zookeeper-server-start.sh config/zookeeper.properties`
`bin/kafka-server-start.sh config/server.properties`
  • Produce data into sf.crimes topic
python kafka_server.py
  • Submit Spark Streaming Job using the command:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.4 --master local[*] data_stream.py

Results

Kafka Consumer Console Output

Consumer Output

Progress Reporter

Progress Report Output

Spark UI

Spark UI

Question 1

How did changing values on the SparkSession property parameters affect the throughput and latency of the data?

After tuning the application using different parameters, I saw that the processedRowsPerSecond increased and the time needed to process each micro batch decreased

Question 2

What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

Based on testing various values for some parameters, I found that changing maxRatePerPartition and spark.sql.shuffle.partitions had an impact on throughput and speed.

the most efficient values were:

  • maxRatePerPartition = 200
  • spark.sql.shuffle.partitions = 2

Smaller values for maxRatePerPartition yielded a lower value for processedRowsPerSecond, while larger values than 200 didn't really increase the number of processedRowsPerSecond significantly

Changing the value spark.sql.shuffle.partitions seemed to have a big impact on performance. The default value which is 200 was not optimal as data was shuffled a lot. Since this operation is expensive, the time needed to process every micro batch was very high (6 to 9 seconds) . Reducing this value improved the time needed to process every micro batch. The optimal value for spark.sql.shuffle.partitions was 2. This resulted in micro batch processing time of 200 to 300 milliseconds. This was the optimal value since the RDD has the same number of partitions as the kafka topic which is equal to 2. Thus, no shuffles happened.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published