sf-crime-statistics

This project provides statistical analyses on crime data in San Francisco using Apache Spark Structured Streaming. The real-world dataset was extracted from Kaggle

Development Environment

Spark 2.4.3
Scala 2.11.x
Java 1.8.x
Kafka build with Scala 2.11.x
Python 3.6.x or 3.7.x

QuickStart

Install requirements using ./start.sh if you use conda for Python. If you use pip rather than conda, then use pip install -r requirements.txt.

Run the Application

Start Zookeeper and Kafka

`bin/zookeeper-server-start.sh config/zookeeper.properties`
`bin/kafka-server-start.sh config/server.properties`

Produce data into sf.crimes topic

python kafka_server.py

Submit Spark Streaming Job using the command:

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.4 --master local[*] data_stream.py

Results

Kafka Consumer Console Output

Progress Reporter

Spark UI

Question 1

How did changing values on the SparkSession property parameters affect the throughput and latency of the data?

After tuning the application using different parameters, I saw that the processedRowsPerSecond increased and the time needed to process each micro batch decreased

Question 2

What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

Based on testing various values for some parameters, I found that changing maxRatePerPartition and spark.sql.shuffle.partitions had an impact on throughput and speed.

the most efficient values were:

maxRatePerPartition = 200
spark.sql.shuffle.partitions = 2

Smaller values for maxRatePerPartition yielded a lower value for processedRowsPerSecond, while larger values than 200 didn't really increase the number of processedRowsPerSecond significantly

Changing the value spark.sql.shuffle.partitions seemed to have a big impact on performance. The default value which is 200 was not optimal as data was shuffled a lot. Since this operation is expensive, the time needed to process every micro batch was very high (6 to 9 seconds) . Reducing this value improved the time needed to process every micro batch. The optimal value for spark.sql.shuffle.partitions was 2. This resulted in micro batch processing time of 200 to 300 milliseconds. This was the optimal value since the RDD has the same number of partitions as the kafka topic which is equal to 2. Thus, no shuffles happened.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
project-screenshots		project-screenshots
.gitignore		.gitignore
README.md		README.md
consumer_server.py		consumer_server.py
data_stream.py		data_stream.py
kafka_server.py		kafka_server.py
producer_server.py		producer_server.py
project-screenshots.zip		project-screenshots.zip
radio_code.json		radio_code.json
requirements.txt		requirements.txt
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sf-crime-statistics

Development Environment

QuickStart

Run the Application

Results

Kafka Consumer Console Output

Progress Reporter

Spark UI

Question 1

Question 2

About

Uh oh!

Releases

Packages

Uh oh!

Languages

arouchdi/sf-crime-statistics

Folders and files

Latest commit

History

Repository files navigation

sf-crime-statistics

Development Environment

QuickStart

Run the Application

Results

Kafka Consumer Console Output

Progress Reporter

Spark UI

Question 1

Question 2

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages