Before being able to run the program you need to (1) Install the gcloud
CLI from Google Cloud Services. (2) Initialize it. (3) Set up a default application login.
On Debian based systems first add the Google Cloud SDK distribution URI as package source
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
Import the public key
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -
Update the package list
sudo apt update
and install the SDK
sudo apt install google-cloud-sdk
To initialize it run
gcloud init
Follow the steps, you will be prompted to log-in with your google account from the browser and select or create a new project.
Run the command
gcloud auth application-default login
and add GOOGLE_APPLICATION_CRDENTIALS=/home/{username}/.config/gcloud/application_default_credentials.json
to your environment variables
Create a virtual environment (recommended) with python -m venv .venv
and activate it with source .venv/bin/activate
(or in Windows .\.venv\Scripts\activate
). Then install the requirements with pip install -r requirements.txt
.
To get a list of available commands run
python -m spark_data_analysis --help
Each question can be run separately, for example the following command will run the first two questions:
python -m spark_data_analysis questions -n 1 2
To run all questions, use the following command:
python -m spark_data_analysis questions -all
You can either choose to run the Spark Streaming application or the questions, to run the Spark Streaming demo use the following command:
python -m spark_data_analysis -p 1 streaming
-p 1
tells the program to use only one part of the dataset.
To run the Spark Streaming application you need to have Kafka and Zookeeper running. For this you can use the docker-compose.yml
file provided in the docker
directory. Starting from Kafka 3.3 Zookeeper is no longer a requirement for running a Kafka broker so we are omitting it in our compose file, we're also using the native (faster even if less feature rich) version of Kafka. You can start the services manually with the following command:
docker compose up -d
But by default the demo will automatically start the services for you unless you specify the --no-docker
flag.
python -m spark_data_analysis streaming --no-docker
To compile the report, use the following command:
python -m spark_data_analysis --report