Spark-Data-Analysis

Preliminaries

Before being able to run the program you need to (1) Install the gcloud CLI from Google Cloud Services. (2) Initialize it. (3) Set up a default application login.

Installing the `gcloud` CLI

On Debian based systems first add the Google Cloud SDK distribution URI as package source

echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list

Import the public key

curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -

Update the package list

sudo apt update

and install the SDK

sudo apt install google-cloud-sdk

Initializing the CLI

To initialize it run

gcloud init

Follow the steps, you will be prompted to log-in with your google account from the browser and select or create a new project.

Setting up a default application login

Run the command

gcloud auth application-default login

and add GOOGLE_APPLICATION_CRDENTIALS=/home/{username}/.config/gcloud/application_default_credentials.json to your environment variables

Creating a new virtual environment

Create a virtual environment (recommended) with python -m venv .venv and activate it with source .venv/bin/activate (or in Windows .\.venv\Scripts\activate). Then install the requirements with pip install -r requirements.txt.

Commands

To get a list of available commands run

python -m spark_data_analysis --help

Questions

Each question can be run separately, for example the following command will run the first two questions:

python -m spark_data_analysis questions -n 1 2

To run all questions, use the following command:

python -m spark_data_analysis questions -all

Spark Streaming

You can either choose to run the Spark Streaming application or the questions, to run the Spark Streaming demo use the following command:

python -m spark_data_analysis -p 1 streaming

-p 1 tells the program to use only one part of the dataset.

Kafka and Zookeeper

To run the Spark Streaming application you need to have Kafka and Zookeeper running. For this you can use the docker-compose.yml file provided in the docker directory. Starting from Kafka 3.3 Zookeeper is no longer a requirement for running a Kafka broker so we are omitting it in our compose file, we're also using the native (faster even if less feature rich) version of Kafka. You can start the services manually with the following command:

docker compose up -d

But by default the demo will automatically start the services for you unless you specify the --no-docker flag.

python -m spark_data_analysis streaming --no-docker

Compile the report

To compile the report, use the following command:

python -m spark_data_analysis --report

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Spark-Data-Analysis

Preliminaries

Installing the `gcloud` CLI

Initializing the CLI

Setting up a default application login

Creating a new virtual environment

Commands

Questions

Spark Streaming

Kafka and Zookeeper

Compile the report

Files

README.md

Latest commit

History

README.md

File metadata and controls

Spark-Data-Analysis

Preliminaries

Installing the gcloud CLI

Initializing the CLI

Setting up a default application login

Creating a new virtual environment

Commands

Questions

Spark Streaming

Kafka and Zookeeper

Compile the report

Installing the `gcloud` CLI