This project is designed to demonstrate the complete process of building a Data Warehouse, which involves several key stages:
- Data Collection and Settlement (OLTP - Online Transaction Processing): This involves gathering and storing raw data, typically in an operational database.
- Data Transformation and Integration (ETL - Extract, Transform, Load): The raw data is transformed and integrated into a unified format, making it suitable for analysis.
- Data Warehouse Design and Implementation (OLAP - Online Analytical Processing): The data warehouse is designed, usually before the ETL process, to support complex queries and reporting.
- Data Analysis and Visualisation (BI - Business Intelligence): The final stage aims to provide actionable insights, answer business questions, and support decision-making processes.
To illustrate these concepts, we use the fictitious Adventure Works Cycles
database, representing a multinational manufacturing company that sells bicycles and accessories.
To begin, clone the project repository:
git clone https://github.com/PascalSun/DW_2024.git
Familiarity with Git and GitHub is essential for software engineers. If you’re new to Git, here are some resources to get started:
Additionally, if you do not have a code editor, we recommend installing VS Code.
Ensure that Docker and Docker Compose are installed on your machine. These tools are necessary to set up the required databases and Jupyter Lab environment.
Start the services with the following command:
docker compose up -d
This will start the containers necessary for the project. You should have four running containers: pgdb
, pgadmin
, sqlserver
, and jupyterlab
. You can check this with:
docker ps
If any container is missing, check the logs for that container using:
docker logs -f <container_name>
To confirm that all four containers are running, use:
docker ps
You should see four containers: pgdb
, pgadmin
, sqlserver
, and jupyterlab
. If one is missing, check the logs for that container:
docker logs -f <container_name>
If you run the docker compose up
command in PowerShell or the terminal provided by Windows, you might encounter issues with the sqlserver
container due to line ending conflicts between Unix-based and Windows-based systems. To resolve this:
-
Open the project in VS Code.
-
Navigate to
OLTP/sqlserver/install.sh
andOLTP/sqlserver/startup.sh
. -
Change the line endings from
CRLF
toLF
(in the bottom-right corner of VS Code). -
Save the files and run:
docker compose up --build -d
For users with ARM architecture, the sqlserver
and jupyterlab
containers may not start properly due to the mcr.microsoft.com/mssql/server
image not supporting ARM architecture. To resolve this:
-
Enable
Use Rosetta for x86_64/amd64 emulation on Apple Silicon
in Docker settings:- Open Docker Desktop settings.
- Go to the General tab.
- Check the
Use Rosetta for x86_64/amd64 emulation on Apple Silicon
option. - Click
Apply & Restart
.
-
Clean the images and run:
docker compose up --build -d
While the project provides a jupyterlab
container within Docker for development, you may also choose to set up a local Python development environment:
-
Install Graphviz and its Python package for drawing ER diagrams:
- On macOS:
brew install graphviz
- Install the Python package:
pip install graphviz
- On macOS:
-
Create a virtual environment with Python 3.9 or higher:
- Using
venv
:python3 -m venv venv source venv/bin/activate
- Using
conda
:conda create -n dw python=3.9 conda activate dw
- Using
-
Install the project dependencies:
pip install -r requirements.txt
-
Run Jupyter Lab in the project directory:
jupyter lab
This project encourages you to explore both the Docker and local development environments, understand their setups, and be prepared to discuss the advantages and disadvantages of each approach. This exploration is valuable for understanding the full lifecycle of a data warehouse project, from data collection to analysis and visualisation.