Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
7b8816d
Add Berlin Marathon dataset (1974-2019) to raw folder
irmacfw Sep 22, 2025
67dd802
Merge pull request #1 from jomachase/irma
irmacfw Sep 22, 2025
dc2e66d
Add Berlin Marathon dataset (1974-2024)
RCabral91 Sep 22, 2025
7c68053
Merge pull request #2 from jomachase/Rafael
RCabral91 Sep 22, 2025
a0f78c9
Updated yaml file with relative path and JupLab with read function ap…
KmkGska Sep 22, 2025
5ba089b
Updated files and Notebook
RCabral91 Sep 23, 2025
13d3d69
Merge pull request #4 from jomachase/Rafael
RCabral91 Sep 23, 2025
c9aef0d
Merge pull request #3 from jomachase/kinga
KmkGska Sep 23, 2025
faeb10c
Files cleanup
jomachase Sep 23, 2025
1fff266
Merge branch 'main' into joma
jomachase Sep 23, 2025
4e1cc09
Merge pull request #5 from jomachase/joma
jomachase Sep 23, 2025
e482201
Few updates
RCabral91 Sep 23, 2025
faa8cbc
Merge branch 'main' into Rafael
RCabral91 Sep 23, 2025
0fc57bc
Local changes: new CSVs, notebooks and adjustments in config
irmacfw Sep 23, 2025
2a6b7c0
Merge pull request #6 from jomachase/irma
irmacfw Sep 23, 2025
154a4bc
Update notebook with charts
RCabral91 Sep 23, 2025
0712a64
Merge pull request #7 from jomachase/Rafael
RCabral91 Sep 23, 2025
1ace869
Reorganization of project files
jomachase Sep 23, 2025
cbdbd80
Update notebook and clean data
irmacfw Sep 23, 2025
aafe334
Merge branch 'main' into irma
irmacfw Sep 23, 2025
4a88a5b
Merge pull request #8 from jomachase/irma
irmacfw Sep 23, 2025
3329863
2_Gender_gap_analysis_v1
KmkGska Sep 23, 2025
010c4ad
Update charts in the notebook
RCabral91 Sep 23, 2025
a22a004
Merge pull request #9 from jomachase/Rafael
RCabral91 Sep 23, 2025
0267cba
Merge pull request #10 from jomachase/kinga
KmkGska Sep 24, 2025
11a0d26
Solved a couple of conflict with notebooks
jomachase Sep 23, 2025
f1daa88
Solved several conflicts
jomachase Sep 24, 2025
dabe6bb
Merge pull request #11 from jomachase/joma
jomachase Sep 24, 2025
eb9e787
Files reorganization
irmacfw Sep 24, 2025
2fd6f5c
Merge branch 'main' into irma
irmacfw Sep 24, 2025
507af6b
Merge pull request #12 from jomachase/irma
irmacfw Sep 24, 2025
8def945
Updated JupNot 2_Gender_gap_analysis file and corresponding figure ex…
KmkGska Sep 24, 2025
903e554
chore: move functions.py into src folder
irmacfw Sep 24, 2025
1434b2c
added Readme, functions.py, deleted figures, create src folder
irmacfw Sep 24, 2025
aed4937
updated project structure in README with src folder
irmacfw Sep 24, 2025
644ad22
Merge pull request #14 from jomachase/irma
irmacfw Sep 24, 2025
a34faf5
Updating the Joma and Marathon_webscrape files
jomachase Sep 25, 2025
0fe0d85
Updating changes to status of first_project files
jomachase Sep 25, 2025
c90e4e5
Merge pull request #16 from jomachase/kinga
KmkGska Sep 25, 2025
1eaadfb
Conflict with Irma's notebook solved
jomachase Sep 25, 2025
888b8a0
Modfied README.md file
jomachase Sep 25, 2025
f1891b8
Modified notebooks
jomachase Sep 25, 2025
abc7499
Added figures in the figures folder
jomachase Sep 25, 2025
feaffa6
Merge pull request #17 from jomachase/joma
jomachase Sep 25, 2025
5d92b84
Final version of 2_Gender_time_difference_analysis_Kinga jup not
KmkGska Sep 25, 2025
d6cfbc2
Final analysis time difference
KmkGska Sep 25, 2025
3f6849c
Merge pull request #18 from jomachase/kinga
KmkGska Sep 25, 2025
1567ece
merged functions.py and update notebooks
irmacfw Sep 25, 2025
22be872
Merge branch 'main' into irma
irmacfw Sep 25, 2025
cff2938
Merge pull request #19 from jomachase/irma
irmacfw Sep 25, 2025
3c494b6
Update README.md
KmkGska Sep 25, 2025
1d046e9
Delete functions.py and clean __pycache__
irmacfw Sep 25, 2025
4008be5
Merge pull request #20 from jomachase/irma
irmacfw Sep 25, 2025
4aa7f5a
Update the time on the charts
RCabral91 Sep 25, 2025
47ff8ac
Jup Kinga analysis revised
KmkGska Sep 25, 2025
9009010
Merge pull request #22 from jomachase/kinga
KmkGska Sep 25, 2025
2a0eac4
Delete notebooks/country_dominance.ipynb
irmacfw Sep 25, 2025
d65f445
Remove duplicate notebook country_dominance.ipynb, keep only country_…
irmacfw Sep 25, 2025
f78ff3f
Merge branch 'main' into Rafael
RCabral91 Sep 25, 2025
31fd079
Merge pull request #21 from jomachase/Rafael
RCabral91 Sep 25, 2025
54d52d2
Update Irma.ipynb, README, gitignore
irmacfw Sep 25, 2025
9e1f559
Merge branch 'main' into irma
irmacfw Sep 25, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,6 @@ notebooks/.env
notebooks/.DS_Store
.DS_Store
*.in
# Ignorar caches de Python
__pycache__/
*.pyc
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.13
183 changes: 132 additions & 51 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,77 +1,158 @@
# Project overview
...
# **Berlin Marathon Project**

# Installation
## **Introduction**
The Berlin Marathon is one of the world’s most prestigious long-distance running events, known for its fast course and record-breaking performances.

1. **Clone the repository**:
For this project, we positioned ourselves as a **data analytics company** hired by a **sports magazine** to analyze the Berlin Marathon’s history.
The goal is to provide insights that will support the magazine in writing a feature article about the marathon, focusing on performance trends, gender dynamics, country dominance, and female participation.

```bash
git clone https://github.com/YourUsername/repository_name.git
```
Throughout this project, we applied data wrangling, cleaning, and exploratory data analysis (EDA) techniques to answer key research questions and test hypotheses about the evolution of the Berlin Marathon.

2. **Install UV**
---

If you're a MacOS/Linux user type:
## **Datasets**
We used **two datasets**, both of which were cleaned and transformed before analysis:

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```
### **1. Berlin Marathon Runners Dataset (1974–2024)**
- Source: [Kaggle – Berlin Marathons Data](https://www.kaggle.com/datasets/aiaiaidavid/berlin-marathons-data)
- Shape after cleaning: ~884,000 rows, 5 columns (`year`, `gender`, `time`, `finish_time`, `finish_seconds`).
- Strengths: Covers all participants, allows analysis of participation and performance distributions.
- Weaknesses: Missing country data for many runners.

If you're a Windows user open an Anaconda Powershell Prompt and type :
### **2. Berlin Marathon Winners Dataset (1974–2024)**
- Source: [Wikipedia – Berlin Marathon](https://en.wikipedia.org/w/index.php?title=Berlin_Marathon&action=edit)
- Shape after cleaning: 100 rows, 5 columns (`year`, `winner`, `country`, `time`, `gender`).
- Strengths: Clean overview of winners per year, good for studying records and country dominance.
- Weaknesses: Only includes winners, does not represent the full participant population.

```bash
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
```
---

3. **Create an environment**
## **Research Questions**

```bash
uv venv
```
### **1. Evolution of performance (Rafael)**
- How have winning times evolved in the Berlin Marathon from 1974 to 2024?
- Are world records being broken more frequently in Berlin than in other marathons?

3. **Activate the environment**
### **2. Gender time scores difference
of female and male winner (Kinga)**
- What is the time difference between male and female winners?
- Has the gender time scores difference decreased over the years?

If you're a MacOS/Linux user type (if you're using a bash shell):

```bash
source ./venv/bin/activate
```
### **3. Country dominance (Irma)**
- Which countries have produced the most winners in the Berlin Marathon?
- How has the country distribution of winners shifted across decades?

If you're a MacOS/Linux user type (if you're using a csh/tcsh shell):
### **4. Female participation (Joma)**
- How has the percentage of female runners changed since 1974?
- Is the finishing time distribution of female runners approaching that of male runners?

```bash
source ./venv/bin/activate.csh
```
---

If you're a Windows user type:
## **Hypotheses**
1. Winning times in the Berlin Marathon have significantly improved over the years.
2. The time score difference between men and women has decreased over time.
3. The number of participants from Africa has increased over the years.
4. Female participation has increased since 1974.

```bash
.\venv\Scripts\activate
```
---

4. **Install dependencies**:
## **Methodology**
We applied multiple **data cleaning and wrangling techniques** to both datasets:

```bash
uv pip install -r requirements.txt
```
- Standardized column names to `snake_case`.
- Removed duplicates and irrelevant columns (`country`, `age` in the raw runners dataset).
- Handled missing values (nulos in `country`, `age`).
- Normalized categorical values (`gender`: male/female/unknown).
- Converted `time` to `timedelta` and created `finish_seconds`.
- Reshaped winners dataset (wide → long format) and added `gender` column.
- Saved final cleaned datasets to `data/clean/`.

# Questions
...
For analysis, we used:
- **Aggregation & filtering** with `groupby`, `value_counts`, and pivot tables.
- **Visualizations** with Matplotlib and Seaborn.
- Exported figures to `figures/` for presentation.

# Dataset
...
---

## Main dataset issues
## **Results & Insights**

- ...
- ...
- ...
### **1. Evolution of performance (Rafael)**
- Winning times in the Berlin Marathon have consistently improved.
- Men’s performances have stabilized just above **2h01**.
- Women achieved a major breakthrough in **2023 with 2h11**.
- Berlin is the **leading marathon for world records**, hosting nearly all men’s records since 2003 and the women’s record in 2023.
- This far surpasses other cities like London or Chicago.

## Solutions for the dataset issues
...
### **2. Gender time scores difference (Kinga)**
- Time results changed over the years for both genders.
- The gender time scores difference has decreased.

# Conclussions
...

# Next steps
...
### **3. Country dominance (Irma)**
- **Kenya (25 wins)** and **Ethiopia (20 wins)** dominate the overall history.
- In early decades, winners came mainly from Europe (West Germany, UK).
- Since the 1990s, East African countries have taken the lead.

### **4. Female participation (Joma)**
- Female participation has increased explosively since 1974 (5.0%). There has been a average 40% increase in female participation.In 2024: there is a 45% increase.
- Time gap between finishing times for female runners and male runners is steadily decreasing. In 1974, the time gap was at 61.3 min. As of 20204, it is now at a 30.4 min difference. At this current rate, times will be equal in 49 years

---

## **Conclusions**
- **Hypothesis 1:** Supported – Winning times have consistently improved; men are near the 2h01 barrier and women set a breakthrough with 2h11 in 2023. Berlin is confirmed as the fastest course, with nearly all world records since 2003.
- **Hypothesis 2:** Supported – The performance gap score between men and women has narrowed.
- **Hypothesis 3:** Supported – African participation and victories have grown strongly (Kenya and Ethiopia lead since the 1990s).
- **Hypothesis 4:** Supported – Female participation has risen steadily since 1974.

---

## **Future Questions**
- How do Berlin results compare to other World Marathon Majors (London, Boston, Tokyo)?
- What external factors (weather, temperature, technology, shoes) influence performance trends?
- How do amateur vs. elite trends differ?

---

## **Project Organization**

### 📂 Repository Structure
first_project/
├── 📂 data
│ ├── 📂 raw
│ └── 📂 clean
├── 📂 figures
├── 📂 notebooks
├── 📂 slides
├── 📂 sql_scripts

├── 📄 config.yaml
├── 📄 main.py
├── 📄 pyproject.toml
├── 📄 README.md
└── 📄 uv.lock

---

## **Teamwork**
- Project managed with **Trello**: [Berlin Marathon Project Trello](https://trello.com/invite/b/68d3cd5c7e316c4e2b2dd967/ATTI582c762f1ce0f9a2b5cf205744fc802d5600E419/berlin-marathon-data-analysis-project)
- GitHub collaboration with branches per member (Irma, Kinga, Rafael, Joma).
- Each member responsible for one research question.

---

## **Deliverables**
- Cleaned datasets in `/data/clean/`.
- Analysis notebooks in `/notebooks/`.
- Figures in `/figures/`.
- Final presentation: [Google Slides](https://docs.google.com/presentation/d/1ppWy2WtMmzKg47Q6yj-O25uKoOPrUDnsZRDllIabRcA/edit?usp=sharing)

---

## **Authors**
- Rafael – Analysis of Evolution of performance
- Kinga – Analysis of Gender gap
- Irma – Analysis of Country dominance
- Joma (Project Manager) – Analysis of Female participation
7 changes: 5 additions & 2 deletions config.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
input_data:
file: "../data/raw/raw_data_file.csv"
marathon_data: "../data/raw/Berlin_Marathon_data_1974_2019.csv"
marathon_winners: "../data/raw/berlin_marathon_winners_1974_2024.csv"

output_data:
file: "../data/clean/cleaned_data_file.csv"
cleaned_data: "../data/clean/cleaned_marathon.csv"
cleaned_winners: "../data/clean/cleaned_marathon_winners.csv"

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
,rafael,rafael-HP-Laptop-17-ca1xxx,23.09.2025 16:07,file:///home/rafael/.config/libreoffice/4;
70 changes: 70 additions & 0 deletions data/clean/Untitled.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "16d7efbc-c66e-4b2e-bca3-a1f7e6eae160",
"metadata": {},
"outputs": [
{
"ename": "ModuleNotFoundError",
"evalue": "No module named 'sqlalchemy'",
"output_type": "error",
"traceback": [
"\u001b[31m---------------------------------------------------------------------------\u001b[39m",
"\u001b[31mModuleNotFoundError\u001b[39m Traceback (most recent call last)",
"\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[1]\u001b[39m\u001b[32m, line 1\u001b[39m\n\u001b[32m----> \u001b[39m\u001b[32m1\u001b[39m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[34;01msqlalchemy\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m create_engine, text\n\u001b[32m 3\u001b[39m DB_USER = \u001b[33m\"\u001b[39m\u001b[33mroot\u001b[39m\u001b[33m\"\u001b[39m \u001b[38;5;66;03m# <-- your MySQL user\u001b[39;00m\n\u001b[32m 4\u001b[39m DB_PASS = \u001b[33m\"\u001b[39m\u001b[33mYOUR_PASSWORD\u001b[39m\u001b[33m\"\u001b[39m \u001b[38;5;66;03m# <-- your password\u001b[39;00m\n",
"\u001b[31mModuleNotFoundError\u001b[39m: No module named 'sqlalchemy'"
]
}
],
"source": [
"from sqlalchemy import create_engine, text\n",
"\n",
"DB_USER = \"root\" # <-- your MySQL user\n",
"DB_PASS = \"YOUR_PASSWORD\" # <-- your password\n",
"DB_HOST = \"localhost\"\n",
"DB_PORT = 3306\n",
"DB_NAME = \"berlin_marathon\"\n",
"\n",
"engine = create_engine(\n",
" f\"mysql+pymysql://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_NAME}\",\n",
" pool_pre_ping=True\n",
")\n",
"\n",
"# quick check\n",
"with engine.connect() as con:\n",
" print(\"Connected:\", con.execute(text(\"SELECT 1\")).scalar() == 1)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2cbb81c8-20d9-46d9-84eb-406b750e8b26",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Empty file removed data/clean/cleaned_data_file.csv
Empty file.
Loading