Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
439 changes: 439 additions & 0 deletions ai-ml/AI001/sample_records.json

Large diffs are not rendered by default.

122 changes: 122 additions & 0 deletions ai-ml/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# AI/ML GitHub Workflow Guide

This guide outlines how the AI/ML team should use GitHub for all project tasks.

---

## Branch Structure

- `main` → final stable version (**do not work here**)
- `dev` → integration branch (**do not work here directly**)
- `task branch` → branch created for each assigned task
- `your branch` → your personal working branch created from the task branch

---

## Workflow

### 1. Start from your assigned task branch

Examples:
- `ai-ml/ai003-data-cleaning`
- `ai-ml/ai004-feature-engineering`
- `ai-ml/ai005-synthetic-data`

Pull the latest version:

```bash
git fetch origin
git checkout <task-branch>
git pull origin <task-branch>
````

---

### 2. Create your own personal branch

Create your branch **from the task branch** using:

```bash
git checkout -b ai-ml/<task-id>/<your-name>-<short-description>
```

Example:

```bash
git checkout -b ai-ml/ai005/john-scenario-logic
```

Recommended naming format:

```text
ai-ml/<task-id>/<your-name>-<short-description>
```

Examples:

```text
ai-ml/ai001/john-scenario-logic
ai-ml/ai002/sarah-data-generator
ai-ml/ai003/alex-validation
```

---

### 3. Do your work and push

```bash
git add .
git commit -m "add initial task logic"
git push -u origin <your-branch>
```

---

### 4. Open a Pull Request (PR)

Create a PR with:

* **base:** task branch
* **compare:** your branch

Example:

```text
ai-ml/ai005/john-scenario-logic → ai-ml/ai005-synthetic-data
```

Task leads have access to review and merge PRs into the task branch.

Once the task is complete, the task branch will be merged into `dev`.

---

## Important Rules

* do **not** work directly on `main`
* do **not** work directly on `dev`
* do **not** work directly on the task branch
* always create your own branch
* always open PR into the task branch
* only task leads should merge PRs into the task branch

---

## Workflow Summary

```text
your branch → task branch → dev → main
```

---

## Need Help?

If you face any issues with:

* permissions
* merge conflicts
* branch creation
* PR approvals

please reach out to the AI/ML lead.
21 changes: 21 additions & 0 deletions ai-ml/cleaning/ai003/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# AI003 - Workstream 3

This folder contains starter work for AI003 Workstream 3: logging, testing, before-vs-after comparison, and documentation.

## Files
- `test_data.csv` - dummy dataset with common data issues
- `logging_utils.py` - helper functions for logging transformations
- `comparison.py` - compares dataset quality before and after cleaning
- `run_demo.py` - demo script to test logging and comparison flow
- `documentation.md` - documentation notes for AI003

## Current Scope
This work is schema-independent and uses dummy CSV data for early development.

## Covered in Workstream 3
- logging rows removed
- logging missing values found
- logging simple transformations
- before vs after comparison
- test dataset preparation
- initial documentation
Binary file not shown.
Binary file not shown.
6 changes: 6 additions & 0 deletions ai-ml/cleaning/ai003/cleaned_output.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
id,timestamp,location,event_type,severity,status
1,2026-03-24 10:00:00,Melbourne,phishing,5,open
2,,Sydney,phishing,8,open
3,24/03/2026,melbourne,misinformation,11,closed
4,2026/03/25 09:30,Brisbane,scam,-1,open
5,2026-03-25T12:00:00,,phishing,4,
17 changes: 17 additions & 0 deletions ai-ml/cleaning/ai003/comparison.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import pandas as pd


def dataset_summary(df: pd.DataFrame) -> dict:
return {
"rows": len(df),
"columns": len(df.columns),
"missing_values": int(df.isnull().sum().sum()),
"duplicate_rows": int(df.duplicated().sum()),
}


def compare_before_after(before_df: pd.DataFrame, after_df: pd.DataFrame) -> dict:
return {
"before": dataset_summary(before_df),
"after": dataset_summary(after_df),
}
53 changes: 53 additions & 0 deletions ai-ml/cleaning/ai003/documentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# AI003 Documentation Notes

## 1. Task Overview
Task ID: AI003
Task Name: Data Cleaning Pipeline Logic
Workstream: 3 - Logging, Testing, and Documentation

## 2. Objective
Support the reusable data cleaning pipeline by:
- tracking transformations
- preparing test datasets
- comparing before vs after outputs
- documenting cleaning behaviour

## 3. Input Data Description
Dataset Name: Dummy AI003 test dataset
Source: Synthetic / manually created
Format: CSV
Fields:
- id
- timestamp
- location
- event_type
- severity
- status

## 4. Identified Data Issues
- Missing timestamp values
- Missing location/status values
- Duplicate rows
- Timestamp inconsistencies
- Categorical inconsistencies (`phishing`, `Phishing`, `phish`)
- Invalid severity values

## 5. Logging & Traceability
Track:
- rows removed
- nulls found
- category normalisation
- other transformations

## 6. Before vs After Comparison
Compare:
- row count
- column count
- missing values
- duplicate rows

## 7. Testing
A dummy CSV dataset is used to simulate common data quality issues.

## 8. Notes
This work is currently schema-independent and will later integrate with AI001 once the schema is finalised.
18 changes: 18 additions & 0 deletions ai-ml/cleaning/ai003/logging_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from datetime import datetime


def log_message(step: str, details: str) -> str:
timestamp = datetime.now().isoformat()
return f"[{timestamp}] {step}: {details}"


def log_rows_removed(count: int) -> str:
return log_message("remove_duplicates", f"rows_removed={count}")


def log_nulls_found(count: int) -> str:
return log_message("missing_values", f"null_values_found={count}")


def log_other_transformations(details: str) -> str:
return log_message("transformation", details)
39 changes: 39 additions & 0 deletions ai-ml/cleaning/ai003/run_demo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
import pandas as pd
from logging_utils import log_rows_removed, log_nulls_found, log_other_transformations
from comparison import compare_before_after


def demo_clean(df: pd.DataFrame) -> pd.DataFrame:
cleaned = df.copy()

null_count = int(cleaned.isnull().sum().sum())
print(log_nulls_found(null_count))

duplicate_count = int(cleaned.duplicated().sum())
cleaned = cleaned.drop_duplicates()
print(log_rows_removed(duplicate_count))

if "event_type" in cleaned.columns:
cleaned["event_type"] = cleaned["event_type"].replace({
"Phishing": "phishing",
"phish": "phishing"
})
print(log_other_transformations("normalised event_type values"))

return cleaned


def main():
before_df = pd.read_csv("test_data.csv")
after_df = demo_clean(before_df)

result = compare_before_after(before_df, after_df)
print("\nBefore vs After Summary")
print(result)

after_df.to_csv("cleaned_output.csv", index=False)
print("\nSaved cleaned_output.csv")


if __name__ == "__main__":
main()
7 changes: 7 additions & 0 deletions ai-ml/cleaning/ai003/test_data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
id,timestamp,location,event_type,severity,status
1,2026-03-24 10:00:00,Melbourne,phishing,5,open
2,,Sydney,Phishing,8,open
2,,Sydney,Phishing,8,open
3,24/03/2026,melbourne,misinformation,11,closed
4,2026/03/25 09:30,Brisbane,scam,-1,open
5,2026-03-25T12:00:00,,phish,4,
Empty file.
21 changes: 21 additions & 0 deletions ai-ml/cleaning/logging/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# AI003 - Workstream 3

This folder contains starter work for AI003 Workstream 3: logging, testing, before-vs-after comparison, and documentation.

## Files
- `test_data.csv` - dummy dataset with common data issues
- `logging_utils.py` - helper functions for logging transformations
- `comparison.py` - compares dataset quality before and after cleaning
- `run_demo.py` - demo script to test logging and comparison flow
- `documentation.md` - documentation notes for AI003

## Current Scope
This work is schema-independent and uses dummy CSV data for early development.

## Covered in Workstream 3
- logging rows removed
- logging missing values found
- logging simple transformations
- before vs after comparison
- test dataset preparation
- initial documentation
Binary file not shown.
Binary file not shown.
6 changes: 6 additions & 0 deletions ai-ml/cleaning/logging/cleaned_output.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
id,timestamp,location,event_type,severity,status
1,2026-03-24 10:00:00,Melbourne,phishing,5,open
2,,Sydney,phishing,8,open
3,24/03/2026,melbourne,misinformation,11,closed
4,2026/03/25 09:30,Brisbane,scam,-1,open
5,2026-03-25T12:00:00,,phishing,4,
17 changes: 17 additions & 0 deletions ai-ml/cleaning/logging/comparison.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import pandas as pd


def dataset_summary(df: pd.DataFrame) -> dict:
return {
"rows": len(df),
"columns": len(df.columns),
"missing_values": int(df.isnull().sum().sum()),
"duplicate_rows": int(df.duplicated().sum()),
}


def compare_before_after(before_df: pd.DataFrame, after_df: pd.DataFrame) -> dict:
return {
"before": dataset_summary(before_df),
"after": dataset_summary(after_df),
}
Loading