Hardhat-Enterprises · digrajsaini100 · Mar 27, 2026 · Mar 27, 2026 · Mar 27, 2026 · Mar 30, 2026
@@ -0,0 +1,122 @@
+# AI/ML GitHub Workflow Guide
+
+This guide outlines how the AI/ML team should use GitHub for all project tasks.
+
+---
+
+## Branch Structure
+
+- `main` → final stable version (**do not work here**)
+- `dev` → integration branch (**do not work here directly**)
+- `task branch` → branch created for each assigned task
+- `your branch` → your personal working branch created from the task branch
+
+---
+
+## Workflow
+
+### 1. Start from your assigned task branch
+
+Examples:
+- `ai-ml/ai003-data-cleaning`
+- `ai-ml/ai004-feature-engineering`
+- `ai-ml/ai005-synthetic-data`
+
+Pull the latest version:
+
+```bash
+git fetch origin
+git checkout <task-branch>
+git pull origin <task-branch>
+````
+
+---
+
+### 2. Create your own personal branch
+
+Create your branch **from the task branch** using:
+
+```bash
+git checkout -b ai-ml/<task-id>/<your-name>-<short-description>
+```
+
+Example:
+
+```bash
+git checkout -b ai-ml/ai005/john-scenario-logic
+```
+
+Recommended naming format:
+
+```text
+ai-ml/<task-id>/<your-name>-<short-description>
+```
+
+Examples:
+
+```text
+ai-ml/ai001/john-scenario-logic
+ai-ml/ai002/sarah-data-generator
+ai-ml/ai003/alex-validation
+```
+
+---
+
+### 3. Do your work and push
+
+```bash
+git add .
+git commit -m "add initial task logic"
+git push -u origin <your-branch>
+```
+
+---
+
+### 4. Open a Pull Request (PR)
+
+Create a PR with:
+
+* **base:** task branch
+* **compare:** your branch
+
+Example:
+
+```text
+ai-ml/ai005/john-scenario-logic → ai-ml/ai005-synthetic-data
+```
+
+Task leads have access to review and merge PRs into the task branch.
+
+Once the task is complete, the task branch will be merged into `dev`.
+
+---
+
+## Important Rules
+
+* do **not** work directly on `main`
+* do **not** work directly on `dev`
+* do **not** work directly on the task branch
+* always create your own branch
+* always open PR into the task branch
+* only task leads should merge PRs into the task branch
+
+---
+
+## Workflow Summary
+
+```text
+your branch → task branch → dev → main
+```
+
+---
+
+## Need Help?
+
+If you face any issues with:
+
+* permissions
+* merge conflicts
+* branch creation
+* PR approvals
+
+please reach out to the AI/ML lead.
@@ -0,0 +1,21 @@
+# AI003 - Workstream 3
+
+This folder contains starter work for AI003 Workstream 3: logging, testing, before-vs-after comparison, and documentation.
+
+## Files
+- `test_data.csv` - dummy dataset with common data issues
+- `logging_utils.py` - helper functions for logging transformations
+- `comparison.py` - compares dataset quality before and after cleaning
+- `run_demo.py` - demo script to test logging and comparison flow
+- `documentation.md` - documentation notes for AI003
+
+## Current Scope
+This work is schema-independent and uses dummy CSV data for early development.
+
+## Covered in Workstream 3
+- logging rows removed
+- logging missing values found
+- logging simple transformations
+- before vs after comparison
+- test dataset preparation
+- initial documentation
@@ -0,0 +1,6 @@
+id,timestamp,location,event_type,severity,status
+1,2026-03-24 10:00:00,Melbourne,phishing,5,open
+2,,Sydney,phishing,8,open
+3,24/03/2026,melbourne,misinformation,11,closed
+4,2026/03/25 09:30,Brisbane,scam,-1,open
+5,2026-03-25T12:00:00,,phishing,4,
@@ -0,0 +1,17 @@
+import pandas as pd
+
+
+def dataset_summary(df: pd.DataFrame) -> dict:
+    return {
+        "rows": len(df),
+        "columns": len(df.columns),
+        "missing_values": int(df.isnull().sum().sum()),
+        "duplicate_rows": int(df.duplicated().sum()),
+    }
+
+
+def compare_before_after(before_df: pd.DataFrame, after_df: pd.DataFrame) -> dict:
+    return {
+        "before": dataset_summary(before_df),
+        "after": dataset_summary(after_df),
+    }
@@ -0,0 +1,53 @@
+# AI003 Documentation Notes
+
+## 1. Task Overview
+Task ID: AI003  
+Task Name: Data Cleaning Pipeline Logic  
+Workstream: 3 - Logging, Testing, and Documentation
+
+## 2. Objective
+Support the reusable data cleaning pipeline by:
+- tracking transformations
+- preparing test datasets
+- comparing before vs after outputs
+- documenting cleaning behaviour
+
+## 3. Input Data Description
+Dataset Name: Dummy AI003 test dataset  
+Source: Synthetic / manually created  
+Format: CSV  
+Fields:
+- id
+- timestamp
+- location
+- event_type
+- severity
+- status
+
+## 4. Identified Data Issues
+- Missing timestamp values
+- Missing location/status values
+- Duplicate rows
+- Timestamp inconsistencies
+- Categorical inconsistencies (`phishing`, `Phishing`, `phish`)
+- Invalid severity values
+
+## 5. Logging & Traceability
+Track:
+- rows removed
+- nulls found
+- category normalisation
+- other transformations
+
+## 6. Before vs After Comparison
+Compare:
+- row count
+- column count
+- missing values
+- duplicate rows
+
+## 7. Testing
+A dummy CSV dataset is used to simulate common data quality issues.
+
+## 8. Notes
+This work is currently schema-independent and will later integrate with AI001 once the schema is finalised.
@@ -0,0 +1,18 @@
+from datetime import datetime
+
+
+def log_message(step: str, details: str) -> str:
+    timestamp = datetime.now().isoformat()
+    return f"[{timestamp}] {step}: {details}"
+
+
+def log_rows_removed(count: int) -> str:
+    return log_message("remove_duplicates", f"rows_removed={count}")
+
+
+def log_nulls_found(count: int) -> str:
+    return log_message("missing_values", f"null_values_found={count}")
+
+
+def log_other_transformations(details: str) -> str:
+    return log_message("transformation", details)
@@ -0,0 +1,39 @@
+import pandas as pd
+from logging_utils import log_rows_removed, log_nulls_found, log_other_transformations
+from comparison import compare_before_after
+
+
+def demo_clean(df: pd.DataFrame) -> pd.DataFrame:
+    cleaned = df.copy()
+
+    null_count = int(cleaned.isnull().sum().sum())
+    print(log_nulls_found(null_count))
+
+    duplicate_count = int(cleaned.duplicated().sum())
+    cleaned = cleaned.drop_duplicates()
+    print(log_rows_removed(duplicate_count))
+
+    if "event_type" in cleaned.columns:
+        cleaned["event_type"] = cleaned["event_type"].replace({
+            "Phishing": "phishing",
+            "phish": "phishing"
+        })
+        print(log_other_transformations("normalised event_type values"))
+
+    return cleaned
+
+
+def main():
+    before_df = pd.read_csv("test_data.csv")
+    after_df = demo_clean(before_df)
+
+    result = compare_before_after(before_df, after_df)
+    print("\nBefore vs After Summary")
+    print(result)
+
+    after_df.to_csv("cleaned_output.csv", index=False)
+    print("\nSaved cleaned_output.csv")
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,7 @@
+id,timestamp,location,event_type,severity,status
+1,2026-03-24 10:00:00,Melbourne,phishing,5,open
+2,,Sydney,Phishing,8,open
+2,,Sydney,Phishing,8,open
+3,24/03/2026,melbourne,misinformation,11,closed
+4,2026/03/25 09:30,Brisbane,scam,-1,open
+5,2026-03-25T12:00:00,,phish,4,
@@ -0,0 +1,21 @@
+# AI003 - Workstream 3
+
+This folder contains starter work for AI003 Workstream 3: logging, testing, before-vs-after comparison, and documentation.
+
+## Files
+- `test_data.csv` - dummy dataset with common data issues
+- `logging_utils.py` - helper functions for logging transformations
+- `comparison.py` - compares dataset quality before and after cleaning
+- `run_demo.py` - demo script to test logging and comparison flow
+- `documentation.md` - documentation notes for AI003
+
+## Current Scope
+This work is schema-independent and uses dummy CSV data for early development.
+
+## Covered in Workstream 3
+- logging rows removed
+- logging missing values found
+- logging simple transformations
+- before vs after comparison
+- test dataset preparation
+- initial documentation
@@ -0,0 +1,6 @@
+id,timestamp,location,event_type,severity,status
+1,2026-03-24 10:00:00,Melbourne,phishing,5,open
+2,,Sydney,phishing,8,open
+3,24/03/2026,melbourne,misinformation,11,closed
+4,2026/03/25 09:30,Brisbane,scam,-1,open
+5,2026-03-25T12:00:00,,phishing,4,
@@ -0,0 +1,17 @@
+import pandas as pd
+
+
+def dataset_summary(df: pd.DataFrame) -> dict:
+    return {
+        "rows": len(df),
+        "columns": len(df.columns),
+        "missing_values": int(df.isnull().sum().sum()),
+        "duplicate_rows": int(df.duplicated().sum()),
+    }
+
+
+def compare_before_after(before_df: pd.DataFrame, after_df: pd.DataFrame) -> dict:
+    return {
+        "before": dataset_summary(before_df),
+        "after": dataset_summary(after_df),
+    }