Skip to content

Thangam-11/end_to_end_ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Complete ML Project Structure - Teaching Guide 🎓

What We're Building:

A Water Quality Classification System that predicts if water is safe to drink based on chemical parameters.

Why This Structure Matters:

  • Industry Standard: Used by companies like Google, Netflix, Uber
  • Maintainable: Easy to debug, extend, and collaborate
  • Scalable: Can handle growing complexity
  • Reproducible: Anyone can run and get same results

2. **Project Architecture

Complete File Structure:

ml_project/
├── config/
│   └── config.yaml              # 🔧 Configuration management
├── src/                         # 📦 Source code package
│   ├── __init__.py             # Makes it a Python package
│   ├── components/             # 🧩 Individual ML components
│   │   ├── __init__.py
│   │   ├── data_ingestion.py   # 📥 Load raw data
│   │   ├── data_cleaning.py    # 🧹 Clean and preprocess
│   │   ├── data_splitter.py    # ✂️ Train/test split + SMOTE
│   │   ├── model_trainer.py    # 🤖 Train multiple models
│   │   └── model_evaluator.py  # 📊 Evaluate performance
│   ├── pipeline/               # 🔄 Orchestration layer
│   │   ├── __init__.py
│   │   └── main_pipeline.py    # 🎯 Complete workflow
│   ├── config_manager.py       # ⚙️ YAML config loader
│   ├── logging_system.py       # 📝 Centralized logging
│   ├── exceptions.py           # ⚠️ Custom error handling
│   ├── feature_engineering.py  # 🔬 Create new features
│   └── feature_scaling.py      # 📏 Normalize features
├── artifacts/                   # 💾 Generated outputs
├── requirements.txt            # 📋 Dependencies
├── setup.py                   # 📦 Package installer
└── .gitignore                 # 🚫 Version control exclusions

3. Core Components Explanation

A. Data Ingestion (data_ingestion.py)

class DataIngestion:
    def load_data(self):
        data = pd.read_csv(self.data_path)
        # Save copy to artifacts/raw_data/
        return data

Purpose:

  • Loads raw CSV data
  • Creates backup in artifacts folder
  • Logs data loading status

B. Data Cleaning (data_cleaning.py)

class DataCleaning:
    def clean_data(self):
        self.convert_to_numeric()      # Convert text to numbers
        self.handle_missing_values()   # Fill missing data
        self.encode_categorical()      # One-hot encoding
        self.clean_target()           # Fix target column
        # Save to artifacts/processed_data/
        return cleaned_data

Key Features:

  • Comprehensive logging of every step
  • Handles missing values intelligently
  • Saves cleaned data for reproducibility

C. Data Splitting (data_splitter.py)

class DataSplitter:
    def split_and_save(self):
        X_train, X_test, y_train, y_test = train_test_split(...)
        # Apply SMOTE for class imbalance
        X_train_res, y_train_res = SMOTE().fit_resample(X_train, y_train)
        # Save both splits
        return train_df, test_df

Advanced Features:

  • Stratified splitting (maintains class distribution)
  • SMOTE for handling imbalanced data
  • Saves train/test splits separately

D. Model Training (model_trainer.py)

class ModelTrainer:
    def train_models(self, X_train, y_train, X_test, y_test):
        models = {
            "LogisticRegression": LogisticRegression(),
            "RandomForest": RandomForestClassifier(),
            "XGBoost": XGBClassifier(),  # If available
        }
        
        results = []
        for name, model in models.items():
            model.fit(X_train, y_train)
            metrics = self._evaluate_model(...)
            results.append({...})
            
        # Save best model based on F1 score
        best_model = max(results, key=lambda x: x["test_metrics"]["f1_score"])
        joblib.dump(best_model["model"], "artifacts/models/best_model.pkl")
        
        return results

E. Model Evaluation (model_evaluator.py)

class ModelEvaluator:
    def calculate_metrics(self, y_true, y_pred, y_proba=None):
        return {
            "accuracy": accuracy_score(y_true, y_pred),
            "precision": precision_score(y_true, y_pred, average="weighted"),
            "recall": recall_score(y_true, y_pred, average="weighted"),
            "f1_score": f1_score(y_true, y_pred, average="weighted"),
            "roc_auc": roc_auc_score(y_true, y_proba[:, 1])
        }

4. **Support Systems **

A. Configuration Management (config_manager.py)

# config/config.yaml
models:
  random_forest:
    module: sklearn.ensemble
    class: RandomForestClassifier
    params:
      n_estimators: 200
      max_depth: 10
      random_state: 42

paths:
  raw_data: "data/raw_data/project_data.csv"
  trained_model: "models/trained_models/best_model.pkl"

Benefits:

  • No hardcoded values in Python files
  • Easy experimentation
  • Environment-specific configurations

B. Logging System (logging_system.py)

class MLProjectLogger:
    def log_pipeline_stage(self, stage, status):
        self.logger.info(f"PIPELINE_STAGE | {stage} | STATUS: {status}")
    
    def log_data_info(self, name, shape, **kwargs):
        info = {"rows": shape[0], "columns": shape[1]}
        self.logger.info(f"DATA_INFO | {name} | {info}")

Features:

  • Console + file logging
  • Structured log messages
  • Exception tracking with full stack traces

C. Custom Exceptions (exceptions.py)

class MLProjectException(Exception):
    def __init__(self, error_message: str, error_detail: Exception):
        self.error_message = error_message
        self.lineno = exc_tb.tb_lineno
        self.file_name = exc_tb.tb_frame.f_code.co_filename

Advantages:

  • Clear error messages with file/line info
  • Consistent error handling across project
  • Better debugging experience

5. Pipeline Orchestration (main_pipeline.py)

class WaterSafetyMLPipeline:
    def run_complete_pipeline(self):
        # 1. Load data
        raw_data = DataIngestion().load_data()
        
        # 2. Clean data
        cleaned_data = DataCleaning(raw_data).clean_data()
        
        # 3. Split data
        train_df, test_df = DataSplitter(cleaned_data).split_and_save()
        
        # 4. Train models
        results = ModelTrainer().train_models(X_train, y_train, X_test, y_test)
        
        # 5. Generate summary
        self._generate_results_summary(results)

Key Benefits:

  • Single Entry Point: Run entire pipeline with one command
  • Error Handling: Graceful failure with detailed logs
  • Results Summary: Automatic model comparison table

6. **Advanced Features Walkthrough **

A. Feature Engineering (feature_engineering.py)

class FeatureEngineering:
    def danger_flags(self):
        # Create binary flags for dangerous levels
        for col, threshold in self.thresholds.items():
            self.data[f"{col}_high"] = (self.data[col] > threshold).astype(int)
    
    def danger_count(self):
        # Count total dangerous parameters
        flags = [f"{c}_high" for c in self.thresholds]
        self.data["danger_count"] = self.data[flags].sum(axis=1)

B. Artifacts System

artifacts/
├── raw_data/           # Original data backup
├── processed_data/     # Cleaned data
├── data/              # Train/test splits
├── models/            # Trained models
└── evaluation_results/ # Performance metrics

Purpose:

  • Reproducibility: Save every intermediate step
  • Debugging: Inspect data at each stage
  • Model Serving: Easy access to trained models

7. **Interactive Demo Script *

Live Coding Session:

# 1. Show configuration loading
from src.config_manager import ConfigManager
config = ConfigManager()
print("Model configs:", config.get_all_models_config())

# 2. Demonstrate pipeline execution
from src.pipeline.main_pipeline import WaterSafetyMLPipeline
pipeline = WaterSafetyMLPipeline()
results = pipeline.run_complete_pipeline()

# 3. Show results
for result in results:
    print(f"{result['model_name']}: {result['test_metrics']['f1_score']:.4f}")

8. Key Teaching Moments

Professional Standards:

  1. Separation of Concerns: Each class has one responsibility
  2. DRY Principle: Don't Repeat Yourself (config management)
  3. Error Handling: Graceful failures with informative messages
  4. Documentation: Clear docstrings and comments
  5. Version Control: Proper .gitignore for Python projects

Industry Best Practices:

  1. Package Structure: Makes code importable and reusable
  2. Logging: Essential for production debugging
  3. Configuration: Environment-specific settings
  4. Artifacts: Reproducible experiments
  5. Model Comparison: Data-driven model selection

9. Hands-On Exercise (15 minutes)

Student Challenge:

  1. Modify Config: Add a new model (SVM) to config.yaml
  2. Update Trainer: Modify model_trainer.py to load SVM from config
  3. Test Pipeline: Run the complete pipeline with new model
  4. Check Results: Verify SVM appears in model comparison

Expected Learning:

  • How configuration drives behavior
  • Adding new models without changing core logic
  • Understanding pipeline flow
  • Interpreting evaluation results

10. **Real-World Connections *

This Structure Powers:

  • Netflix: Recommendation systems
  • Uber: Demand prediction
  • Google: Search ranking algorithms
  • Tesla: Autopilot vision systems

Career Relevance:

  • MLOps Engineer: Knows pipeline orchestration
  • Data Scientist: Understands end-to-end workflow
  • ML Engineer: Can productionize models
  • Software Engineer: Appreciates clean architecture

11. Troubleshooting Common Issues

Import Errors:

# Add to sys.path if needed
import sys
sys.path.append(os.path.join(os.path.dirname(__file__), ".."))

Missing Directories:

os.makedirs("artifacts/models", exist_ok=True)

Configuration Not Found:

  • Check config/config.yaml exists
  • Verify paths in config file
  • Use absolute paths if needed

12. Next Steps & Extensions

Immediate Improvements:

  1. Add hyperparameter tuning (GridSearch/RandomSearch)
  2. Implement cross-validation
  3. Add feature selection methods
  4. Create prediction API with FastAPI

Advanced Extensions:

  1. MLOps: Docker containerization
  2. CI/CD: GitHub Actions for testing
  3. Monitoring: Model drift detection
  4. Deployment: AWS/Azure cloud deployment

Learning Path:

  1. Master this structure
  2. Learn MLOps tools (MLflow, DVC)
  3. Study deployment (Docker, Kubernetes)
  4. Practice on real projects (Kaggle competitions)

Summary

This project structure teaches:

  • Clean Code: Professional Python development
  • System Design: Modular, scalable architecture
  • ML Engineering: End-to-end pipeline thinking
  • Industry Standards: Tools and practices used in production

Key Takeaway: "It's not just about building a model—it's about building a system that can be maintained, extended, and deployed reliably."

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published