A Water Quality Classification System that predicts if water is safe to drink based on chemical parameters.
- Industry Standard: Used by companies like Google, Netflix, Uber
- Maintainable: Easy to debug, extend, and collaborate
- Scalable: Can handle growing complexity
- Reproducible: Anyone can run and get same results
ml_project/
├── config/
│ └── config.yaml # 🔧 Configuration management
├── src/ # 📦 Source code package
│ ├── __init__.py # Makes it a Python package
│ ├── components/ # 🧩 Individual ML components
│ │ ├── __init__.py
│ │ ├── data_ingestion.py # 📥 Load raw data
│ │ ├── data_cleaning.py # 🧹 Clean and preprocess
│ │ ├── data_splitter.py # ✂️ Train/test split + SMOTE
│ │ ├── model_trainer.py # 🤖 Train multiple models
│ │ └── model_evaluator.py # 📊 Evaluate performance
│ ├── pipeline/ # 🔄 Orchestration layer
│ │ ├── __init__.py
│ │ └── main_pipeline.py # 🎯 Complete workflow
│ ├── config_manager.py # ⚙️ YAML config loader
│ ├── logging_system.py # 📝 Centralized logging
│ ├── exceptions.py # ⚠️ Custom error handling
│ ├── feature_engineering.py # 🔬 Create new features
│ └── feature_scaling.py # 📏 Normalize features
├── artifacts/ # 💾 Generated outputs
├── requirements.txt # 📋 Dependencies
├── setup.py # 📦 Package installer
└── .gitignore # 🚫 Version control exclusions
class DataIngestion:
def load_data(self):
data = pd.read_csv(self.data_path)
# Save copy to artifacts/raw_data/
return dataPurpose:
- Loads raw CSV data
- Creates backup in artifacts folder
- Logs data loading status
class DataCleaning:
def clean_data(self):
self.convert_to_numeric() # Convert text to numbers
self.handle_missing_values() # Fill missing data
self.encode_categorical() # One-hot encoding
self.clean_target() # Fix target column
# Save to artifacts/processed_data/
return cleaned_dataKey Features:
- Comprehensive logging of every step
- Handles missing values intelligently
- Saves cleaned data for reproducibility
class DataSplitter:
def split_and_save(self):
X_train, X_test, y_train, y_test = train_test_split(...)
# Apply SMOTE for class imbalance
X_train_res, y_train_res = SMOTE().fit_resample(X_train, y_train)
# Save both splits
return train_df, test_dfAdvanced Features:
- Stratified splitting (maintains class distribution)
- SMOTE for handling imbalanced data
- Saves train/test splits separately
class ModelTrainer:
def train_models(self, X_train, y_train, X_test, y_test):
models = {
"LogisticRegression": LogisticRegression(),
"RandomForest": RandomForestClassifier(),
"XGBoost": XGBClassifier(), # If available
}
results = []
for name, model in models.items():
model.fit(X_train, y_train)
metrics = self._evaluate_model(...)
results.append({...})
# Save best model based on F1 score
best_model = max(results, key=lambda x: x["test_metrics"]["f1_score"])
joblib.dump(best_model["model"], "artifacts/models/best_model.pkl")
return resultsclass ModelEvaluator:
def calculate_metrics(self, y_true, y_pred, y_proba=None):
return {
"accuracy": accuracy_score(y_true, y_pred),
"precision": precision_score(y_true, y_pred, average="weighted"),
"recall": recall_score(y_true, y_pred, average="weighted"),
"f1_score": f1_score(y_true, y_pred, average="weighted"),
"roc_auc": roc_auc_score(y_true, y_proba[:, 1])
}# config/config.yaml
models:
random_forest:
module: sklearn.ensemble
class: RandomForestClassifier
params:
n_estimators: 200
max_depth: 10
random_state: 42
paths:
raw_data: "data/raw_data/project_data.csv"
trained_model: "models/trained_models/best_model.pkl"Benefits:
- No hardcoded values in Python files
- Easy experimentation
- Environment-specific configurations
class MLProjectLogger:
def log_pipeline_stage(self, stage, status):
self.logger.info(f"PIPELINE_STAGE | {stage} | STATUS: {status}")
def log_data_info(self, name, shape, **kwargs):
info = {"rows": shape[0], "columns": shape[1]}
self.logger.info(f"DATA_INFO | {name} | {info}")Features:
- Console + file logging
- Structured log messages
- Exception tracking with full stack traces
class MLProjectException(Exception):
def __init__(self, error_message: str, error_detail: Exception):
self.error_message = error_message
self.lineno = exc_tb.tb_lineno
self.file_name = exc_tb.tb_frame.f_code.co_filenameAdvantages:
- Clear error messages with file/line info
- Consistent error handling across project
- Better debugging experience
class WaterSafetyMLPipeline:
def run_complete_pipeline(self):
# 1. Load data
raw_data = DataIngestion().load_data()
# 2. Clean data
cleaned_data = DataCleaning(raw_data).clean_data()
# 3. Split data
train_df, test_df = DataSplitter(cleaned_data).split_and_save()
# 4. Train models
results = ModelTrainer().train_models(X_train, y_train, X_test, y_test)
# 5. Generate summary
self._generate_results_summary(results)Key Benefits:
- Single Entry Point: Run entire pipeline with one command
- Error Handling: Graceful failure with detailed logs
- Results Summary: Automatic model comparison table
class FeatureEngineering:
def danger_flags(self):
# Create binary flags for dangerous levels
for col, threshold in self.thresholds.items():
self.data[f"{col}_high"] = (self.data[col] > threshold).astype(int)
def danger_count(self):
# Count total dangerous parameters
flags = [f"{c}_high" for c in self.thresholds]
self.data["danger_count"] = self.data[flags].sum(axis=1)artifacts/
├── raw_data/ # Original data backup
├── processed_data/ # Cleaned data
├── data/ # Train/test splits
├── models/ # Trained models
└── evaluation_results/ # Performance metrics
Purpose:
- Reproducibility: Save every intermediate step
- Debugging: Inspect data at each stage
- Model Serving: Easy access to trained models
# 1. Show configuration loading
from src.config_manager import ConfigManager
config = ConfigManager()
print("Model configs:", config.get_all_models_config())
# 2. Demonstrate pipeline execution
from src.pipeline.main_pipeline import WaterSafetyMLPipeline
pipeline = WaterSafetyMLPipeline()
results = pipeline.run_complete_pipeline()
# 3. Show results
for result in results:
print(f"{result['model_name']}: {result['test_metrics']['f1_score']:.4f}")- Separation of Concerns: Each class has one responsibility
- DRY Principle: Don't Repeat Yourself (config management)
- Error Handling: Graceful failures with informative messages
- Documentation: Clear docstrings and comments
- Version Control: Proper .gitignore for Python projects
- Package Structure: Makes code importable and reusable
- Logging: Essential for production debugging
- Configuration: Environment-specific settings
- Artifacts: Reproducible experiments
- Model Comparison: Data-driven model selection
- Modify Config: Add a new model (SVM) to
config.yaml - Update Trainer: Modify
model_trainer.pyto load SVM from config - Test Pipeline: Run the complete pipeline with new model
- Check Results: Verify SVM appears in model comparison
- How configuration drives behavior
- Adding new models without changing core logic
- Understanding pipeline flow
- Interpreting evaluation results
- Netflix: Recommendation systems
- Uber: Demand prediction
- Google: Search ranking algorithms
- Tesla: Autopilot vision systems
- MLOps Engineer: Knows pipeline orchestration
- Data Scientist: Understands end-to-end workflow
- ML Engineer: Can productionize models
- Software Engineer: Appreciates clean architecture
# Add to sys.path if needed
import sys
sys.path.append(os.path.join(os.path.dirname(__file__), ".."))os.makedirs("artifacts/models", exist_ok=True)- Check
config/config.yamlexists - Verify paths in config file
- Use absolute paths if needed
- Add hyperparameter tuning (GridSearch/RandomSearch)
- Implement cross-validation
- Add feature selection methods
- Create prediction API with FastAPI
- MLOps: Docker containerization
- CI/CD: GitHub Actions for testing
- Monitoring: Model drift detection
- Deployment: AWS/Azure cloud deployment
- Master this structure ✅
- Learn MLOps tools (MLflow, DVC)
- Study deployment (Docker, Kubernetes)
- Practice on real projects (Kaggle competitions)
This project structure teaches:
- Clean Code: Professional Python development
- System Design: Modular, scalable architecture
- ML Engineering: End-to-end pipeline thinking
- Industry Standards: Tools and practices used in production
Key Takeaway: "It's not just about building a model—it's about building a system that can be maintained, extended, and deployed reliably."