This project investigates the application of machine learning models to two distinct prediction tasks:
- Forecasting health outcomes in horses
- Predicting passenger transportation in the hypothetical Spaceship Titanic scenario
The study implements and compares four primary machine learning algorithms to analyze these diverse datasets, providing insights into model performance and optimization techniques.
- Implementation of multiple machine learning algorithms:
- CatBoost
- K-Nearest Neighbors (KNN)
- Support Vector Machine (SVM)
- Naive Bayes
- Comprehensive data preprocessing pipeline
- Hyperparameter tuning optimization
- Ensemble methods implementation
- Cross-validation techniques for robust evaluation
- Data cleaning and missing value handling
- Numeric columns: Mean value imputation
- Non-numeric columns: Mode value imputation
- Label encoding for categorical variables
- Feature engineering and dimensionality reduction
- Dataset augmentation using random value generation within attribute ranges
- Dataset splitting into training and validation sets
- 10-fold Cross Validation implementation
- Stratified sampling for balanced class distributions
- Model training with hyperparameter optimization
- Performance evaluation using F1-Score metric
- CatBoost: 80.64
- KNN: 79.23
- SVM: 75.55
- Naive Bayes: 76.6
- CatBoost: 78.65
- KNN: 67.07
- SVM: 46.95
- Naive Bayes: 39.63
CatBoost consistently demonstrated the most robust performance across both datasets, achieving the highest scores in all scenarios.
-
Dataset Size
- Initial dataset size was insufficient
- Required artificial data augmentation
- May impact model generalization
-
Feature Independence
- Naive Bayes assumption of feature independence may not hold true
- Could affect model accuracy in real-world scenarios
-
Computational Resources
- SVM implementation may be computationally intensive
- Could limit scalability for larger datasets
-
Model Complexity
- Advanced models like CatBoost may require more tuning
- Could increase implementation complexity