Spaceship Titanic - Transport Prediction 🚀

Overview

This repository contains a machine learning project for the Kaggle competition "Spaceship Titanic." The goal is to predict which passengers were transported to an alternate dimension during a collision with a spacetime anomaly.

Project Description

In this competition, we use machine learning techniques to analyze data from the Spaceship Titanic's damaged computer system and predict whether passengers were transported.

Project Structure

Introduction
Dependencies Installation
Data Loading
Initial Data Exploration
Feature Engineering
Data Preprocessing
Model Training and Evaluation
Hyperparameter Optimization
Feature Importance
Submission
Conclusion

Getting Started

Prerequisites

Python 3.x
Required Libraries: numpy, pandas, matplotlib, seaborn, scikit-learn

Installation

Install the required libraries using pip:

pip install numpy pandas matplotlib seaborn scikit-learn

Usage

Clone the Repository

git clone https://github.com/yourusername/spaceship-titanic.git

Navigate to the Project Directory
```
cd spaceship-titanic
```
Run the Main Script
```
python main.py
```

Code Explanation

1. Introduction

The goal of this project is to predict if a passenger will be transported using machine learning models.

2. Installation of Dependencies

# Installing necessary libraries
!pip install numpy pandas matplotlib seaborn scikit-learn

# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Ensuring visualizations are displayed inline
%matplotlib inline

3. Loading the Data

# Loading the data
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

# Displaying the first few rows of the dataset
train_data.head()

4. Initial Data Exploration

# Checking general dataset information
train_data.info()

# Plotting distribution of 'Transported'
plt.figure(figsize=(8, 6))
sns.countplot(x='Transported', data=train_data)
plt.title('Distribution of Transported')
plt.show()  # Insert the generated plot here

5. Feature Engineering

# Extracting cabin components
train_data[['Deck', 'Num', 'Side']] = train_data['Cabin'].str.split('/', expand=True)

# Creating a total spend feature
train_data['TotalSpend'] = (train_data['RoomService'] + train_data['FoodCourt'] +
                            train_data['ShoppingMall'] + train_data['Spa'] + train_data['VRDeck'])

# Displaying feature engineering results
train_data[['Cabin', 'Deck', 'Num', 'Side', 'TotalSpend']].head()

6. Data Preprocessing

# Pipelines for data preprocessing
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']),
        ('cat', categorical_transformer, ['HomePlanet', 'Destination', 'Deck', 'Side'])
    ])

7. Model Training and Evaluation

# Splitting data into training and validation sets
X = train_data.drop(columns=['Transported'])
y = train_data['Transported']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Defining models
models = {
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier()
}

# Training and evaluating models
for name, model in models.items():
    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', model)])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_val)
    
    print(f"\n{name} - Accuracy: {accuracy_score(y_val, y_pred):.4f}")
    
    # Confusion matrix
    cm = confusion_matrix(y_val, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f'Confusion Matrix - {name}')
    plt.show()  # Insert the generated plot here

8. Hyperparameter Optimization

# Hyperparameter grid for Random Forest
param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

# Grid search for Random Forest
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', RandomForestClassifier())])
grid_search = GridSearchCV(rf_pipeline, param_grid, cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Best parameters
print("Best parameters:", grid_search.best_params_)

9. Feature Importance

# Best model after grid search
best_model = grid_search.best_estimator_

# Feature importance
feature_importance = best_model.named_steps['classifier'].feature_importances_
feature_names = np.concatenate([['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'],
                                  best_model.named_steps['preprocessor'].transformers_[1][1].get_feature_names_out()])

# Creating and visualizing feature importance DataFrame
feature_importance_df = pd.DataFrame({'feature': feature_names, 'importance': feature_importance})
feature_importance_df = feature_importance_df.sort_values(by='importance', ascending=False).head(15)

plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=feature_importance_df)
plt.title('Top 15 Feature Importance')
plt.tight_layout()
plt.show()  # Insert the generated plot here

10. Submission

# Creating total spend feature for test data
test_data['TotalSpend'] = (test_data['RoomService'] + test_data['FoodCourt'] +
                           test_data['ShoppingMall'] + test_data['Spa'] + test_data['VRDeck'])

# Applying the best model to test data
test_predictions = best_model.predict(test_data)

# Creating and saving the submission file
submission = pd.DataFrame({'PassengerId': test_data['PassengerId'], 
                           'Transported': test_predictions})
submission.to_csv('submission.csv', index=False)

print("Submission file 'submission.csv' created successfully.")

11. Conclusion

This project demonstrates a complete machine learning pipeline from data exploration, feature engineering, and model training to hyperparameter tuning and submission. The pipeline ensures thorough analysis and optimization to achieve the best model performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code.md

code.md

Spaceship Titanic - Transport Prediction 🚀

Overview

Project Description

Project Structure

Getting Started

Prerequisites

Installation

Usage

Code Explanation

1. Introduction

2. Installation of Dependencies

3. Loading the Data

4. Initial Data Exploration

5. Feature Engineering

6. Data Preprocessing

7. Model Training and Evaluation

8. Hyperparameter Optimization

9. Feature Importance

10. Submission

11. Conclusion

Files

code.md

Latest commit

History

code.md

File metadata and controls

Spaceship Titanic - Transport Prediction 🚀

Overview

Project Description

Project Structure

Getting Started

Prerequisites

Installation

Usage

Code Explanation

1. Introduction

2. Installation of Dependencies

3. Loading the Data

4. Initial Data Exploration

5. Feature Engineering

6. Data Preprocessing

7. Model Training and Evaluation

8. Hyperparameter Optimization

9. Feature Importance

10. Submission

11. Conclusion