Skip to content

alibaniasad1999/master-thesis

Repository files navigation

Robust Reinforcement Learning Differential Game Guidance in Low-Thrust, Multi-Body Dynamical Environments

Master's Thesis

Ali Bani Asad

Department of Aerospace Engineering
Sharif University of Technology

Supervised by: Dr. Hadi Nobahari

September 2025


Python 3.8+ PyTorch ROS2 License


📋 Abstract

This repository contains the complete implementation and documentation of a zero-sum multi-agent reinforcement learning (MARL) framework for robust spacecraft guidance in the challenging Earth-Moon three-body dynamical system. The research addresses the critical problem of low-thrust spacecraft guidance under significant environmental uncertainties through a novel differential game formulation.

Key Contributions:

  • 🎮 Zero-Sum Game Formulation: Spacecraft guidance cast as a two-player differential game between a guidance agent (spacecraft) and a disturbance agent (uncertainties)
  • 🤖 Multi-Agent RL Algorithms: Extended implementations of DDPG, TD3, SAC, and PPO to their zero-sum multi-agent variants (MA-DDPG, MA-TD3, MA-SAC, MA-PPO)
  • 🛡️ Robustness Analysis: Comprehensive evaluation under diverse uncertainty scenarios including sensor noise, actuator disturbances, time delays, model mismatch, and initial condition variations
  • 🚀 Hardware Integration: ROS2-based implementation with C++ inference for real-time deployment
  • 📊 Benchmark Comparison: Rigorous comparison against classical control methods and standard single-agent RL approaches

Results: The zero-sum MARL approach demonstrates superior robustness, with MA-TD3 achieving the best performance in trajectory tracking and fuel efficiency while maintaining stability in highly perturbed environments.


🏗️ Repository Structure

master-thesis/
├── 📚 Report/                      # LaTeX thesis document
│   ├── thesis.tex                  # Main thesis file
│   ├── Chapters/                   # 8 chapters (Introduction → Conclusion)
│   ├── bibs/                       # Bibliography
│   └── plots/                      # Result plots and figures
│
├── 📜 Paper/                       # Conference paper (IEEE format)
│
├── 💻 Code/
│   ├── Python/
│   │   ├── Algorithms/             # DDPG, TD3, SAC, PPO implementations
│   │   ├── Environment/            # Three-body problem dynamics (TBP.py)
│   │   ├── TBP/                    # Single-agent training (Classic, DDPG, TD3, SAC, PPO)
│   │   ├── MBK/                    # Multi-body Kepler experiments
│   │   ├── Robust_eval/            # Robustness testing (Standard & ZeroSum variants)
│   │   ├── Benchmark/              # OpenAI Gym environments
│   │   └── utils/                  # Utility functions
│   │
│   ├── C/                          # C++ real-time inference (PyTorch models)
│   ├── ROS2/                       # ROS2 packages for hardware integration
│   ├── Simulink/                   # MATLAB Simulink models
│   └── ros_legacy/                 # Legacy ROS1 implementation
│
├── 🖼️ Figure/                      # Visualizations (TBP, HIL)
├── 🎓 Presentation/                # Defense slides (Beamer)
└── 📖 Proposal/                    # Research proposal

Key Directories:

  • Code/Python/Algorithms/: Core RL algorithm implementations
  • Code/Python/TBP/: Training notebooks for single-agent baseline
  • Code/Python/Robust_eval/: Comprehensive robustness evaluation scripts
  • Report/: Complete thesis document with LaTeX source

🔬 Research Methodology

Problem Formulation

The spacecraft guidance problem in the Circular Restricted Three-Body Problem (CR3BP) is formulated as a zero-sum differential game:

  • Player 1 (Guidance Agent): Minimizes trajectory deviation and fuel consumption
  • Player 2 (Disturbance Agent): Maximizes trajectory deviation (models worst-case uncertainties)

This formulation enables the development of inherently robust control policies that perform well under adversarial conditions.

Multi-Agent RL Algorithms

Four state-of-the-art continuous control algorithms are extended to their zero-sum multi-agent variants:

Algorithm Type Key Features
MA-DDPG Off-policy, Deterministic Simple, efficient, good baseline
MA-TD3 Off-policy, Deterministic Target policy smoothing, delayed updates, clipped double Q-learning
MA-SAC Off-policy, Stochastic Maximum entropy, automatic temperature tuning
MA-PPO On-policy, Stochastic Trust region optimization, robust training

Training Strategy

  • Centralized Training, Decentralized Execution (CTDE): Both agents observe full state during training but can act independently during deployment
  • Alternating Optimization: Sequential training of guidance and disturbance agents
  • Full Information Setting: Complete state observation for optimal policy learning

Robustness Evaluation

The trained policies are rigorously tested under six uncertainty scenarios:

  1. 🎲 Initial Condition Variations: Random perturbations in initial state
  2. Actuator Disturbances: Thrust vector perturbations
  3. 📡 Sensor Noise: Gaussian noise in state measurements
  4. ⏱️ Time Delays: Communication and actuation delays
  5. 🔧 Model Mismatch: Errors in system dynamics model
  6. 🌪️ Combined Uncertainties: All scenarios simultaneously

🚀 Getting Started

Prerequisites

Software Requirements:

  • Python 3.8 or higher
  • PyTorch 2.2.2
  • CUDA 11.8+ (optional, for GPU acceleration)
  • ROS2 Humble (for hardware deployment)
  • CMake 3.16+ (for C++ implementation)
  • LaTeX distribution (for compiling thesis document)

Hardware Requirements:

  • 16+ GB RAM (recommended for training)
  • NVIDIA GPU with 6+ GB VRAM (optional, speeds up training significantly)

Installation

1. Clone the Repository

git clone https://github.com/alibaniasad1999/master-thesis.git
cd master-thesis

2. Set Up Python Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
source venv/bin/activate  # Linux/macOS
# or
venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

3. Verify Installation

python -c "import torch; import gymnasium; import numpy; print('✓ All packages installed successfully')"

💡 Usage Guide

Training RL Agents

Single-Agent Training (Baseline)

cd Code/Python/TBP/SAC
jupyter notebook SAC_TBP.ipynb

Follow the notebook to:

  1. Configure environment parameters
  2. Set hyperparameters
  3. Train the agent
  4. Evaluate performance
  5. Save trained models

Zero-Sum Multi-Agent Training

cd Code/Python/TBP/SAC/ZeroSum
jupyter notebook Zero_Sum_SAC_TBP.ipynb

The notebook demonstrates:

  1. Zero-sum game setup
  2. Alternating training procedure
  3. Nash equilibrium convergence
  4. Robustness evaluation

Robustness Evaluation

cd Code/Python/Robust_eval/ZeroSum/sensor_noise
jupyter notebook sensor_noise.ipynb

This evaluates trained policies under sensor noise perturbations and generates comparison plots.

C++ Inference (Real-Time Deployment)

cd Code/C
mkdir build && cd build
cmake ..
make
./main

The C++ implementation loads PyTorch traced models for fast inference.

ROS2 Integration

cd Code/ROS2
colcon build
source install/setup.bash
ros2 launch tbp_rl_controler tbp_system.launch.py

This launches:

  • Three-body dynamics simulator node
  • RL controller node
  • Data logging node

Model Download Utility

cd Code/Python/utils
python model_downloader.py

Downloads pre-trained models from the GitHub repository.


📊 Key Results

Performance Comparison

Algorithm Trajectory Error (m) Fuel Consumption (m/s) Success Rate (%) Robustness Score
PID Control 8,432 ± 2,156 45.2 ± 8.3 72.4 ⭐⭐
DDPG 1,234 ± 892 28.7 ± 5.2 84.6 ⭐⭐⭐
TD3 967 ± 654 26.4 ± 4.1 88.2 ⭐⭐⭐⭐
SAC 1,045 ± 721 27.8 ± 4.8 86.9 ⭐⭐⭐⭐
PPO 1,398 ± 978 31.2 ± 6.3 81.5 ⭐⭐⭐
MA-DDPG 892 ± 423 25.1 ± 3.2 91.7 ⭐⭐⭐⭐
MA-TD3 687 ± 312 23.4 ± 2.8 95.3 ⭐⭐⭐⭐⭐
MA-SAC 734 ± 367 24.2 ± 3.1 93.8 ⭐⭐⭐⭐⭐
MA-PPO 856 ± 445 26.7 ± 3.9 90.4 ⭐⭐⭐⭐

Results averaged over 1,000 test episodes with combined uncertainty scenarios.

Trajectory Tracking Performance: TD3

TD3: Standard vs Zero-Sum MA-TD3 Comparison

Trajectory Tracking

Standard TD3 Trajectory Zero-Sum MA-TD3 Trajectory
Standard TD3 Trajectory Zero-Sum MA-TD3 Trajectory

Trajectory with Control Forces

Standard TD3 Zero-Sum MA-TD3
Standard TD3 with Forces Zero-Sum MA-TD3 with Forces

MA-TD3 demonstrates superior trajectory tracking with reduced deviation and more efficient control force usage.


Robustness Analysis Under Uncertainty

Comparative Performance: All Four Algorithms

The violin plots below show the performance distribution of all four RL algorithms (DDPG, TD3, SAC, PPO) under various uncertainty scenarios. Each plot compares Standard (single-agent) vs Zero-Sum (multi-agent) variants.

Zero-Sum Multi-Agent RL - All Algorithms Combined

Actuator Disturbance Sensor Noise
Actuator Disturbance - ZS Sensor Noise - ZS
Initial Condition Shift Time Delay
Initial Condition - ZS Time Delay - ZS
Model Mismatch Partial Observation
Model Mismatch - ZS Partial Observation - ZS

Standard Single-Agent RL - All Algorithms Combined

Actuator Disturbance Sensor Noise
Actuator Disturbance - Standard Sensor Noise - Standard
Initial Condition Shift Time Delay
Initial Condition - Standard Time Delay - Standard
Model Mismatch Partial Observation
Model Mismatch - Standard Partial Observation - Standard
📊 Click to view individual algorithm robustness (TD3)

TD3 Robustness Evaluation

Actuator Disturbance Sensor Noise Initial Condition Shift
TD3 Actuator TD3 Sensor TD3 Initial
Time Delay Model Mismatch Partial Observation
TD3 Delay TD3 Mismatch TD3 Partial
📊 Click to view individual algorithm robustness (DDPG)

DDPG Robustness Evaluation

Actuator Disturbance Sensor Noise Initial Condition Shift
DDPG Actuator DDPG Sensor DDPG Initial
Time Delay Model Mismatch Partial Observation
DDPG Delay DDPG Mismatch DDPG Partial
📊 Click to view individual algorithm robustness (SAC)

SAC Robustness Evaluation

Actuator Disturbance Sensor Noise Initial Condition Shift
SAC Actuator SAC Sensor SAC Initial
Time Delay Model Mismatch Partial Observation
SAC Delay SAC Mismatch SAC Partial
📊 Click to view individual algorithm robustness (PPO)

PPO Robustness Evaluation

Actuator Disturbance Sensor Noise Initial Condition Shift
PPO Actuator PPO Sensor PPO Initial
Time Delay Model Mismatch Partial Observation
PPO Delay PPO Mismatch PPO Partial

Key Findings

Zero-sum MARL outperforms single-agent RL across all metrics
MA-TD3 achieves best overall performance with 30% error reduction vs. TD3
Robustness significantly improved under all uncertainty scenarios
Tighter performance distributions in zero-sum variants (visible in violin plots)
Stable performance in highly perturbed environments
Real-time capable C++ implementation achieves <5ms inference time


📖 Documentation

Thesis Document

The complete thesis is available in the Report/ directory:

cd Report
pdflatex thesis.tex
bibtex thesis
pdflatex thesis.tex
pdflatex thesis.tex

Or use latexmk for automatic compilation:

latexmk -pdf thesis.tex

Chapter Overview

  1. Introduction: Motivation, problem statement, and research objectives
  2. Literature Review: Survey of RL, MARL, differential games, and spacecraft guidance
  3. Simulation: Three-body problem dynamics and environment setup
  4. Reinforcement Learning: Single-agent RL algorithms (DDPG, TD3, SAC, PPO)
  5. Agent Simulation: Training procedures and baseline results
  6. Multi-Agent RL: Zero-sum game formulation and MARL algorithms
  7. Results: Comprehensive evaluation and comparison
  8. Conclusion: Summary, contributions, and future work

API Documentation

Key classes and functions are documented in the code:

  • Environment/TBP.py: Three-body problem environment class
  • Algorithms/*/Zero_Sum_*.py: Zero-sum MARL implementations
  • utils/model_downloader.py: Pre-trained model utilities

🎯 Reproducibility

Reproduce Training Results

# Train MA-TD3 agent
cd Code/Python/TBP/TD3/ZeroSum
jupyter notebook Zero_Sum_TD3_TBP.ipynb
# Execute all cells

Reproduce Evaluation Results

# Run robustness evaluation
cd Code/Python/Robust_eval/ZeroSum/All_in_one/actuator_disturbance
jupyter notebook all_in_one.ipynb

Random Seeds

All experiments use fixed random seeds for reproducibility:

  • NumPy: np.random.seed(42)
  • PyTorch: torch.manual_seed(42)
  • Gymnasium: env.seed(42)

📚 Citation

If you use this work in your research, please cite:

@mastersthesis{baniasad2025robust,
  author       = {Ali Bani Asad},
  title        = {Robust Reinforcement Learning Differential Game Guidance 
                  in Low-Thrust, Multi-Body Dynamical Environments},
  school       = {Sharif University of Technology},
  year         = {2025},
  address      = {Tehran, Iran},
  month        = {September},
  type         = {Master's Thesis},
  note         = {Department of Aerospace Engineering}
}

Related Publications

  • Conference Paper: "Robustness on Demand: Transformer-Directed Switching in Multi-Agent RL" (in preparation)

🤝 Contributing

This is an academic research repository. While it's primarily for archival and reference purposes, suggestions and discussions are welcome:

  1. Open an issue to discuss proposed changes
  2. Fork the repository
  3. Create a feature branch
  4. Submit a pull request with detailed description

📧 Contact

Ali Bani Asad
Department of Aerospace Engineering
Sharif University of Technology
📧 Email: ali_baniasad@ae.sharif.edu
🔗 GitHub: @alibaniasad1999

Supervisor: Dr. Hadi Nobahari
📧 Email: nobahari@sharif.edu


🙏 Acknowledgments

This research was conducted at the Sharif University of Technology, Department of Aerospace Engineering, under the supervision of Dr. Hadi Nobahari and the advisory of Dr. Seyed Ali Emami Khooansari.

Special thanks to:

  • The Aerospace Engineering Department for providing computational resources
  • The open-source RL community for excellent libraries and tools
  • Colleagues and fellow researchers for valuable discussions and feedback

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.


🔗 Related Resources


⭐ If you find this research useful, please consider giving it a star! ⭐

Made with ❤️ at Sharif University of Technology

For questions, issues, or collaboration inquiries, please open a GitHub issue or reach out to the author:


📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors