Robust Reinforcement Learning Differential Game Guidance in Low-Thrust, Multi-Body Dynamical Environments
Ali Bani Asad
Department of Aerospace Engineering
Sharif University of Technology
Supervised by: Dr. Hadi Nobahari
September 2025
This repository contains the complete implementation and documentation of a zero-sum multi-agent reinforcement learning (MARL) framework for robust spacecraft guidance in the challenging Earth-Moon three-body dynamical system. The research addresses the critical problem of low-thrust spacecraft guidance under significant environmental uncertainties through a novel differential game formulation.
Key Contributions:
- 🎮 Zero-Sum Game Formulation: Spacecraft guidance cast as a two-player differential game between a guidance agent (spacecraft) and a disturbance agent (uncertainties)
- 🤖 Multi-Agent RL Algorithms: Extended implementations of DDPG, TD3, SAC, and PPO to their zero-sum multi-agent variants (MA-DDPG, MA-TD3, MA-SAC, MA-PPO)
- 🛡️ Robustness Analysis: Comprehensive evaluation under diverse uncertainty scenarios including sensor noise, actuator disturbances, time delays, model mismatch, and initial condition variations
- 🚀 Hardware Integration: ROS2-based implementation with C++ inference for real-time deployment
- 📊 Benchmark Comparison: Rigorous comparison against classical control methods and standard single-agent RL approaches
Results: The zero-sum MARL approach demonstrates superior robustness, with MA-TD3 achieving the best performance in trajectory tracking and fuel efficiency while maintaining stability in highly perturbed environments.
master-thesis/
├── 📚 Report/ # LaTeX thesis document
│ ├── thesis.tex # Main thesis file
│ ├── Chapters/ # 8 chapters (Introduction → Conclusion)
│ ├── bibs/ # Bibliography
│ └── plots/ # Result plots and figures
│
├── 📜 Paper/ # Conference paper (IEEE format)
│
├── 💻 Code/
│ ├── Python/
│ │ ├── Algorithms/ # DDPG, TD3, SAC, PPO implementations
│ │ ├── Environment/ # Three-body problem dynamics (TBP.py)
│ │ ├── TBP/ # Single-agent training (Classic, DDPG, TD3, SAC, PPO)
│ │ ├── MBK/ # Multi-body Kepler experiments
│ │ ├── Robust_eval/ # Robustness testing (Standard & ZeroSum variants)
│ │ ├── Benchmark/ # OpenAI Gym environments
│ │ └── utils/ # Utility functions
│ │
│ ├── C/ # C++ real-time inference (PyTorch models)
│ ├── ROS2/ # ROS2 packages for hardware integration
│ ├── Simulink/ # MATLAB Simulink models
│ └── ros_legacy/ # Legacy ROS1 implementation
│
├── 🖼️ Figure/ # Visualizations (TBP, HIL)
├── 🎓 Presentation/ # Defense slides (Beamer)
└── 📖 Proposal/ # Research proposal
Key Directories:
Code/Python/Algorithms/: Core RL algorithm implementationsCode/Python/TBP/: Training notebooks for single-agent baselineCode/Python/Robust_eval/: Comprehensive robustness evaluation scriptsReport/: Complete thesis document with LaTeX source
The spacecraft guidance problem in the Circular Restricted Three-Body Problem (CR3BP) is formulated as a zero-sum differential game:
- Player 1 (Guidance Agent): Minimizes trajectory deviation and fuel consumption
- Player 2 (Disturbance Agent): Maximizes trajectory deviation (models worst-case uncertainties)
This formulation enables the development of inherently robust control policies that perform well under adversarial conditions.
Four state-of-the-art continuous control algorithms are extended to their zero-sum multi-agent variants:
| Algorithm | Type | Key Features |
|---|---|---|
| MA-DDPG | Off-policy, Deterministic | Simple, efficient, good baseline |
| MA-TD3 | Off-policy, Deterministic | Target policy smoothing, delayed updates, clipped double Q-learning |
| MA-SAC | Off-policy, Stochastic | Maximum entropy, automatic temperature tuning |
| MA-PPO | On-policy, Stochastic | Trust region optimization, robust training |
- Centralized Training, Decentralized Execution (CTDE): Both agents observe full state during training but can act independently during deployment
- Alternating Optimization: Sequential training of guidance and disturbance agents
- Full Information Setting: Complete state observation for optimal policy learning
The trained policies are rigorously tested under six uncertainty scenarios:
- 🎲 Initial Condition Variations: Random perturbations in initial state
- ⚡ Actuator Disturbances: Thrust vector perturbations
- 📡 Sensor Noise: Gaussian noise in state measurements
- ⏱️ Time Delays: Communication and actuation delays
- 🔧 Model Mismatch: Errors in system dynamics model
- 🌪️ Combined Uncertainties: All scenarios simultaneously
Software Requirements:
- Python 3.8 or higher
- PyTorch 2.2.2
- CUDA 11.8+ (optional, for GPU acceleration)
- ROS2 Humble (for hardware deployment)
- CMake 3.16+ (for C++ implementation)
- LaTeX distribution (for compiling thesis document)
Hardware Requirements:
- 16+ GB RAM (recommended for training)
- NVIDIA GPU with 6+ GB VRAM (optional, speeds up training significantly)
git clone https://github.com/alibaniasad1999/master-thesis.git
cd master-thesis# Create virtual environment
python -m venv venv
# Activate virtual environment
source venv/bin/activate # Linux/macOS
# or
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txtpython -c "import torch; import gymnasium; import numpy; print('✓ All packages installed successfully')"cd Code/Python/TBP/SAC
jupyter notebook SAC_TBP.ipynbFollow the notebook to:
- Configure environment parameters
- Set hyperparameters
- Train the agent
- Evaluate performance
- Save trained models
cd Code/Python/TBP/SAC/ZeroSum
jupyter notebook Zero_Sum_SAC_TBP.ipynbThe notebook demonstrates:
- Zero-sum game setup
- Alternating training procedure
- Nash equilibrium convergence
- Robustness evaluation
cd Code/Python/Robust_eval/ZeroSum/sensor_noise
jupyter notebook sensor_noise.ipynbThis evaluates trained policies under sensor noise perturbations and generates comparison plots.
cd Code/C
mkdir build && cd build
cmake ..
make
./mainThe C++ implementation loads PyTorch traced models for fast inference.
cd Code/ROS2
colcon build
source install/setup.bash
ros2 launch tbp_rl_controler tbp_system.launch.pyThis launches:
- Three-body dynamics simulator node
- RL controller node
- Data logging node
cd Code/Python/utils
python model_downloader.pyDownloads pre-trained models from the GitHub repository.
| Algorithm | Trajectory Error (m) | Fuel Consumption (m/s) | Success Rate (%) | Robustness Score |
|---|---|---|---|---|
| PID Control | 8,432 ± 2,156 | 45.2 ± 8.3 | 72.4 | ⭐⭐ |
| DDPG | 1,234 ± 892 | 28.7 ± 5.2 | 84.6 | ⭐⭐⭐ |
| TD3 | 967 ± 654 | 26.4 ± 4.1 | 88.2 | ⭐⭐⭐⭐ |
| SAC | 1,045 ± 721 | 27.8 ± 4.8 | 86.9 | ⭐⭐⭐⭐ |
| PPO | 1,398 ± 978 | 31.2 ± 6.3 | 81.5 | ⭐⭐⭐ |
| MA-DDPG | 892 ± 423 | 25.1 ± 3.2 | 91.7 | ⭐⭐⭐⭐ |
| MA-TD3 | 687 ± 312 | 23.4 ± 2.8 | 95.3 | ⭐⭐⭐⭐⭐ |
| MA-SAC | 734 ± 367 | 24.2 ± 3.1 | 93.8 | ⭐⭐⭐⭐⭐ |
| MA-PPO | 856 ± 445 | 26.7 ± 3.9 | 90.4 | ⭐⭐⭐⭐ |
Results averaged over 1,000 test episodes with combined uncertainty scenarios.
Trajectory Tracking
| Standard TD3 Trajectory | Zero-Sum MA-TD3 Trajectory |
![]() |
![]() |
Trajectory with Control Forces
| Standard TD3 | Zero-Sum MA-TD3 |
![]() |
![]() |
MA-TD3 demonstrates superior trajectory tracking with reduced deviation and more efficient control force usage.
The violin plots below show the performance distribution of all four RL algorithms (DDPG, TD3, SAC, PPO) under various uncertainty scenarios. Each plot compares Standard (single-agent) vs Zero-Sum (multi-agent) variants.
Zero-Sum Multi-Agent RL - All Algorithms Combined
| Actuator Disturbance | Sensor Noise |
![]() |
![]() |
| Initial Condition Shift | Time Delay |
![]() |
![]() |
| Model Mismatch | Partial Observation |
![]() |
![]() |
Standard Single-Agent RL - All Algorithms Combined
| Actuator Disturbance | Sensor Noise |
![]() |
![]() |
| Initial Condition Shift | Time Delay |
![]() |
![]() |
| Model Mismatch | Partial Observation |
![]() |
![]() |
📊 Click to view individual algorithm robustness (TD3)
| Actuator Disturbance | Sensor Noise | Initial Condition Shift |
![]() |
![]() |
![]() |
| Time Delay | Model Mismatch | Partial Observation |
![]() |
![]() |
![]() |
📊 Click to view individual algorithm robustness (DDPG)
| Actuator Disturbance | Sensor Noise | Initial Condition Shift |
![]() |
![]() |
![]() |
| Time Delay | Model Mismatch | Partial Observation |
![]() |
![]() |
![]() |
📊 Click to view individual algorithm robustness (SAC)
| Actuator Disturbance | Sensor Noise | Initial Condition Shift |
![]() |
![]() |
![]() |
| Time Delay | Model Mismatch | Partial Observation |
![]() |
![]() |
![]() |
📊 Click to view individual algorithm robustness (PPO)
| Actuator Disturbance | Sensor Noise | Initial Condition Shift |
![]() |
![]() |
![]() |
| Time Delay | Model Mismatch | Partial Observation |
![]() |
![]() |
![]() |
✅ Zero-sum MARL outperforms single-agent RL across all metrics
✅ MA-TD3 achieves best overall performance with 30% error reduction vs. TD3
✅ Robustness significantly improved under all uncertainty scenarios
✅ Tighter performance distributions in zero-sum variants (visible in violin plots)
✅ Stable performance in highly perturbed environments
✅ Real-time capable C++ implementation achieves <5ms inference time
The complete thesis is available in the Report/ directory:
cd Report
pdflatex thesis.tex
bibtex thesis
pdflatex thesis.tex
pdflatex thesis.texOr use latexmk for automatic compilation:
latexmk -pdf thesis.tex- Introduction: Motivation, problem statement, and research objectives
- Literature Review: Survey of RL, MARL, differential games, and spacecraft guidance
- Simulation: Three-body problem dynamics and environment setup
- Reinforcement Learning: Single-agent RL algorithms (DDPG, TD3, SAC, PPO)
- Agent Simulation: Training procedures and baseline results
- Multi-Agent RL: Zero-sum game formulation and MARL algorithms
- Results: Comprehensive evaluation and comparison
- Conclusion: Summary, contributions, and future work
Key classes and functions are documented in the code:
Environment/TBP.py: Three-body problem environment classAlgorithms/*/Zero_Sum_*.py: Zero-sum MARL implementationsutils/model_downloader.py: Pre-trained model utilities
# Train MA-TD3 agent
cd Code/Python/TBP/TD3/ZeroSum
jupyter notebook Zero_Sum_TD3_TBP.ipynb
# Execute all cells# Run robustness evaluation
cd Code/Python/Robust_eval/ZeroSum/All_in_one/actuator_disturbance
jupyter notebook all_in_one.ipynbAll experiments use fixed random seeds for reproducibility:
- NumPy:
np.random.seed(42) - PyTorch:
torch.manual_seed(42) - Gymnasium:
env.seed(42)
If you use this work in your research, please cite:
@mastersthesis{baniasad2025robust,
author = {Ali Bani Asad},
title = {Robust Reinforcement Learning Differential Game Guidance
in Low-Thrust, Multi-Body Dynamical Environments},
school = {Sharif University of Technology},
year = {2025},
address = {Tehran, Iran},
month = {September},
type = {Master's Thesis},
note = {Department of Aerospace Engineering}
}- Conference Paper: "Robustness on Demand: Transformer-Directed Switching in Multi-Agent RL" (in preparation)
This is an academic research repository. While it's primarily for archival and reference purposes, suggestions and discussions are welcome:
- Open an issue to discuss proposed changes
- Fork the repository
- Create a feature branch
- Submit a pull request with detailed description
Ali Bani Asad
Department of Aerospace Engineering
Sharif University of Technology
📧 Email: ali_baniasad@ae.sharif.edu
🔗 GitHub: @alibaniasad1999
Supervisor: Dr. Hadi Nobahari
📧 Email: nobahari@sharif.edu
This research was conducted at the Sharif University of Technology, Department of Aerospace Engineering, under the supervision of Dr. Hadi Nobahari and the advisory of Dr. Seyed Ali Emami Khooansari.
Special thanks to:
- The Aerospace Engineering Department for providing computational resources
- The open-source RL community for excellent libraries and tools
- Colleagues and fellow researchers for valuable discussions and feedback
This project is licensed under the MIT License - see the LICENSE file for details.
- PyTorch: https://pytorch.org/
- Gymnasium: https://gymnasium.farama.org/
- ROS2: https://docs.ros.org/en/humble/
- Stable-Baselines3: https://stable-baselines3.readthedocs.io/
- Three-Body Problem: https://en.wikipedia.org/wiki/Three-body_problem
⭐ If you find this research useful, please consider giving it a star! ⭐
Made with ❤️ at Sharif University of Technology
For questions, issues, or collaboration inquiries, please open a GitHub issue or reach out to the author:
- GitHub: @alibaniasad1999
- Email: alibaniasad1999@yahoo.com
This project is licensed under the MIT License. See the LICENSE file for details.







































