Skip to content

Feat/ddp mlflow #655

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 0 additions & 9 deletions 3.test_cases/pytorch/cpu-ddp/README.md

This file was deleted.

124 changes: 0 additions & 124 deletions 3.test_cases/pytorch/cpu-ddp/ddp.py

This file was deleted.

118 changes: 0 additions & 118 deletions 3.test_cases/pytorch/cpu-ddp/kubernetes/fsdp-simple.yaml

This file was deleted.

17 changes: 0 additions & 17 deletions 3.test_cases/pytorch/cpu-ddp/slurm/0.create-conda-env.sh

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
Miniconda3-latest*
miniconda3
pt_cpu
pt
*.yaml
data
*.pt
mlruns
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@ FROM pytorch/pytorch:latest

RUN apt update && apt upgrade -y

RUN mlflow==2.13.2 sagemaker-mlflow==0.1.0
COPY ddp.py /workspace


71 changes: 71 additions & 0 deletions 3.test_cases/pytorch/ddp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# PyTorch DDP <!-- omit in toc -->

Isolated environments are crucial for reproducible machine learning because they encapsulate specific software versions and dependencies, ensuring models are consistently retrainable, shareable, and deployable without compatibility issues.

[Anaconda](https://www.anaconda.com/) leverages conda environments to create distinct spaces for projects, allowing different Python versions and libraries to coexist without conflicts by isolating updates to their respective environments. [Docker](https://www.docker.com/), a containerization platform, packages applications and their dependencies into containers, ensuring they run seamlessly across any Linux server by providing OS-level virtualization and encapsulating the entire runtime environment.

This example showcases [PyTorch DDP](https://pytorch.org/tutorials/beginner/ddp_series_theory.html) environment setup utilizing these approaches for efficient environment management. The implementation supports both CPU and GPU computation:

- **CPU Training**: Uses the GLOO backend for distributed training on CPU nodes
- **GPU Training**: Automatically switches to NCCL backend when GPUs are available, providing optimized multi-GPU training

## Training

### Basic Usage

To run the training with GPUs, use `torchrun` with the appropriate number of GPUs:
```bash
torchrun --nproc_per_node=N ddp.py --total_epochs=10 --save_every=1 --batch_size=32
```
where N is the number of GPUs you want to use.

## MLFlow Integration

This implementation includes [MLFlow](https://mlflow.org/) integration for experiment tracking and model management. MLFlow helps you track metrics, parameters, and artifacts during training, making it easier to compare different runs and manage model versions.

### Setup

1. Install MLFlow:
```bash
pip install mlflow
```

2. Start the MLFlow tracking server:
```bash
mlflow ui
```

### Usage

To enable MLFlow logging, add the `--use_mlflow` flag when running the training script:
```bash
torchrun --nproc_per_node=N ddp.py --total_epochs=10 --save_every=1 --batch_size=32 --use_mlflow
```

By default, MLFlow will connect to `http://localhost:5000`. To use a different tracking server, specify the `--tracking_uri`:
```bash
torchrun --nproc_per_node=N ddp.py --total_epochs=10 --save_every=1 --batch_size=32 --use_mlflow --tracking_uri=http://localhost:5000
```

### What's Tracked

MLFlow will track:
- Training metrics (loss per epoch)
- Model hyperparameters
- Model checkpoints
- Training configuration

### Viewing Results

1. Open your browser and navigate to `http://localhost:5000` (or your specified tracking URI)

The MLFlow UI provides:
- Experiment comparison
- Metric visualization
- Parameter tracking
- Model artifact management
- Run history

## Deployment

We provide guides for both Slurm and Kubernetes. However, please note that the Conda example is only compatible with Slurm. For detailed instructions, proceed to the [slurm](slurm) or [kubernetes](kubernetes) subdirectory.
Loading