Add Modal GPU testing infrastructure #1032

GeorgePearse · 2025-10-26T21:14:49Z

Summary

This PR adds comprehensive GPU testing infrastructure using Modal, a serverless GPU compute platform. Tests are automatically triggered on PRs labeled with gpu-tests or modal-tests, enabling cost-effective GPU testing without maintaining dedicated hardware.

Key Features

GPU Test Suites (1,571 lines across 11 files):

Training tests (test_training_gpu.py): GPU memory management, forward/backward passes, loss computation, gradient accumulation, end-to-end training
Inference/benchmark tests (test_inference_gpu.py): Batch inference benchmarks, mixed precision inference, throughput measurement, large batch handling
Multi-GPU tests (test_distributed.py): GPU availability, DataParallel wrapper, gradient synchronization, device consistency, multi-GPU training loops

Infrastructure:

Modal runner (tests/modal_runner.py): Serverless GPU functions with 4 test suites + orchestrator
Test configuration (tests/modal_config.py): Pytest fixtures, markers, and configuration constants
GitHub Actions workflow (.github/workflows/modal-gpu-tests.yml): Label-triggered CI/CD on 3 Python versions with T4 GPUs
Comprehensive documentation (docs/modal-testing.md): 434-line guide covering setup, usage, cost breakdown, and troubleshooting

Cost Optimization:

Uses T4 GPUs (~$0.35/hr) instead of enterprise GPUs
Label-based triggering (only runs on demand)
Python 3.12 only for multi-GPU tests (expensive 2xT4)
Estimated cost: ~$0.33 per full test suite run

Test Coverage

Category	Count	Type
Training	6	Unit + Integration
Inference	6	Benchmark
Distributed	6	Multi-GPU
Total	18	GPU tests

Setup Requirements

For CI/CD: Add GitHub secrets MODAL_TOKEN_ID and MODAL_TOKEN_SECRET (obtainable from Modal dashboard)
For local testing: pip install modal>=0.63.0 + modal token new
Optional: Label PRs with gpu-tests or modal-tests to trigger automated testing

Files Added/Modified

✅ .github/workflows/modal-gpu-tests.yml (164 lines) - CI/CD workflow
✅ tests/modal_runner.py (327 lines) - Modal serverless functions
✅ tests/modal_config.py (106 lines) - Test configuration & fixtures
✅ tests/test_gpu/test_training_gpu.py (174 lines) - Training tests
✅ tests/test_gpu/test_inference_gpu.py (158 lines) - Inference tests
✅ tests/test_gpu/test_distributed.py (197 lines) - Distributed tests
✅ docs/modal-testing.md (434 lines) - Complete documentation
✅ modal.toml (2 lines) - Modal configuration
✅ requirements-modal.txt (7 lines) - Modal dependencies
✅ pyproject.toml (+1 line) - Added modal>=0.63.0

How to Test

Run locally (requires Modal setup):

modal run tests.modal_runner --test-type all

Trigger in CI/CD:

Add gpu-tests label to PR
Tests run automatically on 3 Python versions (3.10, 3.11, 3.12)
Results and artifacts posted to PR

Manual workflow dispatch (coming in next update):

Actions > Modal GPU Tests > Run workflow

Documentation

See docs/modal-testing.md for:

Step-by-step setup instructions
Running tests locally vs. in CI/CD
Cost breakdown and optimization strategies
Troubleshooting guide for common issues
Advanced usage (custom GPU tiers, parallel execution, volume persistence)

Dependencies: modal>=0.63.0 (optional, only needed for CI/CD and local Modal testing)

Adds comprehensive GPU testing via Modal serverless platform: - Modal runner module (tests/modal_runner.py) with functions for: - GPU unit tests (30 min timeout) - Integration tests for training (40 min timeout) - Performance benchmarks (20 min timeout) - Multi-GPU distributed tests (20 min timeout) - GPU test suites: - tests/test_gpu/test_training_gpu.py: 6 GPU training tests - tests/test_gpu/test_inference_gpu.py: 6 inference/benchmark tests - tests/test_gpu/test_distributed.py: 6 multi-GPU tests - Configuration: - tests/modal_config.py: pytest fixtures and test config - modal.toml: Modal app configuration - requirements-modal.txt: Modal dependencies - GitHub Actions: - .github/workflows/modal-gpu-tests.yml: Label-triggered CI workflow - Runs on PR labels: gpu-tests or modal-tests - Tests on 3 Python versions (3.10, 3.11, 3.12) - Uses T4 GPUs for cost-effectiveness - Documentation: - docs/modal-testing.md: Complete guide for Modal testing - Setup instructions for local and CI/CD - Cost breakdown and optimization tips - Troubleshooting guide Updates: - pyproject.toml: Added modal>=0.63.0 to dev dependencies

github-actions · 2025-10-26T21:16:45Z

✅ Skylos Scan: No dead code or security issues detected.

github-actions · 2025-10-26T21:30:55Z

✅ Skylos Scan: No dead code or security issues detected.

GeorgePearse force-pushed the feature/modal-gpu-tests branch from 9f49dd6 to c34708f Compare October 26, 2025 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Modal GPU testing infrastructure #1032

Add Modal GPU testing infrastructure #1032

Uh oh!

GeorgePearse commented Oct 26, 2025

Uh oh!

github-actions bot commented Oct 26, 2025

Uh oh!

github-actions bot commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Modal GPU testing infrastructure #1032

Are you sure you want to change the base?

Add Modal GPU testing infrastructure #1032

Uh oh!

Conversation

GeorgePearse commented Oct 26, 2025

Summary

Key Features

Test Coverage

Setup Requirements

Files Added/Modified

How to Test

Documentation

Uh oh!

github-actions bot commented Oct 26, 2025

Uh oh!

github-actions bot commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants