Skip to content

Conversation

@GeorgePearse
Copy link
Collaborator

Summary

This PR adds comprehensive GPU testing infrastructure using Modal, a serverless GPU compute platform. Tests are automatically triggered on PRs labeled with gpu-tests or modal-tests, enabling cost-effective GPU testing without maintaining dedicated hardware.

Key Features

GPU Test Suites (1,571 lines across 11 files):

  • Training tests (test_training_gpu.py): GPU memory management, forward/backward passes, loss computation, gradient accumulation, end-to-end training
  • Inference/benchmark tests (test_inference_gpu.py): Batch inference benchmarks, mixed precision inference, throughput measurement, large batch handling
  • Multi-GPU tests (test_distributed.py): GPU availability, DataParallel wrapper, gradient synchronization, device consistency, multi-GPU training loops

Infrastructure:

  • Modal runner (tests/modal_runner.py): Serverless GPU functions with 4 test suites + orchestrator
  • Test configuration (tests/modal_config.py): Pytest fixtures, markers, and configuration constants
  • GitHub Actions workflow (.github/workflows/modal-gpu-tests.yml): Label-triggered CI/CD on 3 Python versions with T4 GPUs
  • Comprehensive documentation (docs/modal-testing.md): 434-line guide covering setup, usage, cost breakdown, and troubleshooting

Cost Optimization:

  • Uses T4 GPUs (~$0.35/hr) instead of enterprise GPUs
  • Label-based triggering (only runs on demand)
  • Python 3.12 only for multi-GPU tests (expensive 2xT4)
  • Estimated cost: ~$0.33 per full test suite run

Test Coverage

Category Count Type
Training 6 Unit + Integration
Inference 6 Benchmark
Distributed 6 Multi-GPU
Total 18 GPU tests

Setup Requirements

  1. For CI/CD: Add GitHub secrets MODAL_TOKEN_ID and MODAL_TOKEN_SECRET (obtainable from Modal dashboard)
  2. For local testing: pip install modal>=0.63.0 + modal token new
  3. Optional: Label PRs with gpu-tests or modal-tests to trigger automated testing

Files Added/Modified

  • .github/workflows/modal-gpu-tests.yml (164 lines) - CI/CD workflow
  • tests/modal_runner.py (327 lines) - Modal serverless functions
  • tests/modal_config.py (106 lines) - Test configuration & fixtures
  • tests/test_gpu/test_training_gpu.py (174 lines) - Training tests
  • tests/test_gpu/test_inference_gpu.py (158 lines) - Inference tests
  • tests/test_gpu/test_distributed.py (197 lines) - Distributed tests
  • docs/modal-testing.md (434 lines) - Complete documentation
  • modal.toml (2 lines) - Modal configuration
  • requirements-modal.txt (7 lines) - Modal dependencies
  • pyproject.toml (+1 line) - Added modal>=0.63.0

How to Test

Run locally (requires Modal setup):

modal run tests.modal_runner --test-type all

Trigger in CI/CD:

  1. Add gpu-tests label to PR
  2. Tests run automatically on 3 Python versions (3.10, 3.11, 3.12)
  3. Results and artifacts posted to PR

Manual workflow dispatch (coming in next update):

  • Actions > Modal GPU Tests > Run workflow

Documentation

See docs/modal-testing.md for:

  • Step-by-step setup instructions
  • Running tests locally vs. in CI/CD
  • Cost breakdown and optimization strategies
  • Troubleshooting guide for common issues
  • Advanced usage (custom GPU tiers, parallel execution, volume persistence)

Dependencies: modal>=0.63.0 (optional, only needed for CI/CD and local Modal testing)

Adds comprehensive GPU testing via Modal serverless platform:

- Modal runner module (tests/modal_runner.py) with functions for:
  - GPU unit tests (30 min timeout)
  - Integration tests for training (40 min timeout)
  - Performance benchmarks (20 min timeout)
  - Multi-GPU distributed tests (20 min timeout)

- GPU test suites:
  - tests/test_gpu/test_training_gpu.py: 6 GPU training tests
  - tests/test_gpu/test_inference_gpu.py: 6 inference/benchmark tests
  - tests/test_gpu/test_distributed.py: 6 multi-GPU tests

- Configuration:
  - tests/modal_config.py: pytest fixtures and test config
  - modal.toml: Modal app configuration
  - requirements-modal.txt: Modal dependencies

- GitHub Actions:
  - .github/workflows/modal-gpu-tests.yml: Label-triggered CI workflow
  - Runs on PR labels: gpu-tests or modal-tests
  - Tests on 3 Python versions (3.10, 3.11, 3.12)
  - Uses T4 GPUs for cost-effectiveness

- Documentation:
  - docs/modal-testing.md: Complete guide for Modal testing
  - Setup instructions for local and CI/CD
  - Cost breakdown and optimization tips
  - Troubleshooting guide

Updates:
- pyproject.toml: Added modal>=0.63.0 to dev dependencies
@github-actions
Copy link

Skylos Scan: No dead code or security issues detected.

@GeorgePearse GeorgePearse force-pushed the feature/modal-gpu-tests branch from 9f49dd6 to c34708f Compare October 26, 2025 21:29
@github-actions
Copy link

Skylos Scan: No dead code or security issues detected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants