Auto tune learning rate based on validation loss by mcgibbon · Pull Request #930 · ai2cm/ace

mcgibbon · 2026-03-09T22:00:52Z

Short description of why the PR is needed and how it satisfies those requirements, in sentence form.

Changes:

symbol (e.g. fme.core.my_function) or script and concise description of changes or added feature
Can group multiple related symbols on a single bullet
Tests added
If dependencies changed, "deps only" image rebuilt and "latest_deps_only_image.txt" file updated

Resolves # (delete if none)

… loading state

Enable reuse of the validation loop outside the Trainer (e.g. for upcoming LR tuning trials) by extracting it into a module-level function that accepts a stepper, data, aggregator, and optional EMA. The Trainer method retains its epoch-boundary assertion and flush_diagnostics responsibility, delegating only the core loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Introduce an isolated trial function that creates two stepper forks (baseline at current LR, candidate at current_lr * lr_factor), trains both on the first N batches, validates both, and compares validation loss improvements. Returns the candidate LR only if both improve and the candidate exceeds a configurable threshold. The original model, optimizer, and EMA are never mutated. The trial function accepts copy_stepper and copy_ema callables so the caller controls how forks are created — this avoids copy.deepcopy on the stepper (which has deeply nested tensor structures) and allows the trainer to seed the fork EMAs from its own EMA state. Not yet wired into the Trainer — that will follow in a subsequent commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Hook the LR tuning trial into the Trainer's epoch loop, running _maybe_tune_lr before each training epoch. Extract run_validation into its own module (fme/core/generics/validation.py) to break the circular import between trainer.py and lr_tuning.py, replacing the previous TYPE_CHECKING workaround. Use OptimizationABC instead of the concrete Optimization class in lr_tuning.py to respect the generics layer's import conventions. Add lr_tuning config field to ace, coupled, and diffusion TrainConfig classes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The LR tuning trial creates baseline and candidate forks to evaluate a candidate learning rate. These forks must be fully isolated from the original stepper, optimizer, and EMA so that the trial has zero side effects when the candidate does not win. Two isolation failures were found and fixed: 1. Optimizer state (momentum buffers, Adam step counters) was shared between the forks and the original optimizer because `optimization.get_state()` returns references to live tensors. The forks' training incremented the original's step counter, corrupting Adam's bias correction and changing the effective learning rate for subsequent real training. Fixed by deepcopying the optimization state before loading into forks. 2. EMA `num_updates` tensor was shared because `get_state()` returned it by reference and `from_state()` assigned it directly. The forks' in-place `+=` mutated the original's counter, corrupting the EMA decay schedule. Fixed by cloning tensors in `get_state()`. The test helpers were also updated to use the real Trainer copy patterns (deepcopy + load_state for steppers, from_state for EMA) instead of creating fresh objects, so these isolation bugs are now caught. Three new tests verify that the original EMA num_updates, EMA params, and optimizer state are not mutated by a trial. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mcgibbon and others added 9 commits March 9, 2026 21:13

record planned changes

7e1a73b

implement set_learning_rate, and test that momentum is preserved when…

8035bf2

… loading state

delete plan file

c6ff30e

import missing config symbol

52e8fe6

Merge branch 'main' into feature/candidate_lr

effea9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto tune learning rate based on validation loss#930

Auto tune learning rate based on validation loss#930
mcgibbon wants to merge 9 commits intomainfrom
feature/candidate_lr

mcgibbon commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mcgibbon commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant