Skip to content

add gradient accumulation support#21

Open
jonscheunemann wants to merge 1 commit into
mainfrom
jonny_batch_accumulation
Open

add gradient accumulation support#21
jonscheunemann wants to merge 1 commit into
mainfrom
jonny_batch_accumulation

Conversation

@jonscheunemann

Copy link
Copy Markdown
Collaborator
  • Introduced micro_batch_size and effective_batch_size parameters to the analyzer function.
  • Implemented gradient accumulation logic, allowing for optimization steps to occur after a specified number of micro-batches.
  • Added validation for batch size parameters to ensure compatibility.
  • Updated tests to cover new functionality, including parameter validation, optimizer behavior, and logging behavior during accumulation cycles.

- Introduced `micro_batch_size` and `effective_batch_size` parameters to the analyzer function.
- Implemented gradient accumulation logic, allowing for optimization steps to occur after a specified number of micro-batches.
- Added validation for batch size parameters to ensure compatibility.
- Updated tests to cover new functionality, including parameter validation, optimizer behavior, and logging behavior during accumulation cycles.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds gradient accumulation support to perspic.analyzer.analyzer() so optimization steps and analysis/logging can be aligned to an “effective batch size” composed of multiple micro-batches.

Changes:

  • Introduces micro_batch_size / effective_batch_size parameters, validation, and an internal accumulation_steps computation.
  • Implements optimizer stepping every accumulation_steps micro-batches and switches scheduling/logging to an effective_step counter.
  • Adds extensive unit tests for validation, optimizer behavior, scheduling, and metric/logging behavior under accumulation.

Reviewed changes

Copilot reviewed 2 out of 4 changed files in this pull request and generated 3 comments.

File Description
perspic/analyzer.py Implements gradient accumulation, effective-step scheduling, accumulated metric buffers, and effective batch size logging.
tests/unit/test_analyzer.py Adds a large new test suite for accumulation behavior and updates existing tests to use the new effective-step counter.
.gitignore Expands ignored log/checkpoint/dataset patterns.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread perspic/analyzer.py
Comment on lines +557 to +590
self.model.zero_grad()
loss = self.criterion(self.model(x), y)
loss.backward()

loss_val = loss.detach().item()
if is_train:
self._accum_train_loss += loss_val
if self._accum_grad_train is None:
self._accum_grad_train = [
p.grad.clone() if p.grad is not None else None
for p in self.model.parameters()
]
else:
for acc, p in zip(
self._accum_grad_train, self.model.parameters()
):
if acc is not None and p.grad is not None:
acc.add_(p.grad)
else:
self._accum_measure_loss += loss_val
if self._accum_grad_measure is None:
self._accum_grad_measure = [
p.grad.clone() if p.grad is not None else None
for p in self.model.parameters()
]
else:
for acc, p in zip(
self._accum_grad_measure, self.model.parameters()
):
if acc is not None and p.grad is not None:
acc.add_(p.grad)

self.model.zero_grad()

Copilot AI Apr 14, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With gradient accumulation enabled, _before_training_step() runs before manual_backward() and mid-cycle training gradients are expected to remain on parameters. However, the accumulated-analysis path calls self.sample_calc.compute(...) (which internally calls model.zero_grad()) and _accumulate_linearizer_grads() also does self.model.zero_grad() + loss.backward() + zero_grad() again. This will wipe the already-accumulated training gradients from previous micro-batches, breaking gradient accumulation when the analyzer is enabled. Consider preserving/restoring parameter .grad around analysis, or refactoring calculators to use torch.autograd.grad (no writes to .grad), or running analysis on a separate model copy so training gradients aren’t mutated.

Suggested change
self.model.zero_grad()
loss = self.criterion(self.model(x), y)
loss.backward()
loss_val = loss.detach().item()
if is_train:
self._accum_train_loss += loss_val
if self._accum_grad_train is None:
self._accum_grad_train = [
p.grad.clone() if p.grad is not None else None
for p in self.model.parameters()
]
else:
for acc, p in zip(
self._accum_grad_train, self.model.parameters()
):
if acc is not None and p.grad is not None:
acc.add_(p.grad)
else:
self._accum_measure_loss += loss_val
if self._accum_grad_measure is None:
self._accum_grad_measure = [
p.grad.clone() if p.grad is not None else None
for p in self.model.parameters()
]
else:
for acc, p in zip(
self._accum_grad_measure, self.model.parameters()
):
if acc is not None and p.grad is not None:
acc.add_(p.grad)
self.model.zero_grad()
loss = self.criterion(self.model(x), y)
params = list(self.model.parameters())
grad_params = [p for p in params if p.requires_grad]
grad_values = torch.autograd.grad(
loss,
grad_params,
allow_unused=True,
)
grads = []
grad_iter = iter(grad_values)
for p in params:
if p.requires_grad:
grads.append(next(grad_iter))
else:
grads.append(None)
loss_val = loss.detach().item()
if is_train:
self._accum_train_loss += loss_val
if self._accum_grad_train is None:
self._accum_grad_train = [
g.clone() if g is not None else None for g in grads
]
else:
for acc, g in zip(self._accum_grad_train, grads):
if acc is not None and g is not None:
acc.add_(g)
else:
self._accum_measure_loss += loss_val
if self._accum_grad_measure is None:
self._accum_grad_measure = [
g.clone() if g is not None else None for g in grads
]
else:
for acc, g in zip(self._accum_grad_measure, grads):
if acc is not None and g is not None:
acc.add_(g)

Copilot uses AI. Check for mistakes.
Comment thread perspic/analyzer.py
Comment on lines +193 to +227
# Gradient accumulation setup
self.micro_batch_size = micro_batch_size
self.effective_batch_size = effective_batch_size

if (
effective_batch_size is not None
and micro_batch_size is None
):
raise ValueError(
"micro_batch_size must be specified when "
"effective_batch_size is set."
)

if (
micro_batch_size is not None
and effective_batch_size is not None
):
if effective_batch_size < micro_batch_size:
raise ValueError(
f"effective_batch_size "
f"({effective_batch_size}) must be "
f">= micro_batch_size ({micro_batch_size})."
)
if effective_batch_size % micro_batch_size != 0:
raise ValueError(
f"effective_batch_size "
f"({effective_batch_size}) must be "
f"divisible by micro_batch_size "
f"({micro_batch_size})."
)
self.accumulation_steps = (
effective_batch_size // micro_batch_size
)
else:
self.accumulation_steps = 1

Copilot AI Apr 14, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

micro_batch_size/effective_batch_size are only validated for presence and divisibility, but not for being positive integers. Passing 0 (or negative values) will lead to ZeroDivisionError/nonsensical accumulation_steps. Add explicit validation that any provided batch-size parameter is an int and > 0 before using it in comparisons/modulus/division.

Copilot uses AI. Check for mistakes.
Comment thread perspic/analyzer.py
Comment on lines +539 to +544
# TODO: How would a measure batchsize different to the effective batch size work here?
# !!! We would need to accumulate separately and then combine at the end.
if x2 is not None and y2 is not None:
self._accumulate_linearizer_grads(
x2, y2, is_train=False
)

Copilot AI Apr 14, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The accumulated-analysis implementation assumes the measure batch (cross-response) can be accumulated with the same accumulation_steps/batch sizing as the train batch, but the new TODO comment indicates this case is not actually handled. To avoid silently incorrect cross metrics, either (a) enforce/validate that train & measure micro-batch sizes match and document the requirement, or (b) implement separate accumulation logic for the measure side as suggested by the comment.

Copilot uses AI. Check for mistakes.
@KonstiNik

Copy link
Copy Markdown
Member

Looks nice! I have questions about the general implementation here:

  • Lightning offers a native batch-accumulation implementation for training, see trainer = Trainer(accumulate_grad_batches=7). What if I would select that? How does it interact with your implementation?
  • Why can't we build our implementation on top of the built-in one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants