add gradient accumulation support#21
Conversation
- Introduced `micro_batch_size` and `effective_batch_size` parameters to the analyzer function. - Implemented gradient accumulation logic, allowing for optimization steps to occur after a specified number of micro-batches. - Added validation for batch size parameters to ensure compatibility. - Updated tests to cover new functionality, including parameter validation, optimizer behavior, and logging behavior during accumulation cycles.
There was a problem hiding this comment.
Pull request overview
Adds gradient accumulation support to perspic.analyzer.analyzer() so optimization steps and analysis/logging can be aligned to an “effective batch size” composed of multiple micro-batches.
Changes:
- Introduces
micro_batch_size/effective_batch_sizeparameters, validation, and an internalaccumulation_stepscomputation. - Implements optimizer stepping every
accumulation_stepsmicro-batches and switches scheduling/logging to aneffective_stepcounter. - Adds extensive unit tests for validation, optimizer behavior, scheduling, and metric/logging behavior under accumulation.
Reviewed changes
Copilot reviewed 2 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
perspic/analyzer.py |
Implements gradient accumulation, effective-step scheduling, accumulated metric buffers, and effective batch size logging. |
tests/unit/test_analyzer.py |
Adds a large new test suite for accumulation behavior and updates existing tests to use the new effective-step counter. |
.gitignore |
Expands ignored log/checkpoint/dataset patterns. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.model.zero_grad() | ||
| loss = self.criterion(self.model(x), y) | ||
| loss.backward() | ||
|
|
||
| loss_val = loss.detach().item() | ||
| if is_train: | ||
| self._accum_train_loss += loss_val | ||
| if self._accum_grad_train is None: | ||
| self._accum_grad_train = [ | ||
| p.grad.clone() if p.grad is not None else None | ||
| for p in self.model.parameters() | ||
| ] | ||
| else: | ||
| for acc, p in zip( | ||
| self._accum_grad_train, self.model.parameters() | ||
| ): | ||
| if acc is not None and p.grad is not None: | ||
| acc.add_(p.grad) | ||
| else: | ||
| self._accum_measure_loss += loss_val | ||
| if self._accum_grad_measure is None: | ||
| self._accum_grad_measure = [ | ||
| p.grad.clone() if p.grad is not None else None | ||
| for p in self.model.parameters() | ||
| ] | ||
| else: | ||
| for acc, p in zip( | ||
| self._accum_grad_measure, self.model.parameters() | ||
| ): | ||
| if acc is not None and p.grad is not None: | ||
| acc.add_(p.grad) | ||
|
|
||
| self.model.zero_grad() | ||
|
|
There was a problem hiding this comment.
With gradient accumulation enabled, _before_training_step() runs before manual_backward() and mid-cycle training gradients are expected to remain on parameters. However, the accumulated-analysis path calls self.sample_calc.compute(...) (which internally calls model.zero_grad()) and _accumulate_linearizer_grads() also does self.model.zero_grad() + loss.backward() + zero_grad() again. This will wipe the already-accumulated training gradients from previous micro-batches, breaking gradient accumulation when the analyzer is enabled. Consider preserving/restoring parameter .grad around analysis, or refactoring calculators to use torch.autograd.grad (no writes to .grad), or running analysis on a separate model copy so training gradients aren’t mutated.
| self.model.zero_grad() | |
| loss = self.criterion(self.model(x), y) | |
| loss.backward() | |
| loss_val = loss.detach().item() | |
| if is_train: | |
| self._accum_train_loss += loss_val | |
| if self._accum_grad_train is None: | |
| self._accum_grad_train = [ | |
| p.grad.clone() if p.grad is not None else None | |
| for p in self.model.parameters() | |
| ] | |
| else: | |
| for acc, p in zip( | |
| self._accum_grad_train, self.model.parameters() | |
| ): | |
| if acc is not None and p.grad is not None: | |
| acc.add_(p.grad) | |
| else: | |
| self._accum_measure_loss += loss_val | |
| if self._accum_grad_measure is None: | |
| self._accum_grad_measure = [ | |
| p.grad.clone() if p.grad is not None else None | |
| for p in self.model.parameters() | |
| ] | |
| else: | |
| for acc, p in zip( | |
| self._accum_grad_measure, self.model.parameters() | |
| ): | |
| if acc is not None and p.grad is not None: | |
| acc.add_(p.grad) | |
| self.model.zero_grad() | |
| loss = self.criterion(self.model(x), y) | |
| params = list(self.model.parameters()) | |
| grad_params = [p for p in params if p.requires_grad] | |
| grad_values = torch.autograd.grad( | |
| loss, | |
| grad_params, | |
| allow_unused=True, | |
| ) | |
| grads = [] | |
| grad_iter = iter(grad_values) | |
| for p in params: | |
| if p.requires_grad: | |
| grads.append(next(grad_iter)) | |
| else: | |
| grads.append(None) | |
| loss_val = loss.detach().item() | |
| if is_train: | |
| self._accum_train_loss += loss_val | |
| if self._accum_grad_train is None: | |
| self._accum_grad_train = [ | |
| g.clone() if g is not None else None for g in grads | |
| ] | |
| else: | |
| for acc, g in zip(self._accum_grad_train, grads): | |
| if acc is not None and g is not None: | |
| acc.add_(g) | |
| else: | |
| self._accum_measure_loss += loss_val | |
| if self._accum_grad_measure is None: | |
| self._accum_grad_measure = [ | |
| g.clone() if g is not None else None for g in grads | |
| ] | |
| else: | |
| for acc, g in zip(self._accum_grad_measure, grads): | |
| if acc is not None and g is not None: | |
| acc.add_(g) |
| # Gradient accumulation setup | ||
| self.micro_batch_size = micro_batch_size | ||
| self.effective_batch_size = effective_batch_size | ||
|
|
||
| if ( | ||
| effective_batch_size is not None | ||
| and micro_batch_size is None | ||
| ): | ||
| raise ValueError( | ||
| "micro_batch_size must be specified when " | ||
| "effective_batch_size is set." | ||
| ) | ||
|
|
||
| if ( | ||
| micro_batch_size is not None | ||
| and effective_batch_size is not None | ||
| ): | ||
| if effective_batch_size < micro_batch_size: | ||
| raise ValueError( | ||
| f"effective_batch_size " | ||
| f"({effective_batch_size}) must be " | ||
| f">= micro_batch_size ({micro_batch_size})." | ||
| ) | ||
| if effective_batch_size % micro_batch_size != 0: | ||
| raise ValueError( | ||
| f"effective_batch_size " | ||
| f"({effective_batch_size}) must be " | ||
| f"divisible by micro_batch_size " | ||
| f"({micro_batch_size})." | ||
| ) | ||
| self.accumulation_steps = ( | ||
| effective_batch_size // micro_batch_size | ||
| ) | ||
| else: | ||
| self.accumulation_steps = 1 |
There was a problem hiding this comment.
micro_batch_size/effective_batch_size are only validated for presence and divisibility, but not for being positive integers. Passing 0 (or negative values) will lead to ZeroDivisionError/nonsensical accumulation_steps. Add explicit validation that any provided batch-size parameter is an int and > 0 before using it in comparisons/modulus/division.
| # TODO: How would a measure batchsize different to the effective batch size work here? | ||
| # !!! We would need to accumulate separately and then combine at the end. | ||
| if x2 is not None and y2 is not None: | ||
| self._accumulate_linearizer_grads( | ||
| x2, y2, is_train=False | ||
| ) |
There was a problem hiding this comment.
The accumulated-analysis implementation assumes the measure batch (cross-response) can be accumulated with the same accumulation_steps/batch sizing as the train batch, but the new TODO comment indicates this case is not actually handled. To avoid silently incorrect cross metrics, either (a) enforce/validate that train & measure micro-batch sizes match and document the requirement, or (b) implement separate accumulation logic for the measure side as suggested by the comment.
|
Looks nice! I have questions about the general implementation here:
|
micro_batch_sizeandeffective_batch_sizeparameters to the analyzer function.