Fix EfficientNet AMP NaN grads (FP16-only guard)#170
Fix EfficientNet AMP NaN grads (FP16-only guard)#170nkundiushuti wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes a fine-tuning instability in the EfficientNet AVEX backbone where CUDA AMP FP16 can yield NaN gradients (while forward/loss remain finite) by forcing the EfficientNet feature extractor to run in FP32 only under CUDA autocast FP16.
Changes:
- Add a CUDA-autocast-FP16 guard in
EfficientNet.forward()to runmodel.features(...)under FP32 (autocast disabled). - Preserve the existing gradient-checkpointing behavior inside the guarded/un-guarded branches.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| needs_guard = x.is_cuda and torch.is_autocast_enabled() and torch.get_autocast_dtype("cuda") == torch.float16 | ||
| if needs_guard: | ||
| with torch.autocast(device_type="cuda", enabled=False): | ||
| if self.gradient_checkpointing and self.training: | ||
| features = self._checkpointed_features(x.float()) | ||
| else: | ||
| features = self.model.features(x.float()) | ||
| else: | ||
| features = self.model.features(x) | ||
| if self.gradient_checkpointing and self.training: | ||
| features = self._checkpointed_features(x) | ||
| else: | ||
| features = self.model.features(x) |
There was a problem hiding this comment.
needs_guard currently triggers whenever CUDA autocast FP16 is enabled, including model.eval() / torch.no_grad() paths. Since the reported issue is NaN gradients, consider gating this guard on self.training and/or torch.is_grad_enabled() so FP16 autocast inference/eval keeps the expected throughput and memory benefits.
| needs_guard = x.is_cuda and torch.is_autocast_enabled() and torch.get_autocast_dtype("cuda") == torch.float16 | ||
| if needs_guard: | ||
| with torch.autocast(device_type="cuda", enabled=False): | ||
| if self.gradient_checkpointing and self.training: | ||
| features = self._checkpointed_features(x.float()) | ||
| else: | ||
| features = self.model.features(x.float()) | ||
| else: | ||
| features = self.model.features(x) | ||
| if self.gradient_checkpointing and self.training: | ||
| features = self._checkpointed_features(x) | ||
| else: | ||
| features = self.model.features(x) |
There was a problem hiding this comment.
This AMP stability behavior change isn’t covered by the existing EfficientNet unit tests. Please add a CUDA-only test that exercises forward() under torch.autocast("cuda", dtype=torch.float16) and asserts the guarded path is taken (e.g., by checking returned feature dtype when return_features_only=True, or by instrumenting/patching model.features).
Fix for #169
This fixes a reproducible fine-tuning failure where esp_aves2_effnetb0_all (and the Animalspeak checkpoint) produce NaN gradients under CUDA autocast FP16 while forward/loss remain finite. The fix runs EfficientNet’s model.features(...) in FP32 only when autocast dtype is FP16, keeping BF16 autocast unaffected. Verified via tmp_bn_nan_repro.py on gs://representation-learning/models/efficientnet_animalspeak.pt with AMP enabled.