From b0414d86e8d354ea6434338ddfc06818df862780 Mon Sep 17 00:00:00 2001 From: Faizaan Gagan Date: Sun, 3 Aug 2025 15:12:54 +0530 Subject: [PATCH] Update 10-scaling-up-road-to-the-top-part-3.ipynb On [this point](https://youtu.be/p4ZZq0736Po?t=1104) of the L7 of Practical Deep Learning for Coders, you've explained that we pass the effective batch size to the constructor of the `GradientAccumulation` class. But the comment in the notebook doesn't match that fact and got a bit confusing to figure out what's the case actually. --- 10-scaling-up-road-to-the-top-part-3.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/10-scaling-up-road-to-the-top-part-3.ipynb b/10-scaling-up-road-to-the-top-part-3.ipynb index f21dbc819..476413cdb 100644 --- a/10-scaling-up-road-to-the-top-part-3.ipynb +++ b/10-scaling-up-road-to-the-top-part-3.ipynb @@ -161,7 +161,7 @@ "id": "0eb24c71", "metadata": {}, "source": [ - "*Gradient accumulation* refers to a very simple trick: rather than updating the model weights after every batch based on that batch's gradients, instead keep *accumulating* (adding up) the gradients for a few batches, and them update the model weights with those accumulated gradients. In fastai, the parameter you pass to `GradientAccumulation` defines how many batches of gradients are accumulated. Since we're adding up the gradients over `accum` batches, we therefore need to divide the batch size by that same number. The resulting training loop is nearly mathematically identical to using the original batch size, but the amount of memory used is the same as using a batch size `accum` times smaller!\n", + "*Gradient accumulation* refers to a very simple trick: rather than updating the model weights after every batch based on that batch's gradients, instead keep *accumulating* (adding up) the gradients for a few batches, and them update the model weights with those accumulated gradients. In fastai, the parameter you pass to `GradientAccumulation` defines for how many input items gradients are accumulated. Since we're adding up the gradients over `accum` batches, we therefore need to divide the batch size by that same number. The resulting training loop is nearly mathematically identical to using the original batch size, but the amount of memory used is the same as using a batch size `accum` times smaller!\n", "\n", "For instance, here's a basic example of a single epoch of a training loop without gradient accumulation:\n", "\n",