diff --git a/how-to-guides/02-convert-pytorch-to-ignite.ipynb b/how-to-guides/02-convert-pytorch-to-ignite.ipynb index cf320d6..bf9c491 100644 --- a/how-to-guides/02-convert-pytorch-to-ignite.ipynb +++ b/how-to-guides/02-convert-pytorch-to-ignite.ipynb @@ -1,398 +1,408 @@ { - "nbformat": 4, - "nbformat_minor": 2, - "metadata": { - "colab": { - "name": "convert-pytorch-to-ignite.ipynb", - "provenance": [], - "collapsed_sections": [], - "toc_visible": true - }, - "kernelspec": { - "name": "python3", - "display_name": "Python 3" - }, - "language_info": { - "name": "python" - } + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "xo0JaCAvVI64" + }, + "source": [ + "\n", + "# How to convert pure PyTorch code to Ignite " + ] }, - "cells": [ - { - "cell_type": "markdown", - "source": [ - "\n", - "# How to convert pure PyTorch code to Ignite " - ], - "metadata": { - "id": "xo0JaCAvVI64" - } - }, - { - "cell_type": "markdown", - "source": [ - "In this guide, we will show how PyTorch code components can be converted into compact and flexible PyTorch-Ignite code. \n", - "\n", - "\n", - "\n", - "![Convert PyTorch to Ignite](assets/convert-pytorch2ignite.gif)" - ], - "metadata": { - "id": "CXNZ4XPeV8_I" - } - }, - { - "cell_type": "markdown", - "source": [ - "Since Ignite focuses on the training and validation pipeline, the code for models, datasets, optimizers, etc will remain user-defined and in pure PyTorch." - ], - "metadata": {} - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "model = ...\n", - "train_loader = ...\n", - "val_loader = ...\n", - "optimizer = ...\n", - "criterion = ..." - ], - "outputs": [], - "metadata": { - "id": "L6zvxAsVjP-Z" - } - }, - { - "cell_type": "markdown", - "source": [ - "## Training Loop to `trainer`\n", - "\n", - "A typical PyTorch training loop processes a single batch of data, passes it through the `model`, calculates `loss`, etc as below:\n", - "\n", - "```python\n", - "for batch in train_loader:\n", - " model.train()\n", - " inputs, targets = batch\n", - " optimizer.zero_grad()\n", - " outputs = model(inputs)\n", - " loss = criterion(outputs, targets)\n", - " loss.backward()\n", - " optimizer.step()\n", - "```" - ], - "metadata": { - "id": "2EmmpiTX6huF" - } - }, - { - "cell_type": "markdown", - "source": [ - "To convert the above code into Ignite we need to move the code or steps taken to process a single batch of data while training under a function (`train_step()` below). This function will take `engine` and `batch` (current batch of data) as arguments and can return any data (usually the loss) that can be accessed via `engine.state.output`. We pass this function to `Engine` which creates a `trainer` object." - ], - "metadata": { - "id": "zDkeEWz58hCJ" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "from ignite.engine import Engine\n", - "\n", - "\n", - "def train_step(engine, batch):\n", - " model.train()\n", - " inputs, targets = batch\n", - " optimizer.zero_grad()\n", - " outputs = model(inputs)\n", - " loss = criterion(outputs, targets)\n", - " loss.backward()\n", - " optimizer.step()\n", - " return loss.item()\n", - "\n", - "\n", - "trainer = Engine(train_step)" - ], - "outputs": [], - "metadata": { - "id": "lkWiJVuvh-LC" - } - }, - { - "cell_type": "markdown", - "source": [ - "There are other [helper methods](https://pytorch.org/ignite/engine.html#helper-methods-to-define-supervised-trainer-and-evaluator) that directly create the `trainer` object without writing a custom function for some common use cases like [supervised training](https://pytorch.org/ignite/generated/ignite.engine.create_supervised_trainer.html#ignite.engine.create_supervised_trainer) and [truncated backprop through time](https://pytorch.org/ignite/contrib/engines.html#ignite.contrib.engines.tbptt.create_supervised_tbptt_trainer)." - ], - "metadata": { - "id": "4MWJzKK8-AiC" - } - }, - { - "cell_type": "markdown", - "source": [ - "## Validation Loop to `evaluator`\n", - "\n", - "The validation loop typically makes predictions (`y_pred` below) on the `val_loader` batch by batch and uses them to calculate evaluation metrics (Accuracy, Intersection over Union, etc) as below:\n", - "\n", - "```python\n", - "model.eval()\n", - "num_correct = 0\n", - "num_examples = 0\n", - "\n", - "for batch in val_loader:\n", - " x, y = batch\n", - " y_pred = model(x)\n", - "\n", - " correct = torch.eq(torch.round(y_pred).type(y.type()), y).view(-1)\n", - " num_correct = torch.sum(correct).item()\n", - " num_examples = correct.shape[0]\n", - " print(f\"Epoch: {epoch}, Accuracy: {num_correct / num_examples}\")\n", - "```" - ], - "metadata": { - "id": "cocfuUFZ8okw" - } - }, - { - "cell_type": "markdown", - "source": [ - "We will convert this to Ignite in two steps by separating the validation and metrics logic.\n", - "\n", - "We will move the model evaluation logic under another function (`validation_step()` below) which receives the same parameters as `train_step()` and processes a single batch of data to return some output (usually the predicted and actual value which can be used to calculate metrics) stored in `engine.state.output`. Another instance (called `evaluator` below) of `Engine` is created by passing the `validation_step()` function." - ], - "metadata": { - "id": "N0ETiWo9E0D4" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "def validation_step(engine, batch):\n", - " model.eval()\n", - " with torch.no_grad():\n", - " x, y = batch\n", - " y_pred = model(x)\n", - "\n", - " return y_pred, y\n", - " \n", - " \n", - "evaluator = Engine(validation_step)" - ], - "outputs": [], - "metadata": { - "id": "zv2kceT0CS-L" - } - }, - { - "cell_type": "markdown", - "source": [ - "Similar to the training loop, there are [helper methods](https://pytorch.org/ignite/engine.html#helper-methods-to-define-supervised-trainer-and-evaluator) to avoid writing this custom evaluation function like [`create_supervised_evaluator`](https://pytorch.org/ignite/generated/ignite.engine.create_supervised_evaluator.html#ignite.engine.create_supervised_evaluator).\n", - "\n", - "**Note**: You can create different evaluators for training, validation, and testing if they serve different purposes. A common practice is to have two separate evaluators for training and validation, since the results of the validation evaluator are helpful in determining the best model to save after training." - ], - "metadata": { - "id": "EAIBqfFm8oqS" - } - }, - { - "cell_type": "markdown", - "source": [ - "## Switch to built-in Metrics\n", - "\n", - "Then we can replace the code for calculating metrics like accuracy and instead use several [out-of-the-box metrics](https://pytorch.org/ignite/metrics.html#complete-list-of-metrics) that Ignite provides or write a custom one (refer [here](https://pytorch.org/ignite/metrics.html#how-to-create-a-custom-metric)). The metrics will be computed using the `evaluator`'s output. Finally, we attach these metrics to the `evaluator` by providing a key name (\"accuracy\" below) so they can be accessed via `engine.state.metrics[key_name]`." - ], - "metadata": { - "id": "4t4PsYXn8ost" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "from ignite.metrics import Accuracy\n", - "\n", - "Accuracy().attach(evaluator, \"accuracy\")" - ], - "outputs": [], - "metadata": { - "id": "iUVAOP6kFdA-" - } - }, - { - "cell_type": "markdown", - "source": [ - "## Organizing code into Events and Handlers\n", - "\n", - "Next, we need to identify any code that is triggered when an event occurs. Examples of events can be the start of an iteration, completion of an epoch, or even the start of backprop. We already provide some predefined events (complete list [here](https://pytorch.org/ignite/generated/ignite.engine.events.Events.html#ignite.engine.events.Events)) however we can also create custom ones (refer [here](https://pytorch.org/ignite/concepts.html#custom-events)). We move the event-specific code to different handlers (named functions, lambdas, class functions) which are attached to these events and executed whenever a specific event happens. Here are some common handlers:" - ], - "metadata": { - "id": "WnGK925N5AR7" - } - }, - { - "cell_type": "markdown", - "source": [ - "### Running `evaluator`\n", - "\n", - "We can convert the code that runs the `evaluator` on the training/validation/test dataset after `validate_every` epoch:\n", - "\n", - "```python\n", - "if epoch % validate_every == 0:\n", - " # Validation logic\n", - "```\n", - "\n", - "by attaching a handler to a built-in event `EPOCH_COMPLETED` like:" - ], - "metadata": { - "id": "uZIdI39b-rB4" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "from ignite.engine import Events\n", - "\n", - "validate_every = 10\n", - "\n", - "\n", - "@trainer.on(Events.EPOCH_COMPLETED(every=validate_every))\n", - "def run_validation():\n", - " evaluator.run(val_loader)" - ], - "outputs": [], - "metadata": { - "id": "62Z6RmfJVn7s" - } - }, - { - "cell_type": "markdown", - "source": [ - "### Logging metrics\n", - "\n", - "Similarly, we can log the validation metrics in another handler or combine it with the above handler." - ], - "metadata": { - "id": "7bkte_sKb-vr" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "@trainer.on(Events.EPOCH_COMPLETED(every=validate_every))\n", - "def log_validation():\n", - " metrics = evaluator.state.metrics\n", - " print(f\"Epoch: {trainer.state.epoch}, Accuracy: {metrics['accuracy']}\")" - ], - "outputs": [], - "metadata": { - "id": "ZExU6_CscHyf" - } - }, - { - "cell_type": "markdown", - "source": [ - "### Progress Bar\n", - "\n", - "We use a built-in wrapper around `tqdm` called [`ProgressBar()`](https://pytorch.org/ignite/generated/ignite.contrib.handlers.tqdm_logger.html#module-ignite.contrib.handlers.tqdm_logger)." - ], - "metadata": { - "id": "sRgDrTgi5AU_" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "from ignite.contrib.handlers import ProgressBar\n", - "\n", - "ProgressBar().attach(trainer)" - ], - "outputs": [], - "metadata": { - "id": "0j79aG7ddmk6" - } - }, - { - "cell_type": "markdown", - "source": [ - "### Checkpointing\n", - "\n", - "Instead of saving all models after `checkpoint_every` epoch:\n", - "```python\n", - "if epoch % checkpoint_every == 0:\n", - " checkpoint(model, optimizer, \"checkpoint_dir\")\n", - "```\n", - "\n", - "we can smartly save the best `n_saved` models (depending on `evaluator.state.metrics`), and the state of `optimizer` and `trainer` via the built-in [`Checkpoint()`](https://pytorch.org/ignite/generated/ignite.handlers.checkpoint.Checkpoint.html#checkpoint).\n" - ], - "metadata": { - "id": "vkqMcVnA5AZ3" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "from ignite.handlers import Checkpoint, DiskSaver\n", - "\n", - "checkpoint_every = 5\n", - "checkpoint_dir = ...\n", - "\n", - "\n", - "checkpointer = Checkpoint(\n", - " to_save={'model': model, 'optimizer': optimizer, 'trainer': trainer},\n", - " save_handler=DiskSaver(checkpoint_dir, create_dir=True), n_saved=2\n", - ")\n", - "trainer.add_event_handler(\n", - " Events.EPOCH_COMPLETED(every=checkpoint_every), checkpointer\n", - ")" - ], - "outputs": [], - "metadata": { - "id": "VAkDj1fpoSij" - } - }, - { - "cell_type": "markdown", - "source": [ - "## Run for a number of epochs\n", - "\n", - "Finally, instead of:\n", - "```python\n", - "max_epochs = ...\n", - "\n", - "for epoch in range(max_epochs):\n", - "```\n", - "we begin training on `train_loader` via:\n", - "```python\n", - "trainer.run(train_loader, max_epochs)\n", - "```" - ], - "metadata": { - "id": "WbByMD6xYpgM" - } - }, - { - "cell_type": "markdown", - "source": [ - "An end-to-end example implementing the above principles can be found [here](https://pytorch-ignite.ai/tutorials/getting-started/#complete-code)." - ], - "metadata": {} - } - ] + { + "cell_type": "markdown", + "metadata": { + "id": "CXNZ4XPeV8_I" + }, + "source": [ + "In this guide, we will show how PyTorch code components can be converted into compact and flexible PyTorch-Ignite code. \n", + "\n", + "\n", + "\n", + "![Convert PyTorch to Ignite](assets/convert-pytorch2ignite.gif)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since Ignite focuses on the training and validation pipeline, the code for models, datasets, optimizers, etc will remain user-defined and in pure PyTorch." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "L6zvxAsVjP-Z" + }, + "outputs": [], + "source": [ + "model = ...\n", + "train_loader = ...\n", + "val_loader = ...\n", + "optimizer = ...\n", + "criterion = ..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2EmmpiTX6huF" + }, + "source": [ + "## Training Loop to `trainer`\n", + "\n", + "A typical PyTorch training loop processes a single batch of data, passes it through the `model`, calculates `loss`, etc as below:\n", + "\n", + "```python\n", + "for batch in train_loader:\n", + " model.train()\n", + " inputs, targets = batch\n", + " optimizer.zero_grad()\n", + " outputs = model(inputs)\n", + " loss = criterion(outputs, targets)\n", + " loss.backward()\n", + " optimizer.step()\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zDkeEWz58hCJ" + }, + "source": [ + "To convert the above code into Ignite we need to move the code or steps taken to process a single batch of data while training under a function (`train_step()` below). This function will take `engine` and `batch` (current batch of data) as arguments and can return any data (usually the loss) that can be accessed via `engine.state.output`. We pass this function to `Engine` which creates a `trainer` object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "lkWiJVuvh-LC" + }, + "outputs": [], + "source": [ + "from ignite.engine import Engine\n", + "\n", + "\n", + "def train_step(engine, batch):\n", + " model.train()\n", + " inputs, targets = batch\n", + " optimizer.zero_grad()\n", + " outputs = model(inputs)\n", + " loss = criterion(outputs, targets)\n", + " loss.backward()\n", + " optimizer.step()\n", + " return loss.item()\n", + "\n", + "\n", + "trainer = Engine(train_step)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4MWJzKK8-AiC" + }, + "source": [ + "There are other [helper methods](https://pytorch.org/ignite/engine.html#helper-methods-to-define-supervised-trainer-and-evaluator) that directly create the `trainer` object without writing a custom function for some common use cases like [supervised training](https://pytorch.org/ignite/generated/ignite.engine.create_supervised_trainer.html#ignite.engine.create_supervised_trainer) and [truncated backprop through time](https://pytorch.org/ignite/contrib/engines.html#ignite.contrib.engines.tbptt.create_supervised_tbptt_trainer)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cocfuUFZ8okw" + }, + "source": [ + "## Validation Loop to `evaluator`\n", + "\n", + "The validation loop typically makes predictions (`y_pred` below) on the `val_loader` batch by batch and uses them to calculate evaluation metrics (Accuracy, Intersection over Union, etc) as below:\n", + "\n", + "```python\n", + "model.eval()\n", + "num_correct = 0\n", + "num_examples = 0\n", + "\n", + "for batch in val_loader:\n", + " x, y = batch\n", + " y_pred = model(x)\n", + "\n", + " correct = torch.eq(torch.round(y_pred).type(y.type()), y).view(-1)\n", + " num_correct = torch.sum(correct).item()\n", + " num_examples = correct.shape[0]\n", + " print(f\"Epoch: {epoch}, Accuracy: {num_correct / num_examples}\")\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "N0ETiWo9E0D4" + }, + "source": [ + "We will convert this to Ignite in two steps by separating the validation and metrics logic.\n", + "\n", + "We will move the model evaluation logic under another function (`validation_step()` below) which receives the same parameters as `train_step()` and processes a single batch of data to return some output (usually the predicted and actual value which can be used to calculate metrics) stored in `engine.state.output`. Another instance (called `evaluator` below) of `Engine` is created by passing the `validation_step()` function." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "zv2kceT0CS-L" + }, + "outputs": [], + "source": [ + "def validation_step(engine, batch):\n", + " model.eval()\n", + " with torch.no_grad():\n", + " x, y = batch\n", + " y_pred = model(x)\n", + "\n", + " return y_pred, y\n", + " \n", + " \n", + "evaluator = Engine(validation_step)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EAIBqfFm8oqS" + }, + "source": [ + "Similar to the training loop, there are [helper methods](https://pytorch.org/ignite/engine.html#helper-methods-to-define-supervised-trainer-and-evaluator) to avoid writing this custom evaluation function like [`create_supervised_evaluator`](https://pytorch.org/ignite/generated/ignite.engine.create_supervised_evaluator.html#ignite.engine.create_supervised_evaluator).\n", + "\n", + "**Note**: You can create different evaluators for training, validation, and testing if they serve different purposes. A common practice is to have two separate evaluators for training and validation, since the results of the validation evaluator are helpful in determining the best model to save after training." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4t4PsYXn8ost" + }, + "source": [ + "## Switch to built-in Metrics\n", + "\n", + "Then we can replace the code for calculating metrics like accuracy and instead use several [out-of-the-box metrics](https://pytorch.org/ignite/metrics.html#complete-list-of-metrics) that Ignite provides or write a custom one (refer [here](https://pytorch.org/ignite/metrics.html#how-to-create-a-custom-metric)). The metrics will be computed using the `evaluator`'s output. Finally, we attach these metrics to the `evaluator` by providing a key name (\"accuracy\" below) so they can be accessed via `engine.state.metrics[key_name]`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "iUVAOP6kFdA-" + }, + "outputs": [], + "source": [ + "from ignite.metrics import Accuracy\n", + "\n", + "Accuracy().attach(evaluator, \"accuracy\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WnGK925N5AR7" + }, + "source": [ + "## Organizing code into Events and Handlers\n", + "\n", + "Next, we need to identify any code that is triggered when an event occurs. Examples of events can be the start of an iteration, completion of an epoch, or even the start of backprop. We already provide some predefined events (complete list [here](https://pytorch.org/ignite/generated/ignite.engine.events.Events.html#ignite.engine.events.Events)) however we can also create custom ones (refer [here](https://pytorch.org/ignite/concepts.html#custom-events)). We move the event-specific code to different handlers (named functions, lambdas, class functions) which are attached to these events and executed whenever a specific event happens. Here are some common handlers:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uZIdI39b-rB4" + }, + "source": [ + "### Running `evaluator`\n", + "\n", + "We can convert the code that runs the `evaluator` on the training/validation/test dataset after `validate_every` epoch:\n", + "\n", + "```python\n", + "if epoch % validate_every == 0:\n", + " # Validation logic\n", + "```\n", + "\n", + "by attaching a handler to a built-in event `EPOCH_COMPLETED` like:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "62Z6RmfJVn7s" + }, + "outputs": [], + "source": [ + "from ignite.engine import Events\n", + "\n", + "validate_every = 10\n", + "\n", + "\n", + "@trainer.on(Events.EPOCH_COMPLETED(every=validate_every))\n", + "def run_validation():\n", + " evaluator.run(val_loader)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7bkte_sKb-vr" + }, + "source": [ + "### Logging metrics\n", + "\n", + "Similarly, we can log the validation metrics in another handler or combine it with the above handler." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZExU6_CscHyf" + }, + "outputs": [], + "source": [ + "@trainer.on(Events.EPOCH_COMPLETED(every=validate_every))\n", + "def log_validation():\n", + " metrics = evaluator.state.metrics\n", + " print(f\"Epoch: {trainer.state.epoch}, Accuracy: {metrics['accuracy']}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sRgDrTgi5AU_" + }, + "source": [ + "### Progress Bar\n", + "\n", + "We use a built-in wrapper around `tqdm` called [`ProgressBar()`](https://pytorch.org/ignite/generated/ignite.contrib.handlers.tqdm_logger.html#module-ignite.contrib.handlers.tqdm_logger)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0j79aG7ddmk6" + }, + "outputs": [], + "source": [ + "from ignite.contrib.handlers import ProgressBar\n", + "\n", + "ProgressBar().attach(trainer)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vkqMcVnA5AZ3" + }, + "source": [ + "### Checkpointing\n", + "\n", + "Instead of saving all models after `checkpoint_every` epoch:\n", + "```python\n", + "if epoch % checkpoint_every == 0:\n", + " checkpoint(model, optimizer, \"checkpoint_dir\")\n", + "```\n", + "\n", + "we can smartly save the best `n_saved` models (depending on `evaluator.state.metrics`), and the state of `optimizer` and `trainer` via the built-in [`Checkpoint()`](https://pytorch.org/ignite/generated/ignite.handlers.checkpoint.Checkpoint.html#checkpoint).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VAkDj1fpoSij" + }, + "outputs": [], + "source": [ + "from ignite.handlers import Checkpoint\n", + "\n", + "checkpoint_every = 5\n", + "checkpoint_dir = ...\n", + "\n", + "\n", + "checkpointer = Checkpoint(\n", + " to_save={'model': model, 'optimizer': optimizer, 'trainer': trainer},\n", + " save_handler=checkpoint_dir, n_saved=2\n", + ")\n", + "trainer.add_event_handler(\n", + " Events.EPOCH_COMPLETED(every=checkpoint_every), checkpointer\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WbByMD6xYpgM" + }, + "source": [ + "## Run for a number of epochs\n", + "\n", + "Finally, instead of:\n", + "```python\n", + "max_epochs = ...\n", + "\n", + "for epoch in range(max_epochs):\n", + "```\n", + "we begin training on `train_loader` via:\n", + "```python\n", + "trainer.run(train_loader, max_epochs)\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "An end-to-end example implementing the above principles can be found [here](https://pytorch-ignite.ai/tutorials/getting-started/#complete-code)." + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "convert-pytorch-to-ignite.ipynb", + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + } + }, + "nbformat": 4, + "nbformat_minor": 2 } \ No newline at end of file diff --git a/tutorials/intermediate/01-cifar10-distributed.ipynb b/tutorials/intermediate/01-cifar10-distributed.ipynb index 8aa3535..ed365e3 100644 --- a/tutorials/intermediate/01-cifar10-distributed.ipynb +++ b/tutorials/intermediate/01-cifar10-distributed.ipynb @@ -1,1558 +1,1556 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "_9NEVKMz0v5s" - }, - "source": [ - "\n", - "# Distributed Training with Ignite on CIFAR10 " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "fHmvDGFx10HT" - }, - "source": [ - "This tutorial is a brief introduction on how you can do distributed training with Ignite on one or more CPUs, GPUs or TPUs. We will also introduce several helper functions and Ignite concepts (setup common training handlers, save to/ load from checkpoints, etc.) which you can easily incorporate in your code.\n", - "\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "trJ7_a7f17pg" - }, - "source": [ - "We will use distributed training to train a predefined [ResNet18](https://pytorch.org/vision/stable/models.html#torchvision.models.resnet18) on [CIFAR10](https://pytorch.org/vision/stable/datasets.html#torchvision.datasets.CIFAR10) using either of the following configurations:\n", - "\n", - "* Single Node, One or More GPUs\n", - "* Multiple Nodes, Multiple GPUs\n", - "* Single Node, Multiple CPUs\n", - "* TPUs on Google Colab\n", - "* On Jupyter Notebooks\n", - "\n", - "The type of distributed training we will use is called data parallelism in which we:\n", - "\n", - "> 1. Copy the model on each GPU\n", - "> 2. Split the dataset and fit the models on different subsets\n", - "> 3. Communicate the gradients at each iteration to keep the models in sync\n", - ">\n", - "> -- [Distributed Deep Learning 101: Introduction](https://towardsdatascience.com/distributed-deep-learning-101-introduction-ebfc1bcd59d9)\n", - "\n", - "PyTorch provides a [torch.nn.parallel.DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) API for this task however the implementation that supports different backends + configurations is tedious. In this example, we will see how to can enable data distributed training which is adaptable to various backends in just a few lines of code alongwith:\n", - "* Computing training and validation metrics\n", - "* Setup logging (and connecting with ClearML)\n", - "* Saving the best model weights\n", - "* Setting LR Scheduler\n", - "* Using Automatic Mixed Precision" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "kWLrQ6EH4uoD" - }, - "source": [ - "## Required Dependencies" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "e7k6WVw5_uts" - }, - "outputs": [], - "source": [ - "!pip install pytorch-ignite" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pcvTTo1s8Huq" - }, - "source": [ - "### For parsing arguments" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "OIgDDJbS8Fy6" - }, - "outputs": [], - "source": [ - "!pip install fire" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "d0c2e4I4FWoT" - }, - "source": [ - "### For TPUs" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Gg74Dc3UFUcG" - }, - "outputs": [], - "source": [ - "VERSION = !curl -s https://api.github.com/repos/pytorch/xla/releases/latest | grep -Po '\"tag_name\": \"v\\K.*?(?=\")'\n", - "VERSION = VERSION[0].rstrip('.0') # remove trailing zero\n", - "!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-{VERSION}-cp37-cp37m-linux_x86_64.whl" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "DMGTfKx7o1zr" - }, - "source": [ - "### With ClearML (Optional)\n", - "\n", - "We can enable logging with ClearML to track experiments as follows:\n", - "\n", - "- Make sure you have a ClearML account: https://app.community.clear.ml/\n", - "- Create a credential: Profile > Create new credentials > Copy to clipboard\n", - "- Run `clearml-init` and paste the credentials" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "Ds3-t1PqpAO0" - }, - "outputs": [], - "source": [ - "!pip install clearml\n", - "!clearml-init" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "i4QuWxYFL-Da" - }, - "source": [ - "Specify `with_clearml=True` in `config` below and monitor the experiment on the dashboard. Refer to the end of this tutorial to see an example of such an experiment." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "09xlncFZfmeS" - }, - "source": [ - "## Download Data\n", - "\n", - "Let's download our data first which can later be used by all the processes to instantiate our dataloaders. The following command will download the CIFAR10 dataset to a folder `cifar10`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "gRpACWtUfyn1", - "scrolled": true - }, - "outputs": [], - "source": [ - "!python -c \"from torchvision.datasets import CIFAR10; CIFAR10('cifar10', download=True)\"" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "D71VkD74he9J" - }, - "source": [ - "## Common Configuration\n", - "\n", - "We maintain a `config` dictionary which can be extended or changed to store parameters required during training. We can refer back to this code when we will use these parameters later." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-09-14T16:59:33.061456Z", - "iopub.status.busy": "2021-09-14T16:59:33.061157Z", - "iopub.status.idle": "2021-09-14T16:59:33.069066Z", - "shell.execute_reply": "2021-09-14T16:59:33.068110Z", - "shell.execute_reply.started": "2021-09-14T16:59:33.061424Z" - }, - "id": "9bg7unvqhegL" - }, - "outputs": [], - "source": [ - "config = {\n", - " \"seed\": 543,\n", - " \"data_path\": \"cifar10\",\n", - " \"output_path\": \"output-cifar10/\",\n", - " \"model\": \"resnet18\",\n", - " \"batch_size\": 512,\n", - " \"momentum\": 0.9,\n", - " \"weight_decay\": 1e-4,\n", - " \"num_workers\": 2,\n", - " \"num_epochs\": 5,\n", - " \"learning_rate\": 0.4,\n", - " \"num_warmup_epochs\": 1,\n", - " \"validate_every\": 3,\n", - " \"checkpoint_every\": 200,\n", - " \"backend\": None,\n", - " \"resume_from\": None,\n", - " \"log_every_iters\": 15,\n", - " \"nproc_per_node\": None,\n", - " \"with_clearml\": False,\n", - " \"with_amp\": False,\n", - "}" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "DzuG8QAr5Djf" - }, - "source": [ - "## Basic Setup" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "bIgzky7Q7kUk" - }, - "source": [ - "### Imports" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-09-14T16:59:34.353096Z", - "iopub.status.busy": "2021-09-14T16:59:34.352425Z", - "iopub.status.idle": "2021-09-14T17:00:15.082104Z", - "shell.execute_reply": "2021-09-14T17:00:15.080743Z", - "shell.execute_reply.started": "2021-09-14T16:59:34.353037Z" - }, - "id": "0RVISbXd_h1F" - }, - "outputs": [], - "source": [ - "from datetime import datetime\n", - "from pathlib import Path\n", - "\n", - "import torch\n", - "import torch.nn as nn\n", - "import torch.optim as optim\n", - "from torchvision import datasets, models\n", - "from torchvision.transforms import (\n", - " Compose,\n", - " Normalize,\n", - " Pad,\n", - " RandomCrop,\n", - " RandomHorizontalFlip,\n", - " ToTensor,\n", - ")\n", - "\n", - "import ignite\n", - "import ignite.distributed as idist\n", - "from ignite.contrib.engines import common\n", - "from ignite.handlers import PiecewiseLinear\n", - "from ignite.engine import Events, create_supervised_trainer, create_supervised_evaluator\n", - "from ignite.handlers import Checkpoint, DiskSaver, global_step_from_engine\n", - "from ignite.metrics import Accuracy, Loss\n", - "from ignite.utils import manual_seed, setup_logger" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "fVXiYInWikTn" - }, - "source": [ - "Next we will take the help of `auto_` methods in `idist` ([`ignite.distributed`](https://pytorch.org/ignite/distributed.html#)) to make our dataloaders, model and optimizer automatically adapt to the current configuration `backend=None` (non-distributed) or for backends like `nccl`, `gloo`, and `xla-tpu` (distributed).\n", - "\n", - "Note that we are free to partially use or not use `auto_` methods at all and instead can implement something custom." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1r83mIS4QCdg" - }, - "source": [ - "### Dataloaders\n", - "\n", - "Next we are going to instantiate the train and test datasets from `data_path`, apply transforms to it and return them via `get_train_test_datasets()`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-09-14T17:01:17.777175Z", - "iopub.status.busy": "2021-09-14T17:01:17.776871Z", - "iopub.status.idle": "2021-09-14T17:01:17.785864Z", - "shell.execute_reply": "2021-09-14T17:01:17.784669Z", - "shell.execute_reply.started": "2021-09-14T17:01:17.777146Z" - }, - "id": "Y3BKYL7XGpZL" - }, - "outputs": [], - "source": [ - "def get_train_test_datasets(path):\n", - " train_transform = Compose(\n", - " [\n", - " Pad(4),\n", - " RandomCrop(32, fill=128),\n", - " RandomHorizontalFlip(),\n", - " ToTensor(),\n", - " Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),\n", - " ]\n", - " )\n", - " test_transform = Compose(\n", - " [\n", - " ToTensor(),\n", - " Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),\n", - " ]\n", - " )\n", - "\n", - " train_ds = datasets.CIFAR10(\n", - " root=path, train=True, download=False, transform=train_transform\n", - " )\n", - " test_ds = datasets.CIFAR10(\n", - " root=path, train=False, download=False, transform=test_transform\n", - " )\n", - "\n", - " return train_ds, test_ds" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2GkEcAYiRbgO" - }, - "source": [ - "Finally, we pass the datasets to [`auto_dataloader()`](https://pytorch.org/ignite/generated/ignite.distributed.auto.auto_dataloader.html#ignite.distributed.auto.auto_dataloader)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-09-14T17:01:18.697479Z", - "iopub.status.busy": "2021-09-14T17:01:18.697159Z", - "iopub.status.idle": "2021-09-14T17:01:18.704779Z", - "shell.execute_reply": "2021-09-14T17:01:18.703467Z", - "shell.execute_reply.started": "2021-09-14T17:01:18.697445Z" - }, - "id": "7rNV-UDwRPtO" - }, - "outputs": [], - "source": [ - "def get_dataflow(config):\n", - " train_dataset, test_dataset = get_train_test_datasets(config[\"data_path\"])\n", - "\n", - " train_loader = idist.auto_dataloader(\n", - " train_dataset,\n", - " batch_size=config[\"batch_size\"],\n", - " num_workers=config[\"num_workers\"],\n", - " shuffle=True,\n", - " drop_last=True,\n", - " )\n", - "\n", - " test_loader = idist.auto_dataloader(\n", - " test_dataset,\n", - " batch_size=2 * config[\"batch_size\"],\n", - " num_workers=config[\"num_workers\"],\n", - " shuffle=False,\n", - " )\n", - " return train_loader, test_loader" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "iNLvDK-cS2sH" - }, - "source": [ - "### Model\n", - "\n", - "We check if the model given in `config` is present in [torchvision.models](https://pytorch.org/vision/stable/models.html), change the last layer to output 10 classes (as present in CIFAR10) and pass it to [`auto_model()`](https://pytorch.org/ignite/generated/ignite.distributed.auto.auto_model.html#auto-model) which makes it automatically adaptable for non-distributed and distributed configurations.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-09-14T17:01:18.782444Z", - "iopub.status.busy": "2021-09-14T17:01:18.782146Z", - "iopub.status.idle": "2021-09-14T17:01:18.789078Z", - "shell.execute_reply": "2021-09-14T17:01:18.787728Z", - "shell.execute_reply.started": "2021-09-14T17:01:18.782416Z" - }, - "id": "toShlIcW5oFd" - }, - "outputs": [], - "source": [ - "def get_model(config):\n", - " model_name = config[\"model\"]\n", - " if model_name in models.__dict__:\n", - " fn = models.__dict__[model_name]\n", - " else:\n", - " raise RuntimeError(f\"Unknown model name {model_name}\")\n", - "\n", - " model = idist.auto_model(fn(num_classes=10))\n", - "\n", - " return model" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "V14TfyCT8jQW" - }, - "source": [ - "### Optimizer\n", - "\n", - "Then we can setup the optimizer using hyperameters from `config` and pass it through [`auto_optim()`](https://pytorch.org/ignite/generated/ignite.distributed.auto.auto_optim.html#ignite.distributed.auto.auto_optim)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-09-14T17:01:18.842622Z", - "iopub.status.busy": "2021-09-14T17:01:18.842285Z", - "iopub.status.idle": "2021-09-14T17:01:18.849476Z", - "shell.execute_reply": "2021-09-14T17:01:18.848516Z", - "shell.execute_reply.started": "2021-09-14T17:01:18.842592Z" - }, - "id": "Iddv29eh5qU9" - }, - "outputs": [], - "source": [ - "def get_optimizer(config, model):\n", - " optimizer = optim.SGD(\n", - " model.parameters(),\n", - " lr=config[\"learning_rate\"],\n", - " momentum=config[\"momentum\"],\n", - " weight_decay=config[\"weight_decay\"],\n", - " nesterov=True,\n", - " )\n", - " optimizer = idist.auto_optim(optimizer)\n", - "\n", - " return optimizer" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "VI0o7hgr8l9q" - }, - "source": [ - "### Criterion\n", - "\n", - "We put the loss function on `device`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-09-14T17:01:19.324780Z", - "iopub.status.busy": "2021-09-14T17:01:19.324452Z", - "iopub.status.idle": "2021-09-14T17:01:19.329773Z", - "shell.execute_reply": "2021-09-14T17:01:19.328546Z", - "shell.execute_reply.started": "2021-09-14T17:01:19.324748Z" - }, - "id": "DVDKkYqS5siE" - }, - "outputs": [], - "source": [ - "def get_criterion():\n", - " return nn.CrossEntropyLoss().to(idist.device())" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9No0Ockx8oRC" - }, - "source": [ - "### LR Scheduler\n", - "\n", - "We will use [PiecewiseLinear](https://pytorch.org/ignite/generated/ignite.handlers.param_scheduler.PiecewiseLinear.html#ignite.handlers.param_scheduler.PiecewiseLinear) which is one of the [various LR Schedulers](https://pytorch.org/ignite/handlers.html#parameter-scheduler) Ignite provides.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-09-14T17:01:19.737913Z", - "iopub.status.busy": "2021-09-14T17:01:19.737620Z", - "iopub.status.idle": "2021-09-14T17:01:19.746210Z", - "shell.execute_reply": "2021-09-14T17:01:19.744390Z", - "shell.execute_reply.started": "2021-09-14T17:01:19.737884Z" - }, - "id": "UcYuVeYic_e7" - }, - "outputs": [], - "source": [ - "def get_lr_scheduler(config, optimizer):\n", - " milestones_values = [\n", - " (0, 0.0),\n", - " (config[\"num_iters_per_epoch\"] * config[\"num_warmup_epochs\"], config[\"learning_rate\"]),\n", - " (config[\"num_iters_per_epoch\"] * config[\"num_epochs\"], 0.0),\n", - " ]\n", - " lr_scheduler = PiecewiseLinear(\n", - " optimizer, param_name=\"lr\", milestones_values=milestones_values\n", - " )\n", - " return lr_scheduler" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "jjVYZdn49PKD" - }, - "source": [ - "## Trainer" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "jp6sWINP9CAq" - }, - "source": [ - "### Save Models\n", - "\n", - "We can create checkpoints using either of the two handlers:\n", - "\n", - "1. If specified `with-clearml=True`, we will save the models in ClearML's File Server using [`ClearMLSaver()`](https://pytorch.org/ignite/generated/ignite.contrib.handlers.clearml_logger.html#ignite.contrib.handlers.clearml_logger.ClearMLSaver).\n", - "2. Else save the models to disk using [`DiskSaver()`](https://pytorch.org/ignite/generated/ignite.handlers.DiskSaver.html#ignite.handlers.DiskSaver)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-09-14T17:01:20.256676Z", - "iopub.status.busy": "2021-09-14T17:01:20.256397Z", - "iopub.status.idle": "2021-09-14T17:01:20.262069Z", - "shell.execute_reply": "2021-09-14T17:01:20.261000Z", - "shell.execute_reply.started": "2021-09-14T17:01:20.256647Z" - }, - "id": "-DG0Pj4pJJFw" - }, - "outputs": [], - "source": [ - "def get_save_handler(config):\n", - " if config[\"with_clearml\"]:\n", - " from ignite.contrib.handlers.clearml_logger import ClearMLSaver\n", - "\n", - " return ClearMLSaver(dirname=config[\"output_path\"])\n", - "\n", - " return DiskSaver(config[\"output_path\"], require_empty=False)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "T1N1iR2f9R0n" - }, - "source": [ - "### Resume from Checkpoint\n", - "\n", - "If a checkpoint file path is provided, we can resume training from there by loading the file." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-09-14T17:01:20.613959Z", - "iopub.status.busy": "2021-09-14T17:01:20.613673Z", - "iopub.status.idle": "2021-09-14T17:01:20.620012Z", - "shell.execute_reply": "2021-09-14T17:01:20.618881Z", - "shell.execute_reply.started": "2021-09-14T17:01:20.613923Z" - }, - "id": "j9La55z97PVa" - }, - "outputs": [], - "source": [ - "def load_checkpoint(resume_from):\n", - " checkpoint_fp = Path(resume_from)\n", - " assert (\n", - " checkpoint_fp.exists()\n", - " ), f\"Checkpoint '{checkpoint_fp.as_posix()}' is not found\"\n", - " checkpoint = torch.load(checkpoint_fp.as_posix(), map_location=\"cpu\")\n", - " return checkpoint" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "aBd6WDVE_KmO" - }, - "source": [ - "### Create Trainer\n", - "\n", - "Finally, we can create our `trainer` in four steps:\n", - "1. Create a `trainer` object using [`create_supervised_trainer()`](https://pytorch.org/ignite/generated/ignite.engine.create_supervised_trainer.html#ignite.engine.create_supervised_trainer) which internally defines the steps taken to process a single batch:\n", - " 1. Move the batch to `device` used in current distributed configuration.\n", - " 2. Put `model` in `train()` mode.\n", - " 3. Perform forward pass by passing the inputs through the `model` and calculating `loss`. If AMP is enabled then this step happens with [`autocast`](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.autocast) on which allows this step to run in mixed precision.\n", - " 4. Perform backward pass. If [Automatic Mixed Precision](https://pytorch.org/docs/stable/amp.html) (AMP) is enabled (speeds up computations on large neural networks and reduces memory usage while retaining performance), then the losses will be [scaled](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler.scale) before calling `backward()`, `step()` the optimizer while discarding batches that contain NaNs and [update()](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler.update) the scale for the next iteration.\n", - " 5. Store the loss as `batch loss` in `state.output`.\n", - "\n", - "Internally, the above steps to create the `trainer` would look like:\n", - " ```python\n", - " def train_step(engine, batch):\n", - "\n", - " x, y = batch[0], batch[1]\n", - " if x.device != device:\n", - " x = x.to(device, non_blocking=True)\n", - " y = y.to(device, non_blocking=True)\n", - "\n", - " model.train()\n", - "\n", - " with autocast(enabled=with_amp):\n", - " y_pred = model(x)\n", - " loss = criterion(y_pred, y)\n", - "\n", - " optimizer.zero_grad()\n", - " scaler.scale(loss).backward() # If with_amp=False, this is equivalent to loss.backward()\n", - " scaler.step(optimizer) # If with_amp=False, this is equivalent to optimizer.step()\n", - " scaler.update() # If with_amp=False, this step does nothing\n", - "\n", - " return {\"batch loss\": loss.item()}\n", - "\n", - " trainer = Engine(train_step)\n", - " ```\n", - "3. Setup some common Ignite training handlers. You can do this individually or use [setup_common_training_handlers()](https://pytorch.org/ignite/contrib/engines.html#ignite.contrib.engines.common.setup_common_training_handlers) that takes the `trainer` and a subset of the dataset (`train_sampler`) alongwith:\n", - " * A dictionary mapping on what to save in the checkpoint (`to_save`) and how often (`save_every_iters`).\n", - " * The LR Scheduler\n", - " * The output of `train_step()`\n", - " * Other handlers\n", - "4. If `resume_from` file path is provided, load the states of objects `to_save` from the checkpoint file." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-09-14T17:01:21.143122Z", - "iopub.status.busy": "2021-09-14T17:01:21.142824Z", - "iopub.status.idle": "2021-09-14T17:01:21.153455Z", - "shell.execute_reply": "2021-09-14T17:01:21.152265Z", - "shell.execute_reply.started": "2021-09-14T17:01:21.143089Z" - }, - "id": "ptmfSvESbEPE" - }, - "outputs": [], - "source": [ - "def create_trainer(\n", - " model, optimizer, criterion, lr_scheduler, train_sampler, config, logger\n", - "):\n", - "\n", - " device = idist.device()\n", - " amp_mode = None\n", - " scaler = False\n", - " \n", - " trainer = create_supervised_trainer(\n", - " model,\n", - " optimizer,\n", - " criterion,\n", - " device=device,\n", - " non_blocking=True,\n", - " output_transform=lambda x, y, y_pred, loss: {\"batch loss\": loss.item()},\n", - " amp_mode=\"amp\" if config[\"with_amp\"] else None,\n", - " scaler=config[\"with_amp\"],\n", - " )\n", - " trainer.logger = logger\n", - "\n", - " to_save = {\n", - " \"trainer\": trainer,\n", - " \"model\": model,\n", - " \"optimizer\": optimizer,\n", - " \"lr_scheduler\": lr_scheduler,\n", - " }\n", - " metric_names = [\n", - " \"batch loss\",\n", - " ]\n", - "\n", - " common.setup_common_training_handlers(\n", - " trainer=trainer,\n", - " train_sampler=train_sampler,\n", - " to_save=to_save,\n", - " save_every_iters=config[\"checkpoint_every\"],\n", - " save_handler=get_save_handler(config),\n", - " lr_scheduler=lr_scheduler,\n", - " output_names=metric_names if config[\"log_every_iters\"] > 0 else None,\n", - " with_pbars=False,\n", - " clear_cuda_cache=False,\n", - " )\n", - "\n", - " if config[\"resume_from\"] is not None:\n", - " checkpoint = load_checkpoint(config[\"resume_from\"])\n", - " Checkpoint.load_objects(to_load=to_save, checkpoint=checkpoint)\n", - "\n", - " return trainer" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "U5ius6EM9aiG" - }, - "source": [ - "## Evaluator\n", - "\n", - "The evaluator will be created via [`create_supervised_evaluator()`](https://pytorch.org/ignite/generated/ignite.engine.create_supervised_evaluator.html#ignite.engine.create_supervised_evaluator) which internally will:\n", - "1. Set the `model` to `eval()` mode.\n", - "2. Move the batch to `device` used in current distributed configuration.\n", - "3. Perform forward pass. If AMP is enabled, `autocast` will be on.\n", - "4. Store the predictions and labels in `state.output` to compute metrics.\n", - "\n", - "It will also attach the Ignite metrics passed to the `evaluator`. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-09-14T17:01:21.765740Z", - "iopub.status.busy": "2021-09-14T17:01:21.765110Z", - "iopub.status.idle": "2021-09-14T17:01:21.771498Z", - "shell.execute_reply": "2021-09-14T17:01:21.770611Z", - "shell.execute_reply.started": "2021-09-14T17:01:21.765695Z" - }, - "id": "JPVzU6CqNlE4" - }, - "outputs": [], - "source": [ - "def create_evaluator(model, metrics, config):\n", - " device = idist.device()\n", - "\n", - " amp_mode = \"amp\" if config[\"with_amp\"] else None\n", - " evaluator = create_supervised_evaluator(\n", - " model, metrics=metrics, device=device, non_blocking=True, amp_mode=amp_mode\n", - " )\n", - " \n", - " return evaluator" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "opzRc2DR9dz-" - }, - "source": [ - "## Training\n", - "\n", - "Before we begin training, we must setup a few things on the master process (`rank` = 0):\n", - "* Create folder to store checkpoints, best models and output of tensorboard logging in the format - model_backend_rank_time.\n", - "* If ClearML FileServer is used to save models, then a `Task` has to be created, and we pass our `config` dictionary and the specific hyper parameters that are part of the experiment." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-09-14T17:01:22.274841Z", - "iopub.status.busy": "2021-09-14T17:01:22.273903Z", - "iopub.status.idle": "2021-09-14T17:01:22.283701Z", - "shell.execute_reply": "2021-09-14T17:01:22.283030Z", - "shell.execute_reply.started": "2021-09-14T17:01:22.274775Z" - }, - "id": "3_ahtRZ_i-k8" - }, - "outputs": [], - "source": [ - "def setup_rank_zero(logger, config):\n", - " device = idist.device()\n", - "\n", - " now = datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n", - " output_path = config[\"output_path\"]\n", - " folder_name = (\n", - " f\"{config['model']}_backend-{idist.backend()}-{idist.get_world_size()}_{now}\"\n", - " )\n", - " output_path = Path(output_path) / folder_name\n", - " if not output_path.exists():\n", - " output_path.mkdir(parents=True)\n", - " config[\"output_path\"] = output_path.as_posix()\n", - " logger.info(f\"Output path: {config['output_path']}\")\n", - "\n", - " if config[\"with_clearml\"]:\n", - " from clearml import Task\n", - "\n", - " task = Task.init(\"CIFAR10-Training\", task_name=output_path.stem)\n", - " task.connect_configuration(config)\n", - " # Log hyper parameters\n", - " hyper_params = [\n", - " \"model\",\n", - " \"batch_size\",\n", - " \"momentum\",\n", - " \"weight_decay\",\n", - " \"num_epochs\",\n", - " \"learning_rate\",\n", - " \"num_warmup_epochs\",\n", - " ]\n", - " task.connect({k: v for k, v in config.items()})" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "KXpew9g3PUKI" - }, - "source": [ - "### Logging\n", - "\n", - "This step is optional, however, we can pass a [`setup_logger()`](https://pytorch.org/ignite/utils.html#ignite.utils.setup_logger) object to `log_basic_info()` and log all basic information such as different versions, current configuration, `device` and `backend` used by the current process (identified by its local rank), and number of processes (world size). `idist` (`ignite.distributed`) provides several utility functions like [`get_local_rank()`](https://pytorch.org/ignite/distributed.html#ignite.distributed.utils.get_local_rank), [`backend()`](https://pytorch.org/ignite/distributed.html#ignite.distributed.utils.backend), [`get_world_size()`](https://pytorch.org/ignite/distributed.html#ignite.distributed.utils.get_world_size), etc. to make this possible.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-09-14T17:01:22.883100Z", - "iopub.status.busy": "2021-09-14T17:01:22.882657Z", - "iopub.status.idle": "2021-09-14T17:01:22.891892Z", - "shell.execute_reply": "2021-09-14T17:01:22.891225Z", - "shell.execute_reply.started": "2021-09-14T17:01:22.883044Z" - }, - "id": "g61dzVvnEVvG" - }, - "outputs": [], - "source": [ - "def log_basic_info(logger, config):\n", - " logger.info(f\"Train on CIFAR10\")\n", - " logger.info(f\"- PyTorch version: {torch.__version__}\")\n", - " logger.info(f\"- Ignite version: {ignite.__version__}\")\n", - " if torch.cuda.is_available():\n", - " # explicitly import cudnn as torch.backends.cudnn can not be pickled with hvd spawning procs\n", - " from torch.backends import cudnn\n", - "\n", - " logger.info(\n", - " f\"- GPU Device: {torch.cuda.get_device_name(idist.get_local_rank())}\"\n", - " )\n", - " logger.info(f\"- CUDA version: {torch.version.cuda}\")\n", - " logger.info(f\"- CUDNN version: {cudnn.version()}\")\n", - "\n", - " logger.info(\"\\n\")\n", - " logger.info(\"Configuration:\")\n", - " for key, value in config.items():\n", - " logger.info(f\"\\t{key}: {value}\")\n", - " logger.info(\"\\n\")\n", - "\n", - " if idist.get_world_size() > 1:\n", - " logger.info(\"\\nDistributed setting:\")\n", - " logger.info(f\"\\tbackend: {idist.backend()}\")\n", - " logger.info(f\"\\tworld size: {idist.get_world_size()}\")\n", - " logger.info(\"\\n\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5QT7tiAcJEk0" - }, - "source": [ - "This is a standard utility function to log `train` and `val` metrics after `validate_every` epochs." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-09-14T17:01:23.571959Z", - "iopub.status.busy": "2021-09-14T17:01:23.571522Z", - "iopub.status.idle": "2021-09-14T17:01:23.578820Z", - "shell.execute_reply": "2021-09-14T17:01:23.577344Z", - "shell.execute_reply.started": "2021-09-14T17:01:23.571919Z" - }, - "id": "9hqmFjOJN8kK" - }, - "outputs": [], - "source": [ - "def log_metrics(logger, epoch, elapsed, tag, metrics):\n", - " metrics_output = \"\\n\".join([f\"\\t{k}: {v}\" for k, v in metrics.items()])\n", - " logger.info(\n", - " f\"\\nEpoch {epoch} - Evaluation time (seconds): {elapsed:.2f} - {tag} metrics:\\n {metrics_output}\"\n", - " )" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7fKzNAa1JVRR" - }, - "source": [ - "### Begin Training\n", - "\n", - "This is where the main logic resides, i.e. we will call all the above functions from within here:\n", - "1. Basic Setup\n", - " 1. We set a [`manual_seed()`](https://pytorch.org/ignite/utils.html#ignite.utils.manual_seed) and [`setup_logger()`](https://pytorch.org/ignite/utils.html#ignite.utils.setup_logger), then log all basic information.\n", - " 2. Initialise `dataloaders`, `model`, `optimizer`, `criterion` and `lr_scheduler`.\n", - "2. We use the above objects to create a `trainer`.\n", - "3. Evaluator\n", - " 1. Define some relevant Ignite metrics like [`Accuracy()`](https://pytorch.org/ignite/generated/ignite.metrics.Accuracy.html#accuracy) and [`Loss()`](https://pytorch.org/ignite/generated/ignite.metrics.Loss.html#loss).\n", - " 2. Create two evaluators: `train_evaluator` and `val_evaluator` to compute metrics on the `train_dataloader` and `val_dataloader` respectively, however `val_evaluator` will store the best models based on validation metrics.\n", - " 3. Define `run_validation()` to compute metrics on both dataloaders and log them. Then we attach this function to `trainer` to run after `validate_every` epochs and after training is complete.\n", - "4. Setup TensorBoard logging using [`setup_tb_logging()`](https://pytorch.org/ignite/contrib/engines.html#ignite.contrib.engines.common.setup_tb_logging) on the master process for the trainer and evaluators so that training and validation metrics along with the learning rate can be logged.\n", - "5. Define a [`Checkpoint()`](https://pytorch.org/ignite/generated/ignite.handlers.checkpoint.Checkpoint.html#ignite.handlers.checkpoint.Checkpoint) object to store the two best models (`n_saved`) by validation accuracy (defined in `metrics` as `Accuracy()`) and attach it to `val_evaluator` so that it can be executed everytime `val_evaluator` runs.\n", - "6. Try training on `train_loader` for `num_epochs`\n", - "7. Close Tensorboard logger once training is completed.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-09-14T17:01:24.568735Z", - "iopub.status.busy": "2021-09-14T17:01:24.568265Z", - "iopub.status.idle": "2021-09-14T17:01:24.585423Z", - "shell.execute_reply": "2021-09-14T17:01:24.584535Z", - "shell.execute_reply.started": "2021-09-14T17:01:24.568684Z" - }, - "id": "M3l1U7_vL8pg" - }, - "outputs": [], - "source": [ - "def training(local_rank, config):\n", - "\n", - " rank = idist.get_rank()\n", - " manual_seed(config[\"seed\"] + rank)\n", - "\n", - " logger = setup_logger(name=\"CIFAR10-Training\")\n", - " log_basic_info(logger, config)\n", - "\n", - " if rank == 0:\n", - " setup_rank_zero(logger, config)\n", - "\n", - " train_loader, val_loader = get_dataflow(config)\n", - " model = get_model(config)\n", - " optimizer = get_optimizer(config, model)\n", - " criterion = get_criterion()\n", - " config[\"num_iters_per_epoch\"] = len(train_loader)\n", - " lr_scheduler = get_lr_scheduler(config, optimizer)\n", - "\n", - " trainer = create_trainer(\n", - " model, optimizer, criterion, lr_scheduler, train_loader.sampler, config, logger\n", - " )\n", - "\n", - " metrics = {\n", - " \"Accuracy\": Accuracy(),\n", - " \"Loss\": Loss(criterion),\n", - " }\n", - "\n", - " train_evaluator = create_evaluator(model, metrics, config)\n", - " val_evaluator = create_evaluator(model, metrics, config)\n", - "\n", - " def run_validation(engine):\n", - " epoch = trainer.state.epoch\n", - " state = train_evaluator.run(train_loader)\n", - " log_metrics(logger, epoch, state.times[\"COMPLETED\"], \"train\", state.metrics)\n", - " state = val_evaluator.run(val_loader)\n", - " log_metrics(logger, epoch, state.times[\"COMPLETED\"], \"val\", state.metrics)\n", - "\n", - " trainer.add_event_handler(\n", - " Events.EPOCH_COMPLETED(every=config[\"validate_every\"]) | Events.COMPLETED,\n", - " run_validation,\n", - " )\n", - "\n", - " if rank == 0:\n", - " evaluators = {\"train\": train_evaluator, \"val\": val_evaluator}\n", - " tb_logger = common.setup_tb_logging(\n", - " config[\"output_path\"], trainer, optimizer, evaluators=evaluators\n", - " )\n", - "\n", - " best_model_handler = Checkpoint(\n", - " {\"model\": model},\n", - " get_save_handler(config),\n", - " filename_prefix=\"best\",\n", - " n_saved=2,\n", - " global_step_transform=global_step_from_engine(trainer),\n", - " score_name=\"val_accuracy\",\n", - " score_function=Checkpoint.get_default_score_fn(\"Accuracy\"),\n", - " )\n", - " val_evaluator.add_event_handler(\n", - " Events.COMPLETED,\n", - " best_model_handler,\n", - " )\n", - "\n", - " try:\n", - " trainer.run(train_loader, max_epochs=config[\"num_epochs\"])\n", - " except Exception as e:\n", - " logger.exception(\"\")\n", - " raise e\n", - "\n", - " if rank == 0:\n", - " tb_logger.close()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "WioiRM5U9ipQ" - }, - "source": [ - "## Running Distributed Code\n", - "\n", - "We can easily run the above code with the context manager [Parallel](https://pytorch.org/ignite/generated/ignite.distributed.launcher.Parallel.html#ignite.distributed.launcher.Parallel):\n", - "\n", - "```python\n", - "with idist.Parallel(backend=backend, nproc_per_node=nproc_per_node) as parallel:\n", - " parallel.run(training, config)\n", - "```\n", - "`Parallel` enables us to run the same code across all supported distributed backends and non-distributed configurations in a seamless manner. Here backend refers to a distributed communication framework. Read more about which backend to choose [here](https://pytorch.org/docs/stable/distributed.html#backends). `Parallel` accepts a `backend` and either:\n", - "\n", - "> Spawns `nproc_per_node` child processes and initialize a processing group according to provided backend (useful for standalone scripts).\n", - "\n", - "This way uses `torch.multiprocessing.spawn` and is the default way to spawn processes. However, this way is slower due to initialization overhead. \n", - "\n", - "or\n", - "> Only initialize a processing group given the backend (useful with tools like torch.distributed.launch, horovodrun, etc).\n", - "\n", - "This way is recommended since training is faster and easier to extend to multiple scripts.\n", - "\n", - "We can pass additional information to `Parallel` collectively as `spawn_kwargs` as we will see below." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "YJuv6mqHzpVJ" - }, - "source": [ - "**Note:** It is recommended to run distributed code as scripts for ease of use, however we can also spawn processes in a Jupyter notebook (see end of tutorial). The complete code as a script can be found [here](https://github.com/pytorch-ignite/examples/blob/main/tutorials/intermediate/cifar10-distributed.py). Choose one of the suggested ways below to run the script." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "8Nxfg7j7deRD" - }, - "source": [ - "## Single Node, One or More GPUs\n", - "\n", - "We will use `fire` to convert `run()` into a CLI, use the arguments parsed inside `run()` directly and begin training in the script:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "QlVEECk_kNs7" - }, - "outputs": [], - "source": [ - "import fire\n", - "\n", - "def run(backend=None, **spawn_kwargs):\n", - " config[\"backend\"] = backend\n", - " \n", - " with idist.Parallel(backend=config[\"backend\"], **spawn_kwargs) as parallel:\n", - " parallel.run(training, config)\n", - "\n", - "if __name__ == \"__main__\":\n", - " fire.Fire({\"run\": run})" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "lLvosFNn8mb8" - }, - "source": [ - "Then we can run the script (e.g. for 2 GPUs) as:" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "RdfXz_0cWZ_9" - }, - "source": [ - "### Run with `torch.distributed.launch` (Recommended)\n", - "\n", - "```\n", - "python -u -m torch.distributed.launch --nproc_per_node=2 --use_env main.py run --backend=\"nccl\"\n", - "```\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "b_k_0YkyX1Yn" - }, - "source": [ - "### Run with internal spawining (`torch.multiprocessing.spawn`)\n", - "\n", - "```\n", - "python -u main.py run --backend=\"nccl\" --nproc_per_node=2\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "osAHgAyJWomh" - }, - "source": [ - "### Run with horovodrun\n", - "\n", - "Please make sure that `backend=horovod`. `np` below is number of processes.\n", - "\n", - "```\n", - "horovodrun -np=2 python -u main.py run --backend=\"horovod\"\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "G-jBxovGdee1" - }, - "source": [ - "## Multiple Nodes, Multiple GPUs\n", - "\n", - "The code inside the script is similar to Single Node, One or More GPUs:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "uhmj6meIMk2w" - }, - "outputs": [], - "source": [ - "import fire\n", - "\n", - "def run(backend=None, **spawn_kwargs):\n", - " config[\"backend\"] = backend\n", - " \n", - " with idist.Parallel(backend=config[\"backend\"], **spawn_kwargs) as parallel:\n", - " parallel.run(training, config)\n", - "\n", - "if __name__ == \"__main__\":\n", - " fire.Fire({\"run\": run})" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_hMbi0-z-E-T" - }, - "source": [ - "The only change is how we run the script. We need to provide the IP address of the master node and its port along with the node rank. For example, for 2 nodes (`nnodes`) and 2 GPUs (`nproc_per_node`), we can:" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "rnOI--qIJ0ZN" - }, - "source": [ - "### Run with `torch.distributed.launch` (Recommended)\n", - "\n", - "On node 0 (master node):\n", - "\n", - "```\n", - "python -u -m torch.distributed.launch \\\n", - " --nnodes=2 \\\n", - " --nproc_per_node=2 \\\n", - " --node_rank=0 \\\n", - " --master_addr=master --master_port=2222 --use_env \\\n", - " main.py run --backend=\"nccl\"\n", - "```\n", - "\n", - "On node 1 (worker node):\n", - "\n", - "```\n", - "python -u -m torch.distributed.launch \\\n", - " --nnodes=2 \\\n", - " --nproc_per_node=2 \\\n", - " --node_rank=1 \\\n", - " --master_addr=master --master_port=2222 --use_env \\\n", - " main.py run --backend=\"nccl\"\n", - "```\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "b1Z4JfvJJ6bt" - }, - "source": [ - "### Run with internal spawning\n", - "\n", - "On node 0:\n", - "```\n", - "python -u main.py run\n", - " --nnodes=2 \\\n", - " --nproc_per_node=2 \\\n", - " --node_rank=0 \\\n", - " --master_addr=master --master_port=2222 \\\n", - " --backend=\"nccl\"\n", - "```\n", - "\n", - "On node 1:\n", - "```\n", - "python -u main.py run\n", - " --nnodes=2 \\\n", - " --nproc_per_node=2 \\\n", - " --node_rank=1 \\\n", - " --master_addr=master --master_port=2222 \\\n", - " --backend=\"nccl\"\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "KPUtXHI3KG9w" - }, - "source": [ - "### Run with horovodrun\n", - "\n", - "`np` below is calculated by `nnodes` x `nproc_per_node`.\n", - "\n", - "```\n", - "horovodrun -np 4 -H hostname1:2,hostname2:2 python -u main.py run --backend=\"horovod\"\n", - "```" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "99enig2IdNiF" - }, - "source": [ - "## Single Node, Multiple CPUs\n", - "\n", - "This is similar to Single Node, One or More GPUs. The only difference is while running the script, `backend=gloo` instead of `nccl`." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "g6s0Rn_mdewW" - }, - "source": [ - "## TPUs on Google Colab\n", - "\n", - "Go to Runtime > Change runtime type and select Hardware accelerator = TPU." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-09-14T17:01:35.423328Z", - "iopub.status.busy": "2021-09-14T17:01:35.423010Z", - "iopub.status.idle": "2021-09-14T17:04:10.319065Z", - "shell.execute_reply": "2021-09-14T17:04:10.317535Z", - "shell.execute_reply.started": "2021-09-14T17:01:35.423298Z" - }, - "id": "z6e2NW3kofu8", - "outputId": "fca27701-6133-4949-c2a4-484356e32131" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2021-09-14 17:01:35,425 ignite.distributed.launcher.Parallel INFO: Initialized distributed launcher with backend: 'xla-tpu'\n", - "2021-09-14 17:01:35,427 ignite.distributed.launcher.Parallel INFO: - Parameters to spawn processes: \n", - "\tnproc_per_node: 8\n", - "\tnnodes: 1\n", - "\tnode_rank: 0\n", - "2021-09-14 17:01:35,428 ignite.distributed.launcher.Parallel INFO: Spawn function '' in 8 processes\n", - "2021-09-14 17:01:47,607 CIFAR10-Training INFO: Train on CIFAR10\n", - "2021-09-14 17:01:47,639 CIFAR10-Training INFO: - PyTorch version: 1.8.2+cpu\n", - "2021-09-14 17:01:47,658 CIFAR10-Training INFO: - Ignite version: 0.4.6\n", - "2021-09-14 17:01:47,678 CIFAR10-Training INFO: \n", - "\n", - "2021-09-14 17:01:47,697 CIFAR10-Training INFO: Configuration:\n", - "2021-09-14 17:01:47,721 CIFAR10-Training INFO: \tseed: 543\n", - "2021-09-14 17:01:47,739 CIFAR10-Training INFO: \tdata_path: cifar10\n", - "2021-09-14 17:01:47,765 CIFAR10-Training INFO: \toutput_path: output-cifar10/\n", - "2021-09-14 17:01:47,786 CIFAR10-Training INFO: \tmodel: resnet18\n", - "2021-09-14 17:01:47,810 CIFAR10-Training INFO: \tbatch_size: 512\n", - "2021-09-14 17:01:47,833 CIFAR10-Training INFO: \tmomentum: 0.9\n", - "2021-09-14 17:01:47,854 CIFAR10-Training INFO: \tweight_decay: 0.0001\n", - "2021-09-14 17:01:47,867 CIFAR10-Training INFO: \tnum_workers: 2\n", - "2021-09-14 17:01:47,887 CIFAR10-Training INFO: \tnum_epochs: 5\n", - "2021-09-14 17:01:47,902 CIFAR10-Training INFO: \tlearning_rate: 0.4\n", - "2021-09-14 17:01:47,922 CIFAR10-Training INFO: \tnum_warmup_epochs: 1\n", - "2021-09-14 17:01:47,940 CIFAR10-Training INFO: \tvalidate_every: 3\n", - "2021-09-14 17:01:47,949 CIFAR10-Training INFO: \tcheckpoint_every: 200\n", - "2021-09-14 17:01:47,960 CIFAR10-Training INFO: \tbackend: xla-tpu\n", - "2021-09-14 17:01:47,967 CIFAR10-Training INFO: \tresume_from: None\n", - "2021-09-14 17:01:47,975 CIFAR10-Training INFO: \tlog_every_iters: 15\n", - "2021-09-14 17:01:47,984 CIFAR10-Training INFO: \tnproc_per_node: None\n", - "2021-09-14 17:01:48,003 CIFAR10-Training INFO: \twith_clearml: False\n", - "2021-09-14 17:01:48,019 CIFAR10-Training INFO: \twith_amp: False\n", - "2021-09-14 17:01:48,040 CIFAR10-Training INFO: \n", - "\n", - "2021-09-14 17:01:48,059 CIFAR10-Training INFO: \n", - "Distributed setting:\n", - "2021-09-14 17:01:48,079 CIFAR10-Training INFO: \tbackend: xla-tpu\n", - "2021-09-14 17:01:48,098 CIFAR10-Training INFO: \tworld size: 8\n", - "2021-09-14 17:01:48,109 CIFAR10-Training INFO: \n", - "\n", - "2021-09-14 17:01:48,130 CIFAR10-Training INFO: Output path: output-cifar10/resnet18_backend-xla-tpu-8_20210914-170148\n", - "2021-09-14 17:01:50,917 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset 'Dataset CIFAR10': \n", - "\t{'batch_size': 64, 'num_workers': 2, 'drop_last': True, 'sampler': , 'pin_memory': False}\n", - "2021-09-14 17:01:50,950 ignite.distributed.auto.auto_dataloader INFO: DataLoader is wrapped by `MpDeviceLoader` on XLA\n", - "2021-09-14 17:01:50,975 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset 'Dataset CIFAR10': \n", - "\t{'batch_size': 128, 'num_workers': 2, 'sampler': , 'pin_memory': False}\n", - "2021-09-14 17:01:51,000 ignite.distributed.auto.auto_dataloader INFO: DataLoader is wrapped by `MpDeviceLoader` on XLA\n", - "2021-09-14 17:01:53,866 CIFAR10-Training INFO: Engine run starting with max_epochs=5.\n", - "2021-09-14 17:02:23,913 CIFAR10-Training INFO: Epoch[1] Complete. Time taken: 00:00:30\n", - "2021-09-14 17:02:41,945 CIFAR10-Training INFO: Epoch[2] Complete. Time taken: 00:00:18\n", - "2021-09-14 17:03:13,870 CIFAR10-Training INFO: \n", - "Epoch 3 - Evaluation time (seconds): 14.00 - train metrics:\n", - " \tAccuracy: 0.32997744845360827\n", - "\tLoss: 1.7080145767054606\n", - "2021-09-14 17:03:19,283 CIFAR10-Training INFO: \n", - "Epoch 3 - Evaluation time (seconds): 5.39 - val metrics:\n", - " \tAccuracy: 0.3424\n", - "\tLoss: 1.691359375\n", - "2021-09-14 17:03:19,289 CIFAR10-Training INFO: Epoch[3] Complete. Time taken: 00:00:37\n", - "2021-09-14 17:03:37,535 CIFAR10-Training INFO: Epoch[4] Complete. Time taken: 00:00:18\n", - "2021-09-14 17:03:55,927 CIFAR10-Training INFO: Epoch[5] Complete. Time taken: 00:00:18\n", - "2021-09-14 17:04:07,598 CIFAR10-Training INFO: \n", - "Epoch 5 - Evaluation time (seconds): 11.66 - train metrics:\n", - " \tAccuracy: 0.42823775773195877\n", - "\tLoss: 1.4969784451514174\n", - "2021-09-14 17:04:10,190 CIFAR10-Training INFO: \n", - "Epoch 5 - Evaluation time (seconds): 2.56 - val metrics:\n", - " \tAccuracy: 0.4412\n", - "\tLoss: 1.47838994140625\n", - "2021-09-14 17:04:10,244 CIFAR10-Training INFO: Engine run complete. Time taken: 00:02:16\n", - "2021-09-14 17:04:10,313 ignite.distributed.launcher.Parallel INFO: End of run\n" - ] - } - ], - "source": [ - "nproc_per_node = 8\n", - "config[\"backend\"] = \"xla-tpu\"\n", - "\n", - "with idist.Parallel(backend=config[\"backend\"], nproc_per_node=nproc_per_node) as parallel:\n", - " parallel.run(training, config)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "FYq3RGY1t9me" - }, - "source": [ - "## Run in Jupyter Notebook\n", - "\n", - "We will have to spawn processes in a notebook and therefore, we will use internal spawning to achieve that. For multiple GPUs, use `backend=nccl` and `backend=gloo` for multiple CPUs." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "6DWUEfZHuTDz", - "outputId": "c7110f87-05a0-4c7b-c910-efd72b2606e3" - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2021-09-14 19:15:15,335 ignite.distributed.launcher.Parallel INFO: Initialized distributed launcher with backend: 'nccl'\n", - "2021-09-14 19:15:15,337 ignite.distributed.launcher.Parallel INFO: - Parameters to spawn processes: \n", - "\tnproc_per_node: 2\n", - "\tnnodes: 1\n", - "\tnode_rank: 0\n", - "\tstart_method: fork\n", - "2021-09-14 19:15:15,338 ignite.distributed.launcher.Parallel INFO: Spawn function '' in 2 processes\n", - "2021-09-14 19:15:18,910 CIFAR10-Training INFO: Train on CIFAR10\n", - "2021-09-14 19:15:18,911 CIFAR10-Training INFO: - PyTorch version: 1.9.0\n", - "2021-09-14 19:15:18,912 CIFAR10-Training INFO: - Ignite version: 0.4.6\n", - "2021-09-14 19:15:18,913 CIFAR10-Training INFO: - GPU Device: GeForce GTX 1080 Ti\n", - "2021-09-14 19:15:18,913 CIFAR10-Training INFO: - CUDA version: 11.1\n", - "2021-09-14 19:15:18,914 CIFAR10-Training INFO: - CUDNN version: 8005\n", - "2021-09-14 19:15:18,915 CIFAR10-Training INFO: \n", - "\n", - "2021-09-14 19:15:18,916 CIFAR10-Training INFO: Configuration:\n", - "2021-09-14 19:15:18,917 CIFAR10-Training INFO: \tseed: 543\n", - "2021-09-14 19:15:18,918 CIFAR10-Training INFO: \tdata_path: cifar10\n", - "2021-09-14 19:15:18,919 CIFAR10-Training INFO: \toutput_path: output-cifar10/\n", - "2021-09-14 19:15:18,920 CIFAR10-Training INFO: \tmodel: resnet18\n", - "2021-09-14 19:15:18,921 CIFAR10-Training INFO: \tbatch_size: 512\n", - "2021-09-14 19:15:18,922 CIFAR10-Training INFO: \tmomentum: 0.9\n", - "2021-09-14 19:15:18,923 CIFAR10-Training INFO: \tweight_decay: 0.0001\n", - "2021-09-14 19:15:18,924 CIFAR10-Training INFO: \tnum_workers: 2\n", - "2021-09-14 19:15:18,925 CIFAR10-Training INFO: \tnum_epochs: 5\n", - "2021-09-14 19:15:18,925 CIFAR10-Training INFO: \tlearning_rate: 0.4\n", - "2021-09-14 19:15:18,926 CIFAR10-Training INFO: \tnum_warmup_epochs: 1\n", - "2021-09-14 19:15:18,927 CIFAR10-Training INFO: \tvalidate_every: 3\n", - "2021-09-14 19:15:18,928 CIFAR10-Training INFO: \tcheckpoint_every: 200\n", - "2021-09-14 19:15:18,929 CIFAR10-Training INFO: \tbackend: nccl\n", - "2021-09-14 19:15:18,929 CIFAR10-Training INFO: \tresume_from: None\n", - "2021-09-14 19:15:18,930 CIFAR10-Training INFO: \tlog_every_iters: 15\n", - "2021-09-14 19:15:18,931 CIFAR10-Training INFO: \tnproc_per_node: None\n", - "2021-09-14 19:15:18,931 CIFAR10-Training INFO: \twith_clearml: False\n", - "2021-09-14 19:15:18,932 CIFAR10-Training INFO: \twith_amp: False\n", - "2021-09-14 19:15:18,933 CIFAR10-Training INFO: \n", - "\n", - "2021-09-14 19:15:18,933 CIFAR10-Training INFO: \n", - "Distributed setting:\n", - "2021-09-14 19:15:18,934 CIFAR10-Training INFO: \tbackend: nccl\n", - "2021-09-14 19:15:18,935 CIFAR10-Training INFO: \tworld size: 2\n", - "2021-09-14 19:15:18,935 CIFAR10-Training INFO: \n", - "\n", - "2021-09-14 19:15:18,936 CIFAR10-Training INFO: Output path: output-cifar10/resnet18_backend-nccl-2_20210914-191518\n", - "2021-09-14 19:15:19,725 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset 'Dataset CIFAR10': \n", - "\t{'batch_size': 256, 'num_workers': 1, 'drop_last': True, 'sampler': , 'pin_memory': True}\n", - "2021-09-14 19:15:19,727 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset 'Dataset CIFAR10': \n", - "\t{'batch_size': 512, 'num_workers': 1, 'sampler': , 'pin_memory': True}\n", - "2021-09-14 19:15:19,873 ignite.distributed.auto.auto_model INFO: Apply torch DistributedDataParallel on model, device id: 0\n", - "2021-09-14 19:15:20,049 CIFAR10-Training INFO: Engine run starting with max_epochs=5.\n", - "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /opt/conda/conda-bld/pytorch_1623448265233/work/c10/core/TensorImpl.h:1156.)\n", - " return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)\n", - "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /opt/conda/conda-bld/pytorch_1623448265233/work/c10/core/TensorImpl.h:1156.)\n", - " return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)\n", - "2021-09-14 19:15:28,800 CIFAR10-Training INFO: Epoch[1] Complete. Time taken: 00:00:09\n", - "2021-09-14 19:15:37,474 CIFAR10-Training INFO: Epoch[2] Complete. Time taken: 00:00:09\n", - "2021-09-14 19:15:54,675 CIFAR10-Training INFO: \n", - "Epoch 3 - Evaluation time (seconds): 8.50 - train metrics:\n", - " \tAccuracy: 0.5533988402061856\n", - "\tLoss: 1.2227583423103254\n", - "2021-09-14 19:15:56,077 CIFAR10-Training INFO: \n", - "Epoch 3 - Evaluation time (seconds): 1.36 - val metrics:\n", - " \tAccuracy: 0.5699\n", - "\tLoss: 1.1869916015625\n", - "2021-09-14 19:15:56,079 CIFAR10-Training INFO: Epoch[3] Complete. Time taken: 00:00:19\n", - "2021-09-14 19:16:04,686 CIFAR10-Training INFO: Epoch[4] Complete. Time taken: 00:00:09\n", - "2021-09-14 19:16:13,347 CIFAR10-Training INFO: Epoch[5] Complete. Time taken: 00:00:09\n", - "2021-09-14 19:16:21,857 CIFAR10-Training INFO: \n", - "Epoch 5 - Evaluation time (seconds): 8.46 - train metrics:\n", - " \tAccuracy: 0.6584246134020618\n", - "\tLoss: 0.9565292830319748\n", - "2021-09-14 19:16:23,269 CIFAR10-Training INFO: \n", - "Epoch 5 - Evaluation time (seconds): 1.38 - val metrics:\n", - " \tAccuracy: 0.6588\n", - "\tLoss: 0.9517111328125\n", - "2021-09-14 19:16:23,271 CIFAR10-Training INFO: Engine run complete. Time taken: 00:01:03\n", - "2021-09-14 19:16:23,547 ignite.distributed.launcher.Parallel INFO: End of run\n" - ] - } - ], - "source": [ - "spawn_kwargs = {}\n", - "spawn_kwargs[\"start_method\"] = \"fork\"\n", - "spawn_kwargs[\"nproc_per_node\"] = 2\n", - "config[\"backend\"] = \"nccl\"\n", - "\n", - "with idist.Parallel(backend=config[\"backend\"], **spawn_kwargs) as parallel:\n", - " parallel.run(training, config)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "hNIC_h9fXeKI" - }, - "source": [ - "## Important Links\n", - "\n", - "1. Complete code can be found [here](https://github.com/pytorch-ignite/examples/blob/main/tutorials/intermediate/cifar10-distributed.py).\n", - "2. Example of the logs of a ClearML experiment run on this code:\n", - " - [With torch.distributed.launch](https://app.community.clear.ml/projects/14efa0ee4c114401bd06b7748314b465/experiments/83ebffd99a3f47f49dff1075252e3371/output/execution) \n", - " - [With default internal spawning](https://app.community.clear.ml/projects/14efa0ee4c114401bd06b7748314b465/experiments/c2b82ec98e8445f29044c94f7efc8215/output/execution)\n", - " - [On Jupyter](https://app.community.clear.ml/projects/14efa0ee4c114401bd06b7748314b465/experiments/2fedd7447b114b36af7066cdb81fddae/output/execution)\n", - " - [On Colab with XLA](https://app.community.clear.ml/projects/14efa0ee4c114401bd06b7748314b465/experiments/fbffb4d7f9324c57979a833a789df857/output/execution)" - ] + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "_9NEVKMz0v5s" + }, + "source": [ + "\n", + "# Distributed Training with Ignite on CIFAR10 " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fHmvDGFx10HT" + }, + "source": [ + "This tutorial is a brief introduction on how you can do distributed training with Ignite on one or more CPUs, GPUs or TPUs. We will also introduce several helper functions and Ignite concepts (setup common training handlers, save to/ load from checkpoints, etc.) which you can easily incorporate in your code.\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "trJ7_a7f17pg" + }, + "source": [ + "We will use distributed training to train a predefined [ResNet18](https://pytorch.org/vision/stable/models.html#torchvision.models.resnet18) on [CIFAR10](https://pytorch.org/vision/stable/datasets.html#torchvision.datasets.CIFAR10) using either of the following configurations:\n", + "\n", + "* Single Node, One or More GPUs\n", + "* Multiple Nodes, Multiple GPUs\n", + "* Single Node, Multiple CPUs\n", + "* TPUs on Google Colab\n", + "* On Jupyter Notebooks\n", + "\n", + "The type of distributed training we will use is called data parallelism in which we:\n", + "\n", + "> 1. Copy the model on each GPU\n", + "> 2. Split the dataset and fit the models on different subsets\n", + "> 3. Communicate the gradients at each iteration to keep the models in sync\n", + ">\n", + "> -- [Distributed Deep Learning 101: Introduction](https://towardsdatascience.com/distributed-deep-learning-101-introduction-ebfc1bcd59d9)\n", + "\n", + "PyTorch provides a [torch.nn.parallel.DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) API for this task however the implementation that supports different backends + configurations is tedious. In this example, we will see how to can enable data distributed training which is adaptable to various backends in just a few lines of code alongwith:\n", + "* Computing training and validation metrics\n", + "* Setup logging (and connecting with ClearML)\n", + "* Saving the best model weights\n", + "* Setting LR Scheduler\n", + "* Using Automatic Mixed Precision" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kWLrQ6EH4uoD" + }, + "source": [ + "## Required Dependencies" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "e7k6WVw5_uts" + }, + "outputs": [], + "source": [ + "!pip install pytorch-ignite" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pcvTTo1s8Huq" + }, + "source": [ + "### For parsing arguments" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OIgDDJbS8Fy6" + }, + "outputs": [], + "source": [ + "!pip install fire" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d0c2e4I4FWoT" + }, + "source": [ + "### For TPUs" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Gg74Dc3UFUcG" + }, + "outputs": [], + "source": [ + "VERSION = !curl -s https://api.github.com/repos/pytorch/xla/releases/latest | grep -Po '\"tag_name\": \"v\\K.*?(?=\")'\n", + "VERSION = VERSION[0].rstrip('.0') # remove trailing zero\n", + "!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-{VERSION}-cp37-cp37m-linux_x86_64.whl" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DMGTfKx7o1zr" + }, + "source": [ + "### With ClearML (Optional)\n", + "\n", + "We can enable logging with ClearML to track experiments as follows:\n", + "\n", + "- Make sure you have a ClearML account: https://app.community.clear.ml/\n", + "- Create a credential: Profile > Create new credentials > Copy to clipboard\n", + "- Run `clearml-init` and paste the credentials" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Ds3-t1PqpAO0" + }, + "outputs": [], + "source": [ + "!pip install clearml\n", + "!clearml-init" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i4QuWxYFL-Da" + }, + "source": [ + "Specify `with_clearml=True` in `config` below and monitor the experiment on the dashboard. Refer to the end of this tutorial to see an example of such an experiment." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "09xlncFZfmeS" + }, + "source": [ + "## Download Data\n", + "\n", + "Let's download our data first which can later be used by all the processes to instantiate our dataloaders. The following command will download the CIFAR10 dataset to a folder `cifar10`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gRpACWtUfyn1", + "scrolled": true + }, + "outputs": [], + "source": [ + "!python -c \"from torchvision.datasets import CIFAR10; CIFAR10('cifar10', download=True)\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D71VkD74he9J" + }, + "source": [ + "## Common Configuration\n", + "\n", + "We maintain a `config` dictionary which can be extended or changed to store parameters required during training. We can refer back to this code when we will use these parameters later." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-09-14T16:59:33.061456Z", + "iopub.status.busy": "2021-09-14T16:59:33.061157Z", + "iopub.status.idle": "2021-09-14T16:59:33.069066Z", + "shell.execute_reply": "2021-09-14T16:59:33.068110Z", + "shell.execute_reply.started": "2021-09-14T16:59:33.061424Z" + }, + "id": "9bg7unvqhegL" + }, + "outputs": [], + "source": [ + "config = {\n", + " \"seed\": 543,\n", + " \"data_path\": \"cifar10\",\n", + " \"output_path\": \"output-cifar10/\",\n", + " \"model\": \"resnet18\",\n", + " \"batch_size\": 512,\n", + " \"momentum\": 0.9,\n", + " \"weight_decay\": 1e-4,\n", + " \"num_workers\": 2,\n", + " \"num_epochs\": 5,\n", + " \"learning_rate\": 0.4,\n", + " \"num_warmup_epochs\": 1,\n", + " \"validate_every\": 3,\n", + " \"checkpoint_every\": 200,\n", + " \"backend\": None,\n", + " \"resume_from\": None,\n", + " \"log_every_iters\": 15,\n", + " \"nproc_per_node\": None,\n", + " \"with_clearml\": False,\n", + " \"with_amp\": False,\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DzuG8QAr5Djf" + }, + "source": [ + "## Basic Setup" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bIgzky7Q7kUk" + }, + "source": [ + "### Imports" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-09-14T16:59:34.353096Z", + "iopub.status.busy": "2021-09-14T16:59:34.352425Z", + "iopub.status.idle": "2021-09-14T17:00:15.082104Z", + "shell.execute_reply": "2021-09-14T17:00:15.080743Z", + "shell.execute_reply.started": "2021-09-14T16:59:34.353037Z" + }, + "id": "0RVISbXd_h1F" + }, + "outputs": [], + "source": [ + "from datetime import datetime\n", + "from pathlib import Path\n", + "\n", + "import torch\n", + "import torch.nn as nn\n", + "import torch.optim as optim\n", + "from torchvision import datasets, models\n", + "from torchvision.transforms import (\n", + " Compose,\n", + " Normalize,\n", + " Pad,\n", + " RandomCrop,\n", + " RandomHorizontalFlip,\n", + " ToTensor,\n", + ")\n", + "\n", + "import ignite\n", + "import ignite.distributed as idist\n", + "from ignite.contrib.engines import common\n", + "from ignite.handlers import PiecewiseLinear\n", + "from ignite.engine import Events, create_supervised_trainer, create_supervised_evaluator\n", + "from ignite.handlers import Checkpoint, global_step_from_engine\n", + "from ignite.metrics import Accuracy, Loss\n", + "from ignite.utils import manual_seed, setup_logger" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fVXiYInWikTn" + }, + "source": [ + "Next we will take the help of `auto_` methods in `idist` ([`ignite.distributed`](https://pytorch.org/ignite/distributed.html#)) to make our dataloaders, model and optimizer automatically adapt to the current configuration `backend=None` (non-distributed) or for backends like `nccl`, `gloo`, and `xla-tpu` (distributed).\n", + "\n", + "Note that we are free to partially use or not use `auto_` methods at all and instead can implement something custom." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1r83mIS4QCdg" + }, + "source": [ + "### Dataloaders\n", + "\n", + "Next we are going to instantiate the train and test datasets from `data_path`, apply transforms to it and return them via `get_train_test_datasets()`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-09-14T17:01:17.777175Z", + "iopub.status.busy": "2021-09-14T17:01:17.776871Z", + "iopub.status.idle": "2021-09-14T17:01:17.785864Z", + "shell.execute_reply": "2021-09-14T17:01:17.784669Z", + "shell.execute_reply.started": "2021-09-14T17:01:17.777146Z" + }, + "id": "Y3BKYL7XGpZL" + }, + "outputs": [], + "source": [ + "def get_train_test_datasets(path):\n", + " train_transform = Compose(\n", + " [\n", + " Pad(4),\n", + " RandomCrop(32, fill=128),\n", + " RandomHorizontalFlip(),\n", + " ToTensor(),\n", + " Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),\n", + " ]\n", + " )\n", + " test_transform = Compose(\n", + " [\n", + " ToTensor(),\n", + " Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),\n", + " ]\n", + " )\n", + "\n", + " train_ds = datasets.CIFAR10(\n", + " root=path, train=True, download=False, transform=train_transform\n", + " )\n", + " test_ds = datasets.CIFAR10(\n", + " root=path, train=False, download=False, transform=test_transform\n", + " )\n", + "\n", + " return train_ds, test_ds" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2GkEcAYiRbgO" + }, + "source": [ + "Finally, we pass the datasets to [`auto_dataloader()`](https://pytorch.org/ignite/generated/ignite.distributed.auto.auto_dataloader.html#ignite.distributed.auto.auto_dataloader)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-09-14T17:01:18.697479Z", + "iopub.status.busy": "2021-09-14T17:01:18.697159Z", + "iopub.status.idle": "2021-09-14T17:01:18.704779Z", + "shell.execute_reply": "2021-09-14T17:01:18.703467Z", + "shell.execute_reply.started": "2021-09-14T17:01:18.697445Z" + }, + "id": "7rNV-UDwRPtO" + }, + "outputs": [], + "source": [ + "def get_dataflow(config):\n", + " train_dataset, test_dataset = get_train_test_datasets(config[\"data_path\"])\n", + "\n", + " train_loader = idist.auto_dataloader(\n", + " train_dataset,\n", + " batch_size=config[\"batch_size\"],\n", + " num_workers=config[\"num_workers\"],\n", + " shuffle=True,\n", + " drop_last=True,\n", + " )\n", + "\n", + " test_loader = idist.auto_dataloader(\n", + " test_dataset,\n", + " batch_size=2 * config[\"batch_size\"],\n", + " num_workers=config[\"num_workers\"],\n", + " shuffle=False,\n", + " )\n", + " return train_loader, test_loader" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iNLvDK-cS2sH" + }, + "source": [ + "### Model\n", + "\n", + "We check if the model given in `config` is present in [torchvision.models](https://pytorch.org/vision/stable/models.html), change the last layer to output 10 classes (as present in CIFAR10) and pass it to [`auto_model()`](https://pytorch.org/ignite/generated/ignite.distributed.auto.auto_model.html#auto-model) which makes it automatically adaptable for non-distributed and distributed configurations.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-09-14T17:01:18.782444Z", + "iopub.status.busy": "2021-09-14T17:01:18.782146Z", + "iopub.status.idle": "2021-09-14T17:01:18.789078Z", + "shell.execute_reply": "2021-09-14T17:01:18.787728Z", + "shell.execute_reply.started": "2021-09-14T17:01:18.782416Z" + }, + "id": "toShlIcW5oFd" + }, + "outputs": [], + "source": [ + "def get_model(config):\n", + " model_name = config[\"model\"]\n", + " if model_name in models.__dict__:\n", + " fn = models.__dict__[model_name]\n", + " else:\n", + " raise RuntimeError(f\"Unknown model name {model_name}\")\n", + "\n", + " model = idist.auto_model(fn(num_classes=10))\n", + "\n", + " return model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V14TfyCT8jQW" + }, + "source": [ + "### Optimizer\n", + "\n", + "Then we can setup the optimizer using hyperameters from `config` and pass it through [`auto_optim()`](https://pytorch.org/ignite/generated/ignite.distributed.auto.auto_optim.html#ignite.distributed.auto.auto_optim)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-09-14T17:01:18.842622Z", + "iopub.status.busy": "2021-09-14T17:01:18.842285Z", + "iopub.status.idle": "2021-09-14T17:01:18.849476Z", + "shell.execute_reply": "2021-09-14T17:01:18.848516Z", + "shell.execute_reply.started": "2021-09-14T17:01:18.842592Z" + }, + "id": "Iddv29eh5qU9" + }, + "outputs": [], + "source": [ + "def get_optimizer(config, model):\n", + " optimizer = optim.SGD(\n", + " model.parameters(),\n", + " lr=config[\"learning_rate\"],\n", + " momentum=config[\"momentum\"],\n", + " weight_decay=config[\"weight_decay\"],\n", + " nesterov=True,\n", + " )\n", + " optimizer = idist.auto_optim(optimizer)\n", + "\n", + " return optimizer" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VI0o7hgr8l9q" + }, + "source": [ + "### Criterion\n", + "\n", + "We put the loss function on `device`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-09-14T17:01:19.324780Z", + "iopub.status.busy": "2021-09-14T17:01:19.324452Z", + "iopub.status.idle": "2021-09-14T17:01:19.329773Z", + "shell.execute_reply": "2021-09-14T17:01:19.328546Z", + "shell.execute_reply.started": "2021-09-14T17:01:19.324748Z" + }, + "id": "DVDKkYqS5siE" + }, + "outputs": [], + "source": [ + "def get_criterion():\n", + " return nn.CrossEntropyLoss().to(idist.device())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9No0Ockx8oRC" + }, + "source": [ + "### LR Scheduler\n", + "\n", + "We will use [PiecewiseLinear](https://pytorch.org/ignite/generated/ignite.handlers.param_scheduler.PiecewiseLinear.html#ignite.handlers.param_scheduler.PiecewiseLinear) which is one of the [various LR Schedulers](https://pytorch.org/ignite/handlers.html#parameter-scheduler) Ignite provides.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-09-14T17:01:19.737913Z", + "iopub.status.busy": "2021-09-14T17:01:19.737620Z", + "iopub.status.idle": "2021-09-14T17:01:19.746210Z", + "shell.execute_reply": "2021-09-14T17:01:19.744390Z", + "shell.execute_reply.started": "2021-09-14T17:01:19.737884Z" + }, + "id": "UcYuVeYic_e7" + }, + "outputs": [], + "source": [ + "def get_lr_scheduler(config, optimizer):\n", + " milestones_values = [\n", + " (0, 0.0),\n", + " (config[\"num_iters_per_epoch\"] * config[\"num_warmup_epochs\"], config[\"learning_rate\"]),\n", + " (config[\"num_iters_per_epoch\"] * config[\"num_epochs\"], 0.0),\n", + " ]\n", + " lr_scheduler = PiecewiseLinear(\n", + " optimizer, param_name=\"lr\", milestones_values=milestones_values\n", + " )\n", + " return lr_scheduler" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jjVYZdn49PKD" + }, + "source": [ + "## Trainer" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jp6sWINP9CAq" + }, + "source": [ + "### Save Models\n", + "\n", + "We can create checkpoints using either a handler (in case of ClearML) or by simply passing the path of the checkpoint file to `save_handler`:\n", + "If specified `with-clearml=True`, we will save the models in ClearML's File Server using [`ClearMLSaver()`](https://pytorch.org/ignite/generated/ignite.contrib.handlers.clearml_logger.html#ignite.contrib.handlers.clearml_logger.ClearMLSaver)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-09-14T17:01:20.256676Z", + "iopub.status.busy": "2021-09-14T17:01:20.256397Z", + "iopub.status.idle": "2021-09-14T17:01:20.262069Z", + "shell.execute_reply": "2021-09-14T17:01:20.261000Z", + "shell.execute_reply.started": "2021-09-14T17:01:20.256647Z" + }, + "id": "-DG0Pj4pJJFw" + }, + "outputs": [], + "source": [ + "def get_save_handler(config):\n", + " if config[\"with_clearml\"]:\n", + " from ignite.contrib.handlers.clearml_logger import ClearMLSaver\n", + "\n", + " return ClearMLSaver(dirname=config[\"output_path\"])\n", + "\n", + " return config[\"output_path\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T1N1iR2f9R0n" + }, + "source": [ + "### Resume from Checkpoint\n", + "\n", + "If a checkpoint file path is provided, we can resume training from there by loading the file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-09-14T17:01:20.613959Z", + "iopub.status.busy": "2021-09-14T17:01:20.613673Z", + "iopub.status.idle": "2021-09-14T17:01:20.620012Z", + "shell.execute_reply": "2021-09-14T17:01:20.618881Z", + "shell.execute_reply.started": "2021-09-14T17:01:20.613923Z" + }, + "id": "j9La55z97PVa" + }, + "outputs": [], + "source": [ + "def load_checkpoint(resume_from):\n", + " checkpoint_fp = Path(resume_from)\n", + " assert (\n", + " checkpoint_fp.exists()\n", + " ), f\"Checkpoint '{checkpoint_fp.as_posix()}' is not found\"\n", + " checkpoint = torch.load(checkpoint_fp.as_posix(), map_location=\"cpu\")\n", + " return checkpoint" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aBd6WDVE_KmO" + }, + "source": [ + "### Create Trainer\n", + "\n", + "Finally, we can create our `trainer` in four steps:\n", + "1. Create a `trainer` object using [`create_supervised_trainer()`](https://pytorch.org/ignite/generated/ignite.engine.create_supervised_trainer.html#ignite.engine.create_supervised_trainer) which internally defines the steps taken to process a single batch:\n", + " 1. Move the batch to `device` used in current distributed configuration.\n", + " 2. Put `model` in `train()` mode.\n", + " 3. Perform forward pass by passing the inputs through the `model` and calculating `loss`. If AMP is enabled then this step happens with [`autocast`](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.autocast) on which allows this step to run in mixed precision.\n", + " 4. Perform backward pass. If [Automatic Mixed Precision](https://pytorch.org/docs/stable/amp.html) (AMP) is enabled (speeds up computations on large neural networks and reduces memory usage while retaining performance), then the losses will be [scaled](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler.scale) before calling `backward()`, `step()` the optimizer while discarding batches that contain NaNs and [update()](https://pytorch.org/docs/stable/amp.html#torch.cuda.amp.GradScaler.update) the scale for the next iteration.\n", + " 5. Store the loss as `batch loss` in `state.output`.\n", + "\n", + "Internally, the above steps to create the `trainer` would look like:\n", + " ```python\n", + " def train_step(engine, batch):\n", + "\n", + " x, y = batch[0], batch[1]\n", + " if x.device != device:\n", + " x = x.to(device, non_blocking=True)\n", + " y = y.to(device, non_blocking=True)\n", + "\n", + " model.train()\n", + "\n", + " with autocast(enabled=with_amp):\n", + " y_pred = model(x)\n", + " loss = criterion(y_pred, y)\n", + "\n", + " optimizer.zero_grad()\n", + " scaler.scale(loss).backward() # If with_amp=False, this is equivalent to loss.backward()\n", + " scaler.step(optimizer) # If with_amp=False, this is equivalent to optimizer.step()\n", + " scaler.update() # If with_amp=False, this step does nothing\n", + "\n", + " return {\"batch loss\": loss.item()}\n", + "\n", + " trainer = Engine(train_step)\n", + " ```\n", + "3. Setup some common Ignite training handlers. You can do this individually or use [setup_common_training_handlers()](https://pytorch.org/ignite/contrib/engines.html#ignite.contrib.engines.common.setup_common_training_handlers) that takes the `trainer` and a subset of the dataset (`train_sampler`) alongwith:\n", + " * A dictionary mapping on what to save in the checkpoint (`to_save`) and how often (`save_every_iters`).\n", + " * The LR Scheduler\n", + " * The output of `train_step()`\n", + " * Other handlers\n", + "4. If `resume_from` file path is provided, load the states of objects `to_save` from the checkpoint file." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-09-14T17:01:21.143122Z", + "iopub.status.busy": "2021-09-14T17:01:21.142824Z", + "iopub.status.idle": "2021-09-14T17:01:21.153455Z", + "shell.execute_reply": "2021-09-14T17:01:21.152265Z", + "shell.execute_reply.started": "2021-09-14T17:01:21.143089Z" + }, + "id": "ptmfSvESbEPE" + }, + "outputs": [], + "source": [ + "def create_trainer(\n", + " model, optimizer, criterion, lr_scheduler, train_sampler, config, logger\n", + "):\n", + "\n", + " device = idist.device()\n", + " amp_mode = None\n", + " scaler = False\n", + " \n", + " trainer = create_supervised_trainer(\n", + " model,\n", + " optimizer,\n", + " criterion,\n", + " device=device,\n", + " non_blocking=True,\n", + " output_transform=lambda x, y, y_pred, loss: {\"batch loss\": loss.item()},\n", + " amp_mode=\"amp\" if config[\"with_amp\"] else None,\n", + " scaler=config[\"with_amp\"],\n", + " )\n", + " trainer.logger = logger\n", + "\n", + " to_save = {\n", + " \"trainer\": trainer,\n", + " \"model\": model,\n", + " \"optimizer\": optimizer,\n", + " \"lr_scheduler\": lr_scheduler,\n", + " }\n", + " metric_names = [\n", + " \"batch loss\",\n", + " ]\n", + "\n", + " common.setup_common_training_handlers(\n", + " trainer=trainer,\n", + " train_sampler=train_sampler,\n", + " to_save=to_save,\n", + " save_every_iters=config[\"checkpoint_every\"],\n", + " save_handler=get_save_handler(config),\n", + " lr_scheduler=lr_scheduler,\n", + " output_names=metric_names if config[\"log_every_iters\"] > 0 else None,\n", + " with_pbars=False,\n", + " clear_cuda_cache=False,\n", + " )\n", + "\n", + " if config[\"resume_from\"] is not None:\n", + " checkpoint = load_checkpoint(config[\"resume_from\"])\n", + " Checkpoint.load_objects(to_load=to_save, checkpoint=checkpoint)\n", + "\n", + " return trainer" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "U5ius6EM9aiG" + }, + "source": [ + "## Evaluator\n", + "\n", + "The evaluator will be created via [`create_supervised_evaluator()`](https://pytorch.org/ignite/generated/ignite.engine.create_supervised_evaluator.html#ignite.engine.create_supervised_evaluator) which internally will:\n", + "1. Set the `model` to `eval()` mode.\n", + "2. Move the batch to `device` used in current distributed configuration.\n", + "3. Perform forward pass. If AMP is enabled, `autocast` will be on.\n", + "4. Store the predictions and labels in `state.output` to compute metrics.\n", + "\n", + "It will also attach the Ignite metrics passed to the `evaluator`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-09-14T17:01:21.765740Z", + "iopub.status.busy": "2021-09-14T17:01:21.765110Z", + "iopub.status.idle": "2021-09-14T17:01:21.771498Z", + "shell.execute_reply": "2021-09-14T17:01:21.770611Z", + "shell.execute_reply.started": "2021-09-14T17:01:21.765695Z" + }, + "id": "JPVzU6CqNlE4" + }, + "outputs": [], + "source": [ + "def create_evaluator(model, metrics, config):\n", + " device = idist.device()\n", + "\n", + " amp_mode = \"amp\" if config[\"with_amp\"] else None\n", + " evaluator = create_supervised_evaluator(\n", + " model, metrics=metrics, device=device, non_blocking=True, amp_mode=amp_mode\n", + " )\n", + " \n", + " return evaluator" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "opzRc2DR9dz-" + }, + "source": [ + "## Training\n", + "\n", + "Before we begin training, we must setup a few things on the master process (`rank` = 0):\n", + "* Create folder to store checkpoints, best models and output of tensorboard logging in the format - model_backend_rank_time.\n", + "* If ClearML FileServer is used to save models, then a `Task` has to be created, and we pass our `config` dictionary and the specific hyper parameters that are part of the experiment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-09-14T17:01:22.274841Z", + "iopub.status.busy": "2021-09-14T17:01:22.273903Z", + "iopub.status.idle": "2021-09-14T17:01:22.283701Z", + "shell.execute_reply": "2021-09-14T17:01:22.283030Z", + "shell.execute_reply.started": "2021-09-14T17:01:22.274775Z" + }, + "id": "3_ahtRZ_i-k8" + }, + "outputs": [], + "source": [ + "def setup_rank_zero(logger, config):\n", + " device = idist.device()\n", + "\n", + " now = datetime.now().strftime(\"%Y%m%d-%H%M%S\")\n", + " output_path = config[\"output_path\"]\n", + " folder_name = (\n", + " f\"{config['model']}_backend-{idist.backend()}-{idist.get_world_size()}_{now}\"\n", + " )\n", + " output_path = Path(output_path) / folder_name\n", + " if not output_path.exists():\n", + " output_path.mkdir(parents=True)\n", + " config[\"output_path\"] = output_path.as_posix()\n", + " logger.info(f\"Output path: {config['output_path']}\")\n", + "\n", + " if config[\"with_clearml\"]:\n", + " from clearml import Task\n", + "\n", + " task = Task.init(\"CIFAR10-Training\", task_name=output_path.stem)\n", + " task.connect_configuration(config)\n", + " # Log hyper parameters\n", + " hyper_params = [\n", + " \"model\",\n", + " \"batch_size\",\n", + " \"momentum\",\n", + " \"weight_decay\",\n", + " \"num_epochs\",\n", + " \"learning_rate\",\n", + " \"num_warmup_epochs\",\n", + " ]\n", + " task.connect({k: v for k, v in config.items()})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KXpew9g3PUKI" + }, + "source": [ + "### Logging\n", + "\n", + "This step is optional, however, we can pass a [`setup_logger()`](https://pytorch.org/ignite/utils.html#ignite.utils.setup_logger) object to `log_basic_info()` and log all basic information such as different versions, current configuration, `device` and `backend` used by the current process (identified by its local rank), and number of processes (world size). `idist` (`ignite.distributed`) provides several utility functions like [`get_local_rank()`](https://pytorch.org/ignite/distributed.html#ignite.distributed.utils.get_local_rank), [`backend()`](https://pytorch.org/ignite/distributed.html#ignite.distributed.utils.backend), [`get_world_size()`](https://pytorch.org/ignite/distributed.html#ignite.distributed.utils.get_world_size), etc. to make this possible.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-09-14T17:01:22.883100Z", + "iopub.status.busy": "2021-09-14T17:01:22.882657Z", + "iopub.status.idle": "2021-09-14T17:01:22.891892Z", + "shell.execute_reply": "2021-09-14T17:01:22.891225Z", + "shell.execute_reply.started": "2021-09-14T17:01:22.883044Z" + }, + "id": "g61dzVvnEVvG" + }, + "outputs": [], + "source": [ + "def log_basic_info(logger, config):\n", + " logger.info(f\"Train on CIFAR10\")\n", + " logger.info(f\"- PyTorch version: {torch.__version__}\")\n", + " logger.info(f\"- Ignite version: {ignite.__version__}\")\n", + " if torch.cuda.is_available():\n", + " # explicitly import cudnn as torch.backends.cudnn can not be pickled with hvd spawning procs\n", + " from torch.backends import cudnn\n", + "\n", + " logger.info(\n", + " f\"- GPU Device: {torch.cuda.get_device_name(idist.get_local_rank())}\"\n", + " )\n", + " logger.info(f\"- CUDA version: {torch.version.cuda}\")\n", + " logger.info(f\"- CUDNN version: {cudnn.version()}\")\n", + "\n", + " logger.info(\"\\n\")\n", + " logger.info(\"Configuration:\")\n", + " for key, value in config.items():\n", + " logger.info(f\"\\t{key}: {value}\")\n", + " logger.info(\"\\n\")\n", + "\n", + " if idist.get_world_size() > 1:\n", + " logger.info(\"\\nDistributed setting:\")\n", + " logger.info(f\"\\tbackend: {idist.backend()}\")\n", + " logger.info(f\"\\tworld size: {idist.get_world_size()}\")\n", + " logger.info(\"\\n\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5QT7tiAcJEk0" + }, + "source": [ + "This is a standard utility function to log `train` and `val` metrics after `validate_every` epochs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-09-14T17:01:23.571959Z", + "iopub.status.busy": "2021-09-14T17:01:23.571522Z", + "iopub.status.idle": "2021-09-14T17:01:23.578820Z", + "shell.execute_reply": "2021-09-14T17:01:23.577344Z", + "shell.execute_reply.started": "2021-09-14T17:01:23.571919Z" + }, + "id": "9hqmFjOJN8kK" + }, + "outputs": [], + "source": [ + "def log_metrics(logger, epoch, elapsed, tag, metrics):\n", + " metrics_output = \"\\n\".join([f\"\\t{k}: {v}\" for k, v in metrics.items()])\n", + " logger.info(\n", + " f\"\\nEpoch {epoch} - Evaluation time (seconds): {elapsed:.2f} - {tag} metrics:\\n {metrics_output}\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7fKzNAa1JVRR" + }, + "source": [ + "### Begin Training\n", + "\n", + "This is where the main logic resides, i.e. we will call all the above functions from within here:\n", + "1. Basic Setup\n", + " 1. We set a [`manual_seed()`](https://pytorch.org/ignite/utils.html#ignite.utils.manual_seed) and [`setup_logger()`](https://pytorch.org/ignite/utils.html#ignite.utils.setup_logger), then log all basic information.\n", + " 2. Initialise `dataloaders`, `model`, `optimizer`, `criterion` and `lr_scheduler`.\n", + "2. We use the above objects to create a `trainer`.\n", + "3. Evaluator\n", + " 1. Define some relevant Ignite metrics like [`Accuracy()`](https://pytorch.org/ignite/generated/ignite.metrics.Accuracy.html#accuracy) and [`Loss()`](https://pytorch.org/ignite/generated/ignite.metrics.Loss.html#loss).\n", + " 2. Create two evaluators: `train_evaluator` and `val_evaluator` to compute metrics on the `train_dataloader` and `val_dataloader` respectively, however `val_evaluator` will store the best models based on validation metrics.\n", + " 3. Define `run_validation()` to compute metrics on both dataloaders and log them. Then we attach this function to `trainer` to run after `validate_every` epochs and after training is complete.\n", + "4. Setup TensorBoard logging using [`setup_tb_logging()`](https://pytorch.org/ignite/contrib/engines.html#ignite.contrib.engines.common.setup_tb_logging) on the master process for the trainer and evaluators so that training and validation metrics along with the learning rate can be logged.\n", + "5. Define a [`Checkpoint()`](https://pytorch.org/ignite/generated/ignite.handlers.checkpoint.Checkpoint.html#ignite.handlers.checkpoint.Checkpoint) object to store the two best models (`n_saved`) by validation accuracy (defined in `metrics` as `Accuracy()`) and attach it to `val_evaluator` so that it can be executed everytime `val_evaluator` runs.\n", + "6. Try training on `train_loader` for `num_epochs`\n", + "7. Close Tensorboard logger once training is completed.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-09-14T17:01:24.568735Z", + "iopub.status.busy": "2021-09-14T17:01:24.568265Z", + "iopub.status.idle": "2021-09-14T17:01:24.585423Z", + "shell.execute_reply": "2021-09-14T17:01:24.584535Z", + "shell.execute_reply.started": "2021-09-14T17:01:24.568684Z" + }, + "id": "M3l1U7_vL8pg" + }, + "outputs": [], + "source": [ + "def training(local_rank, config):\n", + "\n", + " rank = idist.get_rank()\n", + " manual_seed(config[\"seed\"] + rank)\n", + "\n", + " logger = setup_logger(name=\"CIFAR10-Training\")\n", + " log_basic_info(logger, config)\n", + "\n", + " if rank == 0:\n", + " setup_rank_zero(logger, config)\n", + "\n", + " train_loader, val_loader = get_dataflow(config)\n", + " model = get_model(config)\n", + " optimizer = get_optimizer(config, model)\n", + " criterion = get_criterion()\n", + " config[\"num_iters_per_epoch\"] = len(train_loader)\n", + " lr_scheduler = get_lr_scheduler(config, optimizer)\n", + "\n", + " trainer = create_trainer(\n", + " model, optimizer, criterion, lr_scheduler, train_loader.sampler, config, logger\n", + " )\n", + "\n", + " metrics = {\n", + " \"Accuracy\": Accuracy(),\n", + " \"Loss\": Loss(criterion),\n", + " }\n", + "\n", + " train_evaluator = create_evaluator(model, metrics, config)\n", + " val_evaluator = create_evaluator(model, metrics, config)\n", + "\n", + " def run_validation(engine):\n", + " epoch = trainer.state.epoch\n", + " state = train_evaluator.run(train_loader)\n", + " log_metrics(logger, epoch, state.times[\"COMPLETED\"], \"train\", state.metrics)\n", + " state = val_evaluator.run(val_loader)\n", + " log_metrics(logger, epoch, state.times[\"COMPLETED\"], \"val\", state.metrics)\n", + "\n", + " trainer.add_event_handler(\n", + " Events.EPOCH_COMPLETED(every=config[\"validate_every\"]) | Events.COMPLETED,\n", + " run_validation,\n", + " )\n", + "\n", + " if rank == 0:\n", + " evaluators = {\"train\": train_evaluator, \"val\": val_evaluator}\n", + " tb_logger = common.setup_tb_logging(\n", + " config[\"output_path\"], trainer, optimizer, evaluators=evaluators\n", + " )\n", + "\n", + " best_model_handler = Checkpoint(\n", + " {\"model\": model},\n", + " get_save_handler(config),\n", + " filename_prefix=\"best\",\n", + " n_saved=2,\n", + " global_step_transform=global_step_from_engine(trainer),\n", + " score_name=\"val_accuracy\",\n", + " score_function=Checkpoint.get_default_score_fn(\"Accuracy\"),\n", + " )\n", + " val_evaluator.add_event_handler(\n", + " Events.COMPLETED,\n", + " best_model_handler,\n", + " )\n", + "\n", + " try:\n", + " trainer.run(train_loader, max_epochs=config[\"num_epochs\"])\n", + " except Exception as e:\n", + " logger.exception(\"\")\n", + " raise e\n", + "\n", + " if rank == 0:\n", + " tb_logger.close()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WioiRM5U9ipQ" + }, + "source": [ + "## Running Distributed Code\n", + "\n", + "We can easily run the above code with the context manager [Parallel](https://pytorch.org/ignite/generated/ignite.distributed.launcher.Parallel.html#ignite.distributed.launcher.Parallel):\n", + "\n", + "```python\n", + "with idist.Parallel(backend=backend, nproc_per_node=nproc_per_node) as parallel:\n", + " parallel.run(training, config)\n", + "```\n", + "`Parallel` enables us to run the same code across all supported distributed backends and non-distributed configurations in a seamless manner. Here backend refers to a distributed communication framework. Read more about which backend to choose [here](https://pytorch.org/docs/stable/distributed.html#backends). `Parallel` accepts a `backend` and either:\n", + "\n", + "> Spawns `nproc_per_node` child processes and initialize a processing group according to provided backend (useful for standalone scripts).\n", + "\n", + "This way uses `torch.multiprocessing.spawn` and is the default way to spawn processes. However, this way is slower due to initialization overhead. \n", + "\n", + "or\n", + "> Only initialize a processing group given the backend (useful with tools like torch.distributed.launch, horovodrun, etc).\n", + "\n", + "This way is recommended since training is faster and easier to extend to multiple scripts.\n", + "\n", + "We can pass additional information to `Parallel` collectively as `spawn_kwargs` as we will see below." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YJuv6mqHzpVJ" + }, + "source": [ + "**Note:** It is recommended to run distributed code as scripts for ease of use, however we can also spawn processes in a Jupyter notebook (see end of tutorial). The complete code as a script can be found [here](https://github.com/pytorch-ignite/examples/blob/main/tutorials/intermediate/cifar10-distributed.py). Choose one of the suggested ways below to run the script." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8Nxfg7j7deRD" + }, + "source": [ + "## Single Node, One or More GPUs\n", + "\n", + "We will use `fire` to convert `run()` into a CLI, use the arguments parsed inside `run()` directly and begin training in the script:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QlVEECk_kNs7" + }, + "outputs": [], + "source": [ + "import fire\n", + "\n", + "def run(backend=None, **spawn_kwargs):\n", + " config[\"backend\"] = backend\n", + " \n", + " with idist.Parallel(backend=config[\"backend\"], **spawn_kwargs) as parallel:\n", + " parallel.run(training, config)\n", + "\n", + "if __name__ == \"__main__\":\n", + " fire.Fire({\"run\": run})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lLvosFNn8mb8" + }, + "source": [ + "Then we can run the script (e.g. for 2 GPUs) as:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RdfXz_0cWZ_9" + }, + "source": [ + "### Run with `torch.distributed.launch` (Recommended)\n", + "\n", + "```\n", + "python -u -m torch.distributed.launch --nproc_per_node=2 --use_env main.py run --backend=\"nccl\"\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b_k_0YkyX1Yn" + }, + "source": [ + "### Run with internal spawining (`torch.multiprocessing.spawn`)\n", + "\n", + "```\n", + "python -u main.py run --backend=\"nccl\" --nproc_per_node=2\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "osAHgAyJWomh" + }, + "source": [ + "### Run with horovodrun\n", + "\n", + "Please make sure that `backend=horovod`. `np` below is number of processes.\n", + "\n", + "```\n", + "horovodrun -np=2 python -u main.py run --backend=\"horovod\"\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "G-jBxovGdee1" + }, + "source": [ + "## Multiple Nodes, Multiple GPUs\n", + "\n", + "The code inside the script is similar to Single Node, One or More GPUs:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uhmj6meIMk2w" + }, + "outputs": [], + "source": [ + "import fire\n", + "\n", + "def run(backend=None, **spawn_kwargs):\n", + " config[\"backend\"] = backend\n", + " \n", + " with idist.Parallel(backend=config[\"backend\"], **spawn_kwargs) as parallel:\n", + " parallel.run(training, config)\n", + "\n", + "if __name__ == \"__main__\":\n", + " fire.Fire({\"run\": run})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_hMbi0-z-E-T" + }, + "source": [ + "The only change is how we run the script. We need to provide the IP address of the master node and its port along with the node rank. For example, for 2 nodes (`nnodes`) and 2 GPUs (`nproc_per_node`), we can:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rnOI--qIJ0ZN" + }, + "source": [ + "### Run with `torch.distributed.launch` (Recommended)\n", + "\n", + "On node 0 (master node):\n", + "\n", + "```\n", + "python -u -m torch.distributed.launch \\\n", + " --nnodes=2 \\\n", + " --nproc_per_node=2 \\\n", + " --node_rank=0 \\\n", + " --master_addr=master --master_port=2222 --use_env \\\n", + " main.py run --backend=\"nccl\"\n", + "```\n", + "\n", + "On node 1 (worker node):\n", + "\n", + "```\n", + "python -u -m torch.distributed.launch \\\n", + " --nnodes=2 \\\n", + " --nproc_per_node=2 \\\n", + " --node_rank=1 \\\n", + " --master_addr=master --master_port=2222 --use_env \\\n", + " main.py run --backend=\"nccl\"\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b1Z4JfvJJ6bt" + }, + "source": [ + "### Run with internal spawning\n", + "\n", + "On node 0:\n", + "```\n", + "python -u main.py run\n", + " --nnodes=2 \\\n", + " --nproc_per_node=2 \\\n", + " --node_rank=0 \\\n", + " --master_addr=master --master_port=2222 \\\n", + " --backend=\"nccl\"\n", + "```\n", + "\n", + "On node 1:\n", + "```\n", + "python -u main.py run\n", + " --nnodes=2 \\\n", + " --nproc_per_node=2 \\\n", + " --node_rank=1 \\\n", + " --master_addr=master --master_port=2222 \\\n", + " --backend=\"nccl\"\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KPUtXHI3KG9w" + }, + "source": [ + "### Run with horovodrun\n", + "\n", + "`np` below is calculated by `nnodes` x `nproc_per_node`.\n", + "\n", + "```\n", + "horovodrun -np 4 -H hostname1:2,hostname2:2 python -u main.py run --backend=\"horovod\"\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "99enig2IdNiF" + }, + "source": [ + "## Single Node, Multiple CPUs\n", + "\n", + "This is similar to Single Node, One or More GPUs. The only difference is while running the script, `backend=gloo` instead of `nccl`." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "g6s0Rn_mdewW" + }, + "source": [ + "## TPUs on Google Colab\n", + "\n", + "Go to Runtime > Change runtime type and select Hardware accelerator = TPU." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-09-14T17:01:35.423328Z", + "iopub.status.busy": "2021-09-14T17:01:35.423010Z", + "iopub.status.idle": "2021-09-14T17:04:10.319065Z", + "shell.execute_reply": "2021-09-14T17:04:10.317535Z", + "shell.execute_reply.started": "2021-09-14T17:01:35.423298Z" + }, + "id": "z6e2NW3kofu8", + "outputId": "fca27701-6133-4949-c2a4-484356e32131" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2021-09-14 17:01:35,425 ignite.distributed.launcher.Parallel INFO: Initialized distributed launcher with backend: 'xla-tpu'\n", + "2021-09-14 17:01:35,427 ignite.distributed.launcher.Parallel INFO: - Parameters to spawn processes: \n", + "\tnproc_per_node: 8\n", + "\tnnodes: 1\n", + "\tnode_rank: 0\n", + "2021-09-14 17:01:35,428 ignite.distributed.launcher.Parallel INFO: Spawn function '' in 8 processes\n", + "2021-09-14 17:01:47,607 CIFAR10-Training INFO: Train on CIFAR10\n", + "2021-09-14 17:01:47,639 CIFAR10-Training INFO: - PyTorch version: 1.8.2+cpu\n", + "2021-09-14 17:01:47,658 CIFAR10-Training INFO: - Ignite version: 0.4.6\n", + "2021-09-14 17:01:47,678 CIFAR10-Training INFO: \n", + "\n", + "2021-09-14 17:01:47,697 CIFAR10-Training INFO: Configuration:\n", + "2021-09-14 17:01:47,721 CIFAR10-Training INFO: \tseed: 543\n", + "2021-09-14 17:01:47,739 CIFAR10-Training INFO: \tdata_path: cifar10\n", + "2021-09-14 17:01:47,765 CIFAR10-Training INFO: \toutput_path: output-cifar10/\n", + "2021-09-14 17:01:47,786 CIFAR10-Training INFO: \tmodel: resnet18\n", + "2021-09-14 17:01:47,810 CIFAR10-Training INFO: \tbatch_size: 512\n", + "2021-09-14 17:01:47,833 CIFAR10-Training INFO: \tmomentum: 0.9\n", + "2021-09-14 17:01:47,854 CIFAR10-Training INFO: \tweight_decay: 0.0001\n", + "2021-09-14 17:01:47,867 CIFAR10-Training INFO: \tnum_workers: 2\n", + "2021-09-14 17:01:47,887 CIFAR10-Training INFO: \tnum_epochs: 5\n", + "2021-09-14 17:01:47,902 CIFAR10-Training INFO: \tlearning_rate: 0.4\n", + "2021-09-14 17:01:47,922 CIFAR10-Training INFO: \tnum_warmup_epochs: 1\n", + "2021-09-14 17:01:47,940 CIFAR10-Training INFO: \tvalidate_every: 3\n", + "2021-09-14 17:01:47,949 CIFAR10-Training INFO: \tcheckpoint_every: 200\n", + "2021-09-14 17:01:47,960 CIFAR10-Training INFO: \tbackend: xla-tpu\n", + "2021-09-14 17:01:47,967 CIFAR10-Training INFO: \tresume_from: None\n", + "2021-09-14 17:01:47,975 CIFAR10-Training INFO: \tlog_every_iters: 15\n", + "2021-09-14 17:01:47,984 CIFAR10-Training INFO: \tnproc_per_node: None\n", + "2021-09-14 17:01:48,003 CIFAR10-Training INFO: \twith_clearml: False\n", + "2021-09-14 17:01:48,019 CIFAR10-Training INFO: \twith_amp: False\n", + "2021-09-14 17:01:48,040 CIFAR10-Training INFO: \n", + "\n", + "2021-09-14 17:01:48,059 CIFAR10-Training INFO: \n", + "Distributed setting:\n", + "2021-09-14 17:01:48,079 CIFAR10-Training INFO: \tbackend: xla-tpu\n", + "2021-09-14 17:01:48,098 CIFAR10-Training INFO: \tworld size: 8\n", + "2021-09-14 17:01:48,109 CIFAR10-Training INFO: \n", + "\n", + "2021-09-14 17:01:48,130 CIFAR10-Training INFO: Output path: output-cifar10/resnet18_backend-xla-tpu-8_20210914-170148\n", + "2021-09-14 17:01:50,917 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset 'Dataset CIFAR10': \n", + "\t{'batch_size': 64, 'num_workers': 2, 'drop_last': True, 'sampler': , 'pin_memory': False}\n", + "2021-09-14 17:01:50,950 ignite.distributed.auto.auto_dataloader INFO: DataLoader is wrapped by `MpDeviceLoader` on XLA\n", + "2021-09-14 17:01:50,975 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset 'Dataset CIFAR10': \n", + "\t{'batch_size': 128, 'num_workers': 2, 'sampler': , 'pin_memory': False}\n", + "2021-09-14 17:01:51,000 ignite.distributed.auto.auto_dataloader INFO: DataLoader is wrapped by `MpDeviceLoader` on XLA\n", + "2021-09-14 17:01:53,866 CIFAR10-Training INFO: Engine run starting with max_epochs=5.\n", + "2021-09-14 17:02:23,913 CIFAR10-Training INFO: Epoch[1] Complete. Time taken: 00:00:30\n", + "2021-09-14 17:02:41,945 CIFAR10-Training INFO: Epoch[2] Complete. Time taken: 00:00:18\n", + "2021-09-14 17:03:13,870 CIFAR10-Training INFO: \n", + "Epoch 3 - Evaluation time (seconds): 14.00 - train metrics:\n", + " \tAccuracy: 0.32997744845360827\n", + "\tLoss: 1.7080145767054606\n", + "2021-09-14 17:03:19,283 CIFAR10-Training INFO: \n", + "Epoch 3 - Evaluation time (seconds): 5.39 - val metrics:\n", + " \tAccuracy: 0.3424\n", + "\tLoss: 1.691359375\n", + "2021-09-14 17:03:19,289 CIFAR10-Training INFO: Epoch[3] Complete. Time taken: 00:00:37\n", + "2021-09-14 17:03:37,535 CIFAR10-Training INFO: Epoch[4] Complete. Time taken: 00:00:18\n", + "2021-09-14 17:03:55,927 CIFAR10-Training INFO: Epoch[5] Complete. Time taken: 00:00:18\n", + "2021-09-14 17:04:07,598 CIFAR10-Training INFO: \n", + "Epoch 5 - Evaluation time (seconds): 11.66 - train metrics:\n", + " \tAccuracy: 0.42823775773195877\n", + "\tLoss: 1.4969784451514174\n", + "2021-09-14 17:04:10,190 CIFAR10-Training INFO: \n", + "Epoch 5 - Evaluation time (seconds): 2.56 - val metrics:\n", + " \tAccuracy: 0.4412\n", + "\tLoss: 1.47838994140625\n", + "2021-09-14 17:04:10,244 CIFAR10-Training INFO: Engine run complete. Time taken: 00:02:16\n", + "2021-09-14 17:04:10,313 ignite.distributed.launcher.Parallel INFO: End of run\n" + ] } - ], - "metadata": { - "colab": { - "collapsed_sections": [ - "RdfXz_0cWZ_9", - "b_k_0YkyX1Yn", - "osAHgAyJWomh", - "rnOI--qIJ0ZN", - "b1Z4JfvJJ6bt", - "KPUtXHI3KG9w" - ], - "name": "cifar10-distributed.ipynb", - "provenance": [] - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.7.10" + ], + "source": [ + "nproc_per_node = 8\n", + "config[\"backend\"] = \"xla-tpu\"\n", + "\n", + "with idist.Parallel(backend=config[\"backend\"], nproc_per_node=nproc_per_node) as parallel:\n", + " parallel.run(training, config)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FYq3RGY1t9me" + }, + "source": [ + "## Run in Jupyter Notebook\n", + "\n", + "We will have to spawn processes in a notebook and therefore, we will use internal spawning to achieve that. For multiple GPUs, use `backend=nccl` and `backend=gloo` for multiple CPUs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6DWUEfZHuTDz", + "outputId": "c7110f87-05a0-4c7b-c910-efd72b2606e3" + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "2021-09-14 19:15:15,335 ignite.distributed.launcher.Parallel INFO: Initialized distributed launcher with backend: 'nccl'\n", + "2021-09-14 19:15:15,337 ignite.distributed.launcher.Parallel INFO: - Parameters to spawn processes: \n", + "\tnproc_per_node: 2\n", + "\tnnodes: 1\n", + "\tnode_rank: 0\n", + "\tstart_method: fork\n", + "2021-09-14 19:15:15,338 ignite.distributed.launcher.Parallel INFO: Spawn function '' in 2 processes\n", + "2021-09-14 19:15:18,910 CIFAR10-Training INFO: Train on CIFAR10\n", + "2021-09-14 19:15:18,911 CIFAR10-Training INFO: - PyTorch version: 1.9.0\n", + "2021-09-14 19:15:18,912 CIFAR10-Training INFO: - Ignite version: 0.4.6\n", + "2021-09-14 19:15:18,913 CIFAR10-Training INFO: - GPU Device: GeForce GTX 1080 Ti\n", + "2021-09-14 19:15:18,913 CIFAR10-Training INFO: - CUDA version: 11.1\n", + "2021-09-14 19:15:18,914 CIFAR10-Training INFO: - CUDNN version: 8005\n", + "2021-09-14 19:15:18,915 CIFAR10-Training INFO: \n", + "\n", + "2021-09-14 19:15:18,916 CIFAR10-Training INFO: Configuration:\n", + "2021-09-14 19:15:18,917 CIFAR10-Training INFO: \tseed: 543\n", + "2021-09-14 19:15:18,918 CIFAR10-Training INFO: \tdata_path: cifar10\n", + "2021-09-14 19:15:18,919 CIFAR10-Training INFO: \toutput_path: output-cifar10/\n", + "2021-09-14 19:15:18,920 CIFAR10-Training INFO: \tmodel: resnet18\n", + "2021-09-14 19:15:18,921 CIFAR10-Training INFO: \tbatch_size: 512\n", + "2021-09-14 19:15:18,922 CIFAR10-Training INFO: \tmomentum: 0.9\n", + "2021-09-14 19:15:18,923 CIFAR10-Training INFO: \tweight_decay: 0.0001\n", + "2021-09-14 19:15:18,924 CIFAR10-Training INFO: \tnum_workers: 2\n", + "2021-09-14 19:15:18,925 CIFAR10-Training INFO: \tnum_epochs: 5\n", + "2021-09-14 19:15:18,925 CIFAR10-Training INFO: \tlearning_rate: 0.4\n", + "2021-09-14 19:15:18,926 CIFAR10-Training INFO: \tnum_warmup_epochs: 1\n", + "2021-09-14 19:15:18,927 CIFAR10-Training INFO: \tvalidate_every: 3\n", + "2021-09-14 19:15:18,928 CIFAR10-Training INFO: \tcheckpoint_every: 200\n", + "2021-09-14 19:15:18,929 CIFAR10-Training INFO: \tbackend: nccl\n", + "2021-09-14 19:15:18,929 CIFAR10-Training INFO: \tresume_from: None\n", + "2021-09-14 19:15:18,930 CIFAR10-Training INFO: \tlog_every_iters: 15\n", + "2021-09-14 19:15:18,931 CIFAR10-Training INFO: \tnproc_per_node: None\n", + "2021-09-14 19:15:18,931 CIFAR10-Training INFO: \twith_clearml: False\n", + "2021-09-14 19:15:18,932 CIFAR10-Training INFO: \twith_amp: False\n", + "2021-09-14 19:15:18,933 CIFAR10-Training INFO: \n", + "\n", + "2021-09-14 19:15:18,933 CIFAR10-Training INFO: \n", + "Distributed setting:\n", + "2021-09-14 19:15:18,934 CIFAR10-Training INFO: \tbackend: nccl\n", + "2021-09-14 19:15:18,935 CIFAR10-Training INFO: \tworld size: 2\n", + "2021-09-14 19:15:18,935 CIFAR10-Training INFO: \n", + "\n", + "2021-09-14 19:15:18,936 CIFAR10-Training INFO: Output path: output-cifar10/resnet18_backend-nccl-2_20210914-191518\n", + "2021-09-14 19:15:19,725 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset 'Dataset CIFAR10': \n", + "\t{'batch_size': 256, 'num_workers': 1, 'drop_last': True, 'sampler': , 'pin_memory': True}\n", + "2021-09-14 19:15:19,727 ignite.distributed.auto.auto_dataloader INFO: Use data loader kwargs for dataset 'Dataset CIFAR10': \n", + "\t{'batch_size': 512, 'num_workers': 1, 'sampler': , 'pin_memory': True}\n", + "2021-09-14 19:15:19,873 ignite.distributed.auto.auto_model INFO: Apply torch DistributedDataParallel on model, device id: 0\n", + "2021-09-14 19:15:20,049 CIFAR10-Training INFO: Engine run starting with max_epochs=5.\n", + "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /opt/conda/conda-bld/pytorch_1623448265233/work/c10/core/TensorImpl.h:1156.)\n", + " return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)\n", + "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /opt/conda/conda-bld/pytorch_1623448265233/work/c10/core/TensorImpl.h:1156.)\n", + " return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)\n", + "2021-09-14 19:15:28,800 CIFAR10-Training INFO: Epoch[1] Complete. Time taken: 00:00:09\n", + "2021-09-14 19:15:37,474 CIFAR10-Training INFO: Epoch[2] Complete. Time taken: 00:00:09\n", + "2021-09-14 19:15:54,675 CIFAR10-Training INFO: \n", + "Epoch 3 - Evaluation time (seconds): 8.50 - train metrics:\n", + " \tAccuracy: 0.5533988402061856\n", + "\tLoss: 1.2227583423103254\n", + "2021-09-14 19:15:56,077 CIFAR10-Training INFO: \n", + "Epoch 3 - Evaluation time (seconds): 1.36 - val metrics:\n", + " \tAccuracy: 0.5699\n", + "\tLoss: 1.1869916015625\n", + "2021-09-14 19:15:56,079 CIFAR10-Training INFO: Epoch[3] Complete. Time taken: 00:00:19\n", + "2021-09-14 19:16:04,686 CIFAR10-Training INFO: Epoch[4] Complete. Time taken: 00:00:09\n", + "2021-09-14 19:16:13,347 CIFAR10-Training INFO: Epoch[5] Complete. Time taken: 00:00:09\n", + "2021-09-14 19:16:21,857 CIFAR10-Training INFO: \n", + "Epoch 5 - Evaluation time (seconds): 8.46 - train metrics:\n", + " \tAccuracy: 0.6584246134020618\n", + "\tLoss: 0.9565292830319748\n", + "2021-09-14 19:16:23,269 CIFAR10-Training INFO: \n", + "Epoch 5 - Evaluation time (seconds): 1.38 - val metrics:\n", + " \tAccuracy: 0.6588\n", + "\tLoss: 0.9517111328125\n", + "2021-09-14 19:16:23,271 CIFAR10-Training INFO: Engine run complete. Time taken: 00:01:03\n", + "2021-09-14 19:16:23,547 ignite.distributed.launcher.Parallel INFO: End of run\n" + ] } + ], + "source": [ + "spawn_kwargs = {}\n", + "spawn_kwargs[\"start_method\"] = \"fork\"\n", + "spawn_kwargs[\"nproc_per_node\"] = 2\n", + "config[\"backend\"] = \"nccl\"\n", + "\n", + "with idist.Parallel(backend=config[\"backend\"], **spawn_kwargs) as parallel:\n", + " parallel.run(training, config)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hNIC_h9fXeKI" + }, + "source": [ + "## Important Links\n", + "\n", + "1. Complete code can be found [here](https://github.com/pytorch-ignite/examples/blob/main/tutorials/intermediate/cifar10-distributed.py).\n", + "2. Example of the logs of a ClearML experiment run on this code:\n", + " - [With torch.distributed.launch](https://app.community.clear.ml/projects/14efa0ee4c114401bd06b7748314b465/experiments/83ebffd99a3f47f49dff1075252e3371/output/execution) \n", + " - [With default internal spawning](https://app.community.clear.ml/projects/14efa0ee4c114401bd06b7748314b465/experiments/c2b82ec98e8445f29044c94f7efc8215/output/execution)\n", + " - [On Jupyter](https://app.community.clear.ml/projects/14efa0ee4c114401bd06b7748314b465/experiments/2fedd7447b114b36af7066cdb81fddae/output/execution)\n", + " - [On Colab with XLA](https://app.community.clear.ml/projects/14efa0ee4c114401bd06b7748314b465/experiments/fbffb4d7f9324c57979a833a789df857/output/execution)" + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [ + "RdfXz_0cWZ_9", + "b_k_0YkyX1Yn", + "osAHgAyJWomh", + "rnOI--qIJ0ZN", + "b1Z4JfvJJ6bt", + "KPUtXHI3KG9w" + ], + "name": "cifar10-distributed.ipynb", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" }, - "nbformat": 4, - "nbformat_minor": 0 -} + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.10" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/tutorials/intermediate/02-Machine_Translation_using_PyTorch_Ignite.ipynb b/tutorials/intermediate/02-Machine_Translation_using_PyTorch_Ignite.ipynb index db9702e..5ef7bc3 100644 --- a/tutorials/intermediate/02-Machine_Translation_using_PyTorch_Ignite.ipynb +++ b/tutorials/intermediate/02-Machine_Translation_using_PyTorch_Ignite.ipynb @@ -170,7 +170,7 @@ "import ignite.distributed as idist\n", "from ignite.contrib.engines import common\n", "from ignite.engine import Engine, Events\n", - "from ignite.handlers import Checkpoint, DiskSaver, global_step_from_engine\n", + "from ignite.handlers import Checkpoint, global_step_from_engine\n", "from ignite.metrics import Bleu\n", "from ignite.utils import manual_seed, setup_logger\n", "\n", @@ -722,7 +722,7 @@ "\n", "\n", "def get_save_handler(config):\n", - " return DiskSaver(config[\"output_path_\"], require_empty=False)" + " return config[\"output_path_\"]" ] }, { @@ -993,4 +993,4 @@ }, "nbformat": 4, "nbformat_minor": 1 -} +} \ No newline at end of file diff --git a/tutorials/intermediate/cifar10-distributed.py b/tutorials/intermediate/cifar10-distributed.py index b045673..aa9ce3b 100644 --- a/tutorials/intermediate/cifar10-distributed.py +++ b/tutorials/intermediate/cifar10-distributed.py @@ -24,7 +24,7 @@ create_supervised_trainer, create_supervised_evaluator, ) -from ignite.handlers import Checkpoint, DiskSaver, global_step_from_engine +from ignite.handlers import Checkpoint, global_step_from_engine from ignite.metrics import Accuracy, Loss from ignite.utils import manual_seed, setup_logger @@ -149,7 +149,7 @@ def get_save_handler(config): return ClearMLSaver(dirname=config["output_path"]) - return DiskSaver(config["output_path"], require_empty=False) + return config["output_path"] def load_checkpoint(resume_from):