diff --git "a/\352\261\264\353\247\220\354\262\234\352\260\204_Q1_FullyConnectedNets.ipynb" "b/\352\261\264\353\247\220\354\262\234\352\260\204_Q1_FullyConnectedNets.ipynb" new file mode 100644 index 0000000..67ec8f5 --- /dev/null +++ "b/\352\261\264\353\247\220\354\262\234\352\260\204_Q1_FullyConnectedNets.ipynb" @@ -0,0 +1,1719 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "GEzcoD2JRCEa", + "outputId": "27d3cb11-adfe-4fd5-95ca-0f7383cf88e8" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount(\"/content/drive\", force_remount=True).\n" + ] + } + ], + "source": [ + "from google.colab import drive\n", + "drive.mount('/content/drive')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "5tqKVNICRCsX", + "outputId": "30063944-8df3-4d53-dec0-6deabf735dc9" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/content/drive/MyDrive/Colab Notebooks/EURON/cs231n-master/assignment2\n" + ] + } + ], + "source": [ + "%cd \"/content/drive/MyDrive/Colab Notebooks/EURON/cs231n-master/assignment2\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XF3UPQxRLo5W" + }, + "source": [ + "# Fully-Connected Neural Nets\n", + "In the previous homework you implemented a fully-connected two-layer neural network on CIFAR-10. The implementation was simple but not very modular since the loss and gradient were computed in a single monolithic function. This is manageable for a simple two-layer network, but would become impractical as we move to bigger models. Ideally we want to build networks using a more modular design so that we can implement different layer types in isolation and then snap them together into models with different architectures." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Oanh_J5eLo5b" + }, + "source": [ + "In this exercise we will implement fully-connected networks using a more modular approach. For each layer we will implement a `forward` and a `backward` function. The `forward` function will receive inputs, weights, and other parameters and will return both an output and a `cache` object storing data needed for the backward pass, like this:\n", + "\n", + "```python\n", + "def layer_forward(x, w):\n", + " \"\"\" Receive inputs x and weights w \"\"\"\n", + " # Do some computations ...\n", + " z = # ... some intermediate value\n", + " # Do some more computations ...\n", + " out = # the output\n", + " \n", + " cache = (x, w, z, out) # Values we need to compute gradients\n", + " \n", + " return out, cache\n", + "```\n", + "\n", + "The backward pass will receive upstream derivatives and the `cache` object, and will return gradients with respect to the inputs and weights, like this:\n", + "\n", + "```python\n", + "def layer_backward(dout, cache):\n", + " \"\"\"\n", + " Receive dout (derivative of loss with respect to outputs) and cache,\n", + " and compute derivative with respect to inputs.\n", + " \"\"\"\n", + " # Unpack cache values\n", + " x, w, z, out = cache\n", + " \n", + " # Use values in cache to compute derivatives\n", + " dx = # Derivative of loss with respect to x\n", + " dw = # Derivative of loss with respect to w\n", + " \n", + " return dx, dw\n", + "```\n", + "\n", + "After implementing a bunch of layers this way, we will be able to easily combine them to build classifiers with different architectures.\n", + "\n", + "In addition to implementing fully-connected networks of arbitrary depth, we will also explore different update rules for optimization, and introduce Dropout as a regularizer and Batch/Layer Normalization as a tool to more efficiently optimize deep networks.\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ysnIIM7xQzMA", + "outputId": "73e483ac-a52c-4279-f75c-c1908c1d68ef" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "240119_BatchNormalization.ipynb collectSubmission.sh\t frameworkpython requirements.txt\n", + "240119_Dropout.ipynb\t\t ConvolutionalNetworks.ipynb notebook_images start_ipython_osx.sh\n", + "240119_FullyConnectedNets.ipynb cs231n\t\t\t PyTorch.ipynb TensorFlow.ipynb\n" + ] + } + ], + "source": [ + "!ls" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "lrejozozQ0kc", + "outputId": "d056a337-721c-4c4b-b8d0-41cc808fe1e6" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/content/drive/MyDrive/Colab Notebooks/EURON/cs231n-master/assignment2\n" + ] + } + ], + "source": [ + "!pwd" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "V6Mb0HmQLo5d", + "outputId": "170499f9-7082-4f81-da9f-afa3f390ad89" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "The autoreload extension is already loaded. To reload it, use:\n", + " %reload_ext autoreload\n" + ] + } + ], + "source": [ + "# As usual, a bit of setup\n", + "from __future__ import print_function\n", + "import time\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from cs231n.classifiers.fc_net import *\n", + "from cs231n.data_utils import get_CIFAR10_data\n", + "from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n", + "from cs231n.solver import Solver\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y)))) # 1e-8: threshold" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Rv2QKhUzLo5e", + "outputId": "0b2e9e0f-aec0-43b4-85a4-caca73fff900", + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "('X_train: ', (49000, 3, 32, 32))\n", + "('y_train: ', (49000,))\n", + "('X_val: ', (1000, 3, 32, 32))\n", + "('y_val: ', (1000,))\n", + "('X_test: ', (1000, 3, 32, 32))\n", + "('y_test: ', (1000,))\n" + ] + } + ], + "source": [ + "# Load the (preprocessed) CIFAR10 data.\n", + "\n", + "data = get_CIFAR10_data()\n", + "for k, v in list(data.items()):\n", + " print(('%s: ' % k, v.shape))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jes-Mi2KLo5f" + }, + "source": [ + "# Affine layer: foward\n", + "Open the file `cs231n/layers.py` and implement the `affine_forward` function.\n", + "\n", + "Once you are done you can test your implementaion by running the following:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aTiM6cC3M79i" + }, + "source": [ + "`affine layer` = `FC layer`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "4t2yvr8TLo5f", + "outputId": "8e82a201-6dbd-47c7-fed0-abebbafcd63e", + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing affine_forward function:\n", + "difference: 9.769849468192957e-10\n" + ] + } + ], + "source": [ + "# Test the affine_forward function\n", + "\n", + "num_inputs = 2\n", + "input_shape = (4, 5, 6)\n", + "output_dim = 3\n", + "\n", + "input_size = num_inputs * np.prod(input_shape)\n", + "weight_size = output_dim * np.prod(input_shape)\n", + "\n", + "x = np.linspace(-0.1, 0.5, num=input_size).reshape(num_inputs, *input_shape)\n", + "w = np.linspace(-0.2, 0.3, num=weight_size).reshape(np.prod(input_shape), output_dim)\n", + "b = np.linspace(-0.3, 0.1, num=output_dim)\n", + "\n", + "out, _ = affine_forward(x, w, b)\n", + "correct_out = np.array([[ 1.49834967, 1.70660132, 1.91485297],\n", + " [ 3.25553199, 3.5141327, 3.77273342]])\n", + "\n", + "# Compare your output with ours. The error should be around e-9 or less.\n", + "print('Testing affine_forward function:')\n", + "print('difference: ', rel_error(out, correct_out))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O5KlbbWfLo5g" + }, + "source": [ + "# Affine layer: backward\n", + "Now implement the `affine_backward` function and test your implementation using numeric gradient checking." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "2S0CuPbxLo5g", + "outputId": "3499a77d-ff26-4527-8aed-18497317b09f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing affine_backward function:\n", + "dx error: 5.399100368651805e-11\n", + "dw error: 9.904211865398145e-11\n", + "db error: 2.4122867568119087e-11\n" + ] + } + ], + "source": [ + "# Test the affine_backward function\n", + "np.random.seed(231)\n", + "x = np.random.randn(10, 2, 3)\n", + "w = np.random.randn(6, 5)\n", + "b = np.random.randn(5)\n", + "dout = np.random.randn(10, 5)\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: affine_forward(x, w, b)[0], x, dout)\n", + "dw_num = eval_numerical_gradient_array(lambda w: affine_forward(x, w, b)[0], w, dout)\n", + "db_num = eval_numerical_gradient_array(lambda b: affine_forward(x, w, b)[0], b, dout)\n", + "\n", + "_, cache = affine_forward(x, w, b)\n", + "dx, dw, db = affine_backward(dout, cache)\n", + "\n", + "# The error should be around e-10 or less\n", + "print('Testing affine_backward function:')\n", + "print('dx error: ', rel_error(dx_num, dx))\n", + "print('dw error: ', rel_error(dw_num, dw))\n", + "print('db error: ', rel_error(db_num, db))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tv98LyZeLo5h" + }, + "source": [ + "# ReLU activation: forward\n", + "Implement the forward pass for the ReLU activation function in the `relu_forward` function and test your implementation using the following:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "mepcb3U_Lo5i", + "outputId": "7a4980e6-f5dc-489f-bf86-54d49f3d5d19" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing relu_forward function:\n", + "difference: 4.999999798022158e-08\n" + ] + } + ], + "source": [ + "# Test the relu_forward function\n", + "\n", + "x = np.linspace(-0.5, 0.5, num=12).reshape(3, 4)\n", + "\n", + "out, _ = relu_forward(x)\n", + "correct_out = np.array([[ 0., 0., 0., 0., ],\n", + " [ 0., 0., 0.04545455, 0.13636364,],\n", + " [ 0.22727273, 0.31818182, 0.40909091, 0.5, ]])\n", + "\n", + "# Compare your output with ours. The error should be on the order of e-8\n", + "print('Testing relu_forward function:')\n", + "print('difference: ', rel_error(out, correct_out))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1k7iVaCiLo5i" + }, + "source": [ + "# ReLU activation: backward\n", + "Now implement the backward pass for the ReLU activation function in the `relu_backward` function and test your implementation using numeric gradient checking:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "KPH900P9Lo5i", + "outputId": "b9ef54ab-d0f5-4a49-a700-4fef25655ad6" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing relu_backward function:\n", + "dx error: 3.2756349136310288e-12\n" + ] + } + ], + "source": [ + "np.random.seed(231)\n", + "x = np.random.randn(10, 10)\n", + "dout = np.random.randn(*x.shape)\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: relu_forward(x)[0], x, dout)\n", + "\n", + "_, cache = relu_forward(x)\n", + "dx = relu_backward(dout, cache)\n", + "\n", + "# The error should be on the order of e-12\n", + "print('Testing relu_backward function:')\n", + "print('dx error: ', rel_error(dx_num, dx))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YhF3P5JdLo5j" + }, + "source": [ + "## Inline Question 1:\n", + "\n", + "We've only asked you to implement ReLU, but there are a number of different activation functions that one could use in neural networks, each with its pros and cons. In particular, an issue commonly seen with activation functions is getting zero (or close to zero) gradient flow during backpropagation. Which of the following activation functions have this problem? If you consider these functions in the one dimensional case, what types of input would lead to this behaviour?\n", + "1. Sigmoid\n", + "2. ReLU\n", + "3. Leaky ReLU\n", + "\n", + "## Answer:\n", + "1,2.\n", + "\n", + "1은 input 값이 절댓값이 큰 양수와 음수에서 기울기가 거의 0이 된다.\n", + "\n", + "2는 input 값이 음수이면 기울기가 0이 된다. 양수일 때는 1이 된다." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yxULPpXLLo5j" + }, + "source": [ + "# \"Sandwich\" layers\n", + "There are some common patterns of layers that are frequently used in neural nets. For example, affine layers are frequently followed by a ReLU nonlinearity. To make these common patterns easy, we define several convenience layers in the file `cs231n/layer_utils.py`.\n", + "\n", + "For now take a look at the `affine_relu_forward` and `affine_relu_backward` functions, and run the following to numerically gradient check the backward pass:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "YFTUoGN1Lo5j", + "outputId": "5fdc7b23-9c29-484c-e41a-cb8e6af08b4b" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing affine_relu_forward and affine_relu_backward:\n", + "dx error: 2.299579177309368e-11\n", + "dw error: 8.162011105764925e-11\n", + "db error: 7.826724021458994e-12\n" + ] + } + ], + "source": [ + "from cs231n.layer_utils import affine_relu_forward, affine_relu_backward\n", + "np.random.seed(231)\n", + "x = np.random.randn(2, 3, 4)\n", + "w = np.random.randn(12, 10)\n", + "b = np.random.randn(10)\n", + "dout = np.random.randn(2, 10)\n", + "\n", + "out, cache = affine_relu_forward(x, w, b)\n", + "dx, dw, db = affine_relu_backward(dout, cache)\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: affine_relu_forward(x, w, b)[0], x, dout)\n", + "dw_num = eval_numerical_gradient_array(lambda w: affine_relu_forward(x, w, b)[0], w, dout)\n", + "db_num = eval_numerical_gradient_array(lambda b: affine_relu_forward(x, w, b)[0], b, dout)\n", + "\n", + "# Relative error should be around e-10 or less\n", + "print('Testing affine_relu_forward and affine_relu_backward:')\n", + "print('dx error: ', rel_error(dx_num, dx))\n", + "print('dw error: ', rel_error(dw_num, dw))\n", + "print('db error: ', rel_error(db_num, db))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_0TdGRSVLo5j" + }, + "source": [ + "# Loss layers: Softmax and SVM\n", + "You implemented these loss functions in the last assignment, so we'll give them to you for free here. You should still make sure you understand how they work by looking at the implementations in `cs231n/layers.py`.\n", + "\n", + "You can make sure that the implementations are correct by running the following:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cd3-FkDNLo5j", + "outputId": "2dc4f482-bf8d-4d78-92af-86d838ba1872" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing svm_loss:\n", + "loss: 8.999602749096233\n", + "dx error: 1.4021566006651672e-09\n", + "\n", + "Testing softmax_loss:\n", + "loss: 2.3025458445007376\n", + "dx error: 8.234144091578429e-09\n" + ] + } + ], + "source": [ + "np.random.seed(231)\n", + "num_classes, num_inputs = 10, 50\n", + "x = 0.001 * np.random.randn(num_inputs, num_classes)\n", + "y = np.random.randint(num_classes, size=num_inputs)\n", + "\n", + "dx_num = eval_numerical_gradient(lambda x: svm_loss(x, y)[0], x, verbose=False)\n", + "loss, dx = svm_loss(x, y)\n", + "\n", + "# Test svm_loss function. Loss should be around 9 and dx error should be around the order of e-9\n", + "print('Testing svm_loss:')\n", + "print('loss: ', loss)\n", + "print('dx error: ', rel_error(dx_num, dx))\n", + "\n", + "dx_num = eval_numerical_gradient(lambda x: softmax_loss(x, y)[0], x, verbose=False)\n", + "loss, dx = softmax_loss(x, y)\n", + "\n", + "# Test softmax_loss function. Loss should be close to 2.3 and dx error should be around e-8\n", + "print('\\nTesting softmax_loss:')\n", + "print('loss: ', loss)\n", + "print('dx error: ', rel_error(dx_num, dx))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rHrkbVnELo5k" + }, + "source": [ + "# Two-layer network\n", + "In the previous assignment you implemented a two-layer neural network in a single monolithic class. Now that you have implemented modular versions of the necessary layers, you will reimplement the two layer network using these modular implementations.\n", + "\n", + "Open the file `cs231n/classifiers/fc_net.py` and complete the implementation of the `TwoLayerNet` class. This class will serve as a model for the other networks you will implement in this assignment, so read through it to make sure you understand the API. You can run the cell below to test your implementation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "YLH8m6zKLo5k", + "outputId": "603f314b-52fc-4a91-9ee6-1cc1dec7cfd2" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing initialization ... \n", + "Testing test-time forward pass ... \n", + "Testing training loss (no regularization)\n", + "Running numeric gradient check with reg = 0.0\n", + "W1 relative error: 1.83e-08\n", + "W2 relative error: 3.20e-10\n", + "b1 relative error: 9.83e-09\n", + "b2 relative error: 4.33e-10\n", + "Running numeric gradient check with reg = 0.7\n", + "W1 relative error: 2.53e-07\n", + "W2 relative error: 2.85e-08\n", + "b1 relative error: 1.56e-08\n", + "b2 relative error: 9.09e-10\n" + ] + } + ], + "source": [ + "np.random.seed(231)\n", + "N, D, H, C = 3, 5, 50, 7\n", + "X = np.random.randn(N, D)\n", + "y = np.random.randint(C, size=N)\n", + "\n", + "std = 1e-3\n", + "model = TwoLayerNet(input_dim=D, hidden_dim=H, num_classes=C, weight_scale=std)\n", + "\n", + "print('Testing initialization ... ')\n", + "W1_std = abs(model.params['W1'].std() - std)\n", + "b1 = model.params['b1']\n", + "W2_std = abs(model.params['W2'].std() - std)\n", + "b2 = model.params['b2']\n", + "assert W1_std < std / 10, 'First layer weights do not seem right'\n", + "assert np.all(b1 == 0), 'First layer biases do not seem right'\n", + "assert W2_std < std / 10, 'Second layer weights do not seem right'\n", + "assert np.all(b2 == 0), 'Second layer biases do not seem right'\n", + "\n", + "print('Testing test-time forward pass ... ')\n", + "model.params['W1'] = np.linspace(-0.7, 0.3, num=D*H).reshape(D, H)\n", + "model.params['b1'] = np.linspace(-0.1, 0.9, num=H)\n", + "model.params['W2'] = np.linspace(-0.3, 0.4, num=H*C).reshape(H, C)\n", + "model.params['b2'] = np.linspace(-0.9, 0.1, num=C)\n", + "X = np.linspace(-5.5, 4.5, num=N*D).reshape(D, N).T\n", + "scores = model.loss(X)\n", + "correct_scores = np.asarray(\n", + " [[11.53165108, 12.2917344, 13.05181771, 13.81190102, 14.57198434, 15.33206765, 16.09215096],\n", + " [12.05769098, 12.74614105, 13.43459113, 14.1230412, 14.81149128, 15.49994135, 16.18839143],\n", + " [12.58373087, 13.20054771, 13.81736455, 14.43418138, 15.05099822, 15.66781506, 16.2846319 ]])\n", + "scores_diff = np.abs(scores - correct_scores).sum()\n", + "assert scores_diff < 1e-6, 'Problem with test-time forward pass'\n", + "\n", + "print('Testing training loss (no regularization)')\n", + "y = np.asarray([0, 5, 1])\n", + "loss, grads = model.loss(X, y)\n", + "correct_loss = 3.4702243556\n", + "assert abs(loss - correct_loss) < 1e-10, 'Problem with training-time loss'\n", + "\n", + "model.reg = 1.0\n", + "loss, grads = model.loss(X, y)\n", + "correct_loss = 26.5948426952\n", + "assert abs(loss - correct_loss) < 1e-10, 'Problem with regularization loss'\n", + "\n", + "# Errors should be around e-7 or less\n", + "for reg in [0.0, 0.7]:\n", + " print('Running numeric gradient check with reg = ', reg)\n", + " model.reg = reg\n", + " loss, grads = model.loss(X, y)\n", + "\n", + " for name in sorted(grads):\n", + " f = lambda _: model.loss(X, y)[0]\n", + " grad_num = eval_numerical_gradient(f, model.params[name], verbose=False)\n", + " print('%s relative error: %.2e' % (name, rel_error(grad_num, grads[name])))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mAukdu9aLo5k" + }, + "source": [ + "# Solver\n", + "In the previous assignment, the logic for training models was coupled to the models themselves. Following a more modular design, for this assignment we have split the logic for training models into a separate class.\n", + "\n", + "Open the file `cs231n/solver.py` and read through it to familiarize yourself with the API. After doing so, use a `Solver` instance to train a `TwoLayerNet` that achieves at least `50%` accuracy on the validation set." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "k2UbGVP6Lo5k", + "outputId": "e732a0b5-4e8f-49c1-a217-210e71aa7b18" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(Iteration 1 / 4900) loss: 2.316765\n", + "(Epoch 0 / 10) train acc: 0.164000; val_acc: 0.134000\n", + "(Epoch 1 / 10) train acc: 0.441000; val_acc: 0.456000\n", + "(Epoch 2 / 10) train acc: 0.491000; val_acc: 0.474000\n", + "(Iteration 1001 / 4900) loss: 1.491941\n", + "(Epoch 3 / 10) train acc: 0.497000; val_acc: 0.478000\n", + "(Epoch 4 / 10) train acc: 0.493000; val_acc: 0.485000\n", + "(Iteration 2001 / 4900) loss: 1.478132\n", + "(Epoch 5 / 10) train acc: 0.531000; val_acc: 0.486000\n", + "(Epoch 6 / 10) train acc: 0.560000; val_acc: 0.515000\n", + "(Iteration 3001 / 4900) loss: 1.212179\n", + "(Epoch 7 / 10) train acc: 0.522000; val_acc: 0.492000\n", + "(Epoch 8 / 10) train acc: 0.540000; val_acc: 0.495000\n", + "(Iteration 4001 / 4900) loss: 1.226817\n", + "(Epoch 9 / 10) train acc: 0.584000; val_acc: 0.515000\n", + "(Epoch 10 / 10) train acc: 0.566000; val_acc: 0.478000\n" + ] + } + ], + "source": [ + "model = TwoLayerNet()\n", + "solver = None\n", + "\n", + "##############################################################################\n", + "# TODO: Use a Solver instance to train a TwoLayerNet that achieves at least #\n", + "# 50% accuracy on the validation set. #\n", + "##############################################################################\n", + "# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + "\n", + "model = TwoLayerNet(hidden_dim=100, reg=0.1)\n", + "\n", + "solver = Solver(model, data, update_rule='sgd', optim_config={\n", + " 'learning_rate': 1e-3\n", + " },\n", + " lr_decay=0.97, num_epochs=10, batch_size=100,print_every=1000)\n", + "solver.train()\n", + "\n", + "# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + "##############################################################################\n", + "# END OF YOUR CODE #\n", + "##############################################################################" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "1CTy9btVLo5l", + "outputId": "91305c7e-1fb9-41ea-aa46-e4a528ae6bbc" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Run this cell to visualize training loss and train / val accuracy\n", + "\n", + "plt.subplot(2, 1, 1)\n", + "plt.title('Training loss')\n", + "plt.plot(solver.loss_history, 'o')\n", + "plt.xlabel('Iteration')\n", + "\n", + "plt.subplot(2, 1, 2)\n", + "plt.title('Accuracy')\n", + "plt.plot(solver.train_acc_history, '-o', label='train')\n", + "plt.plot(solver.val_acc_history, '-o', label='val')\n", + "plt.plot([0.5] * len(solver.val_acc_history), 'k--')\n", + "plt.xlabel('Epoch')\n", + "plt.legend(loc='lower right')\n", + "plt.gcf().set_size_inches(15, 12)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xFk3Rf7gLo5l" + }, + "source": [ + "# Multilayer network\n", + "Next you will implement a fully-connected network with an arbitrary number of hidden layers.\n", + "\n", + "Read through the `FullyConnectedNet` class in the file `cs231n/classifiers/fc_net.py`.\n", + "\n", + "Implement the initialization, the forward pass, and the backward pass. For the moment don't worry about implementing dropout or batch/layer normalization; we will add those features soon." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aSc2UXc6Lo5l" + }, + "source": [ + "## Initial loss and gradient check\n", + "\n", + "As a sanity check, run the following to check the initial loss and to gradient check the network both with and without regularization. Do the initial losses seem reasonable?\n", + "\n", + "For gradient checking, you should expect to see errors around 1e-7 or less." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "UWiUq-IxLo5l", + "outputId": "77fdae7a-d08c-4dcc-b860-803a4f66bd92" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Running check with reg = 0\n", + "Initial loss: 2.300479089768492\n", + "W1 relative error: 1.03e-07\n", + "W2 relative error: 2.21e-05\n", + "W3 relative error: 4.56e-07\n", + "b1 relative error: 4.66e-09\n", + "b2 relative error: 2.09e-09\n", + "b3 relative error: 1.69e-10\n", + "Running check with reg = 3.14\n", + "Initial loss: 7.052114776533016\n", + "W1 relative error: 1.41e-08\n", + "W2 relative error: 6.87e-08\n", + "W3 relative error: 2.13e-08\n", + "b1 relative error: 1.48e-08\n", + "b2 relative error: 1.72e-09\n", + "b3 relative error: 2.38e-10\n" + ] + } + ], + "source": [ + "np.random.seed(231)\n", + "N, D, H1, H2, C = 2, 15, 20, 30, 10\n", + "X = np.random.randn(N, D)\n", + "y = np.random.randint(C, size=(N,))\n", + "\n", + "for reg in [0, 3.14]:\n", + " print('Running check with reg = ', reg)\n", + " model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,\n", + " reg=reg, weight_scale=5e-2, dtype=np.float64)\n", + "\n", + " loss, grads = model.loss(X, y)\n", + " print('Initial loss: ', loss)\n", + "\n", + " # Most of the errors should be on the order of e-7 or smaller.\n", + " # NOTE: It is fine however to see an error for W2 on the order of e-5\n", + " # for the check when reg = 0.0\n", + " for name in sorted(grads):\n", + " f = lambda _: model.loss(X, y)[0]\n", + " grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)\n", + " print('%s relative error: %.2e' % (name, rel_error(grad_num, grads[name])))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8DhnODEpLo5l" + }, + "source": [ + "As another sanity check, make sure you can overfit a small dataset of 50 images. First we will try a three-layer network with 100 units in each hidden layer. In the following cell, tweak the **learning rate** and **weight initialization scale** to overfit and achieve 100% training accuracy within 20 epochs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "hEE2q8VILo5m", + "outputId": "918c9a81-a1b3-4409-d2a9-07a37cfb4234", + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(Iteration 1 / 40) loss: 2.363364\n", + "(Epoch 0 / 20) train acc: 0.020000; val_acc: 0.105000\n", + "(Epoch 1 / 20) train acc: 0.020000; val_acc: 0.106000\n", + "(Epoch 2 / 20) train acc: 0.020000; val_acc: 0.110000\n", + "(Epoch 3 / 20) train acc: 0.020000; val_acc: 0.110000\n", + "(Epoch 4 / 20) train acc: 0.040000; val_acc: 0.109000\n", + "(Epoch 5 / 20) train acc: 0.040000; val_acc: 0.111000\n", + "(Iteration 11 / 40) loss: 2.270022\n", + "(Epoch 6 / 20) train acc: 0.040000; val_acc: 0.111000\n", + "(Epoch 7 / 20) train acc: 0.060000; val_acc: 0.112000\n", + "(Epoch 8 / 20) train acc: 0.060000; val_acc: 0.111000\n", + "(Epoch 9 / 20) train acc: 0.040000; val_acc: 0.110000\n", + "(Epoch 10 / 20) train acc: 0.040000; val_acc: 0.109000\n", + "(Iteration 21 / 40) loss: 2.309562\n", + "(Epoch 11 / 20) train acc: 0.060000; val_acc: 0.110000\n", + "(Epoch 12 / 20) train acc: 0.060000; val_acc: 0.110000\n", + "(Epoch 13 / 20) train acc: 0.060000; val_acc: 0.110000\n", + "(Epoch 14 / 20) train acc: 0.060000; val_acc: 0.110000\n", + "(Epoch 15 / 20) train acc: 0.060000; val_acc: 0.113000\n", + "(Iteration 31 / 40) loss: 2.285026\n", + "(Epoch 16 / 20) train acc: 0.060000; val_acc: 0.117000\n", + "(Epoch 17 / 20) train acc: 0.080000; val_acc: 0.113000\n", + "(Epoch 18 / 20) train acc: 0.080000; val_acc: 0.118000\n", + "(Epoch 19 / 20) train acc: 0.100000; val_acc: 0.118000\n", + "(Epoch 20 / 20) train acc: 0.100000; val_acc: 0.120000\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# TODO: Use a three-layer Net to overfit 50 training examples by\n", + "# tweaking just the learning rate and initialization scale.\n", + "\n", + "num_train = 50\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "weight_scale = 1e-2 # Experiment with this!\n", + "learning_rate = 1e-4 # Experiment with this!\n", + "model = FullyConnectedNet([100, 100],\n", + " weight_scale=weight_scale, dtype=np.float64)\n", + "solver = Solver(model, small_data,\n", + " print_every=10, num_epochs=20, batch_size=25,\n", + " update_rule='sgd',\n", + " optim_config={\n", + " 'learning_rate': learning_rate,\n", + " }\n", + " )\n", + "solver.train()\n", + "\n", + "plt.plot(solver.loss_history, 'o')\n", + "plt.title('Training loss history')\n", + "plt.xlabel('Iteration')\n", + "plt.ylabel('Training loss')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LicqjxZNLo5m" + }, + "source": [ + "Now try to use a five-layer network with 100 units on each layer to overfit 50 training examples. Again, you will have to adjust the learning rate and weight initialization scale, but you should be able to achieve 100% training accuracy within 20 epochs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "7cDzWdPsLo5m", + "outputId": "00830ef4-a01e-4c24-9311-52b8dc77058a" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(Iteration 1 / 40) loss: 2.302585\n", + "(Epoch 0 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Epoch 1 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Epoch 2 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Epoch 3 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Epoch 4 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Epoch 5 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Iteration 11 / 40) loss: 2.301962\n", + "(Epoch 6 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Epoch 7 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Epoch 8 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Epoch 9 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Epoch 10 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Iteration 21 / 40) loss: 2.301859\n", + "(Epoch 11 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Epoch 12 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Epoch 13 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Epoch 14 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Epoch 15 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Iteration 31 / 40) loss: 2.301798\n", + "(Epoch 16 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Epoch 17 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Epoch 18 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Epoch 19 / 20) train acc: 0.160000; val_acc: 0.079000\n", + "(Epoch 20 / 20) train acc: 0.160000; val_acc: 0.079000\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# TODO: Use a five-layer Net to overfit 50 training examples by\n", + "# tweaking just the learning rate and initialization scale.\n", + "\n", + "num_train = 50\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "learning_rate = 2e-3 # Experiment with this!\n", + "weight_scale = 1e-5 # Experiment with this!\n", + "model = FullyConnectedNet([100, 100, 100, 100],\n", + " weight_scale=weight_scale, dtype=np.float64)\n", + "solver = Solver(model, small_data,\n", + " print_every=10, num_epochs=20, batch_size=25,\n", + " update_rule='sgd',\n", + " optim_config={\n", + " 'learning_rate': learning_rate,\n", + " }\n", + " )\n", + "solver.train()\n", + "\n", + "plt.plot(solver.loss_history, 'o')\n", + "plt.title('Training loss history')\n", + "plt.xlabel('Iteration')\n", + "plt.ylabel('Training loss')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kLwN_FVHLo5m" + }, + "source": [ + "## Inline Question 2:\n", + "Did you notice anything about the comparative difficulty of training the three-layer net vs training the five layer net? In particular, based on your experience, which network seemed more sensitive to the initialization scale? Why do you think that is the case?\n", + "\n", + "## Answer:\n", + "three-layer net이 더 initialization scale에 sensitive하다고 생각한다. 이 ipynb에서는 regularization에 대해 고려하지 않으므로, 모델의 복잡성은 layer 개수에 비례하게 된다. 또한 모델이 복잡할수록 parameter 개수가 많다. 따라서 three-layer는 parameter가 조금만 바뀌어도 그 결과에 영향을 미치는 정도가 크므로 initialization scale에 더 sensitive하다.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iC7ID4ARLo5m" + }, + "source": [ + "# Update rules\n", + "So far we have used vanilla stochastic gradient descent (SGD) as our update rule. More sophisticated update rules can make it easier to train deep networks. We will implement a few of the most commonly used update rules and compare them to vanilla SGD." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ax6Ed3LqLo5n" + }, + "source": [ + "# SGD+Momentum\n", + "Stochastic gradient descent with momentum is a widely used update rule that tends to make deep networks converge faster than vanilla stochastic gradient descent. See the Momentum Update section at http://cs231n.github.io/neural-networks-3/#sgd for more information.\n", + "\n", + "Open the file `cs231n/optim.py` and read the documentation at the top of the file to make sure you understand the API. Implement the SGD+momentum update rule in the function `sgd_momentum` and run the following to check your implementation. You should see errors less than e-8." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "mVnqA5wuLo5n", + "outputId": "dcf1cc9b-726d-4c6f-bc6d-b892ac8fc9a7" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "next_w error: 8.882347033505819e-09\n", + "velocity error: 4.269287743278663e-09\n" + ] + } + ], + "source": [ + "from cs231n.optim import sgd_momentum\n", + "\n", + "N, D = 4, 5\n", + "w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)\n", + "dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)\n", + "v = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)\n", + "\n", + "config = {'learning_rate': 1e-3, 'velocity': v}\n", + "next_w, _ = sgd_momentum(w, dw, config=config)\n", + "\n", + "expected_next_w = np.asarray([\n", + " [ 0.1406, 0.20738947, 0.27417895, 0.34096842, 0.40775789],\n", + " [ 0.47454737, 0.54133684, 0.60812632, 0.67491579, 0.74170526],\n", + " [ 0.80849474, 0.87528421, 0.94207368, 1.00886316, 1.07565263],\n", + " [ 1.14244211, 1.20923158, 1.27602105, 1.34281053, 1.4096 ]])\n", + "expected_velocity = np.asarray([\n", + " [ 0.5406, 0.55475789, 0.56891579, 0.58307368, 0.59723158],\n", + " [ 0.61138947, 0.62554737, 0.63970526, 0.65386316, 0.66802105],\n", + " [ 0.68217895, 0.69633684, 0.71049474, 0.72465263, 0.73881053],\n", + " [ 0.75296842, 0.76712632, 0.78128421, 0.79544211, 0.8096 ]])\n", + "\n", + "# Should see relative errors around e-8 or less\n", + "print('next_w error: ', rel_error(next_w, expected_next_w))\n", + "print('velocity error: ', rel_error(expected_velocity, config['velocity']))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "64ABZYdnLo5n" + }, + "source": [ + "Once you have done so, run the following to train a six-layer network with both SGD and SGD+momentum. You should see the SGD+momentum update rule converge faster." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "5YdC14A7Lo5n", + "outputId": "369cb155-299f-40b3-9904-9c9f766c3291", + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "running with sgd\n", + "(Iteration 1 / 200) loss: 2.559978\n", + "(Epoch 0 / 5) train acc: 0.104000; val_acc: 0.107000\n", + "(Iteration 11 / 200) loss: 2.356069\n", + "(Iteration 21 / 200) loss: 2.214091\n", + "(Iteration 31 / 200) loss: 2.205928\n", + "(Epoch 1 / 5) train acc: 0.225000; val_acc: 0.193000\n", + "(Iteration 41 / 200) loss: 2.132095\n", + "(Iteration 51 / 200) loss: 2.118950\n", + "(Iteration 61 / 200) loss: 2.116443\n", + "(Iteration 71 / 200) loss: 2.132549\n", + "(Epoch 2 / 5) train acc: 0.298000; val_acc: 0.260000\n", + "(Iteration 81 / 200) loss: 1.977227\n", + "(Iteration 91 / 200) loss: 2.007528\n", + "(Iteration 101 / 200) loss: 2.004762\n", + "(Iteration 111 / 200) loss: 1.885342\n", + "(Epoch 3 / 5) train acc: 0.343000; val_acc: 0.287000\n", + "(Iteration 121 / 200) loss: 1.891516\n", + "(Iteration 131 / 200) loss: 1.923677\n", + "(Iteration 141 / 200) loss: 1.957743\n", + "(Iteration 151 / 200) loss: 1.966736\n", + "(Epoch 4 / 5) train acc: 0.322000; val_acc: 0.305000\n", + "(Iteration 161 / 200) loss: 1.801483\n", + "(Iteration 171 / 200) loss: 1.973780\n", + "(Iteration 181 / 200) loss: 1.666573\n", + "(Iteration 191 / 200) loss: 1.909494\n", + "(Epoch 5 / 5) train acc: 0.372000; val_acc: 0.319000\n", + "\n", + "running with sgd_momentum\n", + "(Iteration 1 / 200) loss: 3.153778\n", + "(Epoch 0 / 5) train acc: 0.099000; val_acc: 0.088000\n", + "(Iteration 11 / 200) loss: 2.227203\n", + "(Iteration 21 / 200) loss: 2.125706\n", + "(Iteration 31 / 200) loss: 1.932695\n", + "(Epoch 1 / 5) train acc: 0.307000; val_acc: 0.260000\n", + "(Iteration 41 / 200) loss: 1.946488\n", + "(Iteration 51 / 200) loss: 1.778583\n", + "(Iteration 61 / 200) loss: 1.758119\n", + "(Iteration 71 / 200) loss: 1.849137\n", + "(Epoch 2 / 5) train acc: 0.382000; val_acc: 0.322000\n", + "(Iteration 81 / 200) loss: 2.048671\n", + "(Iteration 91 / 200) loss: 1.693223\n", + "(Iteration 101 / 200) loss: 1.511693\n", + "(Iteration 111 / 200) loss: 1.390754\n", + "(Epoch 3 / 5) train acc: 0.458000; val_acc: 0.338000\n", + "(Iteration 121 / 200) loss: 1.670614\n", + "(Iteration 131 / 200) loss: 1.540271\n", + "(Iteration 141 / 200) loss: 1.597365\n", + "(Iteration 151 / 200) loss: 1.609851\n", + "(Epoch 4 / 5) train acc: 0.490000; val_acc: 0.327000\n", + "(Iteration 161 / 200) loss: 1.472687\n", + "(Iteration 171 / 200) loss: 1.378620\n", + "(Iteration 181 / 200) loss: 1.378175\n", + "(Iteration 191 / 200) loss: 1.305934\n", + "(Epoch 5 / 5) train acc: 0.536000; val_acc: 0.368000\n", + "\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "num_train = 4000\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "solvers = {}\n", + "\n", + "for update_rule in ['sgd', 'sgd_momentum']:\n", + " print('running with ', update_rule)\n", + " model = FullyConnectedNet([100, 100, 100, 100, 100], weight_scale=5e-2)\n", + "\n", + " solver = Solver(model, small_data,\n", + " num_epochs=5, batch_size=100,\n", + " update_rule=update_rule,\n", + " optim_config={\n", + " 'learning_rate': 5e-3,\n", + " },\n", + " verbose=True)\n", + " solvers[update_rule] = solver\n", + " solver.train()\n", + " print()\n", + "\n", + "plt.subplot(3, 1, 1)\n", + "plt.title('Training loss')\n", + "plt.xlabel('Iteration')\n", + "\n", + "plt.subplot(3, 1, 2)\n", + "plt.title('Training accuracy')\n", + "plt.xlabel('Epoch')\n", + "\n", + "plt.subplot(3, 1, 3)\n", + "plt.title('Validation accuracy')\n", + "plt.xlabel('Epoch')\n", + "\n", + "for update_rule, solver in solvers.items():\n", + " plt.subplot(3, 1, 1)\n", + " plt.plot(solver.loss_history, 'o', label=\"loss_%s\" % update_rule)\n", + "\n", + " plt.subplot(3, 1, 2)\n", + " plt.plot(solver.train_acc_history, '-o', label=\"train_acc_%s\" % update_rule)\n", + "\n", + " plt.subplot(3, 1, 3)\n", + " plt.plot(solver.val_acc_history, '-o', label=\"val_acc_%s\" % update_rule)\n", + "\n", + "for i in [1, 2, 3]:\n", + " plt.subplot(3, 1, i)\n", + " plt.legend(loc='upper center', ncol=4)\n", + "plt.gcf().set_size_inches(15, 15)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zGDbSeXTLo5n" + }, + "source": [ + "# RMSProp and Adam\n", + "RMSProp [1] and Adam [2] are update rules that set per-parameter learning rates by using a running average of the second moments of gradients.\n", + "\n", + "In the file `cs231n/optim.py`, implement the RMSProp update rule in the `rmsprop` function and implement the Adam update rule in the `adam` function, and check your implementations using the tests below.\n", + "\n", + "**NOTE:** Please implement the _complete_ Adam update rule (with the bias correction mechanism), not the first simplified version mentioned in the course notes.\n", + "\n", + "[1] Tijmen Tieleman and Geoffrey Hinton. \"Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.\" COURSERA: Neural Networks for Machine Learning 4 (2012).\n", + "\n", + "[2] Diederik Kingma and Jimmy Ba, \"Adam: A Method for Stochastic Optimization\", ICLR 2015." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3b_Rv0K1Lo5o", + "outputId": "4ac61204-907b-47bc-c48c-d30cea967910" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "next_w error: 9.524687511038133e-08\n", + "cache error: 2.6477955807156126e-09\n" + ] + } + ], + "source": [ + "# Test RMSProp implementation\n", + "from cs231n.optim import rmsprop\n", + "\n", + "N, D = 4, 5\n", + "w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)\n", + "dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)\n", + "cache = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)\n", + "\n", + "config = {'learning_rate': 1e-2, 'cache': cache}\n", + "next_w, _ = rmsprop(w, dw, config=config)\n", + "\n", + "expected_next_w = np.asarray([\n", + " [-0.39223849, -0.34037513, -0.28849239, -0.23659121, -0.18467247],\n", + " [-0.132737, -0.08078555, -0.02881884, 0.02316247, 0.07515774],\n", + " [ 0.12716641, 0.17918792, 0.23122175, 0.28326742, 0.33532447],\n", + " [ 0.38739248, 0.43947102, 0.49155973, 0.54365823, 0.59576619]])\n", + "expected_cache = np.asarray([\n", + " [ 0.5976, 0.6126277, 0.6277108, 0.64284931, 0.65804321],\n", + " [ 0.67329252, 0.68859723, 0.70395734, 0.71937285, 0.73484377],\n", + " [ 0.75037008, 0.7659518, 0.78158892, 0.79728144, 0.81302936],\n", + " [ 0.82883269, 0.84469141, 0.86060554, 0.87657507, 0.8926 ]])\n", + "\n", + "# You should see relative errors around e-7 or less\n", + "print('next_w error: ', rel_error(expected_next_w, next_w))\n", + "print('cache error: ', rel_error(expected_cache, config['cache']))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "-dlu4pR6Lo5o", + "outputId": "565b7db6-fbcd-4592-bed8-91247dd2b6bc" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "next_w error: 1.1395691798535431e-07\n", + "v error: 4.208314038113071e-09\n", + "m error: 4.214963193114416e-09\n" + ] + } + ], + "source": [ + "# Test Adam implementation\n", + "from cs231n.optim import adam\n", + "\n", + "N, D = 4, 5\n", + "w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)\n", + "dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)\n", + "m = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)\n", + "v = np.linspace(0.7, 0.5, num=N*D).reshape(N, D)\n", + "\n", + "config = {'learning_rate': 1e-2, 'm': m, 'v': v, 't': 5}\n", + "next_w, _ = adam(w, dw, config=config)\n", + "\n", + "expected_next_w = np.asarray([\n", + " [-0.40094747, -0.34836187, -0.29577703, -0.24319299, -0.19060977],\n", + " [-0.1380274, -0.08544591, -0.03286534, 0.01971428, 0.0722929],\n", + " [ 0.1248705, 0.17744702, 0.23002243, 0.28259667, 0.33516969],\n", + " [ 0.38774145, 0.44031188, 0.49288093, 0.54544852, 0.59801459]])\n", + "expected_v = np.asarray([\n", + " [ 0.69966, 0.68908382, 0.67851319, 0.66794809, 0.65738853,],\n", + " [ 0.64683452, 0.63628604, 0.6257431, 0.61520571, 0.60467385,],\n", + " [ 0.59414753, 0.58362676, 0.57311152, 0.56260183, 0.55209767,],\n", + " [ 0.54159906, 0.53110598, 0.52061845, 0.51013645, 0.49966, ]])\n", + "expected_m = np.asarray([\n", + " [ 0.48, 0.49947368, 0.51894737, 0.53842105, 0.55789474],\n", + " [ 0.57736842, 0.59684211, 0.61631579, 0.63578947, 0.65526316],\n", + " [ 0.67473684, 0.69421053, 0.71368421, 0.73315789, 0.75263158],\n", + " [ 0.77210526, 0.79157895, 0.81105263, 0.83052632, 0.85 ]])\n", + "\n", + "# You should see relative errors around e-7 or less\n", + "print('next_w error: ', rel_error(expected_next_w, next_w))\n", + "print('v error: ', rel_error(expected_v, config['v']))\n", + "print('m error: ', rel_error(expected_m, config['m']))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gVOP4c-WLo5o" + }, + "source": [ + "Once you have debugged your RMSProp and Adam implementations, run the following to train a pair of deep networks using these new update rules:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "id": "0wvM3bJELo5u", + "outputId": "f961f83c-5d6f-4ac4-9da6-7d07eb16e7f3" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "running with adam\n", + "(Iteration 1 / 200) loss: 3.476928\n", + "(Epoch 0 / 5) train acc: 0.126000; val_acc: 0.110000\n", + "(Iteration 11 / 200) loss: 2.027712\n", + "(Iteration 21 / 200) loss: 2.183358\n", + "(Iteration 31 / 200) loss: 1.744257\n", + "(Epoch 1 / 5) train acc: 0.363000; val_acc: 0.330000\n", + "(Iteration 41 / 200) loss: 1.707951\n", + "(Iteration 51 / 200) loss: 1.703835\n", + "(Iteration 61 / 200) loss: 2.094758\n", + "(Iteration 71 / 200) loss: 1.505558\n", + "(Epoch 2 / 5) train acc: 0.419000; val_acc: 0.362000\n", + "(Iteration 81 / 200) loss: 1.594431\n", + "(Iteration 91 / 200) loss: 1.511452\n", + "(Iteration 101 / 200) loss: 1.389237\n", + "(Iteration 111 / 200) loss: 1.463575\n", + "(Epoch 3 / 5) train acc: 0.497000; val_acc: 0.368000\n", + "(Iteration 121 / 200) loss: 1.231313\n", + "(Iteration 131 / 200) loss: 1.520198\n", + "(Iteration 141 / 200) loss: 1.363221\n", + "(Iteration 151 / 200) loss: 1.355143\n", + "(Epoch 4 / 5) train acc: 0.543000; val_acc: 0.347000\n", + "(Iteration 161 / 200) loss: 1.436402\n", + "(Iteration 171 / 200) loss: 1.231426\n", + "(Iteration 181 / 200) loss: 1.153575\n", + "(Iteration 191 / 200) loss: 1.209479\n", + "(Epoch 5 / 5) train acc: 0.619000; val_acc: 0.374000\n", + "\n", + "running with rmsprop\n", + "(Iteration 1 / 200) loss: 2.589166\n", + "(Epoch 0 / 5) train acc: 0.119000; val_acc: 0.146000\n", + "(Iteration 11 / 200) loss: 2.032921\n", + "(Iteration 21 / 200) loss: 1.897278\n", + "(Iteration 31 / 200) loss: 1.770793\n", + "(Epoch 1 / 5) train acc: 0.381000; val_acc: 0.320000\n", + "(Iteration 41 / 200) loss: 1.895731\n", + "(Iteration 51 / 200) loss: 1.681091\n", + "(Iteration 61 / 200) loss: 1.487204\n", + "(Iteration 71 / 200) loss: 1.629973\n", + "(Epoch 2 / 5) train acc: 0.429000; val_acc: 0.350000\n", + "(Iteration 81 / 200) loss: 1.506686\n", + "(Iteration 91 / 200) loss: 1.610742\n", + "(Iteration 101 / 200) loss: 1.486124\n", + "(Iteration 111 / 200) loss: 1.559454\n", + "(Epoch 3 / 5) train acc: 0.492000; val_acc: 0.361000\n", + "(Iteration 121 / 200) loss: 1.497406\n", + "(Iteration 131 / 200) loss: 1.530736\n", + "(Iteration 141 / 200) loss: 1.550957\n", + "(Iteration 151 / 200) loss: 1.652046\n", + "(Epoch 4 / 5) train acc: 0.530000; val_acc: 0.361000\n", + "(Iteration 161 / 200) loss: 1.599574\n", + "(Iteration 171 / 200) loss: 1.401073\n", + "(Iteration 181 / 200) loss: 1.509365\n", + "(Iteration 191 / 200) loss: 1.365772\n", + "(Epoch 5 / 5) train acc: 0.531000; val_acc: 0.369000\n", + "\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "learning_rates = {'rmsprop': 1e-4, 'adam': 1e-3}\n", + "for update_rule in ['adam', 'rmsprop']:\n", + " print('running with ', update_rule)\n", + " model = FullyConnectedNet([100, 100, 100, 100, 100], weight_scale=5e-2)\n", + "\n", + " solver = Solver(model, small_data,\n", + " num_epochs=5, batch_size=100,\n", + " update_rule=update_rule,\n", + " optim_config={\n", + " 'learning_rate': learning_rates[update_rule]\n", + " },\n", + " verbose=True)\n", + " solvers[update_rule] = solver\n", + " solver.train()\n", + " print()\n", + "\n", + "plt.subplot(3, 1, 1)\n", + "plt.title('Training loss')\n", + "plt.xlabel('Iteration')\n", + "\n", + "plt.subplot(3, 1, 2)\n", + "plt.title('Training accuracy')\n", + "plt.xlabel('Epoch')\n", + "\n", + "plt.subplot(3, 1, 3)\n", + "plt.title('Validation accuracy')\n", + "plt.xlabel('Epoch')\n", + "\n", + "for update_rule, solver in list(solvers.items()):\n", + " plt.subplot(3, 1, 1)\n", + " plt.plot(solver.loss_history, 'o', label=update_rule)\n", + "\n", + " plt.subplot(3, 1, 2)\n", + " plt.plot(solver.train_acc_history, '-o', label=update_rule)\n", + "\n", + " plt.subplot(3, 1, 3)\n", + " plt.plot(solver.val_acc_history, '-o', label=update_rule)\n", + "\n", + "for i in [1, 2, 3]:\n", + " plt.subplot(3, 1, i)\n", + " plt.legend(loc='upper center', ncol=4)\n", + "plt.gcf().set_size_inches(15, 15)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3A6eh2MiLo5u" + }, + "source": [ + "## Inline Question 3:\n", + "\n", + "AdaGrad, like Adam, is a per-parameter optimization method that uses the following update rule:\n", + "\n", + "```\n", + "cache += dw**2\n", + "w += - learning_rate * dw / (np.sqrt(cache) + eps)\n", + "```\n", + "\n", + "John notices that when he was training a network with AdaGrad that the updates became very small, and that his network was learning slowly. Using your knowledge of the AdaGrad update rule, why do you think the updates would become very small? Would Adam have the same issue?\n", + "\n", + "\n", + "## Answer:\n", + "cache 값은 dw의 제곱을 누적해서 계속 더하기 때문에 학습이 진행될수록 점점 커진다. 따라서 w의 변화량의 분모가 점점 커지게 되어 parameter가 update되는 정도가 작아진다.\n", + "\n", + "Adam은 이러한 문제가 발생하지 않는다. 왜냐하면 학습이 진행될수록 update에 영향을 미치는 요소의 분모가 더 작게 자동으로 조정되기 때문이다.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WGC9kpJULo5v" + }, + "source": [ + "# Train a good model!\n", + "Train the best fully-connected model that you can on CIFAR-10, storing your best model in the `best_model` variable. We require you to get at least 50% accuracy on the validation set using a fully-connected net.\n", + "\n", + "If you are careful it should be possible to get accuracies above 55%, but we don't require it for this part and won't assign extra credit for doing so. Later in the assignment we will ask you to train the best convolutional network that you can on CIFAR-10, and we would prefer that you spend your effort working on convolutional nets rather than fully-connected nets.\n", + "\n", + "You might find it useful to complete the `BatchNormalization.ipynb` and `Dropout.ipynb` notebooks before completing this part, since those techniques can help you train powerful models." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "mj_C4Z3bLo5v", + "outputId": "636c4b56-5a02-480b-8223-56e2527b0191", + "scrolled": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "learning_rates: 0.001 weight_scale: 0.05 batch_size: 50 dropout_ratio: 1 batch_norm: True acc: 0.505\n", + "learning_rates: 0.001 weight_scale: 0.05 batch_size: 100 dropout_ratio: 1 batch_norm: True acc: 0.505\n", + "learning_rates: 0.0001 weight_scale: 0.05 batch_size: 50 dropout_ratio: 1 batch_norm: True acc: 0.5\n", + "learning_rates: 0.0001 weight_scale: 0.05 batch_size: 100 dropout_ratio: 1 batch_norm: True acc: 0.455\n" + ] + } + ], + "source": [ + "best_model = None\n", + "################################################################################\n", + "# TODO: Train the best FullyConnectedNet that you can on CIFAR-10. You might #\n", + "# find batch/layer normalization and dropout useful. Store your best model in #\n", + "# the best_model variable. #\n", + "################################################################################\n", + "# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + "best_acc = -1\n", + "\n", + "learning_rates = [1e-3, 1e-4]\n", + "weight_scale = [5e-2]\n", + "dropout_keep_ratio = [1, 0.5]\n", + "batch_size = [50,100]\n", + "normalize = [True, False]\n", + "\n", + "for lr in learning_rates:\n", + " for w in weight_scale:\n", + " for bs in batch_size:\n", + " model = FullyConnectedNet([100, 100, 100, 100, 100], weight_scale=w)\n", + " solver = Solver(\n", + " model,\n", + " data,\n", + " num_epochs=5,\n", + " batch_size=bs,\n", + " update_rule='adam',\n", + " optim_config={'learning_rate': lr},\n", + " verbose=False)\n", + "\n", + " solver.train()\n", + "\n", + " acc = solver.check_accuracy(data['X_val'], data['y_val'], batch_size = 100)\n", + " print('learning_rates:',lr,'weight_scale:',w, 'batch_size:', bs, 'dropout_ratio:', drop_p, 'batch_norm:', bn, 'acc:',acc)\n", + "\n", + " if acc > best_acc:\n", + " best_acc = acc\n", + " best_model = model\n", + "\n", + "# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1caBDEvTLo5v" + }, + "source": [ + "# Test your model!\n", + "Run your best model on the validation and test sets. You should achieve above 50% accuracy on the validation set." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y_test_pred = np.argmax(best_model.loss(data['X_test']), axis=1)\n", + "y_val_pred = np.argmax(best_model.loss(data['X_val']), axis=1)\n", + "print('Validation set accuracy: ', (y_val_pred == data['y_val']).mean())\n", + "print('Test set accuracy: ', (y_test_pred == data['y_test']).mean())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qyozqKethC62" + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.2" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git "a/\352\261\264\353\247\220\354\262\234\352\260\204_Q2_BatchNormalization.ipynb" "b/\352\261\264\353\247\220\354\262\234\352\260\204_Q2_BatchNormalization.ipynb" new file mode 100644 index 0000000..5418485 --- /dev/null +++ "b/\352\261\264\353\247\220\354\262\234\352\260\204_Q2_BatchNormalization.ipynb" @@ -0,0 +1,1309 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "code", + "metadata": { + "id": "0Ed5BBM7gNY3", + "outputId": "38069997-4205-49c4-fe8f-6bd5d1bfa527", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 69 + } + }, + "source": [ + "# this mounts your Google Drive to the Colab VM.\n", + "from google.colab import drive\n", + "drive.mount('/content/drive', force_remount=True)\n", + "\n", + "# enter the foldername in your Drive where you have saved the unzipped\n", + "# assignment folder, e.g. 'cs231n/assignment2'\n", + "FOLDERNAME = 'cs231n/assignment2'\n", + "assert FOLDERNAME is not None, \"[!] Enter the foldername.\"\n", + "\n", + "# now that we've mounted your Drive, this ensures that\n", + "# the Python interpreter of the Colab VM can load\n", + "# python files from within it.\n", + "import sys\n", + "sys.path.append('/content/drive/My Drive/{}'.format(FOLDERNAME))\n", + "\n", + "# this downloads the CIFAR-10 dataset to your Drive\n", + "# if it doesn't already exist.\n", + "%cd drive/My\\ Drive/$FOLDERNAME/cs231n/datasets/\n", + "!bash get_datasets.sh\n", + "%cd /content" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Mounted at /content/drive\n", + "/content/drive/My Drive/cs231n/assignment2/cs231n/datasets\n", + "/content\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-title" + ], + "id": "pyyct9pngNY_" + }, + "source": [ + "# Batch Normalization\n", + "One way to make deep networks easier to train is to use more sophisticated optimization procedures such as SGD+momentum, RMSProp, or Adam. Another strategy is to change the architecture of the network to make it easier to train.\n", + "One idea along these lines is batch normalization which was proposed by [1] in 2015.\n", + "\n", + "The idea is relatively straightforward. Machine learning methods tend to work better when their input data consists of uncorrelated features with zero mean and unit variance. When training a neural network, we can preprocess the data before feeding it to the network to explicitly decorrelate its features; this will ensure that the first layer of the network sees data that follows a nice distribution. However, even if we preprocess the input data, the activations at deeper layers of the network will likely no longer be decorrelated and will no longer have zero mean or unit variance since they are output from earlier layers in the network. Even worse, during the training process the distribution of features at each layer of the network will shift as the weights of each layer are updated.\n", + "\n", + "The authors of [1] hypothesize that the shifting distribution of features inside deep neural networks may make training deep networks more difficult. To overcome this problem, [1] proposes to insert batch normalization layers into the network. At training time, a batch normalization layer uses a minibatch of data to estimate the mean and standard deviation of each feature. These estimated means and standard deviations are then used to center and normalize the features of the minibatch. A running average of these means and standard deviations is kept during training, and at test time these running averages are used to center and normalize features.\n", + "\n", + "It is possible that this normalization strategy could reduce the representational power of the network, since it may sometimes be optimal for certain layers to have features that are not zero-mean or unit variance. To this end, the batch normalization layer includes learnable shift and scale parameters for each feature dimension.\n", + "\n", + "[1] [Sergey Ioffe and Christian Szegedy, \"Batch Normalization: Accelerating Deep Network Training by Reducing\n", + "Internal Covariate Shift\", ICML 2015.](https://arxiv.org/abs/1502.03167)" + ] + }, + { + "cell_type": "code", + "metadata": { + "tags": [ + "pdf-ignore" + ], + "id": "YuFDz93TgNZA" + }, + "source": [ + "# As usual, a bit of setup\n", + "import time\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from cs231n.classifiers.fc_net import *\n", + "from cs231n.data_utils import get_CIFAR10_data\n", + "from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n", + "from cs231n.solver import Solver\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))\n", + "\n", + "def print_mean_std(x,axis=0):\n", + " print(' means: ', x.mean(axis=axis))\n", + " print(' stds: ', x.std(axis=axis))\n", + " print()" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "tags": [ + "pdf-ignore" + ], + "id": "gLSFC2dOgNZD", + "outputId": "3ec1bb80-7c73-43c8-cb33-accd2ad494ae", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 121 + } + }, + "source": [ + "# Load the (preprocessed) CIFAR10 data.\n", + "data = get_CIFAR10_data()\n", + "for k, v in data.items():\n", + " print('%s: ' % k, v.shape)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "X_train: (49000, 3, 32, 32)\n", + "y_train: (49000,)\n", + "X_val: (1000, 3, 32, 32)\n", + "y_val: (1000,)\n", + "X_test: (1000, 3, 32, 32)\n", + "y_test: (1000,)\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l5z_w447gNZH" + }, + "source": [ + "## Batch normalization: forward\n", + "In the file `cs231n/layers.py`, implement the batch normalization forward pass in the function `batchnorm_forward`. Once you have done so, run the following to test your implementation.\n", + "\n", + "Referencing the paper linked to above in [1] may be helpful!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "NuYItfh3gNZH", + "outputId": "776461e5-20eb-4f34-940e-289536b3a532", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 225 + } + }, + "source": [ + "# Check the training-time forward pass by checking means and variances\n", + "# of features both before and after batch normalization\n", + "\n", + "# Simulate the forward pass for a two-layer network\n", + "np.random.seed(231)\n", + "N, D1, D2, D3 = 200, 50, 60, 3\n", + "X = np.random.randn(N, D1)\n", + "W1 = np.random.randn(D1, D2)\n", + "W2 = np.random.randn(D2, D3)\n", + "a = np.maximum(0, X.dot(W1)).dot(W2)\n", + "\n", + "print('Before batch normalization:')\n", + "print_mean_std(a,axis=0)\n", + "\n", + "gamma = np.ones((D3,))\n", + "beta = np.zeros((D3,))\n", + "# Means should be close to zero and stds close to one\n", + "print('After batch normalization (gamma=1, beta=0)')\n", + "a_norm, _ = batchnorm_forward(a, gamma, beta, {'mode': 'train'})\n", + "print_mean_std(a_norm,axis=0)\n", + "\n", + "gamma = np.asarray([1.0, 2.0, 3.0])\n", + "beta = np.asarray([11.0, 12.0, 13.0])\n", + "# Now means should be close to beta and stds close to gamma\n", + "print('After batch normalization (gamma=', gamma, ', beta=', beta, ')')\n", + "a_norm, _ = batchnorm_forward(a, gamma, beta, {'mode': 'train'})\n", + "print_mean_std(a_norm,axis=0)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Before batch normalization:\n", + " means: [ -2.3814598 -13.18038246 1.91780462]\n", + " stds: [27.18502186 34.21455511 37.68611762]\n", + "\n", + "After batch normalization (gamma=1, beta=0)\n", + " means: [5.99520433e-17 6.93889390e-17 8.32667268e-19]\n", + " stds: [0.99999999 1. 1. ]\n", + "\n", + "After batch normalization (gamma= [1. 2. 3.] , beta= [11. 12. 13.] )\n", + " means: [11. 12. 13.]\n", + " stds: [0.99999999 1.99999999 2.99999999]\n", + "\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "L2VW5k0ZgNZK", + "outputId": "33a5456d-fac8-4b6c-99d3-c9c34429145c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 86 + } + }, + "source": [ + "# Check the test-time forward pass by running the training-time\n", + "# forward pass many times to warm up the running averages, and then\n", + "# checking the means and variances of activations after a test-time\n", + "# forward pass.\n", + "\n", + "np.random.seed(231)\n", + "N, D1, D2, D3 = 200, 50, 60, 3\n", + "W1 = np.random.randn(D1, D2)\n", + "W2 = np.random.randn(D2, D3)\n", + "\n", + "bn_param = {'mode': 'train'}\n", + "gamma = np.ones(D3)\n", + "beta = np.zeros(D3)\n", + "\n", + "for t in range(50):\n", + " X = np.random.randn(N, D1)\n", + " a = np.maximum(0, X.dot(W1)).dot(W2)\n", + " batchnorm_forward(a, gamma, beta, bn_param)\n", + "\n", + "bn_param['mode'] = 'test'\n", + "X = np.random.randn(N, D1)\n", + "a = np.maximum(0, X.dot(W1)).dot(W2)\n", + "a_norm, _ = batchnorm_forward(a, gamma, beta, bn_param)\n", + "\n", + "# Means should be close to zero and stds close to one, but will be\n", + "# noisier than training-time forward passes.\n", + "print('After batch normalization (test-time):')\n", + "print_mean_std(a_norm,axis=0)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "After batch normalization (test-time):\n", + " means: [-0.03927354 -0.04349152 -0.10452688]\n", + " stds: [1.01531428 1.01238373 0.97819988]\n", + "\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qH-ut9MsgNZO" + }, + "source": [ + "## Batch normalization: backward\n", + "Now implement the backward pass for batch normalization in the function `batchnorm_backward`.\n", + "\n", + "To derive the backward pass you should write out the computation graph for batch normalization and backprop through each of the intermediate nodes. Some intermediates may have multiple outgoing branches; make sure to sum gradients across these branches in the backward pass.\n", + "\n", + "Once you have finished, run the following to numerically check your backward pass." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "tr2QXX8ZgNZP", + "outputId": "a2ad89fb-d78f-4270-96dc-d3ecc31437f7", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 69 + } + }, + "source": [ + "# Gradient check batchnorm backward pass\n", + "np.random.seed(231)\n", + "N, D = 4, 5\n", + "x = 5 * np.random.randn(N, D) + 12\n", + "gamma = np.random.randn(D)\n", + "beta = np.random.randn(D)\n", + "dout = np.random.randn(N, D)\n", + "\n", + "bn_param = {'mode': 'train'}\n", + "fx = lambda x: batchnorm_forward(x, gamma, beta, bn_param)[0]\n", + "fg = lambda a: batchnorm_forward(x, a, beta, bn_param)[0]\n", + "fb = lambda b: batchnorm_forward(x, gamma, b, bn_param)[0]\n", + "\n", + "dx_num = eval_numerical_gradient_array(fx, x, dout)\n", + "da_num = eval_numerical_gradient_array(fg, gamma.copy(), dout)\n", + "db_num = eval_numerical_gradient_array(fb, beta.copy(), dout)\n", + "\n", + "_, cache = batchnorm_forward(x, gamma, beta, bn_param)\n", + "dx, dgamma, dbeta = batchnorm_backward(dout, cache)\n", + "#You should expect to see relative errors between 1e-13 and 1e-8\n", + "print('dx error: ', rel_error(dx_num, dx))\n", + "print('dgamma error: ', rel_error(da_num, dgamma))\n", + "print('dbeta error: ', rel_error(db_num, dbeta))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "dx error: 1.6674621912029909e-09\n", + "dgamma error: 7.417225040694815e-13\n", + "dbeta error: 2.379446949959628e-12\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H1nLncVFgNZS" + }, + "source": [ + "## Batch normalization: alternative backward\n", + "In class we talked about two different implementations for the sigmoid backward pass. One strategy is to write out a computation graph composed of simple operations and backprop through all intermediate values. Another strategy is to work out the derivatives on paper. For example, you can derive a very simple formula for the sigmoid function's backward pass by simplifying gradients on paper.\n", + "\n", + "Surprisingly, it turns out that you can do a similar simplification for the batch normalization backward pass too! \n", + "\n", + "In the forward pass, given a set of inputs $X=\\begin{bmatrix}x_1\\\\x_2\\\\...\\\\x_N\\end{bmatrix}$,\n", + "\n", + "we first calculate the mean $\\mu$ and variance $v$.\n", + "With $\\mu$ and $v$ calculated, we can calculate the standard deviation $\\sigma$ and normalized data $Y$.\n", + "The equations and graph illustration below describe the computation ($y_i$ is the i-th element of the vector $Y$).\n", + "\n", + "\\begin{align}\n", + "& \\mu=\\frac{1}{N}\\sum_{k=1}^N x_k & v=\\frac{1}{N}\\sum_{k=1}^N (x_k-\\mu)^2 \\\\\n", + "& \\sigma=\\sqrt{v+\\epsilon} & y_i=\\frac{x_i-\\mu}{\\sigma}\n", + "\\end{align}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dwekPDMUgNZT" + }, + "source": [ + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-ignore" + ], + "id": "0wCU5Y3PgNZU" + }, + "source": [ + "The meat of our problem during backpropagation is to compute $\\frac{\\partial L}{\\partial X}$, given the upstream gradient we receive, $\\frac{\\partial L}{\\partial Y}.$ To do this, recall the chain rule in calculus gives us $\\frac{\\partial L}{\\partial X} = \\frac{\\partial L}{\\partial Y} \\cdot \\frac{\\partial Y}{\\partial X}$.\n", + "\n", + "The unknown/hart part is $\\frac{\\partial Y}{\\partial X}$. We can find this by first deriving step-by-step our local gradients at\n", + "$\\frac{\\partial v}{\\partial X}$, $\\frac{\\partial \\mu}{\\partial X}$,\n", + "$\\frac{\\partial \\sigma}{\\partial v}$,\n", + "$\\frac{\\partial Y}{\\partial \\sigma}$, and $\\frac{\\partial Y}{\\partial \\mu}$,\n", + "and then use the chain rule to compose these gradients (which appear in the form of vectors!) appropriately to compute $\\frac{\\partial Y}{\\partial X}$.\n", + "\n", + "If it's challenging to directly reason about the gradients over $X$ and $Y$ which require matrix multiplication, try reasoning about the gradients in terms of individual elements $x_i$ and $y_i$ first: in that case, you will need to come up with the derivations for $\\frac{\\partial L}{\\partial x_i}$, by relying on the Chain Rule to first calculate the intermediate $\\frac{\\partial \\mu}{\\partial x_i}, \\frac{\\partial v}{\\partial x_i}, \\frac{\\partial \\sigma}{\\partial x_i},$ then assemble these pieces to calculate $\\frac{\\partial y_i}{\\partial x_i}$.\n", + "\n", + "You should make sure each of the intermediary gradient derivations are all as simplified as possible, for ease of implementation.\n", + "\n", + "After doing so, implement the simplified batch normalization backward pass in the function `batchnorm_backward_alt` and compare the two implementations by running the following. Your two implementations should compute nearly identical results, but the alternative implementation should be a bit faster." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "JQBH5QHPgNZV", + "outputId": "df668bb2-c47d-4181-b6e2-4de9be282117", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 86 + } + }, + "source": [ + "np.random.seed(231)\n", + "N, D = 100, 500\n", + "x = 5 * np.random.randn(N, D) + 12\n", + "gamma = np.random.randn(D)\n", + "beta = np.random.randn(D)\n", + "dout = np.random.randn(N, D)\n", + "\n", + "bn_param = {'mode': 'train'}\n", + "out, cache = batchnorm_forward(x, gamma, beta, bn_param)\n", + "\n", + "t1 = time.time()\n", + "dx1, dgamma1, dbeta1 = batchnorm_backward(dout, cache)\n", + "t2 = time.time()\n", + "dx2, dgamma2, dbeta2 = batchnorm_backward_alt(dout, cache)\n", + "t3 = time.time()\n", + "\n", + "print('dx difference: ', rel_error(dx1, dx2))\n", + "print('dgamma difference: ', rel_error(dgamma1, dgamma2))\n", + "print('dbeta difference: ', rel_error(dbeta1, dbeta2))\n", + "print('speedup: %.2fx' % ((t2 - t1) / (t3 - t2)))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "dx difference: 6.680652523862896e-13\n", + "dgamma difference: 0.0\n", + "dbeta difference: 0.0\n", + "speedup: 0.87x\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Nw8sJbXEgNZY" + }, + "source": [ + "## Fully Connected Nets with Batch Normalization\n", + "Now that you have a working implementation for batch normalization, go back to your `FullyConnectedNet` in the file `cs231n/classifiers/fc_net.py`. Modify your implementation to add batch normalization.\n", + "\n", + "Concretely, when the `normalization` flag is set to `\"batchnorm\"` in the constructor, you should insert a batch normalization layer before each ReLU nonlinearity. The outputs from the last layer of the network should not be normalized. Once you are done, run the following to gradient-check your implementation.\n", + "\n", + "HINT: You might find it useful to define an additional helper layer similar to those in the file `cs231n/layer_utils.py`. If you decide to do so, do it in the file `cs231n/classifiers/fc_net.py`." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "K0DrUkyJgNZZ", + "outputId": "20c3ec45-f58c-464c-e37b-f3b555368300", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 449 + } + }, + "source": [ + "np.random.seed(231)\n", + "N, D, H1, H2, C = 2, 15, 20, 30, 10\n", + "X = np.random.randn(N, D)\n", + "y = np.random.randint(C, size=(N,))\n", + "\n", + "# You should expect losses between 1e-4~1e-10 for W,\n", + "# losses between 1e-08~1e-10 for b,\n", + "# and losses between 1e-08~1e-09 for beta and gammas.\n", + "for reg in [0, 3.14]:\n", + " print('Running check with reg = ', reg)\n", + " model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,\n", + " reg=reg, weight_scale=5e-2, dtype=np.float64,\n", + " normalization='batchnorm')\n", + "\n", + " loss, grads = model.loss(X, y)\n", + " print('Initial loss: ', loss)\n", + "\n", + " for name in sorted(grads):\n", + " f = lambda _: model.loss(X, y)[0]\n", + " grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)\n", + " print('%s relative error: %.2e' % (name, rel_error(grad_num, grads[name])))\n", + " if reg == 0: print()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Running check with reg = 0\n", + "Initial loss: 2.2611955101340957\n", + "W1 relative error: 1.10e-04\n", + "W2 relative error: 3.11e-06\n", + "W3 relative error: 4.05e-10\n", + "b1 relative error: 4.44e-08\n", + "b2 relative error: 2.22e-08\n", + "b3 relative error: 1.01e-10\n", + "beta1 relative error: 7.33e-09\n", + "beta2 relative error: 1.89e-09\n", + "gamma1 relative error: 6.96e-09\n", + "gamma2 relative error: 2.41e-09\n", + "\n", + "Running check with reg = 3.14\n", + "Initial loss: 6.996533220108303\n", + "W1 relative error: 1.98e-06\n", + "W2 relative error: 2.28e-06\n", + "W3 relative error: 1.11e-08\n", + "b1 relative error: 5.55e-09\n", + "b2 relative error: 2.22e-08\n", + "b3 relative error: 2.10e-10\n", + "beta1 relative error: 6.65e-09\n", + "beta2 relative error: 3.39e-09\n", + "gamma1 relative error: 6.27e-09\n", + "gamma2 relative error: 5.28e-09\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DbBnmeEYgNZe" + }, + "source": [ + "# Batchnorm for deep networks\n", + "Run the following to train a six-layer network on a subset of 1000 training examples both with and without batch normalization." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "j_XgTX-igNZe", + "outputId": "4530d78f-1713-4e95-fcc3-c4c731a5acd9", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 795 + } + }, + "source": [ + "np.random.seed(231)\n", + "# Try training a very deep net with batchnorm\n", + "hidden_dims = [100, 100, 100, 100, 100]\n", + "\n", + "num_train = 1000\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "weight_scale = 2e-2\n", + "bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization='batchnorm')\n", + "model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)\n", + "\n", + "print('Solver with batch norm:')\n", + "bn_solver = Solver(bn_model, small_data,\n", + " num_epochs=10, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=True,print_every=20)\n", + "bn_solver.train()\n", + "\n", + "print('\\nSolver without batch norm:')\n", + "solver = Solver(model, small_data,\n", + " num_epochs=10, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=True, print_every=20)\n", + "solver.train()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Solver with batch norm:\n", + "(Iteration 1 / 200) loss: 2.340974\n", + "(Epoch 0 / 10) train acc: 0.107000; val_acc: 0.115000\n", + "(Epoch 1 / 10) train acc: 0.314000; val_acc: 0.266000\n", + "(Iteration 21 / 200) loss: 2.039345\n", + "(Epoch 2 / 10) train acc: 0.395000; val_acc: 0.280000\n", + "(Iteration 41 / 200) loss: 2.047471\n", + "(Epoch 3 / 10) train acc: 0.484000; val_acc: 0.316000\n", + "(Iteration 61 / 200) loss: 1.739554\n", + "(Epoch 4 / 10) train acc: 0.525000; val_acc: 0.318000\n", + "(Iteration 81 / 200) loss: 1.247064\n", + "(Epoch 5 / 10) train acc: 0.601000; val_acc: 0.338000\n", + "(Iteration 101 / 200) loss: 1.333654\n", + "(Epoch 6 / 10) train acc: 0.627000; val_acc: 0.323000\n", + "(Iteration 121 / 200) loss: 1.036104\n", + "(Epoch 7 / 10) train acc: 0.695000; val_acc: 0.332000\n", + "(Iteration 141 / 200) loss: 1.140680\n", + "(Epoch 8 / 10) train acc: 0.723000; val_acc: 0.298000\n", + "(Iteration 161 / 200) loss: 0.705776\n", + "(Epoch 9 / 10) train acc: 0.760000; val_acc: 0.322000\n", + "(Iteration 181 / 200) loss: 0.906980\n", + "(Epoch 10 / 10) train acc: 0.771000; val_acc: 0.315000\n", + "\n", + "Solver without batch norm:\n", + "(Iteration 1 / 200) loss: 2.302332\n", + "(Epoch 0 / 10) train acc: 0.129000; val_acc: 0.131000\n", + "(Epoch 1 / 10) train acc: 0.283000; val_acc: 0.250000\n", + "(Iteration 21 / 200) loss: 2.041970\n", + "(Epoch 2 / 10) train acc: 0.316000; val_acc: 0.277000\n", + "(Iteration 41 / 200) loss: 1.900473\n", + "(Epoch 3 / 10) train acc: 0.373000; val_acc: 0.282000\n", + "(Iteration 61 / 200) loss: 1.713156\n", + "(Epoch 4 / 10) train acc: 0.390000; val_acc: 0.310000\n", + "(Iteration 81 / 200) loss: 1.662209\n", + "(Epoch 5 / 10) train acc: 0.434000; val_acc: 0.300000\n", + "(Iteration 101 / 200) loss: 1.696059\n", + "(Epoch 6 / 10) train acc: 0.535000; val_acc: 0.345000\n", + "(Iteration 121 / 200) loss: 1.557987\n", + "(Epoch 7 / 10) train acc: 0.530000; val_acc: 0.304000\n", + "(Iteration 141 / 200) loss: 1.432189\n", + "(Epoch 8 / 10) train acc: 0.628000; val_acc: 0.339000\n", + "(Iteration 161 / 200) loss: 1.033932\n", + "(Epoch 9 / 10) train acc: 0.661000; val_acc: 0.340000\n", + "(Iteration 181 / 200) loss: 0.901034\n", + "(Epoch 10 / 10) train acc: 0.726000; val_acc: 0.318000\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8LUeqy6LgNZj" + }, + "source": [ + "Run the following to visualize the results from two networks trained above. You should find that using batch normalization helps the network to converge much faster." + ] + }, + { + "cell_type": "code", + "metadata": { + "tags": [ + "pdf-ignore-input" + ], + "id": "xIfiZ96hgNZj", + "outputId": "5e35088b-d8c5-4d64-e990-3f14ae0fa97e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 893 + } + }, + "source": [ + "def plot_training_history(title, label, baseline, bn_solvers, plot_fn, bl_marker='.', bn_marker='.', labels=None):\n", + " \"\"\"utility function for plotting training history\"\"\"\n", + " plt.title(title)\n", + " plt.xlabel(label)\n", + " bn_plots = [plot_fn(bn_solver) for bn_solver in bn_solvers]\n", + " bl_plot = plot_fn(baseline)\n", + " num_bn = len(bn_plots)\n", + " for i in range(num_bn):\n", + " label='with_norm'\n", + " if labels is not None:\n", + " label += str(labels[i])\n", + " plt.plot(bn_plots[i], bn_marker, label=label)\n", + " label='baseline'\n", + " if labels is not None:\n", + " label += str(labels[0])\n", + " plt.plot(bl_plot, bl_marker, label=label)\n", + " plt.legend(loc='lower center', ncol=num_bn+1)\n", + "\n", + "\n", + "plt.subplot(3, 1, 1)\n", + "plot_training_history('Training loss','Iteration', solver, [bn_solver], \\\n", + " lambda x: x.loss_history, bl_marker='o', bn_marker='o')\n", + "plt.subplot(3, 1, 2)\n", + "plot_training_history('Training accuracy','Epoch', solver, [bn_solver], \\\n", + " lambda x: x.train_acc_history, bl_marker='-o', bn_marker='-o')\n", + "plt.subplot(3, 1, 3)\n", + "plot_training_history('Validation accuracy','Epoch', solver, [bn_solver], \\\n", + " lambda x: x.val_acc_history, bl_marker='-o', bn_marker='-o')\n", + "\n", + "plt.gcf().set_size_inches(15, 15)\n", + "\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xIWsuQf5gNZn" + }, + "source": [ + "# Batch normalization and initialization\n", + "We will now run a small experiment to study the interaction of batch normalization and weight initialization.\n", + "\n", + "The first cell will train 8-layer networks both with and without batch normalization using different scales for weight initialization. The second layer will plot training accuracy, validation set accuracy, and training loss as a function of the weight initialization scale." + ] + }, + { + "cell_type": "code", + "metadata": { + "tags": [ + "pdf-ignore-input" + ], + "id": "nJ4dJpNpgNZo", + "outputId": "27fce21f-e03a-4d37-ff82-97cba852abee", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 363 + } + }, + "source": [ + "np.random.seed(231)\n", + "# Try training a very deep net with batchnorm\n", + "hidden_dims = [50, 50, 50, 50, 50, 50, 50]\n", + "num_train = 1000\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "bn_solvers_ws = {}\n", + "solvers_ws = {}\n", + "weight_scales = np.logspace(-4, 0, num=20)\n", + "for i, weight_scale in enumerate(weight_scales):\n", + " print('Running weight scale %d / %d' % (i + 1, len(weight_scales)))\n", + " bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization='batchnorm')\n", + " model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)\n", + "\n", + " bn_solver = Solver(bn_model, small_data,\n", + " num_epochs=10, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=False, print_every=200)\n", + " bn_solver.train()\n", + " bn_solvers_ws[weight_scale] = bn_solver\n", + "\n", + " solver = Solver(model, small_data,\n", + " num_epochs=10, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=False, print_every=200)\n", + " solver.train()\n", + " solvers_ws[weight_scale] = solver" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Running weight scale 1 / 20\n", + "Running weight scale 2 / 20\n", + "Running weight scale 3 / 20\n", + "Running weight scale 4 / 20\n", + "Running weight scale 5 / 20\n", + "Running weight scale 6 / 20\n", + "Running weight scale 7 / 20\n", + "Running weight scale 8 / 20\n", + "Running weight scale 9 / 20\n", + "Running weight scale 10 / 20\n", + "Running weight scale 11 / 20\n", + "Running weight scale 12 / 20\n", + "Running weight scale 13 / 20\n", + "Running weight scale 14 / 20\n", + "Running weight scale 15 / 20\n", + "Running weight scale 16 / 20\n", + "Running weight scale 17 / 20\n", + "Running weight scale 18 / 20\n", + "Running weight scale 19 / 20\n", + "Running weight scale 20 / 20\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "tags": [ + "pdf-ignore-input" + ], + "id": "PajkSWVhgNZr", + "outputId": "fa8155b5-d62a-4662-ee6b-ca65f62f5faa", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 897 + } + }, + "source": [ + "# Plot results of weight scale experiment\n", + "best_train_accs, bn_best_train_accs = [], []\n", + "best_val_accs, bn_best_val_accs = [], []\n", + "final_train_loss, bn_final_train_loss = [], []\n", + "\n", + "for ws in weight_scales:\n", + " best_train_accs.append(max(solvers_ws[ws].train_acc_history))\n", + " bn_best_train_accs.append(max(bn_solvers_ws[ws].train_acc_history))\n", + "\n", + " best_val_accs.append(max(solvers_ws[ws].val_acc_history))\n", + " bn_best_val_accs.append(max(bn_solvers_ws[ws].val_acc_history))\n", + "\n", + " final_train_loss.append(np.mean(solvers_ws[ws].loss_history[-100:]))\n", + " bn_final_train_loss.append(np.mean(bn_solvers_ws[ws].loss_history[-100:]))\n", + "\n", + "plt.subplot(3, 1, 1)\n", + "plt.title('Best val accuracy vs weight initialization scale')\n", + "plt.xlabel('Weight initialization scale')\n", + "plt.ylabel('Best val accuracy')\n", + "plt.semilogx(weight_scales, best_val_accs, '-o', label='baseline')\n", + "plt.semilogx(weight_scales, bn_best_val_accs, '-o', label='batchnorm')\n", + "plt.legend(ncol=2, loc='lower right')\n", + "\n", + "plt.subplot(3, 1, 2)\n", + "plt.title('Best train accuracy vs weight initialization scale')\n", + "plt.xlabel('Weight initialization scale')\n", + "plt.ylabel('Best training accuracy')\n", + "plt.semilogx(weight_scales, best_train_accs, '-o', label='baseline')\n", + "plt.semilogx(weight_scales, bn_best_train_accs, '-o', label='batchnorm')\n", + "plt.legend()\n", + "\n", + "plt.subplot(3, 1, 3)\n", + "plt.title('Final training loss vs weight initialization scale')\n", + "plt.xlabel('Weight initialization scale')\n", + "plt.ylabel('Final training loss')\n", + "plt.semilogx(weight_scales, final_train_loss, '-o', label='baseline')\n", + "plt.semilogx(weight_scales, bn_final_train_loss, '-o', label='batchnorm')\n", + "plt.legend()\n", + "plt.gca().set_ylim(1.0, 3.5)\n", + "\n", + "plt.gcf().set_size_inches(15, 15)\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-inline" + ], + "id": "CwhbowpBgNZu" + }, + "source": [ + "## Inline Question 1:\n", + "Describe the results of this experiment. How does the scale of weight initialization affect models with/without batch normalization differently, and why?\n", + "\n", + "## Answer:\n", + "가중치 초기화를 할 때, 너무 크거나 작은 가중치로 시작하면 그래디언트 소실 또는 그래디언트 폭발과 같은 문제가 발생할 수 있으며 추후 이 문제들은 신경망이 제대로 학습할 수 없도록 한다.\n", + "\n", + "이때 배치 정규화를 사용하면 가중치 초기화의 스케일이 학습에 미치는 영향이 줄어들게 된다. 배치 정규화란 각 계층의 출력을 정규화하여, 그래디언트 소실 또는 그래디언트 폭발 문제를 완화시키는 기술이다.\n", + "\n", + "또한 배치 정규화는 내부 공변량 이동(internal covariate shift) 문제를 완화한다. 내부 공변량 이동은 학습 도중 각 계층의 입력 분포가 변화하는 현상을 말하는데, 이는 신경망 학습을 느리게 만들거나 학습 과정을 불안정하게 만들 수 있다. 이런 경우에 배치 정규화를 사용하면 각 계층의 입력 분포가 일정하게 유지되므로 해당 문제를 해결할 수 있다.\n", + "\n", + "즉 배치 정규화는 가중치 초기화의 스케일에 덜 민감하게 만들어 신경망의 성능을 향상시키는 데 도움이 된다. 각 계층의 출력을 배치 정규화 계층에서 강제로 정규화하기 때문에, 가중치 스케일이 잘 최적화되지 않았더라도 배치 정규화 모델은 적절한 정확도를 보인다. 반면 배치 정규화를 사용하지 않는 모델은 가중치 스케일이 잘 최적화되지 않을 경우 정확도가 떨어지는데, 가중치 초기화의 스케일은 배치 정규화를 사용하지 않는 모델에 더 큰 영향을 미치기 때문이다.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qv0zmRnlgNZv" + }, + "source": [ + "# Batch normalization and batch size\n", + "We will now run a small experiment to study the interaction of batch normalization and batch size.\n", + "\n", + "The first cell will train 6-layer networks both with and without batch normalization using different batch sizes. The second layer will plot training accuracy and validation set accuracy over time." + ] + }, + { + "cell_type": "code", + "metadata": { + "tags": [ + "pdf-ignore-input" + ], + "id": "bRWP3mhxgNZv", + "outputId": "a31e2247-143d-4905-d524-977c93ea43c0", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 86 + } + }, + "source": [ + "def run_batchsize_experiments(normalization_mode):\n", + " np.random.seed(231)\n", + " # Try training a very deep net with batchnorm\n", + " hidden_dims = [100, 100, 100, 100, 100]\n", + " num_train = 1000\n", + " small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + " }\n", + " n_epochs=10\n", + " weight_scale = 2e-2\n", + " batch_sizes = [5,10,50]\n", + " lr = 10**(-3.5)\n", + " solver_bsize = batch_sizes[0]\n", + "\n", + " print('No normalization: batch size = ',solver_bsize)\n", + " model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=None)\n", + " solver = Solver(model, small_data,\n", + " num_epochs=n_epochs, batch_size=solver_bsize,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': lr,\n", + " },\n", + " verbose=False)\n", + " solver.train()\n", + "\n", + " bn_solvers = []\n", + " for i in range(len(batch_sizes)):\n", + " b_size=batch_sizes[i]\n", + " print('Normalization: batch size = ',b_size)\n", + " bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, normalization=normalization_mode)\n", + " bn_solver = Solver(bn_model, small_data,\n", + " num_epochs=n_epochs, batch_size=b_size,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': lr,\n", + " },\n", + " verbose=False)\n", + " bn_solver.train()\n", + " bn_solvers.append(bn_solver)\n", + "\n", + " return bn_solvers, solver, batch_sizes\n", + "\n", + "batch_sizes = [5,10,50]\n", + "bn_solvers_bsize, solver_bsize, batch_sizes = run_batchsize_experiments('batchnorm')" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "No normalization: batch size = 5\n", + "Normalization: batch size = 5\n", + "Normalization: batch size = 10\n", + "Normalization: batch size = 50\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "l8m9PQSigNZz", + "outputId": "00762360-156a-48f1-9219-0f335be3c243", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 621 + } + }, + "source": [ + "plt.subplot(2, 1, 1)\n", + "plot_training_history('Training accuracy (Batch Normalization)','Epoch', solver_bsize, bn_solvers_bsize, \\\n", + " lambda x: x.train_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)\n", + "plt.subplot(2, 1, 2)\n", + "plot_training_history('Validation accuracy (Batch Normalization)','Epoch', solver_bsize, bn_solvers_bsize, \\\n", + " lambda x: x.val_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)\n", + "\n", + "plt.gcf().set_size_inches(15, 10)\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-inline" + ], + "id": "EOGo5xmGgNZ3" + }, + "source": [ + "## Inline Question 2:\n", + "Describe the results of this experiment. What does this imply about the relationship between batch normalization and batch size? Why is this relationship observed?\n", + "\n", + "## Answer:\n", + "배치 크기가 클수록 결과가 더 좋은 것으로 나타난다. 이는 큰 배치에서 더 많은 데이터를 통해 더 정확한 평균과 분산을 계산할 수 있기 때문이다. 반면 배치 크기가 작은 경우, 평균과 분산의 추정치가 부정확해지며, 이로 인해 배치 정규화의 효과가 감소하게 된다.\n", + "\n", + "예를 들어 배치 크기라 5와 같이 매우 작은 경우, 해당 배치의 데이터는 전체 데이터 분포를 정확히 반영하지 못할 가능성이 높다. 이러한 경우 배치 정규화는 이처럼 작은 배치의 평균과 분산을 사용하여 데이터를 정규화하는데, 이때 해당 배치의 평균과 분산이 전체 데이터의 평균과 분산과 크게 다르게 되면 배치 정규화 후의 데이터는 실제 데이터 분포를 잘 반영하지 못하게 되는 것이다. 결과적으로 이는 신경망 성능을 저하시키는 주요 원인이 될 수 있다." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xfrb6lVTgNZ3" + }, + "source": [ + "# Layer Normalization\n", + "Batch normalization has proved to be effective in making networks easier to train, but the dependency on batch size makes it less useful in complex networks which have a cap on the input batch size due to hardware limitations.\n", + "\n", + "Several alternatives to batch normalization have been proposed to mitigate this problem; one such technique is Layer Normalization [2]. Instead of normalizing over the batch, we normalize over the features. In other words, when using Layer Normalization, each feature vector corresponding to a single datapoint is normalized based on the sum of all terms within that feature vector.\n", + "\n", + "[2] [Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. \"Layer Normalization.\" stat 1050 (2016): 21.](https://arxiv.org/pdf/1607.06450.pdf)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-inline" + ], + "id": "4vtI3yupgNZ4" + }, + "source": [ + "## Inline Question 3:\n", + "Which of these data preprocessing steps is analogous to batch normalization, and which is analogous to layer normalization?\n", + "\n", + "1. Scaling each image in the dataset, so that the RGB channels for each row of pixels within an image sums up to 1.\n", + "2. Scaling each image in the dataset, so that the RGB channels for all pixels within an image sums up to 1. \n", + "3. Subtracting the mean image of the dataset from each image in the dataset.\n", + "4. Setting all RGB values to either 0 or 1 depending on a given threshold.\n", + "\n", + "## Answer:\n", + "\n", + "1&3은 배치 정규화와 유사한 반면 2&4는 레이어 정규화와 유사하다. 우선 1번의 경우 각 픽셀 행에 대해 RGB 채널의 합이 1이 되도록 스케일링 하고있으며, 3번의 경우 전체 데이터셋의 평균 이미지를 각 이미지에서 빼는 과정이다. 이 두 가지 모두 픽셀 단위로 작업을 수행하며, 이는 배치 정규화가 각 피쳐별로 정규화를 수행하는 것과 유사하다고 볼 수 있다. 여기서 전체 데이터셋의 평균 이미지가 배치 평균에 해당하며, 이를 각 이미지에서 빼내는 방식으로 정규화를 수행한다.\n", + "\n", + "반면 2번의 경우 각 이미지의 모든 픽셀에 대해 RGB 채널의 합이 1이 되도록 스케일링하고, 4번에서는 모든 RGB 값을 0 또는 1로 설정한다. 이 두 가지 모두 이미지 전체에 대해 작업을 수행하며, 이는 레이어 정규화가 각 층의 출력을 독립적으로 정규화하는 것과 유사하다고 볼 수 있다. 여기서 각 이미지가 하나의 층으로 간주되며, 이미지 내의 모든 픽셀이 동일한 스케일링을 받는다.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uR7ql9xWgNZ5" + }, + "source": [ + "# Layer Normalization: Implementation\n", + "\n", + "Now you'll implement layer normalization. This step should be relatively straightforward, as conceptually the implementation is almost identical to that of batch normalization. One significant difference though is that for layer normalization, we do not keep track of the moving moments, and the testing phase is identical to the training phase, where the mean and variance are directly calculated per datapoint.\n", + "\n", + "Here's what you need to do:\n", + "\n", + "* In `cs231n/layers.py`, implement the forward pass for layer normalization in the function `layernorm_forward`.\n", + "\n", + "Run the cell below to check your results.\n", + "* In `cs231n/layers.py`, implement the backward pass for layer normalization in the function `layernorm_backward`.\n", + "\n", + "Run the second cell below to check your results.\n", + "* Modify `cs231n/classifiers/fc_net.py` to add layer normalization to the `FullyConnectedNet`. When the `normalization` flag is set to `\"layernorm\"` in the constructor, you should insert a layer normalization layer before each ReLU nonlinearity.\n", + "\n", + "Run the third cell below to run the batch size experiment on layer normalization." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "bHKlKrtJgNZ9", + "outputId": "d3e18e60-5e18-41ef-901a-7013a24401a8", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 225 + } + }, + "source": [ + "# Check the training-time forward pass by checking means and variances\n", + "# of features both before and after layer normalization\n", + "\n", + "# Simulate the forward pass for a two-layer network\n", + "np.random.seed(231)\n", + "N, D1, D2, D3 =4, 50, 60, 3\n", + "X = np.random.randn(N, D1)\n", + "W1 = np.random.randn(D1, D2)\n", + "W2 = np.random.randn(D2, D3)\n", + "a = np.maximum(0, X.dot(W1)).dot(W2)\n", + "\n", + "print('Before layer normalization:')\n", + "print_mean_std(a,axis=1)\n", + "\n", + "gamma = np.ones(D3)\n", + "beta = np.zeros(D3)\n", + "# Means should be close to zero and stds close to one\n", + "print('After layer normalization (gamma=1, beta=0)')\n", + "a_norm, _ = layernorm_forward(a, gamma, beta, {'mode': 'train'})\n", + "print_mean_std(a_norm,axis=1)\n", + "\n", + "gamma = np.asarray([3.0,3.0,3.0])\n", + "beta = np.asarray([5.0,5.0,5.0])\n", + "# Now means should be close to beta and stds close to gamma\n", + "print('After layer normalization (gamma=', gamma, ', beta=', beta, ')')\n", + "a_norm, _ = layernorm_forward(a, gamma, beta, {'mode': 'train'})\n", + "print_mean_std(a_norm,axis=1)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Before layer normalization:\n", + " means: [-59.06673243 -47.60782686 -43.31137368 -26.40991744]\n", + " stds: [10.07429373 28.39478981 35.28360729 4.01831507]\n", + "\n", + "After layer normalization (gamma=1, beta=0)\n", + " means: [ 4.81096644e-16 -7.40148683e-17 2.22044605e-16 -5.92118946e-16]\n", + " stds: [0.99999995 0.99999999 1. 0.99999969]\n", + "\n", + "After layer normalization (gamma= [3. 3. 3.] , beta= [5. 5. 5.] )\n", + " means: [5. 5. 5. 5.]\n", + " stds: [2.99999985 2.99999998 2.99999999 2.99999907]\n", + "\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "10CwNM3zgNaB", + "outputId": "ff533c30-6279-42be-8dcd-cb41b434e046", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 69 + } + }, + "source": [ + "# Gradient check batchnorm backward pass\n", + "np.random.seed(231)\n", + "N, D = 4, 5\n", + "x = 5 * np.random.randn(N, D) + 12\n", + "gamma = np.random.randn(D)\n", + "beta = np.random.randn(D)\n", + "dout = np.random.randn(N, D)\n", + "\n", + "ln_param = {}\n", + "fx = lambda x: layernorm_forward(x, gamma, beta, ln_param)[0]\n", + "fg = lambda a: layernorm_forward(x, a, beta, ln_param)[0]\n", + "fb = lambda b: layernorm_forward(x, gamma, b, ln_param)[0]\n", + "\n", + "dx_num = eval_numerical_gradient_array(fx, x, dout)\n", + "da_num = eval_numerical_gradient_array(fg, gamma.copy(), dout)\n", + "db_num = eval_numerical_gradient_array(fb, beta.copy(), dout)\n", + "\n", + "_, cache = layernorm_forward(x, gamma, beta, ln_param)\n", + "dx, dgamma, dbeta = layernorm_backward(dout, cache)\n", + "\n", + "#You should expect to see relative errors between 1e-12 and 1e-8\n", + "print('dx error: ', rel_error(dx_num, dx))\n", + "print('dgamma error: ', rel_error(da_num, dgamma))\n", + "print('dbeta error: ', rel_error(db_num, dbeta))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "dx error: 2.107277051835058e-09\n", + "dgamma error: 4.519489546032799e-12\n", + "dbeta error: 2.5842537629899423e-12\n" + ], + "name": "stdout" + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3oGiJPu_gNaF" + }, + "source": [ + "# Layer Normalization and batch size\n", + "\n", + "We will now run the previous batch size experiment with layer normalization instead of batch normalization. Compared to the previous experiment, you should see a markedly smaller influence of batch size on the training history!" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "vf9Aa3rngNaG", + "outputId": "9fd576b1-de98-4c66-b287-219dd9e1bea4", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 690 + } + }, + "source": [ + "ln_solvers_bsize, solver_bsize, batch_sizes = run_batchsize_experiments('layernorm')\n", + "\n", + "plt.subplot(2, 1, 1)\n", + "plot_training_history('Training accuracy (Layer Normalization)','Epoch', solver_bsize, ln_solvers_bsize, \\\n", + " lambda x: x.train_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)\n", + "plt.subplot(2, 1, 2)\n", + "plot_training_history('Validation accuracy (Layer Normalization)','Epoch', solver_bsize, ln_solvers_bsize, \\\n", + " lambda x: x.val_acc_history, bl_marker='-^', bn_marker='-o', labels=batch_sizes)\n", + "\n", + "plt.gcf().set_size_inches(15, 10)\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "text": [ + "No normalization: batch size = 5\n", + "Normalization: batch size = 5\n", + "Normalization: batch size = 10\n", + "Normalization: batch size = 50\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "tags": [], + "needs_background": "light" + } + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-inline" + ], + "id": "HggXBdBUgNaI" + }, + "source": [ + "## Inline Question 4:\n", + "When is layer normalization likely to not work well, and why?\n", + "\n", + "1. Using it in a very deep network\n", + "2. Having a very small dimension of features\n", + "3. Having a high regularization term\n", + "\n", + "\n", + "## Answer:\n", + "2번, 피처의 차원이 매우 작은 경우이다. 레이어 정규화는 각 샘플에 대해 독립적으로 수행되며, 특정 레이어의 모든 입력 피처에 대해 평균과 분산을 계산한다. 피처의 차원이 매우 작다면, 이 평균과 분산의 계산이 노이즈에 민감하게 반응할 수 있다. 따라서 노이즈에 크게 영향을 받아 계산된 평균과 분산이 실제 데이터의 특성을 잘 반영하지 못할 수 있다.\n", + "\n", + "다시 말해 피처의 차원이 작으면 각 레이어에서 다루는 정보의 양이 적어지기 때문에 각 피처가 전체 데이터를 대표하는 표준이 되기 어렵다는 것이다. 이는 배치 정규화에서 배치 크기가 작을 때 발생하는 문제와 유사한데, 배치 크기가 작으면 각 배치의 평균과 분산이 전체 데이터의 평균과 분산을 정확하게 대표하지 못할 수 있는 것과 같은 원리이다.\n" + ] + } + ] +} \ No newline at end of file diff --git "a/\352\261\264\353\247\220\354\262\234\352\260\204_Q3_Dropout.ipynb" "b/\352\261\264\353\247\220\354\262\234\352\260\204_Q3_Dropout.ipynb" new file mode 100644 index 0000000..d8170fd --- /dev/null +++ "b/\352\261\264\353\247\220\354\262\234\352\260\204_Q3_Dropout.ipynb" @@ -0,0 +1,617 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "cells": [ + { + "cell_type": "code", + "metadata": { + "id": "tfPkYvCIKpkX", + "outputId": "9496ffe9-9bbe-4ac4-ad7d-0943156c9e69", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# This mounts your Google Drive to the Colab VM.\n", + "from google.colab import drive\n", + "drive.mount('/content/drive')\n", + "\n", + "# Enter the foldername in your Drive where you have saved the unzipped\n", + "# assignment folder, e.g. 'cs231n/assignments/assignment1/'\n", + "FOLDERNAME = \"euron_cs231n\"\n", + "assert FOLDERNAME is not None, \"[!] Enter the foldername.\"\n", + "\n", + "# Now that we've mounted your Drive, this ensures that\n", + "# the Python interpreter of the Colab VM can load\n", + "# python files from within it.\n", + "import sys\n", + "sys.path.append('/content/drive/My Drive/{}'.format(FOLDERNAME))\n", + "\n", + "# This downloads the CIFAR-10 dataset to your Drive\n", + "# if it doesn't already exist.\n", + "#%cd /content/drive/My\\ Drive/$FOLDERNAME/cs231n/datasets/\n", + "#!bash get_datasets.sh\n", + "%cd /content/drive/My\\ Drive/$FOLDERNAME" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Mounted at /content/drive\n", + "/content/drive/My Drive/euron_cs231n\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-title" + ], + "id": "kQkZ_okoKpka" + }, + "source": [ + "# Dropout\n", + "Dropout [1] is a technique for regularizing neural networks by randomly setting some output activations to zero during the forward pass. In this exercise you will implement a dropout layer and modify your fully-connected network to optionally use dropout.\n", + "\n", + "[1] [Geoffrey E. Hinton et al, \"Improving neural networks by preventing co-adaptation of feature detectors\", arXiv 2012](https://arxiv.org/abs/1207.0580)" + ] + }, + { + "cell_type": "code", + "metadata": { + "tags": [ + "pdf-ignore" + ], + "id": "qqih8tM3Kpkb", + "outputId": "ad448845-5fe6-49c3-fca0-ea70e35f158b", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# As usual, a bit of setup\n", + "from __future__ import print_function\n", + "import time\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from cs231n.classifiers.fc_net import *\n", + "from cs231n.data_utils import get_CIFAR10_data\n", + "from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n", + "from cs231n.solver import Solver\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "=========== You can safely ignore the message below if you are NOT working on ConvolutionalNetworks.ipynb ===========\n", + "\tYou will need to compile a Cython extension for a portion of this assignment.\n", + "\tThe instructions to do this will be given in a section of the notebook below.\n", + "\tThere will be an option for Colab users and another for Jupyter (local) users.\n" + ] + } + ] + }, + { + "cell_type": "code", + "metadata": { + "tags": [ + "pdf-ignore" + ], + "id": "vsIl9Fa7Kpkd", + "outputId": "8dd91d66-ef04-4347-d60e-53df01e704cb", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Load the (preprocessed) CIFAR10 data.\n", + "\n", + "data = get_CIFAR10_data()\n", + "for k, v in data.items():\n", + " print('%s: ' % k, v.shape)" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "X_train: (49000, 3, 32, 32)\n", + "y_train: (49000,)\n", + "X_val: (1000, 3, 32, 32)\n", + "y_val: (1000,)\n", + "X_test: (1000, 3, 32, 32)\n", + "y_test: (1000,)\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rOgSfusHKpkg" + }, + "source": [ + "# Dropout forward pass\n", + "In the file `cs231n/layers.py`, implement the forward pass for dropout. Since dropout behaves differently during training and testing, make sure to implement the operation for both modes.\n", + "\n", + "Once you have done so, run the cell below to test your implementation." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "2JYhCAqhKpkg", + "outputId": "cd9835da-163a-412b-d0b0-ded4a3c3b901", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "np.random.seed(231)\n", + "x = np.random.randn(500, 500) + 10\n", + "\n", + "for p in [0.25, 0.4, 0.7]:\n", + " out, _ = dropout_forward(x, {'mode': 'train', 'p': p})\n", + " out_test, _ = dropout_forward(x, {'mode': 'test', 'p': p})\n", + "\n", + " print('Running tests with p = ', p)\n", + " print('Mean of input: ', x.mean())\n", + " print('Mean of train-time output: ', out.mean())\n", + " print('Mean of test-time output: ', out_test.mean())\n", + " print('Fraction of train-time output set to zero: ', (out == 0).mean())\n", + " print('Fraction of test-time output set to zero: ', (out_test == 0).mean())\n", + " print()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Running tests with p = 0.25\n", + "Mean of input: 10.000207878477502\n", + "Mean of train-time output: 10.014059116977283\n", + "Mean of test-time output: 10.000207878477502\n", + "Fraction of train-time output set to zero: 0.749784\n", + "Fraction of test-time output set to zero: 0.0\n", + "\n", + "Running tests with p = 0.4\n", + "Mean of input: 10.000207878477502\n", + "Mean of train-time output: 9.977917658761159\n", + "Mean of test-time output: 10.000207878477502\n", + "Fraction of train-time output set to zero: 0.600796\n", + "Fraction of test-time output set to zero: 0.0\n", + "\n", + "Running tests with p = 0.7\n", + "Mean of input: 10.000207878477502\n", + "Mean of train-time output: 9.987811912159426\n", + "Mean of test-time output: 10.000207878477502\n", + "Fraction of train-time output set to zero: 0.30074\n", + "Fraction of test-time output set to zero: 0.0\n", + "\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r-rGN1gwKpkj" + }, + "source": [ + "# Dropout backward pass\n", + "In the file `cs231n/layers.py`, implement the backward pass for dropout. After doing so, run the following cell to numerically gradient-check your implementation." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6Ih3gNJhKpkj", + "outputId": "144e0e0d-1a4a-40e6-8651-ebba4b660222", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "np.random.seed(231)\n", + "x = np.random.randn(10, 10) + 10\n", + "dout = np.random.randn(*x.shape)\n", + "\n", + "dropout_param = {'mode': 'train', 'p': 0.2, 'seed': 123}\n", + "out, cache = dropout_forward(x, dropout_param)\n", + "dx = dropout_backward(dout, cache)\n", + "dx_num = eval_numerical_gradient_array(lambda xx: dropout_forward(xx, dropout_param)[0], x, dout)\n", + "\n", + "# Error should be around e-10 or less\n", + "print('dx relative error: ', rel_error(dx, dx_num))" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "dx relative error: 5.44560814873387e-11\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-inline" + ], + "id": "zDcnUzKRKpkm" + }, + "source": [ + "## Inline Question 1:\n", + "What happens if we do not divide the values being passed through inverse dropout by `p` in the dropout layer? Why does that happen?\n", + "\n", + "## Answer:\n", + "드롭아웃 층에서 값을 $p$로 나눠주지 않을 경우, 입력값과 출력값의 크기가 달라지게 된다. 예를 들어, 어떤 뉴런의 드롭아웃 이전 출력값이 $x$일 때 $p=0.5$인 경우의 예상 출력값은 $px+(1-p)0$으로 계산된다. 따라서 크기를 맞춰주기 위해 출력값에 $p$를 곱해주는 과정이 추가된다.\n", + "\n", + "이를 Vanilla dropout이라고 하며, 구체적인 코드는 다음과 같다.\n", + "\n", + "```python\n", + "\"\"\" Vanilla Dropout \"\"\"\n", + "\n", + "p = 0.5 # 활성 뉴런의 비율로, p값이 높을수록 dropout 비율이 낮음\n", + "\n", + "def train_step(X):\n", + " \"\"\" X contains the data \"\"\"\n", + " \n", + " # 3층 신경망 예시에서의 forward pass\n", + " H1 = np.maximum(0, np.dot(W1, X) + b1)\n", + " U1 = np.random.rand(*H1.shape) < p # first dropout mask\n", + " H1 *= U1 # drop!\n", + " H2 = np.maximum(0, np.dot(W2, H1) + b2)\n", + " U2 = np.random.rand(*H2.shape) < p # second dropout mask\n", + " H2 *= U2 # drop!\n", + " out = np.dot(W3, H2) + b3\n", + " \n", + " # backward pass: gradient 계산\n", + " # 파라미터 업데이트 과정\n", + " \n", + "def predict(X):\n", + " # ensembled forward pass\n", + " H1 = np.maximum(0, np.dot(W1, X) + b1) * p # NOTE: scale the activations\n", + " H2 = np.maximum(0, np.dot(W2, H1) + b2) * p # NOTE: scale the activations\n", + " out = np.dot(W3, H2) + b3\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "or23NNK1Kpkm" + }, + "source": [ + "# Fully-connected nets with Dropout\n", + "In the file `cs231n/classifiers/fc_net.py`, modify your implementation to use dropout. Specifically, if the constructor of the network receives a value that is not 1 for the `dropout` parameter, then the net should add a dropout layer immediately after every ReLU nonlinearity. After doing so, run the following to numerically gradient-check your implementation." + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "SO1jxK0VKpkn", + "outputId": "1f3ee320-5f30-4676-8a1c-6aad1e249771", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "np.random.seed(231)\n", + "N, D, H1, H2, C = 2, 15, 20, 30, 10\n", + "X = np.random.randn(N, D)\n", + "y = np.random.randint(C, size=(N,))\n", + "\n", + "for dropout in [1, 0.75, 0.5]:\n", + " print('Running check with dropout = ', dropout)\n", + " model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,\n", + " weight_scale=5e-2, dtype=np.float64,\n", + " dropout=dropout, seed=123)\n", + "\n", + " loss, grads = model.loss(X, y)\n", + " print('Initial loss: ', loss)\n", + "\n", + " # Relative errors should be around e-6 or less; Note that it's fine\n", + " # if for dropout=1 you have W2 error be on the order of e-5.\n", + " for name in sorted(grads):\n", + " f = lambda _: model.loss(X, y)[0]\n", + " grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)\n", + " print('%s relative error: %.2e' % (name, rel_error(grad_num, grads[name])))\n", + " print()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Running check with dropout = 1\n", + "Initial loss: 2.3004790897684924\n", + "W1 relative error: 1.48e-07\n", + "W2 relative error: 2.21e-05\n", + "W3 relative error: 3.53e-07\n", + "b1 relative error: 5.38e-09\n", + "b2 relative error: 2.09e-09\n", + "b3 relative error: 5.80e-11\n", + "\n", + "Running check with dropout = 0.75\n", + "Initial loss: 2.302371489704412\n", + "W1 relative error: 1.90e-07\n", + "W2 relative error: 4.76e-06\n", + "W3 relative error: 2.60e-08\n", + "b1 relative error: 4.73e-09\n", + "b2 relative error: 1.82e-09\n", + "b3 relative error: 1.70e-10\n", + "\n", + "Running check with dropout = 0.5\n", + "Initial loss: 2.3042759220785896\n", + "W1 relative error: 3.11e-07\n", + "W2 relative error: 1.84e-08\n", + "W3 relative error: 5.35e-08\n", + "b1 relative error: 5.37e-09\n", + "b2 relative error: 2.99e-09\n", + "b3 relative error: 1.13e-10\n", + "\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Hdw_cNDRKpkq" + }, + "source": [ + "# Regularization experiment\n", + "As an experiment, we will train a pair of two-layer networks on 500 training examples: one will use no dropout, and one will use a keep probability of 0.25. We will then visualize the training and validation accuracies of the two networks over time." + ] + }, + { + "cell_type": "code", + "metadata": { + "scrolled": false, + "id": "T3dkruOLKpkq", + "outputId": "af36d99d-264d-4631-c448-d63d181ae8a3", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "source": [ + "# Train two identical nets, one with dropout and one without\n", + "np.random.seed(231)\n", + "num_train = 500\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "solvers = {}\n", + "dropout_choices = [1, 0.25]\n", + "for dropout in dropout_choices:\n", + " model = FullyConnectedNet([500], dropout=dropout)\n", + " print(dropout)\n", + "\n", + " solver = Solver(model, small_data,\n", + " num_epochs=25, batch_size=100,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 5e-4,\n", + " },\n", + " verbose=True, print_every=100)\n", + " solver.train()\n", + " solvers[dropout] = solver\n", + " print()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "1\n", + "(Iteration 1 / 125) loss: 7.856643\n", + "(Epoch 0 / 25) train acc: 0.260000; val_acc: 0.184000\n", + "(Epoch 1 / 25) train acc: 0.416000; val_acc: 0.258000\n", + "(Epoch 2 / 25) train acc: 0.482000; val_acc: 0.276000\n", + "(Epoch 3 / 25) train acc: 0.532000; val_acc: 0.277000\n", + "(Epoch 4 / 25) train acc: 0.600000; val_acc: 0.271000\n", + "(Epoch 5 / 25) train acc: 0.708000; val_acc: 0.299000\n", + "(Epoch 6 / 25) train acc: 0.722000; val_acc: 0.282000\n", + "(Epoch 7 / 25) train acc: 0.832000; val_acc: 0.255000\n", + "(Epoch 8 / 25) train acc: 0.880000; val_acc: 0.268000\n", + "(Epoch 9 / 25) train acc: 0.902000; val_acc: 0.277000\n", + "(Epoch 10 / 25) train acc: 0.898000; val_acc: 0.261000\n", + "(Epoch 11 / 25) train acc: 0.924000; val_acc: 0.263000\n", + "(Epoch 12 / 25) train acc: 0.960000; val_acc: 0.300000\n", + "(Epoch 13 / 25) train acc: 0.972000; val_acc: 0.314000\n", + "(Epoch 14 / 25) train acc: 0.972000; val_acc: 0.310000\n", + "(Epoch 15 / 25) train acc: 0.974000; val_acc: 0.314000\n", + "(Epoch 16 / 25) train acc: 0.994000; val_acc: 0.303000\n", + "(Epoch 17 / 25) train acc: 0.970000; val_acc: 0.304000\n", + "(Epoch 18 / 25) train acc: 0.992000; val_acc: 0.312000\n", + "(Epoch 19 / 25) train acc: 0.992000; val_acc: 0.309000\n", + "(Epoch 20 / 25) train acc: 0.992000; val_acc: 0.289000\n", + "(Iteration 101 / 125) loss: 0.001969\n", + "(Epoch 21 / 25) train acc: 0.996000; val_acc: 0.291000\n", + "(Epoch 22 / 25) train acc: 1.000000; val_acc: 0.306000\n", + "(Epoch 23 / 25) train acc: 0.996000; val_acc: 0.309000\n", + "(Epoch 24 / 25) train acc: 0.998000; val_acc: 0.314000\n", + "(Epoch 25 / 25) train acc: 0.998000; val_acc: 0.305000\n", + "\n", + "0.25\n", + "(Iteration 1 / 125) loss: 17.318478\n", + "(Epoch 0 / 25) train acc: 0.230000; val_acc: 0.177000\n", + "(Epoch 1 / 25) train acc: 0.378000; val_acc: 0.243000\n", + "(Epoch 2 / 25) train acc: 0.402000; val_acc: 0.254000\n", + "(Epoch 3 / 25) train acc: 0.502000; val_acc: 0.276000\n", + "(Epoch 4 / 25) train acc: 0.528000; val_acc: 0.298000\n", + "(Epoch 5 / 25) train acc: 0.562000; val_acc: 0.296000\n", + "(Epoch 6 / 25) train acc: 0.626000; val_acc: 0.291000\n", + "(Epoch 7 / 25) train acc: 0.622000; val_acc: 0.297000\n", + "(Epoch 8 / 25) train acc: 0.688000; val_acc: 0.313000\n", + "(Epoch 9 / 25) train acc: 0.712000; val_acc: 0.297000\n", + "(Epoch 10 / 25) train acc: 0.724000; val_acc: 0.306000\n", + "(Epoch 11 / 25) train acc: 0.768000; val_acc: 0.307000\n", + "(Epoch 12 / 25) train acc: 0.774000; val_acc: 0.284000\n", + "(Epoch 13 / 25) train acc: 0.828000; val_acc: 0.308000\n", + "(Epoch 14 / 25) train acc: 0.812000; val_acc: 0.346000\n", + "(Epoch 15 / 25) train acc: 0.850000; val_acc: 0.338000\n", + "(Epoch 16 / 25) train acc: 0.844000; val_acc: 0.307000\n", + "(Epoch 17 / 25) train acc: 0.858000; val_acc: 0.302000\n", + "(Epoch 18 / 25) train acc: 0.860000; val_acc: 0.318000\n", + "(Epoch 19 / 25) train acc: 0.884000; val_acc: 0.316000\n", + "(Epoch 20 / 25) train acc: 0.862000; val_acc: 0.315000\n", + "(Iteration 101 / 125) loss: 4.293572\n", + "(Epoch 21 / 25) train acc: 0.886000; val_acc: 0.330000\n", + "(Epoch 22 / 25) train acc: 0.898000; val_acc: 0.314000\n", + "(Epoch 23 / 25) train acc: 0.934000; val_acc: 0.323000\n", + "(Epoch 24 / 25) train acc: 0.918000; val_acc: 0.322000\n", + "(Epoch 25 / 25) train acc: 0.922000; val_acc: 0.324000\n", + "\n" + ] + } + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "jiOUcPvxKpks", + "outputId": "a25b4e64-6f84-4f6c-819a-412aa6b98c3c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 757 + } + }, + "source": [ + "# Plot train and validation accuracies of the two models\n", + "\n", + "train_accs = []\n", + "val_accs = []\n", + "for dropout in dropout_choices:\n", + " solver = solvers[dropout]\n", + " train_accs.append(solver.train_acc_history[-1])\n", + " val_accs.append(solver.val_acc_history[-1])\n", + "\n", + "plt.subplot(3, 1, 1)\n", + "for dropout in dropout_choices:\n", + " plt.plot(solvers[dropout].train_acc_history, 'o', label='%.2f dropout' % dropout)\n", + "plt.title('Train accuracy')\n", + "plt.xlabel('Epoch')\n", + "plt.ylabel('Accuracy')\n", + "plt.legend(ncol=2, loc='lower right')\n", + "\n", + "plt.subplot(3, 1, 2)\n", + "for dropout in dropout_choices:\n", + " plt.plot(solvers[dropout].val_acc_history, 'o', label='%.2f dropout' % dropout)\n", + "plt.title('Val accuracy')\n", + "plt.xlabel('Epoch')\n", + "plt.ylabel('Accuracy')\n", + "plt.legend(ncol=2, loc='lower right')\n", + "\n", + "plt.gcf().set_size_inches(15, 15)\n", + "plt.show()" + ], + "execution_count": null, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/plain": [ + "
" + ], + "image/png": "\n" + }, + "metadata": {} + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-inline" + ], + "id": "7sH-SVFmKpkv" + }, + "source": [ + "## Inline Question 2:\n", + "Compare the validation and training accuracies with and without dropout -- what do your results suggest about dropout as a regularizer?\n", + "\n", + "## Answer:\n", + "위의 결과는 $p=0.25$인 드롭아웃을 적용하였을 때와 드롭아웃을 적용하지 않았을 때의 train accuracy, valid accuracy를 나타낸다.\n", + "\n", + "- train accuracy는 전체적으로 드롭아웃을 적용하였을 때가 더 낮았지만, epoch가 증가함에 따라 약 0.93까지 도달하였다.\n", + "- valid accuracy의 경우 드롭아웃을 적용하였을 때가 비교적 높았다.\n", + "\n", + "드롭아웃이 overfitting을 방지하는 데 도움을 주었음을 확인할 수 있다. 각 경우의 구체적인 accuracy 값들은 다음과 같다.\n", + "\n", + "- 드롭아웃을 적용하지 않았을 때의 train, valid accuracy는 각각 0.998, 0.305이다. (epoch 25)\n", + "- 드롭아웃을 적용하였을 때의 train, valid accuracy는 각각 0.922, 0.324이다. (epoch 25)\n", + "\n", + "따라서 드롭아웃은 overfitting을 방지하는 regularizer로 기능한다고 볼 수 있다." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-inline" + ], + "id": "wdYffxs7Kpkv" + }, + "source": [ + "## Inline Question 3:\n", + "Suppose we are training a deep fully-connected network for image classification, with dropout after hidden layers (parameterized by keep probability p). If we are concerned about overfitting, how should we modify p (if at all) when we decide to decrease the size of the hidden layers (that is, the number of nodes in each layer)?\n", + "\n", + "## Answer:\n", + "신경망의 은닉층 크기를 줄여 overfitting을 방지하고자 한다면, 드롭아웃 층의 $p$ 값을 줄이는 것이 도움이 될 수 있다. $p$ 값을 감소시켜 뉴런이 비활성화되는 비율을 높일 경우, 각 층의 노드 수가 감소하므로 결과적으로 은닉층의 크기가 $p$에 비례하여 줄어들기 때문이다." + ] + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "uoRhfz77Q-QE" + }, + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file diff --git "a/\352\261\264\353\247\220\354\262\234\352\260\204_Q4_ConvolutionalNetworks.ipynb" "b/\352\261\264\353\247\220\354\262\234\352\260\204_Q4_ConvolutionalNetworks.ipynb" new file mode 100644 index 0000000..8cf32bc --- /dev/null +++ "b/\352\261\264\353\247\220\354\262\234\352\260\204_Q4_ConvolutionalNetworks.ipynb" @@ -0,0 +1,1304 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-title" + ] + }, + "source": [ + "# Convolutional Networks\n", + "So far we have worked with deep fully-connected networks, using them to explore different optimization strategies and network architectures. Fully-connected networks are a good testbed for experimentation because they are very computationally efficient, but in practice all state-of-the-art results use convolutional networks instead.\n", + "\n", + "First you will implement several layer types that are used in convolutional networks. You will then use these layers to train a convolutional network on the CIFAR-10 dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "C:\\Users\\2277007\\Desktop\\assignment2\n" + ] + } + ], + "source": [ + "%cd C:\\Users\\2277007\\Desktop\\assignment2" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "tags": [ + "pdf-ignore" + ] + }, + "outputs": [], + "source": [ + "# As usual, a bit of setup\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from cs231n.classifiers.cnn import *\n", + "from cs231n.data_utils import get_CIFAR10_data\n", + "from cs231n.gradient_check import eval_numerical_gradient_array, eval_numerical_gradient\n", + "from cs231n.layers import *\n", + "from cs231n.fast_layers import *\n", + "from cs231n.solver import Solver\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "tags": [ + "pdf-ignore" + ] + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "X_train: (49000, 3, 32, 32)\n", + "y_train: (49000,)\n", + "X_val: (1000, 3, 32, 32)\n", + "y_val: (1000,)\n", + "X_test: (1000, 3, 32, 32)\n", + "y_test: (1000,)\n" + ] + } + ], + "source": [ + "# Load the (preprocessed) CIFAR10 data.\n", + "\n", + "data = get_CIFAR10_data()\n", + "for k, v in data.items():\n", + " print('%s: ' % k, v.shape)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Convolution: Naive forward pass\n", + "The core of a convolutional network is the convolution operation. In the file `cs231n/layers.py`, implement the forward pass for the convolution layer in the function `conv_forward_naive`. \n", + "\n", + "You don't have to worry too much about efficiency at this point; just write the code in whatever way you find most clear.\n", + "\n", + "You can test your implementation by running the following:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing conv_forward_naive\n", + "difference: 2.2121476417505994e-08\n" + ] + } + ], + "source": [ + "x_shape = (2, 3, 4, 4)\n", + "w_shape = (3, 3, 4, 4)\n", + "x = np.linspace(-0.1, 0.5, num=np.prod(x_shape)).reshape(x_shape)\n", + "w = np.linspace(-0.2, 0.3, num=np.prod(w_shape)).reshape(w_shape)\n", + "b = np.linspace(-0.1, 0.2, num=3)\n", + "\n", + "conv_param = {'stride': 2, 'pad': 1}\n", + "out, _ = conv_forward_naive(x, w, b, conv_param)\n", + "correct_out = np.array([[[[-0.08759809, -0.10987781],\n", + " [-0.18387192, -0.2109216 ]],\n", + " [[ 0.21027089, 0.21661097],\n", + " [ 0.22847626, 0.23004637]],\n", + " [[ 0.50813986, 0.54309974],\n", + " [ 0.64082444, 0.67101435]]],\n", + " [[[-0.98053589, -1.03143541],\n", + " [-1.19128892, -1.24695841]],\n", + " [[ 0.69108355, 0.66880383],\n", + " [ 0.59480972, 0.56776003]],\n", + " [[ 2.36270298, 2.36904306],\n", + " [ 2.38090835, 2.38247847]]]])\n", + "\n", + "# Compare your output to ours; difference should be around e-8\n", + "print('Testing conv_forward_naive')\n", + "print('difference: ', rel_error(out, correct_out))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Aside: Image processing via convolutions\n", + "\n", + "As fun way to both check your implementation and gain a better understanding of the type of operation that convolutional layers can perform, we will set up an input containing two images and manually set up filters that perform common image processing operations (grayscale conversion and edge detection). The convolution forward pass will apply these operations to each of the input images. We can then visualize the results as a sanity check." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "tags": [ + "pdf-ignore-input" + ] + }, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "C:\\Users\\2277007\\AppData\\Local\\Temp\\ipykernel_9448\\2829536601.py:4: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.\n", + " kitten = imread('notebook_images/kitten.jpg')\n", + "C:\\Users\\2277007\\AppData\\Local\\Temp\\ipykernel_9448\\2829536601.py:5: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.\n", + " puppy = imread('notebook_images/puppy.jpg')\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from imageio import imread\n", + "from PIL import Image\n", + "\n", + "kitten = imread('notebook_images/kitten.jpg')\n", + "puppy = imread('notebook_images/puppy.jpg')\n", + "# kitten is wide, and puppy is already square\n", + "d = kitten.shape[1] - kitten.shape[0]\n", + "kitten_cropped = kitten[:, d//2:-d//2, :]\n", + "\n", + "img_size = 200 # Make this smaller if it runs too slow\n", + "resized_puppy = np.array(Image.fromarray(puppy).resize((img_size, img_size)))\n", + "resized_kitten = np.array(Image.fromarray(kitten_cropped).resize((img_size, img_size)))\n", + "x = np.zeros((2, 3, img_size, img_size))\n", + "x[0, :, :, :] = resized_puppy.transpose((2, 0, 1))\n", + "x[1, :, :, :] = resized_kitten.transpose((2, 0, 1))\n", + "\n", + "# Set up a convolutional weights holding 2 filters, each 3x3\n", + "w = np.zeros((2, 3, 3, 3))\n", + "\n", + "# The first filter converts the image to grayscale.\n", + "# Set up the red, green, and blue channels of the filter.\n", + "w[0, 0, :, :] = [[0, 0, 0], [0, 0.3, 0], [0, 0, 0]]\n", + "w[0, 1, :, :] = [[0, 0, 0], [0, 0.6, 0], [0, 0, 0]]\n", + "w[0, 2, :, :] = [[0, 0, 0], [0, 0.1, 0], [0, 0, 0]]\n", + "\n", + "# Second filter detects horizontal edges in the blue channel.\n", + "w[1, 2, :, :] = [[1, 2, 1], [0, 0, 0], [-1, -2, -1]]\n", + "\n", + "# Vector of biases. We don't need any bias for the grayscale\n", + "# filter, but for the edge detection filter we want to add 128\n", + "# to each output so that nothing is negative.\n", + "b = np.array([0, 128])\n", + "\n", + "# Compute the result of convolving each input in x with each filter in w,\n", + "# offsetting by b, and storing the results in out.\n", + "out, _ = conv_forward_naive(x, w, b, {'stride': 1, 'pad': 1})\n", + "\n", + "def imshow_no_ax(img, normalize=True):\n", + " \"\"\" Tiny helper to show images as uint8 and remove axis labels \"\"\"\n", + " if normalize:\n", + " img_max, img_min = np.max(img), np.min(img)\n", + " img = 255.0 * (img - img_min) / (img_max - img_min)\n", + " plt.imshow(img.astype('uint8'))\n", + " plt.gca().axis('off')\n", + "\n", + "# Show the original images and the results of the conv operation\n", + "plt.subplot(2, 3, 1)\n", + "imshow_no_ax(puppy, normalize=False)\n", + "plt.title('Original image')\n", + "plt.subplot(2, 3, 2)\n", + "imshow_no_ax(out[0, 0])\n", + "plt.title('Grayscale')\n", + "plt.subplot(2, 3, 3)\n", + "imshow_no_ax(out[0, 1])\n", + "plt.title('Edges')\n", + "plt.subplot(2, 3, 4)\n", + "imshow_no_ax(kitten_cropped, normalize=False)\n", + "plt.subplot(2, 3, 5)\n", + "imshow_no_ax(out[1, 0])\n", + "plt.subplot(2, 3, 6)\n", + "imshow_no_ax(out[1, 1])\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Convolution: Naive backward pass\n", + "Implement the backward pass for the convolution operation in the function `conv_backward_naive` in the file `cs231n/layers.py`. Again, you don't need to worry too much about computational efficiency.\n", + "\n", + "When you are done, run the following to check your backward pass with a numeric gradient check." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing conv_backward_naive function\n", + "dx error: 5.332438838802517e-09\n", + "dw error: 2.678293773461729e-10\n", + "db error: 3.8835192329918934e-11\n" + ] + } + ], + "source": [ + "np.random.seed(231)\n", + "x = np.random.randn(4, 3, 5, 5)\n", + "w = np.random.randn(2, 3, 3, 3)\n", + "b = np.random.randn(2,)\n", + "dout = np.random.randn(4, 2, 5, 5)\n", + "conv_param = {'stride': 1, 'pad': 1}\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: conv_forward_naive(x, w, b, conv_param)[0], x, dout)\n", + "dw_num = eval_numerical_gradient_array(lambda w: conv_forward_naive(x, w, b, conv_param)[0], w, dout)\n", + "db_num = eval_numerical_gradient_array(lambda b: conv_forward_naive(x, w, b, conv_param)[0], b, dout)\n", + "\n", + "out, cache = conv_forward_naive(x, w, b, conv_param)\n", + "dx, dw, db = conv_backward_naive(dout, cache)\n", + "\n", + "# Your errors should be around e-8 or less.\n", + "print('Testing conv_backward_naive function')\n", + "print('dx error: ', rel_error(dx, dx_num))\n", + "print('dw error: ', rel_error(dw, dw_num))\n", + "print('db error: ', rel_error(db, db_num))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Max-Pooling: Naive forward\n", + "Implement the forward pass for the max-pooling operation in the function `max_pool_forward_naive` in the file `cs231n/layers.py`. Again, don't worry too much about computational efficiency.\n", + "\n", + "Check your implementation by running the following:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing max_pool_forward_naive function:\n", + "difference: 4.1666665157267834e-08\n" + ] + } + ], + "source": [ + "x_shape = (2, 3, 4, 4)\n", + "x = np.linspace(-0.3, 0.4, num=np.prod(x_shape)).reshape(x_shape)\n", + "pool_param = {'pool_width': 2, 'pool_height': 2, 'stride': 2}\n", + "\n", + "out, _ = max_pool_forward_naive(x, pool_param)\n", + "\n", + "correct_out = np.array([[[[-0.26315789, -0.24842105],\n", + " [-0.20421053, -0.18947368]],\n", + " [[-0.14526316, -0.13052632],\n", + " [-0.08631579, -0.07157895]],\n", + " [[-0.02736842, -0.01263158],\n", + " [ 0.03157895, 0.04631579]]],\n", + " [[[ 0.09052632, 0.10526316],\n", + " [ 0.14947368, 0.16421053]],\n", + " [[ 0.20842105, 0.22315789],\n", + " [ 0.26736842, 0.28210526]],\n", + " [[ 0.32631579, 0.34105263],\n", + " [ 0.38526316, 0.4 ]]]])\n", + "\n", + "# Compare your output with ours. Difference should be on the order of e-8.\n", + "print('Testing max_pool_forward_naive function:')\n", + "print('difference: ', rel_error(out, correct_out))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Max-Pooling: Naive backward\n", + "Implement the backward pass for the max-pooling operation in the function `max_pool_backward_naive` in the file `cs231n/layers.py`. You don't need to worry about computational efficiency.\n", + "\n", + "Check your implementation with numeric gradient checking by running the following:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing max_pool_backward_naive function:\n", + "dx error: 3.27562514223145e-12\n" + ] + } + ], + "source": [ + "np.random.seed(231)\n", + "x = np.random.randn(3, 2, 8, 8)\n", + "dout = np.random.randn(3, 2, 4, 4)\n", + "pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: max_pool_forward_naive(x, pool_param)[0], x, dout)\n", + "\n", + "out, cache = max_pool_forward_naive(x, pool_param)\n", + "dx = max_pool_backward_naive(dout, cache)\n", + "\n", + "# Your error should be on the order of e-12\n", + "print('Testing max_pool_backward_naive function:')\n", + "print('dx error: ', rel_error(dx, dx_num))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "%cd C:\\Users\\2277007\\Desktop\\assignment2\\cs231n\n", + "!python setup.py build_ext --inplace\n", + "%cd C:\\Users\\2277007\\Desktop\\assignment2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "%cd C:\\Users\\2277007\\Desktop\\cs231n-master-jabe\\cs231n-master\\assignment2\\cs231n\n", + "!python setup.py build_ext --inplace \n", + "%cd C:\\Users\\2277007\\Desktop\\cs231n-master-jabe\\cs231n-master\\assignment2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Fast layers\n", + "Making convolution and pooling layers fast can be challenging. To spare you the pain, we've provided fast implementations of the forward and backward passes for convolution and pooling layers in the file `cs231n/fast_layers.py`.\n", + "\n", + "The fast convolution implementation depends on a Cython extension; to compile it you need to run the following from the `cs231n` directory:\n", + "\n", + "```bash\n", + "python setup.py build_ext --inplace\n", + "```\n", + "\n", + "The API for the fast versions of the convolution and pooling layers is exactly the same as the naive versions that you implemented above: the forward pass receives data, weights, and parameters and produces outputs and a cache object; the backward pass recieves upstream derivatives and the cache object and produces gradients with respect to the data and weights.\n", + "\n", + "**NOTE:** The fast implementation for pooling will only perform optimally if the pooling regions are non-overlapping and tile the input. If these conditions are not met then the fast pooling implementation will not be much faster than the naive implementation.\n", + "\n", + "You can compare the performance of the naive and fast versions of these layers by running the following:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing conv_forward_fast:\n", + "Naive: 0.004047s\n", + "Fast: 0.008284s\n", + "Speedup: 0.488516x\n", + "Difference: 0.0\n", + "\n", + "Testing conv_backward_fast:\n", + "Naive: 0.373406s\n", + "Fast: 0.009666s\n", + "Speedup: 38.631928x\n", + "dx difference: 1.949764775345631e-11\n", + "dw difference: 6.284191946907482e-13\n", + "db difference: 5.606046075994916e-15\n" + ] + } + ], + "source": [ + "# Rel errors should be around e-9 or less\n", + "from cs231n.fast_layers import conv_forward_fast, conv_backward_fast\n", + "from time import time\n", + "np.random.seed(231)\n", + "x = np.random.randn(100, 3, 31, 31)\n", + "w = np.random.randn(25, 3, 3, 3)\n", + "b = np.random.randn(25,)\n", + "dout = np.random.randn(100, 25, 16, 16)\n", + "conv_param = {'stride': 2, 'pad': 1}\n", + "\n", + "t0 = time()\n", + "out_naive, cache_naive = conv_forward_naive(x, w, b, conv_param)\n", + "t1 = time()\n", + "out_fast, cache_fast = conv_forward_fast(x, w, b, conv_param)\n", + "t2 = time()\n", + "\n", + "print('Testing conv_forward_fast:')\n", + "print('Naive: %fs' % (t1 - t0))\n", + "print('Fast: %fs' % (t2 - t1))\n", + "print('Speedup: %fx' % ((t1 - t0) / (t2 - t1)))\n", + "print('Difference: ', rel_error(out_naive, out_fast))\n", + "\n", + "t0 = time()\n", + "dx_naive, dw_naive, db_naive = conv_backward_naive(dout, cache_naive)\n", + "t1 = time()\n", + "dx_fast, dw_fast, db_fast = conv_backward_fast(dout, cache_fast)\n", + "t2 = time()\n", + "\n", + "print('\\nTesting conv_backward_fast:')\n", + "print('Naive: %fs' % (t1 - t0))\n", + "print('Fast: %fs' % (t2 - t1))\n", + "print('Speedup: %fx' % ((t1 - t0) / (t2 - t1)))\n", + "print('dx difference: ', rel_error(dx_naive, dx_fast))\n", + "print('dw difference: ', rel_error(dw_naive, dw_fast))\n", + "print('db difference: ', rel_error(db_naive, db_fast))" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing pool_forward_fast:\n", + "Naive: 0.003110s\n", + "fast: 0.005553s\n", + "speedup: 0.560069x\n", + "difference: 0.0\n", + "\n", + "Testing pool_backward_fast:\n", + "Naive: 0.021852s\n", + "fast: 0.011665s\n", + "speedup: 1.873319x\n", + "dx difference: 0.0\n" + ] + } + ], + "source": [ + "# Relative errors should be close to 0.0\n", + "from cs231n.fast_layers import max_pool_forward_fast, max_pool_backward_fast\n", + "np.random.seed(231)\n", + "x = np.random.randn(100, 3, 32, 32)\n", + "dout = np.random.randn(100, 3, 16, 16)\n", + "pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}\n", + "\n", + "t0 = time()\n", + "out_naive, cache_naive = max_pool_forward_naive(x, pool_param)\n", + "t1 = time()\n", + "out_fast, cache_fast = max_pool_forward_fast(x, pool_param)\n", + "t2 = time()\n", + "\n", + "print('Testing pool_forward_fast:')\n", + "print('Naive: %fs' % (t1 - t0))\n", + "print('fast: %fs' % (t2 - t1))\n", + "print('speedup: %fx' % ((t1 - t0) / (t2 - t1)))\n", + "print('difference: ', rel_error(out_naive, out_fast))\n", + "\n", + "t0 = time()\n", + "dx_naive = max_pool_backward_naive(dout, cache_naive)\n", + "t1 = time()\n", + "dx_fast = max_pool_backward_fast(dout, cache_fast)\n", + "t2 = time()\n", + "\n", + "print('\\nTesting pool_backward_fast:')\n", + "print('Naive: %fs' % (t1 - t0))\n", + "print('fast: %fs' % (t2 - t1))\n", + "print('speedup: %fx' % ((t1 - t0) / (t2 - t1)))\n", + "print('dx difference: ', rel_error(dx_naive, dx_fast))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Convolutional \"sandwich\" layers\n", + "Previously we introduced the concept of \"sandwich\" layers that combine multiple operations into commonly used patterns. In the file `cs231n/layer_utils.py` you will find sandwich layers that implement a few commonly used patterns for convolutional networks. Run the cells below to sanity check they're working." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing conv_relu_pool\n", + "dx error: 9.591132621921372e-09\n", + "dw error: 5.802455944849637e-09\n", + "db error: 3.57960501324485e-10\n" + ] + } + ], + "source": [ + "from cs231n.layer_utils import conv_relu_pool_forward, conv_relu_pool_backward\n", + "np.random.seed(231)\n", + "x = np.random.randn(2, 3, 16, 16)\n", + "w = np.random.randn(3, 3, 3, 3)\n", + "b = np.random.randn(3,)\n", + "dout = np.random.randn(2, 3, 8, 8)\n", + "conv_param = {'stride': 1, 'pad': 1}\n", + "pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}\n", + "\n", + "out, cache = conv_relu_pool_forward(x, w, b, conv_param, pool_param)\n", + "dx, dw, db = conv_relu_pool_backward(dout, cache)\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: conv_relu_pool_forward(x, w, b, conv_param, pool_param)[0], x, dout)\n", + "dw_num = eval_numerical_gradient_array(lambda w: conv_relu_pool_forward(x, w, b, conv_param, pool_param)[0], w, dout)\n", + "db_num = eval_numerical_gradient_array(lambda b: conv_relu_pool_forward(x, w, b, conv_param, pool_param)[0], b, dout)\n", + "\n", + "# Relative errors should be around e-8 or less\n", + "print('Testing conv_relu_pool')\n", + "print('dx error: ', rel_error(dx_num, dx))\n", + "print('dw error: ', rel_error(dw_num, dw))\n", + "print('db error: ', rel_error(db_num, db))" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Testing conv_relu:\n", + "dx error: 1.5218619980349303e-09\n", + "dw error: 3.3715893156038223e-10\n", + "db error: 4.8422803898140394e-11\n" + ] + } + ], + "source": [ + "from cs231n.layer_utils import conv_relu_forward, conv_relu_backward\n", + "np.random.seed(231)\n", + "x = np.random.randn(2, 3, 8, 8)\n", + "w = np.random.randn(3, 3, 3, 3)\n", + "b = np.random.randn(3,)\n", + "dout = np.random.randn(2, 3, 8, 8)\n", + "conv_param = {'stride': 1, 'pad': 1}\n", + "\n", + "out, cache = conv_relu_forward(x, w, b, conv_param)\n", + "dx, dw, db = conv_relu_backward(dout, cache)\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: conv_relu_forward(x, w, b, conv_param)[0], x, dout)\n", + "dw_num = eval_numerical_gradient_array(lambda w: conv_relu_forward(x, w, b, conv_param)[0], w, dout)\n", + "db_num = eval_numerical_gradient_array(lambda b: conv_relu_forward(x, w, b, conv_param)[0], b, dout)\n", + "\n", + "# Relative errors should be around e-8 or less\n", + "print('Testing conv_relu:')\n", + "print('dx error: ', rel_error(dx_num, dx))\n", + "print('dw error: ', rel_error(dw_num, dw))\n", + "print('db error: ', rel_error(db_num, db))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Three-layer ConvNet\n", + "Now that you have implemented all the necessary layers, we can put them together into a simple convolutional network.\n", + "\n", + "Open the file `cs231n/classifiers/cnn.py` and complete the implementation of the `ThreeLayerConvNet` class. Remember you can use the fast/sandwich layers (already imported for you) in your implementation. Run the following cells to help you debug:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Sanity check loss\n", + "After you build a new network, one of the first things you should do is sanity check the loss. When we use the softmax loss, we expect the loss for random weights (and no regularization) to be about `log(C)` for `C` classes. When we add regularization the loss should go up slightly." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Initial loss (no regularization): 2.302586071243987\n", + "Initial loss (with regularization): 2.508255638232932\n" + ] + } + ], + "source": [ + "model = ThreeLayerConvNet()\n", + "\n", + "N = 50\n", + "X = np.random.randn(N, 3, 32, 32)\n", + "y = np.random.randint(10, size=N)\n", + "\n", + "loss, grads = model.loss(X, y)\n", + "print('Initial loss (no regularization): ', loss)\n", + "\n", + "model.reg = 0.5\n", + "loss, grads = model.loss(X, y)\n", + "print('Initial loss (with regularization): ', loss)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Gradient check\n", + "After the loss looks reasonable, use numeric gradient checking to make sure that your backward pass is correct. When you use numeric gradient checking you should use a small amount of artifical data and a small number of neurons at each layer. Note: correct implementations may still have relative errors up to the order of e-2." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "W1 max relative error: 3.053965e-04\n", + "W2 max relative error: 1.822723e-02\n", + "W3 max relative error: 3.422399e-04\n", + "b1 max relative error: 3.397321e-06\n", + "b2 max relative error: 2.517459e-03\n", + "b3 max relative error: 9.711800e-10\n" + ] + } + ], + "source": [ + "num_inputs = 2\n", + "input_dim = (3, 16, 16)\n", + "reg = 0.0\n", + "num_classes = 10\n", + "np.random.seed(231)\n", + "X = np.random.randn(num_inputs, *input_dim)\n", + "y = np.random.randint(num_classes, size=num_inputs)\n", + "\n", + "model = ThreeLayerConvNet(num_filters=3, filter_size=3,\n", + " input_dim=input_dim, hidden_dim=7,\n", + " dtype=np.float64)\n", + "loss, grads = model.loss(X, y)\n", + "# Errors should be small, but correct implementations may have\n", + "# relative errors up to the order of e-2\n", + "for param_name in sorted(grads):\n", + " f = lambda _: model.loss(X, y)[0]\n", + " param_grad_num = eval_numerical_gradient(f, model.params[param_name], verbose=False, h=1e-6)\n", + " e = rel_error(param_grad_num, grads[param_name])\n", + " print('%s max relative error: %e' % (param_name, rel_error(param_grad_num, grads[param_name])))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Overfit small data\n", + "A nice trick is to train your model with just a few training samples. You should be able to overfit small datasets, which will result in very high training accuracy and comparatively low validation accuracy." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(Iteration 1 / 30) loss: 2.414060\n", + "(Epoch 0 / 15) train acc: 0.200000; val_acc: 0.137000\n", + "(Iteration 2 / 30) loss: 3.102925\n", + "(Epoch 1 / 15) train acc: 0.140000; val_acc: 0.087000\n", + "(Iteration 3 / 30) loss: 2.270330\n", + "(Iteration 4 / 30) loss: 2.096705\n", + "(Epoch 2 / 15) train acc: 0.240000; val_acc: 0.094000\n", + "(Iteration 5 / 30) loss: 1.838880\n", + "(Iteration 6 / 30) loss: 1.934188\n", + "(Epoch 3 / 15) train acc: 0.510000; val_acc: 0.173000\n", + "(Iteration 7 / 30) loss: 1.827912\n", + "(Iteration 8 / 30) loss: 1.639574\n", + "(Epoch 4 / 15) train acc: 0.520000; val_acc: 0.188000\n", + "(Iteration 9 / 30) loss: 1.330082\n", + "(Iteration 10 / 30) loss: 1.756115\n", + "(Epoch 5 / 15) train acc: 0.630000; val_acc: 0.167000\n", + "(Iteration 11 / 30) loss: 1.024162\n", + "(Iteration 12 / 30) loss: 1.041826\n", + "(Epoch 6 / 15) train acc: 0.750000; val_acc: 0.229000\n", + "(Iteration 13 / 30) loss: 1.142777\n", + "(Iteration 14 / 30) loss: 0.835706\n", + "(Epoch 7 / 15) train acc: 0.790000; val_acc: 0.247000\n", + "(Iteration 15 / 30) loss: 0.587786\n", + "(Iteration 16 / 30) loss: 0.645509\n", + "(Epoch 8 / 15) train acc: 0.820000; val_acc: 0.252000\n", + "(Iteration 17 / 30) loss: 0.786844\n", + "(Iteration 18 / 30) loss: 0.467054\n", + "(Epoch 9 / 15) train acc: 0.820000; val_acc: 0.178000\n", + "(Iteration 19 / 30) loss: 0.429880\n", + "(Iteration 20 / 30) loss: 0.635498\n", + "(Epoch 10 / 15) train acc: 0.900000; val_acc: 0.206000\n", + "(Iteration 21 / 30) loss: 0.365807\n", + "(Iteration 22 / 30) loss: 0.284220\n", + "(Epoch 11 / 15) train acc: 0.820000; val_acc: 0.201000\n", + "(Iteration 23 / 30) loss: 0.469343\n", + "(Iteration 24 / 30) loss: 0.509369\n", + "(Epoch 12 / 15) train acc: 0.920000; val_acc: 0.211000\n", + "(Iteration 25 / 30) loss: 0.111638\n", + "(Iteration 26 / 30) loss: 0.145388\n", + "(Epoch 13 / 15) train acc: 0.930000; val_acc: 0.213000\n", + "(Iteration 27 / 30) loss: 0.155575\n", + "(Iteration 28 / 30) loss: 0.143398\n", + "(Epoch 14 / 15) train acc: 0.960000; val_acc: 0.212000\n", + "(Iteration 29 / 30) loss: 0.158160\n", + "(Iteration 30 / 30) loss: 0.118934\n", + "(Epoch 15 / 15) train acc: 0.990000; val_acc: 0.220000\n" + ] + } + ], + "source": [ + "np.random.seed(231)\n", + "\n", + "num_train = 100\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "model = ThreeLayerConvNet(weight_scale=1e-2)\n", + "\n", + "solver = Solver(model, small_data,\n", + " num_epochs=15, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=True, print_every=1)\n", + "solver.train()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Plotting the loss, training accuracy, and validation accuracy should show clear overfitting:" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plt.subplot(2, 1, 1)\n", + "plt.plot(solver.loss_history, 'o')\n", + "plt.xlabel('iteration')\n", + "plt.ylabel('loss')\n", + "\n", + "plt.subplot(2, 1, 2)\n", + "plt.plot(solver.train_acc_history, '-o')\n", + "plt.plot(solver.val_acc_history, '-o')\n", + "plt.legend(['train', 'val'], loc='upper left')\n", + "plt.xlabel('epoch')\n", + "plt.ylabel('accuracy')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Train the net\n", + "By training the three-layer convolutional network for one epoch, you should achieve greater than 40% accuracy on the training set:" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(Iteration 1 / 980) loss: 2.304740\n", + "(Epoch 0 / 1) train acc: 0.103000; val_acc: 0.107000\n", + "(Iteration 21 / 980) loss: 2.098229\n", + "(Iteration 41 / 980) loss: 1.949788\n", + "(Iteration 61 / 980) loss: 1.888398\n", + "(Iteration 81 / 980) loss: 1.877093\n", + "(Iteration 101 / 980) loss: 1.851877\n", + "(Iteration 121 / 980) loss: 1.859353\n", + "(Iteration 141 / 980) loss: 1.800181\n", + "(Iteration 161 / 980) loss: 2.143292\n", + "(Iteration 181 / 980) loss: 1.830573\n", + "(Iteration 201 / 980) loss: 2.037280\n", + "(Iteration 221 / 980) loss: 2.020304\n", + "(Iteration 241 / 980) loss: 1.823728\n", + "(Iteration 261 / 980) loss: 1.692679\n", + "(Iteration 281 / 980) loss: 1.882594\n", + "(Iteration 301 / 980) loss: 1.798261\n", + "(Iteration 321 / 980) loss: 1.851960\n", + "(Iteration 341 / 980) loss: 1.716323\n", + "(Iteration 361 / 980) loss: 1.897655\n", + "(Iteration 381 / 980) loss: 1.319744\n", + "(Iteration 401 / 980) loss: 1.738790\n", + "(Iteration 421 / 980) loss: 1.488866\n", + "(Iteration 441 / 980) loss: 1.718409\n", + "(Iteration 461 / 980) loss: 1.744440\n", + "(Iteration 481 / 980) loss: 1.605460\n", + "(Iteration 501 / 980) loss: 1.494847\n", + "(Iteration 521 / 980) loss: 1.835179\n", + "(Iteration 541 / 980) loss: 1.483923\n", + "(Iteration 561 / 980) loss: 1.676871\n", + "(Iteration 581 / 980) loss: 1.438325\n", + "(Iteration 601 / 980) loss: 1.443469\n", + "(Iteration 621 / 980) loss: 1.529369\n", + "(Iteration 641 / 980) loss: 1.763475\n", + "(Iteration 661 / 980) loss: 1.790329\n", + "(Iteration 681 / 980) loss: 1.693343\n", + "(Iteration 701 / 980) loss: 1.637078\n", + "(Iteration 721 / 980) loss: 1.644564\n", + "(Iteration 741 / 980) loss: 1.708919\n", + "(Iteration 761 / 980) loss: 1.494252\n", + "(Iteration 781 / 980) loss: 1.901751\n", + "(Iteration 801 / 980) loss: 1.898991\n", + "(Iteration 821 / 980) loss: 1.489988\n", + "(Iteration 841 / 980) loss: 1.377615\n", + "(Iteration 861 / 980) loss: 1.763751\n", + "(Iteration 881 / 980) loss: 1.540284\n", + "(Iteration 901 / 980) loss: 1.525582\n", + "(Iteration 921 / 980) loss: 1.674166\n", + "(Iteration 941 / 980) loss: 1.714316\n", + "(Iteration 961 / 980) loss: 1.534668\n", + "(Epoch 1 / 1) train acc: 0.504000; val_acc: 0.499000\n" + ] + } + ], + "source": [ + "model = ThreeLayerConvNet(weight_scale=0.001, hidden_dim=500, reg=0.001)\n", + "\n", + "solver = Solver(model, data,\n", + " num_epochs=1, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=True, print_every=20)\n", + "solver.train()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Visualize Filters\n", + "You can visualize the first-layer convolutional filters from the trained network by running the following:" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "from cs231n.vis_utils import visualize_grid\n", + "\n", + "grid = visualize_grid(model.params['W1'].transpose(0, 2, 3, 1))\n", + "plt.imshow(grid.astype('uint8'))\n", + "plt.axis('off')\n", + "plt.gcf().set_size_inches(5, 5)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Spatial Batch Normalization\n", + "We already saw that batch normalization is a very useful technique for training deep fully-connected networks. As proposed in the original paper (link in `BatchNormalization.ipynb`), batch normalization can also be used for convolutional networks, but we need to tweak it a bit; the modification will be called \"spatial batch normalization.\"\n", + "\n", + "Normally batch-normalization accepts inputs of shape `(N, D)` and produces outputs of shape `(N, D)`, where we normalize across the minibatch dimension `N`. For data coming from convolutional layers, batch normalization needs to accept inputs of shape `(N, C, H, W)` and produce outputs of shape `(N, C, H, W)` where the `N` dimension gives the minibatch size and the `(H, W)` dimensions give the spatial size of the feature map.\n", + "\n", + "If the feature map was produced using convolutions, then we expect every feature channel's statistics e.g. mean, variance to be relatively consistent both between different images, and different locations within the same image -- after all, every feature channel is produced by the same convolutional filter! Therefore spatial batch normalization computes a mean and variance for each of the `C` feature channels by computing statistics over the minibatch dimension `N` as well the spatial dimensions `H` and `W`.\n", + "\n", + "\n", + "[1] [Sergey Ioffe and Christian Szegedy, \"Batch Normalization: Accelerating Deep Network Training by Reducing\n", + "Internal Covariate Shift\", ICML 2015.](https://arxiv.org/abs/1502.03167)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Spatial batch normalization: forward\n", + "\n", + "In the file `cs231n/layers.py`, implement the forward pass for spatial batch normalization in the function `spatial_batchnorm_forward`. Check your implementation by running the following:" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Before spatial batch normalization:\n", + " Shape: (2, 3, 4, 5)\n", + " Means: [9.33463814 8.90909116 9.11056338]\n", + " Stds: [3.61447857 3.19347686 3.5168142 ]\n", + "After spatial batch normalization:\n", + " Shape: (2, 3, 4, 5)\n", + " Means: [ 6.18949336e-16 5.99520433e-16 -1.22124533e-16]\n", + " Stds: [0.99999962 0.99999951 0.9999996 ]\n", + "After spatial batch normalization (nontrivial gamma, beta):\n", + " Shape: (2, 3, 4, 5)\n", + " Means: [6. 7. 8.]\n", + " Stds: [2.99999885 3.99999804 4.99999798]\n" + ] + } + ], + "source": [ + "np.random.seed(231)\n", + "# Check the training-time forward pass by checking means and variances\n", + "# of features both before and after spatial batch normalization\n", + "\n", + "N, C, H, W = 2, 3, 4, 5\n", + "x = 4 * np.random.randn(N, C, H, W) + 10\n", + "\n", + "print('Before spatial batch normalization:')\n", + "print(' Shape: ', x.shape)\n", + "print(' Means: ', x.mean(axis=(0, 2, 3)))\n", + "print(' Stds: ', x.std(axis=(0, 2, 3)))\n", + "\n", + "# Means should be close to zero and stds close to one\n", + "gamma, beta = np.ones(C), np.zeros(C)\n", + "bn_param = {'mode': 'train'}\n", + "out, _ = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", + "print('After spatial batch normalization:')\n", + "print(' Shape: ', out.shape)\n", + "print(' Means: ', out.mean(axis=(0, 2, 3)))\n", + "print(' Stds: ', out.std(axis=(0, 2, 3)))\n", + "\n", + "# Means should be close to beta and stds close to gamma\n", + "gamma, beta = np.asarray([3, 4, 5]), np.asarray([6, 7, 8])\n", + "out, _ = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", + "print('After spatial batch normalization (nontrivial gamma, beta):')\n", + "print(' Shape: ', out.shape)\n", + "print(' Means: ', out.mean(axis=(0, 2, 3)))\n", + "print(' Stds: ', out.std(axis=(0, 2, 3)))" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "After spatial batch normalization (test-time):\n", + " means: [-0.08034406 0.07562881 0.05716371 0.04378383]\n", + " stds: [0.96718744 1.0299714 1.02887624 1.00585577]\n" + ] + } + ], + "source": [ + "np.random.seed(231)\n", + "# Check the test-time forward pass by running the training-time\n", + "# forward pass many times to warm up the running averages, and then\n", + "# checking the means and variances of activations after a test-time\n", + "# forward pass.\n", + "N, C, H, W = 10, 4, 11, 12\n", + "\n", + "bn_param = {'mode': 'train'}\n", + "gamma = np.ones(C)\n", + "beta = np.zeros(C)\n", + "for t in range(50):\n", + " x = 2.3 * np.random.randn(N, C, H, W) + 13\n", + " spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", + "bn_param['mode'] = 'test'\n", + "x = 2.3 * np.random.randn(N, C, H, W) + 13\n", + "a_norm, _ = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", + "\n", + "# Means should be close to zero and stds close to one, but will be\n", + "# noisier than training-time forward passes.\n", + "print('After spatial batch normalization (test-time):')\n", + "print(' means: ', a_norm.mean(axis=(0, 2, 3)))\n", + "print(' stds: ', a_norm.std(axis=(0, 2, 3)))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Spatial batch normalization: backward\n", + "In the file `cs231n/layers.py`, implement the backward pass for spatial batch normalization in the function `spatial_batchnorm_backward`. Run the following to check your implementation using a numeric gradient check:" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "dx error: 2.786648193872555e-07\n", + "dgamma error: 7.0974817113608705e-12\n", + "dbeta error: 3.275608725278405e-12\n" + ] + } + ], + "source": [ + "np.random.seed(231)\n", + "N, C, H, W = 2, 3, 4, 5\n", + "x = 5 * np.random.randn(N, C, H, W) + 12\n", + "gamma = np.random.randn(C)\n", + "beta = np.random.randn(C)\n", + "dout = np.random.randn(N, C, H, W)\n", + "\n", + "bn_param = {'mode': 'train'}\n", + "fx = lambda x: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]\n", + "fg = lambda a: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]\n", + "fb = lambda b: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]\n", + "\n", + "dx_num = eval_numerical_gradient_array(fx, x, dout)\n", + "da_num = eval_numerical_gradient_array(fg, gamma, dout)\n", + "db_num = eval_numerical_gradient_array(fb, beta, dout)\n", + "\n", + "#You should expect errors of magnitudes between 1e-12~1e-06\n", + "_, cache = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", + "dx, dgamma, dbeta = spatial_batchnorm_backward(dout, cache)\n", + "print('dx error: ', rel_error(dx_num, dx))\n", + "print('dgamma error: ', rel_error(da_num, dgamma))\n", + "print('dbeta error: ', rel_error(db_num, dbeta))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Group Normalization\n", + "In the previous notebook, we mentioned that Layer Normalization is an alternative normalization technique that mitigates the batch size limitations of Batch Normalization. However, as the authors of [2] observed, Layer Normalization does not perform as well as Batch Normalization when used with Convolutional Layers:\n", + "\n", + ">With fully connected layers, all the hidden units in a layer tend to make similar contributions to the final prediction, and re-centering and rescaling the summed inputs to a layer works well. However, the assumption of similar contributions is no longer true for convolutional neural networks. The large number of the hidden units whose\n", + "receptive fields lie near the boundary of the image are rarely turned on and thus have very different\n", + "statistics from the rest of the hidden units within the same layer.\n", + "\n", + "The authors of [3] propose an intermediary technique. In contrast to Layer Normalization, where you normalize over the entire feature per-datapoint, they suggest a consistent splitting of each per-datapoint feature into G groups, and a per-group per-datapoint normalization instead. \n", + "\n", + "![Comparison of normalization techniques discussed so far](notebook_images/normalization.png)\n", + "
**Visual comparison of the normalization techniques discussed so far (image edited from [3])**
\n", + "\n", + "Even though an assumption of equal contribution is still being made within each group, the authors hypothesize that this is not as problematic, as innate grouping arises within features for visual recognition. One example they use to illustrate this is that many high-performance handcrafted features in traditional Computer Vision have terms that are explicitly grouped together. Take for example Histogram of Oriented Gradients [4]-- after computing histograms per spatially local block, each per-block histogram is normalized before being concatenated together to form the final feature vector.\n", + "\n", + "You will now implement Group Normalization. Note that this normalization technique that you are to implement in the following cells was introduced and published to ECCV just in 2018 -- this truly is still an ongoing and excitingly active field of research!\n", + "\n", + "[2] [Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. \"Layer Normalization.\" stat 1050 (2016): 21.](https://arxiv.org/pdf/1607.06450.pdf)\n", + "\n", + "\n", + "[3] [Wu, Yuxin, and Kaiming He. \"Group Normalization.\" arXiv preprint arXiv:1803.08494 (2018).](https://arxiv.org/abs/1803.08494)\n", + "\n", + "\n", + "[4] [N. Dalal and B. Triggs. Histograms of oriented gradients for\n", + "human detection. In Computer Vision and Pattern Recognition\n", + "(CVPR), 2005.](https://ieeexplore.ieee.org/abstract/document/1467360/)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Group normalization: forward\n", + "\n", + "In the file `cs231n/layers.py`, implement the forward pass for group normalization in the function `spatial_groupnorm_forward`. Check your implementation by running the following:" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Before spatial group normalization:\n", + " Shape: (2, 6, 4, 5)\n", + " Means: [9.72505327 8.51114185 8.9147544 9.43448077]\n", + " Stds: [3.67070958 3.09892597 4.27043622 3.97521327]\n", + "After spatial group normalization:\n", + " Shape: (2, 6, 4, 5)\n", + " Means: [-2.14643118e-16 5.25505565e-16 2.65528340e-16 -3.38618023e-16]\n", + " Stds: [0.99999963 0.99999948 0.99999973 0.99999968]\n" + ] + } + ], + "source": [ + "np.random.seed(231)\n", + "# Check the training-time forward pass by checking means and variances\n", + "# of features both before and after spatial batch normalization\n", + "\n", + "N, C, H, W = 2, 6, 4, 5\n", + "G = 2\n", + "x = 4 * np.random.randn(N, C, H, W) + 10\n", + "x_g = x.reshape((N*G,-1))\n", + "print('Before spatial group normalization:')\n", + "print(' Shape: ', x.shape)\n", + "print(' Means: ', x_g.mean(axis=1))\n", + "print(' Stds: ', x_g.std(axis=1))\n", + "\n", + "# Means should be close to zero and stds close to one\n", + "gamma, beta = np.ones((1,C,1,1)), np.zeros((1,C,1,1))\n", + "bn_param = {'mode': 'train'}\n", + "\n", + "out, _ = spatial_groupnorm_forward(x, gamma, beta, G, bn_param)\n", + "out_g = out.reshape((N*G,-1))\n", + "print('After spatial group normalization:')\n", + "print(' Shape: ', out.shape)\n", + "print(' Means: ', out_g.mean(axis=1))\n", + "print(' Stds: ', out_g.std(axis=1))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Spatial group normalization: backward\n", + "In the file `cs231n/layers.py`, implement the backward pass for spatial batch normalization in the function `spatial_groupnorm_backward`. Run the following to check your implementation using a numeric gradient check:" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "dx error: 7.413109437563619e-08\n", + "dgamma error: 9.468195772749234e-12\n", + "dbeta error: 3.354494437653335e-12\n" + ] + } + ], + "source": [ + "np.random.seed(231)\n", + "N, C, H, W = 2, 6, 4, 5\n", + "G = 2\n", + "x = 5 * np.random.randn(N, C, H, W) + 12\n", + "gamma = np.random.randn(1,C,1,1)\n", + "beta = np.random.randn(1,C,1,1)\n", + "dout = np.random.randn(N, C, H, W)\n", + "\n", + "gn_param = {}\n", + "fx = lambda x: spatial_groupnorm_forward(x, gamma, beta, G, gn_param)[0]\n", + "fg = lambda a: spatial_groupnorm_forward(x, gamma, beta, G, gn_param)[0]\n", + "fb = lambda b: spatial_groupnorm_forward(x, gamma, beta, G, gn_param)[0]\n", + "\n", + "dx_num = eval_numerical_gradient_array(fx, x, dout)\n", + "da_num = eval_numerical_gradient_array(fg, gamma, dout)\n", + "db_num = eval_numerical_gradient_array(fb, beta, dout)\n", + "\n", + "_, cache = spatial_groupnorm_forward(x, gamma, beta, G, gn_param)\n", + "dx, dgamma, dbeta = spatial_groupnorm_backward(dout, cache)\n", + "#You should expect errors of magnitudes between 1e-12~1e-07\n", + "print('dx error: ', rel_error(dx_num, dx))\n", + "print('dgamma error: ', rel_error(da_num, dgamma))\n", + "print('dbeta error: ', rel_error(db_num, dbeta))" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git "a/\352\261\264\353\247\220\354\262\234\352\260\204_Q5_PyTorch.ipynb" "b/\352\261\264\353\247\220\354\262\234\352\260\204_Q5_PyTorch.ipynb" new file mode 100644 index 0000000..e5ff705 --- /dev/null +++ "b/\352\261\264\353\247\220\354\262\234\352\260\204_Q5_PyTorch.ipynb" @@ -0,0 +1,1809 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "KtukLD6hUyN1", + "outputId": "50919433-76f5-4baf-bd6e-10aae9c792c1" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Mounted at /content/drive\n", + "/content/drive/My Drive/euron_cs231n\n" + ] + } + ], + "source": [ + "# This mounts your Google Drive to the Colab VM.\n", + "from google.colab import drive\n", + "drive.mount('/content/drive')\n", + "\n", + "# Enter the foldername in your Drive where you have saved the unzipped\n", + "# assignment folder, e.g. 'cs231n/assignments/assignment1/'\n", + "FOLDERNAME = \"euron_cs231n\"\n", + "assert FOLDERNAME is not None, \"[!] Enter the foldername.\"\n", + "\n", + "# Now that we've mounted your Drive, this ensures that\n", + "# the Python interpreter of the Colab VM can load\n", + "# python files from within it.\n", + "import sys\n", + "sys.path.append('/content/drive/My Drive/{}'.format(FOLDERNAME))\n", + "\n", + "# This downloads the CIFAR-10 dataset to your Drive\n", + "# if it doesn't already exist.\n", + "#%cd /content/drive/My\\ Drive/$FOLDERNAME/cs231n/datasets/\n", + "#!bash get_datasets.sh\n", + "%cd /content/drive/My\\ Drive/$FOLDERNAME" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-title" + ], + "id": "BE6MuJ2oUyN4" + }, + "source": [ + "# Introduction to PyTorch\n", + "\n", + "You've written a lot of code in this assignment to provide a whole host of neural network functionality. Dropout, Batch Norm, and 2D convolutions are some of the workhorses of deep learning in computer vision. You've also worked hard to make your code efficient and vectorized.\n", + "\n", + "For the last part of this assignment, though, we're going to leave behind your beautiful codebase and instead migrate to one of two popular deep learning frameworks: in this instance, PyTorch." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-ignore" + ], + "id": "3OaFf-aSUyN5" + }, + "source": [ + "## Why do we use deep learning frameworks?\n", + "\n", + "* Our code will now run on GPUs! This will allow our models to train much faster. When using a framework like PyTorch you can harness the power of the GPU for your own custom neural network architectures without having to write CUDA code directly (which is beyond the scope of this class).\n", + "* In this class, we want you to be ready to use one of these frameworks for your project so you can experiment more efficiently than if you were writing every feature you want to use by hand.\n", + "* We want you to stand on the shoulders of giants! PyTorch is an excellent frameworks that will make your lives a lot easier, and now that you understand their guts, you are free to use them :)\n", + "* Finally, we want you to be exposed to the sort of deep learning code you might run into in academia or industry.\n", + "\n", + "## What is PyTorch?\n", + "\n", + "PyTorch is a system for executing dynamic computational graphs over Tensor objects that behave similarly as numpy ndarray. It comes with a powerful automatic differentiation engine that removes the need for manual back-propagation.\n", + "\n", + "## How do I learn PyTorch?\n", + "\n", + "One of our former instructors, Justin Johnson, made an excellent [tutorial](https://github.com/jcjohnson/pytorch-examples) for PyTorch.\n", + "\n", + "You can also find the detailed [API doc](http://pytorch.org/docs/stable/index.html) here. If you have other questions that are not addressed by the API docs, the [PyTorch forum](https://discuss.pytorch.org/) is a much better place to ask than StackOverflow." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MVM_F3e6UyN6" + }, + "source": [ + "# Table of Contents\n", + "\n", + "This assignment has 5 parts. You will learn PyTorch on **three different levels of abstraction**, which will help you understand it better and prepare you for the final project.\n", + "\n", + "1. Part I, Preparation: we will use CIFAR-10 dataset.\n", + "2. Part II, Barebones PyTorch: **Abstraction level 1**, we will work directly with the lowest-level PyTorch Tensors.\n", + "3. Part III, PyTorch Module API: **Abstraction level 2**, we will use `nn.Module` to define arbitrary neural network architecture.\n", + "4. Part IV, PyTorch Sequential API: **Abstraction level 3**, we will use `nn.Sequential` to define a linear feed-forward network very conveniently.\n", + "5. Part V, CIFAR-10 open-ended challenge: please implement your own network to get as high accuracy as possible on CIFAR-10. You can experiment with any layer, optimizer, hyperparameters or other advanced features.\n", + "\n", + "Here is a table of comparison:\n", + "\n", + "| API | Flexibility | Convenience |\n", + "|---------------|-------------|-------------|\n", + "| Barebone | High | Low |\n", + "| `nn.Module` | High | Medium |\n", + "| `nn.Sequential` | Low | High |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-ignore" + ], + "id": "LxopyWIEUyN7" + }, + "source": [ + "# GPU\n", + "\n", + "You can manually switch to a GPU device on Colab by clicking `Runtime -> Change runtime type` and selecting `GPU` under `Hardware Accelerator`. You should do this before running the following cells to import packages, since the kernel gets restarted upon switching runtimes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "pdf-ignore" + ], + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "eIXZDL0mUyN8", + "outputId": "796ad7fa-d47c-41e5-b31e-92c4a3312949" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "using device: cpu\n" + ] + } + ], + "source": [ + "import torch\n", + "import torch.nn as nn\n", + "import torch.optim as optim\n", + "from torch.utils.data import DataLoader\n", + "from torch.utils.data import sampler\n", + "\n", + "import torchvision.datasets as dset\n", + "import torchvision.transforms as T\n", + "\n", + "import numpy as np\n", + "\n", + "USE_GPU = True\n", + "dtype = torch.float32 # We will be using float throughout this tutorial.\n", + "\n", + "if USE_GPU and torch.cuda.is_available():\n", + " device = torch.device('cuda')\n", + "else:\n", + " device = torch.device('cpu')\n", + "\n", + "# Constant to control how frequently we print train loss.\n", + "print_every = 100\n", + "print('using device:', device)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DL4vC4b5UyN9" + }, + "source": [ + "# Part I. Preparation\n", + "\n", + "Now, let's load the CIFAR-10 dataset. This might take a couple minutes the first time you do it, but the files should stay cached after that.\n", + "\n", + "In previous parts of the assignment we had to write our own code to download the CIFAR-10 dataset, preprocess it, and iterate through it in minibatches; PyTorch provides convenient tools to automate this process for us." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "pdf-ignore" + ], + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "h8gJ_G3kUyN9", + "outputId": "6143ce9d-4b27-4c3f-bd95-c6f2c55fe1a1" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Files already downloaded and verified\n", + "Files already downloaded and verified\n", + "Files already downloaded and verified\n" + ] + } + ], + "source": [ + "NUM_TRAIN = 49000\n", + "\n", + "# The torchvision.transforms package provides tools for preprocessing data\n", + "# and for performing data augmentation; here we set up a transform to\n", + "# preprocess the data by subtracting the mean RGB value and dividing by the\n", + "# standard deviation of each RGB value; we've hardcoded the mean and std.\n", + "transform = T.Compose([\n", + " T.ToTensor(),\n", + " T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))\n", + " ])\n", + "\n", + "# We set up a Dataset object for each split (train / val / test); Datasets load\n", + "# training examples one at a time, so we wrap each Dataset in a DataLoader which\n", + "# iterates through the Dataset and forms minibatches. We divide the CIFAR-10\n", + "# training set into train and val sets by passing a Sampler object to the\n", + "# DataLoader telling how it should sample from the underlying Dataset.\n", + "cifar10_train = dset.CIFAR10('./cs231n/datasets', train=True, download=True,\n", + " transform=transform)\n", + "loader_train = DataLoader(cifar10_train, batch_size=64,\n", + " sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN)))\n", + "\n", + "cifar10_val = dset.CIFAR10('./cs231n/datasets', train=True, download=True,\n", + " transform=transform)\n", + "loader_val = DataLoader(cifar10_val, batch_size=64,\n", + " sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000)))\n", + "\n", + "cifar10_test = dset.CIFAR10('./cs231n/datasets', train=False, download=True,\n", + " transform=transform)\n", + "loader_test = DataLoader(cifar10_test, batch_size=64)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zg0DMc80UyN-" + }, + "source": [ + "# Part II. Barebones PyTorch\n", + "\n", + "PyTorch ships with high-level APIs to help us define model architectures conveniently, which we will cover in Part II of this tutorial. In this section, we will start with the barebone PyTorch elements to understand the autograd engine better. After this exercise, you will come to appreciate the high-level model API more.\n", + "\n", + "We will start with a simple fully-connected ReLU network with two hidden layers and no biases for CIFAR classification.\n", + "This implementation computes the forward pass using operations on PyTorch Tensors, and uses PyTorch autograd to compute gradients. It is important that you understand every line, because you will write a harder version after the example.\n", + "\n", + "When we create a PyTorch Tensor with `requires_grad=True`, then operations involving that Tensor will not just compute values; they will also build up a computational graph in the background, allowing us to easily backpropagate through the graph to compute gradients of some Tensors with respect to a downstream loss. Concretely if x is a Tensor with `x.requires_grad == True` then after backpropagation `x.grad` will be another Tensor holding the gradient of x with respect to the scalar loss at the end." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-ignore" + ], + "id": "dqcyLkMCUyN-" + }, + "source": [ + "### PyTorch Tensors: Flatten Function\n", + "A PyTorch Tensor is conceptionally similar to a numpy array: it is an n-dimensional grid of numbers, and like numpy PyTorch provides many functions to efficiently operate on Tensors. As a simple example, we provide a `flatten` function below which reshapes image data for use in a fully-connected neural network.\n", + "\n", + "Recall that image data is typically stored in a Tensor of shape N x C x H x W, where:\n", + "\n", + "* N is the number of datapoints\n", + "* C is the number of channels\n", + "* H is the height of the intermediate feature map in pixels\n", + "* W is the height of the intermediate feature map in pixels\n", + "\n", + "This is the right way to represent the data when we are doing something like a 2D convolution, that needs spatial understanding of where the intermediate features are relative to each other. When we use fully connected affine layers to process the image, however, we want each datapoint to be represented by a single vector -- it's no longer useful to segregate the different channels, rows, and columns of the data. So, we use a \"flatten\" operation to collapse the `C x H x W` values per representation into a single long vector. The flatten function below first reads in the N, C, H, and W values from a given batch of data, and then returns a \"view\" of that data. \"View\" is analogous to numpy's \"reshape\" method: it reshapes x's dimensions to be N x ??, where ?? is allowed to be anything (in this case, it will be C x H x W, but we don't need to specify that explicitly)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "pdf-ignore-input" + ], + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "wexwtAMmUyN_", + "outputId": "5849cc6f-766a-4457-8562-96b51b7fb673" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Before flattening: tensor([[[[ 0, 1],\n", + " [ 2, 3],\n", + " [ 4, 5]]],\n", + "\n", + "\n", + " [[[ 6, 7],\n", + " [ 8, 9],\n", + " [10, 11]]]])\n", + "After flattening: tensor([[ 0, 1, 2, 3, 4, 5],\n", + " [ 6, 7, 8, 9, 10, 11]])\n" + ] + } + ], + "source": [ + "def flatten(x):\n", + " N = x.shape[0] # read in N, C, H, W\n", + " return x.view(N, -1) # \"flatten\" the C * H * W values into a single vector per image\n", + "\n", + "def test_flatten():\n", + " x = torch.arange(12).view(2, 1, 3, 2)\n", + " print('Before flattening: ', x)\n", + " print('After flattening: ', flatten(x))\n", + "\n", + "test_flatten()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-ignore" + ], + "id": "uO-w1QQ2UyN_" + }, + "source": [ + "### Barebones PyTorch: Two-Layer Network\n", + "\n", + "Here we define a function `two_layer_fc` which performs the forward pass of a two-layer fully-connected ReLU network on a batch of image data. After defining the forward pass we check that it doesn't crash and that it produces outputs of the right shape by running zeros through the network.\n", + "\n", + "You don't have to write any code here, but it's important that you read and understand the implementation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "pdf-ignore-input" + ], + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "gDU0jE0iUyN_", + "outputId": "2630ca67-ce34-4550-d303-eff0d6d96644" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "torch.Size([64, 10])\n" + ] + } + ], + "source": [ + "import torch.nn.functional as F # useful stateless functions\n", + "\n", + "def two_layer_fc(x, params):\n", + " \"\"\"\n", + " A fully-connected neural networks; the architecture is:\n", + " NN is fully connected -> ReLU -> fully connected layer.\n", + " Note that this function only defines the forward pass;\n", + " PyTorch will take care of the backward pass for us.\n", + "\n", + " The input to the network will be a minibatch of data, of shape\n", + " (N, d1, ..., dM) where d1 * ... * dM = D. The hidden layer will have H units,\n", + " and the output layer will produce scores for C classes.\n", + "\n", + " Inputs:\n", + " - x: A PyTorch Tensor of shape (N, d1, ..., dM) giving a minibatch of\n", + " input data.\n", + " - params: A list [w1, w2] of PyTorch Tensors giving weights for the network;\n", + " w1 has shape (D, H) and w2 has shape (H, C).\n", + "\n", + " Returns:\n", + " - scores: A PyTorch Tensor of shape (N, C) giving classification scores for\n", + " the input data x.\n", + " \"\"\"\n", + " # first we flatten the image\n", + " x = flatten(x) # shape: [batch_size, C x H x W]\n", + "\n", + " w1, w2 = params\n", + "\n", + " # Forward pass: compute predicted y using operations on Tensors. Since w1 and\n", + " # w2 have requires_grad=True, operations involving these Tensors will cause\n", + " # PyTorch to build a computational graph, allowing automatic computation of\n", + " # gradients. Since we are no longer implementing the backward pass by hand we\n", + " # don't need to keep references to intermediate values.\n", + " # you can also use `.clamp(min=0)`, equivalent to F.relu()\n", + " x = F.relu(x.mm(w1))\n", + " x = x.mm(w2)\n", + " return x\n", + "\n", + "\n", + "def two_layer_fc_test():\n", + " hidden_layer_size = 42\n", + " x = torch.zeros((64, 50), dtype=dtype) # minibatch size 64, feature dimension 50\n", + " w1 = torch.zeros((50, hidden_layer_size), dtype=dtype)\n", + " w2 = torch.zeros((hidden_layer_size, 10), dtype=dtype)\n", + " scores = two_layer_fc(x, [w1, w2])\n", + " print(scores.size()) # you should see [64, 10]\n", + "\n", + "two_layer_fc_test()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_Gbp1mR7UyOA" + }, + "source": [ + "### Barebones PyTorch: Three-Layer ConvNet\n", + "\n", + "Here you will complete the implementation of the function `three_layer_convnet`, which will perform the forward pass of a three-layer convolutional network. Like above, we can immediately test our implementation by passing zeros through the network. The network should have the following architecture:\n", + "\n", + "1. A convolutional layer (with bias) with `channel_1` filters, each with shape `KW1 x KH1`, and zero-padding of two\n", + "2. ReLU nonlinearity\n", + "3. A convolutional layer (with bias) with `channel_2` filters, each with shape `KW2 x KH2`, and zero-padding of one\n", + "4. ReLU nonlinearity\n", + "5. Fully-connected layer with bias, producing scores for C classes.\n", + "\n", + "Note that we have **no softmax activation** here after our fully-connected layer: this is because PyTorch's cross entropy loss performs a softmax activation for you, and by bundling that step in makes computation more efficient.\n", + "\n", + "**HINT**: For convolutions: http://pytorch.org/docs/stable/nn.html#torch.nn.functional.conv2d; pay attention to the shapes of convolutional filters!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OwNP4NleUyOA" + }, + "outputs": [], + "source": [ + "def three_layer_convnet(x, params):\n", + " \"\"\"\n", + " Performs the forward pass of a three-layer convolutional network with the\n", + " architecture defined above.\n", + "\n", + " Inputs:\n", + " - x: A PyTorch Tensor of shape (N, 3, H, W) giving a minibatch of images\n", + " - params: A list of PyTorch Tensors giving the weights and biases for the\n", + " network; should contain the following:\n", + " - conv_w1: PyTorch Tensor of shape (channel_1, 3, KH1, KW1) giving weights\n", + " for the first convolutional layer\n", + " - conv_b1: PyTorch Tensor of shape (channel_1,) giving biases for the first\n", + " convolutional layer\n", + " - conv_w2: PyTorch Tensor of shape (channel_2, channel_1, KH2, KW2) giving\n", + " weights for the second convolutional layer\n", + " - conv_b2: PyTorch Tensor of shape (channel_2,) giving biases for the second\n", + " convolutional layer\n", + " - fc_w: PyTorch Tensor giving weights for the fully-connected layer. Can you\n", + " figure out what the shape should be?\n", + " - fc_b: PyTorch Tensor giving biases for the fully-connected layer. Can you\n", + " figure out what the shape should be?\n", + "\n", + " Returns:\n", + " - scores: PyTorch Tensor of shape (N, C) giving classification scores for x\n", + " \"\"\"\n", + " conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b = params\n", + " scores = None\n", + " ################################################################################\n", + " # TODO: Implement the forward pass for the three-layer ConvNet. #\n", + " ################################################################################\n", + " # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + "\n", + " conv1 = F.conv2d(x, weight=conv_w1, bias=conv_b1, padding=2)\n", + " relu1 = F.relu(conv1)\n", + " conv2 = F.conv2d(relu1, weight=conv_w2, bias=conv_b2, padding=1)\n", + " relu2 = F.relu(conv2)\n", + " relu2_flat = flatten(relu2)\n", + " scores = relu2_flat.mm(fc_w) + fc_b\n", + "\n", + " # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + " ################################################################################\n", + " # END OF YOUR CODE #\n", + " ################################################################################\n", + " return scores" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jgzyYemtUyOB" + }, + "source": [ + "After defining the forward pass of the ConvNet above, run the following cell to test your implementation.\n", + "\n", + "When you run this function, scores should have shape (64, 10)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "pdf-ignore-input" + ], + "test": "barebones_output_shape", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Goe1XLttUyOB", + "outputId": "2fb93952-2b79-47e5-8aaf-eb9d44c47e01" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "torch.Size([64, 10])\n" + ] + } + ], + "source": [ + "def three_layer_convnet_test():\n", + " x = torch.zeros((64, 3, 32, 32), dtype=dtype) # minibatch size 64, image size [3, 32, 32]\n", + "\n", + " conv_w1 = torch.zeros((6, 3, 5, 5), dtype=dtype) # [out_channel, in_channel, kernel_H, kernel_W]\n", + " conv_b1 = torch.zeros((6,)) # out_channel\n", + " conv_w2 = torch.zeros((9, 6, 3, 3), dtype=dtype) # [out_channel, in_channel, kernel_H, kernel_W]\n", + " conv_b2 = torch.zeros((9,)) # out_channel\n", + "\n", + " # you must calculate the shape of the tensor after two conv layers, before the fully-connected layer\n", + " fc_w = torch.zeros((9 * 32 * 32, 10))\n", + " fc_b = torch.zeros(10)\n", + "\n", + " scores = three_layer_convnet(x, [conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b])\n", + " print(scores.size()) # you should see [64, 10]\n", + "three_layer_convnet_test()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Tu5m0VJsUyOB" + }, + "source": [ + "### Barebones PyTorch: Initialization\n", + "Let's write a couple utility methods to initialize the weight matrices for our models.\n", + "\n", + "- `random_weight(shape)` initializes a weight tensor with the Kaiming normalization method.\n", + "- `zero_weight(shape)` initializes a weight tensor with all zeros. Useful for instantiating bias parameters.\n", + "\n", + "The `random_weight` function uses the Kaiming normal initialization method, described in:\n", + "\n", + "He et al, *Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification*, ICCV 2015, https://arxiv.org/abs/1502.01852" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "pdf-ignore-input" + ], + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "eaFeJ2zkUyOB", + "outputId": "6e2ba8cc-65ec-4df5-85ba-22ce464f86b1" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "tensor([[ 0.2756, 0.6988, 0.7051, -1.0388, -0.6274],\n", + " [ 0.5087, 1.6098, -0.6274, 0.1564, -0.5640],\n", + " [-0.5564, 0.1941, -0.3109, 1.0692, 1.2826]], requires_grad=True)" + ] + }, + "metadata": {}, + "execution_count": 8 + } + ], + "source": [ + "def random_weight(shape):\n", + " \"\"\"\n", + " Create random Tensors for weights; setting requires_grad=True means that we\n", + " want to compute gradients for these Tensors during the backward pass.\n", + " We use Kaiming normalization: sqrt(2 / fan_in)\n", + " \"\"\"\n", + " if len(shape) == 2: # FC weight\n", + " fan_in = shape[0]\n", + " else:\n", + " fan_in = np.prod(shape[1:]) # conv weight [out_channel, in_channel, kH, kW]\n", + " # randn is standard normal distribution generator.\n", + " w = torch.randn(shape, device=device, dtype=dtype) * np.sqrt(2. / fan_in)\n", + " w.requires_grad = True\n", + " return w\n", + "\n", + "def zero_weight(shape):\n", + " return torch.zeros(shape, device=device, dtype=dtype, requires_grad=True)\n", + "\n", + "# create a weight of shape [3 x 5]\n", + "# you should see the type `torch.cuda.FloatTensor` if you use GPU.\n", + "# Otherwise it should be `torch.FloatTensor`\n", + "random_weight((3, 5))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nqzxxcnjUyOC" + }, + "source": [ + "### Barebones PyTorch: Check Accuracy\n", + "When training the model we will use the following function to check the accuracy of our model on the training or validation sets.\n", + "\n", + "When checking accuracy we don't need to compute any gradients; as a result we don't need PyTorch to build a computational graph for us when we compute scores. To prevent a graph from being built we scope our computation under a `torch.no_grad()` context manager." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "pdf-ignore-input" + ], + "id": "dMcKcF2rUyOC" + }, + "outputs": [], + "source": [ + "def check_accuracy_part2(loader, model_fn, params):\n", + " \"\"\"\n", + " Check the accuracy of a classification model.\n", + "\n", + " Inputs:\n", + " - loader: A DataLoader for the data split we want to check\n", + " - model_fn: A function that performs the forward pass of the model,\n", + " with the signature scores = model_fn(x, params)\n", + " - params: List of PyTorch Tensors giving parameters of the model\n", + "\n", + " Returns: Nothing, but prints the accuracy of the model\n", + " \"\"\"\n", + " split = 'val' if loader.dataset.train else 'test'\n", + " print('Checking accuracy on the %s set' % split)\n", + " num_correct, num_samples = 0, 0\n", + " with torch.no_grad():\n", + " for x, y in loader:\n", + " x = x.to(device=device, dtype=dtype) # move to device, e.g. GPU\n", + " y = y.to(device=device, dtype=torch.int64)\n", + " scores = model_fn(x, params)\n", + " _, preds = scores.max(1)\n", + " num_correct += (preds == y).sum()\n", + " num_samples += preds.size(0)\n", + " acc = float(num_correct) / num_samples\n", + " print('Got %d / %d correct (%.2f%%)' % (num_correct, num_samples, 100 * acc))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BfASB7kUUyOC" + }, + "source": [ + "### BareBones PyTorch: Training Loop\n", + "We can now set up a basic training loop to train our network. We will train the model using stochastic gradient descent without momentum. We will use `torch.functional.cross_entropy` to compute the loss; you can [read about it here](http://pytorch.org/docs/stable/nn.html#cross-entropy).\n", + "\n", + "The training loop takes as input the neural network function, a list of initialized parameters (`[w1, w2]` in our example), and learning rate." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "tags": [ + "pdf-ignore-input" + ], + "id": "N_z54s43UyOD" + }, + "outputs": [], + "source": [ + "def train_part2(model_fn, params, learning_rate):\n", + " \"\"\"\n", + " Train a model on CIFAR-10.\n", + "\n", + " Inputs:\n", + " - model_fn: A Python function that performs the forward pass of the model.\n", + " It should have the signature scores = model_fn(x, params) where x is a\n", + " PyTorch Tensor of image data, params is a list of PyTorch Tensors giving\n", + " model weights, and scores is a PyTorch Tensor of shape (N, C) giving\n", + " scores for the elements in x.\n", + " - params: List of PyTorch Tensors giving weights for the model\n", + " - learning_rate: Python scalar giving the learning rate to use for SGD\n", + "\n", + " Returns: Nothing\n", + " \"\"\"\n", + " for t, (x, y) in enumerate(loader_train):\n", + " # Move the data to the proper device (GPU or CPU)\n", + " x = x.to(device=device, dtype=dtype)\n", + " y = y.to(device=device, dtype=torch.long)\n", + "\n", + " # Forward pass: compute scores and loss\n", + " scores = model_fn(x, params)\n", + " loss = F.cross_entropy(scores, y)\n", + "\n", + " # Backward pass: PyTorch figures out which Tensors in the computational\n", + " # graph has requires_grad=True and uses backpropagation to compute the\n", + " # gradient of the loss with respect to these Tensors, and stores the\n", + " # gradients in the .grad attribute of each Tensor.\n", + " loss.backward()\n", + "\n", + " # Update parameters. We don't want to backpropagate through the\n", + " # parameter updates, so we scope the updates under a torch.no_grad()\n", + " # context manager to prevent a computational graph from being built.\n", + " with torch.no_grad():\n", + " for w in params:\n", + " w -= learning_rate * w.grad\n", + "\n", + " # Manually zero the gradients after running the backward pass\n", + " w.grad.zero_()\n", + "\n", + " if t % print_every == 0:\n", + " print('Iteration %d, loss = %.4f' % (t, loss.item()))\n", + " check_accuracy_part2(loader_val, model_fn, params)\n", + " print()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "thb69nw_UyOD" + }, + "source": [ + "### BareBones PyTorch: Train a Two-Layer Network\n", + "Now we are ready to run the training loop. We need to explicitly allocate tensors for the fully connected weights, `w1` and `w2`.\n", + "\n", + "Each minibatch of CIFAR has 64 examples, so the tensor shape is `[64, 3, 32, 32]`.\n", + "\n", + "After flattening, `x` shape should be `[64, 3 * 32 * 32]`. This will be the size of the first dimension of `w1`.\n", + "The second dimension of `w1` is the hidden layer size, which will also be the first dimension of `w2`.\n", + "\n", + "Finally, the output of the network is a 10-dimensional vector that represents the probability distribution over 10 classes.\n", + "\n", + "You don't need to tune any hyperparameters but you should see accuracies above 40% after training for one epoch." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "vUHZGfkyUyOD", + "outputId": "54403c3c-e3db-49ac-966e-0c9b5e424f0b" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Iteration 0, loss = 3.5433\n", + "Checking accuracy on the val set\n", + "Got 161 / 1000 correct (16.10%)\n", + "\n", + "Iteration 100, loss = 1.9514\n", + "Checking accuracy on the val set\n", + "Got 341 / 1000 correct (34.10%)\n", + "\n", + "Iteration 200, loss = 2.0469\n", + "Checking accuracy on the val set\n", + "Got 306 / 1000 correct (30.60%)\n", + "\n", + "Iteration 300, loss = 1.7676\n", + "Checking accuracy on the val set\n", + "Got 413 / 1000 correct (41.30%)\n", + "\n", + "Iteration 400, loss = 1.7364\n", + "Checking accuracy on the val set\n", + "Got 409 / 1000 correct (40.90%)\n", + "\n", + "Iteration 500, loss = 1.6219\n", + "Checking accuracy on the val set\n", + "Got 429 / 1000 correct (42.90%)\n", + "\n", + "Iteration 600, loss = 2.0924\n", + "Checking accuracy on the val set\n", + "Got 415 / 1000 correct (41.50%)\n", + "\n", + "Iteration 700, loss = 1.9333\n", + "Checking accuracy on the val set\n", + "Got 411 / 1000 correct (41.10%)\n", + "\n" + ] + } + ], + "source": [ + "hidden_layer_size = 4000\n", + "learning_rate = 1e-2\n", + "\n", + "w1 = random_weight((3 * 32 * 32, hidden_layer_size))\n", + "w2 = random_weight((hidden_layer_size, 10))\n", + "\n", + "train_part2(two_layer_fc, [w1, w2], learning_rate)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K7zRNB0BUyOD" + }, + "source": [ + "### BareBones PyTorch: Training a ConvNet\n", + "\n", + "In the below you should use the functions defined above to train a three-layer convolutional network on CIFAR. The network should have the following architecture:\n", + "\n", + "1. Convolutional layer (with bias) with 32 5x5 filters, with zero-padding of 2\n", + "2. ReLU\n", + "3. Convolutional layer (with bias) with 16 3x3 filters, with zero-padding of 1\n", + "4. ReLU\n", + "5. Fully-connected layer (with bias) to compute scores for 10 classes\n", + "\n", + "You should initialize your weight matrices using the `random_weight` function defined above, and you should initialize your bias vectors using the `zero_weight` function above.\n", + "\n", + "You don't need to tune any hyperparameters, but if everything works correctly you should achieve an accuracy above 42% after one epoch." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "test": "barebones_accuracy", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ELUIv164UyOD", + "outputId": "8d3ac589-8ed1-412e-8c0e-6e19c26279d6" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Iteration 0, loss = 2.8413\n", + "Checking accuracy on the val set\n", + "Got 108 / 1000 correct (10.80%)\n", + "\n", + "Iteration 100, loss = 1.8054\n", + "Checking accuracy on the val set\n", + "Got 367 / 1000 correct (36.70%)\n", + "\n", + "Iteration 200, loss = 1.7080\n", + "Checking accuracy on the val set\n", + "Got 416 / 1000 correct (41.60%)\n", + "\n", + "Iteration 300, loss = 1.4180\n", + "Checking accuracy on the val set\n", + "Got 416 / 1000 correct (41.60%)\n", + "\n", + "Iteration 400, loss = 1.6381\n", + "Checking accuracy on the val set\n", + "Got 461 / 1000 correct (46.10%)\n", + "\n", + "Iteration 500, loss = 1.5293\n", + "Checking accuracy on the val set\n", + "Got 458 / 1000 correct (45.80%)\n", + "\n", + "Iteration 600, loss = 1.4177\n", + "Checking accuracy on the val set\n", + "Got 473 / 1000 correct (47.30%)\n", + "\n", + "Iteration 700, loss = 1.4980\n", + "Checking accuracy on the val set\n", + "Got 489 / 1000 correct (48.90%)\n", + "\n" + ] + } + ], + "source": [ + "learning_rate = 3e-3\n", + "\n", + "channel_1 = 32\n", + "channel_2 = 16\n", + "\n", + "conv_w1 = None\n", + "conv_b1 = None\n", + "conv_w2 = None\n", + "conv_b2 = None\n", + "fc_w = None\n", + "fc_b = None\n", + "\n", + "################################################################################\n", + "# TODO: Initialize the parameters of a three-layer ConvNet. #\n", + "################################################################################\n", + "# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + "\n", + "conv_w1 = random_weight((channel_1, 3, 5, 5))\n", + "conv_b1 = zero_weight((channel_1,))\n", + "conv_w2 = random_weight((channel_2, 32, 3, 3))\n", + "conv_b2 = zero_weight((channel_2,))\n", + "fc_w = random_weight((channel_2*32*32, 10))\n", + "fc_b = zero_weight((10,))\n", + "\n", + "# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################\n", + "\n", + "params = [conv_w1, conv_b1, conv_w2, conv_b2, fc_w, fc_b]\n", + "train_part2(three_layer_convnet, params, learning_rate)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1QyKTSf6UyOD" + }, + "source": [ + "# Part III. PyTorch Module API\n", + "\n", + "Barebone PyTorch requires that we track all the parameter tensors by hand. This is fine for small networks with a few tensors, but it would be extremely inconvenient and error-prone to track tens or hundreds of tensors in larger networks.\n", + "\n", + "PyTorch provides the `nn.Module` API for you to define arbitrary network architectures, while tracking every learnable parameters for you. In Part II, we implemented SGD ourselves. PyTorch also provides the `torch.optim` package that implements all the common optimizers, such as RMSProp, Adagrad, and Adam. It even supports approximate second-order methods like L-BFGS! You can refer to the [doc](http://pytorch.org/docs/master/optim.html) for the exact specifications of each optimizer.\n", + "\n", + "To use the Module API, follow the steps below:\n", + "\n", + "1. Subclass `nn.Module`. Give your network class an intuitive name like `TwoLayerFC`.\n", + "\n", + "2. In the constructor `__init__()`, define all the layers you need as class attributes. Layer objects like `nn.Linear` and `nn.Conv2d` are themselves `nn.Module` subclasses and contain learnable parameters, so that you don't have to instantiate the raw tensors yourself. `nn.Module` will track these internal parameters for you. Refer to the [doc](http://pytorch.org/docs/master/nn.html) to learn more about the dozens of builtin layers. **Warning**: don't forget to call the `super().__init__()` first!\n", + "\n", + "3. In the `forward()` method, define the *connectivity* of your network. You should use the attributes defined in `__init__` as function calls that take tensor as input and output the \"transformed\" tensor. Do *not* create any new layers with learnable parameters in `forward()`! All of them must be declared upfront in `__init__`.\n", + "\n", + "After you define your Module subclass, you can instantiate it as an object and call it just like the NN forward function in part II.\n", + "\n", + "### Module API: Two-Layer Network\n", + "Here is a concrete example of a 2-layer fully connected network:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "9AO25FPdUyOE", + "outputId": "fb79a0d0-0ae3-4232-ae52-6c6ece5d696e" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "torch.Size([64, 10])\n" + ] + } + ], + "source": [ + "class TwoLayerFC(nn.Module):\n", + " def __init__(self, input_size, hidden_size, num_classes):\n", + " super().__init__()\n", + " # assign layer objects to class attributes\n", + " self.fc1 = nn.Linear(input_size, hidden_size)\n", + " # nn.init package contains convenient initialization methods\n", + " # http://pytorch.org/docs/master/nn.html#torch-nn-init\n", + " nn.init.kaiming_normal_(self.fc1.weight)\n", + " self.fc2 = nn.Linear(hidden_size, num_classes)\n", + " nn.init.kaiming_normal_(self.fc2.weight)\n", + "\n", + " def forward(self, x):\n", + " # forward always defines connectivity\n", + " x = flatten(x)\n", + " scores = self.fc2(F.relu(self.fc1(x)))\n", + " return scores\n", + "\n", + "def test_TwoLayerFC():\n", + " input_size = 50\n", + " x = torch.zeros((64, input_size), dtype=dtype) # minibatch size 64, feature dimension 50\n", + " model = TwoLayerFC(input_size, 42, 10)\n", + " scores = model(x)\n", + " print(scores.size()) # you should see [64, 10]\n", + "test_TwoLayerFC()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UD5xHzPpUyOE" + }, + "source": [ + "### Module API: Three-Layer ConvNet\n", + "It's your turn to implement a 3-layer ConvNet followed by a fully connected layer. The network architecture should be the same as in Part II:\n", + "\n", + "1. Convolutional layer with `channel_1` 5x5 filters with zero-padding of 2\n", + "2. ReLU\n", + "3. Convolutional layer with `channel_2` 3x3 filters with zero-padding of 1\n", + "4. ReLU\n", + "5. Fully-connected layer to `num_classes` classes\n", + "\n", + "You should initialize the weight matrices of the model using the Kaiming normal initialization method.\n", + "\n", + "**HINT**: http://pytorch.org/docs/stable/nn.html#conv2d\n", + "\n", + "After you implement the three-layer ConvNet, the `test_ThreeLayerConvNet` function will run your implementation; it should print `(64, 10)` for the shape of the output scores." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "test": "module_output_shape", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "XcHcf9oQUyOE", + "outputId": "38bfa47a-4350-4260-b4e1-e472dc10ac3b" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "torch.Size([64, 10])\n" + ] + } + ], + "source": [ + "class ThreeLayerConvNet(nn.Module):\n", + " def __init__(self, in_channel, channel_1, channel_2, num_classes):\n", + " super().__init__()\n", + " ########################################################################\n", + " # TODO: Set up the layers you need for a three-layer ConvNet with the #\n", + " # architecture defined above. #\n", + " ########################################################################\n", + " # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + "\n", + " self.conv1 = nn.Conv2d(in_channel, channel_1, kernel_size=5, padding=2, bias=True)\n", + " nn.init.kaiming_normal_(self.conv1.weight)\n", + " nn.init.constant_(self.conv1.bias, 0)\n", + "\n", + " self.conv2 = nn.Conv2d(channel_1, channel_2, kernel_size=3, padding=1, bias=True)\n", + " nn.init.kaiming_normal_(self.conv2.weight)\n", + " nn.init.constant_(self.conv2.bias, 0)\n", + "\n", + " self.fc = nn.Linear(channel_2*32*32, num_classes)\n", + " nn.init.kaiming_normal_(self.fc.weight)\n", + " nn.init.constant_(self.fc.bias, 0)\n", + "\n", + " # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + " ########################################################################\n", + " # END OF YOUR CODE #\n", + " ########################################################################\n", + "\n", + " def forward(self, x):\n", + " scores = None\n", + " ########################################################################\n", + " # TODO: Implement the forward function for a 3-layer ConvNet. you #\n", + " # should use the layers you defined in __init__ and specify the #\n", + " # connectivity of those layers in forward() #\n", + " ########################################################################\n", + " # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + "\n", + " relu1 = F.relu(self.conv1(x))\n", + " relu2 = F.relu(self.conv2(relu1))\n", + " scores = self.fc(flatten(relu2))\n", + "\n", + " # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + " ########################################################################\n", + " # END OF YOUR CODE #\n", + " ########################################################################\n", + " return scores\n", + "\n", + "\n", + "def test_ThreeLayerConvNet():\n", + " x = torch.zeros((64, 3, 32, 32), dtype=dtype) # minibatch size 64, image size [3, 32, 32]\n", + " model = ThreeLayerConvNet(in_channel=3, channel_1=12, channel_2=8, num_classes=10)\n", + " scores = model(x)\n", + " print(scores.size()) # you should see [64, 10]\n", + "test_ThreeLayerConvNet()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AmtCNhxoUyOE" + }, + "source": [ + "### Module API: Check Accuracy\n", + "Given the validation or test set, we can check the classification accuracy of a neural network.\n", + "\n", + "This version is slightly different from the one in part II. You don't manually pass in the parameters anymore." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Pt-k9_eWUyOF" + }, + "outputs": [], + "source": [ + "def check_accuracy_part34(loader, model):\n", + " if loader.dataset.train:\n", + " print('Checking accuracy on validation set')\n", + " else:\n", + " print('Checking accuracy on test set')\n", + " num_correct = 0\n", + " num_samples = 0\n", + " model.eval() # set model to evaluation mode\n", + " with torch.no_grad():\n", + " for x, y in loader:\n", + " x = x.to(device=device, dtype=dtype) # move to device, e.g. GPU\n", + " y = y.to(device=device, dtype=torch.long)\n", + " scores = model(x)\n", + " _, preds = scores.max(1)\n", + " num_correct += (preds == y).sum()\n", + " num_samples += preds.size(0)\n", + " acc = float(num_correct) / num_samples\n", + " print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "I4QmW61lUyOF" + }, + "source": [ + "### Module API: Training Loop\n", + "We also use a slightly different training loop. Rather than updating the values of the weights ourselves, we use an Optimizer object from the `torch.optim` package, which abstract the notion of an optimization algorithm and provides implementations of most of the algorithms commonly used to optimize neural networks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "hqRJUk8kUyOF" + }, + "outputs": [], + "source": [ + "def train_part34(model, optimizer, epochs=1):\n", + " \"\"\"\n", + " Train a model on CIFAR-10 using the PyTorch Module API.\n", + "\n", + " Inputs:\n", + " - model: A PyTorch Module giving the model to train.\n", + " - optimizer: An Optimizer object we will use to train the model\n", + " - epochs: (Optional) A Python integer giving the number of epochs to train for\n", + "\n", + " Returns: Nothing, but prints model accuracies during training.\n", + " \"\"\"\n", + " model = model.to(device=device) # move the model parameters to CPU/GPU\n", + " for e in range(epochs):\n", + " for t, (x, y) in enumerate(loader_train):\n", + " model.train() # put model to training mode\n", + " x = x.to(device=device, dtype=dtype) # move to device, e.g. GPU\n", + " y = y.to(device=device, dtype=torch.long)\n", + "\n", + " scores = model(x)\n", + " loss = F.cross_entropy(scores, y)\n", + "\n", + " # Zero out all of the gradients for the variables which the optimizer\n", + " # will update.\n", + " optimizer.zero_grad()\n", + "\n", + " # This is the backwards pass: compute the gradient of the loss with\n", + " # respect to each parameter of the model.\n", + " loss.backward()\n", + "\n", + " # Actually update the parameters of the model using the gradients\n", + " # computed by the backwards pass.\n", + " optimizer.step()\n", + "\n", + " if t % print_every == 0:\n", + " print('Iteration %d, loss = %.4f' % (t, loss.item()))\n", + " check_accuracy_part34(loader_val, model)\n", + " print()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Q3rhB6LWUyOF" + }, + "source": [ + "### Module API: Train a Two-Layer Network\n", + "Now we are ready to run the training loop. In contrast to part II, we don't explicitly allocate parameter tensors anymore.\n", + "\n", + "Simply pass the input size, hidden layer size, and number of classes (i.e. output size) to the constructor of `TwoLayerFC`.\n", + "\n", + "You also need to define an optimizer that tracks all the learnable parameters inside `TwoLayerFC`.\n", + "\n", + "You don't need to tune any hyperparameters, but you should see model accuracies above 40% after training for one epoch." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "E0LpnI-GUyOG", + "outputId": "00e5ef44-e875-4122-ea0c-e525bfd37f39" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Iteration 0, loss = 2.7778\n", + "Checking accuracy on validation set\n", + "Got 161 / 1000 correct (16.10)\n", + "\n", + "Iteration 100, loss = 2.3438\n", + "Checking accuracy on validation set\n", + "Got 320 / 1000 correct (32.00)\n", + "\n", + "Iteration 200, loss = 1.8845\n", + "Checking accuracy on validation set\n", + "Got 383 / 1000 correct (38.30)\n", + "\n", + "Iteration 300, loss = 1.6858\n", + "Checking accuracy on validation set\n", + "Got 390 / 1000 correct (39.00)\n", + "\n", + "Iteration 400, loss = 1.4889\n", + "Checking accuracy on validation set\n", + "Got 402 / 1000 correct (40.20)\n", + "\n", + "Iteration 500, loss = 1.7451\n", + "Checking accuracy on validation set\n", + "Got 441 / 1000 correct (44.10)\n", + "\n", + "Iteration 600, loss = 1.7219\n", + "Checking accuracy on validation set\n", + "Got 413 / 1000 correct (41.30)\n", + "\n", + "Iteration 700, loss = 1.5190\n", + "Checking accuracy on validation set\n", + "Got 433 / 1000 correct (43.30)\n", + "\n" + ] + } + ], + "source": [ + "hidden_layer_size = 4000\n", + "learning_rate = 1e-2\n", + "model = TwoLayerFC(3 * 32 * 32, hidden_layer_size, 10)\n", + "optimizer = optim.SGD(model.parameters(), lr=learning_rate)\n", + "\n", + "train_part34(model, optimizer)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Kl3tLz9aUyOG" + }, + "source": [ + "### Module API: Train a Three-Layer ConvNet\n", + "You should now use the Module API to train a three-layer ConvNet on CIFAR. This should look very similar to training the two-layer network! You don't need to tune any hyperparameters, but you should achieve above above 45% after training for one epoch.\n", + "\n", + "You should train the model using stochastic gradient descent without momentum." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "test": "module_accuracy", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "0ahJ6sMLUyOG", + "outputId": "25630b9e-3b6b-4bf5-c57f-16ee95ef07ec" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Iteration 0, loss = 3.1185\n", + "Checking accuracy on validation set\n", + "Got 89 / 1000 correct (8.90)\n", + "\n", + "Iteration 100, loss = 1.5767\n", + "Checking accuracy on validation set\n", + "Got 354 / 1000 correct (35.40)\n", + "\n", + "Iteration 200, loss = 1.6303\n", + "Checking accuracy on validation set\n", + "Got 366 / 1000 correct (36.60)\n", + "\n", + "Iteration 300, loss = 1.8178\n", + "Checking accuracy on validation set\n", + "Got 411 / 1000 correct (41.10)\n", + "\n", + "Iteration 400, loss = 1.7186\n", + "Checking accuracy on validation set\n", + "Got 413 / 1000 correct (41.30)\n", + "\n", + "Iteration 500, loss = 1.7037\n", + "Checking accuracy on validation set\n", + "Got 443 / 1000 correct (44.30)\n", + "\n", + "Iteration 600, loss = 1.4189\n", + "Checking accuracy on validation set\n", + "Got 460 / 1000 correct (46.00)\n", + "\n", + "Iteration 700, loss = 1.5092\n", + "Checking accuracy on validation set\n", + "Got 465 / 1000 correct (46.50)\n", + "\n" + ] + } + ], + "source": [ + "learning_rate = 3e-3\n", + "channel_1 = 32\n", + "channel_2 = 16\n", + "\n", + "model = None\n", + "optimizer = None\n", + "################################################################################\n", + "# TODO: Instantiate your ThreeLayerConvNet model and a corresponding optimizer #\n", + "################################################################################\n", + "# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + "\n", + "model = ThreeLayerConvNet(3, channel_1, channel_2, 10)\n", + "optimizer = optim.SGD(model.parameters(), lr=learning_rate)\n", + "\n", + "# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################\n", + "\n", + "train_part34(model, optimizer)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rbGjpixRUyOL" + }, + "source": [ + "# Part IV. PyTorch Sequential API\n", + "\n", + "Part III introduced the PyTorch Module API, which allows you to define arbitrary learnable layers and their connectivity.\n", + "\n", + "For simple models like a stack of feed forward layers, you still need to go through 3 steps: subclass `nn.Module`, assign layers to class attributes in `__init__`, and call each layer one by one in `forward()`. Is there a more convenient way?\n", + "\n", + "Fortunately, PyTorch provides a container Module called `nn.Sequential`, which merges the above steps into one. It is not as flexible as `nn.Module`, because you cannot specify more complex topology than a feed-forward stack, but it's good enough for many use cases.\n", + "\n", + "### Sequential API: Two-Layer Network\n", + "Let's see how to rewrite our two-layer fully connected network example with `nn.Sequential`, and train it using the training loop defined above.\n", + "\n", + "Again, you don't need to tune any hyperparameters here, but you shoud achieve above 40% accuracy after one epoch of training." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "d9ZX-PIFUyOM", + "outputId": "350f149d-351b-436e-b80d-badbdea12f98" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Iteration 0, loss = 2.3044\n", + "Checking accuracy on validation set\n", + "Got 143 / 1000 correct (14.30)\n", + "\n", + "Iteration 100, loss = 1.7458\n", + "Checking accuracy on validation set\n", + "Got 368 / 1000 correct (36.80)\n", + "\n", + "Iteration 200, loss = 1.7914\n", + "Checking accuracy on validation set\n", + "Got 409 / 1000 correct (40.90)\n", + "\n", + "Iteration 300, loss = 1.5567\n", + "Checking accuracy on validation set\n", + "Got 430 / 1000 correct (43.00)\n", + "\n", + "Iteration 400, loss = 1.9446\n", + "Checking accuracy on validation set\n", + "Got 433 / 1000 correct (43.30)\n", + "\n", + "Iteration 500, loss = 1.7955\n", + "Checking accuracy on validation set\n", + "Got 419 / 1000 correct (41.90)\n", + "\n", + "Iteration 600, loss = 1.5037\n", + "Checking accuracy on validation set\n", + "Got 450 / 1000 correct (45.00)\n", + "\n", + "Iteration 700, loss = 1.7519\n", + "Checking accuracy on validation set\n", + "Got 436 / 1000 correct (43.60)\n", + "\n" + ] + } + ], + "source": [ + "# We need to wrap `flatten` function in a module in order to stack it\n", + "# in nn.Sequential\n", + "class Flatten(nn.Module):\n", + " def forward(self, x):\n", + " return flatten(x)\n", + "\n", + "hidden_layer_size = 4000\n", + "learning_rate = 1e-2\n", + "\n", + "model = nn.Sequential(\n", + " Flatten(),\n", + " nn.Linear(3 * 32 * 32, hidden_layer_size),\n", + " nn.ReLU(),\n", + " nn.Linear(hidden_layer_size, 10),\n", + ")\n", + "\n", + "# you can use Nesterov momentum in optim.SGD\n", + "optimizer = optim.SGD(model.parameters(), lr=learning_rate,\n", + " momentum=0.9, nesterov=True)\n", + "\n", + "train_part34(model, optimizer)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EpSb3WojUyOM" + }, + "source": [ + "### Sequential API: Three-Layer ConvNet\n", + "Here you should use `nn.Sequential` to define and train a three-layer ConvNet with the same architecture we used in Part III:\n", + "\n", + "1. Convolutional layer (with bias) with 32 5x5 filters, with zero-padding of 2\n", + "2. ReLU\n", + "3. Convolutional layer (with bias) with 16 3x3 filters, with zero-padding of 1\n", + "4. ReLU\n", + "5. Fully-connected layer (with bias) to compute scores for 10 classes\n", + "\n", + "You can use the default PyTorch weight initialization.\n", + "\n", + "You should optimize your model using stochastic gradient descent with Nesterov momentum 0.9.\n", + "\n", + "Again, you don't need to tune any hyperparameters but you should see accuracy above 55% after one epoch of training." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "test": "sequential_accuracy", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "8WvN-HqcUyOM", + "outputId": "ab9524cf-5922-413a-a781-7b1e24e99e45" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Iteration 0, loss = 2.3104\n", + "Checking accuracy on validation set\n", + "Got 126 / 1000 correct (12.60)\n", + "\n", + "Iteration 100, loss = 1.8362\n", + "Checking accuracy on validation set\n", + "Got 414 / 1000 correct (41.40)\n", + "\n", + "Iteration 200, loss = 1.4738\n", + "Checking accuracy on validation set\n", + "Got 469 / 1000 correct (46.90)\n", + "\n", + "Iteration 300, loss = 1.3190\n", + "Checking accuracy on validation set\n", + "Got 531 / 1000 correct (53.10)\n", + "\n", + "Iteration 400, loss = 1.2590\n", + "Checking accuracy on validation set\n", + "Got 529 / 1000 correct (52.90)\n", + "\n", + "Iteration 500, loss = 1.3656\n", + "Checking accuracy on validation set\n", + "Got 528 / 1000 correct (52.80)\n", + "\n", + "Iteration 600, loss = 1.1432\n", + "Checking accuracy on validation set\n", + "Got 572 / 1000 correct (57.20)\n", + "\n", + "Iteration 700, loss = 1.5138\n", + "Checking accuracy on validation set\n", + "Got 578 / 1000 correct (57.80)\n", + "\n" + ] + } + ], + "source": [ + "channel_1 = 32\n", + "channel_2 = 16\n", + "learning_rate = 1e-2\n", + "\n", + "model = None\n", + "optimizer = None\n", + "\n", + "################################################################################\n", + "# TODO: Rewrite the 2-layer ConvNet with bias from Part III with the #\n", + "# Sequential API. #\n", + "################################################################################\n", + "# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + "\n", + "model = nn.Sequential(nn.Conv2d(3, channel_1, 5, padding = 2),\n", + " nn.ReLU(),\n", + " nn.Conv2d(channel_1, channel_2, 3, padding = 1),\n", + " nn.ReLU(),\n", + " Flatten(),\n", + " nn.Linear(channel_2 * 32 * 32, 10))\n", + "\n", + "optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9, nesterov=True)\n", + "\n", + "# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################\n", + "\n", + "train_part34(model, optimizer)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aqOl0qLiUyOM" + }, + "source": [ + "# Part V. CIFAR-10 open-ended challenge\n", + "\n", + "In this section, you can experiment with whatever ConvNet architecture you'd like on CIFAR-10.\n", + "\n", + "Now it's your job to experiment with architectures, hyperparameters, loss functions, and optimizers to train a model that achieves **at least 70%** accuracy on the CIFAR-10 **validation** set within 10 epochs. You can use the check_accuracy and train functions from above. You can use either `nn.Module` or `nn.Sequential` API.\n", + "\n", + "Describe what you did at the end of this notebook.\n", + "\n", + "Here are the official API documentation for each component. One note: what we call in the class \"spatial batch norm\" is called \"BatchNorm2D\" in PyTorch.\n", + "\n", + "* Layers in torch.nn package: http://pytorch.org/docs/stable/nn.html\n", + "* Activations: http://pytorch.org/docs/stable/nn.html#non-linear-activations\n", + "* Loss functions: http://pytorch.org/docs/stable/nn.html#loss-functions\n", + "* Optimizers: http://pytorch.org/docs/stable/optim.html\n", + "\n", + "\n", + "### Things you might try:\n", + "- **Filter size**: Above we used 5x5; would smaller filters be more efficient?\n", + "- **Number of filters**: Above we used 32 filters. Do more or fewer do better?\n", + "- **Pooling vs Strided Convolution**: Do you use max pooling or just stride convolutions?\n", + "- **Batch normalization**: Try adding spatial batch normalization after convolution layers and vanilla batch normalization after affine layers. Do your networks train faster?\n", + "- **Network architecture**: The network above has two layers of trainable parameters. Can you do better with a deep network? Good architectures to try include:\n", + " - [conv-relu-pool]xN -> [affine]xM -> [softmax or SVM]\n", + " - [conv-relu-conv-relu-pool]xN -> [affine]xM -> [softmax or SVM]\n", + " - [batchnorm-relu-conv]xN -> [affine]xM -> [softmax or SVM]\n", + "- **Global Average Pooling**: Instead of flattening and then having multiple affine layers, perform convolutions until your image gets small (7x7 or so) and then perform an average pooling operation to get to a 1x1 image picture (1, 1 , Filter#), which is then reshaped into a (Filter#) vector. This is used in [Google's Inception Network](https://arxiv.org/abs/1512.00567) (See Table 1 for their architecture).\n", + "- **Regularization**: Add l2 weight regularization, or perhaps use Dropout.\n", + "\n", + "### Tips for training\n", + "For each network architecture that you try, you should tune the learning rate and other hyperparameters. When doing this there are a couple important things to keep in mind:\n", + "\n", + "- If the parameters are working well, you should see improvement within a few hundred iterations\n", + "- Remember the coarse-to-fine approach for hyperparameter tuning: start by testing a large range of hyperparameters for just a few training iterations to find the combinations of parameters that are working at all.\n", + "- Once you have found some sets of parameters that seem to work, search more finely around these parameters. You may need to train for more epochs.\n", + "- You should use the validation set for hyperparameter search, and save your test set for evaluating your architecture on the best parameters as selected by the validation set.\n", + "\n", + "### Going above and beyond\n", + "If you are feeling adventurous there are many other features you can implement to try and improve your performance. You are **not required** to implement any of these, but don't miss the fun if you have time!\n", + "\n", + "- Alternative optimizers: you can try Adam, Adagrad, RMSprop, etc.\n", + "- Alternative activation functions such as leaky ReLU, parametric ReLU, ELU, or MaxOut.\n", + "- Model ensembles\n", + "- Data augmentation\n", + "- New Architectures\n", + " - [ResNets](https://arxiv.org/abs/1512.03385) where the input from the previous layer is added to the output.\n", + " - [DenseNets](https://arxiv.org/abs/1608.06993) where inputs into previous layers are concatenated together.\n", + " - [This blog has an in-depth overview](https://chatbotslife.com/resnets-highwaynets-and-densenets-oh-my-9bb15918ee32)\n", + "\n", + "### Have fun and happy training!" + ] + }, + { + "cell_type": "code", + "source": [ + "################################################################################\n", + "# TODO: #\n", + "# Experiment with any architectures, optimizers, and hyperparameters. #\n", + "# Achieve AT LEAST 70% accuracy on the *validation set* within 10 epochs. #\n", + "# #\n", + "# Note that you can use the check_accuracy function to evaluate on either #\n", + "# the test set or the validation set, by passing either loader_test or #\n", + "# loader_val as the second argument to check_accuracy. You should not touch #\n", + "# the test set until you have finished your architecture and hyperparameter #\n", + "# tuning, and only run the test set once at the end to report a final value. #\n", + "################################################################################\n", + "model = None\n", + "optimizer = None\n", + "\n", + "# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + "\n", + "num_classes = 10\n", + "\n", + "conv1 = nn.Sequential(nn.Conv2d(3, 16, kernel_size=5, padding=2),\n", + " nn.BatchNorm2d(16),\n", + " nn.ReLU(),\n", + " nn.MaxPool2d(2))\n", + "\n", + "conv2 = nn.Sequential(nn.Conv2d(16, 32, kernel_size=3, padding=1),\n", + " nn.BatchNorm2d(32),\n", + " nn.ReLU(),\n", + " nn.MaxPool2d(2))\n", + "\n", + "conv3 = nn.Sequential(nn.Conv2d(32, 64, kernel_size=3, padding=1),\n", + " nn.BatchNorm2d(64),\n", + " nn.ReLU(),\n", + " nn.MaxPool2d(2))\n", + "\n", + "fc = nn.Sequential(nn.Dropout(0.2, inplace=True),\n", + " nn.Linear(64*4*4, num_classes))\n", + "\n", + "model = nn.Sequential(conv1,\n", + " conv2,\n", + " conv3,\n", + " Flatten(),\n", + " fc)\n", + "\n", + "learning_rate = 1e-3\n", + "optimizer = optim.Adam(model.parameters(), lr=learning_rate)\n", + "\n", + "# Print training status every epoch: set print_every to a large number\n", + "print_every = 10000\n", + "\n", + "# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****\n", + "################################################################################\n", + "# END OF YOUR CODE\n", + "################################################################################\n", + "\n", + "# You should get at least 70% accuracy\n", + "train_part34(model, optimizer, epochs=10)" + ], + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "GqjkvkxQx68S", + "outputId": "927c5261-73d0-4f23-9a0a-0619c61aac66" + }, + "execution_count": null, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Iteration 0, loss = 2.5435\n", + "Checking accuracy on validation set\n", + "Got 104 / 1000 correct (10.40)\n", + "\n", + "Iteration 0, loss = 0.9266\n", + "Checking accuracy on validation set\n", + "Got 611 / 1000 correct (61.10)\n", + "\n", + "Iteration 0, loss = 1.0116\n", + "Checking accuracy on validation set\n", + "Got 677 / 1000 correct (67.70)\n", + "\n", + "Iteration 0, loss = 0.9464\n", + "Checking accuracy on validation set\n", + "Got 684 / 1000 correct (68.40)\n", + "\n", + "Iteration 0, loss = 0.7033\n", + "Checking accuracy on validation set\n", + "Got 708 / 1000 correct (70.80)\n", + "\n", + "Iteration 0, loss = 1.1305\n", + "Checking accuracy on validation set\n", + "Got 703 / 1000 correct (70.30)\n", + "\n", + "Iteration 0, loss = 0.7823\n", + "Checking accuracy on validation set\n", + "Got 734 / 1000 correct (73.40)\n", + "\n", + "Iteration 0, loss = 0.5802\n", + "Checking accuracy on validation set\n", + "Got 724 / 1000 correct (72.40)\n", + "\n", + "Iteration 0, loss = 0.7206\n", + "Checking accuracy on validation set\n", + "Got 755 / 1000 correct (75.50)\n", + "\n", + "Iteration 0, loss = 0.5087\n", + "Checking accuracy on validation set\n", + "Got 761 / 1000 correct (76.10)\n", + "\n" + ] + } + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-inline" + ], + "id": "dksAoNI2UyOM" + }, + "source": [ + "## Describe what you did\n", + "\n", + "In the cell below you should write an explanation of what you did, any additional features that you implemented, and/or any graphs that you made in the process of training and evaluating your network." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "tags": [ + "pdf-inline" + ], + "id": "YMqzag5iUyON" + }, + "source": [ + "- 모델 구조: \n", + "`(Conv -> BatchNorm -> ReLU -> MaxPool) * 3 -> fc`\n", + "\n", + "\n", + "- 필터 크기: 5x5, 3x3, 3x3\n", + "- 필터 개수: 16, 32, 64\n", + "- 풀링: `2x2` Max pooling\n", + "- Normalization: Batch normalization\n", + "- Regularization: Dropout\n", + "- Optimizer: Adam" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XMva051WUyON" + }, + "source": [ + "## Test set -- run this only once\n", + "\n", + "Now that we've gotten a result we're happy with, we test our final model on the test set (which you should store in best_model). Think about how this compares to your validation set accuracy." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "VaT6Zb-dUyON", + "outputId": "ab1811c2-eb32-4c8c-a583-b147a54071f0" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Checking accuracy on test set\n", + "Got 7440 / 10000 correct (74.40)\n" + ] + } + ], + "source": [ + "best_model = model\n", + "check_accuracy_part34(loader_test, best_model)" + ] + }, + { + "cell_type": "code", + "source": [], + "metadata": { + "id": "F5CuE2TNakuS" + }, + "execution_count": null, + "outputs": [] + } + ], + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "language_info": { + "name": "python" + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file