diff --git a/tutorials/W2D1_Macrocircuits/W2D1_Intro.ipynb b/tutorials/W2D1_Macrocircuits/W2D1_Intro.ipynb index 4f3ce7bdb..015c5bbe5 100644 --- a/tutorials/W2D1_Macrocircuits/W2D1_Intro.ipynb +++ b/tutorials/W2D1_Macrocircuits/W2D1_Intro.ipynb @@ -57,7 +57,9 @@ "source": [ "## Prerequisites\n", "\n", - "Materials of this day assume you have had the experience of model building in `pytorch` earlier. It would be beneficial too if you had the basics of Linear Algebra before as well as if you had played around with Actor-Critic model in Reinforcement Learning setup." + "In order to get the most out of today's tutorials, it would greatly help if you had experience building (simple) neural network models in PyTorch. We will also be using some concepts from Linear Algebra, so some familiarity with concepts from that domain will come in handy. We will also be looking at a specific algorithm in Reinforcement Learning (RL) called the Actor-Critic model, so it would help if you had some familiarity with Reinforcement Learning. We touched a little bit on RL in W1D2 (\"Comparing Tasks\"), specifically in Tutorial 3 (\"Reinforcement Learning Across Temporal Scales\"). It could be good to refer back to that tutorial and to check out the two videos on Meta-RL in that tutorial notebook.\n", + "\n", + "Today is a little more technical, more theory-driven, but it will give you a lot of skills and appreciation to work with these very interesting ideas in NeuroAI. What we encourage you to keep in mind is how this knowledge helps you to appreciate the concept of generalization, the over-arching theme of this entire course. Lots of points today will indicate how learning dynamics will arrive at solutions that **generalize well**!" ] }, { @@ -210,7 +212,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.19" + "version": "3.9.22" } }, "nbformat": 4, diff --git a/tutorials/W2D1_Macrocircuits/W2D1_Tutorial1.ipynb b/tutorials/W2D1_Macrocircuits/W2D1_Tutorial1.ipynb index b099c0d2d..05d3b1286 100644 --- a/tutorials/W2D1_Macrocircuits/W2D1_Tutorial1.ipynb +++ b/tutorials/W2D1_Macrocircuits/W2D1_Tutorial1.ipynb @@ -25,9 +25,9 @@ "\n", "__Content creators:__ Gabriel Mel de Fontenay\n", "\n", - "__Content reviewers:__ Surya Ganguli, Xaq Pitkow, Hlib Solodzhuk, Aakash Agrawal, Alish Dipani, Hossein Rezaei, Yousef Ghanbari, Mostafa Abdollahi, Patrick Mineault\n", + "__Content reviewers:__ Surya Ganguli, Xaq Pitkow, Hlib Solodzhuk, Aakash Agrawal, Alish Dipani, Hossein Rezaei, Yousef Ghanbari, Mostafa Abdollahi, Patrick Mineault, Alex Murphy\n", "\n", - "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk\n" + "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk, Alex Murphy\n" ] }, { @@ -43,13 +43,13 @@ "\n", "*Estimated timing of tutorial: 1 hour*\n", "\n", - "In this tutorial we will take a closer look at the expressivity of neural networks by observing the following:\n", + "In this tutorial we will take a closer look at the **expressivity** of neural networks, the ability of neural networks to model a wide range of functions. We will make the following observations:\n", "\n", - "- The **universal approximator theorem** guarantees that we can approximate any complex function using a network with a single hidden layer. The catch is that the approximating network might need to be extremely *wide*.\n", - "- We will explore this issue by constructing a complex function and attempting to fit it with shallow networks of varying widths.\n", - "- To create this complex function, we'll build a random deep neural network. This is an example of the **student-teacher setting**, where we attempt to fit a known *teacher* function (the deep network) using a *student* model (the shallow/wide network).\n", - "- We will find that the deep teacher network can be either very easy or very hard to approximate and that the difficulty level is related to a form of **chaos** in the network activities.\n", - "- Each layer of a neural network can effectively expand and fold the input it receives from the previous layer. This repeated expansion and folding grants deep neural networks models high **expressivity** - ie. allows them to implement a large number of different functions.\n", + "- The **universal approximator theorem** guarantees that we can approximate any complex function using a network with a single hidden layer. The catch is that the approximating network might need to be extremely *wide* and the theorem only states the existence of such a model (not exactly how neurons are required per task)\n", + "- We will explore this issue by constructing a complex function and attempting to fit it with shallow networks of varying widths\n", + "- To create this complex function, we'll build a random deep neural network. This is an example of the **student-teacher setting**, where we attempt to fit a known *teacher* function (the deep network) using a *student* model (the shallow/wide network)\n", + "- We will see that it can be either very easy or very difficult to learn from the deep (teacher) network and this difficulty is related to a form of **chaos** in the network activations\n", + "- Each layer of a neural network can effectively expand and fold the input it receives from the previous layer. This repeated expansion and folding grants deep neural networks models high **expressivity** - ie. allows them to capture the behavior of a large number of different functions\n", "\n", "Let's get started!" ] @@ -363,7 +363,9 @@ "\n", "# Section 1: Introduction\n", "\n", - "In this section we will create functions to capture the snippets of code that we will use repeatedly in what follows." + "In this section we will write some Python functions to help build some neural networks that will allow us to effectively examine the expressiity of shallow versus deep networks. We will specifically look at this issue through the lens of the universal approximation theorem and ask ourselves what deeper neural networks give us in terms of the ability of those models to capture a wide range of functions. As you will recall from today's introduction video, the idea of each layer being able to fold activations via an activation function increases the ability to model nonlinear functions much more effectively. After going through this tutorial, this idea will hopefully be much clearer.\n", + "\n", + "By **shallow network**, we mean one with a very small number of layers (e.g. one). A shallow networks can be **wide** if it has many, many neurons in this layer, or it can be smaller, having only a limited number of neurons. In contrast, by **deep networks**, we refer to the number of layers in the network. It's important to keep in mind that the term **wide** in the terminology we will use specifically refers to *the number of neurons in a layer, not the number of layers in a network*. If we take a single layer in a shallow or a deep network, we can describe it as being **wide** if it has a very large number of neurons. " ] }, { @@ -427,17 +429,17 @@ }, "source": [ "\n", - "The [universal approximator theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem) (UAT) guarantees that we can approximate any function arbitrarily well using a shallow network - ie. a network with a single hidden layer (figure below, left). So why do we need depth? The \"catch\" in the UAT is that approximating a complex function with a shallow network can require a very large number of hidden units - ie. the network must be very wide. The inability of shallow networks to efficiently implement certain functions suggests that network depth may be one of the brain's computational \"secret sauces\".\n", + "The [universal approximator theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem) (UAT) guarantees that we can approximate any function arbitrarily well using a shallow network - ie. a network with a single hidden layer (figure below, left). So why do we need depth? The *catch* in the UAT is that approximating a complex function with a shallow network can require a very large number of hidden units - ie. the network must be very wide. The inability of shallow networks to efficiently implement certain functions suggests that network depth may be one of the brain's computational *secret sauces*.\n", "\n", "\"Shallow\n", "\n", "To illustrate this fact, we'll create a complex function and then attempt to fit it with single-hidden-layer neural networks of different widths. What we'll find is that although the UAT guarantees that sufficiently wide networks can approximate our function, the performance will actually not be very good for our shallow nets of modest width.\n", "\n", - "One easy way to create a complex function is to build a random deep neural network (figure above, right). We then have a teacher network which generates the ground truth outputs, and a student network whose goal is to learn the mapping implemented by the teacher. This approach - known as the **student-teacher setting** - is useful for both computational and mathematical study of neural networks since it gives us complete control of the data generation process. Unlike with real-world data, we know the exact distribution of inputs and correct outputs.\n", + "One easy way to create a complex function is to build a random deep neural network (figure above, right), which serves as a teacher network (generating the ground truth outputs), and we'll also have a student network, whose goal is to learn the function defined by the teacher network. This approach - known as the **student-teacher setting** - is useful for both the computational and mathematical study of neural networks since it gives us complete control of the data generation process. Unlike with real-world data, we know the exact distribution of inputs and correct outputs. This means we aren't restricted by having to factor in any *noisy* signals that are not connected to our inputs.\n", "\n", - "Finally, we will show that depending on the distribution of the weights, a random deep neural network can be either very difficult or very easy to approximate with a shallow network. The \"complexity\" of the function computed by a random deep network thus depends crucially on the weight distribution. One can actually understand the boundary between hard and easy cases as a kind of boundary between chaos and non-chaos in a certain dynamical system. We will confirm that on the non-chaotic side, a random deep neural network can be effectively approximated by a shallow net. This demonstration will be based on ideas from the paper:\n", + "Finally, we will show that depending on the distribution of the weights, a random deep neural network can be either very difficult or very easy to approximate with a shallow network. The *complexity* of the function computed by a random deep network thus depends crucially on the weight distribution. One can actually understand the boundary between hard and easy cases as a kind of boundary between **chaos** and **non-chaos** in a certain dynamical system. We will confirm that on the non-chaotic side, a random deep neural network can be effectively approximated by a shallow net. This demonstration will be based on ideas from the following paper:\n", "\n", - "[*Exponential expressivity in deep neural networks through transient chaos*](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html) Poole et al. Neurips (2016)." + "[*Exponential expressivity in deep neural networks through transient chaos*](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html) (Poole et al., 2016)." ] }, { @@ -528,9 +530,9 @@ "source": [ "## Coding Exercise 1: Create an MLP\n", "\n", - "The code below implements a function that takes in an input dimension, a layer width, and a number of layers and creates a simple MLP in pytorch. In between each layer, we insert a hyperbolic tangent nonlinearity layer (`nn.Tanh()`).\n", + "The code below implements a function that takes in an input dimension, a layer width, and a number of layers and creates a simple MLP in pytorch where each layer has the same width. In between each layer, we insert a hyperbolic tangent nonlinearity layer (`nn.Tanh()`).\n", "\n", - "Convention: Because we will count the input as a layer, a depth of 2 will mean a network with just one hidden layer, followed by the output neuron. A depth of 3 will mean 2 hidden layers, and so on." + "Convention: Because we will count the input as a layer, a depth of 2 will mean a network with just one hidden layer, followed by the output neuron. A depth of 3 will mean 2 hidden layers, and so on." ] }, { @@ -568,14 +570,14 @@ "\n", " # Assemble D-1 hidden layers and one output layer\n", "\n", - " #input layer\n", + " # input layer\n", " layers = [nn.Linear(n_in, W, bias = False), nonlin]\n", " for i in range(D - 2):\n", - " #linear layer\n", + " # linear layer\n", " layers.append(nn.Linear(W, W, bias = False))\n", - " #activation function\n", + " # activation function\n", " layers.append(nonlin)\n", - " #output layer\n", + " # output layer\n", " layers.append(nn.Linear(W, 1, bias = False))\n", "\n", " return nn.Sequential(*layers)\n", @@ -676,7 +678,7 @@ "source": [ "## Coding Exercise 2: Initialize model weights\n", "\n", - "Write a function that, given a model and a $\\sigma$, initializes all weights in the model according to a normal distribution with mean $0$ and standard deviation\n", + "Write a function that, given a model and $\\sigma$, initializes all weights in the model according to a normal (Gaussian) distribution with mean $0$ and standard deviation\n", " \n", " $$\\frac{\\sigma}{\\sqrt{n_{in}}},$$\n", " \n", @@ -866,7 +868,15 @@ "execution": {} }, "source": [ - "In this coding exercise, write a function that will train a given net on a given dataset. Function parameters include the network, the training inputs and outputs, the number of steps, and the learning rate. Set up loss function as MSE." + "In this coding exercise, write a function that will train a given net on a given dataset. Function parameters include:\n", + "\n", + "* the network (`net`)\n", + "* training inputs (`X`)\n", + "* the outputs (`y`)\n", + "* the number of steps (`n_epochs`)\n", + "* the learning rate (`lr`)\n", + "\n", + "Use the mean-squared error (MSE) loss function in the learning algorithm. You might need to check the pytorch documentation to see the exact layer name you will need to call for this." ] }, { @@ -1002,7 +1012,11 @@ "execution": {} }, "source": [ - "Now, write a helper function that computes the loss of a net on a dataset. It takes the following parameters: the network and the dataset inputs and outputs." + "Now, write a helper function that computes the loss of a net on a dataset. It takes the following parameters:\n", + "\n", + "* the network (`net`)\n", + "* the dataset inputs (`X`)\n", + "* the dataset outputs (`y`)" ] }, { @@ -1098,7 +1112,9 @@ "\n", "Estimated timing to here from start of tutorial: 20 minutes\n", "\n", - "We will now use the functions we've created to experiment with deep network fitting. In particular, we will see to what extent it is possible to fit a deep net using a shallow net. Specifically, we will fix a deep teacher and then fit it with a single-hidden-layer net with varying width value. In principle, if the number of hidden units is large enough, the error should be low. Let's see!" + "We will now use the functions we created to experiment with fitting various student models to our complex function (which we defined earlier to be a randomly initialized deep neural network, what we defined as the teacher network). In particular, we will see to what extent it is possible to fit a deep net using a shallow net. We will freeze a deep teacher network and then fit it with a single-hidden-layer net with varying width sizes. In principle, if the number of hidden units is large enough, the error should be low (according to the universal approximation theorem)\n", + "\n", + "Let's see if that's the case!" ] }, { @@ -1176,7 +1192,7 @@ "source": [ "## Coding Exercise 5: Create learning problem\n", "\n", - "Create a \"deep\" teacher network that accepts inputs of size 5. Give the network a width of 5 and a depth of 5. Use this to generate both a training and test set with 4000 examples for training and 1000 for testing. Initialize weights with a standard deviation of 2.0." + "Create a *deep* teacher network that accepts inputs of size `5`. Give the network a width of `5` and a depth of `5`. Use this to generate both a training and test set with 4,000 examples for training and 1,000 for testing. Initialize weights with a standard deviation of `2.0`." ] }, { @@ -1314,7 +1330,7 @@ "execution": {} }, "source": [ - "Now, let's train the student and observe the loss on a semi-log plot (the y-axis is logarithmic)! Your task is to complete the missing parts of the code. While the model is training training, you can go to the next coding exercise and return back to observe the results (it will take approximately 5 minutes)." + "Now, let's train the student and observe the loss on a semi-log plot (the y-axis is logarithmic)! Your task is to complete the missing parts of the code. While the model is being trained, you can go to the next coding exercise and return back to observe the results shortly. It will take approximately 5 minutes for the model to complete its training call." ] }, { @@ -1387,7 +1403,7 @@ "execution": {} }, "source": [ - "## Coding Exercise 7: Train a 2 layer neural net with varying width" + "## Coding Exercise 7: Train a 2 layer neural net with varying widths" ] }, { @@ -1576,9 +1592,9 @@ "source": [ "---\n", "\n", - "# Section 3: Deep networks in the quasilinear regime\n", + "# Section 3: Deep networks in the quasi-linear regime\n", "\n", - "Estimated timing to here from start of tutorial: 45 minutes\n", + "*Estimated timing to here from start of tutorial: 45 minutes*\n", "\n", "We've just shown that certain deep networks are difficult to fit. In this section, we will discuss a regime in which a shallow network is able to approximate a deep teacher relatively well." ] @@ -1658,13 +1674,13 @@ "source": [ "One of the reasons that shallow nets cannot fit deep nets, in general, is that random deep nets, in certain regimes, behave like chaotic systems: each layer can be thought of as a single step of a dynamical system, and the number of layers plays the role of the number of time steps. A deep network, therefore, effectively subjects its input to long-time chaotic dynamics, which are, almost by definition, very difficult to predict accurately. In particular, *shallow* nets simply cannot capture the complex mapping implemented by deeper networks without resorting to an astronomical number of hidden units. Another way to interpret this behavior is that the many layers of a deep network repeatedly stretch and fold their inputs, allowing the network to implement a large number of complex functions - an idea known as **expressivity** ([Poole et al. 2016](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html)).\n", "\n", - "However, in other regimes, for example, when the weights of the teacher network are small, the dynamics implemented by the teacher network are no longer chaotic. In fact, for small enough weights, they are nearly linear. In this regime, we'd expect a shallow network to be able to approximate a deep teacher relatively well.\n", + "However, in other regimes, for example, when the weights of the teacher network are small, the dynamics implemented by the teacher network are no longer chaotic. In fact, for small enough weights, they are nearly linear. In this regime, we'd expect a shallow network to be able to approximate a deep teacher relatively well. This is what we mean by neural networks in a **quasi-linear** regime.\n", "\n", - "For more on these ideas, see the paper\n", + "For more on these ideas, see the paper:\n", "\n", - "[*Exponential expressivity in deep neural networks through transient chaos*](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html) Poole et al. Neurips (2016).\n", + "[*Exponential expressivity in deep neural networks through transient chaos*](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html) (Poole et al., 2016).\n", "\n", - "To test this idea, we'll repeat the exercise above, this time initializing the teacher weights with a small $\\sigma$, say, $0.4$, so that the teacher network is quasi-linear." + "To test this idea, we'll repeat the exercise above, this time initializing the teacher weights with a small $\\sigma$, say, $0.4$, so that the teacher network is in the so-called quasi-linear regime." ] }, { @@ -1673,7 +1689,7 @@ "execution": {} }, "source": [ - "## Coding Exercise 9: Create dataset & Train a student network\n", + "## Coding Exercise 9: Create Dataset & Train a Student Network\n", "\n", "Create training and test sets. Initialize the teacher network with $\\sigma_{t} = 0.4$." ] @@ -1896,9 +1912,9 @@ "\n", "In this demo, we invite you to explore the expressivity of two distinct deep networks already introduced earlier: one with $\\sigma = 2$ and another (quasi-linear) with $\\sigma = 0.4$. \n", "\n", - "We initialize two deep networks with $D=20$ layers with $W = 100$ hidden units each but different variances in their random parameters. Then, 400 input data points are generated on a unit circle. We will examine how these points are propagated through the networks.\n", + "We initialize two deep networks with $D=20$ layers with $W = 100$ hidden units each but different variances in their weight initializations. Then, 400 input data points are generated on a unit circle. We will examine how these points propagated through the networks by looking at the effect of the transformations that each neural network layer applies to the data.\n", "\n", - "To visualize each layer's activity, we randomly project it into 3 dimensions. The slider below controls which layer you are seeing. On the left, you'll see how a standard network processes its inputs, and on the right, how a quasi-linear network does so. " + "To visualize each layer's activity, we randomly project it into 3 dimensions. The slider below controls which layer you are seeing. On the left, you'll see how a standard network processes its inputs, and on the right, how a quasi-linear network does so. As outlined in the video, the principle takehome message is that low values for the variance parameter in the weight initializations mean that each layer effectively performs a linear transformation, which only rotates and stretches the circular input we put into both networks. The chaotic regime of the standard network allows for a much greater expressivitiy due to this phenomenon!" ] }, { @@ -2002,6 +2018,17 @@ "- We discussed how the fitting difficulty is related to whether the teacher is initialized in the **chaotic** regime.\n", "- Chaotic behavior is related to network **expressivity**, the network's ability to implement a large number of complex functions." ] + }, + { + "cell_type": "markdown", + "metadata": { + "execution": {} + }, + "source": [ + "# The Big Picture\n", + "\n", + "So, how do the topics covered in this tutorial relate to our exploration of the theme of generalization? We have seen that deep neural networks in certain regimes differentially affect the transformation of the inputs and this has an effect on the expressivity of the network. The transformations that take place between shallow and deep neural network make different testable environments for generalization capacity. We leave you to think about, taking what you have learned in this tutorial, what kind of relationship might there be for models that generalize well to inputs outside of the training distribution. Do shallow networks capture the specific details of training inputs? Do they model the problem at a level that pays more attention to surface features or important low-level features (that generalize better)? There is no correct answer to this question, but it's a good exercise to think about and start forming your own thoughts and ideas." + ] } ], "metadata": { @@ -2032,7 +2059,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.19" + "version": "3.9.22" } }, "nbformat": 4, diff --git a/tutorials/W2D1_Macrocircuits/W2D1_Tutorial2.ipynb b/tutorials/W2D1_Macrocircuits/W2D1_Tutorial2.ipynb index 936877ab5..e732bf6af 100644 --- a/tutorials/W2D1_Macrocircuits/W2D1_Tutorial2.ipynb +++ b/tutorials/W2D1_Macrocircuits/W2D1_Tutorial2.ipynb @@ -25,9 +25,9 @@ "\n", "__Content creators:__ Andrew Saxe, Vidya Muthukumar\n", "\n", - "__Content reviewers:__ Max Kanwal, Surya Ganguli, Xaq Pitkow, Hlib Solodzhuk, Patrick Mineault\n", + "__Content reviewers:__ Max Kanwal, Surya Ganguli, Xaq Pitkow, Hlib Solodzhuk, Patrick Mineault, Alex Murphy\n", "\n", - "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk" + "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk, Alex Murphy" ] }, { @@ -39,17 +39,16 @@ "source": [ "___\n", "\n", - "\n", "# Tutorial Objectives\n", "\n", "*Estimated timing of tutorial: 1 hour*\n", "\n", "In this tutorial, we'll look at the sometimes surprising behavior of large neural networks, which is called double descent. This empirical phenomenon puts the classical understanding of the bias-variance tradeoff in question: in double descent, highly overparametrized models can display good performance. In particular, we will explore the following: \n", "\n", - "- notions of low/high bias/variance;\n", - "- improvement of test performance with the network's overparameterization, which leads to large model trends;\n", - "- the conditions under which double descent is observed and what affects its significance;\n", - "- the conditions under which double descent does not occur.\n", + "- notions of low/high bias/variance\n", + "- improvement of test performance with the network's overparameterization, which leads to large model trends\n", + "- the conditions under which double descent is observed and what affects its significance\n", + "- the conditions under which double descent does not occur\n", " \n", "Let's jump in!" ] @@ -496,7 +495,7 @@ "---\n", "# Section 1: Overfitting in overparameterized models\n", "\n", - "In this section we will observe the classical behaviour of overparametrized networks - overfitting." + "In this section we will observe the classical behaviour of overparametrized networks: overfitting. This is where a model becomes tuned specifically to the features of the training data, beyond the general patterns. For example, if data points are measured with an imperfect system that introduces noise into the recording values, then overfitting can be thought of as a model that learns the signal **and** the noise associated with each data point, instead of just the **signal**. This is characterized by a low training error and a higher test error." ] }, { @@ -579,7 +578,7 @@ "\n", "We start by generating a simple sinusoidal dataset.\n", "\n", - "This dataset contains 100 datapoints. We've selected a subset of 10 points for training. " + "This dataset contains 100 data points. We've selected a subset of 10 points for training. " ] }, { @@ -613,7 +612,7 @@ "\n", "The input $x\\in R$ is a scalar. There are $N_h$ hidden units, and the output $\\hat y\\in R$ is a scalar.\n", "\n", - "We will initialize $W_1$ with i.i.d. random Gaussian values with a variance of one and $b$ with values drawn i.i.d. uniformly between $-\\pi$ and $\\pi$. Finally, we will initialize the weights $W_2$ to zero.\n", + "We will initialize $W_1$ with i.i.d. random Gaussian values with a variance of `1.0` and $b$ with values drawn i.i.d. uniformly between $-\\pi$ and $\\pi$. Finally, we will initialize the weights $W_2$ to `0`.\n", "\n", "We only train $W_2$, leaving $W_1$ and $b$ fixed. We can train $W_2$ to minimize the mean squared error between the training labels $y$ and the network's output on those datapoints. \n", "\n", @@ -830,11 +829,11 @@ "execution": {} }, "source": [ - "## Coding Exercise 2: The bias-variance tradeoff\n", + "## Coding Exercise 2: The Bias-Variance Trade-off\n", "\n", "With the network implemented, we now investigate how the size of the network (the number of hidden units it has, $N_h$) relates to its ability to generalize. \n", "\n", - "Ultimately, the true measure of a learning system is how well it performs on novel inputs, that is, its ability to generalize. The classical story of how model size relates to generalization is the bias-variance tradeoff.\n", + "Ultimately, the true measure of a learning system is how well it performs on novel inputs, that is, its ability to generalize. The classical story of how model size relates to generalization is captured in the concept of the bias-variance tradeoff. We assume you are familiar with this concept already. If not, take some time to discuss in your group or search out a verified explanation to review.\n", "\n", "To start, complete the code below to train several small networks with just two hidden neurons and plot their predictions." ] @@ -937,7 +936,7 @@ "execution": {} }, "source": [ - "With just two hidden units, the model cannot fit the training data, nor can it do well on the test data. A network of this size has a high bias.\n", + "With just two hidden units, the model cannot fit the training data, nor the test data. A network of this size has **high bias**.\n", "\n", "Now, let's train a network with five hidden units.\n", "\n", @@ -972,9 +971,9 @@ "execution": {} }, "source": [ - "With five hidden units, the model can do a better job of fitting the training data, and also follows the test data more closely - though still with errors.\n", + "With five hidden units, the model can do a better job of fitting the training data, and also follows the test data more closely, though still with errors.\n", "\n", - "Next let's try 10 hidden units." + "Next let's try 10 hidden units. Try to visualize how you might think the plot before running the cell below." ] }, { @@ -1005,7 +1004,7 @@ "source": [ "With 10 hidden units, the network often fits every training datapoint, but generalizes poorly--sometimes catastrophically so. We say that this size network has high variance. Intuitively, it is so complex that it can fit the training data perfectly, but this same complexity means it can take many different shapes in between datapoints.\n", "\n", - "We have just traced out the bias-variance tradeoff: the models with 2 hidden units had high bias, while the models with 10 hidden units had high variance. The models with 5 hidden units struck a balance--they were complex enough to achieve relatively low error on the training datapoints, but simple enough to be well constrained by the training data." + "We have just traced out the bias-variance tradeoff: the models with 2 hidden units had high bias, while the models with 10 hidden units had high variance. The models with 5 hidden units struck a balance--they were complex enough to achieve relatively low error on the training data points, but simple enough to be well constrained by the training data. The best choice of neural network architecture (e.g. choosing the number of neurons in a layer) is therefore highly dependent on the structure of the problem you are trying to solve and the format of the input data. It also involves trying out a few values and checking for where on the bias-variance trade-off line you find yourself. This was extremely important in classical approaches to understanding how to develop good neural networks. That was then, we are now in the **Modern Regime**, which we'll move on to now!" ] }, { @@ -1030,7 +1029,7 @@ }, "source": [ "---\n", - "# Section 2: The modern regime\n", + "# Section 2: The Modern Regime\n", "\n", "Estimated timing to here from start of tutorial: 20 minutes\n", "\n", @@ -1113,9 +1112,9 @@ "execution": {} }, "source": [ - "We just saw that a network with 10 hidden units trained on 10 training datapoints could fail to generalize. If we add even more hidden units, it seems unlikely that the network could perform well. How could hundreds of weights be correctly constrained with just these ten datapoints?\n", + "We just saw that a network with 10 hidden units trained on 10 training data points could fail to generalize. If we add even more hidden units, it seems unlikely that the network could perform well. How could hundreds of weights be correctly constrained when we just showed that these ten data points failed to capture any meaningful relationship of the training data?\n", "\n", - "But let's try it. Throw caution to the wind and train a network with 500 hidden units." + "Let's go crazy and train a network with `500` hidden units and see what happens! " ] }, { @@ -1154,11 +1153,11 @@ "execution": {} }, "source": [ - "Remarkably, this very large network fits the training datapoints and generalizes well.\n", + "Remarkably, this very large network fits the training data points and generalizes well. We've managed to get predictions that look like they have learned the distribution of our input data correctly.\n", "\n", - "This network has fifty times as many parameters as datapoints. How can this be?\n", + "This network has fifty times as many parameters as data points. How can this be?\n", "\n", - "We've tested four different network sizes and seen the qualitative behavior of the predictions. Now, let's systematically compute the average test error for different network sizes.\n", + "We have tested four different network sizes (`2`, `5`, `10`, `500`) and we saw the qualitative behavior of the predictions. Now, let's systematically compute the average test error for different network sizes.\n", "\n", "For each network size in the array below, train 100 networks and plot their mean test error." ] @@ -1260,9 +1259,9 @@ "\n", "Hence, in this scenario, larger models perform better--even when they contain many more parameters than datapoints.\n", "\n", - "The peak (worst generalization) is at an intermediate model size when the number of hidden units is equal to the number of examples in this case. More generally, it turns out the peak occurs when the model first becomes complex enough to reach zero training error. This point is known as the interpolation point.\n", + "The peak (worst generalization) is at an intermediate model size when the number of hidden units is equal to the number of examples in this case. More generally, it turns out the peak occurs when the model first becomes complex enough to reach zero training error. This point is known as the **interpolation threshold**.\n", "\n", - "The trend for deep learning models to grow in size is in part due to this phenomenon of double descent. Let's now see its limits." + "The trend for deep learning models growing in size is in part due to the implications of the phenomenon of double descent. But does it always hold? Let's now see where its limits are and what modulates the ability to learn in this overparameterized regime." ] }, { @@ -1274,7 +1273,7 @@ "source": [ "## Interactive Demo 1: Interpolation point & predictions\n", "\n", - "In this interactive demo, you can move the slider for the number of hidden units in the network to be trained on and observe one representative trial of predicted values." + "In this interactive demo, you have a slider that represents the number of hidden units in a network to be trained on and then we bserve one representative trial of predicted values." ] }, { @@ -1303,7 +1302,7 @@ "execution": {} }, "source": [ - "The trend for deep learning models to grow in size is in part due to the phenomenon of double descent. Let's now see its limits." + "Having experimented with this interactive tool for a little while, are you able to see the relationship between the results shown here and the double descent plot above (the previous figure)?" ] }, { @@ -1329,11 +1328,11 @@ "source": [ "\n", "---\n", - "# Section 3: Double descent, noise & regularization\n", + "# Section 3: Double Descent, Noise & Regularization\n", "\n", - "Estimated timing to here from start of tutorial: 35 minutes\n", + "*Estimated timing to here from start of tutorial: 35 minutes*\n", "\n", - "In this section, we are going to explore the effect of noise and regularization on double descent behavior." + "In this section, we are going to explore the effect of noise and regularization on double descent." ] }, { @@ -1422,7 +1421,7 @@ "execution": {} }, "source": [ - "So far, our training datapoints have been noiseless. Intuitively, a noisy training dataset might hurt the ability of complex models to generalize. In this section, we are going to explore the effect of noise on double descent behavior.\n", + "So far, our training data points have been noiseless. Intuitively, a noisy training dataset might hurt the ability of complex models to generalize. In this section, we are going to explore the effect of noise on double descent behavior.\n", "\n", "Let's test this. Add i.i.d. Gaussian noise of different standard deviations to the training labels, and plot the resulting double descent curves." ] @@ -1525,7 +1524,7 @@ "execution": {} }, "source": [ - "Though we are still able to observe the double descent effect, its strength is reduced with the increase in noise level." + "Though we are still able to observe the effect of double descent, its nowhere near as clear when we introduce noise into the training data." ] }, { @@ -1567,10 +1566,14 @@ "source": [ "We observe that the \"peak\" disappears, and the test error roughly monotonically decreases, although it is generally higher for higher noise levels in the training data.\n", "\n", + "
\n", + " A note about the use of the term \"regularization\" in multiple contexts in ML (optional)\n", + "
\n", "The word *regularization* is commonly used in statistics/ML parlance in two different contexts to ensure the good generalization of overparameterized models:\n", "\n", "- The first context, which is emphasized throughout the tutorial, is explicit regularization which means that the model is not trained to completion (zero training error) in order to avoid overfitting of noise. Without explicit regularization, we observe the double descent behavior – i.e. catastrophic overfitting when the number of model parameters is too close to the number of training examples – but also a vast reduction in this overfitting effect as we heavily overparameterize the model. With explicit regularization (when tuned correctly), the double descent behavior disappears because we no longer run the risk of overfitting to noise at all.\n", - "- The second context is the one of inductive bias – overparameterized models, when trained with popular optimization algorithms like gradient descent, tend to converge to a particularly “simple” solution that perfectly fits the data. By “simple”, we usually mean that the size of the parameters (in terms of magnitude) is very small. This inductive bias is a big reason why double descent occurs as well, in particular, the benefit of overparameterization in reducing overfitting." + "- The second context is the one of inductive bias – overparameterized models, when trained with popular optimization algorithms like gradient descent, tend to converge to a particularly “simple” solution that perfectly fits the data. By “simple”, we usually mean that the size of the parameters (in terms of magnitude) is very small. This inductive bias is a big reason why double descent occurs as well, in particular, the benefit of overparameterization in reducing overfitting.\n", + "
" ] }, { @@ -1658,9 +1661,9 @@ "execution": {} }, "source": [ - "The network smoothly interpolates between the training datapoints. Even when noisy, these can still somewhat track the test data. Depending on the noise level, though, a smaller and more constrained model can be better.\n", + "The network smoothly interpolates between the training data points. Even when noisy, these can still somewhat track the test data. Depending on the noise level, though, a smaller and more constrained model can be better.\n", "\n", - "From this, we might expect that large models will work particularly well for datasets with little label noise. Many real-world datasets fit this requirement: image classification datasets strive to have accurate labels for all datapoints, for instance. Other datasets may not. For instance, predicting DSM-V diagnoses from structural MRI data is a noisy task, as the diagnoses themselves are noisy." + "From this, we might expect that large models will work particularly well for datasets with little label noise. Many real world datasets fit this requirement: image classification datasets strive to have accurate labels for all data points, for instance. Other datasets may not. For instance, predicting DSM-V diagnoses from structural MRI data is a noisy task, as the diagnoses themselves are noisy due to the inherent difficulty of mapping observations to clinically defined classes." ] }, { @@ -1685,11 +1688,12 @@ }, "source": [ "---\n", - "# Section 4: Double descent and initialization\n", + "# Section 4: Double Descent and Initialization\n", "\n", - "Estimated timing to here from start of tutorial: 50 minutes\n", + "*Estimated timing to here from start of tutorial: 50 minutes*\n", "\n", - "So far, we have considered one important aspect of architecture, namely the size or number of hidden neurons. A second critical aspect is initialization." + "\n", + "So far, we have considered one important aspect of neural network architectures, namely the width of a hidden layer (the number of neurons in the layer). However, another critical aspect connected to the emergence of the double descent phenomenon is that of weight initialization. In the last tutorial, we explored weight initialization from the perspective of a chaotic and non-chaotic regime. We saw that low variance initialization of weights led some deep MLPs to exhibit transformations in the quasi-linear regime. Let's now explore what the effects of weight initialization are on the double descent phenomenon." ] }, { @@ -1841,11 +1845,11 @@ "execution": {} }, "source": [ - "We see that for overparametrized models, where the number of parameters is larger than the number of training examples, the initialization scale strongly impacts the test error. The good performance of these large models thus depends on our choice of initializing $W_2$ equal to zero.\n", + "We see that for our overparametrized model (where the number of parameters is larger than the number of training examples) the initialization scale strongly impacts the test error. The better performance of these large models thus depends on our choice of initializing $W_2$ equal to zero.\n", "\n", "Intuitively, this is because directions of weight space in which we have no training data are not changed by gradient descent, so poor initialization can continue to affect the model even after training. Large initializations implement random functions that generalize poorly.\n", "\n", - "Let's see what the predictions of a large-variance-initialization network with 500 hidden neurons look like." + "Let's see what the predictions of a large-variance-initialization network with 500 hidden neurons look like. We will set `init_scale = 1.0` in this example below." ] }, { @@ -1999,9 +2003,9 @@ }, "source": [ "---\n", - "# Summary\n", + "# The Big Picture\n", "\n", - "Estimated timing of tutorial: 1 hour\n", + "*Estimated timing of tutorial: 1 hour*\n", "\n", "In this tutorial, we observed the phenomenon of double descent: the situation when the overparameterized network was expected to behave as overfitted but instead generalized better to the unseen data. Moreover, we discovered how noise, regularization & initial scale impact the effect of double descent and, in some cases, can fully cancel it.\n", "\n", @@ -2036,7 +2040,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.19" + "version": "3.9.22" } }, "nbformat": 4, diff --git a/tutorials/W2D1_Macrocircuits/W2D1_Tutorial3.ipynb b/tutorials/W2D1_Macrocircuits/W2D1_Tutorial3.ipynb index 1bd3e0e82..2e5857b7f 100644 --- a/tutorials/W2D1_Macrocircuits/W2D1_Tutorial3.ipynb +++ b/tutorials/W2D1_Macrocircuits/W2D1_Tutorial3.ipynb @@ -23,9 +23,9 @@ "\n", "__Content creators:__ Ruiyi Zhang\n", "\n", - "__Content reviewers:__ Xaq Pitkow, Hlib Solodzhuk, Patrick Mineault\n", + "__Content reviewers:__ Xaq Pitkow, Hlib Solodzhuk, Patrick Mineault, Alex Murphy\n", "\n", - "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk" + "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk, Alex Murphy" ] }, { @@ -96,7 +96,8 @@ "# @title Install and import feedback gadget\n", "\n", "!pip install vibecheck datatops --quiet\n", - "!pip install pandas~=2.0.0 --quiet\n", + "!pip install pandas --quiet\n", + "!pip install scikit-learn --quiet\n", "\n", "from vibecheck import DatatopsContentReviewContainer\n", "def content_review(notebook_section: str):\n", @@ -1789,7 +1790,7 @@ "---\n", "# Section 2: Evaluate agents in the training task\n", "\n", - "Estimated timing to here from start of tutorial: 25 minutes" + "*Estimated timing to here from start of tutorial: 25 minutes*" ] }, { @@ -1798,7 +1799,7 @@ "execution": {} }, "source": [ - "With the code for the environment and agents done, we will now write an evaluation function allowing the agent to interact with the environment." + "With the code for the environment and agents complete, we will now write an evaluation function allowing the agent to interact with the environment where the quality of the model can be assessed." ] }, { @@ -1816,7 +1817,7 @@ "execution": {} }, "source": [ - "We first sample 1000 targets for agents to steer to." + "We first sample 1,000 targets for the RL agent to steer towards." ] }, { @@ -2061,7 +2062,7 @@ "execution": {} }, "source": [ - "Since training RL agents takes a lot of time, here we load the pre-trained modular and holistic agents and evaluate these two agents on the same sampled 1000 targets. We will then store the evaluation data in pandas dataframes." + "Since training RL agents takes a lot of time, here we load the pre-trained modular and holistic agents and evaluate these two agents on the same sampled 1,000 targets. We will then store the evaluation data in `pandas` Dataframe object." ] }, { @@ -2499,9 +2500,11 @@ "execution": {} }, "source": [ - "It is well known that an RL agent's performance can vary significantly with different random seeds. Therefore, no conclusions can be drawn based on one training run with a single random seed.\n", + "It is well known that an RL agent's performance can vary significantly with different random seeds. Therefore, no conclusions can be drawn based on one training run with a single random seed. Therefore, to make more convincing conclusions, we must run the same experiment across different random initializations in order to be sure that any repeatedly-obtainable result is robustly seen across such different random initializations.\n", "\n", - "Both agents were trained with eight random seeds, and all of them were evaluated using the same sample of $1000$ targets. Let's load this saved trajectory data." + "Both agents were trained across 8 random seeds. All of them were evaluated using the same sample of 1,000 targets.\n", + "\n", + "Let's load this saved trajectory data." ] }, { @@ -2536,7 +2539,7 @@ "execution": {} }, "source": [ - "We first compute the fraction of rewarded trials in the total $1000$ trials for all training runs with different random seeds for the modular and holistic agents. We visualize this using a bar plot, with each red dot denoting the performance of a random seed." + "We first compute the fraction of rewarded trials in the total 1,000 trials for all training runs with different random seeds for the modular and holistic agents. We visualize this using a bar plot, with each red dot denoting the performance of a random seed." ] }, { @@ -2592,9 +2595,9 @@ "execution": {} }, "source": [ - "Despite similar performance measured by a rewarded fraction, we dis observe qualitative differences in the trajectories of the two agents in the previous sections. It is possible that the holistic agent's more curved trajectories, although reaching the target, are less efficient, i.e., they waste more time.\n", + "Despite similar performance measured by a rewarded fraction, we did observe qualitative differences in the trajectories of the two agents in the previous sections. It is possible that the holistic agent's more curved trajectories, although reaching the target, are less efficient, i.e., they waste more time.\n", "\n", - "Therefore, we also plot the time spent by both agents for the same 1000 targets." + "Therefore, we also plot the time spent by both agents for the same 1,000 targets." ] }, { @@ -2774,7 +2777,9 @@ "---\n", "# Section 3: A novel gain task\n", "\n", - "Estimated timing to here from start of tutorial: 50 minutes" + "*Estimated timing to here from start of tutorial: 50 minutes*\n", + "\n", + "The prior task had a fixed joystick gain that meant consistent linear and angular velocities. We will now look at a novel task that tests the generalization capabilities of these models by varying this setting between training and testing. Will the model generalize well?" ] }, { @@ -3208,22 +3213,7 @@ "execution": {} }, "source": [ - "---\n", - "# Summary\n", - "\n", - "*Estimated timing of tutorial: 1 hour*\n", - "\n", - "In this tutorial, we explored the difference in agents' performance based on their architecture. We revealed that modular architecture, with separate modules for learning different aspects of behavior, is superior to a holistic architecture with a single module. The modular architecture with stronger inductive bias achieves good performance faster and has the capability to generalize to other tasks as well. Intriguingly, this modularity is a property we also observe in the brains, which could be important for generalization in the brain as well." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "execution": {} - }, - "source": [ - "---\n", - "# Bonus Section 1: Decoding analysis" + "## Decoding analysis" ] }, { @@ -3355,7 +3345,21 @@ }, "source": [ "---\n", - "# Bonus Section 2: Generalization, but no free lunch\n", + "# The Big Picture\n", + "\n", + "*Estimated timing of tutorial: 1 hour*\n", + "\n", + "In this tutorial, we explored the difference in agents' performance based on their architecture. We revealed that modular architecture, with separate modules for learning different aspects of behavior, is superior to a holistic architecture with a single module. The modular architecture with stronger inductive bias achieves good performance faster and has the capability to generalize to other tasks as well. Intriguingly, this modularity is a property we also observe in the brains, which could be important for generalization in the brain as well." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "execution": {} + }, + "source": [ + "---\n", + "# Addendum: Generalization, but no free lunch\n", "\n", "The No Free Lunch theorems proved that no inductive bias can excel across all tasks. It has been studied in the [paper](https://www.science.org/doi/10.1126/sciadv.adk1256) that agents with a modular architecture can acquire the underlying structure of the training task. In contrast, holistic agents tend to acquire different knowledge than modular agents during training, such as forming beliefs based on unreliable information sources or exhibiting less efficient control actions. The novel gain task has a structure similar to the training task, consequently, a modular agent that accurately learns the training task's structure can leverage its knowledge in these novel tasks.\n", "\n", @@ -3406,7 +3410,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.19" + "version": "3.9.22" } }, "nbformat": 4, diff --git a/tutorials/W2D1_Macrocircuits/instructor/W2D1_Intro.ipynb b/tutorials/W2D1_Macrocircuits/instructor/W2D1_Intro.ipynb index 4f3ce7bdb..015c5bbe5 100644 --- a/tutorials/W2D1_Macrocircuits/instructor/W2D1_Intro.ipynb +++ b/tutorials/W2D1_Macrocircuits/instructor/W2D1_Intro.ipynb @@ -57,7 +57,9 @@ "source": [ "## Prerequisites\n", "\n", - "Materials of this day assume you have had the experience of model building in `pytorch` earlier. It would be beneficial too if you had the basics of Linear Algebra before as well as if you had played around with Actor-Critic model in Reinforcement Learning setup." + "In order to get the most out of today's tutorials, it would greatly help if you had experience building (simple) neural network models in PyTorch. We will also be using some concepts from Linear Algebra, so some familiarity with concepts from that domain will come in handy. We will also be looking at a specific algorithm in Reinforcement Learning (RL) called the Actor-Critic model, so it would help if you had some familiarity with Reinforcement Learning. We touched a little bit on RL in W1D2 (\"Comparing Tasks\"), specifically in Tutorial 3 (\"Reinforcement Learning Across Temporal Scales\"). It could be good to refer back to that tutorial and to check out the two videos on Meta-RL in that tutorial notebook.\n", + "\n", + "Today is a little more technical, more theory-driven, but it will give you a lot of skills and appreciation to work with these very interesting ideas in NeuroAI. What we encourage you to keep in mind is how this knowledge helps you to appreciate the concept of generalization, the over-arching theme of this entire course. Lots of points today will indicate how learning dynamics will arrive at solutions that **generalize well**!" ] }, { @@ -210,7 +212,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.19" + "version": "3.9.22" } }, "nbformat": 4, diff --git a/tutorials/W2D1_Macrocircuits/instructor/W2D1_Tutorial1.ipynb b/tutorials/W2D1_Macrocircuits/instructor/W2D1_Tutorial1.ipynb index 85cc81484..d285973fa 100644 --- a/tutorials/W2D1_Macrocircuits/instructor/W2D1_Tutorial1.ipynb +++ b/tutorials/W2D1_Macrocircuits/instructor/W2D1_Tutorial1.ipynb @@ -25,9 +25,9 @@ "\n", "__Content creators:__ Gabriel Mel de Fontenay\n", "\n", - "__Content reviewers:__ Surya Ganguli, Xaq Pitkow, Hlib Solodzhuk, Aakash Agrawal, Alish Dipani, Hossein Rezaei, Yousef Ghanbari, Mostafa Abdollahi, Patrick Mineault\n", + "__Content reviewers:__ Surya Ganguli, Xaq Pitkow, Hlib Solodzhuk, Aakash Agrawal, Alish Dipani, Hossein Rezaei, Yousef Ghanbari, Mostafa Abdollahi, Patrick Mineault, Alex Murphy\n", "\n", - "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk\n" + "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk, Alex Murphy\n" ] }, { @@ -43,13 +43,13 @@ "\n", "*Estimated timing of tutorial: 1 hour*\n", "\n", - "In this tutorial we will take a closer look at the expressivity of neural networks by observing the following:\n", + "In this tutorial we will take a closer look at the **expressivity** of neural networks, the ability of neural networks to model a wide range of functions. We will make the following observations:\n", "\n", - "- The **universal approximator theorem** guarantees that we can approximate any complex function using a network with a single hidden layer. The catch is that the approximating network might need to be extremely *wide*.\n", - "- We will explore this issue by constructing a complex function and attempting to fit it with shallow networks of varying widths.\n", - "- To create this complex function, we'll build a random deep neural network. This is an example of the **student-teacher setting**, where we attempt to fit a known *teacher* function (the deep network) using a *student* model (the shallow/wide network).\n", - "- We will find that the deep teacher network can be either very easy or very hard to approximate and that the difficulty level is related to a form of **chaos** in the network activities.\n", - "- Each layer of a neural network can effectively expand and fold the input it receives from the previous layer. This repeated expansion and folding grants deep neural networks models high **expressivity** - ie. allows them to implement a large number of different functions.\n", + "- The **universal approximator theorem** guarantees that we can approximate any complex function using a network with a single hidden layer. The catch is that the approximating network might need to be extremely *wide* and the theorem only states the existence of such a model (not exactly how neurons are required per task)\n", + "- We will explore this issue by constructing a complex function and attempting to fit it with shallow networks of varying widths\n", + "- To create this complex function, we'll build a random deep neural network. This is an example of the **student-teacher setting**, where we attempt to fit a known *teacher* function (the deep network) using a *student* model (the shallow/wide network)\n", + "- We will see that it can be either very easy or very difficult to learn from the deep (teacher) network and this difficulty is related to a form of **chaos** in the network activations\n", + "- Each layer of a neural network can effectively expand and fold the input it receives from the previous layer. This repeated expansion and folding grants deep neural networks models high **expressivity** - ie. allows them to capture the behavior of a large number of different functions\n", "\n", "Let's get started!" ] @@ -363,7 +363,9 @@ "\n", "# Section 1: Introduction\n", "\n", - "In this section we will create functions to capture the snippets of code that we will use repeatedly in what follows." + "In this section we will write some Python functions to help build some neural networks that will allow us to effectively examine the expressiity of shallow versus deep networks. We will specifically look at this issue through the lens of the universal approximation theorem and ask ourselves what deeper neural networks give us in terms of the ability of those models to capture a wide range of functions. As you will recall from today's introduction video, the idea of each layer being able to fold activations via an activation function increases the ability to model nonlinear functions much more effectively. After going through this tutorial, this idea will hopefully be much clearer.\n", + "\n", + "By **shallow network**, we mean one with a very small number of layers (e.g. one). A shallow networks can be **wide** if it has many, many neurons in this layer, or it can be smaller, having only a limited number of neurons. In contrast, by **deep networks**, we refer to the number of layers in the network. It's important to keep in mind that the term **wide** in the terminology we will use specifically refers to *the number of neurons in a layer, not the number of layers in a network*. If we take a single layer in a shallow or a deep network, we can describe it as being **wide** if it has a very large number of neurons. " ] }, { @@ -427,17 +429,17 @@ }, "source": [ "\n", - "The [universal approximator theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem) (UAT) guarantees that we can approximate any function arbitrarily well using a shallow network - ie. a network with a single hidden layer (figure below, left). So why do we need depth? The \"catch\" in the UAT is that approximating a complex function with a shallow network can require a very large number of hidden units - ie. the network must be very wide. The inability of shallow networks to efficiently implement certain functions suggests that network depth may be one of the brain's computational \"secret sauces\".\n", + "The [universal approximator theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem) (UAT) guarantees that we can approximate any function arbitrarily well using a shallow network - ie. a network with a single hidden layer (figure below, left). So why do we need depth? The *catch* in the UAT is that approximating a complex function with a shallow network can require a very large number of hidden units - ie. the network must be very wide. The inability of shallow networks to efficiently implement certain functions suggests that network depth may be one of the brain's computational *secret sauces*.\n", "\n", "\"Shallow\n", "\n", "To illustrate this fact, we'll create a complex function and then attempt to fit it with single-hidden-layer neural networks of different widths. What we'll find is that although the UAT guarantees that sufficiently wide networks can approximate our function, the performance will actually not be very good for our shallow nets of modest width.\n", "\n", - "One easy way to create a complex function is to build a random deep neural network (figure above, right). We then have a teacher network which generates the ground truth outputs, and a student network whose goal is to learn the mapping implemented by the teacher. This approach - known as the **student-teacher setting** - is useful for both computational and mathematical study of neural networks since it gives us complete control of the data generation process. Unlike with real-world data, we know the exact distribution of inputs and correct outputs.\n", + "One easy way to create a complex function is to build a random deep neural network (figure above, right), which serves as a teacher network (generating the ground truth outputs), and we'll also have a student network, whose goal is to learn the function defined by the teacher network. This approach - known as the **student-teacher setting** - is useful for both the computational and mathematical study of neural networks since it gives us complete control of the data generation process. Unlike with real-world data, we know the exact distribution of inputs and correct outputs. This means we aren't restricted by having to factor in any *noisy* signals that are not connected to our inputs.\n", "\n", - "Finally, we will show that depending on the distribution of the weights, a random deep neural network can be either very difficult or very easy to approximate with a shallow network. The \"complexity\" of the function computed by a random deep network thus depends crucially on the weight distribution. One can actually understand the boundary between hard and easy cases as a kind of boundary between chaos and non-chaos in a certain dynamical system. We will confirm that on the non-chaotic side, a random deep neural network can be effectively approximated by a shallow net. This demonstration will be based on ideas from the paper:\n", + "Finally, we will show that depending on the distribution of the weights, a random deep neural network can be either very difficult or very easy to approximate with a shallow network. The *complexity* of the function computed by a random deep network thus depends crucially on the weight distribution. One can actually understand the boundary between hard and easy cases as a kind of boundary between **chaos** and **non-chaos** in a certain dynamical system. We will confirm that on the non-chaotic side, a random deep neural network can be effectively approximated by a shallow net. This demonstration will be based on ideas from the following paper:\n", "\n", - "[*Exponential expressivity in deep neural networks through transient chaos*](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html) Poole et al. Neurips (2016)." + "[*Exponential expressivity in deep neural networks through transient chaos*](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html) (Poole et al., 2016)." ] }, { @@ -528,9 +530,9 @@ "source": [ "## Coding Exercise 1: Create an MLP\n", "\n", - "The code below implements a function that takes in an input dimension, a layer width, and a number of layers and creates a simple MLP in pytorch. In between each layer, we insert a hyperbolic tangent nonlinearity layer (`nn.Tanh()`).\n", + "The code below implements a function that takes in an input dimension, a layer width, and a number of layers and creates a simple MLP in pytorch where each layer has the same width. In between each layer, we insert a hyperbolic tangent nonlinearity layer (`nn.Tanh()`).\n", "\n", - "Convention: Because we will count the input as a layer, a depth of 2 will mean a network with just one hidden layer, followed by the output neuron. A depth of 3 will mean 2 hidden layers, and so on." + "Convention: Because we will count the input as a layer, a depth of 2 will mean a network with just one hidden layer, followed by the output neuron. A depth of 3 will mean 2 hidden layers, and so on." ] }, { @@ -568,14 +570,14 @@ "\n", " # Assemble D-1 hidden layers and one output layer\n", "\n", - " #input layer\n", + " # input layer\n", " layers = [nn.Linear(n_in, W, bias = False), nonlin]\n", " for i in range(D - 2):\n", - " #linear layer\n", + " # linear layer\n", " layers.append(nn.Linear(W, W, bias = False))\n", - " #activation function\n", + " # activation function\n", " layers.append(nonlin)\n", - " #output layer\n", + " # output layer\n", " layers.append(nn.Linear(W, 1, bias = False))\n", "\n", " return nn.Sequential(*layers)\n", @@ -678,7 +680,7 @@ "source": [ "## Coding Exercise 2: Initialize model weights\n", "\n", - "Write a function that, given a model and a $\\sigma$, initializes all weights in the model according to a normal distribution with mean $0$ and standard deviation\n", + "Write a function that, given a model and $\\sigma$, initializes all weights in the model according to a normal (Gaussian) distribution with mean $0$ and standard deviation\n", " \n", " $$\\frac{\\sigma}{\\sqrt{n_{in}}},$$\n", " \n", @@ -872,7 +874,15 @@ "execution": {} }, "source": [ - "In this coding exercise, write a function that will train a given net on a given dataset. Function parameters include the network, the training inputs and outputs, the number of steps, and the learning rate. Set up loss function as MSE." + "In this coding exercise, write a function that will train a given net on a given dataset. Function parameters include:\n", + "\n", + "* the network (`net`)\n", + "* training inputs (`X`)\n", + "* the outputs (`y`)\n", + "* the number of steps (`n_epochs`)\n", + "* the learning rate (`lr`)\n", + "\n", + "Use the mean-squared error (MSE) loss function in the learning algorithm. You might need to check the pytorch documentation to see the exact layer name you will need to call for this." ] }, { @@ -1010,7 +1020,11 @@ "execution": {} }, "source": [ - "Now, write a helper function that computes the loss of a net on a dataset. It takes the following parameters: the network and the dataset inputs and outputs." + "Now, write a helper function that computes the loss of a net on a dataset. It takes the following parameters:\n", + "\n", + "* the network (`net`)\n", + "* the dataset inputs (`X`)\n", + "* the dataset outputs (`y`)" ] }, { @@ -1108,7 +1122,9 @@ "\n", "Estimated timing to here from start of tutorial: 20 minutes\n", "\n", - "We will now use the functions we've created to experiment with deep network fitting. In particular, we will see to what extent it is possible to fit a deep net using a shallow net. Specifically, we will fix a deep teacher and then fit it with a single-hidden-layer net with varying width value. In principle, if the number of hidden units is large enough, the error should be low. Let's see!" + "We will now use the functions we created to experiment with fitting various student models to our complex function (which we defined earlier to be a randomly initialized deep neural network, what we defined as the teacher network). In particular, we will see to what extent it is possible to fit a deep net using a shallow net. We will freeze a deep teacher network and then fit it with a single-hidden-layer net with varying width sizes. In principle, if the number of hidden units is large enough, the error should be low (according to the universal approximation theorem)\n", + "\n", + "Let's see if that's the case!" ] }, { @@ -1186,7 +1202,7 @@ "source": [ "## Coding Exercise 5: Create learning problem\n", "\n", - "Create a \"deep\" teacher network that accepts inputs of size 5. Give the network a width of 5 and a depth of 5. Use this to generate both a training and test set with 4000 examples for training and 1000 for testing. Initialize weights with a standard deviation of 2.0." + "Create a *deep* teacher network that accepts inputs of size `5`. Give the network a width of `5` and a depth of `5`. Use this to generate both a training and test set with 4,000 examples for training and 1,000 for testing. Initialize weights with a standard deviation of `2.0`." ] }, { @@ -1326,7 +1342,7 @@ "execution": {} }, "source": [ - "Now, let's train the student and observe the loss on a semi-log plot (the y-axis is logarithmic)! Your task is to complete the missing parts of the code. While the model is training training, you can go to the next coding exercise and return back to observe the results (it will take approximately 5 minutes)." + "Now, let's train the student and observe the loss on a semi-log plot (the y-axis is logarithmic)! Your task is to complete the missing parts of the code. While the model is being trained, you can go to the next coding exercise and return back to observe the results shortly. It will take approximately 5 minutes for the model to complete its training call." ] }, { @@ -1401,7 +1417,7 @@ "execution": {} }, "source": [ - "## Coding Exercise 7: Train a 2 layer neural net with varying width" + "## Coding Exercise 7: Train a 2 layer neural net with varying widths" ] }, { @@ -1594,9 +1610,9 @@ "source": [ "---\n", "\n", - "# Section 3: Deep networks in the quasilinear regime\n", + "# Section 3: Deep networks in the quasi-linear regime\n", "\n", - "Estimated timing to here from start of tutorial: 45 minutes\n", + "*Estimated timing to here from start of tutorial: 45 minutes*\n", "\n", "We've just shown that certain deep networks are difficult to fit. In this section, we will discuss a regime in which a shallow network is able to approximate a deep teacher relatively well." ] @@ -1676,13 +1692,13 @@ "source": [ "One of the reasons that shallow nets cannot fit deep nets, in general, is that random deep nets, in certain regimes, behave like chaotic systems: each layer can be thought of as a single step of a dynamical system, and the number of layers plays the role of the number of time steps. A deep network, therefore, effectively subjects its input to long-time chaotic dynamics, which are, almost by definition, very difficult to predict accurately. In particular, *shallow* nets simply cannot capture the complex mapping implemented by deeper networks without resorting to an astronomical number of hidden units. Another way to interpret this behavior is that the many layers of a deep network repeatedly stretch and fold their inputs, allowing the network to implement a large number of complex functions - an idea known as **expressivity** ([Poole et al. 2016](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html)).\n", "\n", - "However, in other regimes, for example, when the weights of the teacher network are small, the dynamics implemented by the teacher network are no longer chaotic. In fact, for small enough weights, they are nearly linear. In this regime, we'd expect a shallow network to be able to approximate a deep teacher relatively well.\n", + "However, in other regimes, for example, when the weights of the teacher network are small, the dynamics implemented by the teacher network are no longer chaotic. In fact, for small enough weights, they are nearly linear. In this regime, we'd expect a shallow network to be able to approximate a deep teacher relatively well. This is what we mean by neural networks in a **quasi-linear** regime.\n", "\n", - "For more on these ideas, see the paper\n", + "For more on these ideas, see the paper:\n", "\n", - "[*Exponential expressivity in deep neural networks through transient chaos*](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html) Poole et al. Neurips (2016).\n", + "[*Exponential expressivity in deep neural networks through transient chaos*](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html) (Poole et al., 2016).\n", "\n", - "To test this idea, we'll repeat the exercise above, this time initializing the teacher weights with a small $\\sigma$, say, $0.4$, so that the teacher network is quasi-linear." + "To test this idea, we'll repeat the exercise above, this time initializing the teacher weights with a small $\\sigma$, say, $0.4$, so that the teacher network is in the so-called quasi-linear regime." ] }, { @@ -1691,7 +1707,7 @@ "execution": {} }, "source": [ - "## Coding Exercise 9: Create dataset & Train a student network\n", + "## Coding Exercise 9: Create Dataset & Train a Student Network\n", "\n", "Create training and test sets. Initialize the teacher network with $\\sigma_{t} = 0.4$." ] @@ -1918,9 +1934,9 @@ "\n", "In this demo, we invite you to explore the expressivity of two distinct deep networks already introduced earlier: one with $\\sigma = 2$ and another (quasi-linear) with $\\sigma = 0.4$. \n", "\n", - "We initialize two deep networks with $D=20$ layers with $W = 100$ hidden units each but different variances in their random parameters. Then, 400 input data points are generated on a unit circle. We will examine how these points are propagated through the networks.\n", + "We initialize two deep networks with $D=20$ layers with $W = 100$ hidden units each but different variances in their weight initializations. Then, 400 input data points are generated on a unit circle. We will examine how these points propagated through the networks by looking at the effect of the transformations that each neural network layer applies to the data.\n", "\n", - "To visualize each layer's activity, we randomly project it into 3 dimensions. The slider below controls which layer you are seeing. On the left, you'll see how a standard network processes its inputs, and on the right, how a quasi-linear network does so. " + "To visualize each layer's activity, we randomly project it into 3 dimensions. The slider below controls which layer you are seeing. On the left, you'll see how a standard network processes its inputs, and on the right, how a quasi-linear network does so. As outlined in the video, the principle takehome message is that low values for the variance parameter in the weight initializations mean that each layer effectively performs a linear transformation, which only rotates and stretches the circular input we put into both networks. The chaotic regime of the standard network allows for a much greater expressivitiy due to this phenomenon!" ] }, { @@ -2024,6 +2040,17 @@ "- We discussed how the fitting difficulty is related to whether the teacher is initialized in the **chaotic** regime.\n", "- Chaotic behavior is related to network **expressivity**, the network's ability to implement a large number of complex functions." ] + }, + { + "cell_type": "markdown", + "metadata": { + "execution": {} + }, + "source": [ + "# The Big Picture\n", + "\n", + "So, how do the topics covered in this tutorial relate to our exploration of the theme of generalization? We have seen that deep neural networks in certain regimes differentially affect the transformation of the inputs and this has an effect on the expressivity of the network. The transformations that take place between shallow and deep neural network make different testable environments for generalization capacity. We leave you to think about, taking what you have learned in this tutorial, what kind of relationship might there be for models that generalize well to inputs outside of the training distribution. Do shallow networks capture the specific details of training inputs? Do they model the problem at a level that pays more attention to surface features or important low-level features (that generalize better)? There is no correct answer to this question, but it's a good exercise to think about and start forming your own thoughts and ideas." + ] } ], "metadata": { @@ -2054,7 +2081,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.19" + "version": "3.9.22" } }, "nbformat": 4, diff --git a/tutorials/W2D1_Macrocircuits/instructor/W2D1_Tutorial2.ipynb b/tutorials/W2D1_Macrocircuits/instructor/W2D1_Tutorial2.ipynb index e31b18a6b..91d9e56fa 100644 --- a/tutorials/W2D1_Macrocircuits/instructor/W2D1_Tutorial2.ipynb +++ b/tutorials/W2D1_Macrocircuits/instructor/W2D1_Tutorial2.ipynb @@ -25,9 +25,9 @@ "\n", "__Content creators:__ Andrew Saxe, Vidya Muthukumar\n", "\n", - "__Content reviewers:__ Max Kanwal, Surya Ganguli, Xaq Pitkow, Hlib Solodzhuk, Patrick Mineault\n", + "__Content reviewers:__ Max Kanwal, Surya Ganguli, Xaq Pitkow, Hlib Solodzhuk, Patrick Mineault, Alex Murphy\n", "\n", - "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk" + "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk, Alex Murphy" ] }, { @@ -39,17 +39,16 @@ "source": [ "___\n", "\n", - "\n", "# Tutorial Objectives\n", "\n", "*Estimated timing of tutorial: 1 hour*\n", "\n", "In this tutorial, we'll look at the sometimes surprising behavior of large neural networks, which is called double descent. This empirical phenomenon puts the classical understanding of the bias-variance tradeoff in question: in double descent, highly overparametrized models can display good performance. In particular, we will explore the following: \n", "\n", - "- notions of low/high bias/variance;\n", - "- improvement of test performance with the network's overparameterization, which leads to large model trends;\n", - "- the conditions under which double descent is observed and what affects its significance;\n", - "- the conditions under which double descent does not occur.\n", + "- notions of low/high bias/variance\n", + "- improvement of test performance with the network's overparameterization, which leads to large model trends\n", + "- the conditions under which double descent is observed and what affects its significance\n", + "- the conditions under which double descent does not occur\n", " \n", "Let's jump in!" ] @@ -496,7 +495,7 @@ "---\n", "# Section 1: Overfitting in overparameterized models\n", "\n", - "In this section we will observe the classical behaviour of overparametrized networks - overfitting." + "In this section we will observe the classical behaviour of overparametrized networks: overfitting. This is where a model becomes tuned specifically to the features of the training data, beyond the general patterns. For example, if data points are measured with an imperfect system that introduces noise into the recording values, then overfitting can be thought of as a model that learns the signal **and** the noise associated with each data point, instead of just the **signal**. This is characterized by a low training error and a higher test error." ] }, { @@ -579,7 +578,7 @@ "\n", "We start by generating a simple sinusoidal dataset.\n", "\n", - "This dataset contains 100 datapoints. We've selected a subset of 10 points for training. " + "This dataset contains 100 data points. We've selected a subset of 10 points for training. " ] }, { @@ -613,7 +612,7 @@ "\n", "The input $x\\in R$ is a scalar. There are $N_h$ hidden units, and the output $\\hat y\\in R$ is a scalar.\n", "\n", - "We will initialize $W_1$ with i.i.d. random Gaussian values with a variance of one and $b$ with values drawn i.i.d. uniformly between $-\\pi$ and $\\pi$. Finally, we will initialize the weights $W_2$ to zero.\n", + "We will initialize $W_1$ with i.i.d. random Gaussian values with a variance of `1.0` and $b$ with values drawn i.i.d. uniformly between $-\\pi$ and $\\pi$. Finally, we will initialize the weights $W_2$ to `0`.\n", "\n", "We only train $W_2$, leaving $W_1$ and $b$ fixed. We can train $W_2$ to minimize the mean squared error between the training labels $y$ and the network's output on those datapoints. \n", "\n", @@ -832,11 +831,11 @@ "execution": {} }, "source": [ - "## Coding Exercise 2: The bias-variance tradeoff\n", + "## Coding Exercise 2: The Bias-Variance Trade-off\n", "\n", "With the network implemented, we now investigate how the size of the network (the number of hidden units it has, $N_h$) relates to its ability to generalize. \n", "\n", - "Ultimately, the true measure of a learning system is how well it performs on novel inputs, that is, its ability to generalize. The classical story of how model size relates to generalization is the bias-variance tradeoff.\n", + "Ultimately, the true measure of a learning system is how well it performs on novel inputs, that is, its ability to generalize. The classical story of how model size relates to generalization is captured in the concept of the bias-variance tradeoff. We assume you are familiar with this concept already. If not, take some time to discuss in your group or search out a verified explanation to review.\n", "\n", "To start, complete the code below to train several small networks with just two hidden neurons and plot their predictions." ] @@ -941,7 +940,7 @@ "execution": {} }, "source": [ - "With just two hidden units, the model cannot fit the training data, nor can it do well on the test data. A network of this size has a high bias.\n", + "With just two hidden units, the model cannot fit the training data, nor the test data. A network of this size has **high bias**.\n", "\n", "Now, let's train a network with five hidden units.\n", "\n", @@ -976,9 +975,9 @@ "execution": {} }, "source": [ - "With five hidden units, the model can do a better job of fitting the training data, and also follows the test data more closely - though still with errors.\n", + "With five hidden units, the model can do a better job of fitting the training data, and also follows the test data more closely, though still with errors.\n", "\n", - "Next let's try 10 hidden units." + "Next let's try 10 hidden units. Try to visualize how you might think the plot before running the cell below." ] }, { @@ -1009,7 +1008,7 @@ "source": [ "With 10 hidden units, the network often fits every training datapoint, but generalizes poorly--sometimes catastrophically so. We say that this size network has high variance. Intuitively, it is so complex that it can fit the training data perfectly, but this same complexity means it can take many different shapes in between datapoints.\n", "\n", - "We have just traced out the bias-variance tradeoff: the models with 2 hidden units had high bias, while the models with 10 hidden units had high variance. The models with 5 hidden units struck a balance--they were complex enough to achieve relatively low error on the training datapoints, but simple enough to be well constrained by the training data." + "We have just traced out the bias-variance tradeoff: the models with 2 hidden units had high bias, while the models with 10 hidden units had high variance. The models with 5 hidden units struck a balance--they were complex enough to achieve relatively low error on the training data points, but simple enough to be well constrained by the training data. The best choice of neural network architecture (e.g. choosing the number of neurons in a layer) is therefore highly dependent on the structure of the problem you are trying to solve and the format of the input data. It also involves trying out a few values and checking for where on the bias-variance trade-off line you find yourself. This was extremely important in classical approaches to understanding how to develop good neural networks. That was then, we are now in the **Modern Regime**, which we'll move on to now!" ] }, { @@ -1034,7 +1033,7 @@ }, "source": [ "---\n", - "# Section 2: The modern regime\n", + "# Section 2: The Modern Regime\n", "\n", "Estimated timing to here from start of tutorial: 20 minutes\n", "\n", @@ -1117,9 +1116,9 @@ "execution": {} }, "source": [ - "We just saw that a network with 10 hidden units trained on 10 training datapoints could fail to generalize. If we add even more hidden units, it seems unlikely that the network could perform well. How could hundreds of weights be correctly constrained with just these ten datapoints?\n", + "We just saw that a network with 10 hidden units trained on 10 training data points could fail to generalize. If we add even more hidden units, it seems unlikely that the network could perform well. How could hundreds of weights be correctly constrained when we just showed that these ten data points failed to capture any meaningful relationship of the training data?\n", "\n", - "But let's try it. Throw caution to the wind and train a network with 500 hidden units." + "Let's go crazy and train a network with `500` hidden units and see what happens! " ] }, { @@ -1158,11 +1157,11 @@ "execution": {} }, "source": [ - "Remarkably, this very large network fits the training datapoints and generalizes well.\n", + "Remarkably, this very large network fits the training data points and generalizes well. We've managed to get predictions that look like they have learned the distribution of our input data correctly.\n", "\n", - "This network has fifty times as many parameters as datapoints. How can this be?\n", + "This network has fifty times as many parameters as data points. How can this be?\n", "\n", - "We've tested four different network sizes and seen the qualitative behavior of the predictions. Now, let's systematically compute the average test error for different network sizes.\n", + "We have tested four different network sizes (`2`, `5`, `10`, `500`) and we saw the qualitative behavior of the predictions. Now, let's systematically compute the average test error for different network sizes.\n", "\n", "For each network size in the array below, train 100 networks and plot their mean test error." ] @@ -1266,9 +1265,9 @@ "\n", "Hence, in this scenario, larger models perform better--even when they contain many more parameters than datapoints.\n", "\n", - "The peak (worst generalization) is at an intermediate model size when the number of hidden units is equal to the number of examples in this case. More generally, it turns out the peak occurs when the model first becomes complex enough to reach zero training error. This point is known as the interpolation point.\n", + "The peak (worst generalization) is at an intermediate model size when the number of hidden units is equal to the number of examples in this case. More generally, it turns out the peak occurs when the model first becomes complex enough to reach zero training error. This point is known as the **interpolation threshold**.\n", "\n", - "The trend for deep learning models to grow in size is in part due to this phenomenon of double descent. Let's now see its limits." + "The trend for deep learning models growing in size is in part due to the implications of the phenomenon of double descent. But does it always hold? Let's now see where its limits are and what modulates the ability to learn in this overparameterized regime." ] }, { @@ -1280,7 +1279,7 @@ "source": [ "## Interactive Demo 1: Interpolation point & predictions\n", "\n", - "In this interactive demo, you can move the slider for the number of hidden units in the network to be trained on and observe one representative trial of predicted values." + "In this interactive demo, you have a slider that represents the number of hidden units in a network to be trained on and then we bserve one representative trial of predicted values." ] }, { @@ -1309,7 +1308,7 @@ "execution": {} }, "source": [ - "The trend for deep learning models to grow in size is in part due to the phenomenon of double descent. Let's now see its limits." + "Having experimented with this interactive tool for a little while, are you able to see the relationship between the results shown here and the double descent plot above (the previous figure)?" ] }, { @@ -1335,11 +1334,11 @@ "source": [ "\n", "---\n", - "# Section 3: Double descent, noise & regularization\n", + "# Section 3: Double Descent, Noise & Regularization\n", "\n", - "Estimated timing to here from start of tutorial: 35 minutes\n", + "*Estimated timing to here from start of tutorial: 35 minutes*\n", "\n", - "In this section, we are going to explore the effect of noise and regularization on double descent behavior." + "In this section, we are going to explore the effect of noise and regularization on double descent." ] }, { @@ -1428,7 +1427,7 @@ "execution": {} }, "source": [ - "So far, our training datapoints have been noiseless. Intuitively, a noisy training dataset might hurt the ability of complex models to generalize. In this section, we are going to explore the effect of noise on double descent behavior.\n", + "So far, our training data points have been noiseless. Intuitively, a noisy training dataset might hurt the ability of complex models to generalize. In this section, we are going to explore the effect of noise on double descent behavior.\n", "\n", "Let's test this. Add i.i.d. Gaussian noise of different standard deviations to the training labels, and plot the resulting double descent curves." ] @@ -1533,7 +1532,7 @@ "execution": {} }, "source": [ - "Though we are still able to observe the double descent effect, its strength is reduced with the increase in noise level." + "Though we are still able to observe the effect of double descent, its nowhere near as clear when we introduce noise into the training data." ] }, { @@ -1575,10 +1574,14 @@ "source": [ "We observe that the \"peak\" disappears, and the test error roughly monotonically decreases, although it is generally higher for higher noise levels in the training data.\n", "\n", + "
\n", + " A note about the use of the term \"regularization\" in multiple contexts in ML (optional)\n", + "
\n", "The word *regularization* is commonly used in statistics/ML parlance in two different contexts to ensure the good generalization of overparameterized models:\n", "\n", "- The first context, which is emphasized throughout the tutorial, is explicit regularization which means that the model is not trained to completion (zero training error) in order to avoid overfitting of noise. Without explicit regularization, we observe the double descent behavior – i.e. catastrophic overfitting when the number of model parameters is too close to the number of training examples – but also a vast reduction in this overfitting effect as we heavily overparameterize the model. With explicit regularization (when tuned correctly), the double descent behavior disappears because we no longer run the risk of overfitting to noise at all.\n", - "- The second context is the one of inductive bias – overparameterized models, when trained with popular optimization algorithms like gradient descent, tend to converge to a particularly “simple” solution that perfectly fits the data. By “simple”, we usually mean that the size of the parameters (in terms of magnitude) is very small. This inductive bias is a big reason why double descent occurs as well, in particular, the benefit of overparameterization in reducing overfitting." + "- The second context is the one of inductive bias – overparameterized models, when trained with popular optimization algorithms like gradient descent, tend to converge to a particularly “simple” solution that perfectly fits the data. By “simple”, we usually mean that the size of the parameters (in terms of magnitude) is very small. This inductive bias is a big reason why double descent occurs as well, in particular, the benefit of overparameterization in reducing overfitting.\n", + "
" ] }, { @@ -1668,9 +1671,9 @@ "execution": {} }, "source": [ - "The network smoothly interpolates between the training datapoints. Even when noisy, these can still somewhat track the test data. Depending on the noise level, though, a smaller and more constrained model can be better.\n", + "The network smoothly interpolates between the training data points. Even when noisy, these can still somewhat track the test data. Depending on the noise level, though, a smaller and more constrained model can be better.\n", "\n", - "From this, we might expect that large models will work particularly well for datasets with little label noise. Many real-world datasets fit this requirement: image classification datasets strive to have accurate labels for all datapoints, for instance. Other datasets may not. For instance, predicting DSM-V diagnoses from structural MRI data is a noisy task, as the diagnoses themselves are noisy." + "From this, we might expect that large models will work particularly well for datasets with little label noise. Many real world datasets fit this requirement: image classification datasets strive to have accurate labels for all data points, for instance. Other datasets may not. For instance, predicting DSM-V diagnoses from structural MRI data is a noisy task, as the diagnoses themselves are noisy due to the inherent difficulty of mapping observations to clinically defined classes." ] }, { @@ -1695,11 +1698,12 @@ }, "source": [ "---\n", - "# Section 4: Double descent and initialization\n", + "# Section 4: Double Descent and Initialization\n", "\n", - "Estimated timing to here from start of tutorial: 50 minutes\n", + "*Estimated timing to here from start of tutorial: 50 minutes*\n", "\n", - "So far, we have considered one important aspect of architecture, namely the size or number of hidden neurons. A second critical aspect is initialization." + "\n", + "So far, we have considered one important aspect of neural network architectures, namely the width of a hidden layer (the number of neurons in the layer). However, another critical aspect connected to the emergence of the double descent phenomenon is that of weight initialization. In the last tutorial, we explored weight initialization from the perspective of a chaotic and non-chaotic regime. We saw that low variance initialization of weights led some deep MLPs to exhibit transformations in the quasi-linear regime. Let's now explore what the effects of weight initialization are on the double descent phenomenon." ] }, { @@ -1853,11 +1857,11 @@ "execution": {} }, "source": [ - "We see that for overparametrized models, where the number of parameters is larger than the number of training examples, the initialization scale strongly impacts the test error. The good performance of these large models thus depends on our choice of initializing $W_2$ equal to zero.\n", + "We see that for our overparametrized model (where the number of parameters is larger than the number of training examples) the initialization scale strongly impacts the test error. The better performance of these large models thus depends on our choice of initializing $W_2$ equal to zero.\n", "\n", "Intuitively, this is because directions of weight space in which we have no training data are not changed by gradient descent, so poor initialization can continue to affect the model even after training. Large initializations implement random functions that generalize poorly.\n", "\n", - "Let's see what the predictions of a large-variance-initialization network with 500 hidden neurons look like." + "Let's see what the predictions of a large-variance-initialization network with 500 hidden neurons look like. We will set `init_scale = 1.0` in this example below." ] }, { @@ -2011,9 +2015,9 @@ }, "source": [ "---\n", - "# Summary\n", + "# The Big Picture\n", "\n", - "Estimated timing of tutorial: 1 hour\n", + "*Estimated timing of tutorial: 1 hour*\n", "\n", "In this tutorial, we observed the phenomenon of double descent: the situation when the overparameterized network was expected to behave as overfitted but instead generalized better to the unseen data. Moreover, we discovered how noise, regularization & initial scale impact the effect of double descent and, in some cases, can fully cancel it.\n", "\n", @@ -2048,7 +2052,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.19" + "version": "3.9.22" } }, "nbformat": 4, diff --git a/tutorials/W2D1_Macrocircuits/instructor/W2D1_Tutorial3.ipynb b/tutorials/W2D1_Macrocircuits/instructor/W2D1_Tutorial3.ipynb index 651b867d4..bef38757c 100644 --- a/tutorials/W2D1_Macrocircuits/instructor/W2D1_Tutorial3.ipynb +++ b/tutorials/W2D1_Macrocircuits/instructor/W2D1_Tutorial3.ipynb @@ -23,9 +23,9 @@ "\n", "__Content creators:__ Ruiyi Zhang\n", "\n", - "__Content reviewers:__ Xaq Pitkow, Hlib Solodzhuk, Patrick Mineault\n", + "__Content reviewers:__ Xaq Pitkow, Hlib Solodzhuk, Patrick Mineault, Alex Murphy\n", "\n", - "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk" + "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk, Alex Murphy" ] }, { @@ -96,7 +96,8 @@ "# @title Install and import feedback gadget\n", "\n", "!pip install vibecheck datatops --quiet\n", - "!pip install pandas~=2.0.0 --quiet\n", + "!pip install pandas --quiet\n", + "!pip install scikit-learn --quiet\n", "\n", "from vibecheck import DatatopsContentReviewContainer\n", "def content_review(notebook_section: str):\n", @@ -1795,7 +1796,7 @@ "---\n", "# Section 2: Evaluate agents in the training task\n", "\n", - "Estimated timing to here from start of tutorial: 25 minutes" + "*Estimated timing to here from start of tutorial: 25 minutes*" ] }, { @@ -1804,7 +1805,7 @@ "execution": {} }, "source": [ - "With the code for the environment and agents done, we will now write an evaluation function allowing the agent to interact with the environment." + "With the code for the environment and agents complete, we will now write an evaluation function allowing the agent to interact with the environment where the quality of the model can be assessed." ] }, { @@ -1822,7 +1823,7 @@ "execution": {} }, "source": [ - "We first sample 1000 targets for agents to steer to." + "We first sample 1,000 targets for the RL agent to steer towards." ] }, { @@ -2069,7 +2070,7 @@ "execution": {} }, "source": [ - "Since training RL agents takes a lot of time, here we load the pre-trained modular and holistic agents and evaluate these two agents on the same sampled 1000 targets. We will then store the evaluation data in pandas dataframes." + "Since training RL agents takes a lot of time, here we load the pre-trained modular and holistic agents and evaluate these two agents on the same sampled 1,000 targets. We will then store the evaluation data in `pandas` Dataframe object." ] }, { @@ -2509,9 +2510,11 @@ "execution": {} }, "source": [ - "It is well known that an RL agent's performance can vary significantly with different random seeds. Therefore, no conclusions can be drawn based on one training run with a single random seed.\n", + "It is well known that an RL agent's performance can vary significantly with different random seeds. Therefore, no conclusions can be drawn based on one training run with a single random seed. Therefore, to make more convincing conclusions, we must run the same experiment across different random initializations in order to be sure that any repeatedly-obtainable result is robustly seen across such different random initializations.\n", "\n", - "Both agents were trained with eight random seeds, and all of them were evaluated using the same sample of $1000$ targets. Let's load this saved trajectory data." + "Both agents were trained across 8 random seeds. All of them were evaluated using the same sample of 1,000 targets.\n", + "\n", + "Let's load this saved trajectory data." ] }, { @@ -2546,7 +2549,7 @@ "execution": {} }, "source": [ - "We first compute the fraction of rewarded trials in the total $1000$ trials for all training runs with different random seeds for the modular and holistic agents. We visualize this using a bar plot, with each red dot denoting the performance of a random seed." + "We first compute the fraction of rewarded trials in the total 1,000 trials for all training runs with different random seeds for the modular and holistic agents. We visualize this using a bar plot, with each red dot denoting the performance of a random seed." ] }, { @@ -2602,9 +2605,9 @@ "execution": {} }, "source": [ - "Despite similar performance measured by a rewarded fraction, we dis observe qualitative differences in the trajectories of the two agents in the previous sections. It is possible that the holistic agent's more curved trajectories, although reaching the target, are less efficient, i.e., they waste more time.\n", + "Despite similar performance measured by a rewarded fraction, we did observe qualitative differences in the trajectories of the two agents in the previous sections. It is possible that the holistic agent's more curved trajectories, although reaching the target, are less efficient, i.e., they waste more time.\n", "\n", - "Therefore, we also plot the time spent by both agents for the same 1000 targets." + "Therefore, we also plot the time spent by both agents for the same 1,000 targets." ] }, { @@ -2784,7 +2787,9 @@ "---\n", "# Section 3: A novel gain task\n", "\n", - "Estimated timing to here from start of tutorial: 50 minutes" + "*Estimated timing to here from start of tutorial: 50 minutes*\n", + "\n", + "The prior task had a fixed joystick gain that meant consistent linear and angular velocities. We will now look at a novel task that tests the generalization capabilities of these models by varying this setting between training and testing. Will the model generalize well?" ] }, { @@ -3218,22 +3223,7 @@ "execution": {} }, "source": [ - "---\n", - "# Summary\n", - "\n", - "*Estimated timing of tutorial: 1 hour*\n", - "\n", - "In this tutorial, we explored the difference in agents' performance based on their architecture. We revealed that modular architecture, with separate modules for learning different aspects of behavior, is superior to a holistic architecture with a single module. The modular architecture with stronger inductive bias achieves good performance faster and has the capability to generalize to other tasks as well. Intriguingly, this modularity is a property we also observe in the brains, which could be important for generalization in the brain as well." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "execution": {} - }, - "source": [ - "---\n", - "# Bonus Section 1: Decoding analysis" + "## Decoding analysis" ] }, { @@ -3365,7 +3355,21 @@ }, "source": [ "---\n", - "# Bonus Section 2: Generalization, but no free lunch\n", + "# The Big Picture\n", + "\n", + "*Estimated timing of tutorial: 1 hour*\n", + "\n", + "In this tutorial, we explored the difference in agents' performance based on their architecture. We revealed that modular architecture, with separate modules for learning different aspects of behavior, is superior to a holistic architecture with a single module. The modular architecture with stronger inductive bias achieves good performance faster and has the capability to generalize to other tasks as well. Intriguingly, this modularity is a property we also observe in the brains, which could be important for generalization in the brain as well." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "execution": {} + }, + "source": [ + "---\n", + "# Addendum: Generalization, but no free lunch\n", "\n", "The No Free Lunch theorems proved that no inductive bias can excel across all tasks. It has been studied in the [paper](https://www.science.org/doi/10.1126/sciadv.adk1256) that agents with a modular architecture can acquire the underlying structure of the training task. In contrast, holistic agents tend to acquire different knowledge than modular agents during training, such as forming beliefs based on unreliable information sources or exhibiting less efficient control actions. The novel gain task has a structure similar to the training task, consequently, a modular agent that accurately learns the training task's structure can leverage its knowledge in these novel tasks.\n", "\n", @@ -3416,7 +3420,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.19" + "version": "3.9.22" } }, "nbformat": 4, diff --git a/tutorials/W2D1_Macrocircuits/student/W2D1_Intro.ipynb b/tutorials/W2D1_Macrocircuits/student/W2D1_Intro.ipynb index 4f3ce7bdb..015c5bbe5 100644 --- a/tutorials/W2D1_Macrocircuits/student/W2D1_Intro.ipynb +++ b/tutorials/W2D1_Macrocircuits/student/W2D1_Intro.ipynb @@ -57,7 +57,9 @@ "source": [ "## Prerequisites\n", "\n", - "Materials of this day assume you have had the experience of model building in `pytorch` earlier. It would be beneficial too if you had the basics of Linear Algebra before as well as if you had played around with Actor-Critic model in Reinforcement Learning setup." + "In order to get the most out of today's tutorials, it would greatly help if you had experience building (simple) neural network models in PyTorch. We will also be using some concepts from Linear Algebra, so some familiarity with concepts from that domain will come in handy. We will also be looking at a specific algorithm in Reinforcement Learning (RL) called the Actor-Critic model, so it would help if you had some familiarity with Reinforcement Learning. We touched a little bit on RL in W1D2 (\"Comparing Tasks\"), specifically in Tutorial 3 (\"Reinforcement Learning Across Temporal Scales\"). It could be good to refer back to that tutorial and to check out the two videos on Meta-RL in that tutorial notebook.\n", + "\n", + "Today is a little more technical, more theory-driven, but it will give you a lot of skills and appreciation to work with these very interesting ideas in NeuroAI. What we encourage you to keep in mind is how this knowledge helps you to appreciate the concept of generalization, the over-arching theme of this entire course. Lots of points today will indicate how learning dynamics will arrive at solutions that **generalize well**!" ] }, { @@ -210,7 +212,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.19" + "version": "3.9.22" } }, "nbformat": 4, diff --git a/tutorials/W2D1_Macrocircuits/student/W2D1_Tutorial1.ipynb b/tutorials/W2D1_Macrocircuits/student/W2D1_Tutorial1.ipynb index e2f5481c9..ea644c2d5 100644 --- a/tutorials/W2D1_Macrocircuits/student/W2D1_Tutorial1.ipynb +++ b/tutorials/W2D1_Macrocircuits/student/W2D1_Tutorial1.ipynb @@ -25,9 +25,9 @@ "\n", "__Content creators:__ Gabriel Mel de Fontenay\n", "\n", - "__Content reviewers:__ Surya Ganguli, Xaq Pitkow, Hlib Solodzhuk, Aakash Agrawal, Alish Dipani, Hossein Rezaei, Yousef Ghanbari, Mostafa Abdollahi, Patrick Mineault\n", + "__Content reviewers:__ Surya Ganguli, Xaq Pitkow, Hlib Solodzhuk, Aakash Agrawal, Alish Dipani, Hossein Rezaei, Yousef Ghanbari, Mostafa Abdollahi, Patrick Mineault, Alex Murphy\n", "\n", - "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk\n" + "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk, Alex Murphy\n" ] }, { @@ -43,13 +43,13 @@ "\n", "*Estimated timing of tutorial: 1 hour*\n", "\n", - "In this tutorial we will take a closer look at the expressivity of neural networks by observing the following:\n", + "In this tutorial we will take a closer look at the **expressivity** of neural networks, the ability of neural networks to model a wide range of functions. We will make the following observations:\n", "\n", - "- The **universal approximator theorem** guarantees that we can approximate any complex function using a network with a single hidden layer. The catch is that the approximating network might need to be extremely *wide*.\n", - "- We will explore this issue by constructing a complex function and attempting to fit it with shallow networks of varying widths.\n", - "- To create this complex function, we'll build a random deep neural network. This is an example of the **student-teacher setting**, where we attempt to fit a known *teacher* function (the deep network) using a *student* model (the shallow/wide network).\n", - "- We will find that the deep teacher network can be either very easy or very hard to approximate and that the difficulty level is related to a form of **chaos** in the network activities.\n", - "- Each layer of a neural network can effectively expand and fold the input it receives from the previous layer. This repeated expansion and folding grants deep neural networks models high **expressivity** - ie. allows them to implement a large number of different functions.\n", + "- The **universal approximator theorem** guarantees that we can approximate any complex function using a network with a single hidden layer. The catch is that the approximating network might need to be extremely *wide* and the theorem only states the existence of such a model (not exactly how neurons are required per task)\n", + "- We will explore this issue by constructing a complex function and attempting to fit it with shallow networks of varying widths\n", + "- To create this complex function, we'll build a random deep neural network. This is an example of the **student-teacher setting**, where we attempt to fit a known *teacher* function (the deep network) using a *student* model (the shallow/wide network)\n", + "- We will see that it can be either very easy or very difficult to learn from the deep (teacher) network and this difficulty is related to a form of **chaos** in the network activations\n", + "- Each layer of a neural network can effectively expand and fold the input it receives from the previous layer. This repeated expansion and folding grants deep neural networks models high **expressivity** - ie. allows them to capture the behavior of a large number of different functions\n", "\n", "Let's get started!" ] @@ -363,7 +363,9 @@ "\n", "# Section 1: Introduction\n", "\n", - "In this section we will create functions to capture the snippets of code that we will use repeatedly in what follows." + "In this section we will write some Python functions to help build some neural networks that will allow us to effectively examine the expressiity of shallow versus deep networks. We will specifically look at this issue through the lens of the universal approximation theorem and ask ourselves what deeper neural networks give us in terms of the ability of those models to capture a wide range of functions. As you will recall from today's introduction video, the idea of each layer being able to fold activations via an activation function increases the ability to model nonlinear functions much more effectively. After going through this tutorial, this idea will hopefully be much clearer.\n", + "\n", + "By **shallow network**, we mean one with a very small number of layers (e.g. one). A shallow networks can be **wide** if it has many, many neurons in this layer, or it can be smaller, having only a limited number of neurons. In contrast, by **deep networks**, we refer to the number of layers in the network. It's important to keep in mind that the term **wide** in the terminology we will use specifically refers to *the number of neurons in a layer, not the number of layers in a network*. If we take a single layer in a shallow or a deep network, we can describe it as being **wide** if it has a very large number of neurons. " ] }, { @@ -427,17 +429,17 @@ }, "source": [ "\n", - "The [universal approximator theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem) (UAT) guarantees that we can approximate any function arbitrarily well using a shallow network - ie. a network with a single hidden layer (figure below, left). So why do we need depth? The \"catch\" in the UAT is that approximating a complex function with a shallow network can require a very large number of hidden units - ie. the network must be very wide. The inability of shallow networks to efficiently implement certain functions suggests that network depth may be one of the brain's computational \"secret sauces\".\n", + "The [universal approximator theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem) (UAT) guarantees that we can approximate any function arbitrarily well using a shallow network - ie. a network with a single hidden layer (figure below, left). So why do we need depth? The *catch* in the UAT is that approximating a complex function with a shallow network can require a very large number of hidden units - ie. the network must be very wide. The inability of shallow networks to efficiently implement certain functions suggests that network depth may be one of the brain's computational *secret sauces*.\n", "\n", "\"Shallow\n", "\n", "To illustrate this fact, we'll create a complex function and then attempt to fit it with single-hidden-layer neural networks of different widths. What we'll find is that although the UAT guarantees that sufficiently wide networks can approximate our function, the performance will actually not be very good for our shallow nets of modest width.\n", "\n", - "One easy way to create a complex function is to build a random deep neural network (figure above, right). We then have a teacher network which generates the ground truth outputs, and a student network whose goal is to learn the mapping implemented by the teacher. This approach - known as the **student-teacher setting** - is useful for both computational and mathematical study of neural networks since it gives us complete control of the data generation process. Unlike with real-world data, we know the exact distribution of inputs and correct outputs.\n", + "One easy way to create a complex function is to build a random deep neural network (figure above, right), which serves as a teacher network (generating the ground truth outputs), and we'll also have a student network, whose goal is to learn the function defined by the teacher network. This approach - known as the **student-teacher setting** - is useful for both the computational and mathematical study of neural networks since it gives us complete control of the data generation process. Unlike with real-world data, we know the exact distribution of inputs and correct outputs. This means we aren't restricted by having to factor in any *noisy* signals that are not connected to our inputs.\n", "\n", - "Finally, we will show that depending on the distribution of the weights, a random deep neural network can be either very difficult or very easy to approximate with a shallow network. The \"complexity\" of the function computed by a random deep network thus depends crucially on the weight distribution. One can actually understand the boundary between hard and easy cases as a kind of boundary between chaos and non-chaos in a certain dynamical system. We will confirm that on the non-chaotic side, a random deep neural network can be effectively approximated by a shallow net. This demonstration will be based on ideas from the paper:\n", + "Finally, we will show that depending on the distribution of the weights, a random deep neural network can be either very difficult or very easy to approximate with a shallow network. The *complexity* of the function computed by a random deep network thus depends crucially on the weight distribution. One can actually understand the boundary between hard and easy cases as a kind of boundary between **chaos** and **non-chaos** in a certain dynamical system. We will confirm that on the non-chaotic side, a random deep neural network can be effectively approximated by a shallow net. This demonstration will be based on ideas from the following paper:\n", "\n", - "[*Exponential expressivity in deep neural networks through transient chaos*](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html) Poole et al. Neurips (2016)." + "[*Exponential expressivity in deep neural networks through transient chaos*](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html) (Poole et al., 2016)." ] }, { @@ -528,9 +530,9 @@ "source": [ "## Coding Exercise 1: Create an MLP\n", "\n", - "The code below implements a function that takes in an input dimension, a layer width, and a number of layers and creates a simple MLP in pytorch. In between each layer, we insert a hyperbolic tangent nonlinearity layer (`nn.Tanh()`).\n", + "The code below implements a function that takes in an input dimension, a layer width, and a number of layers and creates a simple MLP in pytorch where each layer has the same width. In between each layer, we insert a hyperbolic tangent nonlinearity layer (`nn.Tanh()`).\n", "\n", - "Convention: Because we will count the input as a layer, a depth of 2 will mean a network with just one hidden layer, followed by the output neuron. A depth of 3 will mean 2 hidden layers, and so on." + "Convention: Because we will count the input as a layer, a depth of 2 will mean a network with just one hidden layer, followed by the output neuron. A depth of 3 will mean 2 hidden layers, and so on." ] }, { @@ -568,14 +570,14 @@ "\n", " # Assemble D-1 hidden layers and one output layer\n", "\n", - " #input layer\n", + " # input layer\n", " layers = [nn.Linear(n_in, W, bias = False), nonlin]\n", " for i in range(D - 2):\n", - " #linear layer\n", + " # linear layer\n", " layers.append(nn.Linear(W, W, bias = False))\n", - " #activation function\n", + " # activation function\n", " layers.append(nonlin)\n", - " #output layer\n", + " # output layer\n", " layers.append(nn.Linear(W, 1, bias = False))\n", "\n", " return nn.Sequential(*layers)\n", @@ -657,7 +659,7 @@ "source": [ "## Coding Exercise 2: Initialize model weights\n", "\n", - "Write a function that, given a model and a $\\sigma$, initializes all weights in the model according to a normal distribution with mean $0$ and standard deviation\n", + "Write a function that, given a model and $\\sigma$, initializes all weights in the model according to a normal (Gaussian) distribution with mean $0$ and standard deviation\n", " \n", " $$\\frac{\\sigma}{\\sqrt{n_{in}}},$$\n", " \n", @@ -810,7 +812,15 @@ "execution": {} }, "source": [ - "In this coding exercise, write a function that will train a given net on a given dataset. Function parameters include the network, the training inputs and outputs, the number of steps, and the learning rate. Set up loss function as MSE." + "In this coding exercise, write a function that will train a given net on a given dataset. Function parameters include:\n", + "\n", + "* the network (`net`)\n", + "* training inputs (`X`)\n", + "* the outputs (`y`)\n", + "* the number of steps (`n_epochs`)\n", + "* the learning rate (`lr`)\n", + "\n", + "Use the mean-squared error (MSE) loss function in the learning algorithm. You might need to check the pytorch documentation to see the exact layer name you will need to call for this." ] }, { @@ -902,7 +912,11 @@ "execution": {} }, "source": [ - "Now, write a helper function that computes the loss of a net on a dataset. It takes the following parameters: the network and the dataset inputs and outputs." + "Now, write a helper function that computes the loss of a net on a dataset. It takes the following parameters:\n", + "\n", + "* the network (`net`)\n", + "* the dataset inputs (`X`)\n", + "* the dataset outputs (`y`)" ] }, { @@ -976,7 +990,9 @@ "\n", "Estimated timing to here from start of tutorial: 20 minutes\n", "\n", - "We will now use the functions we've created to experiment with deep network fitting. In particular, we will see to what extent it is possible to fit a deep net using a shallow net. Specifically, we will fix a deep teacher and then fit it with a single-hidden-layer net with varying width value. In principle, if the number of hidden units is large enough, the error should be low. Let's see!" + "We will now use the functions we created to experiment with fitting various student models to our complex function (which we defined earlier to be a randomly initialized deep neural network, what we defined as the teacher network). In particular, we will see to what extent it is possible to fit a deep net using a shallow net. We will freeze a deep teacher network and then fit it with a single-hidden-layer net with varying width sizes. In principle, if the number of hidden units is large enough, the error should be low (according to the universal approximation theorem)\n", + "\n", + "Let's see if that's the case!" ] }, { @@ -1054,7 +1070,7 @@ "source": [ "## Coding Exercise 5: Create learning problem\n", "\n", - "Create a \"deep\" teacher network that accepts inputs of size 5. Give the network a width of 5 and a depth of 5. Use this to generate both a training and test set with 4000 examples for training and 1000 for testing. Initialize weights with a standard deviation of 2.0." + "Create a *deep* teacher network that accepts inputs of size `5`. Give the network a width of `5` and a depth of `5`. Use this to generate both a training and test set with 4,000 examples for training and 1,000 for testing. Initialize weights with a standard deviation of `2.0`." ] }, { @@ -1169,7 +1185,7 @@ "execution": {} }, "source": [ - "Now, let's train the student and observe the loss on a semi-log plot (the y-axis is logarithmic)! Your task is to complete the missing parts of the code. While the model is training training, you can go to the next coding exercise and return back to observe the results (it will take approximately 5 minutes)." + "Now, let's train the student and observe the loss on a semi-log plot (the y-axis is logarithmic)! Your task is to complete the missing parts of the code. While the model is being trained, you can go to the next coding exercise and return back to observe the results shortly. It will take approximately 5 minutes for the model to complete its training call." ] }, { @@ -1233,7 +1249,7 @@ "execution": {} }, "source": [ - "## Coding Exercise 7: Train a 2 layer neural net with varying width" + "## Coding Exercise 7: Train a 2 layer neural net with varying widths" ] }, { @@ -1396,9 +1412,9 @@ "source": [ "---\n", "\n", - "# Section 3: Deep networks in the quasilinear regime\n", + "# Section 3: Deep networks in the quasi-linear regime\n", "\n", - "Estimated timing to here from start of tutorial: 45 minutes\n", + "*Estimated timing to here from start of tutorial: 45 minutes*\n", "\n", "We've just shown that certain deep networks are difficult to fit. In this section, we will discuss a regime in which a shallow network is able to approximate a deep teacher relatively well." ] @@ -1478,13 +1494,13 @@ "source": [ "One of the reasons that shallow nets cannot fit deep nets, in general, is that random deep nets, in certain regimes, behave like chaotic systems: each layer can be thought of as a single step of a dynamical system, and the number of layers plays the role of the number of time steps. A deep network, therefore, effectively subjects its input to long-time chaotic dynamics, which are, almost by definition, very difficult to predict accurately. In particular, *shallow* nets simply cannot capture the complex mapping implemented by deeper networks without resorting to an astronomical number of hidden units. Another way to interpret this behavior is that the many layers of a deep network repeatedly stretch and fold their inputs, allowing the network to implement a large number of complex functions - an idea known as **expressivity** ([Poole et al. 2016](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html)).\n", "\n", - "However, in other regimes, for example, when the weights of the teacher network are small, the dynamics implemented by the teacher network are no longer chaotic. In fact, for small enough weights, they are nearly linear. In this regime, we'd expect a shallow network to be able to approximate a deep teacher relatively well.\n", + "However, in other regimes, for example, when the weights of the teacher network are small, the dynamics implemented by the teacher network are no longer chaotic. In fact, for small enough weights, they are nearly linear. In this regime, we'd expect a shallow network to be able to approximate a deep teacher relatively well. This is what we mean by neural networks in a **quasi-linear** regime.\n", "\n", - "For more on these ideas, see the paper\n", + "For more on these ideas, see the paper:\n", "\n", - "[*Exponential expressivity in deep neural networks through transient chaos*](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html) Poole et al. Neurips (2016).\n", + "[*Exponential expressivity in deep neural networks through transient chaos*](https://papers.nips.cc/paper_files/paper/2016/hash/148510031349642de5ca0c544f31b2ef-Abstract.html) (Poole et al., 2016).\n", "\n", - "To test this idea, we'll repeat the exercise above, this time initializing the teacher weights with a small $\\sigma$, say, $0.4$, so that the teacher network is quasi-linear." + "To test this idea, we'll repeat the exercise above, this time initializing the teacher weights with a small $\\sigma$, say, $0.4$, so that the teacher network is in the so-called quasi-linear regime." ] }, { @@ -1493,7 +1509,7 @@ "execution": {} }, "source": [ - "## Coding Exercise 9: Create dataset & Train a student network\n", + "## Coding Exercise 9: Create Dataset & Train a Student Network\n", "\n", "Create training and test sets. Initialize the teacher network with $\\sigma_{t} = 0.4$." ] @@ -1686,9 +1702,9 @@ "\n", "In this demo, we invite you to explore the expressivity of two distinct deep networks already introduced earlier: one with $\\sigma = 2$ and another (quasi-linear) with $\\sigma = 0.4$. \n", "\n", - "We initialize two deep networks with $D=20$ layers with $W = 100$ hidden units each but different variances in their random parameters. Then, 400 input data points are generated on a unit circle. We will examine how these points are propagated through the networks.\n", + "We initialize two deep networks with $D=20$ layers with $W = 100$ hidden units each but different variances in their weight initializations. Then, 400 input data points are generated on a unit circle. We will examine how these points propagated through the networks by looking at the effect of the transformations that each neural network layer applies to the data.\n", "\n", - "To visualize each layer's activity, we randomly project it into 3 dimensions. The slider below controls which layer you are seeing. On the left, you'll see how a standard network processes its inputs, and on the right, how a quasi-linear network does so. " + "To visualize each layer's activity, we randomly project it into 3 dimensions. The slider below controls which layer you are seeing. On the left, you'll see how a standard network processes its inputs, and on the right, how a quasi-linear network does so. As outlined in the video, the principle takehome message is that low values for the variance parameter in the weight initializations mean that each layer effectively performs a linear transformation, which only rotates and stretches the circular input we put into both networks. The chaotic regime of the standard network allows for a much greater expressivitiy due to this phenomenon!" ] }, { @@ -1785,6 +1801,17 @@ "- We discussed how the fitting difficulty is related to whether the teacher is initialized in the **chaotic** regime.\n", "- Chaotic behavior is related to network **expressivity**, the network's ability to implement a large number of complex functions." ] + }, + { + "cell_type": "markdown", + "metadata": { + "execution": {} + }, + "source": [ + "# The Big Picture\n", + "\n", + "So, how do the topics covered in this tutorial relate to our exploration of the theme of generalization? We have seen that deep neural networks in certain regimes differentially affect the transformation of the inputs and this has an effect on the expressivity of the network. The transformations that take place between shallow and deep neural network make different testable environments for generalization capacity. We leave you to think about, taking what you have learned in this tutorial, what kind of relationship might there be for models that generalize well to inputs outside of the training distribution. Do shallow networks capture the specific details of training inputs? Do they model the problem at a level that pays more attention to surface features or important low-level features (that generalize better)? There is no correct answer to this question, but it's a good exercise to think about and start forming your own thoughts and ideas." + ] } ], "metadata": { @@ -1815,7 +1842,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.19" + "version": "3.9.22" } }, "nbformat": 4, diff --git a/tutorials/W2D1_Macrocircuits/student/W2D1_Tutorial2.ipynb b/tutorials/W2D1_Macrocircuits/student/W2D1_Tutorial2.ipynb index fbadf8795..3299e70c9 100644 --- a/tutorials/W2D1_Macrocircuits/student/W2D1_Tutorial2.ipynb +++ b/tutorials/W2D1_Macrocircuits/student/W2D1_Tutorial2.ipynb @@ -25,9 +25,9 @@ "\n", "__Content creators:__ Andrew Saxe, Vidya Muthukumar\n", "\n", - "__Content reviewers:__ Max Kanwal, Surya Ganguli, Xaq Pitkow, Hlib Solodzhuk, Patrick Mineault\n", + "__Content reviewers:__ Max Kanwal, Surya Ganguli, Xaq Pitkow, Hlib Solodzhuk, Patrick Mineault, Alex Murphy\n", "\n", - "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk" + "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk, Alex Murphy" ] }, { @@ -39,17 +39,16 @@ "source": [ "___\n", "\n", - "\n", "# Tutorial Objectives\n", "\n", "*Estimated timing of tutorial: 1 hour*\n", "\n", "In this tutorial, we'll look at the sometimes surprising behavior of large neural networks, which is called double descent. This empirical phenomenon puts the classical understanding of the bias-variance tradeoff in question: in double descent, highly overparametrized models can display good performance. In particular, we will explore the following: \n", "\n", - "- notions of low/high bias/variance;\n", - "- improvement of test performance with the network's overparameterization, which leads to large model trends;\n", - "- the conditions under which double descent is observed and what affects its significance;\n", - "- the conditions under which double descent does not occur.\n", + "- notions of low/high bias/variance\n", + "- improvement of test performance with the network's overparameterization, which leads to large model trends\n", + "- the conditions under which double descent is observed and what affects its significance\n", + "- the conditions under which double descent does not occur\n", " \n", "Let's jump in!" ] @@ -496,7 +495,7 @@ "---\n", "# Section 1: Overfitting in overparameterized models\n", "\n", - "In this section we will observe the classical behaviour of overparametrized networks - overfitting." + "In this section we will observe the classical behaviour of overparametrized networks: overfitting. This is where a model becomes tuned specifically to the features of the training data, beyond the general patterns. For example, if data points are measured with an imperfect system that introduces noise into the recording values, then overfitting can be thought of as a model that learns the signal **and** the noise associated with each data point, instead of just the **signal**. This is characterized by a low training error and a higher test error." ] }, { @@ -579,7 +578,7 @@ "\n", "We start by generating a simple sinusoidal dataset.\n", "\n", - "This dataset contains 100 datapoints. We've selected a subset of 10 points for training. " + "This dataset contains 100 data points. We've selected a subset of 10 points for training. " ] }, { @@ -613,7 +612,7 @@ "\n", "The input $x\\in R$ is a scalar. There are $N_h$ hidden units, and the output $\\hat y\\in R$ is a scalar.\n", "\n", - "We will initialize $W_1$ with i.i.d. random Gaussian values with a variance of one and $b$ with values drawn i.i.d. uniformly between $-\\pi$ and $\\pi$. Finally, we will initialize the weights $W_2$ to zero.\n", + "We will initialize $W_1$ with i.i.d. random Gaussian values with a variance of `1.0` and $b$ with values drawn i.i.d. uniformly between $-\\pi$ and $\\pi$. Finally, we will initialize the weights $W_2$ to `0`.\n", "\n", "We only train $W_2$, leaving $W_1$ and $b$ fixed. We can train $W_2$ to minimize the mean squared error between the training labels $y$ and the network's output on those datapoints. \n", "\n", @@ -773,11 +772,11 @@ "execution": {} }, "source": [ - "## Coding Exercise 2: The bias-variance tradeoff\n", + "## Coding Exercise 2: The Bias-Variance Trade-off\n", "\n", "With the network implemented, we now investigate how the size of the network (the number of hidden units it has, $N_h$) relates to its ability to generalize. \n", "\n", - "Ultimately, the true measure of a learning system is how well it performs on novel inputs, that is, its ability to generalize. The classical story of how model size relates to generalization is the bias-variance tradeoff.\n", + "Ultimately, the true measure of a learning system is how well it performs on novel inputs, that is, its ability to generalize. The classical story of how model size relates to generalization is captured in the concept of the bias-variance tradeoff. We assume you are familiar with this concept already. If not, take some time to discuss in your group or search out a verified explanation to review.\n", "\n", "To start, complete the code below to train several small networks with just two hidden neurons and plot their predictions." ] @@ -852,7 +851,7 @@ "execution": {} }, "source": [ - "With just two hidden units, the model cannot fit the training data, nor can it do well on the test data. A network of this size has a high bias.\n", + "With just two hidden units, the model cannot fit the training data, nor the test data. A network of this size has **high bias**.\n", "\n", "Now, let's train a network with five hidden units.\n", "\n", @@ -887,9 +886,9 @@ "execution": {} }, "source": [ - "With five hidden units, the model can do a better job of fitting the training data, and also follows the test data more closely - though still with errors.\n", + "With five hidden units, the model can do a better job of fitting the training data, and also follows the test data more closely, though still with errors.\n", "\n", - "Next let's try 10 hidden units." + "Next let's try 10 hidden units. Try to visualize how you might think the plot before running the cell below." ] }, { @@ -920,7 +919,7 @@ "source": [ "With 10 hidden units, the network often fits every training datapoint, but generalizes poorly--sometimes catastrophically so. We say that this size network has high variance. Intuitively, it is so complex that it can fit the training data perfectly, but this same complexity means it can take many different shapes in between datapoints.\n", "\n", - "We have just traced out the bias-variance tradeoff: the models with 2 hidden units had high bias, while the models with 10 hidden units had high variance. The models with 5 hidden units struck a balance--they were complex enough to achieve relatively low error on the training datapoints, but simple enough to be well constrained by the training data." + "We have just traced out the bias-variance tradeoff: the models with 2 hidden units had high bias, while the models with 10 hidden units had high variance. The models with 5 hidden units struck a balance--they were complex enough to achieve relatively low error on the training data points, but simple enough to be well constrained by the training data. The best choice of neural network architecture (e.g. choosing the number of neurons in a layer) is therefore highly dependent on the structure of the problem you are trying to solve and the format of the input data. It also involves trying out a few values and checking for where on the bias-variance trade-off line you find yourself. This was extremely important in classical approaches to understanding how to develop good neural networks. That was then, we are now in the **Modern Regime**, which we'll move on to now!" ] }, { @@ -945,7 +944,7 @@ }, "source": [ "---\n", - "# Section 2: The modern regime\n", + "# Section 2: The Modern Regime\n", "\n", "Estimated timing to here from start of tutorial: 20 minutes\n", "\n", @@ -1028,9 +1027,9 @@ "execution": {} }, "source": [ - "We just saw that a network with 10 hidden units trained on 10 training datapoints could fail to generalize. If we add even more hidden units, it seems unlikely that the network could perform well. How could hundreds of weights be correctly constrained with just these ten datapoints?\n", + "We just saw that a network with 10 hidden units trained on 10 training data points could fail to generalize. If we add even more hidden units, it seems unlikely that the network could perform well. How could hundreds of weights be correctly constrained when we just showed that these ten data points failed to capture any meaningful relationship of the training data?\n", "\n", - "But let's try it. Throw caution to the wind and train a network with 500 hidden units." + "Let's go crazy and train a network with `500` hidden units and see what happens! " ] }, { @@ -1069,11 +1068,11 @@ "execution": {} }, "source": [ - "Remarkably, this very large network fits the training datapoints and generalizes well.\n", + "Remarkably, this very large network fits the training data points and generalizes well. We've managed to get predictions that look like they have learned the distribution of our input data correctly.\n", "\n", - "This network has fifty times as many parameters as datapoints. How can this be?\n", + "This network has fifty times as many parameters as data points. How can this be?\n", "\n", - "We've tested four different network sizes and seen the qualitative behavior of the predictions. Now, let's systematically compute the average test error for different network sizes.\n", + "We have tested four different network sizes (`2`, `5`, `10`, `500`) and we saw the qualitative behavior of the predictions. Now, let's systematically compute the average test error for different network sizes.\n", "\n", "For each network size in the array below, train 100 networks and plot their mean test error." ] @@ -1150,9 +1149,9 @@ "\n", "Hence, in this scenario, larger models perform better--even when they contain many more parameters than datapoints.\n", "\n", - "The peak (worst generalization) is at an intermediate model size when the number of hidden units is equal to the number of examples in this case. More generally, it turns out the peak occurs when the model first becomes complex enough to reach zero training error. This point is known as the interpolation point.\n", + "The peak (worst generalization) is at an intermediate model size when the number of hidden units is equal to the number of examples in this case. More generally, it turns out the peak occurs when the model first becomes complex enough to reach zero training error. This point is known as the **interpolation threshold**.\n", "\n", - "The trend for deep learning models to grow in size is in part due to this phenomenon of double descent. Let's now see its limits." + "The trend for deep learning models growing in size is in part due to the implications of the phenomenon of double descent. But does it always hold? Let's now see where its limits are and what modulates the ability to learn in this overparameterized regime." ] }, { @@ -1164,7 +1163,7 @@ "source": [ "## Interactive Demo 1: Interpolation point & predictions\n", "\n", - "In this interactive demo, you can move the slider for the number of hidden units in the network to be trained on and observe one representative trial of predicted values." + "In this interactive demo, you have a slider that represents the number of hidden units in a network to be trained on and then we bserve one representative trial of predicted values." ] }, { @@ -1193,7 +1192,7 @@ "execution": {} }, "source": [ - "The trend for deep learning models to grow in size is in part due to the phenomenon of double descent. Let's now see its limits." + "Having experimented with this interactive tool for a little while, are you able to see the relationship between the results shown here and the double descent plot above (the previous figure)?" ] }, { @@ -1219,11 +1218,11 @@ "source": [ "\n", "---\n", - "# Section 3: Double descent, noise & regularization\n", + "# Section 3: Double Descent, Noise & Regularization\n", "\n", - "Estimated timing to here from start of tutorial: 35 minutes\n", + "*Estimated timing to here from start of tutorial: 35 minutes*\n", "\n", - "In this section, we are going to explore the effect of noise and regularization on double descent behavior." + "In this section, we are going to explore the effect of noise and regularization on double descent." ] }, { @@ -1312,7 +1311,7 @@ "execution": {} }, "source": [ - "So far, our training datapoints have been noiseless. Intuitively, a noisy training dataset might hurt the ability of complex models to generalize. In this section, we are going to explore the effect of noise on double descent behavior.\n", + "So far, our training data points have been noiseless. Intuitively, a noisy training dataset might hurt the ability of complex models to generalize. In this section, we are going to explore the effect of noise on double descent behavior.\n", "\n", "Let's test this. Add i.i.d. Gaussian noise of different standard deviations to the training labels, and plot the resulting double descent curves." ] @@ -1387,7 +1386,7 @@ "execution": {} }, "source": [ - "Though we are still able to observe the double descent effect, its strength is reduced with the increase in noise level." + "Though we are still able to observe the effect of double descent, its nowhere near as clear when we introduce noise into the training data." ] }, { @@ -1429,10 +1428,14 @@ "source": [ "We observe that the \"peak\" disappears, and the test error roughly monotonically decreases, although it is generally higher for higher noise levels in the training data.\n", "\n", + "
\n", + " A note about the use of the term \"regularization\" in multiple contexts in ML (optional)\n", + "
\n", "The word *regularization* is commonly used in statistics/ML parlance in two different contexts to ensure the good generalization of overparameterized models:\n", "\n", "- The first context, which is emphasized throughout the tutorial, is explicit regularization which means that the model is not trained to completion (zero training error) in order to avoid overfitting of noise. Without explicit regularization, we observe the double descent behavior – i.e. catastrophic overfitting when the number of model parameters is too close to the number of training examples – but also a vast reduction in this overfitting effect as we heavily overparameterize the model. With explicit regularization (when tuned correctly), the double descent behavior disappears because we no longer run the risk of overfitting to noise at all.\n", - "- The second context is the one of inductive bias – overparameterized models, when trained with popular optimization algorithms like gradient descent, tend to converge to a particularly “simple” solution that perfectly fits the data. By “simple”, we usually mean that the size of the parameters (in terms of magnitude) is very small. This inductive bias is a big reason why double descent occurs as well, in particular, the benefit of overparameterization in reducing overfitting." + "- The second context is the one of inductive bias – overparameterized models, when trained with popular optimization algorithms like gradient descent, tend to converge to a particularly “simple” solution that perfectly fits the data. By “simple”, we usually mean that the size of the parameters (in terms of magnitude) is very small. This inductive bias is a big reason why double descent occurs as well, in particular, the benefit of overparameterization in reducing overfitting.\n", + "
" ] }, { @@ -1505,9 +1508,9 @@ "execution": {} }, "source": [ - "The network smoothly interpolates between the training datapoints. Even when noisy, these can still somewhat track the test data. Depending on the noise level, though, a smaller and more constrained model can be better.\n", + "The network smoothly interpolates between the training data points. Even when noisy, these can still somewhat track the test data. Depending on the noise level, though, a smaller and more constrained model can be better.\n", "\n", - "From this, we might expect that large models will work particularly well for datasets with little label noise. Many real-world datasets fit this requirement: image classification datasets strive to have accurate labels for all datapoints, for instance. Other datasets may not. For instance, predicting DSM-V diagnoses from structural MRI data is a noisy task, as the diagnoses themselves are noisy." + "From this, we might expect that large models will work particularly well for datasets with little label noise. Many real world datasets fit this requirement: image classification datasets strive to have accurate labels for all data points, for instance. Other datasets may not. For instance, predicting DSM-V diagnoses from structural MRI data is a noisy task, as the diagnoses themselves are noisy due to the inherent difficulty of mapping observations to clinically defined classes." ] }, { @@ -1532,11 +1535,12 @@ }, "source": [ "---\n", - "# Section 4: Double descent and initialization\n", + "# Section 4: Double Descent and Initialization\n", "\n", - "Estimated timing to here from start of tutorial: 50 minutes\n", + "*Estimated timing to here from start of tutorial: 50 minutes*\n", "\n", - "So far, we have considered one important aspect of architecture, namely the size or number of hidden neurons. A second critical aspect is initialization." + "\n", + "So far, we have considered one important aspect of neural network architectures, namely the width of a hidden layer (the number of neurons in the layer). However, another critical aspect connected to the emergence of the double descent phenomenon is that of weight initialization. In the last tutorial, we explored weight initialization from the perspective of a chaotic and non-chaotic regime. We saw that low variance initialization of weights led some deep MLPs to exhibit transformations in the quasi-linear regime. Let's now explore what the effects of weight initialization are on the double descent phenomenon." ] }, { @@ -1678,11 +1682,11 @@ "execution": {} }, "source": [ - "We see that for overparametrized models, where the number of parameters is larger than the number of training examples, the initialization scale strongly impacts the test error. The good performance of these large models thus depends on our choice of initializing $W_2$ equal to zero.\n", + "We see that for our overparametrized model (where the number of parameters is larger than the number of training examples) the initialization scale strongly impacts the test error. The better performance of these large models thus depends on our choice of initializing $W_2$ equal to zero.\n", "\n", "Intuitively, this is because directions of weight space in which we have no training data are not changed by gradient descent, so poor initialization can continue to affect the model even after training. Large initializations implement random functions that generalize poorly.\n", "\n", - "Let's see what the predictions of a large-variance-initialization network with 500 hidden neurons look like." + "Let's see what the predictions of a large-variance-initialization network with 500 hidden neurons look like. We will set `init_scale = 1.0` in this example below." ] }, { @@ -1836,9 +1840,9 @@ }, "source": [ "---\n", - "# Summary\n", + "# The Big Picture\n", "\n", - "Estimated timing of tutorial: 1 hour\n", + "*Estimated timing of tutorial: 1 hour*\n", "\n", "In this tutorial, we observed the phenomenon of double descent: the situation when the overparameterized network was expected to behave as overfitted but instead generalized better to the unseen data. Moreover, we discovered how noise, regularization & initial scale impact the effect of double descent and, in some cases, can fully cancel it.\n", "\n", @@ -1873,7 +1877,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.19" + "version": "3.9.22" } }, "nbformat": 4, diff --git a/tutorials/W2D1_Macrocircuits/student/W2D1_Tutorial3.ipynb b/tutorials/W2D1_Macrocircuits/student/W2D1_Tutorial3.ipynb index 96043c6af..a7283f901 100644 --- a/tutorials/W2D1_Macrocircuits/student/W2D1_Tutorial3.ipynb +++ b/tutorials/W2D1_Macrocircuits/student/W2D1_Tutorial3.ipynb @@ -23,9 +23,9 @@ "\n", "__Content creators:__ Ruiyi Zhang\n", "\n", - "__Content reviewers:__ Xaq Pitkow, Hlib Solodzhuk, Patrick Mineault\n", + "__Content reviewers:__ Xaq Pitkow, Hlib Solodzhuk, Patrick Mineault, Alex Murphy\n", "\n", - "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk" + "__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk, Alex Murphy" ] }, { @@ -96,7 +96,8 @@ "# @title Install and import feedback gadget\n", "\n", "!pip install vibecheck datatops --quiet\n", - "!pip install pandas~=2.0.0 --quiet\n", + "!pip install pandas --quiet\n", + "!pip install scikit-learn --quiet\n", "\n", "from vibecheck import DatatopsContentReviewContainer\n", "def content_review(notebook_section: str):\n", @@ -1576,7 +1577,7 @@ "---\n", "# Section 2: Evaluate agents in the training task\n", "\n", - "Estimated timing to here from start of tutorial: 25 minutes" + "*Estimated timing to here from start of tutorial: 25 minutes*" ] }, { @@ -1585,7 +1586,7 @@ "execution": {} }, "source": [ - "With the code for the environment and agents done, we will now write an evaluation function allowing the agent to interact with the environment." + "With the code for the environment and agents complete, we will now write an evaluation function allowing the agent to interact with the environment where the quality of the model can be assessed." ] }, { @@ -1603,7 +1604,7 @@ "execution": {} }, "source": [ - "We first sample 1000 targets for agents to steer to." + "We first sample 1,000 targets for the RL agent to steer towards." ] }, { @@ -1759,7 +1760,7 @@ "execution": {} }, "source": [ - "Since training RL agents takes a lot of time, here we load the pre-trained modular and holistic agents and evaluate these two agents on the same sampled 1000 targets. We will then store the evaluation data in pandas dataframes." + "Since training RL agents takes a lot of time, here we load the pre-trained modular and holistic agents and evaluate these two agents on the same sampled 1,000 targets. We will then store the evaluation data in `pandas` Dataframe object." ] }, { @@ -2153,9 +2154,11 @@ "execution": {} }, "source": [ - "It is well known that an RL agent's performance can vary significantly with different random seeds. Therefore, no conclusions can be drawn based on one training run with a single random seed.\n", + "It is well known that an RL agent's performance can vary significantly with different random seeds. Therefore, no conclusions can be drawn based on one training run with a single random seed. Therefore, to make more convincing conclusions, we must run the same experiment across different random initializations in order to be sure that any repeatedly-obtainable result is robustly seen across such different random initializations.\n", "\n", - "Both agents were trained with eight random seeds, and all of them were evaluated using the same sample of $1000$ targets. Let's load this saved trajectory data." + "Both agents were trained across 8 random seeds. All of them were evaluated using the same sample of 1,000 targets.\n", + "\n", + "Let's load this saved trajectory data." ] }, { @@ -2190,7 +2193,7 @@ "execution": {} }, "source": [ - "We first compute the fraction of rewarded trials in the total $1000$ trials for all training runs with different random seeds for the modular and holistic agents. We visualize this using a bar plot, with each red dot denoting the performance of a random seed." + "We first compute the fraction of rewarded trials in the total 1,000 trials for all training runs with different random seeds for the modular and holistic agents. We visualize this using a bar plot, with each red dot denoting the performance of a random seed." ] }, { @@ -2246,9 +2249,9 @@ "execution": {} }, "source": [ - "Despite similar performance measured by a rewarded fraction, we dis observe qualitative differences in the trajectories of the two agents in the previous sections. It is possible that the holistic agent's more curved trajectories, although reaching the target, are less efficient, i.e., they waste more time.\n", + "Despite similar performance measured by a rewarded fraction, we did observe qualitative differences in the trajectories of the two agents in the previous sections. It is possible that the holistic agent's more curved trajectories, although reaching the target, are less efficient, i.e., they waste more time.\n", "\n", - "Therefore, we also plot the time spent by both agents for the same 1000 targets." + "Therefore, we also plot the time spent by both agents for the same 1,000 targets." ] }, { @@ -2428,7 +2431,9 @@ "---\n", "# Section 3: A novel gain task\n", "\n", - "Estimated timing to here from start of tutorial: 50 minutes" + "*Estimated timing to here from start of tutorial: 50 minutes*\n", + "\n", + "The prior task had a fixed joystick gain that meant consistent linear and angular velocities. We will now look at a novel task that tests the generalization capabilities of these models by varying this setting between training and testing. Will the model generalize well?" ] }, { @@ -2862,22 +2867,7 @@ "execution": {} }, "source": [ - "---\n", - "# Summary\n", - "\n", - "*Estimated timing of tutorial: 1 hour*\n", - "\n", - "In this tutorial, we explored the difference in agents' performance based on their architecture. We revealed that modular architecture, with separate modules for learning different aspects of behavior, is superior to a holistic architecture with a single module. The modular architecture with stronger inductive bias achieves good performance faster and has the capability to generalize to other tasks as well. Intriguingly, this modularity is a property we also observe in the brains, which could be important for generalization in the brain as well." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "execution": {} - }, - "source": [ - "---\n", - "# Bonus Section 1: Decoding analysis" + "## Decoding analysis" ] }, { @@ -3009,7 +2999,21 @@ }, "source": [ "---\n", - "# Bonus Section 2: Generalization, but no free lunch\n", + "# The Big Picture\n", + "\n", + "*Estimated timing of tutorial: 1 hour*\n", + "\n", + "In this tutorial, we explored the difference in agents' performance based on their architecture. We revealed that modular architecture, with separate modules for learning different aspects of behavior, is superior to a holistic architecture with a single module. The modular architecture with stronger inductive bias achieves good performance faster and has the capability to generalize to other tasks as well. Intriguingly, this modularity is a property we also observe in the brains, which could be important for generalization in the brain as well." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "execution": {} + }, + "source": [ + "---\n", + "# Addendum: Generalization, but no free lunch\n", "\n", "The No Free Lunch theorems proved that no inductive bias can excel across all tasks. It has been studied in the [paper](https://www.science.org/doi/10.1126/sciadv.adk1256) that agents with a modular architecture can acquire the underlying structure of the training task. In contrast, holistic agents tend to acquire different knowledge than modular agents during training, such as forming beliefs based on unreliable information sources or exhibiting less efficient control actions. The novel gain task has a structure similar to the training task, consequently, a modular agent that accurately learns the training task's structure can leverage its knowledge in these novel tasks.\n", "\n", @@ -3060,7 +3064,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.19" + "version": "3.9.22" } }, "nbformat": 4,