Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 42 additions & 21 deletions tutorials/W2D3_Microlearning/W2D3_Tutorial1.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@
"\n",
"**By Neuromatch Academy** \n",
"\n",
"__Content creators:__ Blake Richards, Roman Pogodin, Daniel Levenstein, Colin Bredenberg, Jonathan Cornford\n",
"__Content creators:__ Blake Richards, Roman Pogodin, Daniel Levenstein, Colin Bredenberg, Jonathan Cornford, Alex Murphy\n",
"\n",
"__Content reviewers:__ Aakash Agrawal, Alish Dipani, Hossein Rezaei, Yousef Ghanbari, Mostafa Abdollahi, Samuele Bolotta, Patrick Mineault, Hlib Solodzhuk\n",
"__Content reviewers:__ Aakash Agrawal, Alish Dipani, Hossein Rezaei, Yousef Ghanbari, Mostafa Abdollahi, Samuele Bolotta, Patrick Mineault, Hlib Solodzhuk, Alex Murphy\n",
"\n",
"__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk"
"__Production editors:__ Konstantine Tsafatinos, Ella Batty, Spiros Chavlis, Samuele Bolotta, Hlib Solodzhuk, Alex Murphy"
]
},
{
Expand Down Expand Up @@ -631,7 +631,7 @@
"---\n",
"# Section 1: Weight Perturbation\n",
"\n",
"In this section, we will start exploring the learning algorithms which exhibit increased variance."
"In this section, we will start exploring more bioligcally plausible learning algorithms that are known to exhibit increased variance, specifically the *weight perturbation* algorithm."
]
},
{
Expand Down Expand Up @@ -730,16 +730,19 @@
"\\newcommand{\\sqbrackets}[1]{\\left[#1\\right]}\n",
"\\newcommand{\\var}[1]{\\mathbb{V}\\mathrm{ar}\\brackets{#1}}$\n",
"\n",
"In this first section, we will be deriving and implementing the __Weight Perturbation__ algorithm. In the next section, we will be deriving and implementing the __Node Perturbation__ algorithm. Both of these methods of gradient estimation are very closely related to *finite differences* derivative approximation.\n",
"In this first section, we will be deriving and implementing the __Weight Perturbation__ algorithm. In the next section, we will be deriving and implementing the __Node Perturbation__ algorithm. Both of these methods of gradient estimation are very closely related to *finite differences* derivative approximation (the idea that a derivative can be understood as an approximation of the slope of the tangent line between two very close points).\n",
"\n",
"Suppose that we have some loss function, $\\loss(\\weight)$, which we would like to minimize by making some change in our synaptic weights, $\\weight$. The most natural way to decrease the loss would be to perform gradient descent; however, it is not reasonable to assume that a synapse in the brain could perform analytic gradient calculations for general loss functions $\\loss(\\weight)$, which may depend on the activity of many downstream neurons and the external environment. We know neurons in the brain can connect to many distant neurons, but not to the extent to mirror large-scale awareness of error signals that gradient descent would imply. Biological systems could solve this problem by *approximating* the gradient of the loss, which could be accomplished in many ways. \n",
"\n",
"To start, we will provide the __weight perturbation__ update rule, and will subsequently demonstrate why it provides an estimate of the gradient. We will first add noise to our weight matrix, using $\\weight' = \\weight + \\noisew$, where $\\noisew \\sim \\mathcal N(0, \\sigma^2)$ is a diagonal matrix with a fixed sigma on each diagonal element. We take as our update:\n",
"\n",
" Suppose that we have some loss function, $\\loss(\\Delta \\weight)$, which we would like to minimize by making some change in our synaptic weights, $\\Delta \\weight$. The most natural way to decrease the loss would be to perform gradient descent; however, it is not reasonable to assume that a synapse in the brain could perform analytic gradient calculations for general loss functions $\\loss(\\Delta \\weight)$, which may depend on the activity of many downstream neurons and the external environment. Biological systems could solve this problem by *approximating* the gradient of the loss, which could be accomplished in many ways. \n",
" \n",
" To start, we will provide the __weight perturbation__ update, and will subsequently demonstrate why it provides an estimate of the gradient. We will first add noise to our weights, using $\\weight' = \\weight + \\noisew$, where $\\noisew \\sim \\mathcal N(0, \\sigma^2)$. We take as our update:\n",
"\n",
"\\begin{equation}\n",
" \\Delta \\weight = - \\eta \\mathbb{E}_{\\noisew} \\left [\\left (\\loss(\\noisew) - \\loss(0)\\right ) \\frac{(\\weight' - \\weight)}{\\sigma^2} \\right ].\n",
"\\end{equation}\n",
"First, we will clarify why this parameter update is interesting from a neuroscientific perspective. If we look at the parameter update for a *single synapse*, $\\weight_{ij}$, we have:\n",
"\n",
"In this example, the notation $\\loss(\\noisew)$ stands for the loss value for the perturbed weights and $\\loss(0)$ represents the loss value for the original weights. This notation can also be understood also relating to $\\loss(W')$ and $\\loss(W)$, respectively, without any loss of generality. We will clarify why this parameter update is interesting from a neuroscientific perspective. If we look at the parameter update for a *single synapse*, $\\weight_{ij}$, we have:\n",
"\n",
"\\begin{align}\n",
" \\Delta \\weight_{ij} &= - \\eta \\mathbb{E}_{\\noisew} \\left [\\left (\\loss(\\noisew) - \\loss(0)\\right ) \\frac{(\\weight'_{ij} - \\weight_{ij})}{\\sigma^2} \\right ] \\\\\n",
" & \\approx - \\eta \\frac{1}{K}\\sum_{k = 0}^K\\left [\\left (\\loss(\\noisew^{(k)}) - \\loss(0)\\right ) \\frac{(\\weight'^{(k)}_{ij} - \\weight_{ij})}{\\sigma^2} \\right ],\n",
Expand Down Expand Up @@ -776,7 +779,7 @@
"source": [
"## Exercise 1: Perturb the weights\n",
"\n",
"In this section, fill out the function 'perturb' for the WeightPerturbMLP class. This function is used to update the parameters of our MLP network using the weight perturbation algorithm, using the parameter update equations from the preceding section."
"In this section, your task is to complete the `perturb` function for the `WeightPerturbMLP` class. This function is used to update the parameters of our MLP network using the weight perturbation algorithm, using the parameter update equations from the preceding section. You might benefit from expanding the hidden cells above and look at the definition of the `MLP` class to understand the structure of the model a bit clearer. Note that `W_h` relates to the weights mapping the inputs to the hidden layer neurons and `W_y` relates to the mapping from the hidden layer neurons to the outputs. It is also useful to look at the definition of `mse_loss` in the cells to better understand the loss calculated for the case of weight perturbations being applied."
]
},
{
Expand Down Expand Up @@ -981,13 +984,14 @@
"---\n",
"# Section 2: Node Perturbation\n",
"\n",
"Estimated timing to here from start of tutorial: 30 minutes\n",
"*Estimated timing to here from start of tutorial: 30 minutes*\n",
"\n",
"While we can get an unbiased derivative approximation based solely on perturbations of the weights, we will show later on that this is actually a very inefficient method, because it requires averaging out $MN$ noise sources, where $M$ is the dimension of the input $\\stim$ and $N$ is the dimension of the hidden activity $\\rate$. \n",
"\n",
"![Network.](https://github.com/neuromatch/NeuroAI_Course/blob/main/tutorials/W2D3_Microlearning/static/network.png?raw=true)\n",
"\n",
"For simplicity, consider what happens in a network with a linear hidden layer. If we add noise at the level of the hidden units $\\rate = \\weight \\stim$, we will only have to average over $N$ noise sources. To do this, we can use the following update, taking $\\rate' = \\rate + \\noiser$, where $\\noiser \\sim \\mathcal{N}(0,\\sigma^2)$:\n",
"For simplicity, consider what happens in a network with a linear hidden layer (the activation function is just an identity function, simply allowing the inputs to pass through unchanged). If we add noise at the level of the hidden units $\\rate = \\weight \\stim$, we will only have to average over $N$ noise sources. To do this, we can use the following update, taking $\\rate' = \\rate + \\noiser$, where $\\noiser \\sim \\mathcal{N}(0,\\sigma^2)$:\n",
"\n",
"\n",
"\\begin{equation}\n",
" \\Delta \\weight = - \\eta \\mathbb{E}_{\\noiser} \\left [\\left(\\loss(\\noiser) - \\loss(0) \\right ) \\frac{(\\rate' - \\rate)}{\\sigma^2} \\stim\\T \\right ].\n",
Expand Down Expand Up @@ -1392,7 +1396,7 @@
" colors = ['b', 'c', 'r']\n",
" labels = ['Weight Perturbation', 'Node Perturbation', 'Backprop']\n",
" plt.bar(x, snr_vals, color=colors, tick_label=labels)\n",
" plt.xticks(rotation=90)\n",
" plt.xticks(rotation=0)\n",
" plt.ylabel('SNR')\n",
" plt.xlabel('Algorithm')\n",
" plt.title('Gradient SNR')\n",
Expand All @@ -1405,7 +1409,7 @@
"execution": {}
},
"source": [
"As should be evident, the signal-to-noise ratio for both weight and node perturbation are much worse than for backpropagation. This is also reflected in the poor performance of both algorithms relative to backpropagation. This shows that locality of parameter updates often comes at the price of poor performance."
"As should be evident, the signal-to-noise ratio for both weight and node perturbation are much worse than for backpropagation. This is also reflected in the poor performance of both algorithms relative to backpropagation. This shows that locality of parameter updates often comes at the price of poor performance. However, we refer back to the motivations of studying biological plausibility and the importance of having a diverse set of tools in our toolkit to be able to apply these more easily in other scenarios in the future."
]
},
{
Expand All @@ -1430,9 +1434,9 @@
"---\n",
"# Section 4: Feedback Alignment \n",
"\n",
"Estimated timing to here from start of tutorial: 1 hour\n",
"*Estimated timing to here from start of tutorial: 1 hour*\n",
"\n",
"This section will introduce another family of learning algorithms that exhibit no variance but become biased."
"This section will introduce another family of learning algorithms (Feedback Alignment) that exhibit no variance but sit on the other end of the bias-variance tradeoff in that they exhibit a high bias compared to the other learning algorithms we have covered."
]
},
{
Expand Down Expand Up @@ -1572,7 +1576,7 @@
"\n",
"*From Lillicrap et al. (2016), CC-BY*\n",
"\n",
"Feedback alignment replaces $\\weight_{out}^T $ with a random matrix, $\\backweight$. This resolves the 'weight transport' problem, because the feedback weights are no longer the same as the feedforward weights. However, by replacing $\\weight_{out}^T$ with $\\backweight$, we are no longer calculating an accurate gradient! Interestingly, we will see empirically in subsequent sections that this replacement still produces reasonably good gradient estimates, though it still introduces *bias*."
"Feedback alignment replaces $\\weight_{out}^T $ with a random matrix, $\\backweight$. This resolves the 'weight transport' problem, because the feedback weights are no longer the same as the feedforward weights. However, by replacing $\\weight_{out}^T$ with $\\backweight$, we are no longer calculating an accurate gradient! Interestingly, we will see empirically in subsequent sections that this replacement still produces reasonably good gradient estimates, though it still introduces *bias*, because the backward weights are not the same as the forward weights (as explained in the video above)."
]
},
{
Expand Down Expand Up @@ -1790,9 +1794,9 @@
"---\n",
"# Section 5: Kolen-Pollack\n",
"\n",
"Estimated timing to here from start of tutorial: 1 hour 20 minutes\n",
"*Estimated timing to here from start of tutorial: 1 hour 20 minutes*\n",
"\n",
"This section presents the last method for this day, which lies in the cohort of biased ones, Kolen-Pollack method."
"This section presents the last method for this day, which is a method that leans towards exhibiting more biased solutions than ones that exhibit higher-variance solutions. Specifically, the metho we are going to look at today is known as the Kolen-Pollack method. While in the previous section we looked at Feedback Alignment, in that case, we hinted at the fact that this works well for simple tasks. However, feedback alignment, as will be shown below, does not do very well in tasks of the level of complexity we are typically interested in. The Kolen-Pollack method attempts to fix some of the problems of Feedback Alignment in order to be better at more complex and interesting tasks."
]
},
{
Expand Down Expand Up @@ -2093,7 +2097,7 @@
"---\n",
"# Section 6: Assessing the bias of learning algorithms\n",
"\n",
"Estimated timing to here from start of tutorial: 1 hour 50 minutes\n"
"*Estimated timing to here from start of tutorial: 1 hour 50 minutes*\n"
]
},
{
Expand Down Expand Up @@ -2198,6 +2202,23 @@
"* Therefore, good generalization (which depends on large, powerful network architectures), will come from learning algorithms that have as little variance and bias in their gradient estimates as possible.\n",
"* The neuroscience community does not yet know which, if any, of the learning algorithms discussed in this tutorial map onto learning in the brain. The algorithms we have introduced are best thought of as 'candidate models' for how the brain could be learning."
]
},
{
"cell_type": "markdown",
"metadata": {
"execution": {}
},
"source": [
"# The Big Picture\n",
"\n",
"While the summary above recaps the main takeaway points of today's tutorial, let's also stop and think a bit bigger. Let's think back to the opening section of today's tutorial about biological plausibility and what it means for both the future of neuroscience and AI. As we have seen today, learning in brains is restricted by the directional nature of information transfer and biologically plausible learning algorithms are those algorithms that better mirror these properties. This is where we run into a dilemma: on the one hand in standard AI training, backpropagation is so successful because we get exactly the correct set of error signals to make updates to our weights. This has worked well in AI, but this method in the current set up might lead to a wall that we cannot break and extend into significant further advances in AI. This is also an issue working in neuroscience, where AI models are often used as *in silico* representations to model biological processes or as candidate representational spaces to model different stimuli. \n",
"\n",
"The main idea we want you to take away from today is to be aware of alternate approaches that better mirror computational constraints from a system (the brain) that we know in many ways is better than frontier / state of the art AI models. The exact techniques are only candidates, but there is a wide belief that the NeuroAI community might be in an excellent position to study and propose learning algortihms that are not only biologically plausible, but also show promise as future widely-adopted learning algorithms in large-scale deep learning networks.\n",
"\n",
"We hope this idea sticks around in your mind and that you have found today's tutorial insightful.\n",
"\n",
"Tomorrow we'll be looking at **macro**learning, with interesting applications from Reinforcement Learning and continuing on our journey to explore the broad concepts underlying how systems learn and how they generalize!"
]
}
],
"metadata": {
Expand Down Expand Up @@ -2228,7 +2249,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.19"
"version": "3.9.22"
}
},
"nbformat": 4,
Expand Down
Loading