Skip to content

Commit cbbea35

Browse files
authored
docs: switch links from PDF to abstract (#305)
1 parent 30b6397 commit cbbea35

File tree

11 files changed

+34
-37
lines changed

11 files changed

+34
-37
lines changed

.github/workflows/ci_docs.yml

+2-8
Original file line numberDiff line numberDiff line change
@@ -34,13 +34,7 @@ jobs:
3434
- uses: actions/setup-python@v5
3535
with:
3636
python-version: 3.8
37-
38-
- name: Cache pip
39-
uses: actions/cache@v4
40-
with:
41-
path: ~/.cache/pip
42-
key: pip-${{ hashFiles('requirements.txt') }}-${{ hashFiles('_requirements/docs.txt') }}
43-
restore-keys: pip-
37+
cache: pip
4438

4539
- name: Install Texlive & tree
4640
run: |
@@ -60,7 +54,7 @@ jobs:
6054
head=$(git rev-parse origin/"${{ github.base_ref }}")
6155
git diff --name-only $head --output=master-diff.txt
6256
python .actions/assistant.py group-folders master-diff.txt
63-
printf "Changed folders:\n"
57+
printf "Changed folders:\n----------------\n"
6458
cat changed-folders.txt
6559
6660
- name: Count changed notebooks

_docs/source/conf.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -240,4 +240,7 @@
240240
linkcheck_exclude_documents = []
241241

242242
# ignore the following relative links (false positive errors during linkcheck)
243-
linkcheck_ignore = []
243+
linkcheck_ignore = [
244+
# Implicit generation and generalization methods for energy-based models
245+
"https://openai.com/index/energy-based-models/",
246+
]

course_UvA-DL/03-initialization-and-optimization/Initialization_and_Optimization.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -524,7 +524,7 @@ def xavier_init(model):
524524
#
525525
# Thus, we see that we have an additional factor of 1/2 in the equation, so that our desired weight variance becomes $2/d_x$.
526526
# This gives us the Kaiming initialization (see [He, K. et al.
527-
# (2015)](https://arxiv.org/pdf/1502.01852.pdf)).
527+
# (2015)](https://arxiv.org/abs/1502.01852)).
528528
# Note that the Kaiming initialization does not use the harmonic mean between input and output size.
529529
# In their paper (Section 2.2, Backward Propagation, last paragraph), they argue that using $d_x$ or $d_y$ both lead to stable gradients throughout the network, and only depend on the overall input and output size of the network.
530530
# Hence, we can use here only the input $d_x$:
@@ -1098,7 +1098,7 @@ def comb_func(w1, w2):
10981098
# The short answer: no.
10991099
# There are many papers saying that in certain situations, SGD (with momentum) generalizes better where Adam often tends to overfit [5,6].
11001100
# This is related to the idea of finding wider optima.
1101-
# For instance, see the illustration of different optima below (credit: [Keskar et al., 2017](https://arxiv.org/pdf/1609.04836.pdf)):
1101+
# For instance, see the illustration of different optima below (credit: [Keskar et al., 2017](https://arxiv.org/abs/1609.04836)):
11021102
#
11031103
# <center width="100%"><img src="flat_vs_sharp_minima.svg" width="500px"></center>
11041104
#
@@ -1128,7 +1128,7 @@ def comb_func(w1, w2):
11281128
# "Understanding the difficulty of training deep feedforward neural networks."
11291129
# Proceedings of the thirteenth international conference on artificial intelligence and statistics.
11301130
# 2010.
1131-
# [link](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf)
1131+
# [link](https://proceedings.mlr.press/v9/glorot10a)
11321132
#
11331133
# [2] He, Kaiming, et al.
11341134
# "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification."

course_UvA-DL/04-inception-resnet-densenet/Inception_ResNet_DenseNet.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -244,7 +244,7 @@ def configure_optimizers(self):
244244
# We will support Adam or SGD as optimizers.
245245
if self.hparams.optimizer_name == "Adam":
246246
# AdamW is Adam with a correct implementation of weight decay (see here
247-
# for details: https://arxiv.org/pdf/1711.05101.pdf)
247+
# for details: https://arxiv.org/abs/1711.05101)
248248
optimizer = optim.AdamW(self.parameters(), **self.hparams.optimizer_hparams)
249249
elif self.hparams.optimizer_name == "SGD":
250250
optimizer = optim.SGD(self.parameters(), **self.hparams.optimizer_hparams)
@@ -875,8 +875,8 @@ def forward(self, x):
875875
# One difference to the GoogleNet training is that we explicitly use SGD with Momentum as optimizer instead of Adam.
876876
# Adam often leads to a slightly worse accuracy on plain, shallow ResNets.
877877
# It is not 100% clear why Adam performs worse in this context, but one possible explanation is related to ResNet's loss surface.
878-
# ResNet has been shown to produce smoother loss surfaces than networks without skip connection (see [Li et al., 2018](https://arxiv.org/pdf/1712.09913.pdf) for details).
879-
# A possible visualization of the loss surface with/out skip connections is below (figure credit - [Li et al. ](https://arxiv.org/pdf/1712.09913.pdf)):
878+
# ResNet has been shown to produce smoother loss surfaces than networks without skip connection (see [Li et al., 2018](https://arxiv.org/abs/1712.09913) for details).
879+
# A possible visualization of the loss surface with/out skip connections is below (figure credit - [Li et al. ](https://arxiv.org/abs/1712.09913)):
880880
#
881881
# <center width="100%"><img src="resnet_loss_surface.png" style="display: block; margin-left: auto; margin-right: auto;" width="600px"/></center>
882882
#

course_UvA-DL/05-transformers-and-MH-attention/Transformers_MHAttention.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -660,7 +660,7 @@ def forward(self, x):
660660
# In fact, training a deep Transformer without learning rate warm-up can make the model diverge
661661
# and achieve a much worse performance on training and testing.
662662
# Take for instance the following plot by [Liu et al.
663-
# (2019)](https://arxiv.org/pdf/1908.03265.pdf) comparing Adam-vanilla (i.e. Adam without warm-up)
663+
# (2019)](https://arxiv.org/abs/1908.03265) comparing Adam-vanilla (i.e. Adam without warm-up)
664664
# vs Adam with a warm-up:
665665
#
666666
# <center width="100%"><img src="warmup_loss_plot.svg" width="350px"></center>

course_UvA-DL/06-graph-neural-networks/GNN_overview.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -750,7 +750,7 @@ def print_results(result_dict):
750750
# Tutorials and papers for this topic include:
751751
#
752752
# * [PyTorch Geometric example](https://github.com/rusty1s/pytorch_geometric/blob/master/examples/link_pred.py)
753-
# * [Graph Neural Networks: A Review of Methods and Applications](https://arxiv.org/pdf/1812.08434.pdf), Zhou et al.
753+
# * [Graph Neural Networks: A Review of Methods and Applications](https://arxiv.org/abs/1812.08434), Zhou et al.
754754
# 2019
755755
# * [Link Prediction Based on Graph Neural Networks](https://papers.nips.cc/paper/2018/file/53f0d7c537d99b3824f0f99d62ea2428-Paper.pdf), Zhang and Chen, 2018.
756756

course_UvA-DL/09-normalizing-flows/NF_image_modeling.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1396,7 +1396,7 @@ def visualize_dequant_distribution(model: ImageFlow, imgs: Tensor, title: str =
13961396
# and we have the guarantee that every possible input $x$ has a corresponding latent vector $z$.
13971397
# However, even beyond continuous inputs and images, flows can be applied and allow us to exploit
13981398
# the data structure in latent space, as e.g. on graphs for the task of molecule generation [6].
1399-
# Recent advances in [Neural ODEs](https://arxiv.org/pdf/1806.07366.pdf) allow a flow with infinite number of layers,
1399+
# Recent advances in [Neural ODEs](https://arxiv.org/abs/1806.07366) allow a flow with infinite number of layers,
14001400
# called Continuous Normalizing Flows, whose potential is yet to fully explore.
14011401
# Overall, normalizing flows are an exciting research area which will continue over the next couple of years.
14021402

course_UvA-DL/10-autoregressive-image-modeling/Autoregressive_Image_Modeling.py

+9-9
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,10 @@
1818
# For instance, in autoregressive models, we cannot interpolate between two images because of the lack of a latent representation.
1919
# We will explore and discuss these benefits and drawbacks alongside with our implementation.
2020
#
21-
# Our implementation will focus on the [PixelCNN](https://arxiv.org/pdf/1606.05328.pdf) [2] model which has been discussed in detail in the lecture.
21+
# Our implementation will focus on the [PixelCNN](https://arxiv.org/abs/1606.05328) [2] model which has been discussed in detail in the lecture.
2222
# Most current SOTA models use PixelCNN as their fundamental architecture,
2323
# and various additions have been proposed to improve the performance
24-
# (e.g. [PixelCNN++](https://arxiv.org/pdf/1701.05517.pdf) and [PixelSNAIL](http://proceedings.mlr.press/v80/chen18h/chen18h.pdf)).
24+
# (e.g. [PixelCNN++](https://arxiv.org/abs/1701.05517) and [PixelSNAIL](http://proceedings.mlr.press/v80/chen18h/chen18h.pdf)).
2525
# Hence, implementing PixelCNN is a good starting point for our short tutorial.
2626
#
2727
# First of all, we need to import our standard libraries. Similarly as in
@@ -173,7 +173,7 @@ def show_imgs(imgs):
173173
# If we now want to apply this to our convolutions, we need to ensure that the prediction of pixel 1
174174
# is not influenced by its own "true" input, and all pixels on its right and in any lower row.
175175
# In convolutions, this means that we want to set those entries of the weight matrix to zero that take pixels on the right and below into account.
176-
# As an example for a 5x5 kernel, see a mask below (figure credit - [Aaron van den Oord](https://arxiv.org/pdf/1606.05328.pdf)):
176+
# As an example for a 5x5 kernel, see a mask below (figure credit - [Aaron van den Oord](https://arxiv.org/abs/1606.05328)):
177177
#
178178
# <center width="100%" style="padding: 10px"><img src="masked_convolution.svg" width="150px"></center>
179179
#
@@ -217,10 +217,10 @@ def forward(self, x):
217217
#
218218
# To build our own autoregressive image model, we could simply stack a few masked convolutions on top of each other.
219219
# This was actually the case for the original PixelCNN model, discussed in the paper
220-
# [Pixel Recurrent Neural Networks](https://arxiv.org/pdf/1601.06759.pdf), but this leads to a considerable issue.
220+
# [Pixel Recurrent Neural Networks](https://arxiv.org/abs/1601.06759), but this leads to a considerable issue.
221221
# When sequentially applying a couple of masked convolutions, the receptive field of a pixel
222222
# show to have a "blind spot" on the right upper side, as shown in the figure below
223-
# (figure credit - [Aaron van den Oord et al. ](https://arxiv.org/pdf/1606.05328.pdf)):
223+
# (figure credit - [Aaron van den Oord et al. ](https://arxiv.org/abs/1606.05328)):
224224
#
225225
# <center width="100%" style="padding: 10px"><img src="pixelcnn_blind_spot.svg" width="275px"></center>
226226
#
@@ -447,7 +447,7 @@ def show_center_recep_field(img, out):
447447
# For visualizing the receptive field, we assumed a very simplified stack of vertical and horizontal convolutions.
448448
# Obviously, there are more sophisticated ways of doing it, and PixelCNN uses gated convolutions for this.
449449
# Specifically, the Gated Convolution block in PixelCNN looks as follows
450-
# (figure credit - [Aaron van den Oord et al. ](https://arxiv.org/pdf/1606.05328.pdf)):
450+
# (figure credit - [Aaron van den Oord et al. ](https://arxiv.org/abs/1606.05328)):
451451
#
452452
# <center width="100%"><img src="PixelCNN_GatedConv.svg" width="700px" style="padding: 15px"/></center>
453453
#
@@ -508,7 +508,7 @@ def forward(self, v_stack, h_stack):
508508
# The architecture consists of multiple stacked GatedMaskedConv blocks, where we add an additional dilation factor to a few convolutions.
509509
# This is used to increase the receptive field of the model and allows to take a larger context into account during generation.
510510
# As a reminder, dilation on a convolution works looks as follows
511-
# (figure credit - [Vincent Dumoulin and Francesco Visin](https://arxiv.org/pdf/1603.07285.pdf)):
511+
# (figure credit - [Vincent Dumoulin and Francesco Visin](https://arxiv.org/abs/1603.07285)):
512512
#
513513
# <center width="100%"><img src="https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/dilation.gif" width="250px"></center>
514514
#
@@ -659,7 +659,7 @@ def test_step(self, batch, batch_idx):
659659
# %% [markdown]
660660
# The visualization shows that for predicting any pixel, we can take almost half of the image into account.
661661
# However, keep in mind that this is the "theoretical" receptive field and not necessarily
662-
# the [effective receptive field](https://arxiv.org/pdf/1701.04128.pdf), which is usually much smaller.
662+
# the [effective receptive field](https://arxiv.org/abs/1701.04128), which is usually much smaller.
663663
# For a stronger model, we should therefore try to increase the receptive
664664
# field even further. Especially, for the pixel on the bottom right, the
665665
# very last pixel, we would be allowed to take into account the whole
@@ -873,7 +873,7 @@ def autocomplete_image(img):
873873
# Interestingly, the pixel values 64, 128 and 191 also stand out which is likely due to the quantization used during the creation of the dataset.
874874
# For RGB images, we would also see two peaks around 0 and 255,
875875
# but the values in between would be much more frequent than in MNIST
876-
# (see Figure 1 in the [PixelCNN++](https://arxiv.org/pdf/1701.05517.pdf) for a visualization on CIFAR10).
876+
# (see Figure 1 in the [PixelCNN++](https://arxiv.org/abs/1701.05517) for a visualization on CIFAR10).
877877
#
878878
# Next, we can visualize the distribution our model predicts (in average):
879879

course_UvA-DL/11-vision-transformer/Vision_Transformer.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -515,7 +515,7 @@ def train_model(**kwargs):
515515
# Dosovitskiy, Alexey, et al.
516516
# "An image is worth 16x16 words: Transformers for image recognition at scale."
517517
# International Conference on Representation Learning (2021).
518-
# [link](https://arxiv.org/pdf/2010.11929.pdf)
518+
# [link](https://arxiv.org/abs/2010.11929)
519519
#
520520
# Chen, Xiangning, et al.
521521
# "When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations."

course_UvA-DL/12-meta-learning/Meta_Learning.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# %% [markdown]
22
# <div class="center-wrapper"><div class="video-wrapper"><iframe src="https://www.youtube.com/embed/035rkmT8FfE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe></div></div>
3-
# Meta-Learning offers solutions to these situations, and we will discuss three popular algorithms: __Prototypical Networks__ ([Snell et al., 2017](https://arxiv.org/pdf/1703.05175.pdf)), __Model-Agnostic Meta-Learning / MAML__ ([Finn et al., 2017](http://proceedings.mlr.press/v70/finn17a.html)), and __Proto-MAML__ ([Triantafillou et al., 2020](https://openreview.net/pdf?id=rkgAGAVKPr)).
3+
# Meta-Learning offers solutions to these situations, and we will discuss three popular algorithms: __Prototypical Networks__ ([Snell et al., 2017](https://arxiv.org/abs/1703.05175)), __Model-Agnostic Meta-Learning / MAML__ ([Finn et al., 2017](http://proceedings.mlr.press/v70/finn17a.html)), and __Proto-MAML__ ([Triantafillou et al., 2020](https://openreview.net/pdf?id=rkgAGAVKPr)).
44
# We will focus on the task of few-shot classification where the training and test set have distinct sets of classes.
55
# For instance, we would train the model on the binary classifications of cats-birds and flowers-bikes, but during test time, the model would need to learn from 4 examples each the difference between dogs and otters, two classes we have not seen during training (Figure credit - [Lilian Weng](https://lilianweng.github.io/lil-log/2018/11/30/meta-learning.html)).
66
#
@@ -418,7 +418,7 @@ def split_batch(imgs, targets):
418418
# $$\mathbf{v}_c=\frac{1}{|S_c|}\sum_{(\mathbf{x}_i,y_i)\in S_c}f_{\theta}(\mathbf{x}_i)$$
419419
#
420420
# where $S_c$ is the part of the support set $S$ for which $y_i=c$, and $\mathbf{v}_c$ represents the _prototype_ of class $c$.
421-
# The prototype calculation is visualized below for a 2-dimensional feature space and 3 classes (Figure credit - [Snell et al.](https://arxiv.org/pdf/1703.05175.pdf)).
421+
# The prototype calculation is visualized below for a 2-dimensional feature space and 3 classes (Figure credit - [Snell et al.](https://arxiv.org/abs/1703.05175)).
422422
# The colored dots represent encoded support elements with color-corresponding class label, and the black dots next to the class label are the averaged prototypes.
423423
#
424424
# <center width="100%"><img src="protonet_classification.svg" width="300px"></center>
@@ -1329,7 +1329,7 @@ def test_protomaml(model, dataset, k_shot=4):
13291329
# [1] Snell, Jake, Kevin Swersky, and Richard S. Zemel.
13301330
# "Prototypical networks for few-shot learning."
13311331
# NeurIPS 2017.
1332-
# ([link](https://arxiv.org/pdf/1703.05175.pdf))
1332+
# ([link](https://arxiv.org/abs/1703.05175))
13331333
#
13341334
# [2] Chelsea Finn, Pieter Abbeel, Sergey Levine.
13351335
# "Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks."

lightning_examples/finetuning-scheduler/finetuning-scheduler.py

+6-6
Original file line numberDiff line numberDiff line change
@@ -611,18 +611,18 @@ def train() -> None:
611611
# %% [markdown]
612612
# ## Footnotes
613613
#
614-
# - [Howard, J., & Ruder, S. (2018)](https://arxiv.org/pdf/1801.06146.pdf). Fine-tuned Language
614+
# - [Howard, J., & Ruder, S. (2018)](https://arxiv.org/abs/1801.06146). Fine-tuned Language
615615
# Models for Text Classification. ArXiv, abs/1801.06146. [↩](#Scheduled-Fine-Tuning-with-the-Fine-Tuning-Scheduler-Extension)
616-
# - [Chronopoulou, A., Baziotis, C., & Potamianos, A. (2019)](https://arxiv.org/pdf/1902.10547.pdf).
616+
# - [Chronopoulou, A., Baziotis, C., & Potamianos, A. (2019)](https://arxiv.org/abs/1902.10547).
617617
# An embarrassingly simple approach for transfer learning from pretrained language models. arXiv
618618
# preprint arXiv:1902.10547. [↩](#Scheduled-Fine-Tuning-with-the-Fine-Tuning-Scheduler-Extension)
619-
# - [Peters, M. E., Ruder, S., & Smith, N. A. (2019)](https://arxiv.org/pdf/1903.05987.pdf). To tune or not to
619+
# - [Peters, M. E., Ruder, S., & Smith, N. A. (2019)](https://arxiv.org/abs/1903.05987). To tune or not to
620620
# tune? adapting pretrained representations to diverse tasks. arXiv preprint arXiv:1903.05987. [↩](#Scheduled-Fine-Tuning-with-the-Fine-Tuning-Scheduler-Extension)
621-
# - [Sivaprasad, P. T., Mai, F., Vogels, T., Jaggi, M., & Fleuret, F. (2020)](https://arxiv.org/pdf/1910.11758.pdf).
621+
# - [Sivaprasad, P. T., Mai, F., Vogels, T., Jaggi, M., & Fleuret, F. (2020)](https://arxiv.org/abs/1910.11758).
622622
# Optimizer benchmarking needs to account for hyperparameter tuning. In International Conference on Machine Learning
623623
# (pp. 9036-9045). PMLR. [↩](#Optimizer-Configuration)
624-
# - [Mosbach, M., Andriushchenko, M., & Klakow, D. (2020)](https://arxiv.org/pdf/2006.04884.pdf). On the stability of
624+
# - [Mosbach, M., Andriushchenko, M., & Klakow, D. (2020)](https://arxiv.org/abs/2006.04884). On the stability of
625625
# fine-tuning bert: Misconceptions, explanations, and strong baselines. arXiv preprint arXiv:2006.04884. [↩](#Optimizer-Configuration)
626-
# - [Loshchilov, I., & Hutter, F. (2016)](https://arxiv.org/pdf/1608.03983.pdf). Sgdr: Stochastic gradient descent with
626+
# - [Loshchilov, I., & Hutter, F. (2016)](https://arxiv.org/abs/1608.03983). Sgdr: Stochastic gradient descent with
627627
# warm restarts. arXiv preprint arXiv:1608.03983. [↩](#LR-Scheduler-Configuration)
628628
#

0 commit comments

Comments
 (0)