|
18 | 18 | # For instance, in autoregressive models, we cannot interpolate between two images because of the lack of a latent representation.
|
19 | 19 | # We will explore and discuss these benefits and drawbacks alongside with our implementation.
|
20 | 20 | #
|
21 |
| -# Our implementation will focus on the [PixelCNN](https://arxiv.org/pdf/1606.05328.pdf) [2] model which has been discussed in detail in the lecture. |
| 21 | +# Our implementation will focus on the [PixelCNN](https://arxiv.org/abs/1606.05328) [2] model which has been discussed in detail in the lecture. |
22 | 22 | # Most current SOTA models use PixelCNN as their fundamental architecture,
|
23 | 23 | # and various additions have been proposed to improve the performance
|
24 |
| -# (e.g. [PixelCNN++](https://arxiv.org/pdf/1701.05517.pdf) and [PixelSNAIL](http://proceedings.mlr.press/v80/chen18h/chen18h.pdf)). |
| 24 | +# (e.g. [PixelCNN++](https://arxiv.org/abs/1701.05517) and [PixelSNAIL](http://proceedings.mlr.press/v80/chen18h/chen18h.pdf)). |
25 | 25 | # Hence, implementing PixelCNN is a good starting point for our short tutorial.
|
26 | 26 | #
|
27 | 27 | # First of all, we need to import our standard libraries. Similarly as in
|
@@ -173,7 +173,7 @@ def show_imgs(imgs):
|
173 | 173 | # If we now want to apply this to our convolutions, we need to ensure that the prediction of pixel 1
|
174 | 174 | # is not influenced by its own "true" input, and all pixels on its right and in any lower row.
|
175 | 175 | # In convolutions, this means that we want to set those entries of the weight matrix to zero that take pixels on the right and below into account.
|
176 |
| -# As an example for a 5x5 kernel, see a mask below (figure credit - [Aaron van den Oord](https://arxiv.org/pdf/1606.05328.pdf)): |
| 176 | +# As an example for a 5x5 kernel, see a mask below (figure credit - [Aaron van den Oord](https://arxiv.org/abs/1606.05328)): |
177 | 177 | #
|
178 | 178 | # <center width="100%" style="padding: 10px"><img src="masked_convolution.svg" width="150px"></center>
|
179 | 179 | #
|
@@ -217,10 +217,10 @@ def forward(self, x):
|
217 | 217 | #
|
218 | 218 | # To build our own autoregressive image model, we could simply stack a few masked convolutions on top of each other.
|
219 | 219 | # This was actually the case for the original PixelCNN model, discussed in the paper
|
220 |
| -# [Pixel Recurrent Neural Networks](https://arxiv.org/pdf/1601.06759.pdf), but this leads to a considerable issue. |
| 220 | +# [Pixel Recurrent Neural Networks](https://arxiv.org/abs/1601.06759), but this leads to a considerable issue. |
221 | 221 | # When sequentially applying a couple of masked convolutions, the receptive field of a pixel
|
222 | 222 | # show to have a "blind spot" on the right upper side, as shown in the figure below
|
223 |
| -# (figure credit - [Aaron van den Oord et al. ](https://arxiv.org/pdf/1606.05328.pdf)): |
| 223 | +# (figure credit - [Aaron van den Oord et al. ](https://arxiv.org/abs/1606.05328)): |
224 | 224 | #
|
225 | 225 | # <center width="100%" style="padding: 10px"><img src="pixelcnn_blind_spot.svg" width="275px"></center>
|
226 | 226 | #
|
@@ -447,7 +447,7 @@ def show_center_recep_field(img, out):
|
447 | 447 | # For visualizing the receptive field, we assumed a very simplified stack of vertical and horizontal convolutions.
|
448 | 448 | # Obviously, there are more sophisticated ways of doing it, and PixelCNN uses gated convolutions for this.
|
449 | 449 | # Specifically, the Gated Convolution block in PixelCNN looks as follows
|
450 |
| -# (figure credit - [Aaron van den Oord et al. ](https://arxiv.org/pdf/1606.05328.pdf)): |
| 450 | +# (figure credit - [Aaron van den Oord et al. ](https://arxiv.org/abs/1606.05328)): |
451 | 451 | #
|
452 | 452 | # <center width="100%"><img src="PixelCNN_GatedConv.svg" width="700px" style="padding: 15px"/></center>
|
453 | 453 | #
|
@@ -508,7 +508,7 @@ def forward(self, v_stack, h_stack):
|
508 | 508 | # The architecture consists of multiple stacked GatedMaskedConv blocks, where we add an additional dilation factor to a few convolutions.
|
509 | 509 | # This is used to increase the receptive field of the model and allows to take a larger context into account during generation.
|
510 | 510 | # As a reminder, dilation on a convolution works looks as follows
|
511 |
| -# (figure credit - [Vincent Dumoulin and Francesco Visin](https://arxiv.org/pdf/1603.07285.pdf)): |
| 511 | +# (figure credit - [Vincent Dumoulin and Francesco Visin](https://arxiv.org/abs/1603.07285)): |
512 | 512 | #
|
513 | 513 | # <center width="100%"><img src="https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/dilation.gif" width="250px"></center>
|
514 | 514 | #
|
@@ -659,7 +659,7 @@ def test_step(self, batch, batch_idx):
|
659 | 659 | # %% [markdown]
|
660 | 660 | # The visualization shows that for predicting any pixel, we can take almost half of the image into account.
|
661 | 661 | # However, keep in mind that this is the "theoretical" receptive field and not necessarily
|
662 |
| -# the [effective receptive field](https://arxiv.org/pdf/1701.04128.pdf), which is usually much smaller. |
| 662 | +# the [effective receptive field](https://arxiv.org/abs/1701.04128), which is usually much smaller. |
663 | 663 | # For a stronger model, we should therefore try to increase the receptive
|
664 | 664 | # field even further. Especially, for the pixel on the bottom right, the
|
665 | 665 | # very last pixel, we would be allowed to take into account the whole
|
@@ -873,7 +873,7 @@ def autocomplete_image(img):
|
873 | 873 | # Interestingly, the pixel values 64, 128 and 191 also stand out which is likely due to the quantization used during the creation of the dataset.
|
874 | 874 | # For RGB images, we would also see two peaks around 0 and 255,
|
875 | 875 | # but the values in between would be much more frequent than in MNIST
|
876 |
| -# (see Figure 1 in the [PixelCNN++](https://arxiv.org/pdf/1701.05517.pdf) for a visualization on CIFAR10). |
| 876 | +# (see Figure 1 in the [PixelCNN++](https://arxiv.org/abs/1701.05517) for a visualization on CIFAR10). |
877 | 877 | #
|
878 | 878 | # Next, we can visualize the distribution our model predicts (in average):
|
879 | 879 |
|
|
0 commit comments