[Chapter 8 - video #258] Expand position embedding to match the batch size #259

AlessandroMiola · 2023-01-15T18:11:53Z

AlessandroMiola
Jan 15, 2023

Hi @mrdbourke,
first off, thanks for the amazing course and chapter.

I'm going through the ViT implementation; imo the proposed implementation is missing a detail. Shouldn't we expand the position embedding to match the batch size the same way we've done for the class token embedding? If not we'd obtain an output tensor whose first dimension is always equal to 1 regardless of the real batch size.

Namely, I'd propose the following where we do call .expand(batch_size, -1, -1) on self.position_embedding (while the original implementation does x = self.position_embedding + x).

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        batch_size = x.shape[0]
        # expand class token embedding to match the batch size
        class_token = self.class_embedding.expand(batch_size, -1, -1)
        x = self.patch_embedding(x)
        x = torch.cat(tensors=(class_token, x), dim=1)
        x = self.position_embedding.expand(batch_size, -1, -1) + x   # no expand?
        #print(x.shape)
        x = self.embedding_dropout(x)
        x = self.transformer_encoder(x)
        x = self.classifier_head(x[:, 0, :])
        return x

If this is not the case, could you please shed light on where I'm wrong?
Thanks in advance!

Answered by mrdbourke

Jan 19, 2023

Hi @AlessandroMiola!

Great suggestion! And you'd be right thinking that, however, due to the nature of addition in PyTorch, the x = self.position_embedding + x will add the position_embedding across every sample in the batch.

This is from equation 1 in the paper: https://www.learnpytorch.io/08_pytorch_paper_replicating/#47-creating-the-position-embedding

See an example on Google Colab here: https://www.learnpytorch.io/08_pytorch_paper_replicating/#47-creating-the-position-embedding

Let's see an example of creating a batched image tensor of all zeroes then add all ones to it:

import torch
from torch import nn

# Set hyperparameters
batch_size = 32
embed_dim = 768
num_patches = 196

# Creat…

View full answer

mrdbourke · 2023-01-19T01:40:46Z

mrdbourke
Jan 19, 2023
Maintainer

Hi @AlessandroMiola!

Great suggestion! And you'd be right thinking that, however, due to the nature of addition in PyTorch, the x = self.position_embedding + x will add the position_embedding across every sample in the batch.

This is from equation 1 in the paper: https://www.learnpytorch.io/08_pytorch_paper_replicating/#47-creating-the-position-embedding

See an example on Google Colab here: https://www.learnpytorch.io/08_pytorch_paper_replicating/#47-creating-the-position-embedding

Let's see an example of creating a batched image tensor of all zeroes then add all ones to it:

import torch
from torch import nn

# Set hyperparameters
batch_size = 32
embed_dim = 768
num_patches = 196

# Create batch of zeros 
x = torch.zeros([batch_size, 
                 num_patches+1, # +1 is for learnable class_token (not shown here but see: https://www.learnpytorch.io/08_pytorch_paper_replicating/#46-creating-the-class-token-embedding)
                 embed_dim])
print(f"Patch embedding with class token shape: {x.shape} -> [batch_size, patch_embedding + class_token, embedding_dim]")
print(x[0])

Out:

Patch embedding with class token shape: torch.Size([32, 197, 768]) -> [batch_size, patch_embedding + class_token, embedding_dim]
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

Create the position embedding as a learnable tensor:

# Create the position embedding, see here: https://www.learnpytorch.io/08_pytorch_paper_replicating/#47-creating-the-position-embedding 
position_embedding = nn.Parameter(torch.ones(1,
                                             num_patches+1, # +1 is for class_token
                                             embed_dim),
                                  requires_grad=True) # make sure it's learnable

# Show the first 10 sequences and 10 position embedding values and check the shape of the position embedding
print(position_embedding[:, :10, :10])
print(f"Position embeddding shape: {position_embedding.shape} -> [batch_size, number_of_patches, embedding_dimension]")

Out:

tensor([[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]], grad_fn=<SliceBackward0>)
Position embeddding shape: torch.Size([1, 197, 768]) -> [batch_size, number_of_patches, embedding_dimension]

Add the position embedding (ones) to the patched/batched image embedding (all zeroes), across all samples in a batch:

# Add position embedding to x (due to nature of addition in PyTorch, it adds to all tensors in batch)
x_with_position_embedding = x + position_embedding
x_with_position_embedding

Out:

tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
...

Check if position embedding was added to all samples in batch (this is known as broadcasting in NumPy/PyTorch):

# Check to see if position embedding added to "x"
x_with_position_embedding == 1

Out:

tensor([[[True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         ...,
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True]],

        [[True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
         [True, True, True,  ..., True, True, True],
...

A fair bit going on here but I hope that clears things up! Let me know if you're unsure of anything!

1 reply

AlessandroMiola Jan 20, 2023
Author

I understand, thank you! :)

HungryEagle · 2025-03-30T15:23:12Z

HungryEagle
Mar 30, 2025

Shouldn't the position embeddings be a combination of sin and cos like it is in the attention is all you need paper?
Why is it all random here?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Chapter 8 - video #258] Expand position embedding to match the batch size #259

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[Chapter 8 - video #258] Expand position embedding to match the batch size #259

AlessandroMiola Jan 15, 2023

Replies: 2 comments · 1 reply

mrdbourke Jan 19, 2023 Maintainer

AlessandroMiola Jan 20, 2023 Author

HungryEagle Mar 30, 2025

AlessandroMiola
Jan 15, 2023

Replies: 2 comments 1 reply

mrdbourke
Jan 19, 2023
Maintainer

AlessandroMiola Jan 20, 2023
Author

HungryEagle
Mar 30, 2025