- Flickr8k Dataset: This dataset contains 8,000 images, each paired with five descriptive captions. It is widely used in image captioning research due to its diverse and rich annotations.
- For the encoding phase, we employed ResNet50, a powerful convolutional neural network (CNN) pre-trained on ImageNet. ResNet50 is renowned for its deep architecture, which allows it to capture intricate patterns and features in images.By leveraging a pre-trained model, we benefit from the transfer learning capabilities, improving the model's performance on the image captioning task.
- For the decoding phase, we implemented a Transformer-based model. Since Transformers do not have a built-in sense of sequence, we incorporated positional embeddings to provide the model with information about the position of each word in the sequence.
Here are some readings
Dataset