Skip to content

feat: Add Multimodal RAG cookbook #242

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

Conversation

sjrl
Copy link
Contributor

@sjrl sjrl commented Jul 15, 2025

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@sjrl sjrl marked this pull request as ready for review July 21, 2025 12:06
@sjrl sjrl requested a review from a team as a code owner July 21, 2025 12:06
@sjrl
Copy link
Contributor Author

sjrl commented Jul 21, 2025

@bilgeyucel this is ready for review!

@anakin87 anakin87 self-requested a review July 21, 2025 12:27
Copy link

review-notebook-app bot commented Jul 22, 2025

View / edit / reply to this conversation on ReviewNB

anakin87 commented on 2025-07-22T13:40:17Z
----------------------------------------------------------------

I would also explain that in the following Pipeline we are:

  • computing embeddings based on images for image files
  • converting PDF files to textual Documents and then computing embeddings based on the text

sjrl commented on 2025-07-23T11:27:00Z
----------------------------------------------------------------

sounds good!

Copy link

review-notebook-app bot commented Jul 22, 2025

View / edit / reply to this conversation on ReviewNB

anakin87 commented on 2025-07-22T13:40:17Z
----------------------------------------------------------------

Is it worth mentioning that other models (Jina Embeddings 4) might work better?


sjrl commented on 2025-07-23T11:27:23Z
----------------------------------------------------------------

Yeah I was thinking about that. I'll mention that here.

Copy link

review-notebook-app bot commented Jul 22, 2025

View / edit / reply to this conversation on ReviewNB

anakin87 commented on 2025-07-22T13:40:18Z
----------------------------------------------------------------

Something seems off in the first sentence.


Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I left some comments (while waiting on a review from @bilgeyucel)

We should also create a new entry in index.toml.

@anakin87 anakin87 requested a review from bilgeyucel July 22, 2025 13:42
Copy link
Contributor Author

sjrl commented Jul 23, 2025

sounds good!


View entire conversation on ReviewNB

Copy link
Contributor Author

sjrl commented Jul 23, 2025

Yeah I was thinking about that. I'll mention that here.


View entire conversation on ReviewNB

@sjrl sjrl requested a review from anakin87 July 24, 2025 07:45
Copy link

review-notebook-app bot commented Jul 25, 2025

View / edit / reply to this conversation on ReviewNB

bilgeyucel commented on 2025-07-25T10:08:58Z
----------------------------------------------------------------

As a future note, let's add docs link for these components


Copy link

review-notebook-app bot commented Jul 25, 2025

View / edit / reply to this conversation on ReviewNB

bilgeyucel commented on 2025-07-25T10:08:59Z
----------------------------------------------------------------

"Next, we load our embedders with the sentence-transformers/clip-ViT-L-14 model that maps text and images to a shared vector space. It's important that we use the same CLIP model for both text and images to calculate the similarity between.


Copy link

review-notebook-app bot commented Jul 25, 2025

View / edit / reply to this conversation on ReviewNB

bilgeyucel commented on 2025-07-25T10:08:59Z
----------------------------------------------------------------

Let's run the embedders and create vector embeddings for images to see how semantically similar our query is to the two images


Copy link

review-notebook-app bot commented Jul 25, 2025

View / edit / reply to this conversation on ReviewNB

bilgeyucel commented on 2025-07-25T10:09:00Z
----------------------------------------------------------------

As we can see, the text is most similar to our Apple image, as expected! So, the CLIP model can create correct representations for images and text.


Copy link

review-notebook-app bot commented Jul 25, 2025

View / edit / reply to this conversation on ReviewNB

bilgeyucel commented on 2025-07-25T10:09:01Z
----------------------------------------------------------------

Let's create an indexing pipeline to process our image and PDF files at once and write them to our Document Store.


Copy link

review-notebook-app bot commented Jul 25, 2025

View / edit / reply to this conversation on ReviewNB

bilgeyucel commented on 2025-07-25T10:09:01Z
----------------------------------------------------------------

Line #1.    #indexing_pipe.show()

This line creates problem in the tutorial test so let's comment it out


Copy link

review-notebook-app bot commented Jul 25, 2025

View / edit / reply to this conversation on ReviewNB

bilgeyucel commented on 2025-07-25T10:09:02Z
----------------------------------------------------------------

"Run the indexing pipeline with a pdf and an image file"


Copy link

review-notebook-app bot commented Jul 25, 2025

View / edit / reply to this conversation on ReviewNB

bilgeyucel commented on 2025-07-25T10:09:03Z
----------------------------------------------------------------

Extra desciption here: Let's now set up our search and retrieve relevant data from our document store by passing a query.


Copy link

review-notebook-app bot commented Jul 25, 2025

View / edit / reply to this conversation on ReviewNB

bilgeyucel commented on 2025-07-25T10:09:03Z
----------------------------------------------------------------

Do we need the "but" here?


sjrl commented on 2025-07-28T09:26:04Z
----------------------------------------------------------------

Yes I think so otherwise users might think we only send the text version of the image to the LLM and not the image itself.

sjrl commented on 2025-07-28T09:26:22Z
----------------------------------------------------------------

But we can drop the but and leave the rest ;)

Copy link

review-notebook-app bot commented Jul 25, 2025

View / edit / reply to this conversation on ReviewNB

bilgeyucel commented on 2025-07-25T10:09:04Z
----------------------------------------------------------------

Line #1.    indexing_pipe.show()

Again, commenting out


Copy link

review-notebook-app bot commented Jul 25, 2025

View / edit / reply to this conversation on ReviewNB

bilgeyucel commented on 2025-07-25T10:09:05Z
----------------------------------------------------------------

Can you explain what is happening in this pipeline? This is also the first time we have introduced this new prompt type, so it requires more explanation.

I think the important thing to mention here is that we do retrieval based on the image caption we created as indexing, but then we use the image itself for the generation part of the RAG. It's worth mentioning that we convert the image into a base64 string with DocumentToImageContent and render it in the prompt before we pass the prompt to a language model that can process text and image


Copy link

review-notebook-app bot commented Jul 25, 2025

View / edit / reply to this conversation on ReviewNB

bilgeyucel commented on 2025-07-25T10:09:05Z
----------------------------------------------------------------

Line #1.    #pipe.show()

Same comment, we should delete the output cell though, the image should be there.


Copy link
Contributor

@bilgeyucel bilgeyucel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left my comments @sjrl 🙌 We should have it as a tutorial so I will make further adjustments when we transfer it to haystack-tutorials repo.

Copy link
Contributor Author

sjrl commented Jul 28, 2025

Yes I think so otherwise users might think we only send the text version of the image to the LLM and not the image itself.


View entire conversation on ReviewNB

Copy link
Contributor Author

sjrl commented Jul 28, 2025

But we can drop the but and leave the rest ;)


View entire conversation on ReviewNB

@sjrl
Copy link
Contributor Author

sjrl commented Jul 28, 2025

Closing since we are moving this to the tutorial repo deepset-ai/haystack-tutorials#409

@sjrl sjrl closed this Jul 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create Image Indexing Example
3 participants