-
Notifications
You must be signed in to change notification settings - Fork 94
feat: Add Multimodal RAG cookbook #242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sjrl
commented
Jul 15, 2025
- fixes Create Image Indexing Example haystack#9321
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@bilgeyucel this is ready for review! |
View / edit / reply to this conversation on ReviewNB anakin87 commented on 2025-07-22T13:40:17Z I would also explain that in the following Pipeline we are:
sjrl commented on 2025-07-23T11:27:00Z sounds good! |
View / edit / reply to this conversation on ReviewNB anakin87 commented on 2025-07-22T13:40:17Z Is it worth mentioning that other models (Jina Embeddings 4) might work better? sjrl commented on 2025-07-23T11:27:23Z Yeah I was thinking about that. I'll mention that here. |
View / edit / reply to this conversation on ReviewNB anakin87 commented on 2025-07-22T13:40:18Z Something seems off in the first sentence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I left some comments (while waiting on a review from @bilgeyucel)
We should also create a new entry in index.toml.
sounds good! View entire conversation on ReviewNB |
Yeah I was thinking about that. I'll mention that here. View entire conversation on ReviewNB |
View / edit / reply to this conversation on ReviewNB bilgeyucel commented on 2025-07-25T10:08:58Z As a future note, let's add docs link for these components |
View / edit / reply to this conversation on ReviewNB bilgeyucel commented on 2025-07-25T10:08:59Z "Next, we load our embedders with the sentence-transformers/clip-ViT-L-14 model that maps text and images to a shared vector space. It's important that we use the same CLIP model for both text and images to calculate the similarity between. |
View / edit / reply to this conversation on ReviewNB bilgeyucel commented on 2025-07-25T10:08:59Z Let's run the embedders and create vector embeddings for images to see how semantically similar our query is to the two images |
View / edit / reply to this conversation on ReviewNB bilgeyucel commented on 2025-07-25T10:09:00Z As we can see, the text is most similar to our Apple image, as expected! So, the CLIP model can create correct representations for images and text. |
View / edit / reply to this conversation on ReviewNB bilgeyucel commented on 2025-07-25T10:09:01Z Let's create an indexing pipeline to process our image and PDF files at once and write them to our Document Store. |
View / edit / reply to this conversation on ReviewNB bilgeyucel commented on 2025-07-25T10:09:01Z Line #1. #indexing_pipe.show() This line creates problem in the tutorial test so let's comment it out |
View / edit / reply to this conversation on ReviewNB bilgeyucel commented on 2025-07-25T10:09:02Z "Run the indexing pipeline with a pdf and an image file" |
View / edit / reply to this conversation on ReviewNB bilgeyucel commented on 2025-07-25T10:09:03Z Extra desciption here: Let's now set up our search and retrieve relevant data from our document store by passing a query. |
View / edit / reply to this conversation on ReviewNB bilgeyucel commented on 2025-07-25T10:09:03Z Do we need the "but" here? sjrl commented on 2025-07-28T09:26:04Z Yes I think so otherwise users might think we only send the text version of the image to the LLM and not the image itself. sjrl commented on 2025-07-28T09:26:22Z But we can drop the but and leave the rest ;) |
View / edit / reply to this conversation on ReviewNB bilgeyucel commented on 2025-07-25T10:09:04Z Line #1. indexing_pipe.show() Again, commenting out |
View / edit / reply to this conversation on ReviewNB bilgeyucel commented on 2025-07-25T10:09:05Z Can you explain what is happening in this pipeline? This is also the first time we have introduced this new prompt type, so it requires more explanation. I think the important thing to mention here is that we do retrieval based on the image caption we created as indexing, but then we use the image itself for the generation part of the RAG. It's worth mentioning that we convert the image into a base64 string with DocumentToImageContent and render it in the prompt before we pass the prompt to a language model that can process text and image |
View / edit / reply to this conversation on ReviewNB bilgeyucel commented on 2025-07-25T10:09:05Z Line #1. #pipe.show() Same comment, we should delete the output cell though, the image should be there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left my comments @sjrl 🙌 We should have it as a tutorial so I will make further adjustments when we transfer it to haystack-tutorials
repo.
Yes I think so otherwise users might think we only send the text version of the image to the LLM and not the image itself. View entire conversation on ReviewNB |
But we can drop the but and leave the rest ;) View entire conversation on ReviewNB |
Closing since we are moving this to the tutorial repo deepset-ai/haystack-tutorials#409 |