Different similarity results when using text-embedding-3-small or text-embedding-3-large models #542
Replies: 4 comments 1 reply
-
I think that's expected behavior. Bigger and newer models capture more details and understand content better. Something that might seem relevant to ada2 might be less relevant to the other models. The opposite can happen too. In general when switching models, it's recommended to "fine tune" also thresholds, prompts and other "semantic" settings. Similar scenarios present with text generation when moving from GPT 3.5 to 4, and to other models. It's similar to changing an image/sound/video compression algorithm at the core of a game, noticing different quality,performance and artifacts, with the need to revisit settings and requirements. |
Beta Was this translation helpful? Give feedback.
-
Thank you @dluc for the answer. Now I'm experimenting some weird situations in which a question that had a similarity of 0.79 with Now I'm trying to increment |
Beta Was this translation helpful? Give feedback.
-
That's a pretty big difference, are the chunks the same? |
Beta Was this translation helpful? Give feedback.
-
Yes, chunks are the same. The text is relevant as
|
Beta Was this translation helpful? Give feedback.
-
Context / Scenario
For the same document and question, when using
text-embedding-3-small
ortext-embedding-3-large
models, similarity returns results with lower relevance than when usingtext-embedding-ada-002
model.What happened?
I'm using the code available at https://github.com/marcominerva/KernelMemoryService with
SimpleVectorDb
. I have imported the file Taggia.pdf, that is the PDF of the Italian Wikipedia page about the town of Taggia, Italy. Then, I have searched for "Quante persone vivono a Taggia?" (in English it is "How many people do live in Taggia?"),If I use the
text-embedding-ada-002
, model digging into the source code ofSimpleVectorDb
,kernel-memory/service/Core/MemoryStorage/DevTools/SimpleVectorDb.cs
Lines 115 to 121 in d127063
I obtain this:
However, if I use
text-embedding-3-small
(I have of course deleted the previous memories and re-imported the document), with the same question I get:So, if I have these models I need to change the
minRelevance
parameter I use for my query. Withtext-embedding-ada-002
, I use a value of 0.75, while with newer models it seems that anything grater than 0.5 is good. Do you agree?NOTE: I get similar results also with Qdrant.
Importance
edge case
Platform, Language, Versions
Kernel Memory v0.35.240318.1
Relevant log output
No response
Beta Was this translation helpful? Give feedback.
All reactions