Different similarity results when using text-embedding-3-small or text-embedding-3-large models #542

marcominerva · 2024-03-19T10:38:13Z

marcominerva
Mar 19, 2024

Context / Scenario

For the same document and question, when using text-embedding-3-small or text-embedding-3-large models, similarity returns results with lower relevance than when using text-embedding-ada-002 model.

What happened?

I'm using the code available at https://github.com/marcominerva/KernelMemoryService with SimpleVectorDb. I have imported the file Taggia.pdf, that is the PDF of the Italian Wikipedia page about the town of Taggia, Italy. Then, I have searched for "Quante persone vivono a Taggia?" (in English it is "How many people do live in Taggia?"),

If I use the text-embedding-ada-002, model digging into the source code of SimpleVectorDb,

kernel-memory/service/Core/MemoryStorage/DevTools/SimpleVectorDb.cs

Lines 115 to 121 in d127063

    
           var similarity = new Dictionary<string, double>(); 
        
           Embedding textEmbedding = await this._embeddingGenerator.GenerateEmbeddingAsync 
        
               (text, cancellationToken).ConfigureAwait(false); 
        
           foreach (var record in records) 
        
           { 
        
               similarity[record.Value.Id] = textEmbedding.CosineSimilarity(record.Value.Vector); 
        
           }

I obtain this:

However, if I use text-embedding-3-small (I have of course deleted the previous memories and re-imported the document), with the same question I get:

So, if I have these models I need to change the minRelevance parameter I use for my query. With text-embedding-ada-002, I use a value of 0.75, while with newer models it seems that anything grater than 0.5 is good. Do you agree?

NOTE: I get similar results also with Qdrant.

Importance

edge case

Platform, Language, Versions

Kernel Memory v0.35.240318.1

Relevant log output

No response

dluc · 2024-03-19T15:48:23Z

dluc
Mar 19, 2024
Maintainer

I think that's expected behavior. Bigger and newer models capture more details and understand content better. Something that might seem relevant to ada2 might be less relevant to the other models. The opposite can happen too. In general when switching models, it's recommended to "fine tune" also thresholds, prompts and other "semantic" settings. Similar scenarios present with text generation when moving from GPT 3.5 to 4, and to other models. It's similar to changing an image/sound/video compression algorithm at the core of a game, noticing different quality,performance and artifacts, with the need to revisit settings and requirements.
We briefly called out this topic last year at //build and the need of a new generation of dev tools to measure AI behavior, it's still early days, with some options for prompts fine tuning. I haven't seen anything for embeddings yet though.

0 replies

marcominerva · 2024-03-19T16:01:05Z

marcominerva
Mar 19, 2024
Author

Thank you @dluc for the answer. Now I'm experimenting some weird situations in which a question that had a similarity of 0.79 with text-embedding-ada-002, now has only 0.33 with text-embedding-3-small and 0.27 with text-embedding-3-large, so it is very difficult to set a valid threshold.

Now I'm trying to increment MaxMatchesCount in the SearchClientConfig and MaxTokens used by Text Generation.

0 replies

dluc · 2024-03-19T16:18:26Z

dluc
Mar 19, 2024
Maintainer

Thank you @dluc for the answer. Now I'm experimenting some weird situations in which a question that had a similarity of 0.79 with text-embedding-ada-002, now has only 0.33 with text-embedding-3-small and 0.27 with text-embedding-3-large, so it is very difficult to set a valid threshold.

Now I'm trying to increment MaxMatchesCount in the SearchClientConfig and MaxTokens used by Text Generation.

That's a pretty big difference, are the chunks the same?
Looking at the content, which model do you think is "right"? e.g. is the text actually relevant like ada002 says, or not so much as 3-small says?

0 replies

marcominerva · 2024-03-19T16:32:28Z

marcominerva
Mar 19, 2024
Author

Yes, chunks are the same. The text is relevant as text-embedding-ada-002 says. For example, among the others I have a chunk (about 1000 tokens) that contains something like "and near the town there is Vivaldi Palace, built in 1458", and I ask for "When was the Vivaldi Palace built?":

text-embedding-ada-002 tells me that the chuk have a similarity of 0.79 with my question
text-embedding-3-small returns a similarity of 0.33
text-embedding-3-large returns the lowest 0.27

1 reply

Dikarabo-Molele Jul 12, 2024

I am experiencing a similar issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different similarity results when using text-embedding-3-small or text-embedding-3-large models #542

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Different similarity results when using text-embedding-3-small or text-embedding-3-large models #542

marcominerva Mar 19, 2024

Context / Scenario

What happened?

Importance

Platform, Language, Versions

Relevant log output

Replies: 4 comments · 1 reply

dluc Mar 19, 2024 Maintainer

marcominerva Mar 19, 2024 Author

dluc Mar 19, 2024 Maintainer

marcominerva Mar 19, 2024 Author

Dikarabo-Molele Jul 12, 2024

marcominerva
Mar 19, 2024

Replies: 4 comments 1 reply

dluc
Mar 19, 2024
Maintainer

marcominerva
Mar 19, 2024
Author

dluc
Mar 19, 2024
Maintainer

marcominerva
Mar 19, 2024
Author