Skip to content

Conversation

@evan-cao-wb
Copy link
Contributor

@evan-cao-wb evan-cao-wb commented Nov 18, 2025

User description

Add mean/max pooling embedding methods to improve vector query accuracy in the show-text scenario.


PR Type

Enhancement


Description

  • Introduces Mean-Max Pooling (MMP) embedding strategy for improved vector query accuracy

  • Implements multi-provider support (OpenAI, Azure OpenAI, DeepSeek) via ProviderHelper

  • Adds MMPEmbeddingProvider with token-level embedding and pooling capabilities

  • Registers new plugin in solution with dependency injection configuration


Diagram Walkthrough

flowchart LR
  A["Text Input"] --> B["Tokenization"]
  B --> C["Get Token Embeddings"]
  C --> D["ProviderHelper<br/>OpenAI/Azure/DeepSeek"]
  D --> E["Mean-Max Pooling"]
  E --> F["Combined Vector"]
Loading

File Walkthrough

Relevant files
Enhancement
MMPEmbeddingPlugin.cs
Plugin registration and dependency injection setup             

src/Plugins/BotSharp.Plugin.MMPEmbedding/MMPEmbeddingPlugin.cs

  • Implements IBotSharpPlugin interface for plugin registration
  • Registers MMPEmbeddingProvider as ITextEmbedding service
  • Provides plugin metadata (Id, Name, Description)
+19/-0   
ProviderHelper.cs
Multi-provider client factory for embedding services         

src/Plugins/BotSharp.Plugin.MMPEmbedding/ProviderHelper.cs

  • Provides factory method to get OpenAI-compatible clients based on
    provider type
  • Handles Azure OpenAI client creation separately with endpoint and
    credentials
  • Supports OpenAI, DeepSeek, and other OpenAI-compatible providers
  • Retrieves provider settings from ILlmProviderService
+70/-0   
MMPEmbeddingProvider.cs
Core MMP embedding provider with pooling logic                     

src/Plugins/BotSharp.Plugin.MMPEmbedding/Providers/MMPEmbeddingProvider.cs

  • Implements ITextEmbedding interface with mean-max pooling strategy
  • Tokenizes input text using regex word pattern matching
  • Gets embeddings for individual tokens from underlying provider
  • Combines token embeddings using weighted mean and max pooling
  • Supports configurable model, dimension, and underlying provider
+167/-0 
Configuration changes
Using.cs
Global namespace imports for plugin                                           

src/Plugins/BotSharp.Plugin.MMPEmbedding/Using.cs

  • Defines global using statements for common namespaces
  • Includes BotSharp abstraction and ML task dependencies
+10/-0   
BotSharp.Plugin.MMPEmbedding.csproj
Project file for MMP embedding plugin                                       

src/Plugins/BotSharp.Plugin.MMPEmbedding/BotSharp.Plugin.MMPEmbedding.csproj

  • Creates new C# project file for MMPEmbedding plugin
  • References Azure.AI.OpenAI and OpenAI NuGet packages
  • Depends on BotSharp.Core infrastructure project
  • Enables implicit usings and nullable reference types
+18/-0   
BotSharp.sln
Solution file integration for new plugin                                 

BotSharp.sln

  • Adds BotSharp.Plugin.MMPEmbedding project to solution
  • Configures build configurations for Debug and Release
  • Assigns project to Plugins solution folder
+11/-0   

@qodo-merge-pro
Copy link

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
Unsafe array indexing

Description: The mean-max pooling assumes all token embedding vectors have length equal to the
configured dimension _dimension, but the code does not validate or adjust vector lengths,
leading to potential IndexOutOfRange exceptions or inadvertent memory disclosure if an
upstream provider returns embeddings of a different size.
MMPEmbeddingProvider.cs [116-133]

Referred Code
private float[] MeanMaxPooling(IReadOnlyList<float[]> vectors, double meanWeight = 0.5, double maxWeight = 0.5)
{
    var numTokens = vectors.Count;

    if (numTokens == 0)
        return [];

    var meanPooled = Enumerable.Range(0, _dimension)
        .Select(i => vectors.Average(v => v[i]))
        .ToArray();
    var maxPooled = Enumerable.Range(0, _dimension)
        .Select(i => vectors.Max(v => v[i]))
        .ToArray();

    return Enumerable.Range(0, _dimension)
        .Select(i => (float)meanWeight * meanPooled[i] + (float)maxWeight * maxPooled[i])
        .ToArray();
}
Type confusion risk

Description: Azure OpenAI client is instantiated via AzureOpenAIClient and returned as OpenAIClient,
which are different types from different SDKs and may cause runtime type confusion or
incorrect API usage, potentially leading to failures or unintended requests to external
services.
ProviderHelper.cs [44-57]

Referred Code
private static OpenAIClient GetAzureOpenAIClient(LlmModelSetting settings)
{
    if (string.IsNullOrEmpty(settings.Endpoint))
    {
        throw new InvalidOperationException("Azure OpenAI endpoint is required");
    }

    var client = new AzureOpenAIClient(
        new Uri(settings.Endpoint),
        new ApiKeyCredential(settings.ApiKey)
    );

    return client;
}
Excessive error disclosure

Description: Errors from the upstream embedding provider are logged with provider and model names and
then rethrown without controlling retry/backoff; this can spam logs and expose
configuration details while repeatedly hitting external APIs, enabling information
disclosure and denial-of-wallet scenarios.
MMPEmbeddingProvider.cs [82-107]

Referred Code
private async Task<List<float[]>> GetTokenEmbeddingsAsync(List<string> tokens)
{
    try
    {
        // Get the appropriate client based on the underlying provider
        var client = ProviderHelper.GetClient(_underlyingProvider, _model, _serviceProvider);
        var embeddingClient = client.GetEmbeddingClient(_model);

        // Prepare options
        var options = new EmbeddingGenerationOptions
        {
            Dimensions = _dimension > 0 ? _dimension : null
        };

        // Get embeddings for all tokens in batch
        var response = await embeddingClient.GenerateEmbeddingsAsync(tokens, options);
        var embeddings = response.Value;

        return embeddings.Select(e => e.ToFloats().ToArray()).ToList();
    }
    catch (Exception ex)


 ... (clipped 5 lines)
Ticket Compliance
🎫 No ticket provided
  • Create ticket/issue
Codebase Duplication Compliance
Codebase context is not defined

Follow the guide to enable codebase context checks.

Custom Compliance
🟢
Generic: Meaningful Naming and Self-Documenting Code

Objective: Ensure all identifiers clearly express their purpose and intent, making code
self-documenting

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Error Handling

Objective: To prevent the leakage of sensitive system information through error messages while
providing sufficient detail for internal debugging.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Secure Logging Practices

Objective: To ensure logs are useful for debugging and auditing without exposing sensitive
information like PII, PHI, or cardholder data.

Status: Passed

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Comprehensive Audit Trails

Objective: To create a detailed and reliable record of critical system actions for security analysis
and compliance.

Status:
No auditing: The new embedding operations do not log or emit audit events for critical actions, but it
is unclear if these actions are considered auditable in this context.

Referred Code
public async Task<float[]> GetVectorAsync(string text)
{
    if (string.IsNullOrWhiteSpace(text))
    {
        return new float[_dimension];
    }

    var tokens = Tokenize(text).ToList();

    if (tokens.Count == 0)
    {
        return new float[_dimension];
    }

    // Get embeddings for all tokens
    var tokenEmbeddings = await GetTokenEmbeddingsAsync(tokens);

    // Apply mean-max pooling
    var pooledEmbedding = MeanMaxPooling(tokenEmbeddings);

    return pooledEmbedding;


 ... (clipped 17 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Robust Error Handling and Edge Case Management

Objective: Ensure comprehensive error handling that provides meaningful context and graceful
degradation

Status:
Exception context: Helper methods throw InvalidOperationException without guidance on caller handling or
logging, and Azure client creation errors may lack actionable context outside logs.

Referred Code
    if (settings == null)
    {
        throw new InvalidOperationException($"Cannot find settings for provider '{provider}' and model '{model}'");
    }

    // Handle Azure OpenAI separately as it uses AzureOpenAIClient
    if (provider.Equals("azure-openai", StringComparison.OrdinalIgnoreCase))
    {
        return GetAzureOpenAIClient(settings);
    }

    // For OpenAI, DeepSeek, and other OpenAI-compatible providers
    return GetOpenAICompatibleClient(settings);
}

/// <summary>
/// Gets an Azure OpenAI client
/// </summary>
private static OpenAIClient GetAzureOpenAIClient(LlmModelSetting settings)
{
    if (string.IsNullOrEmpty(settings.Endpoint))


 ... (clipped 9 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Generic: Security-First Input Validation and Data Handling

Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent
vulnerabilities

Status:
Input validation: Text inputs are minimally validated (empty checks only) and provider/model names are used
to construct clients without explicit sanitization, which may rely on upstream
configuration guarantees.

Referred Code
{
    if (string.IsNullOrWhiteSpace(text))
    {
        return new float[_dimension];
    }

    var tokens = Tokenize(text).ToList();

    if (tokens.Count == 0)
    {
        return new float[_dimension];
    }

    // Get embeddings for all tokens
    var tokenEmbeddings = await GetTokenEmbeddingsAsync(tokens);

    // Apply mean-max pooling
    var pooledEmbedding = MeanMaxPooling(tokenEmbeddings);

    return pooledEmbedding;
}


 ... (clipped 12 lines)

Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend 🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

@qodo-merge-pro
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
High-level
Re-evaluate the word-level tokenization strategy

The current word-level tokenization should be replaced with chunking into
sentences or paragraphs. This aligns better with how embedding models work,
improving embedding quality and performance.

Examples:

src/Plugins/BotSharp.Plugin.MMPEmbedding/Providers/MMPEmbeddingProvider.cs [47-55]
        var tokens = Tokenize(text).ToList();

        if (tokens.Count == 0)
        {
            return new float[_dimension];
        }

        // Get embeddings for all tokens
        var tokenEmbeddings = await GetTokenEmbeddingsAsync(tokens);
src/Plugins/BotSharp.Plugin.MMPEmbedding/Providers/MMPEmbeddingProvider.cs [162-166]
    public static IEnumerable<string> Tokenize(string text, string? pattern = null)
    {
        var patternRegex = string.IsNullOrEmpty(pattern) ? WordRegex : new(pattern, RegexOptions.Compiled);
        return patternRegex.Matches(text).Cast<Match>().Select(m => m.Value);
    }

Solution Walkthrough:

Before:

// MMPEmbeddingProvider.cs
public async Task<float[]> GetVectorAsync(string text)
{
    // Tokenizes "This is a sentence." into ["This", "is", "a", "sentence"]
    var tokens = Tokenize(text).ToList(); 

    // Gets 4 separate embeddings, one for each word, in a batch API call
    var tokenEmbeddings = await GetTokenEmbeddingsAsync(tokens);

    // Pools the 4 word embeddings
    var pooledEmbedding = MeanMaxPooling(tokenEmbeddings);
    return pooledEmbedding;
}

public static IEnumerable<string> Tokenize(string text, string? pattern = null)
{
    // Regex for words: \b\w+\b
    return WordRegex.Matches(text).Cast<Match>().Select(m => m.Value);
}

After:

// MMPEmbeddingProvider.cs
public async Task<float[]> GetVectorAsync(string text)
{
    // Chunks "Sentence one. Sentence two." into ["Sentence one.", "Sentence two."]
    var chunks = ChunkTextIntoSentences(text); 

    // Gets 2 separate embeddings, one for each sentence
    var chunkEmbeddings = await GetChunkEmbeddingsAsync(chunks);

    // Pools the 2 sentence embeddings
    var pooledEmbedding = MeanMaxPooling(chunkEmbeddings);
    return pooledEmbedding;
}

private List<string> ChunkTextIntoSentences(string text)
{
    // New logic to split text into sentences or paragraphs
    // ...
}
Suggestion importance[1-10]: 9

__

Why: The suggestion correctly identifies a fundamental design flaw in the tokenization strategy that likely leads to poor embedding quality and inefficient API usage, undermining the PR's goal of improving accuracy.

High
Possible issue
Fix inconsistent empty embedding handling

In MeanMaxPooling, return a zero-filled array of size _dimension for empty
inputs to match the behavior of GetVectorAsync and prevent potential errors.

src/Plugins/BotSharp.Plugin.MMPEmbedding/Providers/MMPEmbeddingProvider.cs [116-133]

 private float[] MeanMaxPooling(IReadOnlyList<float[]> vectors, double meanWeight = 0.5, double maxWeight = 0.5)
 {
     var numTokens = vectors.Count;
 
     if (numTokens == 0)
-        return [];
+        return new float[_dimension];
 
     var meanPooled = Enumerable.Range(0, _dimension)
         .Select(i => vectors.Average(v => v[i]))
         .ToArray();
     var maxPooled = Enumerable.Range(0, _dimension)
         .Select(i => vectors.Max(v => v[i]))
         .ToArray();
 
     return Enumerable.Range(0, _dimension)
         .Select(i => (float)meanWeight * meanPooled[i] + (float)maxWeight * maxPooled[i])
         .ToArray();
 }
  • Apply / Chat
Suggestion importance[1-10]: 8

__

Why: The suggestion correctly identifies an inconsistency in return values for empty inputs, which could lead to runtime errors, and proposes a fix that makes the component more robust and predictable.

Medium
General
Process text embeddings in parallel

Refactor GetVectorsAsync to process texts in parallel using Task.WhenAll for
better performance.

src/Plugins/BotSharp.Plugin.MMPEmbedding/Providers/MMPEmbeddingProvider.cs [66-77]

 public async Task<List<float[]>> GetVectorsAsync(List<string> texts)
 {
-    var results = new List<float[]>();
-
-    foreach (var text in texts)
-    {
-        var embedding = await GetVectorAsync(text);
-        results.Add(embedding);
-    }
-
-    return results;
+    var tasks = texts.Select(text => GetVectorAsync(text));
+    var embeddings = await Task.WhenAll(tasks);
+    return embeddings.ToList();
 }
  • Apply / Chat
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a performance bottleneck and proposes a standard parallelization pattern that will significantly improve the method's execution time for multiple texts.

Medium
Cache compiled regular expressions

Optimize the Tokenize method by caching compiled Regex objects in a
ConcurrentDictionary to avoid expensive re-compilation on each call.

src/Plugins/BotSharp.Plugin.MMPEmbedding/Providers/MMPEmbeddingProvider.cs [162-166]

+private static readonly System.Collections.Concurrent.ConcurrentDictionary<string, Regex> _regexCache = new();
+
+/// <summary>
+/// Tokenizes text into individual words
+/// </summary>
 public static IEnumerable<string> Tokenize(string text, string? pattern = null)
 {
-    var patternRegex = string.IsNullOrEmpty(pattern) ? WordRegex : new(pattern, RegexOptions.Compiled);
+    var patternRegex = string.IsNullOrEmpty(pattern) 
+        ? WordRegex 
+        : _regexCache.GetOrAdd(pattern, p => new Regex(p, RegexOptions.Compiled));
     return patternRegex.Matches(text).Cast<Match>().Select(m => m.Value);
 }
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why: The suggestion provides a valid performance optimization by caching compiled Regex objects, which is a best practice to avoid repeated compilation overhead in a frequently called method.

Low
Learned
best practice
Validate embeddings before pooling

Add explicit null/empty checks for tokenEmbeddings and ensure each vector has
the expected dimension before pooling to prevent index errors.

src/Plugins/BotSharp.Plugin.MMPEmbedding/Providers/MMPEmbeddingProvider.cs [55-58]

 var tokenEmbeddings = await GetTokenEmbeddingsAsync(tokens);
+if (tokenEmbeddings == null || tokenEmbeddings.Count == 0 || tokenEmbeddings.Any(v => v == null || v.Length < _dimension))
+{
+    return new float[_dimension];
+}
 
 // Apply mean-max pooling
 var pooledEmbedding = MeanMaxPooling(tokenEmbeddings);
  • Apply / Chat
Suggestion importance[1-10]: 6

__

Why:
Relevant best practice - Guard against nulls, empty collections, and invalid states before access to avoid NREs and out-of-range errors.

Low
  • More

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant