Add MMP Embedding method #1223

evan-cao-wb · 2025-11-18T17:20:10Z

User description

Add mean/max pooling embedding methods to improve vector query accuracy in the show-text scenario.

PR Type

Enhancement

Description

Introduces Mean-Max Pooling (MMP) embedding strategy for improved vector query accuracy
Implements multi-provider support (OpenAI, Azure OpenAI, DeepSeek) via ProviderHelper
Adds MMPEmbeddingProvider with token-level embedding and pooling capabilities
Registers new plugin in solution with dependency injection configuration

Diagram Walkthrough

flowchart LR
  A["Text Input"] --> B["Tokenization"]
  B --> C["Get Token Embeddings"]
  C --> D["ProviderHelper<br/>OpenAI/Azure/DeepSeek"]
  D --> E["Mean-Max Pooling"]
  E --> F["Combined Vector"]

File Walkthrough

Relevant files

Enhancement

MMPEmbeddingPlugin.cs `Plugin registration and dependency injection setup` src/Plugins/BotSharp.Plugin.MMPEmbedding/MMPEmbeddingPlugin.cs Implements IBotSharpPlugin interface for plugin registration Registers MMPEmbeddingProvider as ITextEmbedding service Provides plugin metadata (Id, Name, Description)	+19/-0
ProviderHelper.cs `Multi-provider client factory for embedding services` src/Plugins/BotSharp.Plugin.MMPEmbedding/ProviderHelper.cs Provides factory method to get OpenAI-compatible clients based on provider type Handles Azure OpenAI client creation separately with endpoint and credentials Supports OpenAI, DeepSeek, and other OpenAI-compatible providers Retrieves provider settings from ILlmProviderService	+70/-0
MMPEmbeddingProvider.cs `Core MMP embedding provider with pooling logic` src/Plugins/BotSharp.Plugin.MMPEmbedding/Providers/MMPEmbeddingProvider.cs Implements ITextEmbedding interface with mean-max pooling strategy Tokenizes input text using regex word pattern matching Gets embeddings for individual tokens from underlying provider Combines token embeddings using weighted mean and max pooling Supports configurable model, dimension, and underlying provider	+167/-0

Configuration changes

Using.cs `Global namespace imports for plugin` src/Plugins/BotSharp.Plugin.MMPEmbedding/Using.cs Defines global using statements for common namespaces Includes BotSharp abstraction and ML task dependencies	+10/-0
BotSharp.Plugin.MMPEmbedding.csproj `Project file for MMP embedding plugin` src/Plugins/BotSharp.Plugin.MMPEmbedding/BotSharp.Plugin.MMPEmbedding.csproj Creates new C# project file for MMPEmbedding plugin References Azure.AI.OpenAI and OpenAI NuGet packages Depends on BotSharp.Core infrastructure project Enables implicit usings and nullable reference types	+18/-0
BotSharp.sln `Solution file integration for new plugin` BotSharp.sln Adds BotSharp.Plugin.MMPEmbedding project to solution Configures build configurations for Debug and Release Assigns project to Plugins solution folder	+11/-0

qodo-merge-pro · 2025-11-18T17:20:39Z

PR Compliance Guide 🔍

Below is a summary of compliance checks for this PR:

Security Compliance
⚪	Unsafe array indexing Description: The mean-max pooling assumes all token embedding vectors have length equal to the configured dimension `_dimension`, but the code does not validate or adjust vector lengths, leading to potential IndexOutOfRange exceptions or inadvertent memory disclosure if an upstream provider returns embeddings of a different size. MMPEmbeddingProvider.cs [116-133] Referred Code private float[] MeanMaxPooling(IReadOnlyList<float[]> vectors, double meanWeight = 0.5, double maxWeight = 0.5) { var numTokens = vectors.Count; if (numTokens == 0) return []; var meanPooled = Enumerable.Range(0, _dimension) .Select(i => vectors.Average(v => v[i])) .ToArray(); var maxPooled = Enumerable.Range(0, _dimension) .Select(i => vectors.Max(v => v[i])) .ToArray(); return Enumerable.Range(0, _dimension) .Select(i => (float)meanWeight * meanPooled[i] + (float)maxWeight * maxPooled[i]) .ToArray(); }
	Type confusion risk Description: Azure OpenAI client is instantiated via `AzureOpenAIClient` and returned as `OpenAIClient`, which are different types from different SDKs and may cause runtime type confusion or incorrect API usage, potentially leading to failures or unintended requests to external services. ProviderHelper.cs [44-57] Referred Code private static OpenAIClient GetAzureOpenAIClient(LlmModelSetting settings) { if (string.IsNullOrEmpty(settings.Endpoint)) { throw new InvalidOperationException("Azure OpenAI endpoint is required"); } var client = new AzureOpenAIClient( new Uri(settings.Endpoint), new ApiKeyCredential(settings.ApiKey) ); return client; }
	Excessive error disclosure Description: Errors from the upstream embedding provider are logged with provider and model names and then rethrown without controlling retry/backoff; this can spam logs and expose configuration details while repeatedly hitting external APIs, enabling information disclosure and denial-of-wallet scenarios. MMPEmbeddingProvider.cs [82-107] Referred Code private async Task<List<float[]>> GetTokenEmbeddingsAsync(List<string> tokens) { try { // Get the appropriate client based on the underlying provider var client = ProviderHelper.GetClient(_underlyingProvider, _model, _serviceProvider); var embeddingClient = client.GetEmbeddingClient(_model); // Prepare options var options = new EmbeddingGenerationOptions { Dimensions = _dimension > 0 ? _dimension : null }; // Get embeddings for all tokens in batch var response = await embeddingClient.GenerateEmbeddingsAsync(tokens, options); var embeddings = response.Value; return embeddings.Select(e => e.ToFloats().ToArray()).ToList(); } catch (Exception ex) ... (clipped 5 lines)
Ticket Compliance
⚪	🎫 No ticket provided Create ticket/issue
Codebase Duplication Compliance
⚪	Codebase context is not defined Follow the guide to enable codebase context checks.
Custom Compliance
🟢	Generic: Meaningful Naming and Self-Documenting Code Objective: Ensure all identifiers clearly express their purpose and intent, making code self-documenting Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Error Handling Objective: To prevent the leakage of sensitive system information through error messages while providing sufficient detail for internal debugging. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Secure Logging Practices Objective: To ensure logs are useful for debugging and auditing without exposing sensitive information like PII, PHI, or cardholder data. Status: Passed Learn more about managing compliance generic rules or creating your own custom rules
⚪	Generic: Comprehensive Audit Trails Objective: To create a detailed and reliable record of critical system actions for security analysis and compliance. Status: No auditing: The new embedding operations do not log or emit audit events for critical actions, but it is unclear if these actions are considered auditable in this context. Referred Code public async Task<float[]> GetVectorAsync(string text) { if (string.IsNullOrWhiteSpace(text)) { return new float[_dimension]; } var tokens = Tokenize(text).ToList(); if (tokens.Count == 0) { return new float[_dimension]; } // Get embeddings for all tokens var tokenEmbeddings = await GetTokenEmbeddingsAsync(tokens); // Apply mean-max pooling var pooledEmbedding = MeanMaxPooling(tokenEmbeddings); return pooledEmbedding; ... (clipped 17 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Robust Error Handling and Edge Case Management Objective: Ensure comprehensive error handling that provides meaningful context and graceful degradation Status: Exception context: Helper methods throw InvalidOperationException without guidance on caller handling or logging, and Azure client creation errors may lack actionable context outside logs. Referred Code if (settings == null) { throw new InvalidOperationException($"Cannot find settings for provider '{provider}' and model '{model}'"); } // Handle Azure OpenAI separately as it uses AzureOpenAIClient if (provider.Equals("azure-openai", StringComparison.OrdinalIgnoreCase)) { return GetAzureOpenAIClient(settings); } // For OpenAI, DeepSeek, and other OpenAI-compatible providers return GetOpenAICompatibleClient(settings); } /// <summary> /// Gets an Azure OpenAI client /// </summary> private static OpenAIClient GetAzureOpenAIClient(LlmModelSetting settings) { if (string.IsNullOrEmpty(settings.Endpoint)) ... (clipped 9 lines) Learn more about managing compliance generic rules or creating your own custom rules
	Generic: Security-First Input Validation and Data Handling Objective: Ensure all data inputs are validated, sanitized, and handled securely to prevent vulnerabilities Status: Input validation: Text inputs are minimally validated (empty checks only) and provider/model names are used to construct clients without explicit sanitization, which may rely on upstream configuration guarantees. Referred Code { if (string.IsNullOrWhiteSpace(text)) { return new float[_dimension]; } var tokens = Tokenize(text).ToList(); if (tokens.Count == 0) { return new float[_dimension]; } // Get embeddings for all tokens var tokenEmbeddings = await GetTokenEmbeddingsAsync(tokens); // Apply mean-max pooling var pooledEmbedding = MeanMaxPooling(tokenEmbeddings); return pooledEmbedding; } ... (clipped 12 lines) Learn more about managing compliance generic rules or creating your own custom rules

Compliance status legend

🟢 - Fully Compliant
🟡 - Partial Compliant
🔴 - Not Compliant
⚪ - Requires Further Human Verification
🏷️ - Compliance label

qodo-merge-pro · 2025-11-18T17:22:07Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
High-level	Re-evaluate the word-level tokenization strategy The current word-level tokenization should be replaced with chunking into sentences or paragraphs. This aligns better with how embedding models work, improving embedding quality and performance. Examples: src/Plugins/BotSharp.Plugin.MMPEmbedding/Providers/MMPEmbeddingProvider.cs [47-55] var tokens = Tokenize(text).ToList(); if (tokens.Count == 0) { return new float[_dimension]; } // Get embeddings for all tokens var tokenEmbeddings = await GetTokenEmbeddingsAsync(tokens); src/Plugins/BotSharp.Plugin.MMPEmbedding/Providers/MMPEmbeddingProvider.cs [162-166] public static IEnumerable<string> Tokenize(string text, string? pattern = null) { var patternRegex = string.IsNullOrEmpty(pattern) ? WordRegex : new(pattern, RegexOptions.Compiled); return patternRegex.Matches(text).Cast<Match>().Select(m => m.Value); } Solution Walkthrough: Before: // MMPEmbeddingProvider.cs public async Task<float[]> GetVectorAsync(string text) { // Tokenizes "This is a sentence." into ["This", "is", "a", "sentence"] var tokens = Tokenize(text).ToList(); // Gets 4 separate embeddings, one for each word, in a batch API call var tokenEmbeddings = await GetTokenEmbeddingsAsync(tokens); // Pools the 4 word embeddings var pooledEmbedding = MeanMaxPooling(tokenEmbeddings); return pooledEmbedding; } public static IEnumerable<string> Tokenize(string text, string? pattern = null) { // Regex for words: \b\w+\b return WordRegex.Matches(text).Cast<Match>().Select(m => m.Value); } After: // MMPEmbeddingProvider.cs public async Task<float[]> GetVectorAsync(string text) { // Chunks "Sentence one. Sentence two." into ["Sentence one.", "Sentence two."] var chunks = ChunkTextIntoSentences(text); // Gets 2 separate embeddings, one for each sentence var chunkEmbeddings = await GetChunkEmbeddingsAsync(chunks); // Pools the 2 sentence embeddings var pooledEmbedding = MeanMaxPooling(chunkEmbeddings); return pooledEmbedding; } private List<string> ChunkTextIntoSentences(string text) { // New logic to split text into sentences or paragraphs // ... } Suggestion importance[1-10]: 9 __ Why: The suggestion correctly identifies a fundamental design flaw in the tokenization strategy that likely leads to poor embedding quality and inefficient API usage, undermining the PR's goal of improving accuracy.	High
Possible issue	Fix inconsistent empty embedding handling In `MeanMaxPooling`, return a zero-filled array of size `_dimension` for empty inputs to match the behavior of `GetVectorAsync` and prevent potential errors. src/Plugins/BotSharp.Plugin.MMPEmbedding/Providers/MMPEmbeddingProvider.cs [116-133] private float[] MeanMaxPooling(IReadOnlyList<float[]> vectors, double meanWeight = 0.5, double maxWeight = 0.5) { var numTokens = vectors.Count; if (numTokens == 0) - return []; + return new float[_dimension]; var meanPooled = Enumerable.Range(0, _dimension) .Select(i => vectors.Average(v => v[i])) .ToArray(); var maxPooled = Enumerable.Range(0, _dimension) .Select(i => vectors.Max(v => v[i])) .ToArray(); return Enumerable.Range(0, _dimension) .Select(i => (float)meanWeight * meanPooled[i] + (float)maxWeight * maxPooled[i]) .ToArray(); } Apply / Chat Suggestion importance[1-10]: 8 __ Why: The suggestion correctly identifies an inconsistency in return values for empty inputs, which could lead to runtime errors, and proposes a fix that makes the component more robust and predictable.	Medium
General	Process text embeddings in parallel Refactor `GetVectorsAsync` to process texts in parallel using `Task.WhenAll` for better performance. src/Plugins/BotSharp.Plugin.MMPEmbedding/Providers/MMPEmbeddingProvider.cs [66-77] public async Task<List<float[]>> GetVectorsAsync(List<string> texts) { - var results = new List<float[]>(); - - foreach (var text in texts) - { - var embedding = await GetVectorAsync(text); - results.Add(embedding); - } - - return results; + var tasks = texts.Select(text => GetVectorAsync(text)); + var embeddings = await Task.WhenAll(tasks); + return embeddings.ToList(); } Apply / Chat Suggestion importance[1-10]: 7 __ Why: The suggestion correctly identifies a performance bottleneck and proposes a standard parallelization pattern that will significantly improve the method's execution time for multiple texts.	Medium
General	Cache compiled regular expressions Optimize the `Tokenize` method by caching compiled `Regex` objects in a `ConcurrentDictionary` to avoid expensive re-compilation on each call. src/Plugins/BotSharp.Plugin.MMPEmbedding/Providers/MMPEmbeddingProvider.cs [162-166] +private static readonly System.Collections.Concurrent.ConcurrentDictionary<string, Regex> _regexCache = new(); + +/// <summary> +/// Tokenizes text into individual words +/// </summary> public static IEnumerable<string> Tokenize(string text, string? pattern = null) { - var patternRegex = string.IsNullOrEmpty(pattern) ? WordRegex : new(pattern, RegexOptions.Compiled); + var patternRegex = string.IsNullOrEmpty(pattern) + ? WordRegex + : _regexCache.GetOrAdd(pattern, p => new Regex(p, RegexOptions.Compiled)); return patternRegex.Matches(text).Cast<Match>().Select(m => m.Value); } Apply / Chat Suggestion importance[1-10]: 6 __ Why: The suggestion provides a valid performance optimization by caching compiled `Regex` objects, which is a best practice to avoid repeated compilation overhead in a frequently called method.	Low
Learned best practice	Validate embeddings before pooling Add explicit null/empty checks for `tokenEmbeddings` and ensure each vector has the expected dimension before pooling to prevent index errors. src/Plugins/BotSharp.Plugin.MMPEmbedding/Providers/MMPEmbeddingProvider.cs [55-58] var tokenEmbeddings = await GetTokenEmbeddingsAsync(tokens); +if (tokenEmbeddings == null \|\| tokenEmbeddings.Count == 0 \|\| tokenEmbeddings.Any(v => v == null \|\| v.Length < _dimension)) +{ + return new float[_dimension]; +} // Apply mean-max pooling var pooledEmbedding = MeanMaxPooling(tokenEmbeddings); Apply / Chat Suggestion importance[1-10]: 6 __ Why: Relevant best practice - Guard against nulls, empty collections, and invalid states before access to avoid NREs and out-of-range errors.	Low
More

Wenbo Cao added 2 commits November 18, 2025 11:10

Add MMP embedding

8af9552

solve conflicts

1029f95

qodo-merge-pro bot added the Review effort 2/5 label Nov 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add MMP Embedding method #1223

Add MMP Embedding method #1223

Uh oh!

evan-cao-wb commented Nov 18, 2025 •

edited by qodo-merge-pro bot

Loading

Uh oh!

qodo-merge-pro bot commented Nov 18, 2025

Uh oh!

qodo-merge-pro bot commented Nov 18, 2025

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Add MMP Embedding method #1223

Are you sure you want to change the base?

Add MMP Embedding method #1223

Uh oh!

Conversation

evan-cao-wb commented Nov 18, 2025 • edited by qodo-merge-pro bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

qodo-merge-pro bot commented Nov 18, 2025

PR Compliance Guide 🔍

Uh oh!

qodo-merge-pro bot commented Nov 18, 2025

PR Code Suggestions ✨

Examples:

Solution Walkthrough:

Before:

After:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

evan-cao-wb commented Nov 18, 2025 •

edited by qodo-merge-pro bot

Loading