Overview:
To improve the effectiveness of the anonymization pipeline, the model must support advanced entity resolution by grouping semantically or textually similar entities. This capability will ensure consistent anonymization and allow for dynamic user-driven updates across the processed text.
Objective:
Implement a mechanism that identifies and groups similar sub-entities at both prediction and user interaction stages. This aims to improve entity consistency, reduce manual corrections, and enhance the overall user experience when reviewing or editing anonymized data.
Scope:
- This feature applies to all entities detected by the anonymization model.
- Similarity will be determined via text similarity algorithms (e.g., fuzzy matching or vector embeddings).
- The feature includes both automatic grouping during model inference and dynamic re-grouping based on user input.
Requirements:
-
Automatic Grouping at Prediction Time:
- The model must automatically cluster similar entities based on string similarity or semantic equivalence.
- Example: "Marco Silva", "Sr Silva", and "M. Silva" should be grouped under a unified entity cluster at inference time to ensure consistent anonymization.
-
Dynamic Entity Update and Propagation:
- Users must be able to create new entities or edit existing ones.
- Upon such updates, the system should re-evaluate the text and apply the changes to all entities deemed similar.
- This requires efficient matching and propagation logic to update entity groupings in real time.
Example Use Case:
A user detects that “Marco Silva” and “Sr Silva” were treated as distinct entities. After correcting “Sr Silva” to be grouped with “Marco Silva,” the system should identify all other similar instances in the document and apply the update automatically.
Proposal:
-
Entity Grouping Mechanism:
Utilize fuzzy regex matching (e.g., Levenshtein distance or token-based similarity) to identify and group similar entities during prediction. Regex patterns can be dynamically generated from known entity variants to match unseen variants in the text.
-
User Interaction Flow:
Implement a new API endpoint to support entity update workflows. This endpoint will:
- Accept user-edited entities as input.
- Return a list of recommended similar entities identified via the fuzzy matching engine.
- Apply user-confirmed updates to all matched entities across the dataset.
Overview:
To improve the effectiveness of the anonymization pipeline, the model must support advanced entity resolution by grouping semantically or textually similar entities. This capability will ensure consistent anonymization and allow for dynamic user-driven updates across the processed text.
Objective:
Implement a mechanism that identifies and groups similar sub-entities at both prediction and user interaction stages. This aims to improve entity consistency, reduce manual corrections, and enhance the overall user experience when reviewing or editing anonymized data.
Scope:
Requirements:
Automatic Grouping at Prediction Time:
Dynamic Entity Update and Propagation:
Example Use Case:
A user detects that “Marco Silva” and “Sr Silva” were treated as distinct entities. After correcting “Sr Silva” to be grouped with “Marco Silva,” the system should identify all other similar instances in the document and apply the update automatically.
Proposal:
Entity Grouping Mechanism:
Utilize fuzzy regex matching (e.g., Levenshtein distance or token-based similarity) to identify and group similar entities during prediction. Regex patterns can be dynamically generated from known entity variants to match unseen variants in the text.
User Interaction Flow:
Implement a new API endpoint to support entity update workflows. This endpoint will: