Skip to content

Conversation

@yukiharada1228
Copy link
Contributor

@yukiharada1228 yukiharada1228 commented Dec 17, 2025

Add metadata-based filtering support for delete operations

Summary

This PR adds support for metadata-based filtering in delete() and adelete() methods, enabling bulk deletion of documents based on metadata criteria rather than just by IDs.

Motivation

Currently, the delete methods only support deletion by document IDs, which is limiting for common use cases:

  • Bulk deletions: Users often need to delete groups of documents based on metadata (e.g., all documents from a specific source, time period, or category)
  • Data lifecycle management: Remove documents based on status, expiration dates, or other metadata flags
  • Cleanup operations: Delete documents matching specific criteria without knowing their IDs

Other vector stores (Chroma, Pinecone, Weaviate) already support metadata-based deletion, and the infrastructure for metadata filtering already exists in this codebase via the _create_filter_clause() method.

Changes

Modified Files

  1. langchain_postgres/v2/async_vectorstore.py

    • Enhanced adelete() method to accept optional filter parameter
    • Supports deletion by IDs, filter, or both (combined with AND logic)
    • Leverages existing _create_filter_clause() for consistent filter syntax
    • Updated docstring with comprehensive examples and important limitations
  2. langchain_postgres/v2/vectorstores.py

    • Updated both adelete() and delete() methods to accept filter parameter
    • Sync wrapper properly passes filter to async implementation
    • Updated docstrings with examples and limitations
  3. Test files

    • Added 7 new test cases covering various filtering scenarios
    • Tests for simple filters, operators, complex filters, combined ID+filter, and edge cases
    • All tests use metadata_columns to ensure proper filtering behavior

Usage Examples

Setup: Define metadata columns

# First, create vectorstore with metadata columns
vectorstore = await AsyncPGVectorStore.create(
    engine,
    embedding_service=embeddings_service,
    table_name="my_documents",
    metadata_columns=["source", "category", "year", "status"],  # Define filterable fields
)

Delete by metadata filter only

# Delete all documents from a specific source
await vectorstore.adelete(filter={"source": "documentation"})

# Delete documents with numeric comparisons
await vectorstore.adelete(filter={"year": {"$lt": 2020}})

# Delete with complex filters
await vectorstore.adelete(
    filter={"$and": [{"category": "obsolete"}, {"status": "archived"}]}
)

Delete by IDs only (existing behavior)

await vectorstore.adelete(ids=["id1", "id2", "id3"])

Delete by both IDs and filter (must match both criteria)

# Only deletes documents that match BOTH the ID list AND the filter
await vectorstore.adelete(
    ids=["id1", "id2", "id3"],
    filter={"status": "archived"}
)

Sync methods work identically

# Sync version
vectorstore.delete(filter={"source": "deprecated"})

Filter Syntax

The filter parameter supports the same rich filtering syntax as similarity_search():

  • Equality: {"field": "value"}
  • Comparison operators: {"field": {"$lt": 100}} ($eq, $ne, $lt, $lte, $gt, $gte)
  • List operators: {"field": {"$in": [1, 2, 3]}} ($in, $nin)
  • Text operators: {"field": {"$like": "pattern%"}} ($like, $ilike)
  • Logical operators: {"$and": [...]}, {"$or": [...]}, {"$not": {...}}
  • Existence: {"field": {"$exists": True}}
  • Range: {"field": {"$between": [10, 20]}}

⚠️ Important Limitation

Filters only work on fields defined in metadata_columns, not on fields stored in metadata_json_column.

This is consistent with how similarity_search() filtering works. To use metadata-based deletion, you must define the metadata fields as actual database columns when creating the vectorstore:

# ✅ Correct: Define metadata columns
vectorstore = await AsyncPGVectorStore.create(
    engine,
    embedding_service=embeddings_service,
    table_name="my_table",
    metadata_columns=["source", "category", "year"],  # These fields can be filtered
)

# Now you can filter on these columns
await vectorstore.adelete(filter={"source": "documentation"})
# ❌ Won't work: Fields only in metadata_json_column cannot be filtered
vectorstore = await AsyncPGVectorStore.create(
    engine,
    embedding_service=embeddings_service,
    table_name="my_table",
    # No metadata_columns defined - all metadata goes to JSON column
)

# This will fail - "source" is not a database column
await vectorstore.adelete(filter={"source": "documentation"})

Fields stored only in metadata_json_column cannot be used in filters. This design choice provides better query performance and leverages PostgreSQL's native indexing capabilities.

Implementation Details

  • Backward compatible: Existing code using adelete(ids=[...]) continues to work unchanged
  • SQL injection safe: Uses parameterized queries via existing _create_filter_clause() method
  • Consistent behavior: Filter syntax matches similarity_search() for consistency
  • Performance: Generates efficient SQL DELETE statements with WHERE clauses on indexed columns
  • Type safe: Full type hints and passes mypy strict checking

Test Coverage

Added comprehensive test coverage (all tests passing):

  • test_adelete_with_filter: Basic metadata filter deletion
  • test_adelete_with_filter_and_operator: Deletion with comparison operators
  • test_adelete_with_complex_filter: Complex filters with logical operators
  • test_adelete_with_filter_and_ids: Combined ID and filter deletion
  • test_adelete_with_filter_no_matches: Graceful handling of no matches
  • test_adelete_with_filter (sync): Async method in sync wrapper
  • test_delete_with_filter (sync): Sync method filtering
  • test_adelete: Existing tests continue to pass

All tests follow existing patterns and integrate with the current test suite.

Breaking Changes

None. This is a backward-compatible enhancement:

  • All existing code continues to work unchanged
  • New filter parameter is optional
  • Default behavior (no parameters) remains the same

Checklist

  • Implementation complete for async methods
  • Implementation complete for sync methods
  • Comprehensive test coverage added (7 tests, all passing)
  • Code passes ruff linting
  • Code passes mypy type checking
  • Docstrings updated with examples and limitations
  • Backward compatible with existing code
  • Follows existing codebase patterns

Related Issues

Closes #271

Additional Notes

Why metadata_columns are required for filtering

This implementation reuses the robust _create_filter_clause() method that's already extensively tested for search operations. The method generates SQL WHERE clauses that operate on actual database columns, which provides:

  1. Better performance: Direct column filtering is faster than JSON field extraction
  2. Index support: Metadata columns can be indexed for even better performance
  3. Type safety: Database column types ensure type correctness
  4. Consistency: Same behavior as similarity_search() filtering

This design is consistent with the existing filtering implementation and aligns with how other parts of the codebase handle metadata filtering.

yukiharada1228 and others added 4 commits November 5, 2025 16:46
- Add filter parameter to adelete and delete methods in both async and sync vectorstores
- Support complex filter syntax (operators, , etc.) for bulk deletion
- Add comprehensive test cases for filter-based deletion scenarios
- Update documentation with examples for filter-based deletion
… operations

- Add note that filters only work on metadata_columns, not metadata_json_column
- Update tests to use metadata_columns instead of langchain_metadata
- Ensure tests properly test filtering functionality with dedicated metadata columns
@yukiharada1228 yukiharada1228 changed the title Support metadata based filtering for delete operations Feature: Support metadata based filtering for delete operations Dec 17, 2025
@dishaprakash
Copy link
Collaborator

@yukiharada1228 Thank you for opening this PR!

Copy link
Collaborator

@dishaprakash dishaprakash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yukiharada1228
Copy link
Contributor Author

yukiharada1228 commented Dec 17, 2025

Updated! I've added documentation for the metadata filter-based deletion feature in the how-to notebook, including:

@yukiharada1228
Copy link
Contributor Author

Fixed formatting issues discovered when running make lint.

Changes Made

  • Ran ruff format and reformatted 5 files:
    • examples/pg_vectorstore_how_to.ipynb (notebook cells)
    • langchain_postgres/v2/async_vectorstore.py
    • langchain_postgres/v2/vectorstores.py
    • tests/unit_tests/v2/test_async_pg_vectorstore.py
    • tests/unit_tests/v2/test_pg_vectorstore.py

Fix Details

  • Split long lines appropriately
  • Unified formatting of dictionaries and lists
  • Properly formatted function definition arguments

Verification Results

make lint

✅ All checks passed

  • ruff format: 43 files already formatted
  • ruff check: No errors
  • mypy: Success (39 source files)

Commit: b0b5e20

@yukiharada1228
Copy link
Contributor Author

@averikitsch Thank you very much for the review and approval! 🙏
I appreciate you taking the time to go through the changes.

@yukiharada1228
Copy link
Contributor Author

@dishaprakash Thank you for the review and guidance on the documentation updates!

I've addressed the feedback by updating the how-to notebook with examples for metadata-based deletion, and the changes have now been approved.
Please let me know if there's anything else you'd like me to adjust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support metadata-based filtering for delete operations

3 participants