feat: add identity_fields to DataPoint for declarative node deduplication by Vasilije1990 · Pull Request #2125 · topoteretes/cognee

Vasilije1990 · 2026-02-08T08:59:21Z

Summary

Adds identity_fields to MetaData TypedDict and DataPoint.__init__, enabling declarative deterministic UUID5 generation from specified field values
Users can now set metadata = {"index_fields": ["name"], "identity_fields": ["name"]} on DataPoint subclasses to automatically deduplicate nodes by name
Explicit id overrides still take precedence; models without identity_fields are unaffected (fully backward compatible)

Changes

cognee/infrastructure/engine/models/DataPoint.py: Added identity_fields to MetaData, modified __init__ to generate deterministic UUID5 when identity_fields is set and no explicit id provided, added _get_identity_fields() and _generate_identity_id() helper methods
cognee/tests/unit/infrastructure/engine/test_identity_fields.py: Comprehensive unit tests covering same-value deduplication, cross-type safety, explicit id override, missing field fallback, backward compat, multi-field keys, and string normalization
examples/low_level/pipeline.py: Updated example models with identity_fields and simplified ingest_files() to demonstrate declarative deduplication (no more manual dict tracking)

Test plan

Run new unit tests: pytest cognee/tests/unit/infrastructure/engine/test_identity_fields.py -v
Run existing unit tests to verify backward compatibility: pytest cognee/tests/unit/
Run low-level pipeline example end-to-end: python examples/low_level/pipeline.py

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Models now support deterministic ID generation using configurable identity fields, enabling consistent deduplication across instances.
- Enhanced data ingestion logic with automatic identity-based deduplication.
Tests
- Comprehensive test coverage added for identity field functionality, including edge cases and backward compatibility scenarios.

…tion Users defining custom DataPoint subclasses can now specify identity_fields in metadata to auto-generate deterministic UUID5 IDs from field values, eliminating the need for manual dict-based deduplication. Explicit id overrides still take precedence, and existing models without identity_fields remain unaffected (backward compatible). Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: vasilije <[email protected]>

pull-checklist · 2026-02-08T08:59:26Z

Please make sure all the checkboxes are checked:

I have tested these changes locally.
I have reviewed the code changes.
I have added end-to-end and unit tests (if applicable).
I have updated the documentation and README.md file (if necessary).
I have removed unnecessary code and debug statements.
PR title is clear and follows the convention.
I have tagged reviewers or team members for feedback.

coderabbitai · 2026-02-08T08:59:43Z

Walkthrough

This change introduces identity-based UUID generation for DataPoint models using deterministic UUID5 hashing. A new optional identity_fields metadata field specifies which model attributes define uniqueness. New helper methods handle reflection-based identity field retrieval and UUID5 generation. An example pipeline demonstrates the feature.

Changes

Cohort / File(s)	Summary
DataPoint Identity Model `cognee/infrastructure/engine/models/DataPoint.py`	Adds UUID5 generation logic based on identity fields. Introduces `_get_identity_fields()` class method and `_generate_identity_id()` static method for deterministic ID generation. Extends MetaData TypedDict with optional `identity_fields` attribute.
Identity Fields Test Suite `cognee/tests/unit/infrastructure/engine/test_identity_fields.py`	New comprehensive unit test module validating identity field behavior: UUID5 consistency for identical values, distinct IDs for different values, ID override precedence, fallback to UUID4 for missing fields, backward compatibility, multi-field identity handling, string normalization effects, and UUID version validation.
Pipeline Example Configuration `examples/low_level/pipeline.py`	Refactors ingest_files function to use identity-based deduplication instead of manual dict-based approach. Adds `identity_fields` to Person, Department, Company, and CompanyType metadata. Builds deterministic IDs for departments and companies using the new identity mechanism.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 23.08% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding identity_fields to DataPoint for declarative node deduplication, which is the core feature across all modified files.
Description check	✅ Passed	The PR description covers all template sections with clear information about changes, test plan, and acceptance criteria, though some checklist items are marked incomplete.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/identity-fields-datapoint

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@cognee/infrastructure/engine/models/DataPoint.py`:
- Around line 51-60: The identity ID is being generated before Pydantic applies
defaults; update DataPoint.__init__ signature to include type hints (e.g., def
__init__(self, **data: Any) -> None) and call super().__init__(**data) first,
then compute the identity by calling self.__class__._get_identity_fields() and,
if present, call
self.__class__._generate_identity_id(self.__class__._get_identity_fields(),
self.model_dump(), self.__class__.__name__) (or equivalent) using the defaulted
values from self.model_dump(); if that returns a non-None id and self.id is not
set, assign it to self.id so deterministic IDs are produced.

In `@examples/low_level/pipeline.py`:
- Around line 47-81: Add a docstring to ingest_files explaining its purpose and
inputs, annotate its signature with the correct return type (e.g., ->
List[Company]) and add an explicit type for the local variable all_companies
(e.g., all_companies: List[Company]) so mypy can validate; update imports if
necessary to reference Company/Department/Person types used in the annotations
and keep the implementation unchanged (function name ingest_files, local
variable all_companies).

cognee/infrastructure/engine/models/DataPoint.py

    def __init__(self, **data):
+        if "id" not in data:
+            identity_fields = self.__class__._get_identity_fields()
+            if identity_fields:
+                identity_id = self.__class__._generate_identity_id(
+                    identity_fields, data, self.__class__.__name__
+                )
+                if identity_id is not None:
+                    data["id"] = identity_id
+


coderabbitai · 2026-02-08T09:08:40Z

examples/low_level/pipeline.py

 def ingest_files(data: List[Data]):
-    people_data_points = {}
-    departments_data_points = {}
-    companies_data_points = {}
+    # With identity_fields, DataPoints with the same name automatically get the same UUID.
+    # No manual dict-based deduplication needed — just create instances freely.
+    all_companies = []

    for data_item in data:
        people = data_item.payload["people"]
        companies = data_item.payload["companies"]

+        # Build departments with their employees
+        dept_employees: Dict[str, List[Person]] = {}
        for person in people:
-            new_person = Person(name=person["name"])
-            people_data_points[person["name"]] = new_person
+            dept_name = person["department"]
+            if dept_name not in dept_employees:
+                dept_employees[dept_name] = []
+            dept_employees[dept_name].append(Person(name=person["name"]))

-            if person["department"] not in departments_data_points:
-                departments_data_points[person["department"]] = Department(
-                    name=person["department"], employees=[new_person]
-                )
-            else:
-                departments_data_points[person["department"]].employees.append(new_person)
+        departments = {
+            name: Department(name=name, employees=employees)
+            for name, employees in dept_employees.items()
+        }

-        # Create a single CompanyType node, so we connect all companies to it.
-        companyType = CompanyType()
+        # Create a single CompanyType node (deterministic ID via identity_fields)
+        company_type = CompanyType()

        for company in companies:
-            new_company = Company(name=company["name"], departments=[], is_type=companyType)
-            companies_data_points[company["name"]] = new_company
-
-            for department_name in company["departments"]:
-                if department_name not in departments_data_points:
-                    departments_data_points[department_name] = Department(
-                        name=department_name, employees=[]
-                    )
-
-                new_company.departments.append(departments_data_points[department_name])
-
-    return list(companies_data_points.values())
+            company_departments = [
+                departments.get(dept_name, Department(name=dept_name, employees=[]))
+                for dept_name in company["departments"]
+            ]
+            all_companies.append(
+                Company(name=company["name"], departments=company_departments, is_type=company_type)
+            )
+
+    return all_companies


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "pipeline.py" | head -20

Repository: topoteretes/cognee

Length of output: 144

🏁 Script executed:

cat -n examples/low_level/pipeline.py | head -100

Repository: topoteretes/cognee

Length of output: 4037

🏁 Script executed:

rg "^def ingest_files|^class Company|^class Data|^from typing" examples/low_level/pipeline.py -A 2 -B 2

Repository: topoteretes/cognee

Length of output: 700

Add return type annotation, variable type annotation, and docstring to ingest_files.

The function lacks a return type annotation, a type hint for the all_companies variable, and a docstring. This violates the project's requirements for type hints (mypy checks enabled) and for documenting function definitions.

✏️ Suggested update

-def ingest_files(data: List[Data]): +def ingest_files(data: List[Data]) -> List[Company]: + """Build Company instances from raw payloads using identity-based deduplication.""" # With identity_fields, DataPoints with the same name automatically get the same UUID. # No manual dict-based deduplication needed — just create instances freely. - all_companies = [] + all_companies: List[Company] = []

🤖 Prompt for AI Agents

In `@examples/low_level/pipeline.py` around lines 47 - 81, Add a docstring to ingest_files explaining its purpose and inputs, annotate its signature with the correct return type (e.g., -> List[Company]) and add an explicit type for the local variable all_companies (e.g., all_companies: List[Company]) so mypy can validate; update imports if necessary to reference Company/Department/Person types used in the annotations and keep the implementation unchanged (function name ingest_files, local variable all_companies).

coderabbitai bot reviewed Feb 8, 2026

View reviewed changes

Vasilije1990 requested review from hajdul88 and lxobr February 12, 2026 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add identity_fields to DataPoint for declarative node deduplication#2125

feat: add identity_fields to DataPoint for declarative node deduplication#2125
Vasilije1990 wants to merge 1 commit intodevfrom
feature/identity-fields-datapoint

Vasilije1990 commented Feb 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

pull-checklist bot commented Feb 8, 2026

Uh oh!

coderabbitai bot commented Feb 8, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

This comment was marked as resolved.

Uh oh!

coderabbitai bot Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Vasilije1990 commented Feb 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Summary by CodeRabbit

Uh oh!

pull-checklist bot commented Feb 8, 2026

Please make sure all the checkboxes are checked:

Uh oh!

coderabbitai bot commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

coderabbitai bot Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Vasilije1990 commented Feb 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 8, 2026 •

edited

Loading