Skip to content

Commit 2d62e4c

Browse files
Add Fuzzy and semantic match resolvers for KG builder (#319)
* Add fuzzy and spacy resolvers for KG Builder * Handle mypy issues * Ruff
1 parent 42ab19a commit 2d62e4c

File tree

12 files changed

+2009
-641
lines changed

12 files changed

+2009
-641
lines changed

CHANGELOG.md

+2
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@
66

77
- Added support for multi-vector collection in Qdrant driver.
88
- Added a `Pipeline.stream` method to stream pipeline progress.
9+
- Added a new semantic match resolver to the KG Builder for entity resolution based on spaCy embeddings and cosine similarities so that nodes with similar textual properties get merged.
10+
- Added a new fuzzy match resolver to the KG Builder for entity resolution based on RapiFuzz string fuzzy matching.
911

1012
### Changed
1113

docs/source/api.rst

+11
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,17 @@ SinglePropertyExactMatchResolver
108108
.. autoclass:: neo4j_graphrag.experimental.components.resolver.SinglePropertyExactMatchResolver
109109
:members: run
110110

111+
SpaCySemanticMatchResolver
112+
==========================
113+
114+
.. autoclass:: neo4j_graphrag.experimental.components.resolver.SpaCySemanticMatchResolver
115+
:members: run
116+
117+
FuzzyMatchResolver
118+
==================
119+
120+
.. autoclass:: neo4j_graphrag.experimental.components.resolver.FuzzyMatchResolver
121+
:members: run
111122

112123
.. _pipeline-section:
113124

docs/source/index.rst

+4-1
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,10 @@ List of extra dependencies:
9999
- **qdrant**: store vectors in Qdrant
100100
- **experimental**: experimental features mainly from the Knowledge Graph creation pipelines.
101101
- Warning: this requires `pygraphviz`. Installation instructions can be found `here <https://pygraphviz.github.io/documentation/stable/install.html>`_.
102-
102+
- nlp:
103+
- **spaCy**: load spaCy trained models for nlp pipelines, used by `SpaCySemanticMatchResolver` component from the Knowledge Graph creation pipelines.
104+
- fuzzy-matching:
105+
- **rapidfuzz**: apply fuzzy matching using string similarity, used by `FuzzyMatchResolver` component from the Knowledge Graph creation pipelines.
103106

104107
********
105108
Examples

docs/source/user_guide_kg_builder.rst

+16-5
Original file line numberDiff line numberDiff line change
@@ -1028,22 +1028,33 @@ without making assumptions about entity similarity. The Entity Resolver
10281028
is responsible for refining the created knowledge graph by merging entity
10291029
nodes that represent the same real-world object.
10301030

1031-
In practice, this package implements a simple resolver that merges nodes
1032-
with the same label and identical "name" property.
1031+
In practice, this package implements three resolvers:
1032+
1033+
- a simple resolver that merges nodes with the same label and identical "name" property;
1034+
- two similarity-based resolvers that merge nodes with the same label and similar set of textual properties (by default they use the "name" property):
1035+
1036+
- a semantic match resolver, which is based on spaCy embeddings and cosine similarities of embedding vectors. This resolver is ideal for higher quality KG resolution using static embeddings.
1037+
- a fuzzy match resolver, which is based on RapidFuzz for Rapid fuzzy string matching using the Levenshtein Distance. This resolver offers faster ingestion speeds by using string similarity measures, at the potential cost of resolution precision.
10331038

10341039
.. warning::
10351040

1036-
The `SinglePropertyExactMatchResolver` **replaces** the nodes created by the KG writer.
1041+
- The `SinglePropertyExactMatchResolver`, `SpaCySemanticMatchResolver`, and `FuzzyMatchResolver` **replace** the nodes created by the KG writer.
1042+
1043+
- Check the :ref:`installation` section to make sure you have the required dependencies installed when using `SpaCySemanticMatchResolver`, and `FuzzyMatchResolver`.
10371044

10381045

1039-
It can be used like this:
1046+
The resolvers can be used like this:
10401047

10411048
.. code:: python
10421049
10431050
from neo4j_graphrag.experimental.components.resolver import (
10441051
SinglePropertyExactMatchResolver,
1052+
# SpaCySemanticMatchResolver,
1053+
# FuzzyMatchResolver,
10451054
)
1046-
resolver = SinglePropertyExactMatchResolver(driver)
1055+
resolver = SinglePropertyExactMatchResolver(driver) # exact match resolver
1056+
# resolver = SpaCySemanticMatchResolver(driver) # semantic match with spaCy
1057+
# resolver = FuzzyMatchResolver(driver) # fuzzy match with RapidFuzz
10471058
res = await resolver.run()
10481059
10491060
.. warning::

examples/README.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -127,8 +127,9 @@ are listed in [the last section of this file](#customize).
127127
- [Neo4j writer](./customize/build_graph/components/writers/neo4j_writer.py)
128128
- [Custom](./customize/build_graph/components/writers/custom_writer.py)
129129
- Entity Resolver:
130-
- [SinglePropertyExactMatchResolver](./customize/build_graph/components/resolvers/simple_entity_resolver.py)
130+
- [FuzzyMatchResolver](./customize/build_graph/components/resolvers/fuzzy_match_entity_resolver_pre_filter.py)
131131
- [SinglePropertyExactMatchResolver with pre-filter](./customize/build_graph/components/resolvers/simple_entity_resolver_pre_filter.py)
132+
- [SpaCySemanticMatchResolver with pre-filter](./customize/build_graph/components/resolvers/spacy_entity_resolver_pre_filter.py)
132133
- [Custom resolver](./customize/build_graph/components/resolvers/custom_resolver.py)
133134
- [Custom component](./customize/build_graph/components/custom_component.py)
134135

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
"""The FuzzyMatchResolver merges nodes with same label
2+
and similar textual properties (by default using the "name" property) based on RapidFuzz
3+
for string matching.
4+
5+
If the resolution is intended to be applied only on some nodes, for instance nodes that
6+
belong to a specific document, a "WHERE" query can be added. The only variable in the
7+
query scope is "entity".
8+
9+
WARNING: this process is destructive, initial nodes are deleted and replaced
10+
by the resolved ones, but all relationships are kept.
11+
See apoc.refactor.mergeNodes documentation for more details.
12+
"""
13+
14+
from neo4j_graphrag.experimental.components.resolver import (
15+
FuzzyMatchResolver,
16+
)
17+
from neo4j_graphrag.experimental.components.types import ResolutionStats
18+
19+
import neo4j
20+
21+
22+
async def main(driver: neo4j.Driver) -> None:
23+
resolver = FuzzyMatchResolver(
24+
driver,
25+
# let's filter all entities that belong to a certain docId
26+
filter_query="WHERE (entity)-[:FROM_CHUNK]->(:Chunk)-[:FROM_DOCUMENT]->(doc:"
27+
"Document {id = 'docId'}",
28+
# optionally, change the properties used for resolution (default is "name")
29+
# resolve_properties=["name", "ssn"],
30+
# the similarity threshold (default is 0.8)
31+
# similarity_threshold=0.9
32+
# and the neo4j database where data is updated
33+
# neo4j_database="neo4j",
34+
)
35+
res: ResolutionStats = await resolver.run()
36+
print(res)

examples/customize/build_graph/components/resolvers/simple_entity_resolver.py

-25
This file was deleted.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
"""The SpaCySemanticMatchResolver merges nodes with same label
2+
and similar textual properties (by default using the "name" property) based on spaCy
3+
embeddings and cosine similarities of embedding vectors.
4+
5+
If the resolution is intended to be applied only on some nodes, for instance nodes that
6+
belong to a specific document, a "WHERE" query can be added. The only variable in the
7+
query scope is "entity".
8+
9+
WARNING: this process is destructive, initial nodes are deleted and replaced
10+
by the resolved ones, but all relationships are kept.
11+
See apoc.refactor.mergeNodes documentation for more details.
12+
"""
13+
14+
import neo4j
15+
from neo4j_graphrag.experimental.components.resolver import (
16+
SpaCySemanticMatchResolver,
17+
)
18+
from neo4j_graphrag.experimental.components.types import ResolutionStats
19+
20+
21+
async def main(driver: neo4j.Driver) -> None:
22+
resolver = SpaCySemanticMatchResolver(
23+
driver,
24+
# let's filter all entities that belong to a certain docId
25+
filter_query="WHERE (entity)-[:FROM_CHUNK]->(:Chunk)-[:FROM_DOCUMENT]->(doc:"
26+
"Document {id = 'docId'}",
27+
# optionally, change the properties used for resolution (default is "name")
28+
# resolve_properties=["name", "ssn"],
29+
# the similarity threshold (default is 0.8)
30+
# similarity_threshold=0.9
31+
# the spaCy trained model (default is "en_core_web_lg")
32+
# spacy_model="en_core_web_sm"
33+
# and the neo4j database where data is updated
34+
# neo4j_database="neo4j",
35+
)
36+
res: ResolutionStats = await resolver.run()
37+
print(res)

0 commit comments

Comments
 (0)