Parallel citations (562 U.S. 42, 131 S. Ct. 704, 178 L. Ed. 2d 587) cluster correctly in SYNC mode but appear as separate clusters in ASYNC mode.
Cluster 1: 562 U.S. 42, 131 S. Ct. 704, 178 L. Ed. 2d 587 (all together)
Cluster 3: 562 U.S. 42 (separate)
Cluster 4: 131 S. Ct. 704 (separate)
Cluster 5: 178 L. Ed. 2d 587 (separate)
BUT: All three have the SAME canonical source from verification:
- Source: Madison County v. Oneida Indian Nation of N. Y., 2011-01-10
- All show ✅ VERIFIED
citation_service.py:process_immediately() (line 203)
↓
UnifiedCitationProcessorV2.process_text(text)
↓
cluster_citations_unified_master() (Phase 5)
↓
_detect_parallel_citations() → Groups by proximity
↓
_group_by_proximity() → Uses start_index/end_index
DockerOptimizedProcessor.process_document() (line 466)
↓
UnifiedCitationProcessorV2.process_document_citations(text)
↓
UnifiedCitationProcessorV2.process_text(text) (line 4214)
↓
cluster_citations_unified_master() (Phase 5)
↓
_detect_parallel_citations() → Should group by proximity
↓
_group_by_proximity() → Uses start_index/end_index
SAME CODE PATH! Both use UnifiedCitationProcessorV2.process_text().
The _group_by_proximity() function (unified_clustering_master.py:429) groups citations based on:
- start_index and end_index positions
- Proximity threshold (default 100 characters)
# unified_clustering_master.py:435
sorted_citations = sorted(citations, key=lambda c: getattr(c, 'start_index', 0))
# Lines 448-450:
current_start = getattr(current_citation, 'start_index', 0)
previous_end = getattr(previous_citation, 'end_index', 0)
distance = current_start - previous_end-
Missing start_index/end_index:
- If citations don't have proper indices, they default to 0
- All citations would have the same position (0)
- Distance calculations would be wrong
-
Incorrect positions:
- File extraction might not preserve correct positions
- Text preprocessing might change positions
-
Too large distance:
- If there's padding text between citations
- Distance > proximity_threshold (100 chars)
After proximity grouping fails, group citations that:
- Have the same
canonical_name - Have the same
canonical_date - Are both verified
This would catch parallel citations even if proximity fails.
Ensure start_index and end_index are correctly set throughout the pipeline.
Make proximity_threshold larger to account for file processing differences.
Check if citations have parallel_citations metadata and respect that.
- Primary: Add fallback clustering by canonical data (Option A)
- Secondary: Respect existing parallel_citations metadata (Option D)
This makes clustering more robust regardless of position data quality.
- Add
_group_by_canonical_data()method to UnifiedClusteringMaster - Call it after
_group_by_proximity()for remaining citations - Test with async file upload
- Verify sync mode still works
Status: Analysis complete, ready to implement fix Next: Implement Option A (canonical-based clustering fallback)