Skip to content

Latest commit

 

History

History
125 lines (94 loc) · 3.62 KB

File metadata and controls

125 lines (94 loc) · 3.62 KB

Async Clustering Issue Analysis

Problem Statement

Parallel citations (562 U.S. 42, 131 S. Ct. 704, 178 L. Ed. 2d 587) cluster correctly in SYNC mode but appear as separate clusters in ASYNC mode.

Test Results

✅ SYNC Mode (text paste):

Cluster 1: 562 U.S. 42, 131 S. Ct. 704, 178 L. Ed. 2d 587 (all together)

❌ ASYNC Mode (file upload):

Cluster 3: 562 U.S. 42 (separate)
Cluster 4: 131 S. Ct. 704 (separate)
Cluster 5: 178 L. Ed. 2d 587 (separate)

BUT: All three have the SAME canonical source from verification:

  • Source: Madison County v. Oneida Indian Nation of N. Y., 2011-01-10
  • All show ✅ VERIFIED

Code Path Analysis

SYNC Processing:

citation_service.py:process_immediately() (line 203)
  ↓
UnifiedCitationProcessorV2.process_text(text)
  ↓
cluster_citations_unified_master() (Phase 5)
  ↓
_detect_parallel_citations() → Groups by proximity
  ↓
_group_by_proximity() → Uses start_index/end_index

ASYNC Processing:

DockerOptimizedProcessor.process_document() (line 466)
  ↓
UnifiedCitationProcessorV2.process_document_citations(text)
  ↓  
UnifiedCitationProcessorV2.process_text(text) (line 4214)
  ↓
cluster_citations_unified_master() (Phase 5)
  ↓
_detect_parallel_citations() → Should group by proximity
  ↓
_group_by_proximity() → Uses start_index/end_index

SAME CODE PATH! Both use UnifiedCitationProcessorV2.process_text().

Root Cause Theory

The _group_by_proximity() function (unified_clustering_master.py:429) groups citations based on:

  1. start_index and end_index positions
  2. Proximity threshold (default 100 characters)
# unified_clustering_master.py:435
sorted_citations = sorted(citations, key=lambda c: getattr(c, 'start_index', 0))

# Lines 448-450:
current_start = getattr(current_citation, 'start_index', 0)
previous_end = getattr(previous_citation, 'end_index', 0)
distance = current_start - previous_end

Possible Issues:

  1. Missing start_index/end_index:

    • If citations don't have proper indices, they default to 0
    • All citations would have the same position (0)
    • Distance calculations would be wrong
  2. Incorrect positions:

    • File extraction might not preserve correct positions
    • Text preprocessing might change positions
  3. Too large distance:

    • If there's padding text between citations
    • Distance > proximity_threshold (100 chars)

Solution Approaches

Option A: Add Canonical-Based Clustering (RECOMMENDED)

After proximity grouping fails, group citations that:

  • Have the same canonical_name
  • Have the same canonical_date
  • Are both verified

This would catch parallel citations even if proximity fails.

Option B: Fix Position Tracking

Ensure start_index and end_index are correctly set throughout the pipeline.

Option C: Increase Proximity Threshold

Make proximity_threshold larger to account for file processing differences.

Option D: Use Parallel Citation Metadata

Check if citations have parallel_citations metadata and respect that.

Recommended Fix: Option A + D

  1. Primary: Add fallback clustering by canonical data (Option A)
  2. Secondary: Respect existing parallel_citations metadata (Option D)

This makes clustering more robust regardless of position data quality.


Implementation Plan

  1. Add _group_by_canonical_data() method to UnifiedClusteringMaster
  2. Call it after _group_by_proximity() for remaining citations
  3. Test with async file upload
  4. Verify sync mode still works

Status: Analysis complete, ready to implement fix Next: Implement Option A (canonical-based clustering fallback)