Async Clustering Issue Analysis

Problem Statement

Parallel citations (562 U.S. 42, 131 S. Ct. 704, 178 L. Ed. 2d 587) cluster correctly in SYNC mode but appear as separate clusters in ASYNC mode.

Test Results

✅ SYNC Mode (text paste):

Cluster 1: 562 U.S. 42, 131 S. Ct. 704, 178 L. Ed. 2d 587 (all together)

❌ ASYNC Mode (file upload):

Cluster 3: 562 U.S. 42 (separate)
Cluster 4: 131 S. Ct. 704 (separate)
Cluster 5: 178 L. Ed. 2d 587 (separate)

BUT: All three have the SAME canonical source from verification:

Source: Madison County v. Oneida Indian Nation of N. Y., 2011-01-10
All show ✅ VERIFIED

Code Path Analysis

SYNC Processing:

citation_service.py:process_immediately() (line 203)
  ↓
UnifiedCitationProcessorV2.process_text(text)
  ↓
cluster_citations_unified_master() (Phase 5)
  ↓
_detect_parallel_citations() → Groups by proximity
  ↓
_group_by_proximity() → Uses start_index/end_index

ASYNC Processing:

DockerOptimizedProcessor.process_document() (line 466)
  ↓
UnifiedCitationProcessorV2.process_document_citations(text)
  ↓  
UnifiedCitationProcessorV2.process_text(text) (line 4214)
  ↓
cluster_citations_unified_master() (Phase 5)
  ↓
_detect_parallel_citations() → Should group by proximity
  ↓
_group_by_proximity() → Uses start_index/end_index

SAME CODE PATH! Both use UnifiedCitationProcessorV2.process_text().

Root Cause Theory

The _group_by_proximity() function (unified_clustering_master.py:429) groups citations based on:

start_index and end_index positions
Proximity threshold (default 100 characters)

# unified_clustering_master.py:435
sorted_citations = sorted(citations, key=lambda c: getattr(c, 'start_index', 0))

# Lines 448-450:
current_start = getattr(current_citation, 'start_index', 0)
previous_end = getattr(previous_citation, 'end_index', 0)
distance = current_start - previous_end

Possible Issues:

Missing start_index/end_index:
- If citations don't have proper indices, they default to 0
- All citations would have the same position (0)
- Distance calculations would be wrong
Incorrect positions:
- File extraction might not preserve correct positions
- Text preprocessing might change positions
Too large distance:
- If there's padding text between citations
- Distance > proximity_threshold (100 chars)

Solution Approaches

Option A: Add Canonical-Based Clustering (RECOMMENDED)

After proximity grouping fails, group citations that:

Have the same canonical_name
Have the same canonical_date
Are both verified

This would catch parallel citations even if proximity fails.

Option B: Fix Position Tracking

Ensure start_index and end_index are correctly set throughout the pipeline.

Option C: Increase Proximity Threshold

Make proximity_threshold larger to account for file processing differences.

Option D: Use Parallel Citation Metadata

Check if citations have parallel_citations metadata and respect that.

Recommended Fix: Option A + D

Primary: Add fallback clustering by canonical data (Option A)
Secondary: Respect existing parallel_citations metadata (Option D)

This makes clustering more robust regardless of position data quality.

Implementation Plan

Add _group_by_canonical_data() method to UnifiedClusteringMaster
Call it after _group_by_proximity() for remaining citations
Test with async file upload
Verify sync mode still works

Status: Analysis complete, ready to implement fix Next: Implement Option A (canonical-based clustering fallback)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async Clustering Issue Analysis

Problem Statement

Test Results

✅ SYNC Mode (text paste):

❌ ASYNC Mode (file upload):

Code Path Analysis

SYNC Processing:

ASYNC Processing:

Root Cause Theory

Possible Issues:

Solution Approaches

Option A: Add Canonical-Based Clustering (RECOMMENDED)

Option B: Fix Position Tracking

Option C: Increase Proximity Threshold

Option D: Use Parallel Citation Metadata

Recommended Fix: Option A + D

Implementation Plan

FilesExpand file tree

ASYNC_CLUSTERING_ANALYSIS.md

Latest commit

History

ASYNC_CLUSTERING_ANALYSIS.md

File metadata and controls

Async Clustering Issue Analysis

Problem Statement

Test Results

✅ SYNC Mode (text paste):

❌ ASYNC Mode (file upload):

Code Path Analysis

SYNC Processing:

ASYNC Processing:

Root Cause Theory

Possible Issues:

Solution Approaches

Option A: Add Canonical-Based Clustering (RECOMMENDED)

Option B: Fix Position Tracking

Option C: Increase Proximity Threshold

Option D: Use Parallel Citation Metadata

Recommended Fix: Option A + D

Implementation Plan