Skip to content

Latest commit

 

History

History
150 lines (119 loc) · 3.99 KB

File metadata and controls

150 lines (119 loc) · 3.99 KB

Current Architecture - CaseStrainer Extraction System

Last Updated: October 16, 2025
Status: ✅ Consolidated & Optimized


🎯 ACTIVE CODE - Use These

Primary Extraction & Cleaning

src/unified_case_extraction_master.py
  • Status: ✅ ACTIVE - Single source of truth
  • Purpose: ALL case name extraction and cleaning
  • Key Functions:
    • extract_case_name_and_date_unified_master() - Main extraction
    • get_master_extractor() - Get singleton instance
    • UnifiedCaseExtractionMaster._clean_case_name() - Cleaning logic

Citation Finding (Production Pipeline)

src/clean_extraction_pipeline.py
  • Status: ✅ ACTIVE - Production citation finding
  • Purpose: Find citations in documents (uses eyecite + regex)
  • Delegates to: unified_case_extraction_master.py for cleaning
  • Used by: URL uploads, async processing

Production Endpoint

src/citation_extraction_endpoint.py
  • Status: ✅ ACTIVE - Production entry point
  • Purpose: Main API endpoint for citation extraction
  • Uses: clean_extraction_pipeline.py

⚠️ DEPRECATED BUT STILL FUNCTIONAL

Legacy Wrapper (Delegates to Master)

src/unified_case_name_extractor_v2.py
  • Status: ⚠️ DEPRECATED (shows warnings)
  • Purpose: Backwards compatibility wrapper
  • Delegates to: unified_case_extraction_master.py
  • Migration: Replace calls with direct master calls

Legacy Processor (Being Phased Out)

src/unified_citation_processor_v2.py
  • Status: ⚠️ PARTIALLY DEPRECATED
  • Has deprecation notice: Line 50-57
  • Still used by: Some legacy code paths
  • Future: Will be replaced by clean pipeline

❌ DO NOT USE - Deprecated

src/unified_extraction_architecture.py
src/case_name_extraction_core.py (47+ duplicate functions)
src/enhanced_sync_processor.py
src/processors/citation_extractor.py
src/processors/name_year_extractor.py

📊 Code Flow - Current

URL Upload (Async):

User uploads URL
    ↓
progress_manager.py
    ↓
citation_extraction_endpoint.extract_citations_production()
    ↓
clean_extraction_pipeline.CleanExtractionPipeline.extract_citations()
    ↓
clean_extraction_pipeline._clean_eyecite_case_name()
    ↓ DELEGATES TO
unified_case_extraction_master._clean_case_name()
    ✅ SINGLE SOURCE OF TRUTH

Text Paste (Sync):

User pastes text
    ↓
unified_sync_processor.py
    ↓
May use various paths, but ultimately:
    ↓
unified_case_extraction_master.extract_case_name_and_date_unified_master()
    ✅ SINGLE SOURCE OF TRUTH

🔧 Recent Consolidation (Oct 16, 2025)

Before:

  • ❌ Duplicate cleaning logic in 2 files
  • ❌ 80+ lines of code duplicated
  • ❌ Bug fixes needed in multiple places

After:

  • ✅ Single source of truth for cleaning
  • ✅ 51 lines of duplicate code eliminated
  • ✅ Bug fixes apply everywhere automatically

📝 For Developers

Adding a New Fix:

  1. DO: Edit unified_case_extraction_master._clean_case_name()
  2. RESULT: Fix automatically applies to ALL code paths
  3. DON'T: Edit clean_extraction_pipeline._clean_eyecite_case_name() (it delegates)

Deprecating More Code:

  1. Add deprecation warnings
  2. Update DEPRECATION_NOTICE.md
  3. Create delegation to master
  4. Test both code paths
  5. Remove after 2-3 versions

🎯 Future Work

  1. Complete Migration: Fully migrate unified_citation_processor_v2.py to use master
  2. Remove Wrappers: Eventually remove unified_case_name_extractor_v2.py
  3. Single Pipeline: Consolidate sync and async paths to use same pipeline
  4. Archive Old Code: Move deprecated files to src/deprecated/ folder

✅ Quality Metrics

  • Code Duplication: 73 fewer duplicate lines (51 + 22 from imports)
  • Single Source of Truth: 100% (all cleaning goes through master)
  • Test Coverage: All fixes tested with URL and text inputs
  • Documentation: Deprecation notices in place