All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
v2 is a complete rewrite that is not backwards-compatible with v1. The package has moved from contextifier to contextifier_new.
- Enforced 5-stage pipeline: Convert → Preprocess → Metadata → Content → Postprocess
BaseHandlerenforces execution order — all handlers follow the same structure- Each stage is defined as an ABC:
Converter,Preprocessor,MetadataExtractor,ContentExtractor,Postprocessor
- 14 format handlers: PDF, PDF-Plus, DOCX, DOC, PPTX, PPT, XLSX, XLS, CSV/TSV, HWP, HWPX, RTF, Text, Image
- HandlerRegistry: Automatic extension → handler mapping via
register_defaults() - Immutable config system: Frozen dataclass-based
ProcessingConfigTagConfig,ImageConfig,ChartConfig,MetadataConfig,TableConfig,ChunkingConfig,OCRConfig- Fluent builder:
config.with_tags(),config.with_chunking(), ... - Serialization:
to_dict()/from_dict() - Format-specific options:
config.with_format_option("pdf", ...)
- 4 chunking strategies with automatic selection:
TableChunkingStrategy(priority 5) — spreadsheet-specificPageChunkingStrategy(priority 10) — page boundary-basedProtectedChunkingStrategy(priority 20) — HTML table / protected region preservationPlainChunkingStrategy(priority 100) — recursive splitting fallback
- 5 OCR engines: OpenAI, Anthropic, Google Gemini, AWS Bedrock, vLLM
- Convenience constructors:
from_api_key()for each engine - Direct LangChain client passthrough
- Custom prompt support
- Convenience constructors:
- 5 shared services (DI pattern):
TagService— page / slide / sheet tag generationImageService— image saving / tagging / deduplication / storage backendsChartService— chart data formattingTableService— table HTML / MD / Text renderingMetadataService— metadata formatting (Korean / English)
- Unified type system (
types.py):FileContextTypedDict — standard input for all handlersExtractionResult— unified text / metadata / table / image / chart outputDocumentMetadata,TableData,TableCell,ChartDatashared dataclassesFileCategory,OutputFormat,NamingStrategy,StorageTypeenums
- Unified exception hierarchy (
errors.py):ContextifierErrorbase exception treeFileNotFoundError,UnsupportedFormatError,HandlerNotFoundError, etc.
- ChunkResult:
save_to_md(),__len__,__iter__,__getitem__support - DOC handler auto-detection: OLE, HTML, DOCX, RTF internal format auto-detection
- All legacy v1 code in the
contextifierpackagecore/document_processor.py(monolithic single file)core/functions/(utils.py, individual processor modules)core/processor/(per-handler files without unified structure)chunking/(single chunking.py with all logic)ocr/ocr_engine/(per-engine files without consistency)
- Facade pattern:
DocumentProcessoris the sole public entry point - Strategy pattern: Automatic chunking strategy selection
- Template Method pattern:
BaseHandler.process()enforces the 5-stage order - Dependency Injection: Services are created once and shared across handlers
- Registry pattern: Automatic extension → handler mapping
requirements.txtpath resolution for packaged installs.
- Initial release of Contextifier v1.
- Support for PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, CSV, TSV, HWP, HWPX, RTF, TXT, Image.
- OCR integration via OpenAI, Anthropic, Gemini, Bedrock.
- Basic text chunking with page/table awareness.
- Metadata extraction for common document formats.