|
| 1 | +# Contextify v0.2.6 Bug Fix Report |
| 2 | + |
| 3 | +**Date:** 2026-04-07 |
| 4 | +**Based on:** [DEEP_TEST_REPORT_v0.2.6.md](DEEP_TEST_REPORT_v0.2.6.md) |
| 5 | + |
| 6 | +--- |
| 7 | + |
| 8 | +## Summary |
| 9 | + |
| 10 | +| Metric | Before | After | |
| 11 | +|--------|--------|-------| |
| 12 | +| **Custom Integration Tests** | 106/110 (96.4%) | **110/110 (100%)** | |
| 13 | +| **Existing Unit Tests** | 476/476 (100%) | **476/476 (100%)** | |
| 14 | +| **SVG Registration Warning** | Every instantiation | **Eliminated** | |
| 15 | +| **Bugs Fixed** | 0 | **6** | |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +## Fixes Applied |
| 20 | + |
| 21 | +### FIX #1: Empty File Handling (BUG #1 - Critical) |
| 22 | + |
| 23 | +**Problem:** 0-byte TXT/CSV files caused `ConversionError` crash at validation stage. |
| 24 | + |
| 25 | +**Root Cause:** `BaseConverter.validate()`, `TextConverter.validate()`, and `CsvConverter.validate()` all rejected empty `file_data` (length == 0). |
| 26 | + |
| 27 | +**Changes:** |
| 28 | + |
| 29 | +| File | Change | |
| 30 | +|------|--------| |
| 31 | +| `contextifier/pipeline/converter.py` | `BaseConverter.validate()` now returns `True` always (empty file handling deferred to `convert()`) | |
| 32 | +| `contextifier/handlers/text/converter.py` | `validate()` returns `True`; `convert()` returns empty `TextConvertedData` for empty files instead of raising | |
| 33 | +| `contextifier/handlers/csv/converter.py` | `validate()` returns `True`; `convert()` returns empty `CsvConvertedData` for empty files instead of raising | |
| 34 | +| `tests/unit/test_security.py` | `test_empty_file_rejected` -> `test_empty_file_returns_empty_text` (expects empty string return) | |
| 35 | + |
| 36 | +**Verification:** |
| 37 | +```python |
| 38 | +proc.extract_text("empty.txt") # Returns "" (was: ConversionError) |
| 39 | +proc.extract_text("empty.csv") # Returns metadata-only text (was: ConversionError) |
| 40 | +``` |
| 41 | + |
| 42 | +--- |
| 43 | + |
| 44 | +### FIX #2: SVG Extension Double-Registration (BUG #3 & #6 - Medium) |
| 45 | + |
| 46 | +**Problem:** SVG was registered in both `TextHandler` (as XML text) and `ImageFileHandler` (as image), causing a warning on every instantiation. Additionally, `.svg` was missing from `supported_extensions`. |
| 47 | + |
| 48 | +**Root Cause:** `_TEXT_EXTENSIONS` in `text/handler.py` and `IMAGE_EXTENSIONS` in `image/_constants.py` both included `"svg"`. |
| 49 | + |
| 50 | +**Change:** |
| 51 | + |
| 52 | +| File | Change | |
| 53 | +|------|--------| |
| 54 | +| `contextifier/handlers/image/_constants.py` | Removed `"svg"` from `IMAGE_EXTENSIONS` (SVG stays in TextHandler as it's XML-based) | |
| 55 | + |
| 56 | +**Verification:** |
| 57 | +```python |
| 58 | +# No more warning on instantiation |
| 59 | +proc = DocumentProcessor() # Clean, no SVG warning |
| 60 | +proc.is_supported(".svg") # Returns True (was: False) |
| 61 | +``` |
| 62 | + |
| 63 | +--- |
| 64 | + |
| 65 | +### FIX #3: Empty Text Chunking (BUG #4 - Medium) |
| 66 | + |
| 67 | +**Problem:** `TextChunker.chunk("")` returned `[""]` (list with one empty string) instead of `[]` (empty list). |
| 68 | + |
| 69 | +**Root Cause:** Early return in `chunker.py` line 111-112 returned `[""]`. |
| 70 | + |
| 71 | +**Changes:** |
| 72 | + |
| 73 | +| File | Change | |
| 74 | +|------|--------| |
| 75 | +| `contextifier/chunking/chunker.py` | `return [""]` -> `return []` for empty/whitespace-only text | |
| 76 | +| `tests/unit/chunking/test_chunker.py` | Updated 2 tests to expect `[]` instead of `[""]` | |
| 77 | + |
| 78 | +**Verification:** |
| 79 | +```python |
| 80 | +chunker.chunk("") # Returns [] (was: [""]) |
| 81 | +chunker.chunk(" \n ") # Returns [] (was: [""]) |
| 82 | +``` |
| 83 | + |
| 84 | +--- |
| 85 | + |
| 86 | +### FIX #4: Error Priority Order (ISSUE #4) |
| 87 | + |
| 88 | +**Problem:** Processing `nonexistent.zzz` raised `FileNotFoundError` instead of `UnsupportedFormatError`. Format validation should happen before file existence check. |
| 89 | + |
| 90 | +**Changes:** |
| 91 | + |
| 92 | +| File | Change | |
| 93 | +|------|--------| |
| 94 | +| `contextifier/document_processor.py` | In `extract_text()` and `process()`: moved extension resolution and format support check **before** file existence check | |
| 95 | + |
| 96 | +**Verification:** |
| 97 | +```python |
| 98 | +proc.extract_text("nonexistent.zzz") # UnsupportedFormatError (was: FileNotFoundError) |
| 99 | +proc.extract_text("nonexistent.pdf") # FileNotFoundError (unchanged, .pdf is supported) |
| 100 | +``` |
| 101 | + |
| 102 | +--- |
| 103 | + |
| 104 | +### FIX #5: ChunkResult Strategy Field (ISSUE #5) |
| 105 | + |
| 106 | +**Problem:** `ChunkResult` did not expose which chunking strategy was selected, making debugging difficult. |
| 107 | + |
| 108 | +**Changes:** |
| 109 | + |
| 110 | +| File | Change | |
| 111 | +|------|--------| |
| 112 | +| `contextifier/document_processor.py` | Added `strategy: Optional[str] = None` field to `ChunkResult` dataclass | |
| 113 | +| `contextifier/chunking/chunker.py` | Added `last_strategy_name` property to `TextChunker`; tracks strategy used in each `chunk()` call | |
| 114 | +| `contextifier/document_processor.py` | `extract_chunks()` now passes `self._chunker.last_strategy_name` to `ChunkResult.strategy` | |
| 115 | + |
| 116 | +**Verification:** |
| 117 | +```python |
| 118 | +result = proc.extract_chunks("file.txt", chunk_size=500) |
| 119 | +result.strategy # "plain" |
| 120 | + |
| 121 | +result = proc.extract_chunks("file.csv", chunk_size=500) |
| 122 | +result.strategy # "table" |
| 123 | + |
| 124 | +result = proc.extract_chunks("file.pdf", chunk_size=500) |
| 125 | +result.strategy # "protected" or "page" |
| 126 | +``` |
| 127 | + |
| 128 | +--- |
| 129 | + |
| 130 | +## Files Modified |
| 131 | + |
| 132 | +| File | Lines Changed | Type | |
| 133 | +|------|--------------|------| |
| 134 | +| `contextifier/pipeline/converter.py` | ~5 | Bug fix | |
| 135 | +| `contextifier/handlers/text/converter.py` | ~10 | Bug fix | |
| 136 | +| `contextifier/handlers/csv/converter.py` | ~10 | Bug fix | |
| 137 | +| `contextifier/handlers/image/_constants.py` | ~3 | Bug fix | |
| 138 | +| `contextifier/chunking/chunker.py` | ~10 | Bug fix + Feature | |
| 139 | +| `contextifier/document_processor.py` | ~25 | Bug fix + Feature | |
| 140 | +| `tests/unit/test_security.py` | ~5 | Test update | |
| 141 | +| `tests/unit/chunking/test_chunker.py` | ~4 | Test update | |
| 142 | + |
| 143 | +**Total: 8 files, ~72 lines changed** |
| 144 | + |
| 145 | +--- |
| 146 | + |
| 147 | +## Test Results After Fixes |
| 148 | + |
| 149 | +### Existing Unit Tests |
| 150 | +``` |
| 151 | +476 passed, 6 warnings in 3.15s |
| 152 | +``` |
| 153 | + |
| 154 | +### Custom Deep Integration Tests |
| 155 | +``` |
| 156 | +110 tests | PASS: 110 | FAIL: 0 | ERROR: 0 | WARN: 0 | SKIP: 0 |
| 157 | +Pass Rate: 100.0% |
| 158 | +``` |
| 159 | + |
| 160 | +### Specific Improvements |
| 161 | + |
| 162 | +| Test | Before | After | |
| 163 | +|------|--------|-------| |
| 164 | +| Edge: Empty TXT file | ERROR | PASS | |
| 165 | +| Edge: Empty CSV file | ERROR | PASS | |
| 166 | +| Chunking: Empty text | PASS (wrong result) | PASS (correct result) | |
| 167 | +| Edge: Unsupported format | PASS (wrong error type) | PASS (correct error type) | |
| 168 | +| SVG registration warning | Printed every time | Eliminated | |
| 169 | + |
| 170 | +--- |
| 171 | + |
| 172 | +## Remaining Known Issues (Not Fixed) |
| 173 | + |
| 174 | +These are minor and not blocking: |
| 175 | + |
| 176 | +1. **XLSX processing speed** (~380-420ms): Inherent openpyxl initialization cost. Consider `read_only` mode for large files. |
| 177 | +2. **PDF processing speed** (~77ms/page): pdfplumber overhead. Consider PyMuPDF for faster processing. |
| 178 | +3. **Windows console encoding**: Korean metadata labels may cause `UnicodeEncodeError` on cp949 consoles. This is a Windows limitation. |
| 179 | +4. **BUG #2 (position metadata)**: Re-assessed as **by-design**. `ChunkResult.chunks` always returns `List[str]` for backwards compatibility. `ChunkResult.chunks_with_metadata` provides `List[Chunk]` with position metadata when `include_position_metadata=True`. The `has_metadata` property indicates availability. |
0 commit comments