Skip to content

Commit 477b8b4

Browse files
committed
Add various test files for deep testing
- Created an Excel file with sample data. - Added XML file with structured items. - Introduced YAML configuration file for testing. - Included text files with different encodings (BOM, EUC-KR, whitespace). - Added CSV files with semicolon and wide formats. - Created an RTF file containing a table. - Added an HTML file for XSS security testing.
1 parent 31e52ee commit 477b8b4

53 files changed

Lines changed: 6019 additions & 31 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

_deep_test/DEEP_TEST_REPORT_v0.2.6.md

Lines changed: 550 additions & 0 deletions
Large diffs are not rendered by default.

_deep_test/FIX_REPORT_v0.2.6.md

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# Contextify v0.2.6 Bug Fix Report
2+
3+
**Date:** 2026-04-07
4+
**Based on:** [DEEP_TEST_REPORT_v0.2.6.md](DEEP_TEST_REPORT_v0.2.6.md)
5+
6+
---
7+
8+
## Summary
9+
10+
| Metric | Before | After |
11+
|--------|--------|-------|
12+
| **Custom Integration Tests** | 106/110 (96.4%) | **110/110 (100%)** |
13+
| **Existing Unit Tests** | 476/476 (100%) | **476/476 (100%)** |
14+
| **SVG Registration Warning** | Every instantiation | **Eliminated** |
15+
| **Bugs Fixed** | 0 | **6** |
16+
17+
---
18+
19+
## Fixes Applied
20+
21+
### FIX #1: Empty File Handling (BUG #1 - Critical)
22+
23+
**Problem:** 0-byte TXT/CSV files caused `ConversionError` crash at validation stage.
24+
25+
**Root Cause:** `BaseConverter.validate()`, `TextConverter.validate()`, and `CsvConverter.validate()` all rejected empty `file_data` (length == 0).
26+
27+
**Changes:**
28+
29+
| File | Change |
30+
|------|--------|
31+
| `contextifier/pipeline/converter.py` | `BaseConverter.validate()` now returns `True` always (empty file handling deferred to `convert()`) |
32+
| `contextifier/handlers/text/converter.py` | `validate()` returns `True`; `convert()` returns empty `TextConvertedData` for empty files instead of raising |
33+
| `contextifier/handlers/csv/converter.py` | `validate()` returns `True`; `convert()` returns empty `CsvConvertedData` for empty files instead of raising |
34+
| `tests/unit/test_security.py` | `test_empty_file_rejected` -> `test_empty_file_returns_empty_text` (expects empty string return) |
35+
36+
**Verification:**
37+
```python
38+
proc.extract_text("empty.txt") # Returns "" (was: ConversionError)
39+
proc.extract_text("empty.csv") # Returns metadata-only text (was: ConversionError)
40+
```
41+
42+
---
43+
44+
### FIX #2: SVG Extension Double-Registration (BUG #3 & #6 - Medium)
45+
46+
**Problem:** SVG was registered in both `TextHandler` (as XML text) and `ImageFileHandler` (as image), causing a warning on every instantiation. Additionally, `.svg` was missing from `supported_extensions`.
47+
48+
**Root Cause:** `_TEXT_EXTENSIONS` in `text/handler.py` and `IMAGE_EXTENSIONS` in `image/_constants.py` both included `"svg"`.
49+
50+
**Change:**
51+
52+
| File | Change |
53+
|------|--------|
54+
| `contextifier/handlers/image/_constants.py` | Removed `"svg"` from `IMAGE_EXTENSIONS` (SVG stays in TextHandler as it's XML-based) |
55+
56+
**Verification:**
57+
```python
58+
# No more warning on instantiation
59+
proc = DocumentProcessor() # Clean, no SVG warning
60+
proc.is_supported(".svg") # Returns True (was: False)
61+
```
62+
63+
---
64+
65+
### FIX #3: Empty Text Chunking (BUG #4 - Medium)
66+
67+
**Problem:** `TextChunker.chunk("")` returned `[""]` (list with one empty string) instead of `[]` (empty list).
68+
69+
**Root Cause:** Early return in `chunker.py` line 111-112 returned `[""]`.
70+
71+
**Changes:**
72+
73+
| File | Change |
74+
|------|--------|
75+
| `contextifier/chunking/chunker.py` | `return [""]` -> `return []` for empty/whitespace-only text |
76+
| `tests/unit/chunking/test_chunker.py` | Updated 2 tests to expect `[]` instead of `[""]` |
77+
78+
**Verification:**
79+
```python
80+
chunker.chunk("") # Returns [] (was: [""])
81+
chunker.chunk(" \n ") # Returns [] (was: [""])
82+
```
83+
84+
---
85+
86+
### FIX #4: Error Priority Order (ISSUE #4)
87+
88+
**Problem:** Processing `nonexistent.zzz` raised `FileNotFoundError` instead of `UnsupportedFormatError`. Format validation should happen before file existence check.
89+
90+
**Changes:**
91+
92+
| File | Change |
93+
|------|--------|
94+
| `contextifier/document_processor.py` | In `extract_text()` and `process()`: moved extension resolution and format support check **before** file existence check |
95+
96+
**Verification:**
97+
```python
98+
proc.extract_text("nonexistent.zzz") # UnsupportedFormatError (was: FileNotFoundError)
99+
proc.extract_text("nonexistent.pdf") # FileNotFoundError (unchanged, .pdf is supported)
100+
```
101+
102+
---
103+
104+
### FIX #5: ChunkResult Strategy Field (ISSUE #5)
105+
106+
**Problem:** `ChunkResult` did not expose which chunking strategy was selected, making debugging difficult.
107+
108+
**Changes:**
109+
110+
| File | Change |
111+
|------|--------|
112+
| `contextifier/document_processor.py` | Added `strategy: Optional[str] = None` field to `ChunkResult` dataclass |
113+
| `contextifier/chunking/chunker.py` | Added `last_strategy_name` property to `TextChunker`; tracks strategy used in each `chunk()` call |
114+
| `contextifier/document_processor.py` | `extract_chunks()` now passes `self._chunker.last_strategy_name` to `ChunkResult.strategy` |
115+
116+
**Verification:**
117+
```python
118+
result = proc.extract_chunks("file.txt", chunk_size=500)
119+
result.strategy # "plain"
120+
121+
result = proc.extract_chunks("file.csv", chunk_size=500)
122+
result.strategy # "table"
123+
124+
result = proc.extract_chunks("file.pdf", chunk_size=500)
125+
result.strategy # "protected" or "page"
126+
```
127+
128+
---
129+
130+
## Files Modified
131+
132+
| File | Lines Changed | Type |
133+
|------|--------------|------|
134+
| `contextifier/pipeline/converter.py` | ~5 | Bug fix |
135+
| `contextifier/handlers/text/converter.py` | ~10 | Bug fix |
136+
| `contextifier/handlers/csv/converter.py` | ~10 | Bug fix |
137+
| `contextifier/handlers/image/_constants.py` | ~3 | Bug fix |
138+
| `contextifier/chunking/chunker.py` | ~10 | Bug fix + Feature |
139+
| `contextifier/document_processor.py` | ~25 | Bug fix + Feature |
140+
| `tests/unit/test_security.py` | ~5 | Test update |
141+
| `tests/unit/chunking/test_chunker.py` | ~4 | Test update |
142+
143+
**Total: 8 files, ~72 lines changed**
144+
145+
---
146+
147+
## Test Results After Fixes
148+
149+
### Existing Unit Tests
150+
```
151+
476 passed, 6 warnings in 3.15s
152+
```
153+
154+
### Custom Deep Integration Tests
155+
```
156+
110 tests | PASS: 110 | FAIL: 0 | ERROR: 0 | WARN: 0 | SKIP: 0
157+
Pass Rate: 100.0%
158+
```
159+
160+
### Specific Improvements
161+
162+
| Test | Before | After |
163+
|------|--------|-------|
164+
| Edge: Empty TXT file | ERROR | PASS |
165+
| Edge: Empty CSV file | ERROR | PASS |
166+
| Chunking: Empty text | PASS (wrong result) | PASS (correct result) |
167+
| Edge: Unsupported format | PASS (wrong error type) | PASS (correct error type) |
168+
| SVG registration warning | Printed every time | Eliminated |
169+
170+
---
171+
172+
## Remaining Known Issues (Not Fixed)
173+
174+
These are minor and not blocking:
175+
176+
1. **XLSX processing speed** (~380-420ms): Inherent openpyxl initialization cost. Consider `read_only` mode for large files.
177+
2. **PDF processing speed** (~77ms/page): pdfplumber overhead. Consider PyMuPDF for faster processing.
178+
3. **Windows console encoding**: Korean metadata labels may cause `UnicodeEncodeError` on cp949 consoles. This is a Windows limitation.
179+
4. **BUG #2 (position metadata)**: Re-assessed as **by-design**. `ChunkResult.chunks` always returns `List[str]` for backwards compatibility. `ChunkResult.chunks_with_metadata` provides `List[Chunk]` with position metadata when `include_position_metadata=True`. The `has_metadata` property indicates availability.

0 commit comments

Comments
 (0)