Skip to content

fix(docx): ignore malformed styles missing type#2190

Open
gingerninja85 wants to merge 1 commit into
microsoft:mainfrom
gingerninja85:fix/docx-missing-style-type
Open

fix(docx): ignore malformed styles missing type#2190
gingerninja85 wants to merge 1 commit into
microsoft:mainfrom
gingerninja85:fix/docx-missing-style-type

Conversation

@gingerninja85

Copy link
Copy Markdown

Summary

Fixes a DOCX conversion crash when word/styles.xml contains a malformed w:style entry without the required w:type attribute.

Mammoth indexes style_element.attributes["w:type"], so these malformed style records currently bubble up as:

DocxConverter threw KeyError with message: 'w:type'

This patch extends DOCX preprocessing to remove only malformed style definitions missing w:type or w:styleId, allowing body text conversion to continue instead of failing the whole document.

Fixes #2166.

Verification

pytest packages/markitdown/tests/test_module_vectors.py::test_convert_docx_with_style_missing_type -q
# 1 passed

pytest 'packages/markitdown/tests/test_module_vectors.py::test_convert_local[test_vector0]' \
  'packages/markitdown/tests/test_module_vectors.py::test_convert_stream_with_hints[test_vector0]' \
  'packages/markitdown/tests/test_module_vectors.py::test_convert_stream_without_hints[test_vector0]' \
  'packages/markitdown/tests/test_module_vectors.py::test_convert_file_uri[test_vector0]' \
  'packages/markitdown/tests/test_module_vectors.py::test_convert_data_uri[test_vector0]' \
  packages/markitdown/tests/test_module_vectors.py::test_convert_docx_with_style_missing_type -q
# 6 passed

pre-commit run --files packages/markitdown/src/markitdown/converter_utils/docx/pre_process.py packages/markitdown/tests/test_module_vectors.py
# black Passed

python -m compileall -q packages/markitdown/src packages/markitdown/tests
# passed

I also ran hatch test; it reached 332 passed, 4 skipped with one local environment failure unrelated to this change because ffprobe is not installed for tests/test_module_misc.py::test_speech_transcription.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

- DocxConverter threw KeyError with message: 'w:type'

1 participant