Skip to content

fix: extract <details> and <summary> accordion content#4278

Closed
AlonNaor22 wants to merge 1 commit intoUnstructured-IO:mainfrom
AlonNaor22:fix/html-accordion-details-summary
Closed

fix: extract <details> and <summary> accordion content#4278
AlonNaor22 wants to merge 1 commit intoUnstructured-IO:mainfrom
AlonNaor22:fix/html-accordion-details-summary

Conversation

@AlonNaor22
Copy link

Summary

  • Fixes bug/Partition_html Function Fails to Extract Accordion Titles #3919partition_html silently discards all content inside <details> and <summary> HTML elements (accordions/collapsible sections)
  • Root cause: both <details> and <summary> were mapped to RemovedBlock in the HTML parser element class lookup, causing them and all their children to be stripped from output
  • Fix: reclassify <details> as Flow (block container, like <div>) and <summary> as Flow (block element, let text classifier decide the element type)

Files changed

  • unstructured/partition/html/parser.py — changed "details": RemovedBlockFlow and "summary": RemovedBlockFlow
  • test_unstructured/partition/html/test_partition.py — added 6 tests covering: basic <details>, <summary> extraction, FAQ accordion, nested details, details without summary, summary with inline markup

Test plan

  • <details><p>text</p></details> → content extracted (was empty before)
  • <summary> heading text appears in output
  • Multi-item FAQ accordion produces all questions and answers
  • Nested <details> elements all extracted
  • <details> without <summary> still works
  • <summary> with inline markup (<b>, <em>) preserves text
  • All 6 new tests pass, 100 existing tests still pass (1 pre-existing failure unrelated to this change)
  • Linter and formatter clean (ruff check + ruff format)

…carding it

The HTML parser mapped <details> and <summary> elements to RemovedBlock,
causing all accordion/collapsible content (e.g. FAQ sections) to be silently
dropped from partition output. Reclassify both as Flow elements so their
text content is properly extracted.

Closes #3919

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AlonNaor22 AlonNaor22 closed this by deleting the head repository Mar 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/Partition_html Function Fails to Extract Accordion Titles

1 participant