Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(html-backend): improve accordion extraction and hidden content ha… #1115

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ulan-yisaev
Copy link

Description

This PR addresses two issues with the HTML backend:

  1. Missing questions in Bootstrap accordion components: The HTML backend was not properly extracting questions from Bootstrap accordion components. These questions were in <a> tags inside <div class="panel-title"> elements, causing incomplete Q&A extraction.

  2. Unwanted extraction of hidden content: The backend was including text from elements marked as 'hidden', which polluted the extracted content with metadata and invisible elements.

Changes implemented:

  • Added div to the TAGS_FOR_NODE_ITEMS list to ensure div elements are processed
  • Added specialized handlers for Bootstrap accordion components:
    • handle_panel_title(): Extracts question text from panel titles
    • handle_panel(): Processes entire accordion panels
  • Implemented is_hidden_element() method to detect various types of hidden elements:
    • Elements with classes like "hidden", "d-none", "hide", "invisible", "collapse"
    • Elements with inline styles like "display:none" or "visibility:hidden"
    • Elements with the "hidden" attribute
  • Modified text extraction to skip hidden elements in multiple places:
    • In extract_text_recursively() to skip hidden content during text collection
    • In walk() to skip processing hidden tags entirely
    • In analyze_tag() to prevent processing hidden elements

Issue resolved by this Pull Request:
Resolves #1112

Checklist:

  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

…ndling

   - Add specialized handlers for Bootstrap accordion components to properly extract
     questions from panel-title elements
   - Implement is_hidden_element() method to detect and skip content with hidden
     classes, styles, and attributes
   - Update walk(), analyze_tag(), and extract_text_recursively() to filter out
     hidden elements
   - Add comprehensive test suite with direct method tests and example HTML files

   This fixes two issues:
   1. Missing questions in accordion components
   2. Unwanted extraction of hidden metadata content

   Tests: tests/test_html_enhanced.py

Signed-off-by: Ulan.Yisaev <[email protected]>
Copy link

mergify bot commented Mar 4, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Signed-off-by: Ulan.Yisaev <[email protected]>
dolfim-ibm
dolfim-ibm previously approved these changes Mar 5, 2025
Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ulan-yisaev thanks for your contribution. looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants