Skip to content

Add XML file ingestion support#560

Merged
KaifAhmad1 merged 3 commits into
semantica-agi:mainfrom
Luffy2208:feature/233-xml-ingestion-support
May 19, 2026
Merged

Add XML file ingestion support#560
KaifAhmad1 merged 3 commits into
semantica-agi:mainfrom
Luffy2208:feature/233-xml-ingestion-support

Conversation

@Luffy2208
Copy link
Copy Markdown
Contributor

Description

Adds dedicated XML file ingestion support for Semantica.

This PR introduces a new XMLIngestor that parses local XML files into structured data instead of treating them only as plain text. It extracts nested element hierarchy, a flat element list, namespaces, attributes, document metadata, and optional validation results.

XML ingestion is also wired into the public ingest API through:

  • ingest_xml()
  • ingest_file(..., method="xml")
  • Unified .xml auto-detection via ingest("file.xml")

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring

Related Issues

Closes #233
Fixes #233


Changes Made

  • Added semantica/ingest/xml_ingestor.py with:

    • XMLIngestor
    • XMLIngestionData
  • Implemented XML parsing with:

    • Namespace support
    • Element hierarchy extraction
    • Flat element extraction
    • Attribute extraction
    • Metadata extraction
  • Added optional:

    • XSD schema validation
    • DTD validation
    • Clear validation reports
  • Added malformed XML handling using:

    • ProcessingError
    • ValidationError
  • Added:

    • ingest_xml() convenience method
    • XML routing in unified ingest()
  • Registered XML ingestion methods in the ingest method registry.

  • Exported:

    • XMLIngestor
    • XMLIngestionData
    • ingest_xml
      from semantica.ingest
  • Added XML ingestion tests covering:

    • Parsing
    • Namespaces
    • Attributes
    • XSD validation
    • DTD validation
    • Malformed XML handling
    • Directory ingestion
    • Unified dispatch behavior
  • Updated ingest documentation and usage examples for XML ingestion.


Testing

  • Tested locally
  • Added tests for new functionality
  • Package builds successfully (python -m build)

Live testing

image image

Test Commands

python -m py_compile \
  semantica/ingest/xml_ingestor.py \
  semantica/ingest/methods.py \
  semantica/ingest/__init__.py \
  semantica/ingest/registry.py \
  tests/ingest/test_xml_ingestor.py

python -m pytest \
  tests/ingest/test_xml_ingestor.py \
  tests/ingest/test_file_ingestor.py \
  tests/ingest/test_ingestors.py \
  tests/ingest/test_optional_imports.py \
  -q

@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@KaifAhmad1 KaifAhmad1 self-requested a review May 19, 2026 08:50
…n keys

- Add test_xml_ingestor_ingests_string to cover the public ingest_string()
  method which had no test coverage
- Document all source_type return keys in the ingest() docstring so callers
  know to use result["xml"] rather than result["data"] for XML sources
Copy link
Copy Markdown
Contributor

@KaifAhmad1 KaifAhmad1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #560 — XML File Ingestion Support

Author: @Luffy2208 | Reviewer: @KaifAhmad1 | Closes: #233


What It Does

  • Adds XMLIngestor and XMLIngestionData for parsing local XML files into structured data
  • Extracts nested element tree, flat element list, namespaces, attributes, and document metadata
  • Optional XSD and DTD validation with detailed error reports
  • Wired into the public API via ingest_xml(), ingest_file(..., method="xml"), and ingest("file.xml") auto-detection

Issues Found & Fixed

  • Missing ingest_string test — public method had no coverage; test added in bc6c443
  • ingest() return key undocumented — XML returns result["xml"], not result["data"]; all return keys documented in bc6c443
  • Changelog missing[Unreleased] entry added in a397fb5

Security

  • resolve_entities=False and no_network=True by default — blocks XXE injection
  • huge_tree and allow_network are explicit opt-ins

Tests

  • 8/8 XML tests pass (7 original + 1 added)
  • 18/18 existing ingest tests pass — no regressions

Verdict

Approved. Follows project conventions, secure by default. All review issues resolved on the PR branch before merge.

@KaifAhmad1 KaifAhmad1 merged commit 9823274 into semantica-agi:main May 19, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] XML File Ingestion Support

2 participants