Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Docling JSON ingestion #783

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

vagenas
Copy link
Contributor

@vagenas vagenas commented Jan 21, 2025

Resolves #781.

Copy link

mergify bot commented Jan 21, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Comment on lines +22 to +23
def is_valid(self) -> bool:
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to existing implementations, this method should only return True if the provided content from path_or_stream is supported by this backend.
I think that the logic in convert should be moved to the constructor and is_valid should just evaluate if the constructor was successful.

Comment on lines +342 to +348
elif mime == "application/json":
if (
InputFormat.JSON_DOCLING in formats
and '"schema_name": "DoclingDocument"' in content_str
):
input_format = InputFormat.JSON_DOCLING

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to keep the disambiguation here, like the XML pattern.
Just FYI, for a further PR, we thought of creating an abstract method in the DeclarativeDocumentBackend that would require the implementation of these types of disambiguation, given a fragment of a document content. We would then avoid having backend-specific logic in this method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support JSON format with DoclingDocument as InputFormat
2 participants