Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consolidate advanced chunker notebook #310

Closed

Conversation

vagenas
Copy link
Contributor

@vagenas vagenas commented Nov 11, 2024

Main improvements with this PR:

  • Set chunk.text directly to updated text (including any headings, captions)
  • Add typing
  • switch to list comprehensions where possible
  • encapsulate all methods within new chunker implementation
  • use dataclass instead of unmanaged dictionary
  • list dependencies in setup installation line

jwm4 and others added 5 commits November 1, 2024 08:41
Signed-off-by: Bill Murdock <[email protected]>
Earlier versions used the `doc.name` as the overall title of the document, but the discussion revealed that probably it is better to just trust the `doc_chunk.meta.headings` to have the title information sooner or later.  So I've removed all the special title stuff and am just relying on the headers now.

Signed-off-by: Bill Murdock <[email protected]>
Add typing, switch to list comprehensions where possible,
encapsulate all methods within new chunker implementation,
use dataclass instead of unmanged dictionary,
list dependencies in setup installation line

Signed-off-by: Panos Vagenas <[email protected]>
@vagenas vagenas requested a review from jwm4 November 11, 2024 16:23
Signed-off-by: Panos Vagenas <[email protected]>
Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@PeterStaar-IBM PeterStaar-IBM marked this pull request as draft November 18, 2024 08:31
Base automatically changed from jwm4-chunking-example-1 to advanced-chunking-example November 19, 2024 22:12
@vagenas
Copy link
Contributor Author

vagenas commented Nov 19, 2024

Changes directly merged to https://github.com/DS4SD/docling/tree/advanced-chunking-example as part of ce38baf.

@vagenas vagenas closed this Nov 19, 2024
@vagenas vagenas deleted the consolidate-advanced-chunker branch November 19, 2024 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants