Skip to content

Conversation

@maksymbuleshnyi
Copy link
Contributor

@maksymbuleshnyi maksymbuleshnyi commented Oct 14, 2025

Important

Enhance FileReadTool with intelligent file type detection and content extraction using converters and LLMs for various file types.

  • Behavior:
    • FileReadTool in file_tools.py now detects file types and uses converters for text extraction.
    • Supports PDF, DOCX, PPTX, HTML, TXT, IMAGE, and more.
    • For large files, returns chunked content.
    • Uses LLM for image and PDF processing if supported.
  • Converters:
    • Adds detect_file_type() function in file_tools.py for file type detection.
    • Introduces DEFAULT_FILE_TYPE_TO_CONVERTER_CLASS_MAP for mapping file types to converters.
    • LLMImageConverter and LLMPDFConverter in llm_text_extractor.py handle image and PDF text extraction.
  • File Types:
    • New FileType enum and EXTENSION_MAP in file_types.py for file type management.
  • Tests:
    • Updates test_file_tools.py to test new file reading and writing functionalities.

This description was created by Ellipsis for eaeafec. You can customize this summary. It will automatically update as commits are pushed.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Changes requested ❌

Reviewed everything up to 3201378 in 2 minutes and 9 seconds. Click for details.
  • Reviewed 566 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 5 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. dynamiq/nodes/converters/llm_text_extractor.py:376
  • Draft comment:
    Ensure run_kwargs is a dict and Python>=3.9 is assumed, as the dict union operator (|) is used.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
2. dynamiq/utils/file_types.py:22
  • Draft comment:
    Ensure EXTENSION_MAP is kept updated as new file types and converters are added.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is asking the PR author to ensure that EXTENSION_MAP is kept updated, which is a general reminder rather than a specific code suggestion or issue. It doesn't point out a specific problem or suggest a specific improvement.
3. dynamiq/nodes/tools/file_tools.py:223
  • Draft comment:
    Typographical error: The print statement outputs 'starting converton'. Did you mean 'starting conversion' or 'starting converter'? Please correct the spelling.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% Print statements are typically used for debugging and should generally be removed before code is merged. The typo is in a debug print statement that likely shouldn't be in the final code anyway. The comment is technically correct but focuses on the wrong issue - instead of fixing the typo, the print statements should probably be removed entirely or replaced with proper logging. Maybe the print statements are intentionally left in for ongoing debugging purposes? Maybe the typo could cause confusion for other developers? Even if the prints are intentionally left in, fixing a typo in debug prints is extremely low value. The code already has proper logging statements elsewhere. The comment should be removed as it focuses on an unimportant typo in what appears to be temporary debug code.
4. dynamiq/nodes/tools/file_tools.py:233
  • Draft comment:
    Typographical/placeholder text: The print message 'Asdasdasd' appears to be a placeholder or typo. Consider removing or correcting it.
  • Reason this comment was not posted:
    Marked as duplicate.
5. dynamiq/nodes/tools/file_tools.py:364
  • Draft comment:
    Typo/Concatenation: In the log warning message for no extracted text content, the two string literals are concatenated as "no text content extracted from file," + "falling back to raw content" resulting in "file,falling" without a space. Consider adding a leading space (e.g., " falling back to raw content") to ensure proper spacing in the log message.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% While technically correct that there's a missing space, this is an extremely minor issue that only affects a log message. Log messages are for debugging purposes and a missing space doesn't impact functionality. This feels like the kind of nitpicky comment that adds noise without real value. The missing space could make the log message slightly harder to read for developers debugging issues. The fix is trivial and would improve readability. While true that it would be slightly more readable, this is an extremely low-priority cosmetic issue in debug output. The message is still perfectly understandable. This comment should be deleted as it's too minor of an issue to warrant a PR comment. Log message formatting is not important enough to flag in code review.

Workflow ID: wflow_Xh3Fs8yff5m2G6Qj

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed f17cdbe in 1 minute and 3 seconds. Click for details.
  • Reviewed 38 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 3 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. dynamiq/nodes/tools/file_tools.py:222
  • Draft comment:
    Removed debug print statements ('print("starting converton ")' and 'print(detected_type)'). Using logging is preferred over print for debugging.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.
2. dynamiq/nodes/tools/file_tools.py:240
  • Draft comment:
    Removed debug print of merged kwargs. Consider using logging.debug if internal state info is needed.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.
3. dynamiq/nodes/tools/file_tools.py:338
  • Draft comment:
    Removed debug printing of text_content in execute. Logging should be used instead of print statements for production.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.

Workflow ID: wflow_vzsJwfO8BApS3V5B

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed 1ba617e in 1 minute and 4 seconds. Click for details.
  • Reviewed 82 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 5 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. dynamiq/nodes/tools/file_tools.py:2
  • Draft comment:
    Good: Added 'import os' for path operations.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
2. dynamiq/nodes/tools/file_tools.py:47
  • Draft comment:
    Using os.path.splitext for file extension extraction is more robust than manual splitting.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
3. dynamiq/nodes/tools/file_tools.py:163
  • Draft comment:
    Removed redundant inline comments in _setup_converters improves code clarity.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
4. dynamiq/nodes/tools/file_tools.py:223
  • Draft comment:
    Cleaned up inline comments in _process_file_with_converter; logic remains clear.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
5. dynamiq/nodes/tools/file_tools.py:318
  • Draft comment:
    Extraneous comments around file processing setup in execute were removed, enhancing readability.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None

Workflow ID: wflow_8xbIqIJMjMEefccL

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed 237c704 in 1 minute and 0 seconds. Click for details.
  • Reviewed 29 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 3 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. tests/unit/nodes/tools/test_file_tools.py:49
  • Draft comment:
    Update assert to compare string directly (no decode) since FileReadTool now returns text.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, explaining a change that was made. It doesn't provide a suggestion, ask for confirmation, or point out a potential issue. It seems to be explaining the rationale behind a change, which is not necessary according to the rules.
2. tests/unit/nodes/tools/test_file_tools.py:140
  • Draft comment:
    Update integration test to check for string output (remove .decode()) for integration file content.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.
3. tests/unit/nodes/tools/test_file_tools.py:150
  • Draft comment:
    Update assertion for reading file from a different storage: expect string output instead of decoding bytes.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment seems to be purely informative, as it is just stating what the update is about without providing any actionable feedback or suggestions. It doesn't ask for confirmation or suggest any changes.

Workflow ID: wflow_tpC7qV9wIMbR3DPL

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

@github-actions
Copy link

github-actions bot commented Oct 14, 2025

Coverage

Coverage Report •
FileStmtsMissCoverMissing
dynamiq/nodes/converters
   llm_text_extractor.py19211440%14, 66–68, 130, 140, 148–150, 162, 190–192, 194, 202, 226, 228–231, 238–239, 246–247, 249–252, 254–256, 258–259, 261–266, 268, 280, 282, 303, 305, 307–309, 311, 315, 339–340, 345, 347–348, 357, 361, 376, 382, 384–387, 401–402, 404–409, 422, 445–451, 454–455, 470–472, 524, 526–528, 556–558, 560, 568, 591, 593–596, 603–604, 611–612, 614–617, 619–620, 622–627, 629
dynamiq/nodes/tools
   file_tools.py2156967%41, 44–45, 51–52, 59–60, 62–64, 174, 221–222, 245, 247, 250, 252–253, 263, 274–277, 280, 300, 326, 330, 333, 337, 342, 346–347, 353–355, 358, 360, 362, 364–366, 384, 386, 388–389, 391, 393–396, 398, 409, 423, 425, 427–428, 430, 432–435, 437, 448, 510–512, 562–564
dynamiq/utils
   file_types.py200100% 
TOTAL18122581767% 

Tests Skipped Failures Errors Time
955 35 💤 0 ❌ 0 🔥 3m 31s ⏱️

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed f7e4afc in 1 minute and 56 seconds. Click for details.
  • Reviewed 62 lines of code in 2 files
  • Skipped 0 files when reviewing.
  • Skipped posting 2 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. dynamiq/nodes/tools/file_tools.py:68
  • Draft comment:
    Ensure file type detection yields these new enum values. PPTX files may be detected as FileType.PRESENTATION rather than PPTX_PRESENTATION, so PPTXFileConverter might not be used.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.
2. dynamiq/utils/file_types.py:31
  • Draft comment:
    Overlapping extensions for PPTX: 'pptx' appears in both FileType.PRESENTATION and FileType.PPTX_PRESENTATION, which can cause ambiguity during detection.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_WMyL9rUsiafLSDnc

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

@vitalii-dynamiq
Copy link
Contributor

@olbychos @tyaroshko @delamainer please review

tyaroshko
tyaroshko previously approved these changes Oct 20, 2025
Copy link
Collaborator

@tyaroshko tyaroshko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

(but we'll probably need to update this PR when we make dynamiq-extra public in order to avoid code duplication for file type mapping and detection)

@maksymbuleshnyi
Copy link
Contributor Author

Image 20 10 2025 at 17 40

Example of trace

olbychos
olbychos previously approved these changes Oct 21, 2025
Copy link
Contributor

@olbychos olbychos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, lgtm; any other important file types worth adding (like excel,csv)?

@maksymbuleshnyi
Copy link
Contributor Author

Overall, lgtm; any other important file types worth adding (like excel,csv)?

This is generally supported because we will fallback to pure content, which works well with csv

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed eaeafec in 37 seconds. Click for details.
  • Reviewed 20 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 2 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. dynamiq/nodes/tools/file_tools.py:224
  • Draft comment:
    Removed redundant comment; file.seek(0) clearly resets file pointer.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
2. dynamiq/nodes/tools/file_tools.py:312
  • Draft comment:
    Removed redundant comment; the code is self-explanatory.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None

Workflow ID: wflow_b0b3SsplkhockqmX

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link
Collaborator

@acoola acoola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please make sure that FileReadTool works fine with YAML load/dump, tracing and so on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants