-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] markitdown._markitdown.UnsupportedFormatException #222
Comments
I found that as long as the first few characters of the downloaded webpage are blank or line breaks, the program will crash, always showing the same error message. It might be an issue with the logic for determining file type. |
Fixes microsoft#222 Address issue with `markitdown.convert_stream` crashing on input with leading blank characters or line breaks. * Modify `convert_stream` function in `src/markitdown/_markitdown.py` to strip leading blank characters or line breaks from the input stream using a new helper function `_strip_leading_blanks`. * Add a test case in `tests/test_markitdown.py` to verify that `markitdown.convert_stream` handles input with leading blank characters or line breaks correctly. --- For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/microsoft/markitdown/issues/222?shareId=XXXX-XXXX-XXXX-XXXX).
I am investigating. I think the problem must be lower in the stack, perhaps puremagic or similar. We shouldn't need to trim the input stream. UPDATE: the problem is indeed here: markitdown/src/markitdown/_markitdown.py Line 1596 in 731b39e
The I would really rather not change the file (e.g, by trimming it), just so detection works. By all rights, those spaces might be meaningful in some files. I would like to investigate other approaches to addressing this problem so that any corrections are narrow |
Fixed in #260 |
Here is the steps to reproduce the issue:
OS: Windows
Shell: PowerShell v7.4.6
The error message:
The text was updated successfully, but these errors were encountered: