Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] markitdown._markitdown.UnsupportedFormatException #222

Closed
doggy8088 opened this issue Dec 27, 2024 · 3 comments
Closed

[bug] markitdown._markitdown.UnsupportedFormatException #222

doggy8088 opened this issue Dec 27, 2024 · 3 comments

Comments

@doggy8088
Copy link

doggy8088 commented Dec 27, 2024

Here is the steps to reproduce the issue:

OS: Windows

Shell: PowerShell v7.4.6

pip install markitdown
curl -s https://www.duotify.com -o a.htm
type a.htm | markitdown

The error message:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Python312\Scripts\markitdown.exe\__main__.py", line 7, in <module>
  File "C:\Python312\Lib\site-packages\markitdown\__main__.py", line 38, in main
    result = markitdown.convert_stream(sys.stdin.buffer)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\site-packages\markitdown\_markitdown.py", line 1142, in convert_stream
    result = self._convert(temp_path, extensions, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python312\Lib\site-packages\markitdown\_markitdown.py", line 1260, in _convert
    raise UnsupportedFormatException(
markitdown._markitdown.UnsupportedFormatException: Could not convert 'C:\Users\user\AppData\Local\Temp\tmpzn_zxenq' to Markdown. The formats [] are not supported.
@doggy8088
Copy link
Author

I found that as long as the first few characters of the downloaded webpage are blank or line breaks, the program will crash, always showing the same error message.

It might be an issue with the logic for determining file type.

doggy8088 added a commit to doggy8088/markitdown that referenced this issue Dec 27, 2024
Fixes microsoft#222

Address issue with `markitdown.convert_stream` crashing on input with leading blank characters or line breaks.

* Modify `convert_stream` function in `src/markitdown/_markitdown.py` to strip leading blank characters or line breaks from the input stream using a new helper function `_strip_leading_blanks`.
* Add a test case in `tests/test_markitdown.py` to verify that `markitdown.convert_stream` handles input with leading blank characters or line breaks correctly.

---

For more details, open the [Copilot Workspace session](https://copilot-workspace.githubnext.com/microsoft/markitdown/issues/222?shareId=XXXX-XXXX-XXXX-XXXX).
@afourney
Copy link
Member

afourney commented Jan 3, 2025

I am investigating. I think the problem must be lower in the stack, perhaps puremagic or similar. We shouldn't need to trim the input stream.

UPDATE: the problem is indeed here:

guesses = puremagic.magic_file(path)

The magic_file method does not detect the file as HTML, since it doesn't start with an html or similar tag.

I would really rather not change the file (e.g, by trimming it), just so detection works. By all rights, those spaces might be meaningful in some files. I would like to investigate other approaches to addressing this problem so that any corrections are narrow

@afourney
Copy link
Member

afourney commented Jan 4, 2025

Fixed in #260

@afourney afourney closed this as completed Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants