Skip to content

fix: accept any readable file-like object in convert_to_bytes()#4275

Closed
weiguangli-o wants to merge 1 commit intoUnstructured-IO:mainfrom
weiguangli-o:codex/unstructured-4097-zipextfile
Closed

fix: accept any readable file-like object in convert_to_bytes()#4275
weiguangli-o wants to merge 1 commit intoUnstructured-IO:mainfrom
weiguangli-o:codex/unstructured-4097-zipextfile

Conversation

@weiguangli-o
Copy link

Closes #4097

Problem

When a user opens a file from inside a zip archive via zipfile.ZipFile.open() and passes the resulting ZipExtFile into partition(), the call crashes with:

ValueError: Invalid file-like object type

This happens because convert_to_bytes() only accepts a hardcoded whitelist of types (BytesIO, BufferedReader, SpooledTemporaryFile, TextIOWrapper). Any other IO[bytes] implementation — including ZipExtFile — is immediately rejected.

Fix

Added a duck-typing fallback before the final raise ValueError(...): if the object has a .read() method, read it; if it also has .seek(), reset the cursor so the caller can re-read the file. This preserves all existing behavior for the whitelisted types while also covering ZipExtFile, GzipFile, tarfile.ExFileObject, and any other standard IO[bytes] implementation.

Testing

Four new unit tests added in test_unstructured/partition/common/test_common.py:

Test Verifies
it_reads_a_ZipExtFile ZipExtFile from a zip archive is accepted and returns correct bytes
it_resets_cursor_after_reading_a_ZipExtFile File cursor is reset so the caller can re-read
it_reads_a_generic_IO_bytes_with_read_method Any object with .read() is accepted (duck-typing)
it_raises_on_a_non_readable_object Non-file objects still raise ValueError

To verify locally:

import zipfile
from io import BytesIO
from unstructured.partition.common.common import convert_to_bytes

buf = BytesIO()
with zipfile.ZipFile(buf, "w") as zf:
    zf.writestr("hello.txt", b"Hello from zip!")
buf.seek(0)

with zipfile.ZipFile(buf) as zf:
    with zf.open("hello.txt") as f:
        print(convert_to_bytes(f))  # b'Hello from zip!'

convert_to_bytes() only accepted a hardcoded set of file types (BytesIO,
BufferedReader, SpooledTemporaryFile, TextIOWrapper). Any other IO[bytes]
type — such as zipfile.ZipExtFile returned by zipfile.ZipFile.open() —
was rejected with "ValueError: Invalid file-like object type".

Add a duck-typing fallback before raising: if the object has a .read()
method, read it; if it also has .seek(), reset the cursor for the caller.
This fixes the crash when partitioning text files loaded from zip archives
and also covers other standard IO[bytes] types like GzipFile and
tarfile.ExFileObject.

Closes Unstructured-IO#4097
@weiguangli-o
Copy link
Author

Closing — shifting focus to AI Agent core projects. Thank you for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/a text file cannot be loaded from a ZipExtFile

1 participant