You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While trying to read a parquet file from dataset revision refs/convert/parquet (generated by datasets-server) with duckdb, it throws the following error:
D select * from 'https://huggingface.co/datasets/Pavithra/sampled-code-parrot-train-100k/resolve/refs%2Fconvert%2Fparquet/Pavithra--sampled-code-parrot-train-100k/parquet-train.parquet' limit 10;
Error: Invalid Input Error: No magic bytes found at end of file 'https://huggingface.co/datasets/Pavithra/sampled-code-parrot-train-100k/resolve/refs%2Fconvert%2Fparquet/Pavithra--sampled-code-parrot-train-100k/parquet-train.parquet'
Not sure if this is an error in the parquet generation logic. Should we validate non-corrupted files are being pushed to the dataset repo?
I was able to reproduce the same (corrupted error) with polars:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/andrea/.pyenv/versions/3.9.15/lib/python3.9/site-packages/polars/io/parquet/functions.py", line 123, in read_parquet
return pl.DataFrame._read_parquet(
File "/home/andrea/.pyenv/versions/3.9.15/lib/python3.9/site-packages/polars/dataframe/frame.py", line 865, in _read_parquet
self._df = PyDataFrame.read_parquet(
exceptions.ArrowErrorException: ExternalFormat("File out of specification: The file must end with PAR1")
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
severo
changed the title
Corrupted Parquet file in refs/convert/parquet
Validate source Parquet files before linking in refs/convert/parquet
Jun 19, 2024
While trying to read a parquet file from dataset revision refs/convert/parquet (generated by datasets-server) with duckdb, it throws the following error:
Parquet file:
https://huggingface.co/datasets/Pavithra/sampled-code-parrot-train-100k/resolve/refs%2Fconvert%2Fparquet/Pavithra--sampled-code-parrot-train-100k/parquet-train.parquet
Not sure if this is an error in the parquet generation logic. Should we validate non-corrupted files are being pushed to the dataset repo?
I was able to reproduce the same (corrupted error) with polars:
The text was updated successfully, but these errors were encountered: