Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate source Parquet files before linking in refs/convert/parquet #1504

Open
AndreaFrancis opened this issue Jul 6, 2023 · 5 comments
Open
Labels
bug Something isn't working P2 Nice to have

Comments

@AndreaFrancis
Copy link
Contributor

While trying to read a parquet file from dataset revision refs/convert/parquet (generated by datasets-server) with duckdb, it throws the following error:

D select * from 'https://huggingface.co/datasets/Pavithra/sampled-code-parrot-train-100k/resolve/refs%2Fconvert%2Fparquet/Pavithra--sampled-code-parrot-train-100k/parquet-train.parquet' limit 10;
Error: Invalid Input Error: No magic bytes found at end of file 'https://huggingface.co/datasets/Pavithra/sampled-code-parrot-train-100k/resolve/refs%2Fconvert%2Fparquet/Pavithra--sampled-code-parrot-train-100k/parquet-train.parquet'

Parquet file:
https://huggingface.co/datasets/Pavithra/sampled-code-parrot-train-100k/resolve/refs%2Fconvert%2Fparquet/Pavithra--sampled-code-parrot-train-100k/parquet-train.parquet

Not sure if this is an error in the parquet generation logic. Should we validate non-corrupted files are being pushed to the dataset repo?
I was able to reproduce the same (corrupted error) with polars:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/andrea/.pyenv/versions/3.9.15/lib/python3.9/site-packages/polars/io/parquet/functions.py", line 123, in read_parquet
    return pl.DataFrame._read_parquet(
  File "/home/andrea/.pyenv/versions/3.9.15/lib/python3.9/site-packages/polars/dataframe/frame.py", line 865, in _read_parquet
    self._df = PyDataFrame.read_parquet(
exceptions.ArrowErrorException: ExternalFormat("File out of specification: The file must end with PAR1")

@severo severo added the bug Something isn't working label Jul 17, 2023
@severo
Copy link
Collaborator

severo commented Jul 17, 2023

Do you know if this is an isolated or widespread case?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@severo severo added the P2 Nice to have label Aug 11, 2023
@severo
Copy link
Collaborator

severo commented Aug 11, 2023

Should we test them?
-> in the same job runner that created them
-> in production, as a cron job

@AndreaFrancis
Copy link
Contributor Author

Do you know if this is an isolated or widespread case?

This is a unique case I saw.

I think we could probably validate at the job runner level, in this case, I see that the source is wrong https://huggingface.co/datasets/Pavithra/sampled-code-parrot-train-100k/resolve/main/data/train-00000-of-00001.parquet so maybe we could validate before copying it to refs/convert/parquet?

@severo
Copy link
Collaborator

severo commented Aug 18, 2023

Yes, good idea.

@severo severo changed the title Corrupted Parquet file in refs/convert/parquet Validate source Parquet files before linking in refs/convert/parquet Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P2 Nice to have
Projects
None yet
Development

No branches or pull requests

2 participants