Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to show fan-in jobs' results in response ("pending" and "failed" keys) #1472

Open
polinaeterna opened this issue Jul 3, 2023 · 3 comments
Labels
api P2 Nice to have question Further information is requested

Comments

@polinaeterna
Copy link
Contributor

In cache entries of fan-in jobs we have keys pending and failed. For example, config-level /parquet response has the following format (only "parquet_files" key):

{
    "parquet_files": [
        {
            "dataset": "duorc",
            "config": "ParaphraseRC",
            "split": "test",
            "url": "https://huggingface.co/datasets/duorc/resolve/refs%2Fconvert%2Fparquet/ParaphraseRC/duorc-test.parquet",
            "filename": "duorc-test.parquet",
            "size": 6136591
        },
       ... # list of parquet files
    ],
}

and for dataset-level it also has pending and failed keys:

{
    "parquet_files": [
        {
            "dataset": "duorc",
            "config": "ParaphraseRC",
            "split": "test",
            "url": "https://huggingface.co/datasets/duorc/resolve/refs%2Fconvert%2Fparquet/ParaphraseRC/duorc-test.parquet",
            "filename": "duorc-test.parquet",
            "size": 6136591
        },
       ... # list of parquet files
    ],
    "pending": [],
    "failed": []
}

To me, undocumented "pending" and "failed" keys look a bit too technical and unclear.

What we can do:

  • document what these keys mean
  • don't document it but also for these kind of endpoints show only examples where all levels are specified (currently it's not like this). So, don't show examples that return pending and failed field.
  • anything else? @huggingface/datasets-server
@severo severo added question Further information is requested api labels Jul 3, 2023
@github-actions
Copy link

github-actions bot commented Aug 3, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@severo
Copy link
Collaborator

severo commented Aug 11, 2023

from the related issue #1299

format error: failed (and pending) should not have split field for config-size
https://datasets-server.huggingface.co/size?dataset=stas/openwebtext-10k

{
  "size": {
    "dataset": {
      "dataset": "stas/openwebtext-10k",
      "num_bytes_original_files": 0,
      "num_bytes_parquet_files": 0,
      "num_bytes_memory": 0,
      "num_rows": 0
    },
    "configs": [],
    "splits": []
  },
  "pending": [],
  "failed": [
    {
      "kind": "config-size",
      "dataset": "stas/openwebtext-10k",
      "config": "plain_text",
      "split": null
    }
  ]
}

By the way, here we're exposing the underlying steps logic. The API client is more interested in URLs, so we should show something like:

{
  "size": {
    "dataset": {
      "dataset": "stas/openwebtext-10k",
      "num_bytes_original_files": 0,
      "num_bytes_parquet_files": 0,
      "num_bytes_memory": 0,
      "num_rows": 0
    },
    "configs": [],
    "splits": []
  },
  "pending": [],
  "failed": [
    {
      "config": "plain_text",
      "url": "https://datasets-server.huggingface.co/size?dataset=stas/openwebtext-10k&config=plain_text"
    }
  ]
}

So that getting the error is just a matter of requesting the URL

@severo
Copy link
Collaborator

severo commented Aug 11, 2023

See also #1665

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api P2 Nice to have question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants