Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise specific errors (and error_code) instead of UnexpectedError #1443

Open
severo opened this issue Jun 28, 2023 · 23 comments
Open

Raise specific errors (and error_code) instead of UnexpectedError #1443

severo opened this issue Jun 28, 2023 · 23 comments
Assignees
Labels

Comments

@severo
Copy link
Collaborator

severo commented Jun 28, 2023

The following query on the production database gives the number of datasets with at least one cache entry with error_code "UnexpectedError", grouped by the underlying "cause_exception".

For the most common ones (DatasetGenerationError, HfHubHTTPError, OSError, etc.) we would benefit from raising a specific error with its error code. It would allow to:

  • retry automatically, if needed
  • show an adequate error message to the users
  • even: adapt the way we show the dataset viewer on the Hub

null means it has no details.cause_exception. These cache entries should be inspected more closely. See #1123 in particular, which is one of the cases where no cause exception is reported.

db.cachedResponsesBlue.aggregate([
    {$match: {error_code: "UnexpectedError"}},
    {$group: {_id: {cause: "$details.cause_exception", dataset: "$dataset"}, count: {$sum: 1}}},
    {$group: {_id: "$_id.cause", count: {$sum: 1}}},
    {$sort: {count: -1}}
])
{ _id: 'DatasetGenerationError', count: 1964 }
{ _id: null, count: 1388 }
{ _id: 'HfHubHTTPError', count: 1154 }
{ _id: 'OSError', count: 433 }
{ _id: 'FileNotFoundError', count: 242 }
{ _id: 'FileExistsError', count: 198 }
{ _id: 'ValueError', count: 186 }
{ _id: 'TypeError', count: 160 }
{ _id: 'ConnectionError', count: 146 }
{ _id: 'RuntimeError', count: 86 }
{ _id: 'NonMatchingSplitsSizesError', count: 83 }
{ _id: 'FileSystemError', count: 62 }
{ _id: 'ClientResponseError', count: 52 }
{ _id: 'ArrowInvalid', count: 45 }
{ _id: 'ParquetResponseEmptyError', count: 43 }
{ _id: 'RepositoryNotFoundError', count: 41 }
{ _id: 'ManualDownloadError', count: 39 }
{ _id: 'IndexError', count: 28 }
{ _id: 'AttributeError', count: 16 }
{ _id: 'KeyError', count: 15 }
{ _id: 'GatedRepoError', count: 13 }
{ _id: 'NotImplementedError', count: 11 }
{ _id: 'ExpectedMoreSplits', count: 9 }
{ _id: 'PermissionError', count: 8 }
{ _id: 'BadRequestError', count: 7 }
{ _id: 'NonMatchingChecksumError', count: 6 }
{ _id: 'AssertionError', count: 4 }
{ _id: 'NameError', count: 4 }
{ _id: 'UnboundLocalError', count: 3 }
{ _id: 'JSONDecodeError', count: 3 }
{ _id: 'ZeroDivisionError', count: 3 }
{ _id: 'InvalidDocument', count: 3 }
{ _id: 'DoesNotExist', count: 3 }
{ _id: 'EOFError', count: 3 }
{ _id: 'ImportError', count: 3 }
{ _id: 'NotADirectoryError', count: 2 }
{ _id: 'RarCannotExec', count: 2 }
{ _id: 'ReadTimeout', count: 2 }
{ _id: 'ChunkedEncodingError', count: 2 }
{ _id: 'ExpectedMoreDownloadedFiles', count: 2 }
{ _id: 'InvalidConfigName', count: 2 }
{ _id: 'ModuleNotFoundError', count: 2 }
{ _id: 'Exception', count: 2 }
{ _id: 'MissingBeamOptions', count: 2 }
{ _id: 'HTTPError', count: 1 }
{ _id: 'BadZipFile', count: 1 }
{ _id: 'OverflowError', count: 1 }
{ _id: 'HFValidationError', count: 1 }
{ _id: 'IsADirectoryError', count: 1 }
{ _id: 'OperationalError', count: 1 }
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@severo
Copy link
Collaborator Author

severo commented Aug 7, 2023

We need to do it to provide better feedback to the user, and to retry when appropriate.

@severo severo reopened this Aug 7, 2023
@severo severo added the P2 Nice to have label Aug 7, 2023
@severo severo added P1 Not as needed as P0, but still important/wanted and removed P2 Nice to have labels Aug 11, 2023
@severo
Copy link
Collaborator Author

severo commented Aug 11, 2023

Copying from #1462

Updated query (Without errors from parent):

db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", kind:"split-duckdb-index", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {cause: "$details.cause_exception"}, count: {$sum: 1}}},{$sort: {count: -1}}])

From 128617 records currently existing in cache collection, these are the top kind of UnexpectedErrors:

[
  { _id: { cause: 'HfHubHTTPError' }, count: 4429 },
  { _id: { cause: 'HTTPException' }, count: 2570 },
  { _id: { cause: 'Error' }, count: 54 },
  { _id: { cause: 'BinderException' }, count: 41 },
  { _id: { cause: 'CatalogException' }, count: 38 },
  { _id: { cause: 'ParserException' }, count: 29 },
  { _id: { cause: 'InvalidInputException' }, count: 22 },
  { _id: { cause: 'RuntimeError' }, count: 8 },
  { _id: { cause: 'IOException' }, count: 5 },
  { _id: { cause: 'BadRequestError' }, count: 2 },
  { _id: { cause: 'NotPrimaryError' }, count: 2 },
  { _id: { cause: 'EntryNotFoundError' }, count: 2 }
]

Since this is a new job runner, most of these should be evaluated in case there is a bug in the code.

@AndreaFrancis AndreaFrancis self-assigned this Sep 11, 2023
@AndreaFrancis
Copy link
Contributor

AndreaFrancis commented Sep 11, 2023

Updating list:
datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {cause: "$details.cause_exception"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
{ _id: { cause: 'AttributeError' }, count: 9876 },
{ _id: { cause: 'ClientResponseError' }, count: 6034 },
{ _id: { cause: 'DatasetGenerationError' }, count: 5674 },
{ _id: { cause: 'ParserException' }, count: 3058 },
{ _id: { cause: 'TypeError' }, count: 2689 },
{ _id: { cause: 'IOException' }, count: 1961 },
{ _id: { cause: 'InvalidInputException' }, count: 1814 },
{ _id: { cause: 'ZeroDivisionError' }, count: 1693 },
{ _id: { cause: 'FileNotFoundError' }, count: 1687 },
{ _id: { cause: 'HfHubHTTPError' }, count: 1316 },
{ _id: { cause: 'HTTPException' }, count: 1216 },
{ _id: { cause: 'NonMatchingSplitsSizesError' }, count: 1141 },
{ _id: { cause: 'EntryNotFoundError' }, count: 895 },
{ _id: { cause: 'ValueError' }, count: 827 },
{ _id: { cause: 'BinderException' }, count: 789 },
{ _id: { cause: 'KeyError' }, count: 608 },
{ _id: { cause: 'ParquetResponseEmptyError' }, count: 598 },
{ _id: { cause: 'NotImplementedError' }, count: 509 },
{ _id: { cause: 'CachedArtifactNotFoundError' }, count: 457 },
{ _id: { cause: null }, count: 370 }
{ _id: { cause: 'ReadTimeout' }, count: 329 },
{ _id: { cause: 'ConnectionError' }, count: 264 },
{ _id: { cause: 'LocationParseError' }, count: 191 },
{ _id: { cause: 'OSError' }, count: 186 },
{ _id: { cause: 'IndexError' }, count: 155 },
{ _id: { cause: 'AssertionError' }, count: 84 },
{ _id: { cause: 'BadZipFile' }, count: 63 },
{ _id: { cause: 'ArrowInvalid' }, count: 57 },
{ _id: { cause: 'OutOfRangeException' }, count: 53 },
{ _id: { cause: 'CatalogException' }, count: 44 },
{ _id: { cause: 'ModuleNotFoundError' }, count: 41 },
{ _id: { cause: 'RuntimeError' }, count: 39 },
{ _id: { cause: 'LocalEntryNotFoundError' }, count: 26 },
{ _id: { cause: 'UnboundLocalError' }, count: 26 },
{ _id: { cause: 'FileExistsError' }, count: 24 },
{ _id: { cause: 'Error' }, count: 24 },
{ _id: { cause: 'RepositoryNotFoundError' }, count: 21 },
{ _id: { cause: 'InvalidOperation' }, count: 16 },
{ _id: { cause: 'ExpectedMoreSplits' }, count: 15 },
{ _id: { cause: 'ImportError' }, count: 12 }
{ _id: { cause: 'ServerDisconnectedError' }, count: 11 },
{ _id: { cause: 'NameError' }, count: 9 },
{ _id: { cause: 'SyntaxError' }, count: 8 },
{ _id: { cause: 'PermissionError' }, count: 6 },
{ _id: { cause: 'InternalException' }, count: 5 },
{ _id: { cause: 'ChunkedEncodingError' }, count: 5 },
{ _id: { cause: 'InvalidDocument' }, count: 4 },
{ _id: { cause: 'ParserError' }, count: 3 },
{ _id: { cause: 'DoesNotExist' }, count: 3 },
{ _id: { cause: 'ConversionException' }, count: 3 },
{ _id: { cause: 'NonStreamableDatasetError' }, count: 3 },
{ _id: { cause: 'SSLError' }, count: 3 },
{ _id: { cause: 'Exception' }, count: 3 },
{ _id: { cause: 'GatedRepoError' }, count: 3 },
{ _id: { cause: 'JSONDecodeError' }, count: 2 },
{ _id: { cause: 'InvalidConfigName' }, count: 2 },
{ _id: { cause: 'FileSystemError' }, count: 1 },
{ _id: { cause: 'AutoReconnect' }, count: 1 },
{ _id: { cause: 'TypeMismatchException' }, count: 1 },
{ _id: { cause: 'HFValidationError' }, count: 1 }
{ _id: { cause: 'EOFError' }, count: 1 },
{ _id: { cause: 'OperationalError' }, count: 1 },
{ _id: { cause: 'TransactionException' }, count: 1 },
{ _id: { cause: 'NotPrimaryError' }, count: 1 },
{ _id: { cause: 'UnicodeDecodeError' }, count: 1 },
{ _id: { cause: 'OutOfMemoryException' }, count: 1 }
]

@AndreaFrancis
Copy link
Contributor

AndreaFrancis commented Sep 18, 2023

After doing some cache maintenance actions manually (removing obsolete records which config or split no longer exist) this is the updated list mostly AttributeError and ClientResponseError reduced:

[
  { _id: { cause: 'DatasetGenerationError' }, count: 3791 },
  { _id: { cause: 'TypeError' }, count: 2222 },
  { _id: { cause: 'ParserException' }, count: 2095 },
  { _id: { cause: 'InvalidInputException' }, count: 1750 },
  { _id: { cause: 'FileNotFoundError' }, count: 1393 },
  { _id: { cause: 'ZeroDivisionError' }, count: 1224 },
  { _id: { cause: 'HfHubHTTPError' }, count: 1128 },
  { _id: { cause: 'NonMatchingSplitsSizesError' }, count: 1116 },
  { _id: { cause: 'IOException' }, count: 1035 },
  { _id: { cause: 'CachedArtifactNotFoundError' }, count: 745 },
  { _id: { cause: 'HTTPException' }, count: 526 },
  { _id: { cause: 'NotImplementedError' }, count: 493 },
  { _id: { cause: 'BinderException' }, count: 462 },
  { _id: { cause: 'KeyError' }, count: 454 },
  { _id: { cause: 'ReadTimeout' }, count: 311 },
  { _id: { cause: 'ParquetResponseEmptyError' }, count: 292 },
  { _id: { cause: 'ConnectionError' }, count: 201 },
  { _id: { cause: 'ValueError' }, count: 187 },
  { _id: { cause: 'AttributeError' }, count: 127 },
  { _id: { cause: 'IndexError' }, count: 107 },
  { _id: { cause: 'OSError' }, count: 102 },
  { _id: { cause: 'ClientResponseError' }, count: 94 },
  { _id: { cause: 'EntryNotFoundError' }, count: 92 },
  { _id: { cause: 'AssertionError' }, count: 84 },
  { _id: { cause: 'BadZipFile' }, count: 61 },
  { _id: { cause: 'OutOfRangeException' }, count: 46 },
  { _id: { cause: 'ModuleNotFoundError' }, count: 43 },
  { _id: { cause: 'LocationParseError' }, count: 29 },
  { _id: { cause: 'ArrowInvalid' }, count: 28 },
  { _id: { cause: 'CatalogException' }, count: 26 },
  { _id: { cause: 'LocalEntryNotFoundError' }, count: 19 },
  { _id: { cause: 'Error' }, count: 16 },
  { _id: { cause: 'ServerDisconnectedError' }, count: 9 },
  { _id: { cause: 'SyntaxError' }, count: 8 },
  { _id: { cause: 'InvalidOperation' }, count: 8 },
  { _id: { cause: 'RuntimeError' }, count: 7 },
  { _id: { cause: 'PermissionError' }, count: 6 },
  { _id: { cause: 'UnboundLocalError' }, count: 6 },
  { _id: { cause: 'NameError' }, count: 5 },
  { _id: { cause: 'NonStreamableDatasetError' }, count: 3 },
  { _id: { cause: 'Exception' }, count: 3 },
  { _id: { cause: 'ChunkedEncodingError' }, count: 3 },
  { _id: { cause: 'SSLError' }, count: 3 },
  { _id: { cause: 'ExpectedMoreSplits' }, count: 2 },
  { _id: { cause: 'ConversionException' }, count: 2 },
  { _id: { cause: null }, count: 2 },
  { _id: { cause: 'ParserError' }, count: 2 },
  { _id: { cause: 'RepositoryNotFoundError' }, count: 2 },
  { _id: { cause: 'OperationalError' }, count: 1 },
  { _id: { cause: 'UnicodeDecodeError' }, count: 1 },
  { _id: { cause: 'TransactionException' }, count: 1 },
  { _id: { cause: 'OutOfMemoryException' }, count: 1 },
  { _id: { cause: 'DoesNotExist' }, count: 1 },
  { _id: { cause: 'ImportError' }, count: 1 },
  { _id: { cause: 'HFValidationError' }, count: 1 },
  { _id: { cause: 'JSONDecodeError' }, count: 1 },
  { _id: { cause: 'EOFError' }, count: 1 },
  { _id: { cause: 'TypeMismatchException' }, count: 1 },
  { _id: { cause: 'InternalException' }, count: 1 }
]

@AndreaFrancis
Copy link
Contributor

AndreaFrancis commented Nov 17, 2023

Update of UnexpectedErrors count by kind:

db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kindkind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
  { _id: { kindkind: 'config-parquet-and-info' }, count: 9117 },
  { _id: { kindkind: 'split-descriptive-statistics' }, count: 6685 },
  { _id: { kindkind: 'split-duckdb-index' }, count: 591 },
  { _id: { kindkind: 'split-first-rows-from-parquet' }, count: 11 }
]

For split-first-rows-from-parquet it will be fixed with #2126

@severo
Copy link
Collaborator Author

severo commented Nov 17, 2023

interesting that only 4 steps produce all the unexpected errors

@severo
Copy link
Collaborator Author

severo commented Nov 22, 2023

For KeyError, see huggingface/huggingface_hub#1853

@severo
Copy link
Collaborator Author

severo commented Nov 22, 2023

Current state:

db.cachedResponsesBlue.aggregate([
    {$match: {error_code: "UnexpectedError"}},
    {$group: {_id: {cause: "$details.cause_exception", dataset: "$dataset"}, count: {$sum: 1}}},
    {$group: {_id: "$_id.cause", count: {$sum: 1}}},
    {$sort: {count: -1}}
])
{ _id: 'DatasetGenerationError', count: 2767 }
{ _id: 'HfHubHTTPError', count: 795 }
{ _id: 'TypeError', count: 633 }
{ _id: 'ZeroDivisionError', count: 621 }
{ _id: 'IOException', count: 514 }
{ _id: 'ReadTimeout', count: 245 }
{ _id: 'OSError', count: 151 }
{ _id: 'BinderException', count: 127 }
{ _id: 'ConnectionError', count: 119 }
{ _id: 'ValueError', count: 103 }
{ _id: 'ParserException', count: 91 }
{ _id: 'EntryNotFoundError', count: 66 }
{ _id: 'NotImplementedError', count: 66 }
{ _id: 'FileNotFoundError', count: 60 }
{ _id: 'NonMatchingSplitsSizesError', count: 43 }
{ _id: 'BrokenPipeError', count: 39 }
{ _id: 'InvalidInputException', count: 36 }
{ _id: 'IndexError', count: 30 }
{ _id: 'OutOfRangeException', count: 30 }
{ _id: 'HTTPException', count: 21 }
{ _id: 'LocationParseError', count: 17 }
{ _id: 'RuntimeError', count: 15 }
{ _id: 'KeyError', count: 13 }
{ _id: 'BadZipFile', count: 9 }
{ _id: 'Error', count: 7 }
{ _id: 'ExpectedMoreSplits', count: 5 }
{ _id: 'ArrowInvalid', count: 5 }
{ _id: 'ConversionException', count: 4 }
{ _id: 'NameError', count: 4 }
{ _id: 'AssertionError', count: 4 }
{ _id: 'AttributeError', count: 3 }
{ _id: 'ModuleNotFoundError', count: 3 }
{ _id: 'PermissionError', count: 3 }
{ _id: 'NotPrimaryError', count: 3 }
{ _id: 'ParserError', count: 3 }
{ _id: 'ChunkedEncodingError', count: 2 }
{ _id: 'LocalEntryNotFoundError', count: 2 }
{ _id: 'RepositoryNotFoundError', count: 2 }
{ _id: 'UnboundLocalError', count: 2 }
{ _id: 'Exception', count: 2 }
{ _id: 'TypeMismatchException', count: 2 }
{ _id: 'ClientResponseError', count: 2 }
{ _id: 'JSONDecodeError', count: 1 }
{ _id: 'InvalidConfigName', count: 1 }
{ _id: 'GatedRepoError', count: 1 }
{ _id: 'CachedArtifactNotFoundError', count: 1 }
{ _id: 'HFValidationError', count: 1 }
{ _id: 'RarCannotExec', count: 1 }
{ _id: 'OutOfMemoryException', count: 1 }
{ _id: 'ImportError', count: 1 }
{ _id: 'NonStreamableDatasetError', count: 1 }
{ _id: 'OperationalError', count: 1 }
{ _id: 'SyntaxError', count: 1 }
{ _id: 'UnicodeDecodeError', count: 1 }
{ _id: 'EOFError', count: 1 }

@AndreaFrancis
Copy link
Contributor

Updated list of UnexpectedErrors by kind:

[
  { _id: { kindkind: 'config-parquet-and-info' }, count: 8500 },
  { _id: { kindkind: 'split-descriptive-statistics' }, count: 2628 },
  { _id: { kindkind: 'split-duckdb-index' }, count: 794 }
]

@severo
Copy link
Collaborator Author

severo commented Feb 6, 2024

Current state:

db.cachedResponsesBlue.aggregate([
    {$match: {error_code: "UnexpectedError"}},
    {$group: {_id: {cause: "$details.cause_exception", dataset: "$dataset"}, count: {$sum: 1}}},
    {$group: {_id: "$_id.cause", count: {$sum: 1}}},
    {$sort: {count: -1}}
])
{ _id: 'DatasetGenerationError', count: 3963 }
{ _id: 'TypeError', count: 958 }
{ _id: 'HfHubHTTPError', count: 778 }
{ _id: 'DatasetGenerationCastError', count: 287 }
{ _id: 'OSError', count: 219 }
{ _id: 'ValueError', count: 182 }
{ _id: 'ReadTimeout', count: 172 }
{ _id: 'ParserException', count: 127 }
{ _id: 'BinderException', count: 108 }
{ _id: 'ConnectionError', count: 103 }
{ _id: 'EntryNotFoundError', count: 77 }
{ _id: 'InvalidInputException', count: 76 }
{ _id: 'IOException', count: 72 }
{ _id: 'NotImplementedError', count: 69 }
{ _id: 'FileNotFoundError', count: 59 }
{ _id: 'ComputeError', count: 57 }
{ _id: 'NonMatchingSplitsSizesError', count: 50 }
{ _id: 'ColumnNotFoundError', count: 46 }
{ _id: 'RuntimeError', count: 34 }
{ _id: 'IndexError', count: 25 }
{ _id: 'ConversionException', count: 23 }
{ _id: 'HTTPException', count: 20 }
{ _id: 'ZeroDivisionError', count: 19 }
{ _id: 'LocationParseError', count: 15 }
{ _id: 'KeyError', count: 12 }
{ _id: 'BadZipFile', count: 11 }
{ _id: 'ArrowInvalid', count: 10 }
{ _id: 'ExpectedMoreSplits', count: 8 }
{ _id: 'ParserError', count: 8 }
{ _id: 'Error', count: 8 }
{ _id: 'InvalidOperationError', count: 7 }
{ _id: 'SchemaError', count: 5 }
{ _id: 'ReadError', count: 5 }
{ _id: 'AssertionError', count: 4 }
{ _id: 'ArrowCapacityError', count: 4 }
{ _id: 'NameError', count: 4 }
{ _id: 'PermissionError', count: 3 }
{ _id: 'AttributeError', count: 3 }
{ _id: 'JSONDecodeError', count: 3 }
{ _id: 'DuplicateError', count: 2 }
{ _id: 'TypeMismatchException', count: 2 }
{ _id: 'RarCannotExec', count: 2 }
{ _id: 'UnboundLocalError', count: 2 }
{ _id: 'Exception', count: 2 }
{ _id: 'TransactionException', count: 2 }
{ _id: 'ChunkedEncodingError', count: 2 }
{ _id: 'UnicodeDecodeError', count: 2 }
{ _id: 'ClientResponseError', count: 2 }
{ _id: 'ModuleNotFoundError', count: 2 }
{ _id: 'InvalidConfigName', count: 1 }
{ _id: 'OperationalError', count: 1 }
{ _id: 'GatedRepoError', count: 1 }
{ _id: 'CachedArtifactNotFoundError', count: 1 }
{ _id: 'HFValidationError', count: 1 }
{ _id: 'ImportError', count: 1 }
{ _id: 'OutOfRangeException', count: 1 }
{ _id: 'NonStreamableDatasetError', count: 1 }
{ _id: 'NotPrimaryError', count: 1 }
{ _id: 'RepositoryNotFoundError', count: 1 }
{ _id: 'LocalEntryNotFoundError', count: 1 }
db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kindkind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
{ _id: { kindkind: 'config-parquet-and-info' }, count: 9338 }
{ _id: { kindkind: 'split-descriptive-statistics' }, count: 2868 }
{ _id: { kindkind: 'split-duckdb-index' }, count: 847 }
{ _id: { kindkind: 'split-first-rows-from-parquet' }, count: 2 }

@severo
Copy link
Collaborator Author

severo commented Feb 6, 2024

I would bet that most errors occur for datasets with a script. I propose to recreate all of these datasets... In most cases, it will create a DatasetWithScriptNotSupportedError error instead of some weird-looking error.

Number of unique datasets:

db.cachedResponsesBlue.aggregate([
  { $match: { error_code: "UnexpectedError" } },
    { $group: { _id: null, uniqueValues: { $addToSet: "$dataset" } } },
    { $project: { _id: 0, uniqueValues: 1 } },
    { $unwind: "$uniqueValues" },
    { $group: { _id: null, count: { $sum: 1 } } },
    { $project: { _id: 0, count: 1 } }
]);
{ count: 7484 }

I'm recreating the datasets one by one, with:

DATASETS=(...)
for dataset in ${DATASETS[@]}; do curl -H "Authorization: Bearer $HF_TOKEN" -X POST "https://datasets-server.huggingface.co/admin/recreate-dataset?dataset=$dataset&priority=low"; done;

Scaled the admin service from 2 to 4, let's see if it improves something.

They are processing at a rate of 1 request per second (approximate value). So: hopefully in two hours we should be done

@severo
Copy link
Collaborator Author

severo commented Feb 8, 2024

Today:

number of datasets, by step and cause exception
db.cachedResponsesBlue.aggregate([
  { $match: { error_code: "UnexpectedError", "details.copied_from_artifact": { $exists: false } } },
  {
    $group: {
      _id: { kind: "$kind", cause: "$details.cause_exception", dataset: "$dataset" },
      count: { $sum: 1 },
    },
  },
  { $group: { _id: { kind: "$_id.kind", cause: "$_id.cause" }, count: { $sum: 1 } } },
  { $sort: { "_id.kind": 1, count: -1 } },
  { $project: { _id: 0, kind: "$_id.kind", num_datasets: "$count", cause: "$_id.cause" } } 
]);
{ kind: 'config-parquet-and-info', num_datasets: 2486, cause: 'DatasetGenerationError' }
{ kind: 'config-parquet-and-info', num_datasets: 1226, cause: 'DatasetGenerationCastError' }
{ kind: 'config-parquet-and-info', num_datasets: 575, cause: 'OSError' }
{ kind: 'config-parquet-and-info', num_datasets: 64, cause: 'ValueError' }
{ kind: 'config-parquet-and-info', num_datasets: 32, cause: 'NotImplementedError' }
{ kind: 'config-parquet-and-info', num_datasets: 30, cause: 'NonMatchingSplitsSizesError' }
{ kind: 'config-parquet-and-info', num_datasets: 18, cause: 'ZeroDivisionError' }
{ kind: 'config-parquet-and-info', num_datasets: 15, cause: 'RuntimeError' }
{ kind: 'config-parquet-and-info', num_datasets: 14, cause: 'ArrowInvalid' }
{ kind: 'config-parquet-and-info', num_datasets: 11, cause: 'HfHubHTTPError' }
{ kind: 'config-parquet-and-info', num_datasets: 8, cause: 'ParserError' }
{ kind: 'config-parquet-and-info', num_datasets: 7, cause: 'BadZipFile' }
{ kind: 'config-parquet-and-info', num_datasets: 6, cause: 'ReadError' }
{ kind: 'config-parquet-and-info', num_datasets: 5, cause: 'ArrowCapacityError' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'TypeError' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'IndexError' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'ExpectedMoreSplits' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'RarCannotExec' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'JSONDecodeError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'AttributeError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ModuleNotFoundError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'FileNotFoundError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'UnicodeDecodeError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ConnectionError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ImportError' }
{ kind: 'split-descriptive-statistics', num_datasets: 935, cause: 'TypeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 56, cause: 'ValueError' }
{ kind: 'split-descriptive-statistics', num_datasets: 35, cause: 'ColumnNotFoundError' }
{ kind: 'split-descriptive-statistics', num_datasets: 12, cause: 'ComputeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 5, cause: 'InvalidOperationError' }
{ kind: 'split-descriptive-statistics', num_datasets: 4, cause: 'SchemaError' }
{ kind: 'split-descriptive-statistics', num_datasets: 2, cause: 'DuplicateError' }
{ kind: 'split-duckdb-index', num_datasets: 123, cause: 'InvalidInputException' }
{ kind: 'split-duckdb-index', num_datasets: 109, cause: 'ParserException' }
{ kind: 'split-duckdb-index', num_datasets: 49, cause: 'IOException' }
{ kind: 'split-duckdb-index', num_datasets: 6, cause: 'ConversionException' }
{ kind: 'split-duckdb-index', num_datasets: 5, cause: 'Error' }
{ kind: 'split-duckdb-index', num_datasets: 2, cause: 'TypeMismatchException' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'TransactionException' }

@AndreaFrancis
Copy link
Contributor

Today:

Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])

[
  { _id: { kind: 'config-parquet-and-info' }, count: 6215 },
  { _id: { kind: 'split-descriptive-statistics' }, count: 2173 },
  { _id: { kind: 'split-duckdb-index' }, count: 2034 },
  { _id: { kind: 'split-duckdb-index-010' }, count: 777 },
  { _id: { kind: 'split-first-rows' }, count: 1 }
]

@AndreaFrancis
Copy link
Contributor

Today:

Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}]) 
[
  { _id: { kind: 'config-parquet-and-info' }, count: 7373 },
  { _id: { kind: 'split-descriptive-statistics' }, count: 3808 },
  { _id: { kind: 'split-duckdb-index' }, count: 3285 },
  { _id: { kind: 'split-first-rows' }, count: 206 }
]

@AndreaFrancis
Copy link
Contributor

Today:

db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
  { _id: { kind: 'config-parquet-and-info' }, count: 6668 },
  { _id: { kind: 'split-descriptive-statistics' }, count: 3667 },
  { _id: { kind: 'split-duckdb-index' }, count: 2941 },
  { _id: { kind: 'dataset-loading-tags' }, count: 1539 },
  { _id: { kind: 'split-first-rows' }, count: 30 }
]

@severo
Copy link
Collaborator Author

severo commented May 14, 2024

The last PR (#2796) has a big impact!

72K -> 20K entries

Capture d’écran 2024-05-14 à 08 47 29 Capture d’écran 2024-05-14 à 08 47 35

Replaced with 36K DatasetGenerationError and 12K DatasetGenerationCastError

Capture d’écran 2024-05-14 à 08 49 38 Capture d’écran 2024-05-14 à 08 49 44

@severo
Copy link
Collaborator Author

severo commented May 14, 2024

Today:

db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
{ _id: { kind: 'split-duckdb-index' }, count: 2871 }
{ _id: { kind: 'dataset-compatible-libraries' }, count: 2546 }
{ _id: { kind: 'split-descriptive-statistics' }, count: 1683 }
{ _id: { kind: 'config-parquet-and-info' }, count: 1407 }
{ _id: { kind: 'split-first-rows' }, count: 68 }
{ _id: { kind: 'split-image-url-columns' }, count: 2 }

@AndreaFrancis
Copy link
Contributor

After refreshing some records:

Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
  { _id: { kind: 'split-duckdb-index' }, count: 1380 },
  { _id: { kind: 'config-parquet-and-info' }, count: 1171 },
  { _id: { kind: 'split-descriptive-statistics' }, count: 676 },
  { _id: { kind: 'dataset-compatible-libraries' }, count: 619 },
  { _id: { kind: 'split-first-rows' }, count: 68 },
  { _id: { kind: 'split-image-url-columns' }, count: 2 }
]

@AndreaFrancis
Copy link
Contributor

Today (Almost half of yesterday's):

Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.aggregate([{$match: {error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}}},{$group: {_id: {kind: "$kind"}, count: {$sum: 1}}},{$sort: {count: -1}}])
[
  { _id: { kind: 'split-duckdb-index' }, count: 1236 },
  { _id: { kind: 'config-parquet-and-info' }, count: 588 },
  { _id: { kind: 'split-descriptive-statistics' }, count: 301 },
  { _id: { kind: 'dataset-compatible-libraries' }, count: 209 },
  { _id: { kind: 'split-first-rows' }, count: 68 },
  { _id: { kind: 'split-image-url-columns' }, count: 2 }
]

Atlas atlas-x5jgb3-shard-0 [primary] datasets_server_cache> db.cachedResponsesBlue.countDocuments({error_code: "UnexpectedError", "details.copied_from_artifact":{$exists:false}})
2405

@severo
Copy link
Collaborator Author

severo commented Jul 30, 2024

Today:

db.cachedResponsesBlue.aggregate([
  { $match: { error_code: "UnexpectedError", "details.copied_from_artifact": { $exists: false } } },
  {
    $group: {
      _id: { kind: "$kind", cause: "$details.cause_exception", dataset: "$dataset" },
      count: { $sum: 1 },
    },
  },
  { $group: { _id: { kind: "$_id.kind", cause: "$_id.cause" }, count: { $sum: 1 } } },
  { $sort: { count: -1, "_id.kind": 1 } },
  { $project: { _id: 0, kind: "$_id.kind", num_datasets: "$count", cause: "$_id.cause" } } 
]);

{ kind: 'dataset-compatible-libraries', num_datasets: 1507, cause: 'FileNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 288, cause: 'ParserException' }
{ kind: 'split-duckdb-index', num_datasets: 262, cause: 'HfHubHTTPError' }
{ kind: 'config-parquet-and-info', num_datasets: 203, cause: 'ValueError' }
{ kind: 'split-duckdb-index', num_datasets: 181, cause: 'UnidentifiedImageError' }
{ kind: 'dataset-filetypes', num_datasets: 160, cause: 'BadZipFile' }
{ kind: 'split-descriptive-statistics', num_datasets: 157, cause: 'ReadTimeout' }
{ kind: 'config-parquet-and-info', num_datasets: 148, cause: 'PermissionError' }
{ kind: 'split-duckdb-index', num_datasets: 144, cause: 'BinderException' }
{ kind: 'dataset-filetypes', num_datasets: 140, cause: 'ValueError' }
{ kind: 'split-duckdb-index', num_datasets: 134, cause: 'ReadTimeout' }
{ kind: 'split-descriptive-statistics', num_datasets: 121, cause: 'ValueError' }
{ kind: 'dataset-compatible-libraries', num_datasets: 96, cause: 'UnicodeDecodeError' }
{ kind: 'dataset-compatible-libraries', num_datasets: 93, cause: 'HfHubHTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 77, cause: 'ValueError' }
{ kind: 'config-parquet-and-info', num_datasets: 73, cause: 'ArrowInvalid' }
{ kind: 'config-parquet-and-info', num_datasets: 69, cause: 'ReadTimeout' }
{ kind: 'config-parquet-and-info', num_datasets: 65, cause: 'RuntimeError' }
{ kind: 'config-parquet-and-info', num_datasets: 52, cause: 'ReadError' }
{ kind: 'split-first-rows', num_datasets: 52, cause: 'ServerDisconnectedError' }
{ kind: 'split-duckdb-index', num_datasets: 50, cause: 'SchemaError' }
{ kind: 'split-duckdb-index', num_datasets: 49, cause: 'ComputeError' }
{ kind: 'split-duckdb-index', num_datasets: 48, cause: 'InvalidInputException' }
{ kind: 'config-parquet-and-info', num_datasets: 44, cause: 'HfHubHTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 42, cause: 'ColumnNotFoundError' }
{ kind: 'split-descriptive-statistics', num_datasets: 40, cause: 'ColumnNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 40, cause: 'TypeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 35, cause: 'ConnectionError' }
{ kind: 'split-duckdb-index', num_datasets: 32, cause: 'EntryNotFoundError' }
{ kind: 'dataset-filetypes', num_datasets: 31, cause: 'TypeError' }
{ kind: 'split-first-rows', num_datasets: 28, cause: 'ClientResponseError' }
{ kind: 'config-parquet-and-info', num_datasets: 25, cause: 'NonMatchingSplitsSizesError' }
{ kind: 'config-parquet-and-info', num_datasets: 24, cause: 'ArrowTypeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 24, cause: 'EntryNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 24, cause: 'ConnectionError' }
{ kind: 'config-parquet-and-info', num_datasets: 21, cause: 'FileNotFoundError' }
{ kind: 'config-parquet-and-info', num_datasets: 19, cause: 'KeyError' }
{ kind: 'dataset-filetypes', num_datasets: 19, cause: 'FileNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 19, cause: 'DecompressionBombError' }
{ kind: 'config-parquet-and-info', num_datasets: 18, cause: 'ConnectionError' }
{ kind: 'config-parquet-and-info', num_datasets: 18, cause: 'ZeroDivisionError' }
{ kind: 'config-parquet-and-info', num_datasets: 15, cause: 'DatasetGenerationError' }
{ kind: 'config-parquet-and-info', num_datasets: 15, cause: 'BadZipFile' }
{ kind: 'config-parquet-and-info', num_datasets: 15, cause: 'IndexError' }
{ kind: 'split-descriptive-statistics', num_datasets: 14, cause: 'ComputeError' }
{ kind: 'config-parquet-and-info', num_datasets: 13, cause: 'ParserError' }
{ kind: 'config-parquet-and-info', num_datasets: 13, cause: 'NotImplementedError' }
{ kind: 'config-parquet-and-info', num_datasets: 11, cause: 'ArrowCapacityError' }
{ kind: 'dataset-filetypes', num_datasets: 11, cause: 'HfHubHTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 10, cause: 'IOException' }
{ kind: 'split-first-rows', num_datasets: 10, cause: 'AttributeError' }
{ kind: 'split-first-rows', num_datasets: 9, cause: 'OSError' }
{ kind: 'split-duckdb-index', num_datasets: 8, cause: 'KeyError' }
{ kind: 'split-duckdb-index', num_datasets: 8, cause: 'ArrowInvalid' }
{ kind: 'split-first-rows', num_datasets: 8, cause: 'ArrowInvalid' }
{ kind: 'config-parquet-and-info', num_datasets: 7, cause: 'TypeError' }
{ kind: 'config-parquet-and-info', num_datasets: 7, cause: 'OSError' }
{ kind: 'split-first-rows', num_datasets: 7, cause: 'ValueError' }
{ kind: 'config-parquet-and-info', num_datasets: 6, cause: 'JSONDecodeError' }
{ kind: 'split-duckdb-index', num_datasets: 6, cause: 'InternalException' }
{ kind: 'split-image-url-columns', num_datasets: 6, cause: 'TypeError' }
{ kind: 'config-parquet-and-info', num_datasets: 5, cause: 'HTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 5, cause: 'ConversionException' }
{ kind: 'config-parquet-and-info', num_datasets: 4, cause: 'DatasetGenerationCastError' }
{ kind: 'split-descriptive-statistics', num_datasets: 4, cause: 'InvalidOperationError' }
{ kind: 'split-descriptive-statistics', num_datasets: 4, cause: 'HfHubHTTPError' }
{ kind: 'split-duckdb-index', num_datasets: 4, cause: 'TypeMismatchException' }
{ kind: 'split-first-rows', num_datasets: 4, cause: 'FSTimeoutError' }
{ kind: 'config-parquet-and-info', num_datasets: 3, cause: 'UnpicklingError' }
{ kind: 'config-parquet-and-info', num_datasets: 3, cause: 'ExpectedMoreSplits' }
{ kind: 'split-duckdb-index', num_datasets: 3, cause: 'Error' }
{ kind: 'config-parquet-and-info', num_datasets: 2, cause: 'UnicodeDecodeError' }
{ kind: 'dataset-compatible-libraries', num_datasets: 2, cause: 'ValueError' }
{ kind: 'split-descriptive-statistics', num_datasets: 2, cause: 'DuplicateError' }
{ kind: 'split-descriptive-statistics', num_datasets: 2, cause: 'SchemaError' }
{ kind: 'split-descriptive-statistics', num_datasets: 2, cause: 'KeyError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ImportError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ChunkedEncodingError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'IsADirectoryError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'EmptyDataError' }
{ kind: 'config-parquet-and-info', num_datasets: 1, cause: 'EOFError' }
{ kind: 'dataset-compatible-libraries', num_datasets: 1, cause: 'EmptyDatasetError' }
{ kind: 'dataset-filetypes', num_datasets: 1, cause: 'ConnectionError' }
{ kind: 'split-descriptive-statistics', num_datasets: 1, cause: 'RuntimeError' }
{ kind: 'split-descriptive-statistics', num_datasets: 1, cause: 'TypeError' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'error' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'TransactionException' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'FileNotFoundError' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'OutOfMemoryException' }
{ kind: 'split-duckdb-index', num_datasets: 1, cause: 'RuntimeError' }
{ kind: 'split-first-rows', num_datasets: 1, cause: 'ClientConnectorError' }
{ kind: 'split-first-rows', num_datasets: 1, cause: 'UnicodeDecodeError' }
{ kind: 'split-first-rows', num_datasets: 1, cause: 'ClientPayloadError' }

@severo
Copy link
Collaborator Author

severo commented Aug 1, 2024

Note that we currently have 14K UnexpectedError entries, which is about 0.1% of the total cache entries. So: not that crucial either. I'll reduce the priority.

Maybe more important is to replace ConfigNamesError with the underlying error (100K entries). And to explicit more the DatasetGenerationError (50K entries) to help users debug their data files.

@severo severo added P2 Nice to have and removed P1 Not as needed as P0, but still important/wanted labels Aug 1, 2024
@severo
Copy link
Collaborator Author

severo commented Aug 1, 2024

I created #3010 and #3011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants