-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise specific errors (and error_code) instead of UnexpectedError #1443
Comments
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
We need to do it to provide better feedback to the user, and to retry when appropriate. |
Copying from #1462
|
Updating list: |
After doing some cache maintenance actions manually (removing obsolete records which config or split no longer exist) this is the updated list mostly AttributeError and ClientResponseError reduced:
|
Update of UnexpectedErrors count by kind:
For split-first-rows-from-parquet it will be fixed with #2126 |
interesting that only 4 steps produce all the unexpected errors |
For |
Current state:
|
Updated list of UnexpectedErrors by kind:
|
Current state:
|
I would bet that most errors occur for datasets with a script. I propose to recreate all of these datasets... In most cases, it will create a DatasetWithScriptNotSupportedError error instead of some weird-looking error. Number of unique datasets:
I'm recreating the datasets one by one, with:
Scaled the admin service from 2 to 4, let's see if it improves something. They are processing at a rate of 1 request per second (approximate value). So: hopefully in two hours we should be done |
Today: number of datasets, by step and cause exceptiondb.cachedResponsesBlue.aggregate([ { $match: { error_code: "UnexpectedError", "details.copied_from_artifact": { $exists: false } } }, { $group: { _id: { kind: "$kind", cause: "$details.cause_exception", dataset: "$dataset" }, count: { $sum: 1 }, }, }, { $group: { _id: { kind: "$_id.kind", cause: "$_id.cause" }, count: { $sum: 1 } } }, { $sort: { "_id.kind": 1, count: -1 } }, { $project: { _id: 0, kind: "$_id.kind", num_datasets: "$count", cause: "$_id.cause" } } ]); { kind: 'config-parquet-and-info', num_datasets: 2486, cause: 'DatasetGenerationError' } { kind: 'config-parquet-and-info', num_datasets: 1226, cause: 'DatasetGenerationCastError' } { kind: 'config-parquet-and-info', num_datasets: 575, cause: 'OSError' } { kind: 'config-parquet-and-info', num_datasets: 64, cause: 'ValueError' } { kind: 'config-parquet-and-info', num_datasets: 32, cause: 'NotImplementedError' } { kind: 'config-parquet-and-info', num_datasets: 30, cause: 'NonMatchingSplitsSizesError' } { kind: 'config-parquet-and-info', num_datasets: 18, cause: 'ZeroDivisionError' } { kind: 'config-parquet-and-info', num_datasets: 15, cause: 'RuntimeError' } { kind: 'config-parquet-and-info', num_datasets: 14, cause: 'ArrowInvalid' } { kind: 'config-parquet-and-info', num_datasets: 11, cause: 'HfHubHTTPError' } { kind: 'config-parquet-and-info', num_datasets: 8, cause: 'ParserError' } { kind: 'config-parquet-and-info', num_datasets: 7, cause: 'BadZipFile' } { kind: 'config-parquet-and-info', num_datasets: 6, cause: 'ReadError' } { kind: 'config-parquet-and-info', num_datasets: 5, cause: 'ArrowCapacityError' } { kind: 'config-parquet-and-info', num_datasets: 2, cause: 'TypeError' } { kind: 'config-parquet-and-info', num_datasets: 2, cause: 'IndexError' } { kind: 'config-parquet-and-info', num_datasets: 2, cause: 'ExpectedMoreSplits' } { kind: 'config-parquet-and-info', num_datasets: 2, cause: 'RarCannotExec' } { kind: 'config-parquet-and-info', num_datasets: 2, cause: 'JSONDecodeError' } { kind: 'config-parquet-and-info', num_datasets: 1, cause: 'AttributeError' } { kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ModuleNotFoundError' } { kind: 'config-parquet-and-info', num_datasets: 1, cause: 'FileNotFoundError' } { kind: 'config-parquet-and-info', num_datasets: 1, cause: 'UnicodeDecodeError' } { kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ConnectionError' } { kind: 'config-parquet-and-info', num_datasets: 1, cause: 'ImportError' } { kind: 'split-descriptive-statistics', num_datasets: 935, cause: 'TypeError' } { kind: 'split-descriptive-statistics', num_datasets: 56, cause: 'ValueError' } { kind: 'split-descriptive-statistics', num_datasets: 35, cause: 'ColumnNotFoundError' } { kind: 'split-descriptive-statistics', num_datasets: 12, cause: 'ComputeError' } { kind: 'split-descriptive-statistics', num_datasets: 5, cause: 'InvalidOperationError' } { kind: 'split-descriptive-statistics', num_datasets: 4, cause: 'SchemaError' } { kind: 'split-descriptive-statistics', num_datasets: 2, cause: 'DuplicateError' } { kind: 'split-duckdb-index', num_datasets: 123, cause: 'InvalidInputException' } { kind: 'split-duckdb-index', num_datasets: 109, cause: 'ParserException' } { kind: 'split-duckdb-index', num_datasets: 49, cause: 'IOException' } { kind: 'split-duckdb-index', num_datasets: 6, cause: 'ConversionException' } { kind: 'split-duckdb-index', num_datasets: 5, cause: 'Error' } { kind: 'split-duckdb-index', num_datasets: 2, cause: 'TypeMismatchException' } { kind: 'split-duckdb-index', num_datasets: 1, cause: 'TransactionException' } |
Today:
|
Today:
|
Today:
|
The last PR (#2796) has a big impact! 72K -> 20K entries ![]() ![]() Replaced with 36K DatasetGenerationError and 12K DatasetGenerationCastError ![]() ![]() |
Today:
|
After refreshing some records:
|
Today (Almost half of yesterday's):
|
Today:
|
Note that we currently have 14K UnexpectedError entries, which is about 0.1% of the total cache entries. So: not that crucial either. I'll reduce the priority. Maybe more important is to replace |
The following query on the production database gives the number of datasets with at least one cache entry with error_code "UnexpectedError", grouped by the underlying "cause_exception".
For the most common ones (
DatasetGenerationError
,HfHubHTTPError
,OSError
, etc.) we would benefit from raising a specific error with its error code. It would allow to:null
means it has nodetails.cause_exception
. These cache entries should be inspected more closely. See #1123 in particular, which is one of the cases where no cause exception is reported.The text was updated successfully, but these errors were encountered: