Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split batching of remove calls into 10 MB or 100 request chunks #495

Closed
alxmrs opened this issue Aug 31, 2022 · 6 comments · Fixed by #496
Closed

Split batching of remove calls into 10 MB or 100 request chunks #495

alxmrs opened this issue Aug 31, 2022 · 6 comments · Fixed by #496

Comments

@alxmrs
Copy link
Contributor

alxmrs commented Aug 31, 2022

The GCS API supports batching requests to minimize the number of simultaneous client connections. This could help solve the issue related to #493.

https://cloud.google.com/storage/docs/batch#overview

@alxmrs
Copy link
Contributor Author

alxmrs commented Aug 31, 2022

Let me specifically recreate @dmaring's recommendation about batching (the initial version of this issue is miss-titled).

It looks like there is batching of requests for certain calls, like rm! See:

body = "".join(

However, according to the docs linked above, certain batched requests will be too big. If the payload is larger than 10 MBs or there are more than 100 calls (e.g. like we have when working with certain geospatial datasets), then it will fail.

The fix here includes chunking batch calls within these limits.

@alxmrs alxmrs changed the title Automatically batch requests Split batching of remove calls into 10 MB or 100 request chunks Aug 31, 2022
@martindurant
Copy link
Member

The fix here ...

Is there a linked PR?

@alxmrs
Copy link
Contributor Author

alxmrs commented Aug 31, 2022

:) Not yet! I'm about to write one. Maybe this was a poor choice of works over text; I mean to describe the approach for a fix.

This is the next this I am working on today.

@dmaring
Copy link

dmaring commented Aug 31, 2022

A few other things to consider with large delete batch requests:

  1. GCP Cloud Storage buckets autoscale based on usage. The initial IO capacity of a storage bucket is 1000 object write requests which includes deleting objects.

"Consequently, if the request rate on your bucket increases faster than Cloud Storage can perform this redistribution, you may run into temporary limits, specifically higher latency and error rates."

https://cloud.google.com/storage/docs/request-rate#auto-scaling

  1. The batch API will return 200 even if subrequests in the batch request fail. However, the 200 response will include the responses of the subrequests. You can check these if desired.

https://cloud.google.com/storage/docs/batch#batch-example-response

  1. Hotspotting can occur if file names are sequential. If possible avoid sequential file names and use a hash as a prefix as shown in the following link.

https://cloud.google.com/storage/docs/request-rate#naming-convention

@slevang
Copy link
Contributor

slevang commented Jan 15, 2023

I think we could do better here. Still running into this issue when I have multiple workers rm-ing files on a bucket that can't scale up quickly enough. The traceback is:

    self.fs.rm(path, recursive=True)
  File "/opt/conda/lib/python3.9/site-packages/fsspec/asyn.py", line 113, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/fsspec/asyn.py", line 98, in sync
    raise return_result
  File "/opt/conda/lib/python3.9/site-packages/fsspec/asyn.py", line 53, in _runner
    result[0] = await coro
  File "/opt/conda/lib/python3.9/site-packages/gcsfs/core.py", line 1069, in _rm
    raise exs[0]
  File "/opt/conda/lib/python3.9/site-packages/gcsfs/core.py", line 1034, in _rm_files
    raise OSError(errors)
OSError: ['We encountered an internal error. Please try again.']

This looks like the 503 here:
https://cloud.google.com/storage/docs/json_api/v1/status-codes#503_Service_Unavailable

But, I can't figure out why it isn't being caught as an HttpError in validate_response if so.

Relates to #493, #496, #406

@martindurant
Copy link
Member

Yes, I agree that looks like something that should be backoff-retriable. I can't immediately see why this isn't in the existing HttpError catch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants