-
Notifications
You must be signed in to change notification settings - Fork 357
Support empty folder deletion for HNS bucket with Orbax #1286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Left some comment
Dockerfile
Outdated
@@ -83,14 +83,15 @@ ENTRYPOINT ["/opt/apache/beam/boot"] | |||
|
|||
FROM base AS tpu | |||
|
|||
ARG EXTRAS= | |||
ARG EXTRAS=orbax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it, made orbax dependency change in pyproject.toml
Dockerfile
Outdated
|
||
ENV UV_FIND_LINKS=https://storage.googleapis.com/jax-releases/libtpu_releases.html | ||
# Ensure we install the TPU version, even if building locally. | ||
# Jax will fallback to CPU when run on a machine without TPU. | ||
RUN uv pip install --prerelease=allow .[core,tpu] && uv cache clean | ||
RUN if [ -n "$EXTRAS" ]; then uv pip install .[$EXTRAS] && uv cache clean; fi | ||
COPY . . | ||
RUN pip install -U "orbax-checkpoint @ git+https://github.com/google/orbax.git@refs/pull/2074/head#subdirectory=checkpoint" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest moving this change to pyproject.toml
,
# Orbax checkpointing.
orbax = [
"humanize==4.10.0",
"orbax-checkpoint@git+https://github.com/google/orbax.git@refs/pull/2074/head#subdirectory=checkpoint",
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't use something from an unmerged PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this PR to point to the HEAD of the Orbax repository as the orbax PR is now merged. Should we merge this now, or should we wait for the next official Orbax release and use that version instead ?
Also, moved the change to pyproject.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this change is needed because it relies on the enable_hns_rmtree flag, which was recently added to the Orbax library.
@@ -242,6 +242,7 @@ def save_fn_with_summaries(step: int, last_saved_step: Optional[int]) -> bool: | |||
should_save_fn=save_fn_with_summaries, | |||
enable_background_delete=True, | |||
async_options=ocp.options.AsyncOptions(timeout_secs=cfg.async_timeout_secs), | |||
enable_hns_rmtree=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add some test results in the PR summary to help with reviewers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the details. Kindly review and I'm happy to provide any additional details as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the GCS is not HNS? Is there any regression in terms of functionality and performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for non HNS bucket, there is existing path.rmtree()
in orbax which works to delete objects in non HNS. But that does not work for HNS bucket as HNS bucket have true folder structure and path.rmtree()
only removes objects, leaving all the empty parent folders intact. So, orbax is now manually deleting the empty folders recursively.
_rmtree function in Orbax https://github.com/google/orbax/blob/1769e61f1a380f975d7094f6b8c6ecff035ac5db/checkpoint/orbax/checkpoint/_src/path/deleter.py#L140
And Axlearn uses enable_background_delete=True in CheckpointManagerOptions CheckpointManagerOptionshttps://github.com/apple/axlearn/blob/89991e862f2889641dd705040106f1706cd8db5c/axlearn/common/checkpointer_orbax.py#L243C16 which makes the deletion in background and should not cause regression in performance
pyproject.toml
Outdated
@@ -195,4 +196,4 @@ junit_family="xunit2" | |||
|
|||
[tool.isort] | |||
line_length = 100 | |||
profile = "black" | |||
profile = "black" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix extra line -- these changes will be rejected by pre-commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should revert the change here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Dockerfile
Outdated
@@ -112,4 +113,4 @@ COPY . . | |||
# Final target spec. # | |||
################################################################################ | |||
|
|||
FROM ${TARGET} AS final | |||
FROM ${TARGET} AS final |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should revert the change in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Dockerfile
Outdated
@@ -112,4 +113,4 @@ COPY . . | |||
# Final target spec. # | |||
################################################################################ | |||
|
|||
FROM ${TARGET} AS final | |||
FROM ${TARGET} AS final |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should revert the change in this file
pyproject.toml
Outdated
@@ -195,4 +196,4 @@ junit_family="xunit2" | |||
|
|||
[tool.isort] | |||
line_length = 100 | |||
profile = "black" | |||
profile = "black" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should revert the change here
@@ -120,7 +121,7 @@ vertexai_tensorboard = [ | |||
"setuptools==65.7.0", | |||
# Pin version to fix Tensorboard uploader TypeError: can only concatenate str (not "NoneType") to str | |||
# https://github.com/googleapis/python-aiplatform/commit/4f982ab254b05fe44a9d2ed959fca2793961b56c | |||
"google-cloud-aiplatform[tensorboard]==1.61.0", | |||
"google-cloud-aiplatform[tensorboard]==1.92.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a required change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as we updated the version of google-cloud-storage to recognize buckets with hierarchical namespace enabled, google-cloud-aiplatform is also updated correspondingly.
@@ -242,6 +242,7 @@ def save_fn_with_summaries(step: int, last_saved_step: Optional[int]) -> bool: | |||
should_save_fn=save_fn_with_summaries, | |||
enable_background_delete=True, | |||
async_options=ocp.options.AsyncOptions(timeout_secs=cfg.async_timeout_secs), | |||
enable_hns_rmtree=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the GCS is not HNS? Is there any regression in terms of functionality and performance?
pyproject.toml
Outdated
@@ -154,7 +155,7 @@ mmau = [ | |||
# Orbax checkpointing. | |||
orbax = [ | |||
"humanize==4.10.0", | |||
"orbax-checkpoint==0.11.15", | |||
"orbax-checkpoint @ git+https://github.com/google/orbax.git#subdirectory=checkpoint", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will the release be available? yes it is preferred to use a released version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked with Orbax team it will be available likely tomorrow. I will update the release version here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using the latest release version now.
Orbax made a change to support empty folders deletion which require enable_hns_rmtree flag to be True in CheckpointManagerOptions.