Skip to content

Support empty folder deletion for HNS bucket with Orbax #1286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

deepikarajani24
Copy link

@deepikarajani24 deepikarajani24 commented Jul 7, 2025

Orbax made a change to support empty folders deletion which require enable_hns_rmtree flag to be True in CheckpointManagerOptions.

  • Tested orbax regular checkpoint with HNS bucket with keep_last_n as 3, and confirmed only latest 3 checkpoint folders were left and rest were deleted. Also, without this change confirmed all the checkpoint folders would exist and none would be deleted.
  • This change would help with improved listing for HNS bucket which gets slow if empty folders are present

@deepikarajani24 deepikarajani24 requested review from ruomingp, markblee and a team as code owners July 7, 2025 03:51
Copy link
Contributor

@Ethanlm Ethanlm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Left some comment

Dockerfile Outdated
@@ -83,14 +83,15 @@ ENTRYPOINT ["/opt/apache/beam/boot"]

FROM base AS tpu

ARG EXTRAS=
ARG EXTRAS=orbax
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to change this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it, made orbax dependency change in pyproject.toml

Dockerfile Outdated

ENV UV_FIND_LINKS=https://storage.googleapis.com/jax-releases/libtpu_releases.html
# Ensure we install the TPU version, even if building locally.
# Jax will fallback to CPU when run on a machine without TPU.
RUN uv pip install --prerelease=allow .[core,tpu] && uv cache clean
RUN if [ -n "$EXTRAS" ]; then uv pip install .[$EXTRAS] && uv cache clean; fi
COPY . .
RUN pip install -U "orbax-checkpoint @ git+https://github.com/google/orbax.git@refs/pull/2074/head#subdirectory=checkpoint"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest moving this change to pyproject.toml,

# Orbax checkpointing.
orbax = [
    "humanize==4.10.0",
    "orbax-checkpoint@git+https://github.com/google/orbax.git@refs/pull/2074/head#subdirectory=checkpoint",
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't use something from an unmerged PR

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this PR to point to the HEAD of the Orbax repository as the orbax PR is now merged. Should we merge this now, or should we wait for the next official Orbax release and use that version instead ?

Also, moved the change to pyproject.yaml

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this change is needed because it relies on the enable_hns_rmtree flag, which was recently added to the Orbax library.

@@ -242,6 +242,7 @@ def save_fn_with_summaries(step: int, last_saved_step: Optional[int]) -> bool:
should_save_fn=save_fn_with_summaries,
enable_background_delete=True,
async_options=ocp.options.AsyncOptions(timeout_secs=cfg.async_timeout_secs),
enable_hns_rmtree=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add some test results in the PR summary to help with reviewers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the details. Kindly review and I'm happy to provide any additional details as needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the GCS is not HNS? Is there any regression in terms of functionality and performance?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for non HNS bucket, there is existing path.rmtree() in orbax which works to delete objects in non HNS. But that does not work for HNS bucket as HNS bucket have true folder structure and path.rmtree() only removes objects, leaving all the empty parent folders intact. So, orbax is now manually deleting the empty folders recursively.

_rmtree function in Orbax https://github.com/google/orbax/blob/1769e61f1a380f975d7094f6b8c6ecff035ac5db/checkpoint/orbax/checkpoint/_src/path/deleter.py#L140

And Axlearn uses enable_background_delete=True in CheckpointManagerOptions CheckpointManagerOptionshttps://github.com/apple/axlearn/blob/89991e862f2889641dd705040106f1706cd8db5c/axlearn/common/checkpointer_orbax.py#L243C16 which makes the deletion in background and should not cause regression in performance

pyproject.toml Outdated
@@ -195,4 +196,4 @@ junit_family="xunit2"

[tool.isort]
line_length = 100
profile = "black"
profile = "black"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix extra line -- these changes will be rejected by pre-commit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should revert the change here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Dockerfile Outdated
@@ -112,4 +113,4 @@ COPY . .
# Final target spec. #
################################################################################

FROM ${TARGET} AS final
FROM ${TARGET} AS final
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove extra line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should revert the change in this file

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@deepikarajani24 deepikarajani24 changed the title Support empty folder deletion for HNS bucket with Orbax Support empty folder deletion for HNS bucket with Orbax which required enabling enable_hns_rmtree flag in CheckpointManagerOptions Tested on HNS bucket with orbax regular checkpoint with axlearn, and confirmed the empty checkpoint folders were being deleted. Screenshot of bucket https://screenshot.googleplex.com/B2sN4YfRVcjik2Q where only the last 3 checkpoints are kept and older checkpoints were deleted. Jul 8, 2025
@deepikarajani24 deepikarajani24 changed the title Support empty folder deletion for HNS bucket with Orbax which required enabling enable_hns_rmtree flag in CheckpointManagerOptions Tested on HNS bucket with orbax regular checkpoint with axlearn, and confirmed the empty checkpoint folders were being deleted. Screenshot of bucket https://screenshot.googleplex.com/B2sN4YfRVcjik2Q where only the last 3 checkpoints are kept and older checkpoints were deleted. Support empty folder deletion for HNS bucket with Orbax which required enabling enable_hns_rmtree flag in CheckpointManagerOptions - test https://screenshot.googleplex.com/B2sN4YfRVcjik2Q older checkpoints were deleted Jul 8, 2025
@deepikarajani24 deepikarajani24 changed the title Support empty folder deletion for HNS bucket with Orbax which required enabling enable_hns_rmtree flag in CheckpointManagerOptions - test https://screenshot.googleplex.com/B2sN4YfRVcjik2Q older checkpoints were deleted Support empty folder deletion for HNS bucket with Orbax which required enabling enable_hns_rmtree flag in CheckpointManagerOptions - test confirmed only keep_last_n checkpoints were left and older ckpts were deleted, without this change all checkpoint folders are kept Jul 8, 2025
@deepikarajani24 deepikarajani24 changed the title Support empty folder deletion for HNS bucket with Orbax which required enabling enable_hns_rmtree flag in CheckpointManagerOptions - test confirmed only keep_last_n checkpoints were left and older ckpts were deleted, without this change all checkpoint folders are kept Support empty folder deletion for HNS bucket with Orbax Jul 8, 2025
Dockerfile Outdated
@@ -112,4 +113,4 @@ COPY . .
# Final target spec. #
################################################################################

FROM ${TARGET} AS final
FROM ${TARGET} AS final
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should revert the change in this file

pyproject.toml Outdated
@@ -195,4 +196,4 @@ junit_family="xunit2"

[tool.isort]
line_length = 100
profile = "black"
profile = "black"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should revert the change here

@@ -120,7 +121,7 @@ vertexai_tensorboard = [
"setuptools==65.7.0",
# Pin version to fix Tensorboard uploader TypeError: can only concatenate str (not "NoneType") to str
# https://github.com/googleapis/python-aiplatform/commit/4f982ab254b05fe44a9d2ed959fca2793961b56c
"google-cloud-aiplatform[tensorboard]==1.61.0",
"google-cloud-aiplatform[tensorboard]==1.92.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a required change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as we updated the version of google-cloud-storage to recognize buckets with hierarchical namespace enabled, google-cloud-aiplatform is also updated correspondingly.

@@ -242,6 +242,7 @@ def save_fn_with_summaries(step: int, last_saved_step: Optional[int]) -> bool:
should_save_fn=save_fn_with_summaries,
enable_background_delete=True,
async_options=ocp.options.AsyncOptions(timeout_secs=cfg.async_timeout_secs),
enable_hns_rmtree=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the GCS is not HNS? Is there any regression in terms of functionality and performance?

pyproject.toml Outdated
@@ -154,7 +155,7 @@ mmau = [
# Orbax checkpointing.
orbax = [
"humanize==4.10.0",
"orbax-checkpoint==0.11.15",
"orbax-checkpoint @ git+https://github.com/google/orbax.git#subdirectory=checkpoint",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will the release be available? yes it is preferred to use a released version

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked with Orbax team it will be available likely tomorrow. I will update the release version here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using the latest release version now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants