ci: optimize integration test CI time #487

AndreaBozzo · 2025-11-27T18:18:37Z

Reduces integration test CI time by using a pre-built Docker image from Docker Hub.

Approach

Per maintainer suggestion, the runner image is now pre-built and published to Docker Hub instead of building it in CI:

Image: andreab92/hudi-demo-runner:latest
Tags: rust-1.87-py3.13, latest

Changes

Remove prepare-integration-build job
Pull pre-built image directly from Docker Hub
Use GitHub Actions cache for Rust/Python build artifacts
Revert ci_run.sh to original simple version
Restore networks section in compose.yaml

After merge

Maintainer can republish under official Hudi Docker Hub.

Fixes #270

…ache#270) Implement Phase 1 of CI optimization to skip integration tests on docs/chore PRs. Changes: - Enable automatic PR labeling by removing 'if: false' from pr.yml - Add 'demo' and 'chore' labels to labeler.yml for better categorization - Add 'check-integration-needed' job to detect if integration tests should run - Update 'integration-tests' job to depend on check and skip for docs/chore-only PRs Expected savings: 15-20 minutes on ~30% of PRs (docs/chore only) Safety: Fail-safe design - if labeler fails or labels unclear, tests still run Related to apache#270

Implement Phase 2 of CI optimization to build Python bindings once and reuse across all integration test apps. Changes: - Add 'prepare-integration-build' job that builds runner image with pre-compiled bindings - Save built image as artifact (~500MB-1GB, 1-day retention) - Update 'integration-tests' job to download and load cached runner image - Modify ci_run.sh to support --use-cached-image flag - Skip 'make setup develop' in datafusion/rust/python apps when using cached image - Remove Dockerfile hash from cache key (no longer rebuilding image per matrix item) Expected savings: 6-9 minutes on all code PRs (eliminates 3 redundant 3-4 min builds) Trade-offs: - Adds ~1 minute for artifact upload/download - Net savings: 5-8 minutes per workflow run Related to apache#270

Implement Phase 3 of CI optimization to cache Docker build layers. Changes: - Add Docker Buildx layer cache to prepare-integration-build job - Use docker buildx build with cache-from/cache-to for runner image - Rotate cache directory to prevent unbounded growth - Cache key based on Dockerfile hash for proper invalidation Expected savings: 30-60 seconds on subsequent runs with cache hit This completes the three-phase CI optimization for issue apache#270: - Phase 1: Label-based conditional execution (15-20 min savings on docs PRs) - Phase 2: Eliminate redundant builds (6-9 min savings on code PRs) - Phase 3: Docker layer caching (30-60 sec savings on repeat runs) Total expected savings: ~11 min per PR on average (40% reduction) Related to apache#270

- Fix label matching logic using jq instead of grep anchors - Add atomic cache rotation to prevent data loss - Add error handling for binding build failures - Improve logging and validation in ci_run.sh - Update cache key to include Rust toolchain version

AndreaBozzo · 2025-11-27T18:57:56Z

we should keep an eye on the actual performance impact once/if merged.

The Docker image artifact is going to be fairly large (probably 1-2GB compressed), so the transfer time between jobs might eat into some of the time savings we're expecting.

Would be good to track:

Actual artifact size and upload/download times
Real-world time savings per PR
Cache hit rates in practice

If the artifact transfer ends up taking 2-3+ minutes, we might want to consider alternatives like pushing to GHCR instead of using GitHub artifacts.

Open for any feedback or change

codecov · 2025-11-28T23:04:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.65%. Comparing base (065fd31) to head (1418911).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #487   +/-   ##
=======================================
  Coverage   85.65%   85.65%           
=======================================
  Files          65       65           
  Lines        3897     3897           
=======================================
  Hits         3338     3338           
  Misses        559      559

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This commit addresses critical failures in PR apache#487's CI workflows: 1. Fixed UV PATH issue in prepare-integration-build - Added explicit PATH export in docker compose exec command - UV binary is installed to /root/.local/bin/ but wasn't in PATH - Prevented "uv is not installed" error during binding compilation 2. Fixed buildx cache rotation on cold start - Added existence check before moving cache directory - Prevented "cannot stat '/tmp/.buildx-cache'" error - Made cache rotation atomic to avoid data loss 3. Fixed labeler permissions issue - Created separate pr-labeler.yml workflow with pull_request_target - Disabled labeler in pr.yml temporarily - Allows labeling to work on fork PRs while maintaining security All changes maintain the PR's core optimization goals: - Conditional test execution based on labels - Docker layer caching for faster builds - Binding pre-compilation to eliminate redundant builds

AndreaBozzo · 2025-11-29T12:16:44Z

Fixed the CI failures - three issues were blocking the workflow:

UV PATH not found - docker compose exec wasn't inheriting the container's PATH, so the uv binary wasn't accessible. Added explicit PATH export.
Cache rotation failing on first run - The buildx cache didn't exist yet, causing mv to fail. Added a check before moving.
Labeler permissions - Fork PRs can't use pull_request trigger with write permissions. Created a separate pr-labeler.yml workflow using pull_request_target (which is safe because it only checks out the base branch, not the PR code).

Testing the labeler workflow on this PR to make sure it works correctly, this might take a few tries xd

…g logic

- Use explicit .cargo/registry and .cargo/git paths in artifact upload - Update pr-labeler.yml to actions/checkout@v6 for consistency - Remove unused demo_network from compose.yaml

AndreaBozzo · 2025-12-07T08:50:20Z

Good morning @xushiyan

Did a bit of cleanup this morning, build caching 'should' now work, the conditional applies since GH Actions do behave weird sometimes, thank you for your time.

Note on PR Labeler: The pr-labeler.yml workflow uses pull_request_target for write permissions, but since this file doesn't exist in main yet, GitHub won't execute it. Auto-labeling will only work after this PR is merged or a small, separated commit for labels only on main

xushiyan · 2025-12-07T18:39:36Z

Good morning @xushiyan

Did a bit of cleanup this morning, build caching 'should' now work, the conditional applies since GH Actions do behave weird sometimes, thank you for your time.

Note on PR Labeler: The pr-labeler.yml workflow uses pull_request_target for write permissions, but since this file doesn't exist in main yet, GitHub won't execute it. Auto-labeling will only work after this PR is merged or a small, separated commit for labels only on main

Thanks. We don't need to make labeler work right now as there are some policy restriction. We can just focus on reducing the CI time in this PR. Is there a way you can try running the CI in your own fork?

- Fix uv PATH: install to /usr/local/bin instead of /root/.local/bin (hopefully this sticks this time) so it works when container runs as HOST_UID:HOST_GID - Remove pr-labeler.yml workflow (policy restriction) - Remove check-integration-needed job and label-based skip logic - Revert labeler.yml additions (demo, chore labels)

AndreaBozzo · 2025-12-07T19:26:51Z

Good morning @xushiyan
Did a bit of cleanup this morning, build caching 'should' now work, the conditional applies since GH Actions do behave weird sometimes, thank you for your time.
Note on PR Labeler: The pr-labeler.yml workflow uses pull_request_target for write permissions, but since this file doesn't exist in main yet, GitHub won't execute it. Auto-labeling will only work after this PR is merged or a small, separated commit for labels only on main

Thanks. We don't need to make labeler work right now as there are some policy restriction. We can just focus on reducing the CI time in this PR. Is there a way you can try running the CI in your own fork?

Unfortunately, even in 2025, trial & error is the fastest way to work with GitHub Actions

uv now seems working as a non-root user locally, also reverted the pr-labeler changes

AndreaBozzo · 2025-12-21T17:08:10Z

By the way thanks for your patience @xushiyan , i appreciate, logs are crucial for this change

The container runs as HOST_UID:HOST_GID but Python and venv were installed by root during image build. This caused "Permission denied" when accessing /opt/.venv/bin/python3. Changes: - Set UV_PYTHON_INSTALL_DIR=/opt/python to install Python in shared location - Add chmod to make Python and venv readable by any user

AndreaBozzo · 2025-12-21T17:15:01Z

Last 2 commits, this helps if the problem persists.

Root cause: The Docker container runs as HOST_UID:HOST_GID (non-root), but Python and venv were installed by root during image build.

Changes:

UV_PYTHON_INSTALL_DIR=/opt/python - Install Python in shared location instead of /root/.local/share/
chmod -R a+rX /opt/python - Make Python readable by any user
chmod -R a+rwX /opt/.venv - Make venv writable (needed for pip install to create new directories)

- Build minio client (mc) image even with --use-cached-build flag, as only runner image is saved as artifact - Add .cargo/ and .uv-cache/ to gitignore (created by Docker tests)

AndreaBozzo · 2025-12-21T17:48:46Z

Some progress after last push, i also added 2 cache folders to .gitignore that seemed missing

The --use-cached-build flag was using --no-build which skipped building the mc image (only runner was cached as artifact).

we 'should' be close

xushiyan · 2025-12-21T18:34:29Z

hey @AndreaBozzo thanks for the changes. but i'm not sure about the gain since it seems that the main branch https://github.com/apache/hudi-rs/commits/main/ without prepare integ test build took shorter time.

AndreaBozzo · 2025-12-21T18:45:48Z

Hi @xushiyan, thanks for the feedback! You're right to question the gains

We hit several issues (permission errors, missing mc image) that required multiple fix iterations. The CI runs you saw were failing/suboptimal.

Intended gains (once working):

Build Python bindings once → reuse across 4 integration tests (currently each rebuilds from scratch)
~6-9 min savings on code PRs (3 redundant make setup develop eliminated)

Current overhead concerns:

Artifact upload/download (~1-2GB runner image)
Still need to build mc in each job

Options:

Simplify - Drop the artifact approach, just use GitHub Actions cache for cargo/target dirs
Abandon - Keep main branch approach if complexity > benefit

I'm perfectly fine with both, as the change was meant to be beneficial, not detrimental of course.

xushiyan

Thanks @AndreaBozzo . I have some questions and suggestions

xushiyan · 2025-12-21T19:59:55Z

demo/infra/runner/Dockerfile


+# Make Python installation and venv accessible to any user (needed for HOST_UID:HOST_GID mapping)
+# Python needs read+execute, venv needs read+write+execute for pip install
+RUN chmod -R a+rX /opt/python && chmod -R a+rwX /opt/.venv


can you pre-build this docker image and publish it in docker hub so it can be pulled directly during CI? once everything passed, I can re-publish the same image under Hudi's official docker hub

I'm going to try do that before Christmas, thats actually neat solution

sounds good

.github/workflows/ci.yml

.github/workflows/pr.yml

.gitignore

demo/compose.yaml

.github/workflows/pr.yml

demo/compose.yaml

demo/ci_run.sh

- Remove prepare-integration-build job - Pull andreab92/hudi-demo-runner:latest in integration-tests - Revert ci_run.sh to original version - Restore networks section in compose.yaml

AndreaBozzo · 2025-12-29T22:02:16Z

Hi @xushiyan, sorry for the delay, holidays got in the way, I did my best to get this done before Christmas but couldn't make it.

Changes in this commit

Simplified the CI approach as you suggested:

Pre-built Docker image published: andreab92/hudi-demo-runner:latest
- Tags: rust-1.87-py3.13, latest
- SHA: a4dd53edc8d11499efb6634a7152742bdc60404e1789dcb4c74df8263ea2cd96
Removed prepare-integration-build job
Integration tests now:
- Pull image directly from Docker Hub (~30s)
- Use GitHub Actions cache for Rust/Python build artifacts
- Run in parallel immediately
Reverted compose.yaml network section and ci_run.sh to original versions

After merge

You can republish the image under Hudi's official Docker Hub:

docker pull andreab92/hudi-demo-runner:latest
docker tag andreab92/hudi-demo-runner:latest apache/hudi-demo-runner:latest
docker push apache/hudi-demo-runner:latest

Then update CI to use apache/hudi-demo-runner:latest.

Let me know if anything needs adjustment!

xushiyan · 2026-01-02T21:57:22Z

@AndreaBozzo i'll investigate later easier for me to trigger CI and see how it works. meanwhile can you rebase and resolve conflict ?

…on-tests-270

AndreaBozzo · 2026-01-02T22:37:58Z

@AndreaBozzo i'll investigate later easier for me to trigger CI and see how it works. meanwhile can you rebase and resolve conflict ?

updated branch, don't worry i have no rush and thanks again!

AndreaBozzo added 3 commits November 27, 2025 18:47

AndreaBozzo requested a review from xushiyan as a code owner November 27, 2025 18:18

AndreaBozzo added 3 commits December 7, 2025 09:16

Merge branch 'main' into optimize-ci-integration-tests-270

1be66e1

fix: update CI workflow to use actions/checkout@v6 and improve cachin…

cf5d9ea

…g logic

fix: CI workflow improvements and cleanup

47a0e98

- Use explicit .cargo/registry and .cargo/git paths in artifact upload - Update pr-labeler.yml to actions/checkout@v6 for consistency - Remove unused demo_network from compose.yaml

AndreaBozzo force-pushed the optimize-ci-integration-tests-270 branch from 0d84e83 to 43db981 Compare December 21, 2025 17:12

fix: build mc image in cached mode and add cache dirs to gitignore

6825364

- Build minio client (mc) image even with --use-cached-build flag, as only runner image is saved as artifact - Add .cargo/ and .uv-cache/ to gitignore (created by Docker tests)

xushiyan reviewed Dec 21, 2025

View reviewed changes

AndreaBozzo added 2 commits December 21, 2025 21:20

reverted .gitignore, reverted mistakenly downgraded action

b987965

reverted pr.yml

feb4992

xushiyan reviewed Dec 22, 2025

View reviewed changes

.github/workflows/pr.yml Outdated Show resolved Hide resolved

demo/compose.yaml Show resolved Hide resolved

xushiyan reviewed Dec 22, 2025

View reviewed changes

demo/ci_run.sh Outdated Show resolved Hide resolved

use pre-built Docker Hub image for integration tests

be137a1

- Remove prepare-integration-build job - Pull andreab92/hudi-demo-runner:latest in integration-tests - Revert ci_run.sh to original version - Restore networks section in compose.yaml

AndreaBozzo requested a review from xushiyan December 29, 2025 22:23

AndreaBozzo added 3 commits January 2, 2026 23:25

Merge remote-tracking branch 'origin/main' into optimize-ci-integrati…

ae235b1

…on-tests-270

Merge branch 'main' into optimize-ci-integration-tests-270

b0539de

checkout action to v6

1418911

ci: optimize integration test CI time #487

Are you sure you want to change the base?

ci: optimize integration test CI time #487

Uh oh!

Conversation

AndreaBozzo commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approach

Changes

After merge

Uh oh!

AndreaBozzo commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

AndreaBozzo commented Nov 29, 2025

Uh oh!

AndreaBozzo commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xushiyan commented Dec 7, 2025

Uh oh!

AndreaBozzo commented Dec 7, 2025

Uh oh!

AndreaBozzo commented Dec 21, 2025

Uh oh!

AndreaBozzo commented Dec 21, 2025

Uh oh!

AndreaBozzo commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xushiyan commented Dec 21, 2025

Uh oh!

AndreaBozzo commented Dec 21, 2025

Uh oh!

xushiyan left a comment

Choose a reason for hiding this comment

Uh oh!

xushiyan Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

AndreaBozzo Dec 21, 2025

Choose a reason for hiding this comment

Uh oh!

xushiyan Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AndreaBozzo commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes in this commit

After merge

Uh oh!

xushiyan commented Jan 2, 2026

Uh oh!

AndreaBozzo commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AndreaBozzo commented Nov 27, 2025 •

edited

Loading

AndreaBozzo commented Nov 27, 2025 •

edited

Loading

codecov bot commented Nov 28, 2025 •

edited

Loading

AndreaBozzo commented Dec 7, 2025 •

edited

Loading

AndreaBozzo commented Dec 21, 2025 •

edited

Loading

AndreaBozzo commented Dec 29, 2025 •

edited

Loading