-
Notifications
You must be signed in to change notification settings - Fork 52
ci: optimize integration test CI time #487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ci: optimize integration test CI time #487
Conversation
…ache#270) Implement Phase 1 of CI optimization to skip integration tests on docs/chore PRs. Changes: - Enable automatic PR labeling by removing 'if: false' from pr.yml - Add 'demo' and 'chore' labels to labeler.yml for better categorization - Add 'check-integration-needed' job to detect if integration tests should run - Update 'integration-tests' job to depend on check and skip for docs/chore-only PRs Expected savings: 15-20 minutes on ~30% of PRs (docs/chore only) Safety: Fail-safe design - if labeler fails or labels unclear, tests still run Related to apache#270
Implement Phase 2 of CI optimization to build Python bindings once and reuse across all integration test apps. Changes: - Add 'prepare-integration-build' job that builds runner image with pre-compiled bindings - Save built image as artifact (~500MB-1GB, 1-day retention) - Update 'integration-tests' job to download and load cached runner image - Modify ci_run.sh to support --use-cached-image flag - Skip 'make setup develop' in datafusion/rust/python apps when using cached image - Remove Dockerfile hash from cache key (no longer rebuilding image per matrix item) Expected savings: 6-9 minutes on all code PRs (eliminates 3 redundant 3-4 min builds) Trade-offs: - Adds ~1 minute for artifact upload/download - Net savings: 5-8 minutes per workflow run Related to apache#270
Implement Phase 3 of CI optimization to cache Docker build layers. Changes: - Add Docker Buildx layer cache to prepare-integration-build job - Use docker buildx build with cache-from/cache-to for runner image - Rotate cache directory to prevent unbounded growth - Cache key based on Dockerfile hash for proper invalidation Expected savings: 30-60 seconds on subsequent runs with cache hit This completes the three-phase CI optimization for issue apache#270: - Phase 1: Label-based conditional execution (15-20 min savings on docs PRs) - Phase 2: Eliminate redundant builds (6-9 min savings on code PRs) - Phase 3: Docker layer caching (30-60 sec savings on repeat runs) Total expected savings: ~11 min per PR on average (40% reduction) Related to apache#270
- Fix label matching logic using jq instead of grep anchors - Add atomic cache rotation to prevent data loss - Add error handling for binding build failures - Improve logging and validation in ci_run.sh - Update cache key to include Rust toolchain version
|
we should keep an eye on the actual performance impact once/if merged. The Docker image artifact is going to be fairly large (probably 1-2GB compressed), so the transfer time between jobs might eat into some of the time savings we're expecting. Would be good to track:
If the artifact transfer ends up taking 2-3+ minutes, we might want to consider alternatives like pushing to GHCR instead of using GitHub artifacts. Open for any feedback or change |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #487 +/- ##
=======================================
Coverage 85.65% 85.65%
=======================================
Files 65 65
Lines 3897 3897
=======================================
Hits 3338 3338
Misses 559 559 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This commit addresses critical failures in PR apache#487's CI workflows: 1. Fixed UV PATH issue in prepare-integration-build - Added explicit PATH export in docker compose exec command - UV binary is installed to /root/.local/bin/ but wasn't in PATH - Prevented "uv is not installed" error during binding compilation 2. Fixed buildx cache rotation on cold start - Added existence check before moving cache directory - Prevented "cannot stat '/tmp/.buildx-cache'" error - Made cache rotation atomic to avoid data loss 3. Fixed labeler permissions issue - Created separate pr-labeler.yml workflow with pull_request_target - Disabled labeler in pr.yml temporarily - Allows labeling to work on fork PRs while maintaining security All changes maintain the PR's core optimization goals: - Conditional test execution based on labels - Docker layer caching for faster builds - Binding pre-compilation to eliminate redundant builds
|
Fixed the CI failures - three issues were blocking the workflow:
Testing the labeler workflow on this PR to make sure it works correctly, this might take a few tries xd |
- Use explicit .cargo/registry and .cargo/git paths in artifact upload - Update pr-labeler.yml to actions/checkout@v6 for consistency - Remove unused demo_network from compose.yaml
|
Good morning @xushiyan Did a bit of cleanup this morning, build caching 'should' now work, the conditional applies since GH Actions do behave weird sometimes, thank you for your time. Note on PR Labeler: The pr-labeler.yml workflow uses pull_request_target for write permissions, but since this file doesn't exist in main yet, GitHub won't execute it. Auto-labeling will only work after this PR is merged or a small, separated commit for labels only on main |
Thanks. We don't need to make labeler work right now as there are some policy restriction. We can just focus on reducing the CI time in this PR. Is there a way you can try running the CI in your own fork? |
- Fix uv PATH: install to /usr/local/bin instead of /root/.local/bin (hopefully this sticks this time) so it works when container runs as HOST_UID:HOST_GID - Remove pr-labeler.yml workflow (policy restriction) - Remove check-integration-needed job and label-based skip logic - Revert labeler.yml additions (demo, chore labels)
Unfortunately, even in 2025, trial & error is the fastest way to work with GitHub Actions
uv now seems working as a non-root user locally, also reverted the pr-labeler changes |
|
By the way thanks for your patience @xushiyan , i appreciate, logs are crucial for this change |
The container runs as HOST_UID:HOST_GID but Python and venv were installed by root during image build. This caused "Permission denied" when accessing /opt/.venv/bin/python3. Changes: - Set UV_PYTHON_INSTALL_DIR=/opt/python to install Python in shared location - Add chmod to make Python and venv readable by any user
0d84e83 to
43db981
Compare
|
Last 2 commits, this helps if the problem persists. Root cause: The Docker container runs as Changes:
|
- Build minio client (mc) image even with --use-cached-build flag, as only runner image is saved as artifact - Add .cargo/ and .uv-cache/ to gitignore (created by Docker tests)
|
Some progress after last push, i also added 2 cache folders to .gitignore that seemed missing The we 'should' be close |
|
hey @AndreaBozzo thanks for the changes. but i'm not sure about the gain since it seems that the main branch https://github.com/apache/hudi-rs/commits/main/ without prepare integ test build took shorter time. |
|
Hi @xushiyan, thanks for the feedback! You're right to question the gains We hit several issues (permission errors, missing mc image) that required multiple fix iterations. The CI runs you saw were failing/suboptimal. Intended gains (once working):
Current overhead concerns:
Options:
I'm perfectly fine with both, as the change was meant to be beneficial, not detrimental of course. |
xushiyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @AndreaBozzo . I have some questions and suggestions
|
|
||
| # Make Python installation and venv accessible to any user (needed for HOST_UID:HOST_GID mapping) | ||
| # Python needs read+execute, venv needs read+write+execute for pip install | ||
| RUN chmod -R a+rX /opt/python && chmod -R a+rwX /opt/.venv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you pre-build this docker image and publish it in docker hub so it can be pulled directly during CI? once everything passed, I can re-publish the same image under Hudi's official docker hub
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to try do that before Christmas, thats actually neat solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
- Remove prepare-integration-build job - Pull andreab92/hudi-demo-runner:latest in integration-tests - Revert ci_run.sh to original version - Restore networks section in compose.yaml
|
Hi @xushiyan, sorry for the delay, holidays got in the way, I did my best to get this done before Christmas but couldn't make it. Changes in this commitSimplified the CI approach as you suggested:
After mergeYou can republish the image under Hudi's official Docker Hub: docker pull andreab92/hudi-demo-runner:latest
docker tag andreab92/hudi-demo-runner:latest apache/hudi-demo-runner:latest
docker push apache/hudi-demo-runner:latestThen update CI to use Let me know if anything needs adjustment! |
|
@AndreaBozzo i'll investigate later easier for me to trigger CI and see how it works. meanwhile can you rebase and resolve conflict ? |
updated branch, don't worry i have no rush and thanks again! |

Reduces integration test CI time by using a pre-built Docker image from Docker Hub.
Approach
Per maintainer suggestion, the runner image is now pre-built and published to Docker Hub instead of building it in CI:
Changes
After merge
Maintainer can republish under official Hudi Docker Hub.
Fixes #270