Skip to content

Conversation

@AndreaBozzo
Copy link

@AndreaBozzo AndreaBozzo commented Nov 27, 2025

Reduces integration test CI time by using a pre-built Docker image from Docker Hub.

Approach

Per maintainer suggestion, the runner image is now pre-built and published to Docker Hub instead of building it in CI:

Changes

  • Remove prepare-integration-build job
  • Pull pre-built image directly from Docker Hub
  • Use GitHub Actions cache for Rust/Python build artifacts
  • Revert ci_run.sh to original simple version
  • Restore networks section in compose.yaml

After merge

Maintainer can republish under official Hudi Docker Hub.

Fixes #270

…ache#270)

Implement Phase 1 of CI optimization to skip integration tests on docs/chore PRs.

Changes:
- Enable automatic PR labeling by removing 'if: false' from pr.yml
- Add 'demo' and 'chore' labels to labeler.yml for better categorization
- Add 'check-integration-needed' job to detect if integration tests should run
- Update 'integration-tests' job to depend on check and skip for docs/chore-only PRs

Expected savings: 15-20 minutes on ~30% of PRs (docs/chore only)

Safety: Fail-safe design - if labeler fails or labels unclear, tests still run

Related to apache#270
Implement Phase 2 of CI optimization to build Python bindings once and reuse across all integration test apps.

Changes:
- Add 'prepare-integration-build' job that builds runner image with pre-compiled bindings
- Save built image as artifact (~500MB-1GB, 1-day retention)
- Update 'integration-tests' job to download and load cached runner image
- Modify ci_run.sh to support --use-cached-image flag
- Skip 'make setup develop' in datafusion/rust/python apps when using cached image
- Remove Dockerfile hash from cache key (no longer rebuilding image per matrix item)

Expected savings: 6-9 minutes on all code PRs (eliminates 3 redundant 3-4 min builds)

Trade-offs:
- Adds ~1 minute for artifact upload/download
- Net savings: 5-8 minutes per workflow run

Related to apache#270
Implement Phase 3 of CI optimization to cache Docker build layers.

Changes:
- Add Docker Buildx layer cache to prepare-integration-build job
- Use docker buildx build with cache-from/cache-to for runner image
- Rotate cache directory to prevent unbounded growth
- Cache key based on Dockerfile hash for proper invalidation

Expected savings: 30-60 seconds on subsequent runs with cache hit

This completes the three-phase CI optimization for issue apache#270:
- Phase 1: Label-based conditional execution (15-20 min savings on docs PRs)
- Phase 2: Eliminate redundant builds (6-9 min savings on code PRs)
- Phase 3: Docker layer caching (30-60 sec savings on repeat runs)

Total expected savings: ~11 min per PR on average (40% reduction)

Related to apache#270
- Fix label matching logic using jq instead of grep anchors
- Add atomic cache rotation to prevent data loss
- Add error handling for binding build failures
- Improve logging and validation in ci_run.sh
- Update cache key to include Rust toolchain version
@AndreaBozzo
Copy link
Author

AndreaBozzo commented Nov 27, 2025

we should keep an eye on the actual performance impact once/if merged.

The Docker image artifact is going to be fairly large (probably 1-2GB compressed), so the transfer time between jobs might eat into some of the time savings we're expecting.

Would be good to track:

  • Actual artifact size and upload/download times
  • Real-world time savings per PR
  • Cache hit rates in practice

If the artifact transfer ends up taking 2-3+ minutes, we might want to consider alternatives like pushing to GHCR instead of using GitHub artifacts.

Open for any feedback or change

@codecov
Copy link

codecov bot commented Nov 28, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.65%. Comparing base (065fd31) to head (1418911).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #487   +/-   ##
=======================================
  Coverage   85.65%   85.65%           
=======================================
  Files          65       65           
  Lines        3897     3897           
=======================================
  Hits         3338     3338           
  Misses        559      559           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This commit addresses critical failures in PR apache#487's CI workflows:

1. Fixed UV PATH issue in prepare-integration-build
   - Added explicit PATH export in docker compose exec command
   - UV binary is installed to /root/.local/bin/ but wasn't in PATH
   - Prevented "uv is not installed" error during binding compilation

2. Fixed buildx cache rotation on cold start
   - Added existence check before moving cache directory
   - Prevented "cannot stat '/tmp/.buildx-cache'" error
   - Made cache rotation atomic to avoid data loss

3. Fixed labeler permissions issue
   - Created separate pr-labeler.yml workflow with pull_request_target
   - Disabled labeler in pr.yml temporarily
   - Allows labeling to work on fork PRs while maintaining security

All changes maintain the PR's core optimization goals:
- Conditional test execution based on labels
- Docker layer caching for faster builds
- Binding pre-compilation to eliminate redundant builds
@AndreaBozzo
Copy link
Author

Fixed the CI failures - three issues were blocking the workflow:

  1. UV PATH not found - docker compose exec wasn't inheriting the container's PATH, so the uv binary wasn't accessible. Added explicit PATH export.

  2. Cache rotation failing on first run - The buildx cache didn't exist yet, causing mv to fail. Added a check before moving.

  3. Labeler permissions - Fork PRs can't use pull_request trigger with write permissions. Created a separate pr-labeler.yml workflow using pull_request_target (which is safe because it only checks out the base branch, not the PR code).

Testing the labeler workflow on this PR to make sure it works correctly, this might take a few tries xd

- Use explicit .cargo/registry and .cargo/git paths in artifact upload
- Update pr-labeler.yml to actions/checkout@v6 for consistency
- Remove unused demo_network from compose.yaml
@AndreaBozzo
Copy link
Author

AndreaBozzo commented Dec 7, 2025

Good morning @xushiyan

Did a bit of cleanup this morning, build caching 'should' now work, the conditional applies since GH Actions do behave weird sometimes, thank you for your time.

Note on PR Labeler: The pr-labeler.yml workflow uses pull_request_target for write permissions, but since this file doesn't exist in main yet, GitHub won't execute it. Auto-labeling will only work after this PR is merged or a small, separated commit for labels only on main

@xushiyan
Copy link
Member

xushiyan commented Dec 7, 2025

Good morning @xushiyan

Did a bit of cleanup this morning, build caching 'should' now work, the conditional applies since GH Actions do behave weird sometimes, thank you for your time.

Note on PR Labeler: The pr-labeler.yml workflow uses pull_request_target for write permissions, but since this file doesn't exist in main yet, GitHub won't execute it. Auto-labeling will only work after this PR is merged or a small, separated commit for labels only on main

Thanks. We don't need to make labeler work right now as there are some policy restriction. We can just focus on reducing the CI time in this PR. Is there a way you can try running the CI in your own fork?

- Fix uv PATH: install to /usr/local/bin instead of /root/.local/bin (hopefully this sticks this time) so it works when container runs as HOST_UID:HOST_GID
- Remove pr-labeler.yml workflow (policy restriction)
- Remove check-integration-needed job and label-based skip logic
- Revert labeler.yml additions (demo, chore labels)
@AndreaBozzo
Copy link
Author

Good morning @xushiyan
Did a bit of cleanup this morning, build caching 'should' now work, the conditional applies since GH Actions do behave weird sometimes, thank you for your time.
Note on PR Labeler: The pr-labeler.yml workflow uses pull_request_target for write permissions, but since this file doesn't exist in main yet, GitHub won't execute it. Auto-labeling will only work after this PR is merged or a small, separated commit for labels only on main

Thanks. We don't need to make labeler work right now as there are some policy restriction. We can just focus on reducing the CI time in this PR. Is there a way you can try running the CI in your own fork?

Unfortunately, even in 2025, trial & error is the fastest way to work with GitHub Actions

uvhudi

uv now seems working as a non-root user locally, also reverted the pr-labeler changes

@AndreaBozzo
Copy link
Author

By the way thanks for your patience @xushiyan , i appreciate, logs are crucial for this change

The container runs as HOST_UID:HOST_GID but Python and venv were
installed by root during image build. This caused "Permission denied"
when accessing /opt/.venv/bin/python3.

Changes:
- Set UV_PYTHON_INSTALL_DIR=/opt/python to install Python in shared location
- Add chmod to make Python and venv readable by any user
@AndreaBozzo AndreaBozzo force-pushed the optimize-ci-integration-tests-270 branch from 0d84e83 to 43db981 Compare December 21, 2025 17:12
@AndreaBozzo
Copy link
Author

Last 2 commits, this helps if the problem persists.

Root cause: The Docker container runs as HOST_UID:HOST_GID (non-root), but Python and venv were installed by root during image build.

Changes:

  1. UV_PYTHON_INSTALL_DIR=/opt/python - Install Python in shared location instead of /root/.local/share/
  2. chmod -R a+rX /opt/python - Make Python readable by any user
  3. chmod -R a+rwX /opt/.venv - Make venv writable (needed for pip install to create new directories)

- Build minio client (mc) image even with --use-cached-build flag,
  as only runner image is saved as artifact
- Add .cargo/ and .uv-cache/ to gitignore (created by Docker tests)
@AndreaBozzo
Copy link
Author

AndreaBozzo commented Dec 21, 2025

Some progress after last push, i also added 2 cache folders to .gitignore that seemed missing

The --use-cached-build flag was using --no-build which skipped building the mc image (only runner was cached as artifact).

we 'should' be close

@xushiyan
Copy link
Member

hey @AndreaBozzo thanks for the changes. but i'm not sure about the gain since it seems that the main branch https://github.com/apache/hudi-rs/commits/main/ without prepare integ test build took shorter time.

@AndreaBozzo
Copy link
Author

Hi @xushiyan, thanks for the feedback! You're right to question the gains

We hit several issues (permission errors, missing mc image) that required multiple fix iterations. The CI runs you saw were failing/suboptimal.

Intended gains (once working):

  • Build Python bindings once → reuse across 4 integration tests (currently each rebuilds from scratch)
  • ~6-9 min savings on code PRs (3 redundant make setup develop eliminated)

Current overhead concerns:

  • Artifact upload/download (~1-2GB runner image)
  • Still need to build mc in each job

Options:

  1. Simplify - Drop the artifact approach, just use GitHub Actions cache for cargo/target dirs
  2. Abandon - Keep main branch approach if complexity > benefit

I'm perfectly fine with both, as the change was meant to be beneficial, not detrimental of course.

Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AndreaBozzo . I have some questions and suggestions


# Make Python installation and venv accessible to any user (needed for HOST_UID:HOST_GID mapping)
# Python needs read+execute, venv needs read+write+execute for pip install
RUN chmod -R a+rX /opt/python && chmod -R a+rwX /opt/.venv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you pre-build this docker image and publish it in docker hub so it can be pulled directly during CI? once everything passed, I can re-publish the same image under Hudi's official docker hub

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to try do that before Christmas, thats actually neat solution

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

- Remove prepare-integration-build job
- Pull andreab92/hudi-demo-runner:latest in integration-tests
- Revert ci_run.sh to original version
- Restore networks section in compose.yaml
@AndreaBozzo
Copy link
Author

AndreaBozzo commented Dec 29, 2025

Hi @xushiyan, sorry for the delay, holidays got in the way, I did my best to get this done before Christmas but couldn't make it.

Changes in this commit

Simplified the CI approach as you suggested:

  1. Pre-built Docker image published: andreab92/hudi-demo-runner:latest

    • Tags: rust-1.87-py3.13, latest
    • SHA: a4dd53edc8d11499efb6634a7152742bdc60404e1789dcb4c74df8263ea2cd96
  2. Removed prepare-integration-build job

  3. Integration tests now:

    • Pull image directly from Docker Hub (~30s)
    • Use GitHub Actions cache for Rust/Python build artifacts
    • Run in parallel immediately
  4. Reverted compose.yaml network section and ci_run.sh to original versions

After merge

You can republish the image under Hudi's official Docker Hub:

docker pull andreab92/hudi-demo-runner:latest
docker tag andreab92/hudi-demo-runner:latest apache/hudi-demo-runner:latest
docker push apache/hudi-demo-runner:latest

Then update CI to use apache/hudi-demo-runner:latest.

Let me know if anything needs adjustment!

@AndreaBozzo AndreaBozzo requested a review from xushiyan December 29, 2025 22:23
@xushiyan
Copy link
Member

xushiyan commented Jan 2, 2026

@AndreaBozzo i'll investigate later easier for me to trigger CI and see how it works. meanwhile can you rebase and resolve conflict ?

@AndreaBozzo
Copy link
Author

@AndreaBozzo i'll investigate later easier for me to trigger CI and see how it works. meanwhile can you rebase and resolve conflict ?

updated branch, don't worry i have no rush and thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve integration test CI time

2 participants