Skip to content

feature: add Dockerfile.maximal to optimize CI (#1252)#1339

Closed
tirthpatel90 wants to merge 16 commits intooraios:mainfrom
tirthpatel90:feature/optimize-docker-ci
Closed

feature: add Dockerfile.maximal to optimize CI (#1252)#1339
tirthpatel90 wants to merge 16 commits intooraios:mainfrom
tirthpatel90:feature/optimize-docker-ci

Conversation

@tirthpatel90
Copy link
Copy Markdown

This PR introduces Dockerfile.maximal to resolve #1252. It successfully ports the heaviest dependencies (R, Julia, Rust, Go, C++, Node, Ruby, etc.) into a native Docker image, bypassing OS security restrictions and memory limits.
Note: For stability and to avoid GitHub Actions OOM limits, extremely niche toolchains (like Swift, Haskell, Lean4) are deferred to a Phase 2 update. This Phase 1 image immediately resolves the primary CI compilation bottlenecks.

@MischaPanch
Copy link
Copy Markdown
Contributor

Hey @tirthpatel90 , thanks for the PR. This just adds a new dockerfile, I don't see how this is accelerating CI. The dockerfile won't be built into an image or used anywhere in CI, or am I missing something?

@tirthpatel90
Copy link
Copy Markdown
Author

Hi @MischaPanch, you are absolutely right!

My main focus initially was to successfully build this massive Dockerfile.maximal without hitting GitHub's 6-hour timeout or OOM limits. (Note: I intentionally included the heaviest bottlenecks like R, Julia, Rust, Node, Go, C++, etc., but deferred a few remaining ones like Swift, Haskell, and OCaml to ensure the build remains stable within runner limits for now).

Now that the core image builds perfectly, I'm ready to wire it up to the CI!

To actually accelerate the CI, what is your preferred strategy? Should we add a workflow to publish this image to the GitHub Container Registry (ghcr.io), and then modify the main test workflows to pull and run inside it?

Let me know how you'd like to handle the image hosting, and I'll push the necessary CI workflow changes to this PR right away!

@MischaPanch
Copy link
Copy Markdown
Contributor

Yes, let's build it, push it to the GH container registry, adjust the pytest workflow to use it and see how much it accelerates. Would it be possible for you to test this out in your fork and link the actions here? I suppose you can't write to the right place in the container registry from the actions triggered in a PR, right?

@MischaPanch
Copy link
Copy Markdown
Contributor

I would prefer to review the dockerfile after things are running and acceleration is visible :). That's why I'm asking

@MischaPanch
Copy link
Copy Markdown
Contributor

Btw, setting up OCaml eats up a lot of time in CI, if it's possible to include it here, that would be great. If not, we might just disable it at some point

@tirthpatel90
Copy link
Copy Markdown
Author

Sounds like a plan, @MischaPanch!

I'll add OCaml to the maximal image, set up a workflow on my fork to build and publish it to my personal GHCR, and then adjust the pytest.yml to run inside that container.

I'll ping you here with the action run links showing the acceleration metrics once it's running smoothly on my end. Working on it now!

Comment thread .github/workflows/test-maximal.yml Fixed
@tirthpatel90
Copy link
Copy Markdown
Author

Hi @MischaPanch,

I have successfully published the maximal image to GHCR and wired it up to a test workflow on my fork.

The good news: The environment works perfectly and immediately breezes through ~65% of the test suite (including C++, Go, Java, Rust, etc.) without any setup overhead!

However, the test execution hangs/slows down significantly after the 65% mark. Based on the logs, I suspect two reasons for this:

  1. Missing Toolchains: The test suite might be automatically downloading and building the deferred niche toolchains (Haskell, Swift, Lean 4, etc.) on the fly because they aren't baked into this Phase 1 image.
  2. Julia JIT Precompilation: Since we disabled Julia package precompilation in the Docker build to prevent OOM crashes, it might be running that heavy compilation during the test itself.

Does the test suite automatically attempt to install missing language servers during execution? If so, is there an environment variable or a pytest flag I should pass to strictly skip tests for toolchains not present in the image? This would allow us to accurately benchmark the speed of the pre-installed languages!

@tirthpatel90
Copy link
Copy Markdown
Author

Closing this PR in favor of the parallelized CI architecture discussed in #1362.

The monolithic maximal image successfully proved that we can drastically cut down test times by pre-baking dependencies. However, moving forward, we will pivot to segmenting the tests into dynamic matrix batches and running them in parallel using a leaner base image. This will be more scalable and maintainable.

Thanks for the feedback, everyone! I'll be opening a new PR for the parallel matrix workflows soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Accelerate CI, extend docker build

3 participants