Skip to content

Add testing DSS with Canonical Kubernetes (New) #1793

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

motjuste
Copy link
Contributor

@motjuste motjuste commented Mar 14, 2025

Description

dss now has provisional support for running on Canonical Kubernetes instead of microk8s. This support is currently available in the dss snap from the channel 1.1/edge. To enable this:

  • The install-deps script now enables installing Canonical Kubernetes instead of microk8s based on given arguments. It still installs microk8s by default.
  • install-deps now also installs the helm snap which will be used to enable NVIDIA GPU support in Canonical Kubernetes as well as microk8s.
  • There's no change needed to support Intel GPUs, but this PR needs Simplify enabling Intel GPU for DSS (New) #1789 to have the latest, simplified method to enable Intel GPUs.
    • Please note that DSS currently has issue 209 which will fail creating notebooks with Intel GPU support on Canonical Kubernetes.
  • A job has been added to use helm for enabling NVIDIA GPU support in Kubernetes, roughly following this guide. It works for both microk8s and Canonical Kubernetes.
  • Updates to names (actually ids) of the jobs.

The provider-snap's minor version has been bumped to indicate since when we added support for Canonical Kubernetes.

TODOs

  • Create gpu_setup.py that combines setting up Intel and NVIDIA GPU in Kubernetes.
  • Update the GH workflow and job-spec.yaml for testing DSS on Testflinger:
    • Build the checkbox-dss snap separately and pass it to Testflinger jobs, so that we don't waste a lot of time in the jobs doing this repeated work.
    • Add Canonical Kubernetes to the matrix of jobs.

Resolved issues

Documentation

Updated the README for this provider. No changes to main Checkbox documentation.

Tests

No additions to the unit-tests. And while I have checked this manually, I will wait for #1789 to be merged before updating this PR and running full automated tests.

motjuste added 11 commits March 14, 2025 15:33
We will need helm when installing canonical k8s to enable NVIDIA GPU
operator in it.

Canonical k8s (and helm) will only be installed if the explicit argument
for the channel to use is provided.  Otherwise, the old default
behaviour of install microk8s will be maintained.
We use helm to add the relevant chart from nvidia and install the chart.
We re-use existing script to verify the rollout too.
This job needs to run after either one of the two jobs above it enabling
NVIDIA GPU in the k8s cluster succeed (one is for microk8s, the other is
for Canonical k8s).  We can't use those jobs in 'depends' because then
both of them will have to succeed, which is impossible because only one
of either microk8s or Canonical k8s will be available.

The trick we use here is that we now `depends` on `dss/initialize`,
which must succeed for the whole test-plan to be run anyway, and,
we require that we have an NVIDIA GPU.  This is similar to the `depends`
for the two jobs for microk8s and Canonical k8s.  Then we will have to
be careful that this job is added in the test-plan to ONLY run after
those two jobs.  The difference will then be that this job will not be
skipped if either of the two jobs enabling NVIDIA GPU fail.
We now have an addition to the `install-deps` script. It also demarcates
from whence we started supporting Canonical K8s
The "worker" daemonset that was being verified may have a version number
in its name, which we cannot predict.
Copy link

codecov bot commented Mar 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 49.83%. Comparing base (5baa4d0) to head (612a608).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1793   +/-   ##
=======================================
  Coverage   49.83%   49.83%           
=======================================
  Files         377      377           
  Lines       40719    40719           
  Branches     6851     6851           
=======================================
  Hits        20294    20294           
  Misses      19700    19700           
  Partials      725      725           
Flag Coverage Δ
provider-dss 100.00% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

NVIDIA GPU operator can be enabled in both microk8s and
canonical k8s using helm, so we remove all the ugly parts so far that
was trying to handle whether microk8s or canonical k8s were installed,
and just use the unified approach using helm.

Helm now becomes a hard requirement.
@fernando79513
Copy link
Collaborator

As discussed in:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants