Skip to content

Conversation

rhdedgar
Copy link
Collaborator

@rhdedgar rhdedgar commented Sep 3, 2025

With this PR, users no longer need to create separate ConfigMaps for CA bundle and run.yaml configuration data.

  • No longer relies on FieldSelectors / Field Indexers for Custom Resources, improving support for older versions of Kubernetes (<1.31) and OpenShift (<4.17)
  • Improved performance on clusters that have many ConfigMaps, especially on older clusters.

Closes #134

Closes #135

Closes RHAIENG-662

@rhdedgar
Copy link
Collaborator Author

rhdedgar commented Sep 4, 2025

After rebasing from main and force-pushing the rebased branch, it now says that I requested a review from everyone.

Is that a bug or a feature?

@derekhiggins
Copy link
Collaborator

Before taking a indepth look I've noticed that this is a breaking change. Existing users will need to migrate their configurations from ConfigMap. Do we want to support both methods for a short deprecation period?

Copy link
Collaborator

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the rationale for this change and the downside of watching the CM but this feels like a bit of a regression. Now users have to edit the CR with the rather tedious PEM content. Some clusters (like OpenShift) automatically generate the CA bundles in a ConfigMap, which makes it much more convenient to just reference them directly.

// ConfigMapName is the name of the ConfigMap containing user configuration
ConfigMapName string `json:"configMapName"`
// ConfigMapNamespace is the namespace of the ConfigMap (defaults to the same namespace as the CR)
// CustomConfig contains arbitrary text data that represents a user-provided run.yamlconfiguration file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// CustomConfig contains arbitrary text data that represents a user-provided run.yamlconfiguration file
// CustomConfig contains arbitrary text data that represents a user-provided run.yaml configuration file

Copy link
Collaborator

@rhuss rhuss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhdedgar thanks for the PR! I think it goes in the right direction, but the original intention for the configuration override with a custom run.yaml was:

Add a subschema of run.yaml directly in the resource file. It should be reflected in the CRDs schema. It was not meant to include the run.yaml as a single opaque string, but a more typed approach that helps user to create the fields. This is also a means to protect users for fast schema changes in run.yaml as it currently still happens quite a lot. With keeping a sub-schema when can easily add an additional mapping layer for newer version of the run.yaml schema, so that we keep the operator config stable over a longer time.

For a full overwrite, I would still keep the configMap option, but maybe only locally in the same namespace and picking it up only during startup (no watch needed).

Just moving the full run.yaml makes it even harder to maintain (updating a string inlined yaml is even more error prone with the formatting, especially when it has been changed by some k8s mangling)

Not sure if this makes sense, I would love to hear some more opinions on this (and especially what subschema from run.yaml should we allow).

@rhdedgar
Copy link
Collaborator Author

rhdedgar commented Sep 9, 2025

I understand the rationale for this change and the downside of watching the CM but this feels like a bit of a regression. Now users have to edit the CR with the rather tedious PEM content. Some clusters (like OpenShift) automatically generate the CA bundles in a ConfigMap, which makes it much more convenient to just reference them directly.

Hi, in this case, the operator is still creating and managing an ephemeral ConfigMap to hold the run.yaml and CA Bundle data, so that ConfigMap can be annotated with service.beta.openshift.io/inject-cabundle: "true" to pull in the cluster certificates and combine the two sources via the init container.

I intended to keep that logic separate from this PR, and re-introduce it in the midstream repo to keep this upstream repo separate from RHOAI-specific or OpenShift specific features (as briefly discussed with @rhuss).

@rhdedgar thanks for the PR! I think it goes in the right direction, but the original intention for the configuration override with a custom run.yaml was:

Add a subschema of run.yaml directly in the resource file. It should be reflected in the CRDs schema. It was not meant to include the run.yaml as a single opaque string, but a more typed approach that helps user to create the fields. This is also a means to protect users for fast schema changes in run.yaml as it currently still happens quite a lot. With keeping a sub-schema when can easily add an additional mapping layer for newer version of the run.yaml schema, so that we keep the operator config stable over a longer time.

For a full overwrite, I would still keep the configMap option, but maybe only locally in the same namespace and picking it up only during startup (no watch needed).

Just moving the full run.yaml makes it even harder to maintain (updating a string inlined yaml is even more error prone with the formatting, especially when it has been changed by some k8s mangling)

Not sure if this makes sense, I would love to hear some more opinions on this (and especially what subschema from run.yaml should we allow).

Hi, there was some discussion in #117 around keeping the option to provide a custom UserConfig, which has an arbitrary schema not defined by the operator. I originally intended for this PR to not overlap with the sub-schema feature outlined in #117, and instead provide a new way of providing the run.yaml data that's currently provided by a ConfigMap. It can be removed if the sub-schema method is the only one that should be supported though.

I'm a little concerned about keeping the existing ConfigMap method, while only removing the watch for the ConfigMap. That would make it so that a user has to manually ensure the existing pod is deleted and re-created to pull in the new run.yaml data. I think it will get confusing if someone sees a config that doesn't match what the container is actually using, so maybe there's another way to avoid that path.

@VaishnaviHire
Copy link
Collaborator

Instead of introducing crd changes as part of this fix, should this PR include updates on Watches? I agree with @rhdedgar, run.yaml configmap migration to CRD should be covered under #117 .

It should be reflected in the CRDs schema. It was not meant to include the run.yaml as a single opaque string, but a more typed approach that helps user to create the fields.

@rhuss +1 on this.

Given #117, we will only have to watch only ca-bundle configmap eventually. We should keep the ca-bundle configmap reference so as to support auto-generated/ existing cert configmaps.

For this PR, instead of removing the watch completely, can we have opinionated watches ?

  • Watch for run.yaml configmap only in LlamaStackDistribution CR namespace?
  • For ca-bundle, since its common across CR instances, watch for ca-bundle configmap in operator namespace?

@rhuss
Copy link
Collaborator

rhuss commented Sep 10, 2025

Yeah, I think for the configuration I think we should allow references only to ConfigMaps in the same namespace (config). This should fix the performance issue without being too restrictive.

For the caBundle, I'm unsure how it works in a multi-tenant setup, where the operator runs completely outside any tenant and different tenants have different CAs they trust. (but then, I'm not deep in how the ca bundle is currently used, so there might be an easy solution, too)

@VaishnaviHire
Copy link
Collaborator

For the caBundle, I'm unsure how it works in a multi-tenant setup, where the operator runs completely outside any tenant and different tenants have different CAs they trust. (but then, I'm not deep in how the ca bundle is currently used, so there might be an easy solution, too)

We currently reference the ca-bundle configmap in our CR. For multi-tenancy, I think we can just watch the tenant namespace/s? Require users to have the ca-bundle configmap within those namespaces?

@rhuss
Copy link
Collaborator

rhuss commented Sep 10, 2025

@rhdedgar I fully agree that we should keep the same semantics about restarting, regardless of how the configuration is stored.

For the external, fully-overridable run.yaml case, wdyt think about the idea, that the ConfigMaps themselves should be immutable (you can set CMs with a flag immutable), and then require the user to make an update of the reference field in the LLSD pointing to the ConfigMap ? (probably somehting like customConfig: run-yaml-config-asdff1234, changing the suffix for every update). Something what kustomize does, too.

The benefit of this approach is that the reconciled Deployment resource differs from the original one (change in the configmap name), so going to restart the deployment. If you only change the configMap but not the DeploymentSpec, usually nothing happens. Of course, any ConfigMap mounted into a Pod reflects the new content, but the process within the Pod's container needs to detect that (via a file watch) and need to do a hot-reload (without restart). That would be the best solution anyways, but LLS is not capable of doing this.

Using immutable configMaps also the benefit of having any way for history and allows rollback, too.

tl;dr - I probably would favor such a solution:

  • Keep the ConfigMap reference and don't include a full run.yaml (that really is a bad and fragile user experience, mainly when run.yaml can get large. Additionally, you need to perform special escaping, such as $ -> $$, due to the special YAML semantics. It's really not trivial.
  • Encourage people who use an external configMap to use either kustomize to manage their LLSDs or recommend immutable configmaps to trigger restarts.
  • Focus on Proposal: Add run.yaml directly into LLSD as a sub-schema #117 in addition to allow a smoother and more stable user experience for simpler use cases, like adding additional models or tools.

@rhdedgar
Copy link
Collaborator Author

Closing this PR, as I was able to obtain a support exception for the older versions of OpenShift.

@rhdedgar rhdedgar closed this Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch from ConfigMaps to CRs for CA bundles and custom run.yaml data
5 participants