-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORS-2062: Allow Customers to BYO DNS #1400
Conversation
@sadasu: This pull request references CORS-1874 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@sadasu: This pull request references CORS-2062 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
ef442d3
to
d4b7709
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
8. When the LoadBalancer Service successfully creates the Ingress LB in the | ||
underlying cloud, it updates the `Hostname` (in AWS) or `IP (Azure and GCP) | ||
fields within the [LoadBalancerIngress struct] (https://github.com/kubernetes/api/blob/cdff1d4efea5d7ddc52c4085f82748c5f3e5cc8e/core/v1/types.go#L4554). These fields represent the hostname | ||
or the IP address of the Ingress LB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should be combined with bullet point 10?
10. runtimecfg monitors changes to the `IP` associated with the Ingress LB. When | ||
the `Hostname` is set, it needs to be resolved to its IP addresses before they | ||
can be added to the Corefile associated with the CoreDNS pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still seems like a pain point. We have talked about this in the past, but to rehash: it seems that coredns will only allow us to create A records with IP addresses.
Is it realistic (how painful) for runtimecfg
to resolve the LB DNS name to the ip addresses? Do we know whether those underlying IP addresses can change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 ways we can solve this:
- runtimecfg watches for changes to the IP address that Ingress LB Hostname resolves to. When there is a change, it updates the CoreDNS Corefile with the new IP
- Add CNAME records in CoreDNS for our DNS entries.
I am not looking pursuing option 2 at this time because for Azure & GCP, the IP address of the LB is available and not its Hostname. By choosing option 1, we can have a common implementation for all 3 platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive 👀 here.
IIUC hive will not be impacted or need to add support for this as long as managed DNS is not used for such clusters because the main touchpoint is the install-config. Is that right?
|
||
<snip> | ||
``` | ||
Setting `userConfiguredDNS` to `Enabled` and not specifying `apiLBName` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean userConfiguredDNSLB.userDNS
rather than userConfiguredDNS
here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Fixed it now
6140434
to
dd34233
Compare
/unassign |
75215f8
to
876d631
Compare
@sadasu: This pull request references CORS-2062 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We will not use the [Kubernetes External DNS provider](https://github.com/openshift/external-dns) | ||
and/or the [OpenShift External DNS Operator](https://github.com/openshift/external-dns-operator) | ||
to program dns entries across varied DNS providers. The External DNS Operator | ||
relies on being able to program the customer's DNS solution. The customer DNS | ||
solutions in consideration here do not allow OpenShift to configure them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The customer DNS solutions in consideration here do not allow OpenShift to configure them.
Is this the primary issue with the External DNS Operator and is this really the case? I have not heard that users don't want openshift to program DNS, but rather our IPI installs are tied to a single platform where users' DNS is frequently hosted in a separate platform.
I don't think we want to intentionally exclude the external dns operator. I think we want to include it as a part of the solution as much as possible (but that might not be very much).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this solution needs to worry about the set of DNSs the External DNS operator can configure. If the DNS solution the customer wants to use is supported by External DNS operator and they are OK with OpenShift configuring it, they can still use it.
This solution is useful for customers that want to use a DNS solution that is currently not supported by the External DNS operator and/or they don't want OpenShift to configure their DNS. @makentenza can add details about the customers we are targeting with this solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're able to use the External DNS Operator it could provide some substantial advantages. Consider if we moved API load balancers to be provisioned/managed by services created on bootstrap and ran the external dns operator on the bootstrap node. This would allow installer/cluster--provisioned LBs (customers don't need to BYO) and DNS records for the subset of DNS providers supported by the external DNS operator. We would need to determine some form of support for the DNS providers not included in the external DNS operator, but that could be either of the approaches we have already discussed (byo LBs or post-configured DNS).
Abhinav did a write up of SLB managed API load balancers, which might provide some more context about the approach or challenges. I wonder if using the external DNS operator would mitigate some (or most) of the issues he mentions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the original requirements for the feature said that OpenShift is not given credentials or access to manage the external DNS server at all, based on infrastructure security requirements in the settings where this feature is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the feature in Jira, it says:
While cloud-based DNS services provide convenient hostname management, there's a number of regulatory and operational constraints customers face prohibiting the use of those DNS hosting services on public cloud providers.
So, yeah, maybe that does knock out the option of using the external DNS operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please refer to https://issues.redhat.com/browse/OCPSTRAT-261 which states:
Out of Scope
The Installer to manage (create/update/delete) the required DNS records to deploy OpenShift on these cloud providers or integrate with those DNS services in any way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solving the use case where there is no programmatic access to DNS is not optional. So we cannot require the external DNS operator.
I guess that for users who are using the external DNS operator anyway, it would be nice if they didn't also have to manually create their own load balancer. But that's not an explicit goal of this work.
Potentially we could have an API design that would leave room for this if it ever actually comes up - e.g. if we had a DNSProvider field where the options are (some variant of) UserManaged vs. OpenShiftManaged then we could add an ExternalDNSOperatorManaged option if needed in the future. (Please don't use any of these names 🙂)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we can't require programmable DNS.
We will not use the [Kubernetes External DNS provider](https://github.com/openshift/external-dns) | |
and/or the [OpenShift External DNS Operator](https://github.com/openshift/external-dns-operator) | |
to program dns entries across varied DNS providers. The External DNS Operator | |
relies on being able to program the customer's DNS solution. The customer DNS | |
solutions in consideration here do not allow OpenShift to configure them. | |
We will not use the [Kubernetes External DNS provider](https://github.com/openshift/external-dns) | |
and/or the [OpenShift External DNS Operator](https://github.com/openshift/external-dns-operator) | |
to program dns entries across varied DNS providers. Because programmable DNS is precluded in some | |
environments due to technical limitations or organizational policy we cannot require programmable DNS. |
da4d1f4
to
4a08899
Compare
API-Int. The customer would also be responsible for removing entries from | ||
their custom DNS. | ||
|
||
#### Variation [optional] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patrickdillon @zaneb, @Miciah and @bennerv , I have updated this section to include information about spec.loadBalancerIP
and the annotations that have been added by AWS, Azure and GCP that we could potentially use towards the design goal of pre-configuring the DNS entry for *.apps before cluster installation.
Could you please take a look and provide comments? If these annotations seem like a better option than running the CoreDNS pod, I will update the main "Proposal" to reflect that.
@sdodson Thanks for the ping. After my first pass at this, I have some concerns from a managed services perspective:
It seems a preferable solution here would be one where the cluster can provision and update cloud load balancer objects, and either run a local DNS pod, configure host files, or some other method in order to allow the cluster to still internally resolve it's required API DNS entries while not relying on external DNS functioning. This would allow for cloud DNS configuration as a post-install operation, and is more in line with what ARO does today. Is this an option we can more fulsomely consider? Happy to discus further. |
Thanks for the feedback @cblecker!
One of the obstacles here is that the installer--in addition to the cluster--needs to resolve DNS for the API. @cblecker in managed services does the installer have the same DNS view as the cluster? I don't know much about Private Link or how it's used in managed services, but perhaps it solves this problem? |
@cblecker thanks for taking a look.
While considering this aspect of the design, I was told that there is a move towards BYO infrastructure so this is not really a concern.
I don't see why the cluster cannot update the LBs. The users are required to create them but the installer would configure them and the same LB reference would be passed into Machine objects.
We don't want to design a solution that won't work for ARO. The platform specific configuration bits will be added one platform at a time but we want a cohesive design across AWS, Azure and GCP. So, could you please point out which specific section of the design is not OK for ARO or is it the design as a whole?
The previous iteration of the design for Custom DNS in #1276 has many aspects of what you are asking for here. That design met with an untimely death and so is not in its final polished state. Feel free to take a look. |
@cblecker could you also PTAL at the "Variations" section? It has some new information about annotations and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of having the user pre-allocate an IP address rather than a whole load balancer. The platform-specific APIs are annoying, but we can work with them. I'm curious about the particulars of how the user would create or reserve the custom IP address on Azure or GCP.
the install-config. The Installer will make it avilable to the ingress operator | ||
via the Ingress Controller CRD as in the above case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than modifying the IngressController API, it might make sense to add a field to the cluster ingress config, or maybe it makes the most sense to have both: IngressController.spec.endpointPublishingStrategy.loadBalancer.providerParameters.aws.networkLoadBalancer.eipAllocations
and Ingress.spec.loadBalancer.platform.aws.networkLoadBalancer.eipAllocations
, using the new ingress config field to set a default value for the "default" IngressController's new field.
(My first thought was to add an IngressIPs
field to AWSPlatformStatus
à la BareMetalPlatformStatus
, OpenStackPlatformStatus
, etc., but it makes sense to put the new config under the ingress config as it really is specifically ingress-related config and the ingress config already has the spec.aws
union, which could be extended with NLB-specifc config.)
utilize this feature for their Ingress LB pre-creation. Again, the concern here | ||
is the deprecation comment in the documentation. | ||
|
||
2. [`service.beta.kubernetes.io/azure-load-balancer-ipv4` and `service.beta.kubernetes.io/azure-load-balancer-ipv6`](https://learn.microsoft.com/en-us/azure/aks/load-balancer-standard#specify-the-load-balancer-ip-address) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't rely on the AKS documentation as there could be AKS-only annotations, but the annotation is defined in cloud-provider-azure (which you pointed me to: https://github.com/kubernetes-sigs/cloud-provider-azure/blob/0cfeb043809bdf796233f47dbffb8608d3a716e4/pkg/consts/consts.go#L205), so it should be available for use by OpenShift.
2. [`service.beta.kubernetes.io/azure-load-balancer-ipv4` and `service.beta.kubernetes.io/azure-load-balancer-ipv6`](https://learn.microsoft.com/en-us/azure/aks/load-balancer-standard#specify-the-load-balancer-ip-address) | ||
|
||
Setting this annotation means that the Azure LB would not have to be created | ||
by the user before cluster installation. The custom DNS solution can be | ||
configured before cluster installtion such that *.apps resolves to these | ||
IPv4 (and IPv6 for dual stack) addresses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Azure documentation says, "All custom IP addresses must be created and managed by the user." It doesn't say exactly how to create a custom IP address though. Is it as easy as creating an EIP on AWS?
2. [`kubernetes.io/ingress.global-static-ip-name`](https://cloud.google.com/kubernetes-engine/docs/concepts/ingress-xlb#static_ip_addresses_for_https_load_balancers) | ||
|
||
This annotation was added to [`Ingress-gce`](https://github.com/kubernetes/ingress-gce). | ||
It seems similar to the 2nd option discussed for AWS and Azure where the a | ||
pre-determined static IP is set aside for the external Ingress LB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the kubernetes.io/ingress.global-static-ip-name
annotation defined in https://github.com/openshift/kubernetes/blob/release-4.14/staging/src/k8s.io/legacy-cloud-providers/gce/gce_annotations.go, so I expect it isn't available for use by OpenShift.
Given the different approaches taken by different platforms, it seems that if | ||
we wanted a semantically consistent way to handle to the pre-configuration of | ||
DNS for *.apps we use the platform specific annotation that allows a | ||
pre-defined static IP to be specified for the Ingress LB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, internally we would use the platform-specific mechanism, and for the customer, it makes sense to me to define a consistent API that could be used on each of these platforms that has some mechanism, for example on the IngressController API:
spec.endpointPublishingStrategy.loadBalancer.providerParameters.aws.networkLoadBalancer.eipAllocations
, which would get translated into theservice.beta.kubernetes.io/aws-load-balancer-eip-allocations
annotation.spec.endpointPublishingStrategy.loadBalancer.providerParameters.azure.loadBalancerIPAddress
, which would get translated into theservice.beta.kubernetes.io/azure-load-balancer-ipv4
orservice.beta.kubernetes.io/azure-load-balancer-ipv6
annotation.spec.endpointPublishingStrategy.loadBalancer.providerParameters.gcp.loadBalancerIPAddress
, which would get translated into the Servicespec.loadBalancerIP
field.
Or possibly similar on the cluster ingress config API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the idea and putting it all together like this makes it very clear. Adding it back to the document.
One thing that I think might help this discussion would be to have a diagram that shows the dependencies that the various pieces of infrastructure have on one another. e.g. (disclaimer: probably inaccurate): flowchart BT;
awslb[AWS LoadBalancer]
awseip[AWS ElasticIP]
dns[AWS Route53]
awssubnet[AWS Subnet]
awslb-->awseip
awslb-->awssubnet
awseip-->awssubnet
dns-->awslb
API-->dns
style dns stroke:#f55,stroke-width:3px
flowchart BT;
awslb[AWS LoadBalancer]
awseip[AWS ElasticIP]
dns[AWS Route53]
awssubnet[AWS Subnet]
awslb-->awseip
awslb-->awssubnet
awseip-->awssubnet
dns-->awslb
awslb-->Service
Service-->Ingress
style dns stroke:#f55,stroke-width:3px
I think this would help in discussing the different options where we need to break the graph edges for the user to provide input, given that the DNS often resides in the middle of the graph. Source for the charts above (see https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-diagrams#creating-mermaid-diagrams):
|
|
||
## Implementation History | ||
|
||
Major milestones in the life cycle of a proposal should be tracked in `Implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the current status of this work?
I'm getting the impression that the API part is implemented but the Ingress part is not, and that's the part we are continuing to discuss?
const ( | ||
// DNSUserAndClusterProvided indicates that the user configures DNS for API and API-Int. | ||
// The cluster handles some of its in-cluster DNS needs without user intervention. | ||
DNSUserAndClusterProvided DNSProviderType = "UserAndClusterProvided" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be inclined to make the distinction between the needs of the cluster and the needs of an external user in the key name rather than in the value. e.g. publicDNSProvider: UserManaged
/publicDNSProvider: OpenShiftDefaultClusterManaged
instead of provider: UserAndClusterProvided
configured with the LB DNS Names of the LBs associated with each of these | ||
services. Hence the LBs have to be created before they can used to configure | ||
the DNS solution. Also, important to note is that these LB DNS Names cannot | ||
be generated before they are created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC in the case of Ingress the reason here is that the LB name is based on the UID of the Service CR that manages it, at least in cloud-provider-aws.
What's the constraint for the API?
When it is indicated to the OpenShift installer via an input parameter that the | ||
custom DNS solution is in use, OpenShift does not configure the cloud DNS | ||
solution [Route53 for AWS](https://aws.amazon.com/route53/) for any of its | ||
services. In the case of API and API-Int services, the customer is responsible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the Goals states:
"Continue using the default cloud LB service for API, API-Int and Ingress."
I read that as meaning that the installer would continue to create the load balancers, so maybe there's a mismatch here. Is the goal out of date or am I reading wrong?
LoadBalancer service to create and configure the LBs for the *.apps service. | ||
OpenShift will start an in-cluster DNS pod to provide DNS resolution for *.apps | ||
during cluster installation. After the cluster is installed, the user will have | ||
to update their custom DNS solution with an entry for the *.apps service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further enquiry suggests this is not assisted-specific.
Ingress is typically one of the very last things to come up, because it runs on worker nodes. I'm not sure that anything depends on it, except for the check in the installer to see if the console is available - that definitely does.
Is it possible that the answer for Ingress could be as simple as disabling that check when installing with user-managed DNS??
enabled via install-config. Since API and API-Int resolution are pre-requisties | ||
for successful cluster installation, the Installer performs additional | ||
validations to make sure they are configured correctly. The installer provides | ||
a warning but does not stop installation when API or API-Int resolution fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it would be more accurate to say that we can validate anything we want from the bootstrap but since the installer doesn't have a way of checking it, it's only useful for debugging after the fact.
fields within the [LoadBalancerIngress struct](https://github.com/kubernetes/api/blob/cdff1d4efea5d7ddc52c4085f82748c5f3e5cc8e/core/v1/types.go#L4554). These fields represent the hostname | ||
or the IP address of the Ingress LB. | ||
|
||
9. OpenShift needs to provide an in-cluster DNS solution for *.apps resolution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has always been my impression that this is already a built-in part of k8s: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/ - this allows pods within the cluster to obtain the IP address (on the cluster network) for particular services.
IIUC the implementation for it in OpenShift is managed by the DNS Operator. You can see the CoreDNS pods running in the openshift-dns
namespace.
The wildcard *.apps external DNS is only needed for stuff outside the cluster to get traffic to the correct Ingress point.
Have we tried installing a cluster without adding a static CoreDNS pod? What happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true of services, but not for Ingresses. I did a quick check to see what happens if we delete the *.apps record in an AWS cluster. Both the authentication and console operators entered Available=False state.
authentication 4.14.0-0.okd-scos-2023-08-07-185858 False False False 14s OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.sdodson.redacted.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.sdodson.redacted.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server)
console 4.14.0-0.okd-scos-2023-08-07-185858 False False False 17s RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.sdodson.redacted.openshift.com): Get "https://console-openshift-console.apps.sdodson.redacted.openshift.com": dial tcp: lookup console-openshift-console.apps.sdodson.redacted.openshift.com on 172.30.0.10:53: no such host
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I should have added this earlier. This is some test code in Installer : openshift/installer#7403. e2e tests show auth, console and Ingress operators degrading.
Enhancement proposal for https://issues.redhat.com/browse/CORS-1874
@sadasu: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
Stale enhancement proposals rot after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle rotten |
Rotten enhancement proposals close after 7d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Reopen the proposal by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
(automated message) This pull request is closed with lifecycle/rotten. The associated Jira ticket, CORS-1874, has status "In Progress". Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
1 similar comment
(automated message) This pull request is closed with lifecycle/rotten. The associated Jira ticket, CORS-1874, has status "In Progress". Should the PR be reopened, updated, and merged? If not, removing the lifecycle/rotten label will tell this bot to ignore it in the future. |
Closed in favor of #1468 |
Enhancement proposal for https://issues.redhat.com/browse/CORS-1874
API changes: openshift/api#1397