Skip to content

Separate control plane and data plane; support multiple Gateways #3318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

sjberman
Copy link
Collaborator

As a route to efficacy and quickly understanding the Gateway API, its implementation and alignment to NGINX as a data plane, we decided on a simplified, but rigid, deployment pattern. To improve our security posture and installation flexibility the control and data planes are being separated as semi-autonomous, distributed components. This also allows us to support multiple Gateways for a single control plane.

A general summary of the changes being made:

  • control plane and data plane are now in separate Deployments
  • installing NGF just installs the control plane
  • when a Gateway resource is created, the control plane provisions an nginx data plane deployment and service
  • the NginxProxy CRD resource can now be set at the Gateway level, and has been enhanced to include all deployment/service infrastructure-related fields, such as replicas, loadBalancerIP, serviceType, etc.
    • these fields can be configured globally at installation time in the helm chart, or set on an individual basis per Gateway
    • updating these fields directly on a provisioned nginx Deployment or Service will not take effect
    • this does not apply to the control plane Deployment
  • labels/annotations for the NGINX deployment or service can be set in the Gateway's Infrastructure section
  • the NGINX pod uses the NGINX agent (currently an unofficial, unreleased version) to update NGINX configuration
  • control plane communicates with the NGINX agent over a secure gRPC connection, using self-signed certs by default, created at installation time. Cert-manager can be used instead.
  • multiple Gateways is now supported

Note: there are still a few more implementation steps to finish the work for this feature, but this PR includes all of the main functionality and passing test pipelines. This is not 100% stable yet and is subject for a few more breakages and changes before release.

Design: https://github.com/nginx/nginx-gateway-fabric/tree/main/docs/proposals/control-data-plane-split
Epic: #1508

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that all unit tests pass after adding my changes
  • I have updated necessary documentation
  • I have rebased my branch onto main
  • I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

Release notes

If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.

BREAKING CHANGES:

<link to upgrade documentation and anything else that may be relevant>

The following change are breaking and require users to fully uninstall NGINX Gateway Fabric (including NGINX Gateway Fabric CRDs) before re-installing the new version. Gateway API resources (such as Gateway, HTTPRoute, etc) are unaffected and can be left alone.

- Control plane and data plane have been separated into different Deployments.
   - the control plane will provision an NGINX data plane Deployment and Service when a Gateway object is created.
- NginxProxy CRD resource is now namespace-scoped (was cluster-scoped).
- NginxProxy resource controls infrastructure fields for the NGINX Deployment and Service, such as replicas, loadBalancerIP, serviceType, etc. Users who want to set or update these fields must do so either at installation time through the helm chart (which sets them globally), or per Gateway. Updating these fields directly on a provisioned nginx Deployment or Service will not take effect.
   - this does not apply to the the NGINX Gateway Fabric control plane Deployment.
- Helm values structure has changed slightly to better support the separate Deployments.

FEATURES:
- Support for creating and deploying multiple Gateways.
- NginxProxy resource can now additionally be attached to a Gateway, and will overwrite any settings that are attached at the GatewayClass level, for the Gateway that it's attached to.

sjberman and others added 22 commits April 23, 2025 07:37
Removing the nginx runtime manager and deployment container since nginx will live in its own pod managed by agent. Temporarily saving the nginx deployment and service for future use.

Updated the control plane liveness probe to return true once it's processed all resources, instead of after it's written config to nginx (since nginx may not be started yet in the future architecture).
Updating the nginx docker containers to build and include agent. Once agent is officially released, we can use the published binary instead of building.

Added a temporary nginx deployment to the helm chart to deploy a standalone nginx pod.

Added the basic gRPC server and agent API implementation to allow for the agent pod to connect to the control plane without errors.
Added the following:
- middleware to extract IP address of agent and store it in the grpc context
- link the agent's hostname to its IP address when connecting and track it
- use this linkage to pause the Subscription until the agent registers itself, then proceeding

This logic is subject to change as we enhance this (like tracking auth token instead of IP address).
Problem: When the control plane and data planes are split, the user will need the ability to specify data plane settings on a per-Gateway basis. To allow this, we need to support NginxProxy at the Gateway level in addition the the GatewayClass level. In practice, this means a user can reference an NginxProxy resource via the
spec.infrastructure.parametersRef field on the Gateway resource. We still want to support referencing an NginxProxy at the GatewayClass level. If a Gateway and its GatewayClass reference distinct NginxProxy resources, the settings must be merged. Settings specified on a Gateway NginxProxy must override those set on the GatewayClass NginxProxy.

Solution: To support NginxProxy at the Gateway level several changes were made to the API.
As a result, the API is now at version v1alpha2.

Breaking Changes:
* Change the scope of the CRD to Namespaced. The parametersRef.namespace field on the GatewayClass is now required.
* Make DisableHTTP2 and Telemetry.Exporter.Endpoint optional.

New fields:
* Telemetry.DisabledFeatures: allows users to explicitly disable telemetry features. It is a list with one supported entry: DisableTracing. More features may be added in future releases.

Other changes:
* Remove the listType=Map kubebuilder annotation from the RewriteClientIP.TrustedAddresses field. This listType is incorrect since TrustedAddresses can have duplicate keys.

The graph now stores NginxProxies that are referenced by the winning GatewayClass and Gateway. This will need to be updated once we support multiple Gateways. The graph is also responsible for merging the NginxProxies when necessary. The result of this is stored on the graph's Gateway object in the field EffectiveNginxProxy. The EffectiveNginxProxy on the Gateway is used to build the NGINX configuration.
This commit adds functionality to send nginx configuration to the agent. It also adds support for the single nginx Deployment to be scaled, and send configuration to all replicas. This requires tracking all Subscriptions for a particular deployment, and receiving all responses from those replicas to determine the status to write to the Gateway.
Problem: The NGINX Plus API conf file was empty when sending using OSS, which caused an error applying config. This also revealed an issue where we received multiple messages from agent, causing some channel blocking.

Solution: Don't send the empty NGINX conf file if not running N+. Ignore responses from agent about rollbacks, so we only ever process a single response as expected.
Add leader election to allow data plane pods to only connect to the lead NGF pod. If control plane is scaled, only the leader is marked as ready and the backups are Unready so the data plane doesn't connect to them.

Problem: We want the NGF control plane to fail-over to another pod when the control plane pod goes down.

Solution: Only the leader pod is marked as ready by Kubernetes, and all connections from data plane pods are connected to the leader pod.
This commit updates the control plane to deploy an NGINX data plane when a valid Gateway resource is created. When the Gateway is deleted or becomes invalid, the data plane is removed. The NginxProxy resource has been updated with numerous configuration options related to the k8s deployment and service configs, which the control plane will apply to the NGINX resources when set. The control plane fully owns the NGINX deployment resources, so users who want to change any configuration must do so using the NginxProxy resource.

This does not yet support NGINX Plus or NGINX debug mode. Those will be added in followup pull requests. This also adds some basic daemonset fields, but does not yet support deploying a daemosnet. That will also be added soon.
* Add back runnables change and call to nginx provisioner enable

---------

Co-authored-by: Benjamin Jee <[email protected]>
…3147)

Support nginx debug mode when provisioning the Data Plane.

Problem: We want to have the option to provision nginx instances in debug mode.

Solution: Add debug field to NginxProxy CRD. Also user can set debug field when installing through Helm by setting the nginx.debug flag.
Continuation from the previous commit to add support for provisioning with NGINX Plus. This adds support for duplicating any NGINX Plus or docker registry secrets into the Gateway namespace.

Added unit tests.
With the new deployment model, the provisioner mode for conformance tests is no longer needed. This code is removed, and at a later date the conformance tests will be updated to work with the new model. Renamed the "static-mode" to "controller".

Also removed some unneeded metrics collection.
Problem: When a user updates or deletes their docker registry or NGINX Plus secrets, those changes need to be propagated to all duplicate secrets that we've provisioned for the Gateway resources.

Solution: If updated, update the provisioned secret. If deleted, delete the provisioned secret.
Update functional tests for the control plane data plane split.

Problem: The functional tests do not pass with the current architecture.

Solution: Add updates to functional tests.
Problem: We want to ensure that the connection between the control plane and data plane is authenticated and secure.

Solution:

1. Configure agent to send the kubernetes service token in the request. The control plane validates this token using the TokenReview API to ensure the agent is authenticated.
2. Configure TLS certificates for both the control and data planes. By default, a Job will run when installing NGF that creates self-signed certificates in the nginx-gateway namespace. The server Secret is mounted to the control plane, and the control plane copies the client Secret when deploying nginx resources. This Secret is mounted to the agent.

The control plane will reset the agent connection if it detects that its own certs have changed.

For production environments, we'll recommend a user configures TLS using cert-manager instead, for better security and certificate rotation.
Problem: The data plane container was not properly handling the kill signal when the Pod was Terminated.

Solution: Update the entrypoint to catch the proper signals.
Problem: Now that we have additional pods in the new architecture, we need the proper SecurityContextConstraints for running in Openshift.

Solution: Create an SCC for the cert-generator and an SCC for nginx data plane pods on startup. A Role and RoleBinding are created when deploying nginx to link to the SCC.
Problem: Users want to be able to configure multiple Gateways with a single installation of NGF.

Solution: Support the ability to create multiple Gateways. Routes and policies can be attached to multiple Gateways.

Also fixed conformance tests.

---------

Co-authored-by: Saylor Berman <[email protected]>
Update non-functional tests for the control plane data plane split.

Problem: The non-functional tests do not work for the control plane data plane split changes.

Solution: Update non-functional tests.

Testing: Scale, Reconfiguration, Performance, and Longevity tests work. Upgrade test doesn't work, however that is sort of planned since the CP/DP split is a breaking change of NGF and thus you can't easily upgrade with zero downtime.

---------

Co-authored-by: Saylor Berman <[email protected]>
@github-actions github-actions bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file change Pull requests that introduce a change helm-chart Relates to helm chart labels Apr 23, 2025
Copy link

codecov bot commented Apr 23, 2025

Codecov Report

Attention: Patch coverage is 78.19876% with 351 lines in your changes missing coverage. Please review.

Project coverage is 86.77%. Comparing base (1768129) to head (5084be8).

Files with missing lines Patch % Lines
internal/mode/static/manager.go 3.77% 102 Missing ⚠️
internal/mode/static/nginx/agent/command.go 82.50% 53 Missing and 10 partials ⚠️
cmd/gateway/commands.go 71.73% 50 Missing and 2 partials ⚠️
cmd/gateway/certs.go 75.67% 25 Missing and 11 partials ⚠️
internal/mode/static/handler.go 82.27% 21 Missing and 7 partials ⚠️
internal/framework/controller/predicate/secret.go 56.75% 12 Missing and 4 partials ⚠️
internal/framework/file/file.go 77.58% 12 Missing and 1 partial ⚠️
...static/nginx/agent/grpc/filewatcher/filewatcher.go 84.48% 7 Missing and 2 partials ⚠️
internal/framework/controller/resource.go 0.00% 7 Missing ⚠️
...ernal/framework/controller/predicate/annotation.go 76.00% 4 Missing and 2 partials ⚠️
... and 8 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3318      +/-   ##
==========================================
+ Coverage   86.20%   86.77%   +0.56%     
==========================================
  Files         116      128      +12     
  Lines       11928    14712    +2784     
  Branches       62       62              
==========================================
+ Hits        10283    12766    +2483     
- Misses       1580     1806     +226     
- Partials       65      140      +75     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

bjee19 and others added 2 commits April 28, 2025 11:12
…ervice (#3319)

Add ability to set loadBalancerClass for load balancer Service

Problem: We would like the ability to specify the loadBalanacerClass field on a load balancer service.

Solution: Add ability to set loadBalancerClass for load balancer Service.

Testing: Manually tested that deploying NGF with the nginx.service.loadBalancerClass Helm flag would correctly set the field. Also tested that modifying the NginxProxy resource would set the loadBalancerClass when the service was re-created (the field can only be set upon creation).
Problem: All config update events resulted in sending configuration to every Gateway, even if the change was irrelevant.

Solution: Compare new config with old config to determine if a ConfigApply is necessary. Simplified the change processor and handler to no longer have to determine this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change Pull requests that introduce a change dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation helm-chart Relates to helm chart release-notes
Projects
Status: 🆕 New
Development

Successfully merging this pull request may close these issues.

4 participants