-
Notifications
You must be signed in to change notification settings - Fork 3
Support for GCP control plane, single/dual cluster #208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
nelsonjr
wants to merge
1
commit into
main
Choose a base branch
from
nelson/nav-gcp-2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,279 @@ | ||
| # Self-Hosted Union on GCP using Terraform | ||
|
|
||
| This step-by-step tutorial aims to deploy a Union self-hosted (both control | ||
| plane and data plane) on GCP using Union's reference Terraform modules. | ||
|
|
||
| > The customer is free to use any other infrastructure mechanism, be that their | ||
| > own Terraform modules or other means, as long as their choice of system | ||
| > produces the same output as the Terraform modules. | ||
| > | ||
| > This is critical because the instructions herein assume those resources exist | ||
| > and configured as such. You are welcome to perform all tasks manually and | ||
| > observe the full manual step-by-step defined in the respective [Control Plane | ||
| > GCP][manual-cp-gcp] and [Data Plane GCP][manual-dp-gcp] guides. | ||
|
|
||
| ## Resources Needed | ||
|
|
||
| - **VPC**: networking details to run Union | ||
| - **GKE**: where Union will run. It can be the same cluster or distinct | ||
| clusters. | ||
| - **Cloud SQL (postgres)**: a database for CP to store job and run information. | ||
| - **Workload Identities**: used to allow the GKE cluster to assime the IAM roles | ||
| and acquire privileges to perform operations, i.e., access GCS | ||
| - **IAM service accounts**: the accounts used to perform privileged operations, | ||
| i.e., writing state to GCS | ||
| - **ScyllaDB**: a high-performant NoSQL store for dynamic state | ||
|
|
||
| ## Deploying Infrastructure Resources | ||
|
|
||
| To deploy the infrastructure resources we will use the Terraform modules you | ||
| received. They are: | ||
|
|
||
| - `infra`: Creates all control and data plane infrastructure resources | ||
| - `infra_ext`: An adapter to plug into existing infrastructure without | ||
| creating them. This is used to make your VPC and GKE compatible with the | ||
| modules below. | ||
| - `controlplane`: Creates all the control plane resources | ||
| - `dataplane`: Creates all the dataplane resources | ||
|
|
||
| ### Resources created by the Terraform modules | ||
|
|
||
| - **VPC**: networking details to run Union | ||
| - **GKE**: where Union will run. It can be the same cluster or distinct | ||
| clusters, depending on the value of the variable | ||
| `dedicated_dataplane_cluster`. If `true`, two clusters will be created (or | ||
| referenced), and if `false` the same cluster is shared between CP and DP. | ||
| - **Cloud SQL (postgres)**: a database for CP to store job and run information. | ||
| - **Workload Identities**: used to allow the GKE cluster to assime the IAM roles | ||
| and acquire privileges to perform operations, i.e., access GCS | ||
| - **IAM service accounts**: the accounts used to perform privileged operations, | ||
| i.e., writing state to GCS | ||
|
|
||
| ### Deployment Instructions | ||
|
|
||
| 1. Unpack the Terraform modules | ||
| 2. Choose from the examples, either `create-infra` or `already-created-infra`, | ||
| depending whether you want the module to create the VPC and GKE or not. | ||
| - Update the values within with your specific project details | ||
| - If to share the same cluster by CP and DP, set `dedicated_dataplane_cluster | ||
| = false`, otherwise set to `dedicated_dataplane_cluster = true`. | ||
| 3. `terraform init` to pull the required providers | ||
| 4. `terraform plan` and review the objects to be created | ||
| 5. `terraform apply` to make it so | ||
|
|
||
| At the end of this run you will have: | ||
|
|
||
| - VPC created (optional) | ||
| - GKE created (optional) | ||
| - Postgres in Cloud SQL | ||
| - Postgres account information (loaded as secret in GKE) | ||
| - Self-signed certificates for the Control Plane (loaded as secret in GKE) | ||
| - Workload identities for Data Plane backend | ||
|
|
||
| ## Deploying Union | ||
|
|
||
| ### Deploying Control Plane | ||
|
|
||
| As you used the Terraform module (or performed your own steps that produced the | ||
| same objects), we will be skipping all the manual steps listed in the [Control | ||
| Plane GCP][manual-cp-gcp] page, and move straight into `helm install`. | ||
|
|
||
| #### Gathering Infra Details | ||
|
|
||
| > We will make Terraform output the values file eventually. For now, please find | ||
| > these values from the output of the Terraform execution (or call `terraform | ||
| > output`) | ||
|
|
||
| Mostly you will concentrate on the `global` section of the values file: | ||
|
|
||
| | Value | Description | Source | | ||
| | ------------------------- | ----------------------------------------- | ---------------------------------------------------------------------- | | ||
| | `GCP_REGION` | The region the Control Plane is installed | `main.tf > module > infra > region` | | ||
| | `DB_HOST` | The IP address of the Postgres database | `terraform output controlplane > db > host` | | ||
| | `BUCKET_NAME` | Bucket used for system functions | `terraform output controlplane > gcs > flyte > id` | | ||
| | `ARTIFACTS_BUCKET_NAME` | Bucket to store artifacts | `terraform output controlplane > gcs > artifacts > id` | | ||
| | `ARTIFACT_IAM_ROLE_ARN` | The IAM role to access artifacts | `terraform output controlplane > service_accounts > artifacts > email` | | ||
| | `FLYTEADMIN_IAM_ROLE_ARN` | The IAM role to access system storage | `terraform output controlplane > service_accounts > flyte > email` | | ||
| | `UNION_ORG` | The name of the Union organization | `main.tf > locals > union_org` | | ||
| | `GOOGLE_PROJECT_ID` | The name of the GCP project | `main.tf > locals > project_id` | | ||
|
|
||
| We will ignore for now, but will come back to it later, after we install the | ||
| Data Plane: | ||
|
|
||
| | Value | Description | Source | | ||
| | -------------------- | --------------------------------------- | ------------------------------------------------ | | ||
| | `DATAPLANE_ENDPOINT` | The ingress endpoint for the data plane | Data Plane `EXTERNAL_IP` for its ingress service | | ||
|
|
||
| > If you are using DNS entries for the ingress endpoints, and you know in | ||
| > advance the Data Plane ingress DNS, you can specify them now and skip updating | ||
| > this later. | ||
|
|
||
| #### CP Deployment Instructions | ||
|
|
||
| 0. Ensure your current cluster is pointing to where you want to install the | ||
| Control Plane, i.e., execute `gcloud container clusters get-credentials` or | ||
| whichevever mechanism you use to make your cluster the default. | ||
|
|
||
| 1. Unpack the Helm charts | ||
|
|
||
| 2. Make a copy of the [GCP CP Self-Hosted values.yaml][gcp-cp-values] | ||
| - If you received a reference values from Union personnel, use that instead. | ||
|
|
||
| 3. Update the `values.yaml` from step 2 with your project and environment | ||
| specific information | ||
|
|
||
| 4. Load the Registry access secret to the cluster: | ||
|
|
||
| kubectl create secret docker-registry union-registry-secret \ | ||
| --docker-server=registry.unionai.cloud \ | ||
| --docker-username='<username>' \ | ||
| --docker-password='<password>' \ | ||
| -n union-cp | ||
|
|
||
| 5. `helm install` the Control Plane module | ||
|
|
||
| cd charts/controlplane | ||
| helm upgrade --install unionai-controlplane . \ | ||
| --namespace union-cp \ | ||
| --create-namespace \ | ||
| --values your-values.yaml \ | ||
| --timeout 15m | ||
|
|
||
| 6. Wait for a little bit: get a coffee, maybe a short walk? | ||
|
|
||
| #### CP Deployment Verification | ||
|
|
||
| Confirm all services are running: | ||
|
|
||
| kubectl get pod -n union-cp | ||
|
|
||
| You should get something like this: | ||
|
|
||
| NAME READY STATUS RESTARTS AGE | ||
| authorizer-6f8f655467-l44mt 1/1 Running 0 33h | ||
| cacheservice-6979466f8c-6rn5g 1/1 Running 0 33h | ||
| cluster-c5597448-prvbq 1/1 Running 0 33h | ||
| controlplane-nginx-controller-857b568794-twfkr 1/1 Running 0 33h | ||
| dataproxy-6d484b7c45-rbgw9 1/1 Running 0 11h | ||
| executions-5d4fb97788-wgg6w 1/1 Running 0 33h | ||
| flyteadmin-74fbf5bbd9-x6lb5 1/1 Running 0 33h | ||
| flyteconsole-d6859d494-wnrxl 1/1 Running 0 33h | ||
| queue-78f8fb75f4-22qp8 1/1 Running 0 33h | ||
| run-scheduler-54667b6d96-w5z8p 1/1 Running 0 33h | ||
| scylla-dc1-rack1-0 4/4 Running 0 33h | ||
| scylla-dc1-rack1-1 4/4 Running 0 33h | ||
| scylla-dc1-rack1-2 4/4 Running 0 33h | ||
|
Comment on lines
+161
to
+165
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also a note that queue service crashloops until scylla rack pods are up |
||
| unionconsole-55d946668-nlf7x 1/1 Running 0 33h | ||
| usage-5ddf757d6d-cjlr8 1/1 Running 0 33h | ||
|
|
||
| At this point the control plane setup is complete. | ||
|
|
||
| ### Deploying Data Plane | ||
|
|
||
| The process to deploy the data plane is very similar to the Control Plane. | ||
|
|
||
| #### DP Gathering Infra Details | ||
|
|
||
| | Value | Description | Source | | ||
| | --------------------------------- | ------------------------------------------------------ | ----------------------------------------------------------------------- | | ||
| | `CLUSTER_NAME` | Name of the Data Plane | `terraform output dataplane > union > cluster_name` | | ||
| | `ORG_NAME` | Union organization | `terraform output dataplane > union > org` | | ||
| | `METADATA_BUCKET` | System bucket | `terraform output dataplane > gcs > metadata > name` | | ||
| | `FAST_REGISTRATION_BUCKET` | Fast registration bucket (can be the same of metadata) | `terraform output dataplane > gcs > fast_registration > name` | | ||
| | `GCP_REGION` | The region the Data Plane is installed | `main.tf > module > infra > region` | | ||
| | `GOOGLE_PROJECT_ID` | The name of the GCP project | `main.tf > locals > project_id` | | ||
| | `BACKEND_IAM_ROLE_ARN` | The role backend services will run | `terraform output dataplane > gcp > service_accounts > backend > email` | | ||
| | `WORKER_IAM_ROLE_ARN` | The role workers will run | `terraform output dataplane > gcp > service_accounts > worker > email` | | ||
| | `CONTROLPLANE_INTRA_CLUSTER_HOST` | | On CP GKE `get svc controlplane-nginx-controller` pick `EXTERNAL_IP` | | ||
| | `QUEUE_SERVICE_HOST` | | On CP GKE `get svc queue` pick `EXTERNAL_IP` | | ||
| | `FLYTEADMIN_ENDPOINT` | | On CP GKE `get svc flyteadmin` pick `EXTERNAL_IP` | | ||
| | `CACHESERVICE_ENDPOINT` | | On CP GKE `get svc cacheservice` pick `EXTERNAL_IP` | | ||
|
|
||
| #### DP Deployment Instructions | ||
|
|
||
| 0. Ensure your current cluster is pointing to where you want to install the | ||
| Data Plane. _If you are not sharing the same cluster, note this should point | ||
| to the **data plane** cluster now._ | ||
|
|
||
| 1. Make a copy of the [GCP DP Self-Hosted values.yaml][gcp-dp-values] | ||
| - If you received a reference values from Union personnel, use that instead. | ||
|
|
||
| 2. Update the `values.yaml` from step 1 with your project and environment | ||
| specific information | ||
|
|
||
| 3. `helm install` the Control Plane module | ||
|
|
||
| cd charts/dataplane | ||
| helm upgrade --install unionai-dataplane . \ | ||
| --namespace union \ | ||
| --create-namespace \ | ||
| --values your-values.yaml \ | ||
| --timeout 10m \ | ||
| --wait | ||
|
|
||
| > It is important to deploy in the `union` namespace, so do not skip the `-n | ||
| > union` argument. The Workflow Identity IAM is configured to that | ||
| > namespace, and changing it just here will make things fail. | ||
|
|
||
| 4. Wait for a little bit: time for now coffee or walk? | ||
|
|
||
| #### DP Deployment Verification | ||
|
|
||
| Confirm all services are running: | ||
|
|
||
| kubectl get pod -n union | ||
|
|
||
| You should see something like this: | ||
|
|
||
| NAME READY STATUS RESTARTS AGE | ||
| dataplane-nginx-controller-859754bb66-zxjgs 1/1 Running 0 12h | ||
| executor-6b9fbfb46d-bczbs 1/1 Running 0 12h | ||
| flytepropeller-54b98486b4-59qnw 1/1 Running 0 12h | ||
| flytepropeller-webhook-6fc47cd8fd-rf7nt 1/1 Running 0 12h | ||
| prometheus-operator-5cff9b5487-rb7nn 1/1 Running 0 12h | ||
| prometheus-union-operator-prometheus-0 2/2 Running 0 12h | ||
| syncresources-56d976c8-7s28f 1/1 Running 0 12h | ||
| union-operator-d8746c9f9-6c6lz 1/1 Running 0 12h | ||
| union-operator-proxy-5fd674b9dd-jp8vb 1/1 Running 0 12h | ||
| unionai-dataplane-fluentbit-572gt 1/1 Running 0 12h | ||
| unionai-dataplane-fluentbit-n4tbd 1/1 Running 0 12h | ||
| unionai-dataplane-fluentbit-qhknp 1/1 Running 0 12h | ||
|
|
||
| #### Binding DP to CP | ||
|
|
||
| > This step is only needed if you're not using DNS entries for the Data Plane | ||
| > ingress, or if you do but cannot predict it before installing the control | ||
| > plane. | ||
|
|
||
| The Control Plane needs to reach out to the Data Plane to send work to it. | ||
| Therefore, we need to "teach" the Control Plane where to find the Data Plane. | ||
| That's accomplished by the Helm variable `DATAPLANE_ENDPOINT` in the Control | ||
| Plane Helm chart. | ||
|
|
||
| 1. Find the IP address (or DNS entry) for the ingress of the Data Plane, and | ||
| pick the `EXTERNAL_IP`: | ||
|
|
||
| kubectl get svc dataplane-nginx-controller | ||
|
|
||
| 2. Update the variable in your `values.yaml` for the Control Plane | ||
|
|
||
| 3. Make sure you switch your Kubernetes context to point to the Control Plane | ||
| GKE. | ||
|
|
||
| 4. Run an upgrade of the Helm chart to propagate the value (by running the same | ||
| command as to install the Control Plane): | ||
|
|
||
| cd charts/controlplane | ||
| helm upgrade --install unionai-controlplane . \ | ||
| --namespace union-cp \ | ||
| --create-namespace \ | ||
| --values your-values.yaml \ | ||
| --timeout 15m | ||
|
|
||
| Once this complete you're done! Both the control plane and data plane are | ||
| successfully setup. | ||
|
|
||
| [manual-cp-gcp]: https://github.com/unionai/helm-charts/blob/nelson/nav-gcp/charts/controlplane/SELFHOSTED_INTRA_CLUSTER_GCP.md | ||
| [manual-dp-gcp]: https://github.com/unionai/helm-charts/blob/nelson/nav-gcp/charts/dataplane/SELFHOSTED_INTRA_CLUSTER_GCP.md | ||
| [gcp-cp-values]: https://github.com/unionai/helm-charts/blob/nelson/nav-gcp/charts/controlplane/values.gcp.selfhosted-intracluster.yaml | ||
| [gcp-dp-values]: https://github.com/unionai/helm-charts/blob/nelson/nav-gcp/charts/dataplane/values.gcp.selfhosted-intracluster.yaml | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we mention also cert-manager or similar functionality to be available on the cluster for creating certs and also note regarding the racy behavior where the scylladb webhook takes a long time to come up which requires the user the run this command multiple times. May be a note also to check the logs for the scylla webhook.
Longer term we can probably add some delay in bringing up the services until scylla is up if its enabled in the values file or separate scylla installation step so that services can up with that dependency being available.
Also in general a infra/dependency verification tool could be helpful before we run this step to make sure things all dependencies can be met.
Another option is each service checks for its depdencies to be available before coming up. Whichever option we choose may be good to mention the dependency here