-
Notifications
You must be signed in to change notification settings - Fork 66
feat(23570): Add controller for workspace backup #1530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Allda The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
42dd45c to
dffd7e6
Compare
|
@Allda : Really appreciate you taking the time to contribute this in such a short time. 🎉 Could you please also fill out the “Is it tested? How?” section in the PR template? It’ll help reviewers and future contributors verify the change more easily. Thanks again for your effort! 🙌 |
|
I tested this PR and it seems to work.
config:
workspace:
backupCronJob:
enable: true
schedule: "*/3 * * * *"
|
0bc74b1 to
8427ba5
Compare
|
/retest |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1530 +/- ##
==========================================
+ Coverage 34.09% 35.30% +1.21%
==========================================
Files 160 161 +1
Lines 13348 13802 +454
==========================================
+ Hits 4551 4873 +322
- Misses 8487 8599 +112
- Partials 310 330 +20 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ibuziuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@Allda : Thank you! I can confirm that this approach works without any explicit configuration. I can see the backup image being pushed to the configured registry. However, I see image has a different format I tested on CRC with OpenShift 4.20.1 |
Yes, oras artifact is not "real" image so it has a different config. The images that are produced by the oras should have following manifest: |
Signed-off-by: Ales Raszka <[email protected]>
Signed-off-by: Ales Raszka <[email protected]>
|
@tolusha for some reason GitHub is not letting me comment on your comment: #1530 (comment) IMHO I would prefer if we didn't add an annotation to the DevWorkspaces to avoid potentially sending a high number of requests to the apiserver |
- Moving extraArgs to Oras config section - Unify default values - Change UBI base image - Use constant for the PVC name Signed-off-by: Ales Raszka <[email protected]>
Instead of using global secret for a whole cluster the controller search for namespace specific secret and use it if available. If not found it fallback to the global secret. Signed-off-by: Ales Raszka <[email protected]>
Signed-off-by: Ales Raszka <[email protected]>
Signed-off-by: Ales Raszka <[email protected]>
In case user uses internal registry in the OCP the image repository path aligns with OCP namespace and creates a image stream given by the workspace name. Signed-off-by: Ales Raszka <[email protected]>
The internal OCP registry is supported by default without a need of providing any registry auth secret. The backup image is pushed to the same namespace where the workspace is running. The token is auto-generated and mounted from the SA definition. Signed-off-by: Ales Raszka <[email protected]>
|
/retest |
|
With this comment I would like to give you an overview of the feature, how to set it up and show you results. The current implementation was focused on priorities to support Build in OCP registry or any other OCI compliant registry. First let's start with the default OCP registry. For this variant user don't need to provide any auth secrets. In next cronjob iteration the workspace will be backed up with the Job. Logs from the Job: The image is then available in the internal registry The second option is to use any generic OCI compliant registry with custom authentication. Here is the example of the config with custom quay.io registry. User needs to provide access token to either a worskspace namespace or to the operator namespace. Here is the log of the backup job: |
|
@Allda could you please clarify if the backup is expected to work with both per-user & per-workspace PVC strategies? Also, please consider contributing documentation (could be a separate PR) - https://github.com/devfile/devworkspace-operator/blob/main/docs/dwo-configuration.md |
Yes, the controller supports both types of volumes and is based on the volume provisioner logic that was already available in the operator. I'll create a separate PR with documentation. |
|
@Allda : Thanks for providing the steps. I followed these steps on CRC ( External Registry Backup Scenario
DWOC Configuration: config:
workspace:
backupCronJob:
enable: true
registry:
authSecret: dockerhub-push-secret
path: docker.io/rohankanojia
schedule: '* * * * *'When I stop DevWorkspace, I see job pod logs in When inspecting individual pod logs , it seems it's trying to push to wrong URL (maybe DockerHub doesn't support it): OpenShift Internal Registry Backup Scenario❌ I wasn't able to get it working on CRC When I stop DevWorkspace, I see job pod logs in I see error below in job pod logs, it seems I can see ImageStream gets created but tag is empty: |
| sa := &corev1.ServiceAccount{ | ||
| ObjectMeta: metav1.ObjectMeta{Name: saName, Namespace: workspace.Namespace, Labels: map[string]string{ | ||
| constants.DevWorkspaceIDLabel: workspace.Status.DevWorkspaceId, | ||
| constants.DevWorkspaceWatchSecretLabel: "true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| constants.DevWorkspaceWatchSecretLabel: "true", |
| return err | ||
| } | ||
|
|
||
| if _, err := controllerutil.CreateOrUpdate(ctx, r.Client, sa, func() error { return nil }); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use SyncObjectWithCluster [1]. NotInSyncError is ok, it is always thrown on first creation [2]
[1]
| func SyncObjectWithCluster(specObj crclient.Object, api ClusterAPI) (crclient.Object, error) { |
[2]
| func createObjectGeneric(specObj crclient.Object, api ClusterAPI) error { |
| log.Info("Backup Job created for DevWorkspace", "id", dwID) | ||
|
|
||
| } | ||
| origConfig := client.MergeFrom(dwOperatorConfig.DeepCopy()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why do we need to make a deep copy?
| } | ||
| dwOperatorConfig.Status.LastBackupTime = &metav1.Time{Time: metav1.Now().Time} | ||
|
|
||
| err = r.NonCachingClient.Status().Patch(ctx, dwOperatorConfig, origConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to use a NonCachingClient for patching DWOC
|
|
||
| // wasStoppedSinceLastBackup checks if the DevWorkspace was stopped since the last backup time. | ||
| func (r *BackupCronJobReconciler) wasStoppedSinceLastBackup(workspace *dw.DevWorkspace, lastBackupTime *metav1.Time, log logr.Logger) bool { | ||
| if workspace.Status.Phase != dw.DevWorkspaceStatusStopped { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, backup should only be for stopped workspaces
| if workspace.Status.Phase != dw.DevWorkspaceStatusStopped { | ||
| return false | ||
| } | ||
| log.Info("DevWorkspace is currently stopped, checking if it was stopped since last backup", "namespace", workspace.Namespace, "name", workspace.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit concerned about extra printing.
It will look like a spam in case of hundreds of workspaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, if 100s of workspaces are stopped and stopped in a day, there can be thousands of backup related logs out of the box. For now, I think we should set https://github.com/Allda/devworkspace-operator/blob/6af5fb855a0512cff8f90c9a7d128c17269eb587/main.go#L195 to:
Log: ctrl.Log.WithName("controllers").WithName("BackupCronJob").V(1)
There's no built in functionality for changing log level for DWO: https://issues.redhat.com/browse/WTO-296, maybe it can be set manually for now (either in operand's deployment, or in the CSV) as an env var.
| // in the namespace where the operator is running in. | ||
| // as the DevWorkspaceOperatorCongfig. | ||
| // +kubebuilder:validation:Optional | ||
| AuthSecret string `json:"authSecret,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dkwon17
We can add requirements for all secrets to have controller.devfile.io/watch-secret=true, in this case no need to use noncachingclient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to do that,
@Allda could we add this line, or something similar in the description (line 86)?
The secret must have the controller.devfile.io/watch-secret=true label set.
and use the regular client in the handleRegistryAuthSecret function?
| // to the OpenShift internal registry. | ||
| func (r *BackupCronJobReconciler) ensureImagePushRoleBinding(ctx context.Context, saName string, workspace *dw.DevWorkspace) error { | ||
| // Create ClusterRoleBinding for system:image-builder role | ||
| clusterRoleBinding := &rbacv1.ClusterRoleBinding{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you confirm, that RoleBinding doesn't suit and we need specifically ClusterRoleBinding ?
| }, | ||
| } | ||
|
|
||
| if _, err := controllerutil.CreateOrUpdate(ctx, r.Client, clusterRoleBinding, func() error { return nil }); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use SyncObjectWithCluster
|
@rohanKanojia for the error you're facing, I was able to work around it by having this in my config: |
| Spec: corev1.PodSpec{ | ||
| ServiceAccountName: JobRunnerSAName + "-" + workspace.Status.DevWorkspaceId, | ||
| RestartPolicy: corev1.RestartPolicyNever, | ||
| Containers: []corev1.Container{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the backup job, I noticed if there's a failure, the job is retried a maximum of 6 times, potentially leaving many pods. For example here are the pods that are in my namespace when I was testing this feature when trying to backup a workspace (not all at once, I tried multiple times throughout the day):
For the cronjob implementation, is it possible to set successfulJobsHistoryLimit to 0 and failedJobsHistoryLimit to only 1?
https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/#jobs-history-limits
I would like to limit the number of pods, because I am a bit concerned we might overwhelm etcd if we have 1000s of devworkspaces and many failed pods that remain on the cluster cc @ibuziuk
A new backup controller orchestrates a backup process for workspace PVC. A new configuration option is added to DevWorkspaceOperatorConfig that enables running regular cronjob that is responsible for backup mechanism. The job executes following steps:
The last step is currently not fully implemented as it requires running a buildah inside the container and it will be delivered as a separate feature.
Issue: eclipse-che/che#23570
What does this PR do?
What issues does this PR fix or reference?
Is it tested? How?
The feature has been tested locally and using integration tests. Following configuration should be added to the config to enable this feature:
After a config is added, stop any workspace and wait till a backup job is created.
The job creates a backup and push image to registry
PR Checklist
/test v8-devworkspace-operator-e2e, v8-che-happy-pathto trigger)v8-devworkspace-operator-e2e: DevWorkspace e2e testv8-che-happy-path: Happy path for verification integration with Che