-
Notifications
You must be signed in to change notification settings - Fork 106
MGMT-20153: Fix ostree race between node-image-pull and installing to rootfs #1076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MGMT-20153: Fix ostree race between node-image-pull and installing to rootfs #1076
Conversation
@carbonin: This pull request references MGMT-20153 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Skipping CI for Draft Pull Request. |
Related to openshift/installer#9570 |
a3bddf7
to
f14fb38
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1076 +/- ##
==========================================
- Coverage 55.67% 55.59% -0.08%
==========================================
Files 15 15
Lines 3393 3421 +28
==========================================
+ Hits 1889 1902 +13
- Misses 1311 1317 +6
- Partials 193 202 +9
🚀 New features to boost your workflow:
|
/test edge-e2e-ai-operator-ztp |
/retest |
src/ops/ops.go
Outdated
var ostreeOutputRegex = regexp.MustCompile(`Imported: (\w+)`) | ||
func refFilePath(refName string) string { | ||
return path.Join(ostreeRepo, "refs/heads", refName) | ||
} | ||
|
||
func (o *ops) importOSTreeCommit(liveLogger io.Writer) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe at some point worth thinking about ostree object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 I'm not sure what it gains us right now.
Also the not-so-long-term plan I think is to rip all this out and move to bootc so I don't know how much we really need to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can put this stuff in a separate file if that makes it a bit more organized
src/ops/ops.go
Outdated
@@ -174,6 +200,20 @@ func (o *ops) WriteImageToExistingRoot(liveLogger io.Writer, ignitionPath string | |||
return errors.Wrapf(err, "failed to remount boot: %s", out) | |||
} | |||
|
|||
// Remove the node image reference if it still exists, but also ensure it isn't garbage collected just yet | |||
if o.FileExists(refFilePath(nodeImageOSTreeRefName)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets move this part to function
@carbonin: This pull request references MGMT-20153 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
f14fb38
to
71f250c
Compare
I only skimmed this but FYI a lot of this I would consider internal implementation details subject to change, especically the ostree ref structures. I'd like to have a good understanding of exactly what this is trying to do because it will need to be supported directly in bootc - and for bootc we have plans to switch to a new composefs-native store and stop doing the ostree-container flow, so directly invoking |
src/ops/ops.go
Outdated
} | ||
|
||
func ostreeArgs(commit string, installerArgs []string) []string { | ||
func ostreeArgs(ref string, installerArgs []string) []string { | ||
ostreeArgs := []string{"admin", "deploy", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a live chat on this and I think you can drop a lot of the low-level ostree commands if you switch to using ostree container image deploy
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this command have something similar to the --karg-append
and --karg-delete
that I'm using below?
I see --karg
, but not the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see something in the --help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--karg
is the same as --karg-append
, now you're right it doesn't have --karg-delete
right now; would be easy to add but can you elaborate on why it's needed?
We include almost no default kernel arguments so I am uncertain when one would need to delete any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed it as bootc-dev/bootc#1229
In the short term it'd probably work to do the deploy, and then remove any unwanted kargs as a post-install step instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate on why it's needed?
Adding and deleting kargs is an API we've exposed through assisted because they were options for coreos-installer. I'm just trying to keep parity here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm (effectively) running this in my last attempt:
ostree container image deploy --stateroot install --sysroot /sysroot --image <rhcos-content-image-from-release>
Then the next boot fails.
I booted a live ISO and mounted the boot and sysroot partitions and I see the old boot entry in the boot partition's entry, but I see the new one in the sysroot /boot/loader/entries
directory ... This doesn't seem right. Should I have maybe specified /
for --sysroot
in the ostree command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rpm-ostree kargs --os=install --append/--delete ...
should work to edit the kargs post-deployment. (Aside: opened coreos/rpm-ostree#5345 so you can also type --stateroot
for consistency.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, try with --sysroot=/
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, try with --sysroot=/
This seems to have helped. Doing some more testing then will update the PR.
rpm-ostree kargs --os=install --append/--delete ... should work to edit the kargs post-deployment
Nice, thanks. I'll try this out too.
Noticed this while reviewing openshift/assisted-installer#1076.
The `--stateroot` term is more often used now in libostree, so let's match that here too. Noticed this while reviewing openshift/assisted-installer#1076.
The `--stateroot` term is more often used now in libostree, so let's match that here too. Noticed this while reviewing openshift/assisted-installer#1076.
The `--stateroot` term is more often used now in libostree, so let's match that here too. Noticed this while reviewing openshift/assisted-installer#1076.
Let's work on getting openshift/installer#9570 tested and merged so we can clean up the ref manipulations here a bit. |
I'll take a look at this after I get a reasonable general approach working. Once I have something that works and folks are happy with I'll test pulling out my addition to remove the node image ref in favor of the installer change. I suspect if we want to test with 4.19 ec releases right after this is merged though I'll need to at least keep the |
`ostree container unencapsulate` panics if the image it is attempting to import is already present on the system. This will be the case on the bootstrap node if the node overlay service runs before installing to disk. Instead, use container image deploy for installing to a stateroot.
This process only takes about a minute and when installing to the existing rootfs in multi-node cases it is required that the node image is downloaded before we attempt to use the same image to install. Previously this race condition between layering the node image and installing to the rootfs would cause the ostree commands to fail.
This requires calling `ostree admin finalize-staged` after install and again after editing kargs, but this is currently the only way to handle karg editing outside of unencapsulate
This makes the logic in WriteImageToExistingRoot easier to understand
71f250c
to
4505593
Compare
This requires #1076. Without it the cleanup service fails on the installed node. |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks sane overall. One thing that might be worth mentioning in the first commit message is that with the switch to the layered node image soon, the unencapsulate flow will just not work anyway so it's good that we're switching away from that.
src/ops/ops.go
Outdated
lastArgIndex := len(installerArgs) - 1 | ||
for i, arg := range installerArgs { | ||
if arg == "--append-karg" && i < lastArgIndex { | ||
commandArgs = append(commandArgs, "--append", installerArgs[i+1]) | ||
continue | ||
} | ||
if arg == "--delete-karg" && i < lastArgIndex { | ||
commandArgs = append(commandArgs, "--delete-if-present", installerArgs[i+1]) | ||
continue | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like we should error out if --append-karg
or --delete-karg
is last rather than do our best to work with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have a validation around this in assisted (where the user adds these). And if we don't I'd rather it fail there than here.
@@ -130,6 +130,19 @@ func (o *ops) SystemctlAction(action string, args ...string) error { | |||
return errors.Wrapf(err, "Failed executing systemctl %s %s", action, args) | |||
} | |||
|
|||
func (o *ops) remountFilesystems(liveLogger io.Writer) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why do we need this BTW? ostree and rpm-ostree should both know to remount within a mount namespace to do their operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very least we still need boot because we're putting networking files and the ignition there. I never considered that I might be able to remove sysroot
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: carbonin, jlebon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/hold until the assisted-service PR is merged. |
@carbonin: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/unhold |
[ART PR BUILD NOTIFIER] Distgit: ose-agent-installer-orchestrator |
[ART PR BUILD NOTIFIER] Distgit: ose-agent-installer-csr-approver |
ostree container unencapsulate
panics if the image passed to it is already present in the repo. Useostree container image pull
instead. Additionally attempting to pull the same os image as thenode-image-pull
service at the same time also can lead to errors writing to the repo. So makestartBootstrap
run synchronously with the rest ofInstallNode
to ensure we are done applying the node image before attempting to pull the image for install.Resolves https://issues.redhat.com/browse/MGMT-20153