Skip to content

MGMT-20153: Fix ostree race between node-image-pull and installing to rootfs #1076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 2, 2025

Conversation

carbonin
Copy link
Member

@carbonin carbonin commented Mar 24, 2025

ostree container unencapsulate panics if the image passed to it is already present in the repo. Use ostree container image pull instead. Additionally attempting to pull the same os image as the node-image-pull service at the same time also can lead to errors writing to the repo. So make startBootstrap run synchronously with the rest of InstallNode to ensure we are done applying the node image before attempting to pull the image for install.

Resolves https://issues.redhat.com/browse/MGMT-20153

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 24, 2025

@carbonin: This pull request references MGMT-20153 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

ostree container unencapsulate panics if the image passed to it is already present in the repo. Use ostree container image pull instead. Additionally attempting to pull the same os image as the node-image-pull service at the same time also can lead to errors writing to the repo. So make startBootstrap run synchronously with the rest of InstallNode to ensure we are don applying the node image before attempting to pull the image for install.

Resolves https://issues.redhat.com/browse/MGMT-20153

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 24, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 24, 2025
Copy link

openshift-ci bot commented Mar 24, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 24, 2025
@carbonin
Copy link
Member Author

Related to openshift/installer#9570

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 24, 2025
@carbonin carbonin force-pushed the ostree-image-pull branch from a3bddf7 to f14fb38 Compare March 24, 2025 20:06
@openshift-ci openshift-ci bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 24, 2025
@carbonin carbonin marked this pull request as ready for review March 24, 2025 20:13
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 24, 2025
Copy link

codecov bot commented Mar 24, 2025

Codecov Report

Attention: Patch coverage is 61.11111% with 28 lines in your changes missing coverage. Please review.

Project coverage is 55.59%. Comparing base (1076581) to head (4505593).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
src/ops/ops.go 59.42% 16 Missing and 12 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1076      +/-   ##
==========================================
- Coverage   55.67%   55.59%   -0.08%     
==========================================
  Files          15       15              
  Lines        3393     3421      +28     
==========================================
+ Hits         1889     1902      +13     
- Misses       1311     1317       +6     
- Partials      193      202       +9     
Files with missing lines Coverage Δ
src/installer/installer.go 68.37% <100.00%> (-0.21%) ⬇️
src/ops/ops.go 45.62% <59.42%> (+0.37%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@carbonin
Copy link
Member Author

/test edge-e2e-ai-operator-ztp

@carbonin
Copy link
Member Author

/retest

src/ops/ops.go Outdated
var ostreeOutputRegex = regexp.MustCompile(`Imported: (\w+)`)
func refFilePath(refName string) string {
return path.Join(ostreeRepo, "refs/heads", refName)
}

func (o *ops) importOSTreeCommit(liveLogger io.Writer) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe at some point worth thinking about ostree object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 I'm not sure what it gains us right now.
Also the not-so-long-term plan I think is to rip all this out and move to bootc so I don't know how much we really need to change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can put this stuff in a separate file if that makes it a bit more organized

src/ops/ops.go Outdated
@@ -174,6 +200,20 @@ func (o *ops) WriteImageToExistingRoot(liveLogger io.Writer, ignitionPath string
return errors.Wrapf(err, "failed to remount boot: %s", out)
}

// Remove the node image reference if it still exists, but also ensure it isn't garbage collected just yet
if o.FileExists(refFilePath(nodeImageOSTreeRefName)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets move this part to function

@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 25, 2025

@carbonin: This pull request references MGMT-20153 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

ostree container unencapsulate panics if the image passed to it is already present in the repo. Use ostree container image pull instead. Additionally attempting to pull the same os image as the node-image-pull service at the same time also can lead to errors writing to the repo. So make startBootstrap run synchronously with the rest of InstallNode to ensure we are done applying the node image before attempting to pull the image for install.

Resolves https://issues.redhat.com/browse/MGMT-20153

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@carbonin carbonin force-pushed the ostree-image-pull branch from f14fb38 to 71f250c Compare March 25, 2025 15:58
@cgwalters
Copy link
Member

I only skimmed this but FYI a lot of this I would consider internal implementation details subject to change, especically the ostree ref structures.

I'd like to have a good understanding of exactly what this is trying to do because it will need to be supported directly in bootc - and for bootc we have plans to switch to a new composefs-native store and stop doing the ostree-container flow, so directly invoking ostree will break when that happens.

src/ops/ops.go Outdated
}

func ostreeArgs(commit string, installerArgs []string) []string {
func ostreeArgs(ref string, installerArgs []string) []string {
ostreeArgs := []string{"admin", "deploy",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a live chat on this and I think you can drop a lot of the low-level ostree commands if you switch to using ostree container image deploy instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this command have something similar to the --karg-append and --karg-delete that I'm using below?

I see --karg, but not the others.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see something in the --help

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--karg is the same as --karg-append, now you're right it doesn't have --karg-delete right now; would be easy to add but can you elaborate on why it's needed?

We include almost no default kernel arguments so I am uncertain when one would need to delete any.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed it as bootc-dev/bootc#1229

In the short term it'd probably work to do the deploy, and then remove any unwanted kargs as a post-install step instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate on why it's needed?

Adding and deleting kargs is an API we've exposed through assisted because they were options for coreos-installer. I'm just trying to keep parity here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm (effectively) running this in my last attempt:

ostree container image deploy --stateroot install --sysroot /sysroot --image <rhcos-content-image-from-release>

Then the next boot fails.

I booted a live ISO and mounted the boot and sysroot partitions and I see the old boot entry in the boot partition's entry, but I see the new one in the sysroot /boot/loader/entries directory ... This doesn't seem right. Should I have maybe specified / for --sysroot in the ostree command?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rpm-ostree kargs --os=install --append/--delete ... should work to edit the kargs post-deployment. (Aside: opened coreos/rpm-ostree#5345 so you can also type --stateroot for consistency.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, try with --sysroot=/.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, try with --sysroot=/

This seems to have helped. Doing some more testing then will update the PR.

rpm-ostree kargs --os=install --append/--delete ... should work to edit the kargs post-deployment

Nice, thanks. I'll try this out too.

jlebon added a commit to jlebon/rpm-ostree that referenced this pull request Mar 26, 2025
jlebon added a commit to jlebon/rpm-ostree that referenced this pull request Mar 26, 2025
The `--stateroot` term is more often used now in libostree, so let's
match that here too.

Noticed this while reviewing
openshift/assisted-installer#1076.
jlebon added a commit to jlebon/rpm-ostree that referenced this pull request Mar 26, 2025
The `--stateroot` term is more often used now in libostree, so let's
match that here too.

Noticed this while reviewing
openshift/assisted-installer#1076.
jlebon added a commit to jlebon/rpm-ostree that referenced this pull request Mar 26, 2025
The `--stateroot` term is more often used now in libostree, so let's
match that here too.

Noticed this while reviewing
openshift/assisted-installer#1076.
@jlebon
Copy link
Member

jlebon commented Mar 26, 2025

Let's work on getting openshift/installer#9570 tested and merged so we can clean up the ref manipulations here a bit.

@carbonin
Copy link
Member Author

Let's work on getting openshift/installer#9570 tested and merged so we can clean up the ref manipulations here a bit.

I'll take a look at this after I get a reasonable general approach working. Once I have something that works and folks are happy with I'll test pulling out my addition to remove the node image ref in favor of the installer change.

I suspect if we want to test with 4.19 ec releases right after this is merged though I'll need to at least keep the touch in there then remove it after we get a build with your changes (or just after GA to ensure we're not using releases without the change)

`ostree container unencapsulate` panics if the image it is attempting to
import is already present on the system. This will be the case on the
bootstrap node if the node overlay service runs before installing to
disk. Instead, use container image deploy for installing to a stateroot.
This process only takes about a minute and when installing to the
existing rootfs in multi-node cases it is required that the node image
is downloaded before we attempt to use the same image to install.

Previously this race condition between layering the node image and
installing to the rootfs would cause the ostree commands to fail.
This requires calling `ostree admin finalize-staged` after install and
again after editing kargs, but this is currently the only way to handle
karg editing outside of unencapsulate
This makes the logic in WriteImageToExistingRoot easier to understand
@carbonin
Copy link
Member Author

This requires #1076. Without it the cleanup service fails on the installed node.

@carbonin
Copy link
Member Author

carbonin commented Apr 1, 2025

/retest

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sane overall. One thing that might be worth mentioning in the first commit message is that with the switch to the layered node image soon, the unencapsulate flow will just not work anyway so it's good that we're switching away from that.

src/ops/ops.go Outdated
Comment on lines 157 to 167
lastArgIndex := len(installerArgs) - 1
for i, arg := range installerArgs {
if arg == "--append-karg" && i < lastArgIndex {
commandArgs = append(commandArgs, "--append", installerArgs[i+1])
continue
}
if arg == "--delete-karg" && i < lastArgIndex {
commandArgs = append(commandArgs, "--delete-if-present", installerArgs[i+1])
continue
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like we should error out if --append-karg or --delete-karg is last rather than do our best to work with it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a validation around this in assisted (where the user adds these). And if we don't I'd rather it fail there than here.

@@ -130,6 +130,19 @@ func (o *ops) SystemctlAction(action string, args ...string) error {
return errors.Wrapf(err, "Failed executing systemctl %s %s", action, args)
}

func (o *ops) remountFilesystems(liveLogger io.Writer) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why do we need this BTW? ostree and rpm-ostree should both know to remount within a mount namespace to do their operations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least we still need boot because we're putting networking files and the ignition there. I never considered that I might be able to remove sysroot

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 1, 2025
Copy link

openshift-ci bot commented Apr 1, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: carbonin, jlebon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@carbonin
Copy link
Member Author

carbonin commented Apr 1, 2025

/retest

@carbonin
Copy link
Member Author

carbonin commented Apr 1, 2025

/hold

until the assisted-service PR is merged.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 1, 2025
Copy link

openshift-ci bot commented Apr 1, 2025

@carbonin: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@carbonin
Copy link
Member Author

carbonin commented Apr 2, 2025

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 2, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 0e954b1 into openshift:master Apr 2, 2025
20 checks passed
@carbonin carbonin deleted the ostree-image-pull branch April 2, 2025 12:45
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-agent-installer-orchestrator
This PR has been included in build ose-agent-installer-orchestrator-container-v4.19.0-202504021544.p0.g0e954b1.assembly.stream.el9.
All builds following this will include this PR.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: ose-agent-installer-csr-approver
This PR has been included in build ose-agent-installer-csr-approver-container-v4.20.0-202504021544.p0.g0e954b1.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants