Skip to content

Conversation

@fangge1212
Copy link
Contributor

Validate that AWS instance types and AMIs are compatible with AMD SEV-SNP confidential computing to fail fast during install-config validation instead of during cluster deployment.

  • Validate instance types have "amd-sev-snp" feature support
  • Validate AMIs have UEFI or UEFI-preferred boot mode
  • Support both custom and default RHCOS AMI validation
  • Add AMI metadata fetching and caching infrastructure
  • Add 13 test cases for SEV-SNP validation scenarios

@openshift-ci openshift-ci bot requested review from mtulio and tthvo November 25, 2025 06:25
@fangge1212 fangge1212 force-pushed the aws_amdsevsnp_validation branch from fcf95c6 to 6c827f8 Compare November 25, 2025 06:27
@tthvo
Copy link
Member

tthvo commented Nov 25, 2025

/retitle NO-JIRA: aws: validate instance type and AMI compatibility for SEV-SNP

@openshift-ci openshift-ci bot changed the title aws: Validate instance type and AMI compatibility for SEV-SNP NO-JIRA: aws: validate instance type and AMI compatibility for SEV-SNP Nov 25, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 25, 2025
@openshift-ci-robot
Copy link
Contributor

@fangge1212: This pull request explicitly references no jira issue.

In response to this:

Validate that AWS instance types and AMIs are compatible with AMD SEV-SNP confidential computing to fail fast during install-config validation instead of during cluster deployment.

  • Validate instance types have "amd-sev-snp" feature support
  • Validate AMIs have UEFI or UEFI-preferred boot mode
  • Support both custom and default RHCOS AMI validation
  • Add AMI metadata fetching and caching infrastructure
  • Add 13 test cases for SEV-SNP validation scenarios

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tthvo
Copy link
Member

tthvo commented Nov 25, 2025

/label platform/aws

Copy link
Member

@tthvo tthvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for helping with this! I had a few comments, but overall it's great!

Comment on lines 26 to 27
cctx, cancel := context.WithTimeout(ctx, 1*time.Minute)
defer cancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cctx, cancel := context.WithTimeout(ctx, 1*time.Minute)
defer cancel()

I think we should use the parent context without setting a timeout 👇

In CI environment, we frequently run into timeout problems due to AWS API throttling, for example, OCPBUGS-65938. So, let's stay away from setting a low constant timeout.

vpcSubnets SubnetGroups
vpc VPC
instanceTypes map[string]InstanceType
images map[string]*ImageInfo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
images map[string]*ImageInfo
images map[string]ImageInfo

nit: Since the image info is read-only, we can use values instead of pointers. It's also consistent with instanceTypes field 😁

Comment on lines 463 to 468
if pool.CPUOptions != nil && pool.CPUOptions.ConfidentialCompute != nil {
if err := validateAMIBootMode(ctx, meta, fldPath, platform, pool, arch); err != nil {
allErrs = append(allErrs, err)
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 This is great! Though, I have a refactor suggestions below 👇, which looks a bit simpler:

  1. We can define a func validateCPUOptions so that we can extend the validations if other configurations are added.
func validateCPUOptions(ctx context.Context, meta *Metadata, fldPath *field.Path, pool *awstypes.MachinePool) field.ErrorList {
	allErrs := field.ErrorList{}
	cpuOpts := pool.CPUOptions

	// Early return if no CPU options specified
	if cpuOpts == nil {
		return allErrs
	}

	// See requirements sev-snp support: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/sev-snp.html#snp-requirements
	if cpuOpts.ConfidentialCompute != nil && *cpuOpts.ConfidentialCompute == awstypes.ConfidentialComputePolicySEVSNP {
		// Validate AMI boot mode for SEV-SNP
		allErrs = append(allErrs, validateAMIBootMode(ctx, meta, fldPath, pool)...)

		// Validate instance type for SEV-SNP
		allErrs = append(allErrs, validateInstanceTypeForSEVSNP(ctx, meta, fldPath, pool)...)
	}

	return allErrs
}

func validateInstanceTypeForSEVSNP(ctx context.Context, meta *Metadata, fldPath *field.Path, pool *awstypes.MachinePool) field.ErrorList {
	allErrs := field.ErrorList{}

	// Warn if using default instance type
	if pool.InstanceType == "" {
		logrus.Warn("AMD SEV-SNP confidential computing is enabled but no instance type is specified. The default instance type may not support amd-sev-snp")
		return allErrs
	}

	// Fetch instance types metadata
	instanceTypes, err := meta.InstanceTypes(ctx)
	if err != nil {
		return append(allErrs, field.InternalError(fldPath, err))
	}

	// Validate the specified instance type supports SEV-SNP
	// If the instance type is not found, it's already caught in validateMachinePool
	typeMeta, ok := instanceTypes[pool.InstanceType]
	if !ok {
		return allErrs
	}

	if !slices.Contains(typeMeta.Features, ec2.SupportedAdditionalProcessorFeatureAmdSevSnp) {
		allErrs = append(allErrs, field.Invalid(fldPath.Child("type"), pool.InstanceType, "specified instance type in the specified region doesn't support amd-sev-snp"))
	}

	return allErrs
}

func validateAMIBootMode(ctx context.Context, meta *Metadata, fldPath *field.Path, pool *awstypes.MachinePool) field.ErrorList {
	allErrs := field.ErrorList{}

	amiID := pool.AMIID
	if amiID == "" {
		// Warn when using default AMI with SEV-SNP
		logrus.Warn("AMD SEV-SNP confidential computing is enabled but no custom AMI is specified. The default RHCOS AMI may not have UEFI boot mode enabled.")
		return allErrs
	}

	// Get image metadata
	imageInfo, err := meta.Images(ctx, amiID)
	if err != nil {
		return append(allErrs, field.Invalid(fldPath.Child("amiID"), amiID, fmt.Sprintf("unable to retrieve AMI metadata: %v", err)))
	}

	// Check if boot mode supports UEFI
	if imageInfo.BootMode != ec2.BootModeValuesUefi && imageInfo.BootMode != ec2.BootModeValuesUefiPreferred {
		allErrs = append(allErrs, field.Invalid(fldPath.Child("amiID"), amiID, fmt.Sprintf("AMI boot mode must be 'uefi' or 'uefi-preferred' when using AMD SEV-SNP confidential computing, got '%s'", imageInfo.BootMode)))
	}

	return allErrs
}
  1. Then we use it in validateMachinePool as a top-level validation by checking pool.CPUOptions != nil, which is consistent with other validations.
func validateMachinePool(...) field.ErrorList {
   // ...output-omitted...

    if pool.CPUOptions != nil {
		allErrs = append(allErrs, validateCPUOptions(ctx, meta, fldPath, pool)...)
	}

    if len(pool.AdditionalSecurityGroupIDs) > 0 {
		allErrs = append(allErrs, validateSecurityGroupIDs(ctx, meta, fldPath.Child("additionalSecurityGroupIDs"), platform, pool)...)
	}

    // ...output-omitted...
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fangge1212 You'll notice I suggest giving a warning message when .amiID or instanceType is empty, instead of defaulting.

Here are my reasons:

  1. Unfortunately, we don't have a unified place to set these defaults. This AMI defaulting logic might change in the future; thus, it makes it harder for maintenance if done in another place.
  2. The machinepool instance type is optional. The default (e.g. m6i.xlarge) might not support AMD SEV-SNP at all. Thus, we need to duplicate the defaulting logic too.

So, unless anyone complains, I'd say we go this way for now and can easily harden the validation after we unify the logic. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be a better way forward: the installer can figure out which AMI and instance type (if possible) to use when confidential compute AMD SEV-SNP is enabled 💡 🤓

But that seems over-engineering somehow unless anyone specifically requests it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you said makes sense. I've updated the code to address all your points. Please review again
when you get a chance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 👍

instanceTypes: validInstanceTypes(),
},
{
name: "valid UEFI AMI with enabling SEV-SNP on control plane",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case has the same setting as test case valid instance type with enabling SEV-SNP on control plane, right? We can combine the 2 cases into:

valid instance type with enabling SEV-SNP and UEFI AMI on control plane

Test cases for compute pool also has duplicates 👀

@tthvo
Copy link
Member

tthvo commented Nov 26, 2025

ci/prow/golint is currently not happy. I think we need to fix those and we can run ./hack/go-lint.sh to verify it... 😞

@tthvo
Copy link
Member

tthvo commented Nov 26, 2025

/cc @patrickdillon

@openshift-ci openshift-ci bot requested a review from patrickdillon November 26, 2025 02:30
@fangge1212 fangge1212 force-pushed the aws_amdsevsnp_validation branch 2 times, most recently from 8d6bac6 to 242c9f8 Compare November 27, 2025 12:41
@fangge1212
Copy link
Contributor Author

ci/prow/golint is currently not happy. I think we need to fix those and we can run ./hack/go-lint.sh to verify it... 😞

The lint failures seem to origin from other places, not from this pr

Copy link
Member

@tthvo tthvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ci/prow/golint is currently not happy. I think we need to fix those and we can run ./hack/go-lint.sh to verify it... 😞

The lint failures seem to origin from other places, not from this pr

@fangge1212 Right, I agreed! These issues are long-standing remnants as golint only considers source files with "new changes". This PR happens to hit that file.

Though, would I trouble you to apply a quick fix for those? I'd like to avoid overriding golint job if possible. Thank you 🙏

diff --git a/pkg/asset/installconfig/aws/validation_test.go b/pkg/asset/installconfig/aws/validation_test.go
index 4e22616b60..392b49ba7d 100644
--- a/pkg/asset/installconfig/aws/validation_test.go
+++ b/pkg/asset/installconfig/aws/validation_test.go
@@ -1454,10 +1454,8 @@ func TestValidate(t *testing.T) {
                        err := Validate(context.TODO(), meta, test.installConfig)
                        if test.expectErr == "" {
                                assert.NoError(t, err)
-                       } else {
-                               if assert.Error(t, err) {
-                                       assert.Regexp(t, test.expectErr, err.Error())
-                               }
+                       } else if assert.Error(t, err) {
+                               assert.Regexp(t, test.expectErr, err.Error())
                        }
                })
        }
@@ -1585,10 +1583,8 @@ func TestValidateForProvisioning(t *testing.T) {
                        err := ValidateForProvisioning(route53Client, ic, meta)
                        if test.expectedErr == "" {
                                assert.NoError(t, err)
-                       } else {
-                               if assert.Error(t, err) {
-                                       assert.Regexp(t, test.expectedErr, err.Error())
-                               }
+                       } else if assert.Error(t, err) {
+                               assert.Regexp(t, test.expectedErr, err.Error())
                        }
                })
        }
@@ -1628,7 +1624,6 @@ func TestGetSubDomainDNSRecords(t *testing.T) {
        route53Client := mock.NewMockAPI(mockCtrl)
 
        for _, test := range cases {
-
                t.Run(test.name, func(t *testing.T) {
                        ic := icBuild.build(icBuild.withBaseDomain(test.baseDomain))
                        if test.expectedErr != "" {
@@ -1653,10 +1648,8 @@ func TestGetSubDomainDNSRecords(t *testing.T) {
                        _, err := route53Client.GetSubDomainDNSRecords(&validDomainOutput, ic, nil)
                        if test.expectedErr == "" {
                                assert.NoError(t, err)
-                       } else {
-                               if assert.Error(t, err) {
-                                       assert.Regexp(t, test.expectedErr, err.Error())
-                               }
+                       } else if assert.Error(t, err) {
+                               assert.Regexp(t, test.expectedErr, err.Error())
                        }
                })
        }

@fangge1212
Copy link
Contributor Author

ci/prow/golint is currently not happy. I think we need to fix those and we can run ./hack/go-lint.sh to verify it... 😞
The lint failures seem to origin from other places, not from this pr

@fangge1212 Right, I agreed! These issues are long-standing remnants as golint only considers source files with "new changes". This PR happens to hit that file.

Though, would I trouble you to apply a quick fix for those? I'd like to avoid overriding golint job if possible. Thank you 🙏

diff --git a/pkg/asset/installconfig/aws/validation_test.go b/pkg/asset/installconfig/aws/validation_test.go

Ok, done

Copy link
Member

@tthvo tthvo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Local testing looks good to me 👍 I notice that sevsnp-supported instance type (e.g. m6a.xlarge) is also available in unsupported region (e.g. us-east-1). This means someone might use the right ami + instance type, but invalid region.

Though, I think keeping a static region list is not a good idea either since AWS can roll out new support regions. As long as we call out the constraints in docs, I think it's fine for now.

@yalzhang could you double check and add the verified label?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 28, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tthvo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 28, 2025
Validate that AWS instance types and AMIs are compatible with AMD SEV-SNP
confidential computing to fail fast during install-config validation
instead of during cluster deployment.

- Validate instance types have "amd-sev-snp" feature support
- Validate AMIs have UEFI or UEFI-preferred boot mode

Signed-off-by: Fangge Jin <[email protected]>
Assisted-by: Claude Code
Simplify nested if-else statements to use else-if and remove
unnecessary blank lines in validation_test.go.

Signed-off-by: Fangge Jin <[email protected]>
@fangge1212 fangge1212 force-pushed the aws_amdsevsnp_validation branch from 58e904e to f2fbd64 Compare December 2, 2025 03:45
@tthvo
Copy link
Member

tthvo commented Dec 2, 2025

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 2, 2025

@fangge1212: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-heterogeneous f2fbd64 link false /test e2e-aws-ovn-heterogeneous

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@yalzhang
Copy link

yalzhang commented Dec 3, 2025

/verified by yalzhang

1 similar comment
@tthvo
Copy link
Member

tthvo commented Dec 3, 2025

/verified by yalzhang

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Dec 3, 2025
@openshift-ci-robot
Copy link
Contributor

@tthvo: This PR has been marked as verified by yalzhang.

In response to this:

/verified by yalzhang

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. platform/aws verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants