NO-JIRA: aws: validate instance type and AMI compatibility for SEV-SNP #10126

fangge1212 · 2025-11-25T06:24:12Z

Validate that AWS instance types and AMIs are compatible with AMD SEV-SNP confidential computing to fail fast during install-config validation instead of during cluster deployment.

Validate instance types have "amd-sev-snp" feature support
Validate AMIs have UEFI or UEFI-preferred boot mode
Support both custom and default RHCOS AMI validation
Add AMI metadata fetching and caching infrastructure
Add 13 test cases for SEV-SNP validation scenarios

tthvo · 2025-11-25T18:32:50Z

/retitle NO-JIRA: aws: validate instance type and AMI compatibility for SEV-SNP

openshift-ci-robot · 2025-11-25T18:32:57Z

@fangge1212: This pull request explicitly references no jira issue.

In response to this:

Validate that AWS instance types and AMIs are compatible with AMD SEV-SNP confidential computing to fail fast during install-config validation instead of during cluster deployment.

Validate instance types have "amd-sev-snp" feature support

Validate AMIs have UEFI or UEFI-preferred boot mode

Support both custom and default RHCOS AMI validation

Add AMI metadata fetching and caching infrastructure

Add 13 test cases for SEV-SNP validation scenarios

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tthvo · 2025-11-25T18:34:22Z

/label platform/aws

tthvo

Thank you for helping with this! I had a few comments, but overall it's great!

tthvo · 2025-11-25T20:53:00Z

pkg/asset/installconfig/aws/images.go

+	cctx, cancel := context.WithTimeout(ctx, 1*time.Minute)
+	defer cancel()


Suggested change

cctx, cancel := context.WithTimeout(ctx, 1*time.Minute)

defer cancel()

I think we should use the parent context without setting a timeout 👇

In CI environment, we frequently run into timeout problems due to AWS API throttling, for example, OCPBUGS-65938. So, let's stay away from setting a low constant timeout.

tthvo · 2025-11-25T20:57:33Z

pkg/asset/installconfig/aws/metadata.go

 	vpcSubnets        SubnetGroups
 	vpc               VPC
 	instanceTypes     map[string]InstanceType
+	images            map[string]*ImageInfo


Suggested change

images map[string]*ImageInfo

images map[string]ImageInfo

nit: Since the image info is read-only, we can use values instead of pointers. It's also consistent with instanceTypes field 😁

tthvo · 2025-11-26T01:37:52Z

pkg/asset/installconfig/aws/validation.go

+	if pool.CPUOptions != nil && pool.CPUOptions.ConfidentialCompute != nil {
+		if err := validateAMIBootMode(ctx, meta, fldPath, platform, pool, arch); err != nil {
+			allErrs = append(allErrs, err)
+		}
+	}
+


💡 This is great! Though, I have a refactor suggestions below 👇, which looks a bit simpler:

We can define a func validateCPUOptions so that we can extend the validations if other configurations are added.

func validateCPUOptions(ctx context.Context, meta *Metadata, fldPath *field.Path, pool *awstypes.MachinePool) field.ErrorList { allErrs := field.ErrorList{} cpuOpts := pool.CPUOptions // Early return if no CPU options specified if cpuOpts == nil { return allErrs } // See requirements sev-snp support: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/sev-snp.html#snp-requirements if cpuOpts.ConfidentialCompute != nil && *cpuOpts.ConfidentialCompute == awstypes.ConfidentialComputePolicySEVSNP { // Validate AMI boot mode for SEV-SNP allErrs = append(allErrs, validateAMIBootMode(ctx, meta, fldPath, pool)...) // Validate instance type for SEV-SNP allErrs = append(allErrs, validateInstanceTypeForSEVSNP(ctx, meta, fldPath, pool)...) } return allErrs } func validateInstanceTypeForSEVSNP(ctx context.Context, meta *Metadata, fldPath *field.Path, pool *awstypes.MachinePool) field.ErrorList { allErrs := field.ErrorList{} // Warn if using default instance type if pool.InstanceType == "" { logrus.Warn("AMD SEV-SNP confidential computing is enabled but no instance type is specified. The default instance type may not support amd-sev-snp") return allErrs } // Fetch instance types metadata instanceTypes, err := meta.InstanceTypes(ctx) if err != nil { return append(allErrs, field.InternalError(fldPath, err)) } // Validate the specified instance type supports SEV-SNP // If the instance type is not found, it's already caught in validateMachinePool typeMeta, ok := instanceTypes[pool.InstanceType] if !ok { return allErrs } if !slices.Contains(typeMeta.Features, ec2.SupportedAdditionalProcessorFeatureAmdSevSnp) { allErrs = append(allErrs, field.Invalid(fldPath.Child("type"), pool.InstanceType, "specified instance type in the specified region doesn't support amd-sev-snp")) } return allErrs } func validateAMIBootMode(ctx context.Context, meta *Metadata, fldPath *field.Path, pool *awstypes.MachinePool) field.ErrorList { allErrs := field.ErrorList{} amiID := pool.AMIID if amiID == "" { // Warn when using default AMI with SEV-SNP logrus.Warn("AMD SEV-SNP confidential computing is enabled but no custom AMI is specified. The default RHCOS AMI may not have UEFI boot mode enabled.") return allErrs } // Get image metadata imageInfo, err := meta.Images(ctx, amiID) if err != nil { return append(allErrs, field.Invalid(fldPath.Child("amiID"), amiID, fmt.Sprintf("unable to retrieve AMI metadata: %v", err))) } // Check if boot mode supports UEFI if imageInfo.BootMode != ec2.BootModeValuesUefi && imageInfo.BootMode != ec2.BootModeValuesUefiPreferred { allErrs = append(allErrs, field.Invalid(fldPath.Child("amiID"), amiID, fmt.Sprintf("AMI boot mode must be 'uefi' or 'uefi-preferred' when using AMD SEV-SNP confidential computing, got '%s'", imageInfo.BootMode))) } return allErrs }

Then we use it in validateMachinePool as a top-level validation by checking pool.CPUOptions != nil, which is consistent with other validations.

func validateMachinePool(...) field.ErrorList { // ...output-omitted... if pool.CPUOptions != nil { allErrs = append(allErrs, validateCPUOptions(ctx, meta, fldPath, pool)...) } if len(pool.AdditionalSecurityGroupIDs) > 0 { allErrs = append(allErrs, validateSecurityGroupIDs(ctx, meta, fldPath.Child("additionalSecurityGroupIDs"), platform, pool)...) } // ...output-omitted... }

@fangge1212 You'll notice I suggest giving a warning message when .amiID or instanceType is empty, instead of defaulting.

Here are my reasons:

Unfortunately, we don't have a unified place to set these defaults. This AMI defaulting logic might change in the future; thus, it makes it harder for maintenance if done in another place.

The machinepool instance type is optional. The default (e.g. m6i.xlarge) might not support AMD SEV-SNP at all. Thus, we need to duplicate the defaulting logic too.

So, unless anyone complains, I'd say we go this way for now and can easily harden the validation after we unify the logic. WDYT?

I think there might be a better way forward: the installer can figure out which AMI and instance type (if possible) to use when confidential compute AMD SEV-SNP is enabled 💡 🤓

But that seems over-engineering somehow unless anyone specifically requests it...

I think what you said makes sense. I've updated the code to address all your points. Please review again
when you get a chance.

Thanks 👍

tthvo · 2025-11-26T01:59:11Z

pkg/asset/installconfig/aws/validation_test.go

+			instanceTypes: validInstanceTypes(),
+		},
+		{
+			name: "valid UEFI AMI with enabling SEV-SNP on control plane",


This case has the same setting as test case valid instance type with enabling SEV-SNP on control plane, right? We can combine the 2 cases into:

valid instance type with enabling SEV-SNP and UEFI AMI on control plane

Test cases for compute pool also has duplicates 👀

tthvo · 2025-11-26T02:19:55Z

ci/prow/golint is currently not happy. I think we need to fix those and we can run ./hack/go-lint.sh to verify it... 😞

tthvo · 2025-11-26T02:30:18Z

/cc @patrickdillon

fangge1212 · 2025-11-27T12:46:40Z

ci/prow/golint is currently not happy. I think we need to fix those and we can run ./hack/go-lint.sh to verify it... 😞

The lint failures seem to origin from other places, not from this pr

tthvo

ci/prow/golint is currently not happy. I think we need to fix those and we can run ./hack/go-lint.sh to verify it... 😞

The lint failures seem to origin from other places, not from this pr

@fangge1212 Right, I agreed! These issues are long-standing remnants as golint only considers source files with "new changes". This PR happens to hit that file.

Though, would I trouble you to apply a quick fix for those? I'd like to avoid overriding golint job if possible. Thank you 🙏

diff --git a/pkg/asset/installconfig/aws/validation_test.go b/pkg/asset/installconfig/aws/validation_test.go
index 4e22616b60..392b49ba7d 100644
--- a/pkg/asset/installconfig/aws/validation_test.go
+++ b/pkg/asset/installconfig/aws/validation_test.go
@@ -1454,10 +1454,8 @@ func TestValidate(t *testing.T) {
                        err := Validate(context.TODO(), meta, test.installConfig)
                        if test.expectErr == "" {
                                assert.NoError(t, err)
-                       } else {
-                               if assert.Error(t, err) {
-                                       assert.Regexp(t, test.expectErr, err.Error())
-                               }
+                       } else if assert.Error(t, err) {
+                               assert.Regexp(t, test.expectErr, err.Error())
                        }
                })
        }
@@ -1585,10 +1583,8 @@ func TestValidateForProvisioning(t *testing.T) {
                        err := ValidateForProvisioning(route53Client, ic, meta)
                        if test.expectedErr == "" {
                                assert.NoError(t, err)
-                       } else {
-                               if assert.Error(t, err) {
-                                       assert.Regexp(t, test.expectedErr, err.Error())
-                               }
+                       } else if assert.Error(t, err) {
+                               assert.Regexp(t, test.expectedErr, err.Error())
                        }
                })
        }
@@ -1628,7 +1624,6 @@ func TestGetSubDomainDNSRecords(t *testing.T) {
        route53Client := mock.NewMockAPI(mockCtrl)
 
        for _, test := range cases {
-
                t.Run(test.name, func(t *testing.T) {
                        ic := icBuild.build(icBuild.withBaseDomain(test.baseDomain))
                        if test.expectedErr != "" {
@@ -1653,10 +1648,8 @@ func TestGetSubDomainDNSRecords(t *testing.T) {
                        _, err := route53Client.GetSubDomainDNSRecords(&validDomainOutput, ic, nil)
                        if test.expectedErr == "" {
                                assert.NoError(t, err)
-                       } else {
-                               if assert.Error(t, err) {
-                                       assert.Regexp(t, test.expectedErr, err.Error())
-                               }
+                       } else if assert.Error(t, err) {
+                               assert.Regexp(t, test.expectedErr, err.Error())
                        }
                })
        }

fangge1212 · 2025-11-28T01:59:45Z

ci/prow/golint is currently not happy. I think we need to fix those and we can run ./hack/go-lint.sh to verify it... 😞
The lint failures seem to origin from other places, not from this pr

@fangge1212 Right, I agreed! These issues are long-standing remnants as golint only considers source files with "new changes". This PR happens to hit that file.

Though, would I trouble you to apply a quick fix for those? I'd like to avoid overriding golint job if possible. Thank you 🙏
diff --git a/pkg/asset/installconfig/aws/validation_test.go b/pkg/asset/installconfig/aws/validation_test.go

Ok, done

tthvo

/approve

Local testing looks good to me 👍 I notice that sevsnp-supported instance type (e.g. m6a.xlarge) is also available in unsupported region (e.g. us-east-1). This means someone might use the right ami + instance type, but invalid region.

Though, I think keeping a static region list is not a good idea either since AWS can roll out new support regions. As long as we call out the constraints in docs, I think it's fine for now.

@yalzhang could you double check and add the verified label?

openshift-ci · 2025-11-28T03:18:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tthvo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/asset/installconfig/aws/OWNERS~~ [tthvo]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Validate that AWS instance types and AMIs are compatible with AMD SEV-SNP confidential computing to fail fast during install-config validation instead of during cluster deployment. - Validate instance types have "amd-sev-snp" feature support - Validate AMIs have UEFI or UEFI-preferred boot mode Signed-off-by: Fangge Jin <[email protected]> Assisted-by: Claude Code

Simplify nested if-else statements to use else-if and remove unnecessary blank lines in validation_test.go. Signed-off-by: Fangge Jin <[email protected]>

tthvo · 2025-12-02T18:46:35Z

/retest-required

openshift-ci · 2025-12-02T21:53:10Z

@fangge1212: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-heterogeneous	`f2fbd64`	link	false	`/test e2e-aws-ovn-heterogeneous`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

yalzhang · 2025-12-03T03:09:36Z

/verified by yalzhang

tthvo · 2025-12-03T04:51:37Z

/verified by yalzhang

openshift-ci-robot · 2025-12-03T04:51:50Z

@tthvo: This PR has been marked as verified by yalzhang.

In response to this:

/verified by yalzhang

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci bot requested review from mtulio and tthvo November 25, 2025 06:25

fangge1212 force-pushed the aws_amdsevsnp_validation branch from fcf95c6 to 6c827f8 Compare November 25, 2025 06:27

openshift-ci bot changed the title ~~aws: Validate instance type and AMI compatibility for SEV-SNP~~ NO-JIRA: aws: validate instance type and AMI compatibility for SEV-SNP Nov 25, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Nov 25, 2025

openshift-ci bot added the platform/aws label Nov 25, 2025

tthvo reviewed Nov 26, 2025

View reviewed changes

openshift-ci bot requested a review from patrickdillon November 26, 2025 02:30

fangge1212 force-pushed the aws_amdsevsnp_validation branch 2 times, most recently from 8d6bac6 to 242c9f8 Compare November 27, 2025 12:41

tthvo reviewed Nov 27, 2025

View reviewed changes

tthvo reviewed Nov 28, 2025

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 28, 2025

fangge1212 added 2 commits December 1, 2025 22:43

aws: Fix lint failures in validation tests

f2fbd64

Simplify nested if-else statements to use else-if and remove unnecessary blank lines in validation_test.go. Signed-off-by: Fangge Jin <[email protected]>

fangge1212 force-pushed the aws_amdsevsnp_validation branch from 58e904e to f2fbd64 Compare December 2, 2025 03:45

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Dec 3, 2025

		cctx, cancel := context.WithTimeout(ctx, 1*time.Minute)
		defer cancel()

NO-JIRA: aws: validate instance type and AMI compatibility for SEV-SNP #10126

Are you sure you want to change the base?

NO-JIRA: aws: validate instance type and AMI compatibility for SEV-SNP #10126

Conversation

fangge1212 commented Nov 25, 2025

Uh oh!

tthvo commented Nov 25, 2025

Uh oh!

openshift-ci-robot commented Nov 25, 2025

Uh oh!

tthvo commented Nov 25, 2025

Uh oh!

tthvo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tthvo commented Nov 26, 2025

Uh oh!

tthvo commented Nov 26, 2025

Uh oh!

fangge1212 commented Nov 27, 2025

Uh oh!

tthvo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fangge1212 commented Nov 28, 2025

Uh oh!

tthvo left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Nov 28, 2025

Uh oh!

tthvo commented Dec 2, 2025

Uh oh!

openshift-ci bot commented Dec 2, 2025

Uh oh!

yalzhang commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tthvo commented Dec 3, 2025

Uh oh!

openshift-ci-robot commented Dec 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tthvo left a comment •

edited

Loading

yalzhang commented Dec 3, 2025 •

edited

Loading