-
Notifications
You must be signed in to change notification settings - Fork 1.5k
NO-JIRA: aws: validate instance type and AMI compatibility for SEV-SNP #10126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
fcf95c6 to
6c827f8
Compare
|
/retitle NO-JIRA: aws: validate instance type and AMI compatibility for SEV-SNP |
|
@fangge1212: This pull request explicitly references no jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/label platform/aws |
tthvo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for helping with this! I had a few comments, but overall it's great!
| cctx, cancel := context.WithTimeout(ctx, 1*time.Minute) | ||
| defer cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| cctx, cancel := context.WithTimeout(ctx, 1*time.Minute) | |
| defer cancel() |
I think we should use the parent context without setting a timeout 👇
In CI environment, we frequently run into timeout problems due to AWS API throttling, for example, OCPBUGS-65938. So, let's stay away from setting a low constant timeout.
| vpcSubnets SubnetGroups | ||
| vpc VPC | ||
| instanceTypes map[string]InstanceType | ||
| images map[string]*ImageInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| images map[string]*ImageInfo | |
| images map[string]ImageInfo |
nit: Since the image info is read-only, we can use values instead of pointers. It's also consistent with instanceTypes field 😁
| if pool.CPUOptions != nil && pool.CPUOptions.ConfidentialCompute != nil { | ||
| if err := validateAMIBootMode(ctx, meta, fldPath, platform, pool, arch); err != nil { | ||
| allErrs = append(allErrs, err) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 This is great! Though, I have a refactor suggestions below 👇, which looks a bit simpler:
- We can define a func
validateCPUOptionsso that we can extend the validations if other configurations are added.
func validateCPUOptions(ctx context.Context, meta *Metadata, fldPath *field.Path, pool *awstypes.MachinePool) field.ErrorList {
allErrs := field.ErrorList{}
cpuOpts := pool.CPUOptions
// Early return if no CPU options specified
if cpuOpts == nil {
return allErrs
}
// See requirements sev-snp support: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/sev-snp.html#snp-requirements
if cpuOpts.ConfidentialCompute != nil && *cpuOpts.ConfidentialCompute == awstypes.ConfidentialComputePolicySEVSNP {
// Validate AMI boot mode for SEV-SNP
allErrs = append(allErrs, validateAMIBootMode(ctx, meta, fldPath, pool)...)
// Validate instance type for SEV-SNP
allErrs = append(allErrs, validateInstanceTypeForSEVSNP(ctx, meta, fldPath, pool)...)
}
return allErrs
}
func validateInstanceTypeForSEVSNP(ctx context.Context, meta *Metadata, fldPath *field.Path, pool *awstypes.MachinePool) field.ErrorList {
allErrs := field.ErrorList{}
// Warn if using default instance type
if pool.InstanceType == "" {
logrus.Warn("AMD SEV-SNP confidential computing is enabled but no instance type is specified. The default instance type may not support amd-sev-snp")
return allErrs
}
// Fetch instance types metadata
instanceTypes, err := meta.InstanceTypes(ctx)
if err != nil {
return append(allErrs, field.InternalError(fldPath, err))
}
// Validate the specified instance type supports SEV-SNP
// If the instance type is not found, it's already caught in validateMachinePool
typeMeta, ok := instanceTypes[pool.InstanceType]
if !ok {
return allErrs
}
if !slices.Contains(typeMeta.Features, ec2.SupportedAdditionalProcessorFeatureAmdSevSnp) {
allErrs = append(allErrs, field.Invalid(fldPath.Child("type"), pool.InstanceType, "specified instance type in the specified region doesn't support amd-sev-snp"))
}
return allErrs
}
func validateAMIBootMode(ctx context.Context, meta *Metadata, fldPath *field.Path, pool *awstypes.MachinePool) field.ErrorList {
allErrs := field.ErrorList{}
amiID := pool.AMIID
if amiID == "" {
// Warn when using default AMI with SEV-SNP
logrus.Warn("AMD SEV-SNP confidential computing is enabled but no custom AMI is specified. The default RHCOS AMI may not have UEFI boot mode enabled.")
return allErrs
}
// Get image metadata
imageInfo, err := meta.Images(ctx, amiID)
if err != nil {
return append(allErrs, field.Invalid(fldPath.Child("amiID"), amiID, fmt.Sprintf("unable to retrieve AMI metadata: %v", err)))
}
// Check if boot mode supports UEFI
if imageInfo.BootMode != ec2.BootModeValuesUefi && imageInfo.BootMode != ec2.BootModeValuesUefiPreferred {
allErrs = append(allErrs, field.Invalid(fldPath.Child("amiID"), amiID, fmt.Sprintf("AMI boot mode must be 'uefi' or 'uefi-preferred' when using AMD SEV-SNP confidential computing, got '%s'", imageInfo.BootMode)))
}
return allErrs
}- Then we use it in
validateMachinePoolas a top-level validation by checkingpool.CPUOptions != nil, which is consistent with other validations.
func validateMachinePool(...) field.ErrorList {
// ...output-omitted...
if pool.CPUOptions != nil {
allErrs = append(allErrs, validateCPUOptions(ctx, meta, fldPath, pool)...)
}
if len(pool.AdditionalSecurityGroupIDs) > 0 {
allErrs = append(allErrs, validateSecurityGroupIDs(ctx, meta, fldPath.Child("additionalSecurityGroupIDs"), platform, pool)...)
}
// ...output-omitted...
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fangge1212 You'll notice I suggest giving a warning message when .amiID or instanceType is empty, instead of defaulting.
Here are my reasons:
- Unfortunately, we don't have a unified place to set these defaults. This AMI defaulting logic might change in the future; thus, it makes it harder for maintenance if done in another place.
- The machinepool instance type is optional. The default (e.g.
m6i.xlarge) might not supportAMD SEV-SNPat all. Thus, we need to duplicate the defaulting logic too.
So, unless anyone complains, I'd say we go this way for now and can easily harden the validation after we unify the logic. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be a better way forward: the installer can figure out which AMI and instance type (if possible) to use when confidential compute AMD SEV-SNP is enabled 💡 🤓
But that seems over-engineering somehow unless anyone specifically requests it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you said makes sense. I've updated the code to address all your points. Please review again
when you get a chance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks 👍
| instanceTypes: validInstanceTypes(), | ||
| }, | ||
| { | ||
| name: "valid UEFI AMI with enabling SEV-SNP on control plane", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case has the same setting as test case valid instance type with enabling SEV-SNP on control plane, right? We can combine the 2 cases into:
valid instance type with enabling SEV-SNP and UEFI AMI on control plane
Test cases for compute pool also has duplicates 👀
|
ci/prow/golint is currently not happy. I think we need to fix those and we can run |
|
/cc @patrickdillon |
8d6bac6 to
242c9f8
Compare
The lint failures seem to origin from other places, not from this pr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ci/prow/golint is currently not happy. I think we need to fix those and we can run
./hack/go-lint.shto verify it... 😞The lint failures seem to origin from other places, not from this pr
@fangge1212 Right, I agreed! These issues are long-standing remnants as golint only considers source files with "new changes". This PR happens to hit that file.
Though, would I trouble you to apply a quick fix for those? I'd like to avoid overriding golint job if possible. Thank you 🙏
diff --git a/pkg/asset/installconfig/aws/validation_test.go b/pkg/asset/installconfig/aws/validation_test.go
index 4e22616b60..392b49ba7d 100644
--- a/pkg/asset/installconfig/aws/validation_test.go
+++ b/pkg/asset/installconfig/aws/validation_test.go
@@ -1454,10 +1454,8 @@ func TestValidate(t *testing.T) {
err := Validate(context.TODO(), meta, test.installConfig)
if test.expectErr == "" {
assert.NoError(t, err)
- } else {
- if assert.Error(t, err) {
- assert.Regexp(t, test.expectErr, err.Error())
- }
+ } else if assert.Error(t, err) {
+ assert.Regexp(t, test.expectErr, err.Error())
}
})
}
@@ -1585,10 +1583,8 @@ func TestValidateForProvisioning(t *testing.T) {
err := ValidateForProvisioning(route53Client, ic, meta)
if test.expectedErr == "" {
assert.NoError(t, err)
- } else {
- if assert.Error(t, err) {
- assert.Regexp(t, test.expectedErr, err.Error())
- }
+ } else if assert.Error(t, err) {
+ assert.Regexp(t, test.expectedErr, err.Error())
}
})
}
@@ -1628,7 +1624,6 @@ func TestGetSubDomainDNSRecords(t *testing.T) {
route53Client := mock.NewMockAPI(mockCtrl)
for _, test := range cases {
-
t.Run(test.name, func(t *testing.T) {
ic := icBuild.build(icBuild.withBaseDomain(test.baseDomain))
if test.expectedErr != "" {
@@ -1653,10 +1648,8 @@ func TestGetSubDomainDNSRecords(t *testing.T) {
_, err := route53Client.GetSubDomainDNSRecords(&validDomainOutput, ic, nil)
if test.expectedErr == "" {
assert.NoError(t, err)
- } else {
- if assert.Error(t, err) {
- assert.Regexp(t, test.expectedErr, err.Error())
- }
+ } else if assert.Error(t, err) {
+ assert.Regexp(t, test.expectedErr, err.Error())
}
})
}
Ok, done |
tthvo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Local testing looks good to me 👍 I notice that sevsnp-supported instance type (e.g. m6a.xlarge) is also available in unsupported region (e.g. us-east-1). This means someone might use the right ami + instance type, but invalid region.
Though, I think keeping a static region list is not a good idea either since AWS can roll out new support regions. As long as we call out the constraints in docs, I think it's fine for now.
@yalzhang could you double check and add the verified label?
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tthvo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Validate that AWS instance types and AMIs are compatible with AMD SEV-SNP confidential computing to fail fast during install-config validation instead of during cluster deployment. - Validate instance types have "amd-sev-snp" feature support - Validate AMIs have UEFI or UEFI-preferred boot mode Signed-off-by: Fangge Jin <[email protected]> Assisted-by: Claude Code
Simplify nested if-else statements to use else-if and remove unnecessary blank lines in validation_test.go. Signed-off-by: Fangge Jin <[email protected]>
58e904e to
f2fbd64
Compare
|
/retest-required |
|
@fangge1212: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/verified by yalzhang |
1 similar comment
|
/verified by yalzhang |
|
@tthvo: This PR has been marked as verified by In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Validate that AWS instance types and AMIs are compatible with AMD SEV-SNP confidential computing to fail fast during install-config validation instead of during cluster deployment.