Skip to content

Conversation

@alicefr
Copy link

@alicefr alicefr commented Oct 27, 2025

In confidential clusters, using raw TPMs support, we need to register the attestation key in Trustee in order to be able to verify the attestation quote. The key is trusted at first use (TOFU), and the registration point passed with ignition allows the operator to get the key and configure trustee properly.

The AK registration is done before any merge/replace directives because in a second step we want to be able to protect the ignition config with an attestation phase. In this way, the key can be registered before any fetching.

We assume that only one AK is used and can be registered for the attestation per system.
This PR includes the commit from #2145 because it has been tested with the clevis pin for Trustee

Example:

  "attestation": {
    "attestation_key": {
      "registration": {
        "url": "http://192.168.122.1:5000"
      }
    }
  }

Design document including the AK registration for the confidential cluster operator

In order to support new clevis pin, either they need to be added each
time in the hardcoded list of pins or ignition can allow any name for
the pin. This is required in order to enable the clevis trustee pin used
for confidential clusters.

Signed-off-by: Alice Frosi <[email protected]>
The attestation includes the fields necessary to attest the machine. For
example, in confidential clusters. The registration of the attestation
key enables to extract the AK of the TPM and register it before using it
to sign the TPM quote.

Signed-off-by: Alice Frosi <[email protected]>
Generate and persistent the AK in the TPM if the key registration is
request in the config. The public AK is then registered at the url
specified in the config.

The AK is generate only once and if the /var/tpm/ak.pub exists then it
means that this step has already been performed and it is skipped of the
next stages.

The registration of the AK required networking hence, during the
fetch-offline stage it signals that the networking is necessary.

The retries mechanism ensures that the registration is tried multiple
times for allowing the parallel network configuration in the fetch
phase.

Signed-off-by: Alice Frosi <[email protected]>
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces attestation key generation and registration functionality for confidential clusters using raw TPMs. It includes changes to config translation, clevis validation, schema definitions, and the addition of attestation handling logic. The code introduces a new Attestation type to the configuration schema, and implements the logic to generate, persist, and register the attestation key with a remote service. The review focuses on error handling, code clarity, and potential security considerations.


type Registration struct {
Url *string `json:"url,omitempty"`
Certificate *string `json:"certificat,omitempty"`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The field name certificat in the Registration struct has a typo. It should be certificate to match the intended meaning and avoid confusion.

Suggested change
Certificate *string `json:"certificat,omitempty"`
Certificate *string `json:"certificate,omitempty"`

Comment on lines +36 to 37
if util.NilOrEmpty(cu.Pin) {
r.AddOnError(c.Append("pin"), errors.ErrClevisPinRequired)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If cu.Pin is nil or empty, the validation logic adds an error indicating that a Clevis pin is required. However, the code does not validate the value of cu.Pin against a list of known pins, potentially leading to runtime errors if an invalid pin is provided. It's better to validate the pin against known values to provide early feedback to the user.

Consider adding a validation step to ensure that the pin value is one of the supported types (e.g., "tpm2", "tang", "sss").

Comment on lines +139 to +140
if _, err := os.Stat(AKRegisterd); err == nil {
return nil

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code checks if the AK was already generated by checking for the existence of the AKRegisterd file. However, the comment says "Check if AK was already generated", which is misleading. The file indicates that the AK was registered, not generated. This could cause confusion when debugging.

Comment on lines +155 to +163
if !util.NilOrEmpty(registration.Certificate) {
tlsConfig, err := createTLSConfig(*registration.Certificate)
if err != nil {
return fmt.Errorf("failed to create TLS config: %w", err)
}

client.Transport = &http.Transport{
TLSClientConfig: tlsConfig,
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code creates a TLS configuration only if registration.Certificate is not nil or empty. However, it does not handle the case where the certificate string is invalid or malformed. This could lead to a crash or unexpected behavior at runtime.

Consider adding error handling to check if the certificate is valid and return an error if it is not.

Comment on lines +185 to +186
if resp.StatusCode < 200 || resp.StatusCode >= 300 {
return fmt.Errorf("registration failed with status code: %d", resp.StatusCode)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code checks if the response status code is outside the range of 200-299. However, it does not provide any context or information about the specific status code. This makes it difficult to debug registration failures.

Consider logging the status code and response body to provide more context for debugging.

Comment on lines +190 to +192
if err := os.WriteFile(AKRegisterd, []byte{}, 0644); err != nil {
return fmt.Errorf("failed to create AK registered file: %w", err)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code creates an empty file AKRegisterd to indicate successful registration. However, it does not handle potential errors during file creation. This could lead to a false positive if the file creation fails.

Consider adding error handling to check if the file creation was successful and return an error if it was not.

Comment on lines +217 to +234
func createTLSConfig(certPEM string) (*tls.Config, error) {
block, _ := pem.Decode([]byte(certPEM))
if block == nil {
return nil, fmt.Errorf("failed to decode PEM certificate")
}

cert, err := x509.ParseCertificate(block.Bytes)
if err != nil {
return nil, fmt.Errorf("failed to parse certificate: %w", err)
}

certPool := x509.NewCertPool()
certPool.AddCert(cert)

return &tls.Config{
RootCAs: certPool,
}, nil
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The createTLSConfig function decodes and parses the certificate but doesn't handle potential errors during PEM decoding or certificate parsing. This could lead to unexpected behavior if the certificate is invalid.

Comment on lines +319 to +327
if err := attestation.HandleAttestation(e.Logger, &cfg, e.PlatformConfig.Name(), e.NeedNet); err != nil {
if err == resource.ErrNeedNet && stageName == "fetch-offline" {
err = e.signalNeedNet()
if err != nil {
e.Logger.Crit("failed to signal neednet: %v", err)
}
return cfg, resource.ErrNeedNet
}
return types.Config{}, err

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code calls HandleAttestation and checks for resource.ErrNeedNet specifically in the fetch-offline stage. If this error occurs in other stages, it will not be handled in the same way, potentially leading to unexpected behavior or failures. It's important to ensure that network-related errors are handled consistently across all stages.

Consider handling resource.ErrNeedNet in a more generic way, or ensuring that signalNeedNet is called in all relevant stages.

@alicefr alicefr marked this pull request as draft October 27, 2025 15:36
@prestist
Copy link
Collaborator

👋 hey @alicefr!
I took a look through the changes. Thank you for working on this! this is a great start 🥇.

I wanted to ask, so firstly protect the ignition config with an attestation phase how is this achieved? I see the gen of the key, and the registration but no decrypt of the config like I expected

Second I am a bit worried about the impact this could have on all distros we are assuming these constants

const (
	TPMDir      = "/var/tpm"
	AKPath      = "/var/tpm/ak.pub"
	AKCtxPath   = "/var/tpm/ak.ctx"
	AKRegisterd = "/var/tpm/ak.registerd"
	AKHandle    = "0x81010002"
	EKHandle    = "0x81010001"
)

Would the TPM directory always be mounted at /var/tpm? and are teh AK / EK handles standard. If not it might make sense to expose those configs as configs not constants, and let sugar take the burden (butane) that way ignition does not work differently on some distros?

@alicefr
Copy link
Author

alicefr commented Oct 28, 2025

👋 hey @alicefr! I took a look through the changes. Thank you for working on this! this is a great start 🥇.

I wanted to ask, so firstly protect the ignition config with an attestation phase how is this achieved? I see the gen of the key, and the registration but no decrypt of the config like I expected

Right now, in our operator we release the ignition configuration with the clevis pin configuration without protection with a merge derective. The idea would be that we would release those only after a valid attestation. An attestation produce an attestation token that can be used to release resources, and in that can it will be the ignition configuration for the clevis pin. That's why I have tried to generate the AK after the fetch of the provisioner config but before all the merge directive.

The flow will be:

  1. Generate and register the AK
  2. Pass the attestation
  3. Fetch the ignition merge directive with the clevis configuration and the UUID of the guest
  4. Pass a second time attestation including the UUID in the measurment
  5. Retrieve the LUKs key for the disk encryption

Second I am a bit worried about the impact this could have on all distros we are assuming these constants

const (
	TPMDir      = "/var/tpm"
	AKPath      = "/var/tpm/ak.pub"
	AKCtxPath   = "/var/tpm/ak.ctx"
	AKRegisterd = "/var/tpm/ak.registerd"
	AKHandle    = "0x81010002"
	EKHandle    = "0x81010001"
)

Would the TPM directory always be mounted at /var/tpm? and are teh AK / EK handles standard. If not it might make sense to expose those configs as configs not constants, and let sugar take the burden (butane) that way ignition does not work differently on some distros?

I tried to the keep the configuration changes as minimal as possible. But if you think, they can be configurable, I can add them.
About the handler, it is the default one that trustee attester expects, tpm-attestation-key-ak-setup
Ignition does create the directory /var/tpm if doesn't exist, so I think this belongs to the ignition internal logic where the key files are stored and from where ignition reads the key for the registration.

@prestist
Copy link
Collaborator

prestist commented Oct 28, 2025

👋 hey @alicefr! I took a look through the changes. Thank you for working on this! this is a great start 🥇.
I wanted to ask, so firstly protect the ignition config with an attestation phase how is this achieved? I see the gen of the key, and the registration but no decrypt of the config like I expected

Right now, in our operator we release the ignition configuration with the clevis pin configuration without protection with a merge derective. The idea would be that we would release those only after a valid attestation. An attestation produce an attestation token that can be used to release resources, and in that can it will be the ignition configuration for the clevis pin. That's why I have tried to generate the AK after the fetch of the provisioner config but before all the merge directive.

The flow will be:

1. Generate and register the AK

2. Pass the attestation

3. Fetch the ignition merge directive with the clevis configuration and the UUID of the guest

4. Pass a second time attestation including the UUID in the measurment

5. Retrieve the LUKs key for the disk encryption

Im sorry for this if its a silly question and I am just missing it. I can see step 1, step 2 in the changes, could you help me by pointing me to where steps 3, 4, and 5 are handled?

Second I am a bit worried about the impact this could have on all distros we are assuming these constants

const (
	TPMDir      = "/var/tpm"
	AKPath      = "/var/tpm/ak.pub"
	AKCtxPath   = "/var/tpm/ak.ctx"
	AKRegisterd = "/var/tpm/ak.registerd"
	AKHandle    = "0x81010002"
	EKHandle    = "0x81010001"
)

Would the TPM directory always be mounted at /var/tpm? and are teh AK / EK handles standard. If not it might make sense to expose those configs as configs not constants, and let sugar take the burden (butane) that way ignition does not work differently on some distros?

I tried to the keep the configuration changes as minimal as possible. But if you think, they can be configurable, I can add them. About the handler, it is the default one that trustee attester expects, tpm-attestation-key-ak-setup Ignition does create the directory /var/tpm if doesn't exist, so I think this belongs to the ignition internal logic where the key files are stored and from where ignition reads the key for the registration.

Okay yes! if thats the case we can def reduce the amount of config. I was incorrectly under the impression that the location could change but since its us handling the folder it does not matter. Lets keep the constants.

@alicefr
Copy link
Author

alicefr commented Oct 29, 2025

👋 hey @alicefr! I took a look through the changes. Thank you for working on this! this is a great start 🥇.
I wanted to ask, so firstly protect the ignition config with an attestation phase how is this achieved? I see the gen of the key, and the registration but no decrypt of the config like I expected

Right now, in our operator we release the ignition configuration with the clevis pin configuration without protection with a merge derective. The idea would be that we would release those only after a valid attestation. An attestation produce an attestation token that can be used to release resources, and in that can it will be the ignition configuration for the clevis pin. That's why I have tried to generate the AK after the fetch of the provisioner config but before all the merge directive.
The flow will be:

1. Generate and register the AK

2. Pass the attestation

3. Fetch the ignition merge directive with the clevis configuration and the UUID of the guest

4. Pass a second time attestation including the UUID in the measurment

5. Retrieve the LUKs key for the disk encryption

Im sorry for this if its a silly question and I am just missing it. I can see step 1, step 2 in the changes, could you help me by pointing me to where steps 3, 4, and 5 are handled?

We haven't implemented that yet, sorry if it wasn't clear. Maybe it will be better if I write a design document for the 2 phase attestations so we can reference it here and the entire flow becomes clearer

@alicefr
Copy link
Author

alicefr commented Oct 29, 2025

@travier @prestist what worries me more is that the information passed to register the AK isn't enough. Right now, I'm simply passing the platform, but this isn't enough for getting the endorsement key of the VM instance. From the google cloud docu, the EK can be fetched with a get request:

GET /compute/v1/projects/[PROJECT_ID]/zones/[ZONE]/instances/[INSTANCE_NAME]/getShieldedInstanceIdentity

but we need the project-id, zone and instance-name. AFAIU, this information is available with afterburn, but it runs too late for when we register the key

@alicefr
Copy link
Author

alicefr commented Oct 29, 2025

@prestist this PR drafts confidential-clusters/cocl-operator#68 the idea of having 2 attestation phase for protecting the ignition fetch directive with the clevis pin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants