[WIP] nvmeof/add tls psk#6280
Conversation
change "metadata" string (it is kms ID tpye) to be as const. Signed-off-by: gadi-didi <gadi.didi@ibm.com>
There was a problem hiding this comment.
Pull request overview
Adds TLS-PSK (Pre-Shared Key) support to the NVMe-oF CSI driver so that initiator↔gateway traffic can be encrypted. The flow mirrors the existing DH-CHAP plumbing: the controller generates/stores a PSK in a KMS-backed DEKStore on ControllerPublishVolume, configures the gateway with the PSK and a secure listener, and the node retrieves the same PSK from KMS in NodeStageVolume to nvme connect --tls --tls-key …. PR is explicitly marked WIP and untested.
Changes:
- New
internal/nvmeof/tls_psk.gowithGetOrCreateTLSPSKKey/RemoveTLSPSKKeyandnvme gen-tls-keyinvocation. - Gateway RPC API (
CreateSubsystem,AddHost,CreateListener) extended to carry secure/PSK arguments; newtlsPskModestorage class parameter is threaded through volume context, RBD metadata, and node/controller security helpers. - Controller/node security helpers (
setupTLSPSKKey,cleanupTLSPSKKey,setupTLSPSKAuth) added, mirroring the DH-CHAP pattern;RBDMetadataKMSconstant introduced and used as default when KMS ID is omitted.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/nvmeof/tls_psk.go | New TLS-PSK generation/storage/retrieval helpers. |
| internal/nvmeof/nvmeof.go | Adds securedListener/pskKey params to gateway RPC calls. |
| internal/nvmeof/nvmeof_initiator.go | Adds --tls --tls-key to nvme connect when PSK is set. |
| internal/nvmeof/volume.go | Adds TlsPskMode to security config; defaults KMS ID. |
| internal/nvmeof/volume_test.go | Switches literal "metadata" to RBDMetadataKMS constant. |
| internal/nvmeof/crypto.go | Adds RBDMetadataKMS constant. |
| internal/nvmeof/controller/security.go | Adds setupTLSPSKKey and cleanupTLSPSKKey. |
| internal/nvmeof/controller/controllerserver.go | Wires PSK setup/cleanup and secure listener into publish/unpublish. |
| internal/nvmeof/nodeserver/security.go | Adds setupTLSPSKAuth to fetch PSK and pass to ConnectRequest. |
| internal/nvmeof/nodeserver/nodeserver.go | Plumbs tlsPskMode through volume context and connect flow. |
| internal/nvmeof/tests/nvmeof_test.go | Updates real-gateway test to new gateway RPC signatures. |
| // GetOrCreateTLSPSKKey retrieves existing PSK or creates new one if not found. | ||
| // This function follows the same pattern as GetOrCreateDHCHAPHostKey. | ||
| func GetOrCreateTLSPSKKey( | ||
| ctx context.Context, | ||
| skm SecurityKeyManager, | ||
| nodeID, | ||
| subsystemNQN string, | ||
| ) (string, error) { | ||
| // Try to get existing key | ||
| pskKey, err := getTLSPSKKey(ctx, skm, nodeID, subsystemNQN) | ||
| if err == nil { | ||
| // Key exists, return it | ||
| return pskKey, nil | ||
| } | ||
|
|
||
| // TODO - when another KMS implementation is added, we should check what kind of error it returns | ||
| // when key is not found and check for that here instead of ErrKeyNotFound.!!! | ||
| // for now , since we have only RBD DEKStore, we can check for ErrKeyNotFound which | ||
| // is returned by RBD DEKStore when key is not found. | ||
|
|
||
| // Only create if truly not found - not on any other error. | ||
| if !errors.Is(err, ErrKeyNotFound) { | ||
| // Real error (KMS down, network issue etc) - don't generate new key | ||
| return "", fmt.Errorf("failed to check existing TLS-PSK key: %w", err) | ||
| } | ||
|
|
||
| // Key doesn't exist, generate new one | ||
| pskKey, err = generateAndStoreTLSPSKKey(ctx, skm, nodeID, subsystemNQN) | ||
| if err != nil { | ||
| return "", fmt.Errorf("failed to generate TLS-PSK key: %w", err) | ||
| } | ||
|
|
||
| return pskKey, nil | ||
| } | ||
|
|
||
| // RemoveTLSPSKKey removes the TLS-PSK key for the given node and subsystem connection. | ||
| func RemoveTLSPSKKey( | ||
| ctx context.Context, | ||
| skm SecurityKeyManager, | ||
| nodeID string, | ||
| subsystemNQN string, | ||
| ) error { | ||
| keyID := buildTLSPSKKeyID(nodeID, subsystemNQN) | ||
|
|
||
| return skm.RemoveKey(ctx, keyID) | ||
| } |
1ac640a to
b378211
Compare
b378211 to
c310656
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
internal/nvmeof/volume.go:123
- This default-KMS block runs for any
tlsPskModevalue other than ""/"none". If validation isn’t added, a typo intlsPskModecan silently switchAuthenticationKMSIDto metadata and still diverge from the node’s behavior. Consider defaulting the KMS only whentlsPskMode == TLSPSKEnabled(and rejecting unknown values).
if v.Security.TlsPskMode != TLSPSKEmpty &&
v.Security.TlsPskMode != TLSPSKNone &&
v.Security.AuthenticationKMSID == "" {
v.Security.AuthenticationKMSID = RBDMetadataKMS
}
| if v.Security.TlsPskMode != TLSPSKEmpty && | ||
| v.Security.TlsPskMode != TLSPSKNone && | ||
| v.Security.AuthenticationKMSID == "" { | ||
| v.Security.AuthenticationKMSID = RBDMetadataKMS | ||
| } |
c310656 to
e32c2b6
Compare
| // GetOrCreateTLSPSKKey retrieves existing PSK or creates new one if not found. | ||
| // This function follows the same pattern as GetOrCreateDHCHAPHostKey. | ||
| func GetOrCreateTLSPSKKey( | ||
| ctx context.Context, | ||
| skm SecurityKeyManager, | ||
| nodeID, | ||
| subsystemNQN string, | ||
| ) (string, error) { | ||
| // Try to get existing key | ||
| pskKey, err := getTLSPSKKey(ctx, skm, nodeID, subsystemNQN) | ||
| if err == nil { | ||
| // Key exists, return it | ||
| return pskKey, nil | ||
| } | ||
|
|
||
| // TODO - when another KMS implementation is added, we should check what kind of error it returns | ||
| // when key is not found and check for that here instead of ErrKeyNotFound.!!! | ||
| // for now , since we have only RBD DEKStore, we can check for ErrKeyNotFound which | ||
| // is returned by RBD DEKStore when key is not found. | ||
|
|
||
| // Only create if truly not found - not on any other error. | ||
| if !errors.Is(err, ErrKeyNotFound) { | ||
| // Real error (KMS down, network issue etc) - don't generate new key | ||
| return "", fmt.Errorf("failed to check existing TLS-PSK key: %w", err) | ||
| } | ||
|
|
||
| // Key doesn't exist, generate new one | ||
| pskKey, err = generateAndStoreTLSPSKKey(ctx, skm, nodeID, subsystemNQN) | ||
| if err != nil { | ||
| return "", fmt.Errorf("failed to generate TLS-PSK key: %w", err) | ||
| } | ||
|
|
||
| return pskKey, nil | ||
| } |
Implement TLS-PSK (Pre-Shared Key) key lifecycle management for NVMe-oF secure connections. Keys are generated per host-subsystem pair, encrypted via KMS, and stored in DEKStore (RBD metadata - for testing, or external KMS like Vault). Key operations: `GetOrCreateTLSPSKKey()` for idempotent retrieval, `RemoveTLSPSKKey()` for cleanup, and `generateTLSPSKKey()` using nvme gen-tls-key command. Follows same SecurityKeyManager pattern as DH-CHAP authentication. Signed-off-by: gadi-didi <gadi.didi@ibm.com>
This adds TLS-PSK (Transport Layer Security with Pre-Shared Key) to provide encrypted transport for NVMe-oF volumes. TLS-PSK combines authentication and encryption, unlike DH-CHAP which only authenticates. internal/nvmeof/volume.go: - Add TlsPskMode field to NVMeoFSecurityConfig struct - Parse tlsPskMode from StorageClass parameters - Set default authenticationKMSID to RBD metadata when tlsPskMode is enabled but no KMS specified (for testing; production should use Vault) internal/nvmeof/controller/security.go: - Add setupTLSPSKKey() to retrieve or generate PSK during ControllerPublishVolume - Add cleanupTLSPSKKey() to remove PSK from KMS after host is unpublished - Both functions mirror the DH-CHAP pattern: initialize SecurityKeyManager, setup DEKStore if needed, and call GetOrCreateTLSPSKKey() or RemoveTLSPSKKey() internal/nvmeof/controller/controllerserver.go: - Pass securedListener flag to CreateSubsystem() and CreateListener() when TLS-PSK is enabled to configure gateway for TLS - Call setupTLSPSKKey() in publishResources() and pass PSK to AddHost() - Call cleanupTLSPSKKey() in unpublishResources() after host is removed - Add vcTLSPSKMode constant for volume context metadata key - Store and retrieve TlsPskMode in volume context and RBD metadata internal/nvmeof/nvmeof.go: - Add securedListener parameter to CreateSubsystem() to set SecureListeners flag in gateway request when using auto-listener network mask - Add securedListener parameter to CreateListener() to set Secure flag for manual listener creation - Add pskKey parameter to AddHost() and set Psk field in gateway request when PSK is provided internal/nvmeof/tests/nvmeof_test.go: - Update test calls to pass new parameters: securedListener=false for CreateSubsystem() and CreateListener(), empty PSK for `AddHost()` Signed-off-by: gadi-didi <gadi.didi@ibm.com>
e32c2b6 to
2ab191e
Compare
This completes the node server implementation for TLS-PSK secure transport, enabling nodes to retrieve PSK keys and establish encrypted NVMe-oF connections using nvme-cli. internal/nvmeof/nodeserver/nodeserver.go: - Add TlsPskMode field to NvmeConnectionInfo struct to track TLS-PSK mode alongside DH-CHAP mode - Add nvmeoftlsPskMode constant for volume context key parsing - Extract tlsPskMode from volume context in getNvmeConnection() and normalize empty/"none" values to empty string - Call setupTLSPSKAuth() in connectToSubsystem() when TlsPskMode is enabled, before resolving listeners internal/nvmeof/nodeserver/security.go: - Add setupTLSPSKAuth() function to retrieve TLS-PSK key on node side - Call GetOrCreateTLSPSKKey() to fetch PSK for this node-subsystem pair from KMS/DEKStore - Set PSKKey field in ConnectRequest for nvme-cli command internal/nvmeof/nvmeof_initiator.go: - Add PSKKey field to ConnectRequest struct for passing PSK to connection - Append --tls and --tls-key flags to nvme connect command when PSKKey is provided Signed-off-by: gadi-didi <gadi.didi@ibm.com>
2ab191e to
ae7ff12
Compare
Add TLS-PSK support for NVMe-oF volumes
This PR adds TLS-PSK (Pre-Shared Key) to encrypt NVMe-oF connections between nodes and Ceph gateways.
It is very similar implementation like DH-CHAP feature. (Both use KMS and dek-store for encrypting and saving the key inside).
DH-CHAP is for authentication, TLS-PSK for secure communication.
What changed
Added
tlsPskModeStorageClass parameter that enables TLS encryption for volumes. When enabled:Controller generates PSK when publishing volume to a node
nvme gen-tls-keyAddHost()callsecure=trueflagNode retrieves PSK when staging volume
nvme connect --tls --tls-key <PSK>Controller cleanup when unpublishing
Key design choices
nvmeof-tls-psk-<nodeID>-<subsystemHash>where subsystemHash is first 16 chars of SHA-256(subsystemNQN)Usage
Testing
I did not test it yet.
fix issue : #5717
UPDATE
TLS-PSK requires
tlshd(TLS Handshake Daemon) to be installed and running on each node where the CSI node plugin is deployed. The Linux kernel's NVMe-TCP driver cannot perform TLS handshakes directly in kernel space, so it delegates this totlshd, a user-space daemon from the ktls-utils package. Whennvme connect --tlsis issued, the kernel sends a handshake request via netlink socket to tlshd, which uses the GnuTLS library to perform the TLS 1.3 negotiation with the gateway. The resulting session keys are passed back to the kernel's kTLS layer, which then handles all subsequent I/O encryption transparently. The PSK key must be pre-inserted into the kernel's .nvme keyring (vianvme check-tls-key --insert --identity 0) before connecting, and tlshd must be configured withkeyrings=.nvmein/etc/tlshd.confso it can locate the key during the handshake. Without tlshd running, all TLS-PSK connection attempts will fail with a handshake error regardless of whether the key is correctly configured on both sides.We need to figure out how to handle it.
Checklist:
guidelines in the developer
guide.
Request
notes
updated with breaking and/or notable changes for the next major release.
Show available bot commands
These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:
/retest ci/centos/<job-name>: retest the<job-name>after unrelatedfailure (please report the failure too!)