Skip to content

[WIP] nvmeof/add tls psk#6280

Draft
gadididi wants to merge 4 commits into
ceph:develfrom
gadididi:nvmeof/add_tls-psk
Draft

[WIP] nvmeof/add tls psk#6280
gadididi wants to merge 4 commits into
ceph:develfrom
gadididi:nvmeof/add_tls-psk

Conversation

@gadididi
Copy link
Copy Markdown
Contributor

@gadididi gadididi commented May 18, 2026

Add TLS-PSK support for NVMe-oF volumes

This PR adds TLS-PSK (Pre-Shared Key) to encrypt NVMe-oF connections between nodes and Ceph gateways.
It is very similar implementation like DH-CHAP feature. (Both use KMS and dek-store for encrypting and saving the key inside).
DH-CHAP is for authentication, TLS-PSK for secure communication.

What changed

Added tlsPskMode StorageClass parameter that enables TLS encryption for volumes. When enabled:

  1. Controller generates PSK when publishing volume to a node

    • Creates 256-bit random key using nvme gen-tls-key
    • Encrypts and stores key in KMS/DEKStore (same pattern as DH-CHAP)
    • Configures gateway with PSK via AddHost() call
    • Creates listeners with secure=true flag
  2. Node retrieves PSK when staging volume

    • Fetches same key from KMS using node ID + subsystem NQN
    • Connects with nvme connect --tls --tls-key <PSK>
  3. Controller cleanup when unpublishing

    • Removes PSK from KMS after host is removed from subsystem

Key design choices

  • PSK scope: Per host-subsystem pair). Multiple volumes on same node connecting to same subsystem share one key.
  • Storage: Uses existing DEKStore pattern - RBD metadata for testing, Vault/external KMS for production.
  • Key retrieval: Node independently fetches key from KMS, controller doesn't pass it through volume context.
  • Key format: nvmeof-tls-psk-<nodeID>-<subsystemHash> where subsystemHash is first 16 chars of SHA-256(subsystemNQN)

Usage

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ceph-nvmeof-tls
provisioner: nvmeof.csi.ceph.com
parameters:
  tlsPskMode: "enabled"
  authenticationKMSID: "metadata"  # or vault/aws-kms/etc

Testing

I did not test it yet.

fix issue : #5717

UPDATE

TLS-PSK requires tlshd (TLS Handshake Daemon) to be installed and running on each node where the CSI node plugin is deployed. The Linux kernel's NVMe-TCP driver cannot perform TLS handshakes directly in kernel space, so it delegates this to tlshd, a user-space daemon from the ktls-utils package. When nvme connect --tls is issued, the kernel sends a handshake request via netlink socket to tlshd, which uses the GnuTLS library to perform the TLS 1.3 negotiation with the gateway. The resulting session keys are passed back to the kernel's kTLS layer, which then handles all subsequent I/O encryption transparently. The PSK key must be pre-inserted into the kernel's .nvme keyring (via nvme check-tls-key --insert --identity 0) before connecting, and tlshd must be configured with keyrings=.nvme in /etc/tlshd.conf so it can locate the key during the handshake. Without tlshd running, all TLS-PSK connection attempts will fail with a handshake error regardless of whether the key is correctly configured on both sides.

We need to figure out how to handle it.

Checklist:

  • Commit Message Formatting: Commit titles and messages follow
    guidelines in the developer
    guide
    .
  • Reviewed the developer guide on Submitting a Pull
    Request
  • Pending release
    notes

    updated with breaking and/or notable changes for the next major release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

Show available bot commands

These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:

  • /retest ci/centos/<job-name>: retest the <job-name> after unrelated
    failure (please report the failure too!)

change "metadata" string (it is kms ID tpye)
to be as const.

Signed-off-by: gadi-didi <gadi.didi@ibm.com>
@gadididi gadididi requested a review from Copilot May 18, 2026 13:24
@gadididi gadididi self-assigned this May 18, 2026
@gadididi gadididi added the component/nvme-of Issues and PRs related to NVMe-oF. label May 18, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds TLS-PSK (Pre-Shared Key) support to the NVMe-oF CSI driver so that initiator↔gateway traffic can be encrypted. The flow mirrors the existing DH-CHAP plumbing: the controller generates/stores a PSK in a KMS-backed DEKStore on ControllerPublishVolume, configures the gateway with the PSK and a secure listener, and the node retrieves the same PSK from KMS in NodeStageVolume to nvme connect --tls --tls-key …. PR is explicitly marked WIP and untested.

Changes:

  • New internal/nvmeof/tls_psk.go with GetOrCreateTLSPSKKey/RemoveTLSPSKKey and nvme gen-tls-key invocation.
  • Gateway RPC API (CreateSubsystem, AddHost, CreateListener) extended to carry secure/PSK arguments; new tlsPskMode storage class parameter is threaded through volume context, RBD metadata, and node/controller security helpers.
  • Controller/node security helpers (setupTLSPSKKey, cleanupTLSPSKKey, setupTLSPSKAuth) added, mirroring the DH-CHAP pattern; RBDMetadataKMS constant introduced and used as default when KMS ID is omitted.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
internal/nvmeof/tls_psk.go New TLS-PSK generation/storage/retrieval helpers.
internal/nvmeof/nvmeof.go Adds securedListener/pskKey params to gateway RPC calls.
internal/nvmeof/nvmeof_initiator.go Adds --tls --tls-key to nvme connect when PSK is set.
internal/nvmeof/volume.go Adds TlsPskMode to security config; defaults KMS ID.
internal/nvmeof/volume_test.go Switches literal "metadata" to RBDMetadataKMS constant.
internal/nvmeof/crypto.go Adds RBDMetadataKMS constant.
internal/nvmeof/controller/security.go Adds setupTLSPSKKey and cleanupTLSPSKKey.
internal/nvmeof/controller/controllerserver.go Wires PSK setup/cleanup and secure listener into publish/unpublish.
internal/nvmeof/nodeserver/security.go Adds setupTLSPSKAuth to fetch PSK and pass to ConnectRequest.
internal/nvmeof/nodeserver/nodeserver.go Plumbs tlsPskMode through volume context and connect flow.
internal/nvmeof/tests/nvmeof_test.go Updates real-gateway test to new gateway RPC signatures.

Comment thread internal/nvmeof/tls_psk.go Outdated
Comment thread internal/nvmeof/tls_psk.go
Comment thread internal/nvmeof/controller/security.go
Comment thread internal/nvmeof/nvmeof_initiator.go Outdated
Comment thread internal/nvmeof/nvmeof.go
Comment thread internal/nvmeof/nodeserver/security.go
Comment thread internal/nvmeof/tests/nvmeof_test.go Outdated
Comment on lines +79 to +124
// GetOrCreateTLSPSKKey retrieves existing PSK or creates new one if not found.
// This function follows the same pattern as GetOrCreateDHCHAPHostKey.
func GetOrCreateTLSPSKKey(
ctx context.Context,
skm SecurityKeyManager,
nodeID,
subsystemNQN string,
) (string, error) {
// Try to get existing key
pskKey, err := getTLSPSKKey(ctx, skm, nodeID, subsystemNQN)
if err == nil {
// Key exists, return it
return pskKey, nil
}

// TODO - when another KMS implementation is added, we should check what kind of error it returns
// when key is not found and check for that here instead of ErrKeyNotFound.!!!
// for now , since we have only RBD DEKStore, we can check for ErrKeyNotFound which
// is returned by RBD DEKStore when key is not found.

// Only create if truly not found - not on any other error.
if !errors.Is(err, ErrKeyNotFound) {
// Real error (KMS down, network issue etc) - don't generate new key
return "", fmt.Errorf("failed to check existing TLS-PSK key: %w", err)
}

// Key doesn't exist, generate new one
pskKey, err = generateAndStoreTLSPSKKey(ctx, skm, nodeID, subsystemNQN)
if err != nil {
return "", fmt.Errorf("failed to generate TLS-PSK key: %w", err)
}

return pskKey, nil
}

// RemoveTLSPSKKey removes the TLS-PSK key for the given node and subsystem connection.
func RemoveTLSPSKKey(
ctx context.Context,
skm SecurityKeyManager,
nodeID string,
subsystemNQN string,
) error {
keyID := buildTLSPSKKeyID(nodeID, subsystemNQN)

return skm.RemoveKey(ctx, keyID)
}
Comment thread internal/nvmeof/controller/controllerserver.go
Comment thread internal/nvmeof/tls_psk.go Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (1)

internal/nvmeof/volume.go:123

  • This default-KMS block runs for any tlsPskMode value other than ""/"none". If validation isn’t added, a typo in tlsPskMode can silently switch AuthenticationKMSID to metadata and still diverge from the node’s behavior. Consider defaulting the KMS only when tlsPskMode == TLSPSKEnabled (and rejecting unknown values).
	if v.Security.TlsPskMode != TLSPSKEmpty &&
		v.Security.TlsPskMode != TLSPSKNone &&
		v.Security.AuthenticationKMSID == "" {
		v.Security.AuthenticationKMSID = RBDMetadataKMS
	}

Comment thread internal/nvmeof/controller/security.go Outdated
Comment thread internal/nvmeof/volume.go
Comment thread internal/nvmeof/nodeserver/security.go Outdated
Comment thread internal/nvmeof/tls_psk.go Outdated
Comment thread internal/nvmeof/volume.go
Comment on lines +119 to 123
if v.Security.TlsPskMode != TLSPSKEmpty &&
v.Security.TlsPskMode != TLSPSKNone &&
v.Security.AuthenticationKMSID == "" {
v.Security.AuthenticationKMSID = RBDMetadataKMS
}
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Comment on lines +76 to +109
// GetOrCreateTLSPSKKey retrieves existing PSK or creates new one if not found.
// This function follows the same pattern as GetOrCreateDHCHAPHostKey.
func GetOrCreateTLSPSKKey(
ctx context.Context,
skm SecurityKeyManager,
nodeID,
subsystemNQN string,
) (string, error) {
// Try to get existing key
pskKey, err := getTLSPSKKey(ctx, skm, nodeID, subsystemNQN)
if err == nil {
// Key exists, return it
return pskKey, nil
}

// TODO - when another KMS implementation is added, we should check what kind of error it returns
// when key is not found and check for that here instead of ErrKeyNotFound.!!!
// for now , since we have only RBD DEKStore, we can check for ErrKeyNotFound which
// is returned by RBD DEKStore when key is not found.

// Only create if truly not found - not on any other error.
if !errors.Is(err, ErrKeyNotFound) {
// Real error (KMS down, network issue etc) - don't generate new key
return "", fmt.Errorf("failed to check existing TLS-PSK key: %w", err)
}

// Key doesn't exist, generate new one
pskKey, err = generateAndStoreTLSPSKKey(ctx, skm, nodeID, subsystemNQN)
if err != nil {
return "", fmt.Errorf("failed to generate TLS-PSK key: %w", err)
}

return pskKey, nil
}
Comment thread internal/nvmeof/tls_psk.go Outdated
Comment thread internal/nvmeof/volume.go Outdated
Comment thread internal/nvmeof/controller/controllerserver.go Outdated
gadididi added 2 commits May 19, 2026 15:52
Implement TLS-PSK (Pre-Shared Key) key lifecycle management for
NVMe-oF secure connections. Keys are generated per host-subsystem
pair, encrypted via KMS, and stored in DEKStore
(RBD metadata - for testing, or external KMS like Vault).

Key operations: `GetOrCreateTLSPSKKey()` for idempotent retrieval,
`RemoveTLSPSKKey()` for cleanup, and `generateTLSPSKKey()` using
nvme gen-tls-key command. Follows same SecurityKeyManager pattern
as DH-CHAP authentication.

Signed-off-by: gadi-didi <gadi.didi@ibm.com>
This adds TLS-PSK (Transport Layer Security with Pre-Shared Key) to
provide encrypted transport for NVMe-oF volumes. TLS-PSK combines
authentication and encryption, unlike DH-CHAP which only authenticates.

internal/nvmeof/volume.go:
- Add TlsPskMode field to NVMeoFSecurityConfig struct
- Parse tlsPskMode from StorageClass parameters
- Set default authenticationKMSID to RBD metadata when tlsPskMode is
  enabled but no KMS specified (for testing; production should use Vault)

internal/nvmeof/controller/security.go:
- Add setupTLSPSKKey() to retrieve or generate PSK during ControllerPublishVolume
- Add cleanupTLSPSKKey() to remove PSK from KMS after host is unpublished
- Both functions mirror the DH-CHAP pattern: initialize SecurityKeyManager,
  setup DEKStore if needed, and call GetOrCreateTLSPSKKey() or RemoveTLSPSKKey()

internal/nvmeof/controller/controllerserver.go:
- Pass securedListener flag to CreateSubsystem() and CreateListener() when
  TLS-PSK is enabled to configure gateway for TLS
- Call setupTLSPSKKey() in publishResources() and pass PSK to AddHost()
- Call cleanupTLSPSKKey() in unpublishResources() after host is removed
- Add vcTLSPSKMode constant for volume context metadata key
- Store and retrieve TlsPskMode in volume context and RBD metadata

internal/nvmeof/nvmeof.go:
- Add securedListener parameter to CreateSubsystem() to set SecureListeners
  flag in gateway request when using auto-listener network mask
- Add securedListener parameter to CreateListener() to set Secure flag
  for manual listener creation
- Add pskKey parameter to AddHost() and set Psk field in gateway request
  when PSK is provided

internal/nvmeof/tests/nvmeof_test.go:
- Update test calls to pass new parameters: securedListener=false for
  CreateSubsystem() and CreateListener(), empty PSK for `AddHost()`

Signed-off-by: gadi-didi <gadi.didi@ibm.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Comment thread internal/nvmeof/volume.go
Comment thread internal/nvmeof/volume_test.go
Comment thread internal/nvmeof/tls_psk.go
This completes the node server implementation for TLS-PSK secure
transport, enabling nodes to retrieve PSK keys and establish encrypted
NVMe-oF connections using nvme-cli.

internal/nvmeof/nodeserver/nodeserver.go:
- Add TlsPskMode field to NvmeConnectionInfo struct to track TLS-PSK
  mode alongside DH-CHAP mode
- Add nvmeoftlsPskMode constant for volume context key parsing
- Extract tlsPskMode from volume context in getNvmeConnection() and
  normalize empty/"none" values to empty string
- Call setupTLSPSKAuth() in connectToSubsystem() when TlsPskMode is
  enabled, before resolving listeners

internal/nvmeof/nodeserver/security.go:
- Add setupTLSPSKAuth() function to retrieve TLS-PSK key on node side
- Call GetOrCreateTLSPSKKey() to fetch PSK for this node-subsystem pair
  from KMS/DEKStore
- Set PSKKey field in ConnectRequest for nvme-cli command

internal/nvmeof/nvmeof_initiator.go:
- Add PSKKey field to ConnectRequest struct for passing PSK to connection
- Append --tls and --tls-key flags to nvme connect command when PSKKey
  is provided

Signed-off-by: gadi-didi <gadi.didi@ibm.com>
@gadididi gadididi force-pushed the nvmeof/add_tls-psk branch from 2ab191e to ae7ff12 Compare May 25, 2026 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/nvme-of Issues and PRs related to NVMe-oF.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants