Netdev List
 help / color / mirror / Atom feed
From: Jakub Kicinski <kuba@kernel.org>
To: rjethwani@purestorage.com
Cc: Jakub Kicinski <kuba@kernel.org>,
	netdev@vger.kernel.org, saeedm@nvidia.com, tariqt@nvidia.com,
	mbloch@nvidia.com, borisp@nvidia.com, john.fastabend@gmail.com,
	sd@queasysnail.net, davem@davemloft.net, pabeni@redhat.com,
	edumazet@google.com, leon@kernel.org
Subject: Re: [PATCH v14 6/9] tls: device: add TX KeyUpdate support
Date: Mon, 25 May 2026 14:16:08 -0700	[thread overview]
Message-ID: <20260525211608.2508449-1-kuba@kernel.org> (raw)
In-Reply-To: <20260515212715.3151307-7-rjethwani@purestorage.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
tls: device: add TX KeyUpdate support

This adds TX-side TLS 1.3 KeyUpdate support for HW-offloaded sockets by
installing a temporary SW context with the new key, deferring the HW
reinstall until all old-key records have been ACKed, and falling back to
permanent SW mode if the HW reinstall fails.

Thanks for the patch.  A few questions inline.

> diff --git a/include/net/tls.h b/include/net/tls.h
> index 2512a3799b21..c1085873ee01 100644
> --- a/include/net/tls.h
> +++ b/include/net/tls.h

[ ... ]

> diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
> index 7a98d2f6cbd3..c435b3450872 100644
> --- a/net/tls/tls_device.c
> +++ b/net/tls/tls_device.c

[ ... ]

> @@ -222,6 +246,19 @@ static void tls_tcp_clean_acked(struct sock *sk, u32 acked_seq)
>  	}
>  
>  	ctx->unacked_record_sn += deleted_records;
> +
> +	/* Once all old-key HW records are ACKed, set REKEY_READY to
> +	 * let sendmsg know it can finish the rekey and switch back
> +	 * to HW offload.
> +	 */
> +	if (test_bit(TLS_TX_REKEY_PENDING, &tls_ctx->flags) &&
> +	    !test_bit(TLS_TX_REKEY_FAILED, &tls_ctx->flags)) {
> +		u32 boundary_seq = READ_ONCE(tls_ctx->rekey.boundary_seq);
> +
> +		if (!before(acked_seq, boundary_seq))
> +			set_bit(TLS_TX_REKEY_READY, &tls_ctx->flags);
> +	}
> +
>  	spin_unlock_irqrestore(&ctx->lock, flags);
>  }

[Medium]
Is the ordering between the boundary_seq publish and the PENDING bit
publish in tls_device_start_rekey() strong enough for this read site?

In tls_device_start_rekey() the sequence is:

	WRITE_ONCE(ctx->rekey.boundary_seq, tcp_sk(sk)->write_seq);
	...
	set_bit(TLS_TX_REKEY_PENDING, &ctx->flags);

set_bit() does not document any implicit memory barrier
(Documentation/atomic_bitops.txt).  On weakly-ordered architectures, can
this reader observe PENDING=1 with a stale boundary_seq, set REKEY_READY
prematurely, and let tls_device_complete_rekey() run before the old-key
records are actually ACKed?

Would an smp_wmb() (or smp_mb__before_atomic()) between the write of
boundary_seq and the set_bit, or moving both writes under
offload_ctx->lock that this reader already holds, be appropriate?

> @@ -253,6 +290,14 @@ void tls_device_free_resources_tx(struct sock *sk)
>  	struct tls_context *tls_ctx = tls_get_ctx(sk);
>  
>  	tls_free_partial_record(sk, tls_ctx);
> +
> +	if (unlikely(tls_ctx->rekey.sw_ctx))
> +		tls_sw_release_resources_tx(sk);
> +
> +	if (test_bit(TLS_TX_REKEY_PENDING, &tls_ctx->flags)) {
> +		TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYOK);
> +		TLS_DEC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYINPROGRESS);
> +	}
>  }

[Medium]
Should a rekey that was still PENDING at socket teardown count as
TlsTxRekeyOk?  The rekey never completed, but the counter is bumped as
if it had succeeded.  Should this be a separate "aborted" counter, or
just omit the OK increment in this path?

> @@ -624,6 +672,19 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
>  			goto out;
>  	}
>  
> +	/* Old-key records all ACKed; switch back to HW. */
> +	if (test_bit(TLS_TX_REKEY_READY, &tls_ctx->flags))
> +		tls_device_complete_rekey(sk, tls_ctx, true);

[Medium]
Should the return value of tls_device_complete_rekey() be checked here?

tls_device_complete_rekey() can return non-zero from its early
tls_sw_drain_tx() path:

	rc = tls_sw_drain_tx(sk, ctx);
	if (rc)
		return rc;

In that early-return case none of REKEY_PENDING, REKEY_READY, or
REKEY_FAILED is updated.  The caller falls through to
tls_sw_sendmsg_locked(), and on the next sendmsg the same READY bit is
still set, so complete_rekey is retried, drain returns -EAGAIN again,
and so on.  Is there a way out of this state?  Should persistent drain
failure transition to FAILED, or at least bump a counter so it is
observable?

> +
> +	/* Use SW path if rekey is in progress (PENDING) or if HW rekey
> +	 * failed (FAILED).
> +	 */
> +	if (test_bit(TLS_TX_REKEY_PENDING, &tls_ctx->flags) ||
> +	    test_bit(TLS_TX_REKEY_FAILED, &tls_ctx->flags)) {
> +		rc = tls_sw_sendmsg_locked(sk, msg, size);
> +		goto out;
> +	}
> +
>  	rc = tls_push_data(sk, &msg->msg_iter, size, msg->msg_flags,
>  			   record_type);

[High]
What happens to a HW open_record across this transition?

If userspace previously sent with MSG_MORE, offload_ctx->open_record
holds buffered plaintext and tls_ctx->pending_open_record_frags is true.
A subsequent setsockopt KeyUpdate then routes future sendmsg calls into
tls_sw_sendmsg_locked(), which builds into sw_ctx->open_rec (initially
NULL) and never touches the HW open_record.  At socket close,
tls_device_sk_destruct() does:

	if (ctx->open_record)
		destroy_record(ctx->open_record);

so the buffered plaintext is freed without ever being transmitted.

Also, since pending_open_record_frags stays true, would
tls_sw_drain_tx() see tls_is_pending_open_record(ctx) as true and return
-EAGAIN, blocking tls_device_complete_rekey() until something on the SW
side coincidentally clears it?

The tcp_write_collapse_fence() in tls_device_start_rekey() only handles
records already in the TCP write queue; it does not appear to flush this
HW open_record.

> @@ -1103,6 +1164,260 @@ static struct tls_offload_context_tx *alloc_offload_ctx_tx(struct tls_context *c
[ ... ]
> +static int tls_device_start_rekey(struct sock *sk,
> +				  struct tls_context *ctx,
> +				  struct tls_offload_context_tx *offload_ctx,
> +				  struct tls_crypto_info *new_crypto_info)
> +{
> +	bool rekey_pending = test_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
> +	bool rekey_failed = test_bit(TLS_TX_REKEY_FAILED, &ctx->flags);
[ ... ]
> +	if (rekey_pending || rekey_failed) {
> +		rc = crypto_aead_setkey(offload_ctx->rekey.sw.aead_send,
> +					key, cipher_desc->key);
> +		if (rc)
> +			return rc;

[High]
Can there be in-flight async crypto requests on this aead_send when
crypto_aead_setkey() is called here?

While a rekey is PENDING, sendmsg has been encrypting via
offload_ctx->rekey.sw.aead_send through tls_sw_sendmsg_locked().  Async
backends can return -EINPROGRESS and complete later.  Other call sites
that mutate AEAD state (tls_sw_release_resources_tx, tls_sw_drain_tx,
tls_set_sw_offload) call tls_encrypt_async_wait() first.  Should the
same wait happen here before re-keying the same tfm, so in-flight
encryptions cannot pick up K2 instead of the K1 they were submitted
with?

[ ... ]
> +	} else {
> +		rc = tls_device_init_rekey_sw(sk, ctx, offload_ctx,
> +					      new_crypto_info);
[ ... ]
> +		WRITE_ONCE(ctx->rekey.boundary_seq, tcp_sk(sk)->write_seq);
> +
> +		/* Prevent a partial record straddling the SW/HW boundary. */
> +		tcp_write_collapse_fence(sk);
> +
> +		ctx->rekey.sw_ctx = &offload_ctx->rekey.sw;
> +		ctx->rekey.cipher_ctx = &offload_ctx->rekey.tx;
> +
> +		set_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
[ ... ]
> +	unsafe_memcpy(&offload_ctx->rekey.crypto_send.info, new_crypto_info,
> +		      cipher_desc->crypto_info,
> +		      /* checked in do_tls_setsockopt_conf */);
> +	memzero_explicit(new_crypto_info, cipher_desc->crypto_info);
> +
> +	return 0;
> +}

[Medium]
new_crypto_info is zeroized after the copy, but is the staging area
itself ever scrubbed?

offload_ctx->rekey.crypto_send.info, offload_ctx->rekey.tx.iv, and
offload_ctx->rekey.tx.rec_seq receive the same key bytes via
unsafe_memcpy/memcpy.  tls_device_complete_rekey() does not scrub them
on success, and tls_device_free_resources_tx() / tls_device_tx_del_task()
kfree the offload_ctx without a memzero_explicit().  Should those sites
zero the staging area to limit the window in which key bytes are
recoverable from slab memory after rotation?

> +static int tls_device_complete_rekey(struct sock *sk, struct tls_context *ctx,
> +				     bool deferred)
> +{
[ ... ]
> +	if (!test_bit(TLS_TX_DEV_CLOSED, &ctx->flags)) {
> +		netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> +						TLS_OFFLOAD_CTX_DIR_TX);
> +		set_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
> +	}
> +
> +	memcpy(crypto_info_rec_seq(&offload_ctx->rekey.crypto_send.info, cipher_desc),
> +	       offload_ctx->rekey.tx.rec_seq, cipher_desc->rec_seq);
> +
> +	rc = tls_device_dev_add_tx(sk, netdev, &offload_ctx->rekey.crypto_send.info,
> +				   tcp_sk(sk)->write_seq);
> +
> +release_lock:
> +	up_read(&device_offload_lock);
> +
> +	spin_lock_irqsave(&offload_ctx->lock, flags);
> +	memcpy(&rcd_sn, offload_ctx->rekey.tx.rec_seq, sizeof(rcd_sn));
> +	offload_ctx->unacked_record_sn = be64_to_cpu(rcd_sn) - 1;
> +	spin_unlock_irqrestore(&offload_ctx->lock, flags);
> +
> +	memcpy(ctx->tx.iv, offload_ctx->rekey.tx.iv,
> +	       cipher_desc->salt + cipher_desc->iv);
> +	memcpy(ctx->tx.rec_seq, offload_ctx->rekey.tx.rec_seq,
> +	       cipher_desc->rec_seq);
> +	unsafe_memcpy(&ctx->crypto_send.info,
> +		      &offload_ctx->rekey.crypto_send.info,
> +		      cipher_desc->crypto_info,
> +		      /* checked during rekey setup */);
> +
> +	if (rc)
> +		goto rekey_fail;
> +
> +	clear_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
> +
> +	key = crypto_info_key(&offload_ctx->rekey.crypto_send.info, cipher_desc);
> +	rc = crypto_aead_setkey(offload_ctx->aead_send, key, cipher_desc->key);
> +	if (rc)
> +		goto rekey_fail;

[High]
Can these unconditional copies of the new key material into ctx->tx.iv,
ctx->tx.rec_seq, and ctx->crypto_send.info before the rc check leave
inconsistent state on failure?

If tls_device_dev_add_tx() failed, control reaches rekey_fail with:

  - HW context removed (TLS_TX_DEV_CLOSED set)
  - ctx->crypto_send.info advertising the NEW key
  - ctx->tx.iv and ctx->tx.rec_seq holding the NEW salt/IV/rec_seq
  - offload_ctx->aead_send still holding the OLD key (the setkey for it
    is reached only after the rc check)

Userspace getsockopt(SOL_TLS, TLS_TX) would then read NEW key bytes that
no kernel AEAD ever absorbed.  Any subsequent fallback that builds IVs
from ctx->tx and encrypts via offload_ctx->aead_send would mix NEW salt
with the OLD key.

Could the memcpy block move below the rc check?

[High]
On the success path, is there a window where tls_device_down() sees
stale TLS_TX_DEV_CLOSED?

The success path here:

	if (!test_bit(TLS_TX_DEV_CLOSED, &ctx->flags)) {
		netdev->tlsdev_ops->tls_dev_del(...);
		set_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
	}
	...
	rc = tls_device_dev_add_tx(...);   /* installs NEW HW ctx */

release_lock:
	up_read(&device_offload_lock);
	...
	if (rc) goto rekey_fail;
	clear_bit(TLS_TX_DEV_CLOSED, &ctx->flags);

Between dropping device_offload_lock and clear_bit(), can
tls_device_down() take down_write(&device_offload_lock) and run:

	if (ctx->tx_conf == TLS_HW &&
	    !test_bit(TLS_TX_DEV_CLOSED, &ctx->flags)) {
		netdev->tlsdev_ops->tls_dev_del(...);
		set_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
	}

It sees TLS_TX_DEV_CLOSED still set, skips tls_dev_del() for the
freshly-installed NEW HW context, and proceeds to clear ctx->netdev.
Would the new NIC-side TLS state then be leaked until the netdev is
unregistered?

Should TLS_TX_DEV_CLOSED be cleared while still holding
device_offload_lock?

> +
> +	/* Start marker: the NIC passes through everything before
> +	 * write_seq unencrypted (already SW-encrypted during rekey),
> +	 * same as during initial offload setup.
> +	 */
> +	tls_device_commit_start_marker(sk, offload_ctx, start_marker_record);
[ ... ]
> +rekey_fail:
> +	kfree(start_marker_record);
> +	set_bit(TLS_TX_REKEY_FAILED, &ctx->flags);
> +	clear_bit(TLS_TX_REKEY_READY, &ctx->flags);
> +	clear_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
> +	if (deferred)
> +		TLS_DEC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYINPROGRESS);
> +	TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYFALLBACK);
> +
> +	return 0;
> +}

[Medium]
After REKEY_FAILED, the connection is structurally encrypting in SW even
though ctx->tx_conf stays TLS_HW.  TLSCURRTXDEVICE is never decremented
and TLSCURRTXSW is never incremented in this path.  Should the
current-mode gauges in /proc/net/tls_stat reflect the actual encryption
mode for these connections?

> +static int tls_set_device_offload_rekey(struct sock *sk,
> +					struct tls_context *ctx,
> +					struct net_device *netdev,
> +					struct tls_crypto_info *new_crypto_info)
> +{
> +	struct tls_offload_context_tx *offload_ctx = tls_offload_ctx_tx(ctx);
> +	bool rekey_pending = test_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
> +	bool rekey_failed = test_bit(TLS_TX_REKEY_FAILED, &ctx->flags);
> +	bool defer = true;
> +	int rc;
> +
> +	if (!rekey_pending && !rekey_failed)
> +		defer = tls_has_unacked_records(offload_ctx);
> +
> +	down_read(&device_offload_lock);
> +
> +	rc = tls_device_start_rekey(sk, ctx, offload_ctx, new_crypto_info);
> +	if (rc) {
> +		up_read(&device_offload_lock);
> +		return rc;
> +	}
> +
> +	up_read(&device_offload_lock);
> +
> +	if (defer) {
> +		if (!rekey_pending)
> +			TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYINPROGRESS);
> +		else
> +			TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYOK);
> +		return 0;
> +	}
> +
> +	return tls_device_complete_rekey(sk, ctx, false);
> +}

[Medium]
On the rekey_pending path here, TLSTXREKEYOK is incremented before any
HW reinstall actually happens.  Combined with the increment in
tls_device_complete_rekey() on success and the increment in
tls_device_free_resources_tx() when PENDING is set at close, can a
single logical successful rekey produce more than one TLSTXREKEYOK
increment?

[Medium]
tls_has_unacked_records() inspects offload_ctx->records_list only.  Is
that sufficient to decide whether to defer?  pending_open_record_frags
or offload_ctx->open_record can still hold pre-rekey state, in which
case the !defer path falls into tls_device_complete_rekey(deferred=false),
tls_sw_drain_tx() returns -EAGAIN, setsockopt returns the error to
userspace via err_crypto_info, and none of the state transitions from
tls_device_start_rekey() (PENDING bit, sw_ctx, cipher_ctx, boundary_seq,
sk_validate_xmit_skb) are rolled back.  Should the failure path either
roll those back or transition to a defined recovery state?

[ ... ]

> diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c
> index 1110f7ac6bcb..64ac4ef4012b 100644
> --- a/net/tls/tls_device_fallback.c
> +++ b/net/tls/tls_device_fallback.c
> @@ -435,6 +435,30 @@ struct sk_buff *tls_validate_xmit_skb_sw(struct sock *sk,
>  	return tls_sw_fallback(sk, skb);
>  }
>  
> +struct sk_buff *tls_validate_xmit_skb_rekey(struct sock *sk,
> +					    struct net_device *dev,
> +					    struct sk_buff *skb)
> +{
> +	struct tls_context *tls_ctx = tls_get_ctx(sk);
> +	u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
> +	u32 boundary_seq;
> +
> +	if (test_bit(TLS_TX_REKEY_FAILED, &tls_ctx->flags))
> +		return skb;
> +
> +	/* If this packet is at or after the rekey boundary, it's already
> +	 * SW-encrypted with the new key, pass through unchanged
> +	 */
> +	boundary_seq = READ_ONCE(tls_ctx->rekey.boundary_seq);
> +	if (!before(tcp_seq, boundary_seq))
> +		return skb;
> +
> +	/* Packet before boundary means retransmit of old data,
> +	 * use SW fallback with the old key
> +	 */
> +	return tls_sw_fallback(sk, skb);
> +}

[Critical]
Can this leak plaintext on the wire after REKEY_FAILED?

The first branch returns the skb unchanged whenever
TLS_TX_REKEY_FAILED is set, regardless of where tcp_seq is relative to
boundary_seq.

After tls_device_complete_rekey() has called tls_dev_del() and then
landed in rekey_fail, there is no NIC TLS context for the socket.
SKBs from before the boundary are still sitting in the socket write
queue with plaintext payload, because HW-offload semantics expected the
NIC to encrypt them at TX time.  When TCP retransmits one of those old
skbs:

	tcp_seq < boundary_seq, but TLS_TX_REKEY_FAILED is set
	-> early return skb unchanged
	-> NIC has no TLS context for this flow
	-> driver transmits the skb as-is

Should the FAILED branch still route pre-boundary packets through
tls_sw_fallback() with the old key, the same way the non-failed branch
does, so that pre-boundary retransmits are encrypted?

> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> index fd04857fa0ab..2548ad2b2219 100644
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -371,6 +371,8 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
>  
>  	if (ctx->tx_conf == TLS_SW)
>  		tls_sw_cancel_work_tx(ctx);
> +	else if (ctx->tx_conf == TLS_HW && ctx->rekey.sw_ctx)
> +		tls_sw_cancel_work_tx(ctx);
>  
>  	lock_sock(sk);

[High]
Is reading ctx->rekey.sw_ctx before lock_sock(sk) safe?

ctx->rekey.sw_ctx is published by tls_device_start_rekey() under
lock_sock.  Here close reads it before taking lock_sock, so a concurrent
setsockopt rekey can publish sw_ctx after this read but before close
acquires the socket lock.  Close then proceeds without canceling the
delayed tx_work, while later cleanup
(tls_device_free_resources_tx -> tls_sw_release_resources_tx) frees
aead_send.

If a tx_work_handler scheduled by the rekey path runs after that, can it
touch the freed delayed_work or aead_send?

Should the read be moved under lock_sock, or should
tls_sw_release_resources_tx() internally cancel the delayed work?

[ ... ]
-- 
pw-bot: cr

  reply	other threads:[~2026-05-25 21:16 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-15 21:27 [PATCH net-next v14 0/9] tls: Add TLS 1.3 hardware offload support Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 1/9] net: tls: reject TLS 1.3 offload in chcr_ktls and nfp drivers Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 2/9] net/mlx5e: add TLS 1.3 hardware offload support Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 3/9] tls: " Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 4/9] tls: split tls_set_sw_offload into init and finalize stages Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 5/9] tls: prep helpers and refactors for HW offload KeyUpdate Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 6/9] tls: device: add TX KeyUpdate support Rishikesh Jethwani
2026-05-25 21:16   ` Jakub Kicinski [this message]
2026-05-15 21:27 ` [PATCH v14 7/9] tls: device: add RX " Rishikesh Jethwani
2026-05-25 21:16   ` Jakub Kicinski
2026-05-15 21:27 ` [PATCH v14 8/9] tls: device: add tracepoints for RX KeyUpdate path Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 9/9] selftests: net: add TLS hardware offload test Rishikesh Jethwani
2026-05-25 21:16   ` Jakub Kicinski
2026-05-17 22:21 ` [PATCH net-next v14 0/9] tls: Add TLS 1.3 hardware offload support Sabrina Dubroca

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260525211608.2508449-1-kuba@kernel.org \
    --to=kuba@kernel.org \
    --cc=borisp@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=leon@kernel.org \
    --cc=mbloch@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=rjethwani@purestorage.com \
    --cc=saeedm@nvidia.com \
    --cc=sd@queasysnail.net \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox