From: Jakub Kicinski <kuba@kernel.org>
To: rjethwani@purestorage.com
Cc: Jakub Kicinski <kuba@kernel.org>,
netdev@vger.kernel.org, saeedm@nvidia.com, tariqt@nvidia.com,
mbloch@nvidia.com, borisp@nvidia.com, john.fastabend@gmail.com,
sd@queasysnail.net, davem@davemloft.net, pabeni@redhat.com,
edumazet@google.com, leon@kernel.org
Subject: Re: [PATCH v14 6/9] tls: device: add TX KeyUpdate support
Date: Mon, 25 May 2026 14:16:08 -0700 [thread overview]
Message-ID: <20260525211608.2508449-1-kuba@kernel.org> (raw)
In-Reply-To: <20260515212715.3151307-7-rjethwani@purestorage.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
tls: device: add TX KeyUpdate support
This adds TX-side TLS 1.3 KeyUpdate support for HW-offloaded sockets by
installing a temporary SW context with the new key, deferring the HW
reinstall until all old-key records have been ACKed, and falling back to
permanent SW mode if the HW reinstall fails.
Thanks for the patch. A few questions inline.
> diff --git a/include/net/tls.h b/include/net/tls.h
> index 2512a3799b21..c1085873ee01 100644
> --- a/include/net/tls.h
> +++ b/include/net/tls.h
[ ... ]
> diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
> index 7a98d2f6cbd3..c435b3450872 100644
> --- a/net/tls/tls_device.c
> +++ b/net/tls/tls_device.c
[ ... ]
> @@ -222,6 +246,19 @@ static void tls_tcp_clean_acked(struct sock *sk, u32 acked_seq)
> }
>
> ctx->unacked_record_sn += deleted_records;
> +
> + /* Once all old-key HW records are ACKed, set REKEY_READY to
> + * let sendmsg know it can finish the rekey and switch back
> + * to HW offload.
> + */
> + if (test_bit(TLS_TX_REKEY_PENDING, &tls_ctx->flags) &&
> + !test_bit(TLS_TX_REKEY_FAILED, &tls_ctx->flags)) {
> + u32 boundary_seq = READ_ONCE(tls_ctx->rekey.boundary_seq);
> +
> + if (!before(acked_seq, boundary_seq))
> + set_bit(TLS_TX_REKEY_READY, &tls_ctx->flags);
> + }
> +
> spin_unlock_irqrestore(&ctx->lock, flags);
> }
[Medium]
Is the ordering between the boundary_seq publish and the PENDING bit
publish in tls_device_start_rekey() strong enough for this read site?
In tls_device_start_rekey() the sequence is:
WRITE_ONCE(ctx->rekey.boundary_seq, tcp_sk(sk)->write_seq);
...
set_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
set_bit() does not document any implicit memory barrier
(Documentation/atomic_bitops.txt). On weakly-ordered architectures, can
this reader observe PENDING=1 with a stale boundary_seq, set REKEY_READY
prematurely, and let tls_device_complete_rekey() run before the old-key
records are actually ACKed?
Would an smp_wmb() (or smp_mb__before_atomic()) between the write of
boundary_seq and the set_bit, or moving both writes under
offload_ctx->lock that this reader already holds, be appropriate?
> @@ -253,6 +290,14 @@ void tls_device_free_resources_tx(struct sock *sk)
> struct tls_context *tls_ctx = tls_get_ctx(sk);
>
> tls_free_partial_record(sk, tls_ctx);
> +
> + if (unlikely(tls_ctx->rekey.sw_ctx))
> + tls_sw_release_resources_tx(sk);
> +
> + if (test_bit(TLS_TX_REKEY_PENDING, &tls_ctx->flags)) {
> + TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYOK);
> + TLS_DEC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYINPROGRESS);
> + }
> }
[Medium]
Should a rekey that was still PENDING at socket teardown count as
TlsTxRekeyOk? The rekey never completed, but the counter is bumped as
if it had succeeded. Should this be a separate "aborted" counter, or
just omit the OK increment in this path?
> @@ -624,6 +672,19 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> goto out;
> }
>
> + /* Old-key records all ACKed; switch back to HW. */
> + if (test_bit(TLS_TX_REKEY_READY, &tls_ctx->flags))
> + tls_device_complete_rekey(sk, tls_ctx, true);
[Medium]
Should the return value of tls_device_complete_rekey() be checked here?
tls_device_complete_rekey() can return non-zero from its early
tls_sw_drain_tx() path:
rc = tls_sw_drain_tx(sk, ctx);
if (rc)
return rc;
In that early-return case none of REKEY_PENDING, REKEY_READY, or
REKEY_FAILED is updated. The caller falls through to
tls_sw_sendmsg_locked(), and on the next sendmsg the same READY bit is
still set, so complete_rekey is retried, drain returns -EAGAIN again,
and so on. Is there a way out of this state? Should persistent drain
failure transition to FAILED, or at least bump a counter so it is
observable?
> +
> + /* Use SW path if rekey is in progress (PENDING) or if HW rekey
> + * failed (FAILED).
> + */
> + if (test_bit(TLS_TX_REKEY_PENDING, &tls_ctx->flags) ||
> + test_bit(TLS_TX_REKEY_FAILED, &tls_ctx->flags)) {
> + rc = tls_sw_sendmsg_locked(sk, msg, size);
> + goto out;
> + }
> +
> rc = tls_push_data(sk, &msg->msg_iter, size, msg->msg_flags,
> record_type);
[High]
What happens to a HW open_record across this transition?
If userspace previously sent with MSG_MORE, offload_ctx->open_record
holds buffered plaintext and tls_ctx->pending_open_record_frags is true.
A subsequent setsockopt KeyUpdate then routes future sendmsg calls into
tls_sw_sendmsg_locked(), which builds into sw_ctx->open_rec (initially
NULL) and never touches the HW open_record. At socket close,
tls_device_sk_destruct() does:
if (ctx->open_record)
destroy_record(ctx->open_record);
so the buffered plaintext is freed without ever being transmitted.
Also, since pending_open_record_frags stays true, would
tls_sw_drain_tx() see tls_is_pending_open_record(ctx) as true and return
-EAGAIN, blocking tls_device_complete_rekey() until something on the SW
side coincidentally clears it?
The tcp_write_collapse_fence() in tls_device_start_rekey() only handles
records already in the TCP write queue; it does not appear to flush this
HW open_record.
> @@ -1103,6 +1164,260 @@ static struct tls_offload_context_tx *alloc_offload_ctx_tx(struct tls_context *c
[ ... ]
> +static int tls_device_start_rekey(struct sock *sk,
> + struct tls_context *ctx,
> + struct tls_offload_context_tx *offload_ctx,
> + struct tls_crypto_info *new_crypto_info)
> +{
> + bool rekey_pending = test_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
> + bool rekey_failed = test_bit(TLS_TX_REKEY_FAILED, &ctx->flags);
[ ... ]
> + if (rekey_pending || rekey_failed) {
> + rc = crypto_aead_setkey(offload_ctx->rekey.sw.aead_send,
> + key, cipher_desc->key);
> + if (rc)
> + return rc;
[High]
Can there be in-flight async crypto requests on this aead_send when
crypto_aead_setkey() is called here?
While a rekey is PENDING, sendmsg has been encrypting via
offload_ctx->rekey.sw.aead_send through tls_sw_sendmsg_locked(). Async
backends can return -EINPROGRESS and complete later. Other call sites
that mutate AEAD state (tls_sw_release_resources_tx, tls_sw_drain_tx,
tls_set_sw_offload) call tls_encrypt_async_wait() first. Should the
same wait happen here before re-keying the same tfm, so in-flight
encryptions cannot pick up K2 instead of the K1 they were submitted
with?
[ ... ]
> + } else {
> + rc = tls_device_init_rekey_sw(sk, ctx, offload_ctx,
> + new_crypto_info);
[ ... ]
> + WRITE_ONCE(ctx->rekey.boundary_seq, tcp_sk(sk)->write_seq);
> +
> + /* Prevent a partial record straddling the SW/HW boundary. */
> + tcp_write_collapse_fence(sk);
> +
> + ctx->rekey.sw_ctx = &offload_ctx->rekey.sw;
> + ctx->rekey.cipher_ctx = &offload_ctx->rekey.tx;
> +
> + set_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
[ ... ]
> + unsafe_memcpy(&offload_ctx->rekey.crypto_send.info, new_crypto_info,
> + cipher_desc->crypto_info,
> + /* checked in do_tls_setsockopt_conf */);
> + memzero_explicit(new_crypto_info, cipher_desc->crypto_info);
> +
> + return 0;
> +}
[Medium]
new_crypto_info is zeroized after the copy, but is the staging area
itself ever scrubbed?
offload_ctx->rekey.crypto_send.info, offload_ctx->rekey.tx.iv, and
offload_ctx->rekey.tx.rec_seq receive the same key bytes via
unsafe_memcpy/memcpy. tls_device_complete_rekey() does not scrub them
on success, and tls_device_free_resources_tx() / tls_device_tx_del_task()
kfree the offload_ctx without a memzero_explicit(). Should those sites
zero the staging area to limit the window in which key bytes are
recoverable from slab memory after rotation?
> +static int tls_device_complete_rekey(struct sock *sk, struct tls_context *ctx,
> + bool deferred)
> +{
[ ... ]
> + if (!test_bit(TLS_TX_DEV_CLOSED, &ctx->flags)) {
> + netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> + TLS_OFFLOAD_CTX_DIR_TX);
> + set_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
> + }
> +
> + memcpy(crypto_info_rec_seq(&offload_ctx->rekey.crypto_send.info, cipher_desc),
> + offload_ctx->rekey.tx.rec_seq, cipher_desc->rec_seq);
> +
> + rc = tls_device_dev_add_tx(sk, netdev, &offload_ctx->rekey.crypto_send.info,
> + tcp_sk(sk)->write_seq);
> +
> +release_lock:
> + up_read(&device_offload_lock);
> +
> + spin_lock_irqsave(&offload_ctx->lock, flags);
> + memcpy(&rcd_sn, offload_ctx->rekey.tx.rec_seq, sizeof(rcd_sn));
> + offload_ctx->unacked_record_sn = be64_to_cpu(rcd_sn) - 1;
> + spin_unlock_irqrestore(&offload_ctx->lock, flags);
> +
> + memcpy(ctx->tx.iv, offload_ctx->rekey.tx.iv,
> + cipher_desc->salt + cipher_desc->iv);
> + memcpy(ctx->tx.rec_seq, offload_ctx->rekey.tx.rec_seq,
> + cipher_desc->rec_seq);
> + unsafe_memcpy(&ctx->crypto_send.info,
> + &offload_ctx->rekey.crypto_send.info,
> + cipher_desc->crypto_info,
> + /* checked during rekey setup */);
> +
> + if (rc)
> + goto rekey_fail;
> +
> + clear_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
> +
> + key = crypto_info_key(&offload_ctx->rekey.crypto_send.info, cipher_desc);
> + rc = crypto_aead_setkey(offload_ctx->aead_send, key, cipher_desc->key);
> + if (rc)
> + goto rekey_fail;
[High]
Can these unconditional copies of the new key material into ctx->tx.iv,
ctx->tx.rec_seq, and ctx->crypto_send.info before the rc check leave
inconsistent state on failure?
If tls_device_dev_add_tx() failed, control reaches rekey_fail with:
- HW context removed (TLS_TX_DEV_CLOSED set)
- ctx->crypto_send.info advertising the NEW key
- ctx->tx.iv and ctx->tx.rec_seq holding the NEW salt/IV/rec_seq
- offload_ctx->aead_send still holding the OLD key (the setkey for it
is reached only after the rc check)
Userspace getsockopt(SOL_TLS, TLS_TX) would then read NEW key bytes that
no kernel AEAD ever absorbed. Any subsequent fallback that builds IVs
from ctx->tx and encrypts via offload_ctx->aead_send would mix NEW salt
with the OLD key.
Could the memcpy block move below the rc check?
[High]
On the success path, is there a window where tls_device_down() sees
stale TLS_TX_DEV_CLOSED?
The success path here:
if (!test_bit(TLS_TX_DEV_CLOSED, &ctx->flags)) {
netdev->tlsdev_ops->tls_dev_del(...);
set_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
}
...
rc = tls_device_dev_add_tx(...); /* installs NEW HW ctx */
release_lock:
up_read(&device_offload_lock);
...
if (rc) goto rekey_fail;
clear_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
Between dropping device_offload_lock and clear_bit(), can
tls_device_down() take down_write(&device_offload_lock) and run:
if (ctx->tx_conf == TLS_HW &&
!test_bit(TLS_TX_DEV_CLOSED, &ctx->flags)) {
netdev->tlsdev_ops->tls_dev_del(...);
set_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
}
It sees TLS_TX_DEV_CLOSED still set, skips tls_dev_del() for the
freshly-installed NEW HW context, and proceeds to clear ctx->netdev.
Would the new NIC-side TLS state then be leaked until the netdev is
unregistered?
Should TLS_TX_DEV_CLOSED be cleared while still holding
device_offload_lock?
> +
> + /* Start marker: the NIC passes through everything before
> + * write_seq unencrypted (already SW-encrypted during rekey),
> + * same as during initial offload setup.
> + */
> + tls_device_commit_start_marker(sk, offload_ctx, start_marker_record);
[ ... ]
> +rekey_fail:
> + kfree(start_marker_record);
> + set_bit(TLS_TX_REKEY_FAILED, &ctx->flags);
> + clear_bit(TLS_TX_REKEY_READY, &ctx->flags);
> + clear_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
> + if (deferred)
> + TLS_DEC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYINPROGRESS);
> + TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYFALLBACK);
> +
> + return 0;
> +}
[Medium]
After REKEY_FAILED, the connection is structurally encrypting in SW even
though ctx->tx_conf stays TLS_HW. TLSCURRTXDEVICE is never decremented
and TLSCURRTXSW is never incremented in this path. Should the
current-mode gauges in /proc/net/tls_stat reflect the actual encryption
mode for these connections?
> +static int tls_set_device_offload_rekey(struct sock *sk,
> + struct tls_context *ctx,
> + struct net_device *netdev,
> + struct tls_crypto_info *new_crypto_info)
> +{
> + struct tls_offload_context_tx *offload_ctx = tls_offload_ctx_tx(ctx);
> + bool rekey_pending = test_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
> + bool rekey_failed = test_bit(TLS_TX_REKEY_FAILED, &ctx->flags);
> + bool defer = true;
> + int rc;
> +
> + if (!rekey_pending && !rekey_failed)
> + defer = tls_has_unacked_records(offload_ctx);
> +
> + down_read(&device_offload_lock);
> +
> + rc = tls_device_start_rekey(sk, ctx, offload_ctx, new_crypto_info);
> + if (rc) {
> + up_read(&device_offload_lock);
> + return rc;
> + }
> +
> + up_read(&device_offload_lock);
> +
> + if (defer) {
> + if (!rekey_pending)
> + TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYINPROGRESS);
> + else
> + TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYOK);
> + return 0;
> + }
> +
> + return tls_device_complete_rekey(sk, ctx, false);
> +}
[Medium]
On the rekey_pending path here, TLSTXREKEYOK is incremented before any
HW reinstall actually happens. Combined with the increment in
tls_device_complete_rekey() on success and the increment in
tls_device_free_resources_tx() when PENDING is set at close, can a
single logical successful rekey produce more than one TLSTXREKEYOK
increment?
[Medium]
tls_has_unacked_records() inspects offload_ctx->records_list only. Is
that sufficient to decide whether to defer? pending_open_record_frags
or offload_ctx->open_record can still hold pre-rekey state, in which
case the !defer path falls into tls_device_complete_rekey(deferred=false),
tls_sw_drain_tx() returns -EAGAIN, setsockopt returns the error to
userspace via err_crypto_info, and none of the state transitions from
tls_device_start_rekey() (PENDING bit, sw_ctx, cipher_ctx, boundary_seq,
sk_validate_xmit_skb) are rolled back. Should the failure path either
roll those back or transition to a defined recovery state?
[ ... ]
> diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c
> index 1110f7ac6bcb..64ac4ef4012b 100644
> --- a/net/tls/tls_device_fallback.c
> +++ b/net/tls/tls_device_fallback.c
> @@ -435,6 +435,30 @@ struct sk_buff *tls_validate_xmit_skb_sw(struct sock *sk,
> return tls_sw_fallback(sk, skb);
> }
>
> +struct sk_buff *tls_validate_xmit_skb_rekey(struct sock *sk,
> + struct net_device *dev,
> + struct sk_buff *skb)
> +{
> + struct tls_context *tls_ctx = tls_get_ctx(sk);
> + u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
> + u32 boundary_seq;
> +
> + if (test_bit(TLS_TX_REKEY_FAILED, &tls_ctx->flags))
> + return skb;
> +
> + /* If this packet is at or after the rekey boundary, it's already
> + * SW-encrypted with the new key, pass through unchanged
> + */
> + boundary_seq = READ_ONCE(tls_ctx->rekey.boundary_seq);
> + if (!before(tcp_seq, boundary_seq))
> + return skb;
> +
> + /* Packet before boundary means retransmit of old data,
> + * use SW fallback with the old key
> + */
> + return tls_sw_fallback(sk, skb);
> +}
[Critical]
Can this leak plaintext on the wire after REKEY_FAILED?
The first branch returns the skb unchanged whenever
TLS_TX_REKEY_FAILED is set, regardless of where tcp_seq is relative to
boundary_seq.
After tls_device_complete_rekey() has called tls_dev_del() and then
landed in rekey_fail, there is no NIC TLS context for the socket.
SKBs from before the boundary are still sitting in the socket write
queue with plaintext payload, because HW-offload semantics expected the
NIC to encrypt them at TX time. When TCP retransmits one of those old
skbs:
tcp_seq < boundary_seq, but TLS_TX_REKEY_FAILED is set
-> early return skb unchanged
-> NIC has no TLS context for this flow
-> driver transmits the skb as-is
Should the FAILED branch still route pre-boundary packets through
tls_sw_fallback() with the old key, the same way the non-failed branch
does, so that pre-boundary retransmits are encrypted?
> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> index fd04857fa0ab..2548ad2b2219 100644
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -371,6 +371,8 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
>
> if (ctx->tx_conf == TLS_SW)
> tls_sw_cancel_work_tx(ctx);
> + else if (ctx->tx_conf == TLS_HW && ctx->rekey.sw_ctx)
> + tls_sw_cancel_work_tx(ctx);
>
> lock_sock(sk);
[High]
Is reading ctx->rekey.sw_ctx before lock_sock(sk) safe?
ctx->rekey.sw_ctx is published by tls_device_start_rekey() under
lock_sock. Here close reads it before taking lock_sock, so a concurrent
setsockopt rekey can publish sw_ctx after this read but before close
acquires the socket lock. Close then proceeds without canceling the
delayed tx_work, while later cleanup
(tls_device_free_resources_tx -> tls_sw_release_resources_tx) frees
aead_send.
If a tx_work_handler scheduled by the rekey path runs after that, can it
touch the freed delayed_work or aead_send?
Should the read be moved under lock_sock, or should
tls_sw_release_resources_tx() internally cancel the delayed work?
[ ... ]
--
pw-bot: cr
next prev parent reply other threads:[~2026-05-25 21:16 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-15 21:27 [PATCH net-next v14 0/9] tls: Add TLS 1.3 hardware offload support Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 1/9] net: tls: reject TLS 1.3 offload in chcr_ktls and nfp drivers Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 2/9] net/mlx5e: add TLS 1.3 hardware offload support Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 3/9] tls: " Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 4/9] tls: split tls_set_sw_offload into init and finalize stages Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 5/9] tls: prep helpers and refactors for HW offload KeyUpdate Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 6/9] tls: device: add TX KeyUpdate support Rishikesh Jethwani
2026-05-25 21:16 ` Jakub Kicinski [this message]
2026-05-15 21:27 ` [PATCH v14 7/9] tls: device: add RX " Rishikesh Jethwani
2026-05-25 21:16 ` Jakub Kicinski
2026-05-15 21:27 ` [PATCH v14 8/9] tls: device: add tracepoints for RX KeyUpdate path Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 9/9] selftests: net: add TLS hardware offload test Rishikesh Jethwani
2026-05-25 21:16 ` Jakub Kicinski
2026-05-17 22:21 ` [PATCH net-next v14 0/9] tls: Add TLS 1.3 hardware offload support Sabrina Dubroca
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260525211608.2508449-1-kuba@kernel.org \
--to=kuba@kernel.org \
--cc=borisp@nvidia.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=john.fastabend@gmail.com \
--cc=leon@kernel.org \
--cc=mbloch@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=rjethwani@purestorage.com \
--cc=saeedm@nvidia.com \
--cc=sd@queasysnail.net \
--cc=tariqt@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.