From: Jakub Kicinski <kuba@kernel.org>
To: rjethwani@purestorage.com
Cc: Jakub Kicinski <kuba@kernel.org>,
netdev@vger.kernel.org, saeedm@nvidia.com, tariqt@nvidia.com,
mbloch@nvidia.com, borisp@nvidia.com, john.fastabend@gmail.com,
sd@queasysnail.net, davem@davemloft.net, pabeni@redhat.com,
edumazet@google.com, leon@kernel.org
Subject: Re: [PATCH v14 6/9] tls: device: add TX KeyUpdate support
Date: Mon, 25 May 2026 14:16:08 -0700 [thread overview]
Message-ID: <20260525211608.2508449-1-kuba@kernel.org> (raw)
In-Reply-To: <20260515212715.3151307-7-rjethwani@purestorage.com>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
tls: device: add TX KeyUpdate support
This adds TX-side TLS 1.3 KeyUpdate support for HW-offloaded sockets by
installing a temporary SW context with the new key, deferring the HW
reinstall until all old-key records have been ACKed, and falling back to
permanent SW mode if the HW reinstall fails.
Thanks for the patch. A few questions inline.
> diff --git a/include/net/tls.h b/include/net/tls.h
> index 2512a3799b21..c1085873ee01 100644
> --- a/include/net/tls.h
> +++ b/include/net/tls.h
[ ... ]
> diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
> index 7a98d2f6cbd3..c435b3450872 100644
> --- a/net/tls/tls_device.c
> +++ b/net/tls/tls_device.c
[ ... ]
> @@ -222,6 +246,19 @@ static void tls_tcp_clean_acked(struct sock *sk, u32 acked_seq)
> }
>
> ctx->unacked_record_sn += deleted_records;
> +
> + /* Once all old-key HW records are ACKed, set REKEY_READY to
> + * let sendmsg know it can finish the rekey and switch back
> + * to HW offload.
> + */
> + if (test_bit(TLS_TX_REKEY_PENDING, &tls_ctx->flags) &&
> + !test_bit(TLS_TX_REKEY_FAILED, &tls_ctx->flags)) {
> + u32 boundary_seq = READ_ONCE(tls_ctx->rekey.boundary_seq);
> +
> + if (!before(acked_seq, boundary_seq))
> + set_bit(TLS_TX_REKEY_READY, &tls_ctx->flags);
> + }
> +
> spin_unlock_irqrestore(&ctx->lock, flags);
> }
[Medium]
Is the ordering between the boundary_seq publish and the PENDING bit
publish in tls_device_start_rekey() strong enough for this read site?
In tls_device_start_rekey() the sequence is:
WRITE_ONCE(ctx->rekey.boundary_seq, tcp_sk(sk)->write_seq);
...
set_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
set_bit() does not document any implicit memory barrier
(Documentation/atomic_bitops.txt). On weakly-ordered architectures, can
this reader observe PENDING=1 with a stale boundary_seq, set REKEY_READY
prematurely, and let tls_device_complete_rekey() run before the old-key
records are actually ACKed?
Would an smp_wmb() (or smp_mb__before_atomic()) between the write of
boundary_seq and the set_bit, or moving both writes under
offload_ctx->lock that this reader already holds, be appropriate?
> @@ -253,6 +290,14 @@ void tls_device_free_resources_tx(struct sock *sk)
> struct tls_context *tls_ctx = tls_get_ctx(sk);
>
> tls_free_partial_record(sk, tls_ctx);
> +
> + if (unlikely(tls_ctx->rekey.sw_ctx))
> + tls_sw_release_resources_tx(sk);
> +
> + if (test_bit(TLS_TX_REKEY_PENDING, &tls_ctx->flags)) {
> + TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYOK);
> + TLS_DEC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYINPROGRESS);
> + }
> }
[Medium]
Should a rekey that was still PENDING at socket teardown count as
TlsTxRekeyOk? The rekey never completed, but the counter is bumped as
if it had succeeded. Should this be a separate "aborted" counter, or
just omit the OK increment in this path?
> @@ -624,6 +672,19 @@ int tls_device_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> goto out;
> }
>
> + /* Old-key records all ACKed; switch back to HW. */
> + if (test_bit(TLS_TX_REKEY_READY, &tls_ctx->flags))
> + tls_device_complete_rekey(sk, tls_ctx, true);
[Medium]
Should the return value of tls_device_complete_rekey() be checked here?
tls_device_complete_rekey() can return non-zero from its early
tls_sw_drain_tx() path:
rc = tls_sw_drain_tx(sk, ctx);
if (rc)
return rc;
In that early-return case none of REKEY_PENDING, REKEY_READY, or
REKEY_FAILED is updated. The caller falls through to
tls_sw_sendmsg_locked(), and on the next sendmsg the same READY bit is
still set, so complete_rekey is retried, drain returns -EAGAIN again,
and so on. Is there a way out of this state? Should persistent drain
failure transition to FAILED, or at least bump a counter so it is
observable?
> +
> + /* Use SW path if rekey is in progress (PENDING) or if HW rekey
> + * failed (FAILED).
> + */
> + if (test_bit(TLS_TX_REKEY_PENDING, &tls_ctx->flags) ||
> + test_bit(TLS_TX_REKEY_FAILED, &tls_ctx->flags)) {
> + rc = tls_sw_sendmsg_locked(sk, msg, size);
> + goto out;
> + }
> +
> rc = tls_push_data(sk, &msg->msg_iter, size, msg->msg_flags,
> record_type);
[High]
What happens to a HW open_record across this transition?
If userspace previously sent with MSG_MORE, offload_ctx->open_record
holds buffered plaintext and tls_ctx->pending_open_record_frags is true.
A subsequent setsockopt KeyUpdate then routes future sendmsg calls into
tls_sw_sendmsg_locked(), which builds into sw_ctx->open_rec (initially
NULL) and never touches the HW open_record. At socket close,
tls_device_sk_destruct() does:
if (ctx->open_record)
destroy_record(ctx->open_record);
so the buffered plaintext is freed without ever being transmitted.
Also, since pending_open_record_frags stays true, would
tls_sw_drain_tx() see tls_is_pending_open_record(ctx) as true and return
-EAGAIN, blocking tls_device_complete_rekey() until something on the SW
side coincidentally clears it?
The tcp_write_collapse_fence() in tls_device_start_rekey() only handles
records already in the TCP write queue; it does not appear to flush this
HW open_record.
> @@ -1103,6 +1164,260 @@ static struct tls_offload_context_tx *alloc_offload_ctx_tx(struct tls_context *c
[ ... ]
> +static int tls_device_start_rekey(struct sock *sk,
> + struct tls_context *ctx,
> + struct tls_offload_context_tx *offload_ctx,
> + struct tls_crypto_info *new_crypto_info)
> +{
> + bool rekey_pending = test_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
> + bool rekey_failed = test_bit(TLS_TX_REKEY_FAILED, &ctx->flags);
[ ... ]
> + if (rekey_pending || rekey_failed) {
> + rc = crypto_aead_setkey(offload_ctx->rekey.sw.aead_send,
> + key, cipher_desc->key);
> + if (rc)
> + return rc;
[High]
Can there be in-flight async crypto requests on this aead_send when
crypto_aead_setkey() is called here?
While a rekey is PENDING, sendmsg has been encrypting via
offload_ctx->rekey.sw.aead_send through tls_sw_sendmsg_locked(). Async
backends can return -EINPROGRESS and complete later. Other call sites
that mutate AEAD state (tls_sw_release_resources_tx, tls_sw_drain_tx,
tls_set_sw_offload) call tls_encrypt_async_wait() first. Should the
same wait happen here before re-keying the same tfm, so in-flight
encryptions cannot pick up K2 instead of the K1 they were submitted
with?
[ ... ]
> + } else {
> + rc = tls_device_init_rekey_sw(sk, ctx, offload_ctx,
> + new_crypto_info);
[ ... ]
> + WRITE_ONCE(ctx->rekey.boundary_seq, tcp_sk(sk)->write_seq);
> +
> + /* Prevent a partial record straddling the SW/HW boundary. */
> + tcp_write_collapse_fence(sk);
> +
> + ctx->rekey.sw_ctx = &offload_ctx->rekey.sw;
> + ctx->rekey.cipher_ctx = &offload_ctx->rekey.tx;
> +
> + set_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
[ ... ]
> + unsafe_memcpy(&offload_ctx->rekey.crypto_send.info, new_crypto_info,
> + cipher_desc->crypto_info,
> + /* checked in do_tls_setsockopt_conf */);
> + memzero_explicit(new_crypto_info, cipher_desc->crypto_info);
> +
> + return 0;
> +}
[Medium]
new_crypto_info is zeroized after the copy, but is the staging area
itself ever scrubbed?
offload_ctx->rekey.crypto_send.info, offload_ctx->rekey.tx.iv, and
offload_ctx->rekey.tx.rec_seq receive the same key bytes via
unsafe_memcpy/memcpy. tls_device_complete_rekey() does not scrub them
on success, and tls_device_free_resources_tx() / tls_device_tx_del_task()
kfree the offload_ctx without a memzero_explicit(). Should those sites
zero the staging area to limit the window in which key bytes are
recoverable from slab memory after rotation?
> +static int tls_device_complete_rekey(struct sock *sk, struct tls_context *ctx,
> + bool deferred)
> +{
[ ... ]
> + if (!test_bit(TLS_TX_DEV_CLOSED, &ctx->flags)) {
> + netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
> + TLS_OFFLOAD_CTX_DIR_TX);
> + set_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
> + }
> +
> + memcpy(crypto_info_rec_seq(&offload_ctx->rekey.crypto_send.info, cipher_desc),
> + offload_ctx->rekey.tx.rec_seq, cipher_desc->rec_seq);
> +
> + rc = tls_device_dev_add_tx(sk, netdev, &offload_ctx->rekey.crypto_send.info,
> + tcp_sk(sk)->write_seq);
> +
> +release_lock:
> + up_read(&device_offload_lock);
> +
> + spin_lock_irqsave(&offload_ctx->lock, flags);
> + memcpy(&rcd_sn, offload_ctx->rekey.tx.rec_seq, sizeof(rcd_sn));
> + offload_ctx->unacked_record_sn = be64_to_cpu(rcd_sn) - 1;
> + spin_unlock_irqrestore(&offload_ctx->lock, flags);
> +
> + memcpy(ctx->tx.iv, offload_ctx->rekey.tx.iv,
> + cipher_desc->salt + cipher_desc->iv);
> + memcpy(ctx->tx.rec_seq, offload_ctx->rekey.tx.rec_seq,
> + cipher_desc->rec_seq);
> + unsafe_memcpy(&ctx->crypto_send.info,
> + &offload_ctx->rekey.crypto_send.info,
> + cipher_desc->crypto_info,
> + /* checked during rekey setup */);
> +
> + if (rc)
> + goto rekey_fail;
> +
> + clear_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
> +
> + key = crypto_info_key(&offload_ctx->rekey.crypto_send.info, cipher_desc);
> + rc = crypto_aead_setkey(offload_ctx->aead_send, key, cipher_desc->key);
> + if (rc)
> + goto rekey_fail;
[High]
Can these unconditional copies of the new key material into ctx->tx.iv,
ctx->tx.rec_seq, and ctx->crypto_send.info before the rc check leave
inconsistent state on failure?
If tls_device_dev_add_tx() failed, control reaches rekey_fail with:
- HW context removed (TLS_TX_DEV_CLOSED set)
- ctx->crypto_send.info advertising the NEW key
- ctx->tx.iv and ctx->tx.rec_seq holding the NEW salt/IV/rec_seq
- offload_ctx->aead_send still holding the OLD key (the setkey for it
is reached only after the rc check)
Userspace getsockopt(SOL_TLS, TLS_TX) would then read NEW key bytes that
no kernel AEAD ever absorbed. Any subsequent fallback that builds IVs
from ctx->tx and encrypts via offload_ctx->aead_send would mix NEW salt
with the OLD key.
Could the memcpy block move below the rc check?
[High]
On the success path, is there a window where tls_device_down() sees
stale TLS_TX_DEV_CLOSED?
The success path here:
if (!test_bit(TLS_TX_DEV_CLOSED, &ctx->flags)) {
netdev->tlsdev_ops->tls_dev_del(...);
set_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
}
...
rc = tls_device_dev_add_tx(...); /* installs NEW HW ctx */
release_lock:
up_read(&device_offload_lock);
...
if (rc) goto rekey_fail;
clear_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
Between dropping device_offload_lock and clear_bit(), can
tls_device_down() take down_write(&device_offload_lock) and run:
if (ctx->tx_conf == TLS_HW &&
!test_bit(TLS_TX_DEV_CLOSED, &ctx->flags)) {
netdev->tlsdev_ops->tls_dev_del(...);
set_bit(TLS_TX_DEV_CLOSED, &ctx->flags);
}
It sees TLS_TX_DEV_CLOSED still set, skips tls_dev_del() for the
freshly-installed NEW HW context, and proceeds to clear ctx->netdev.
Would the new NIC-side TLS state then be leaked until the netdev is
unregistered?
Should TLS_TX_DEV_CLOSED be cleared while still holding
device_offload_lock?
> +
> + /* Start marker: the NIC passes through everything before
> + * write_seq unencrypted (already SW-encrypted during rekey),
> + * same as during initial offload setup.
> + */
> + tls_device_commit_start_marker(sk, offload_ctx, start_marker_record);
[ ... ]
> +rekey_fail:
> + kfree(start_marker_record);
> + set_bit(TLS_TX_REKEY_FAILED, &ctx->flags);
> + clear_bit(TLS_TX_REKEY_READY, &ctx->flags);
> + clear_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
> + if (deferred)
> + TLS_DEC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYINPROGRESS);
> + TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYFALLBACK);
> +
> + return 0;
> +}
[Medium]
After REKEY_FAILED, the connection is structurally encrypting in SW even
though ctx->tx_conf stays TLS_HW. TLSCURRTXDEVICE is never decremented
and TLSCURRTXSW is never incremented in this path. Should the
current-mode gauges in /proc/net/tls_stat reflect the actual encryption
mode for these connections?
> +static int tls_set_device_offload_rekey(struct sock *sk,
> + struct tls_context *ctx,
> + struct net_device *netdev,
> + struct tls_crypto_info *new_crypto_info)
> +{
> + struct tls_offload_context_tx *offload_ctx = tls_offload_ctx_tx(ctx);
> + bool rekey_pending = test_bit(TLS_TX_REKEY_PENDING, &ctx->flags);
> + bool rekey_failed = test_bit(TLS_TX_REKEY_FAILED, &ctx->flags);
> + bool defer = true;
> + int rc;
> +
> + if (!rekey_pending && !rekey_failed)
> + defer = tls_has_unacked_records(offload_ctx);
> +
> + down_read(&device_offload_lock);
> +
> + rc = tls_device_start_rekey(sk, ctx, offload_ctx, new_crypto_info);
> + if (rc) {
> + up_read(&device_offload_lock);
> + return rc;
> + }
> +
> + up_read(&device_offload_lock);
> +
> + if (defer) {
> + if (!rekey_pending)
> + TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYINPROGRESS);
> + else
> + TLS_INC_STATS(sock_net(sk), LINUX_MIB_TLSTXREKEYOK);
> + return 0;
> + }
> +
> + return tls_device_complete_rekey(sk, ctx, false);
> +}
[Medium]
On the rekey_pending path here, TLSTXREKEYOK is incremented before any
HW reinstall actually happens. Combined with the increment in
tls_device_complete_rekey() on success and the increment in
tls_device_free_resources_tx() when PENDING is set at close, can a
single logical successful rekey produce more than one TLSTXREKEYOK
increment?
[Medium]
tls_has_unacked_records() inspects offload_ctx->records_list only. Is
that sufficient to decide whether to defer? pending_open_record_frags
or offload_ctx->open_record can still hold pre-rekey state, in which
case the !defer path falls into tls_device_complete_rekey(deferred=false),
tls_sw_drain_tx() returns -EAGAIN, setsockopt returns the error to
userspace via err_crypto_info, and none of the state transitions from
tls_device_start_rekey() (PENDING bit, sw_ctx, cipher_ctx, boundary_seq,
sk_validate_xmit_skb) are rolled back. Should the failure path either
roll those back or transition to a defined recovery state?
[ ... ]
> diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c
> index 1110f7ac6bcb..64ac4ef4012b 100644
> --- a/net/tls/tls_device_fallback.c
> +++ b/net/tls/tls_device_fallback.c
> @@ -435,6 +435,30 @@ struct sk_buff *tls_validate_xmit_skb_sw(struct sock *sk,
> return tls_sw_fallback(sk, skb);
> }
>
> +struct sk_buff *tls_validate_xmit_skb_rekey(struct sock *sk,
> + struct net_device *dev,
> + struct sk_buff *skb)
> +{
> + struct tls_context *tls_ctx = tls_get_ctx(sk);
> + u32 tcp_seq = ntohl(tcp_hdr(skb)->seq);
> + u32 boundary_seq;
> +
> + if (test_bit(TLS_TX_REKEY_FAILED, &tls_ctx->flags))
> + return skb;
> +
> + /* If this packet is at or after the rekey boundary, it's already
> + * SW-encrypted with the new key, pass through unchanged
> + */
> + boundary_seq = READ_ONCE(tls_ctx->rekey.boundary_seq);
> + if (!before(tcp_seq, boundary_seq))
> + return skb;
> +
> + /* Packet before boundary means retransmit of old data,
> + * use SW fallback with the old key
> + */
> + return tls_sw_fallback(sk, skb);
> +}
[Critical]
Can this leak plaintext on the wire after REKEY_FAILED?
The first branch returns the skb unchanged whenever
TLS_TX_REKEY_FAILED is set, regardless of where tcp_seq is relative to
boundary_seq.
After tls_device_complete_rekey() has called tls_dev_del() and then
landed in rekey_fail, there is no NIC TLS context for the socket.
SKBs from before the boundary are still sitting in the socket write
queue with plaintext payload, because HW-offload semantics expected the
NIC to encrypt them at TX time. When TCP retransmits one of those old
skbs:
tcp_seq < boundary_seq, but TLS_TX_REKEY_FAILED is set
-> early return skb unchanged
-> NIC has no TLS context for this flow
-> driver transmits the skb as-is
Should the FAILED branch still route pre-boundary packets through
tls_sw_fallback() with the old key, the same way the non-failed branch
does, so that pre-boundary retransmits are encrypted?
> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> index fd04857fa0ab..2548ad2b2219 100644
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -371,6 +371,8 @@ static void tls_sk_proto_close(struct sock *sk, long timeout)
>
> if (ctx->tx_conf == TLS_SW)
> tls_sw_cancel_work_tx(ctx);
> + else if (ctx->tx_conf == TLS_HW && ctx->rekey.sw_ctx)
> + tls_sw_cancel_work_tx(ctx);
>
> lock_sock(sk);
[High]
Is reading ctx->rekey.sw_ctx before lock_sock(sk) safe?
ctx->rekey.sw_ctx is published by tls_device_start_rekey() under
lock_sock. Here close reads it before taking lock_sock, so a concurrent
setsockopt rekey can publish sw_ctx after this read but before close
acquires the socket lock. Close then proceeds without canceling the
delayed tx_work, while later cleanup
(tls_device_free_resources_tx -> tls_sw_release_resources_tx) frees
aead_send.
If a tx_work_handler scheduled by the rekey path runs after that, can it
touch the freed delayed_work or aead_send?
Should the read be moved under lock_sock, or should
tls_sw_release_resources_tx() internally cancel the delayed work?
[ ... ]
--
pw-bot: cr
next prev parent reply other threads:[~2026-05-25 21:16 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-15 21:27 [PATCH net-next v14 0/9] tls: Add TLS 1.3 hardware offload support Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 1/9] net: tls: reject TLS 1.3 offload in chcr_ktls and nfp drivers Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 2/9] net/mlx5e: add TLS 1.3 hardware offload support Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 3/9] tls: " Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 4/9] tls: split tls_set_sw_offload into init and finalize stages Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 5/9] tls: prep helpers and refactors for HW offload KeyUpdate Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 6/9] tls: device: add TX KeyUpdate support Rishikesh Jethwani
2026-05-25 21:16 ` Jakub Kicinski [this message]
2026-05-15 21:27 ` [PATCH v14 7/9] tls: device: add RX " Rishikesh Jethwani
2026-05-25 21:16 ` Jakub Kicinski
2026-05-15 21:27 ` [PATCH v14 8/9] tls: device: add tracepoints for RX KeyUpdate path Rishikesh Jethwani
2026-05-15 21:27 ` [PATCH v14 9/9] selftests: net: add TLS hardware offload test Rishikesh Jethwani
2026-05-25 21:16 ` Jakub Kicinski
2026-05-17 22:21 ` [PATCH net-next v14 0/9] tls: Add TLS 1.3 hardware offload support Sabrina Dubroca
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260525211608.2508449-1-kuba@kernel.org \
--to=kuba@kernel.org \
--cc=borisp@nvidia.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=john.fastabend@gmail.com \
--cc=leon@kernel.org \
--cc=mbloch@nvidia.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=rjethwani@purestorage.com \
--cc=saeedm@nvidia.com \
--cc=sd@queasysnail.net \
--cc=tariqt@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox