* [PATCH v2 0/8] TLS read_sock performance scalability
@ 2026-03-11 0:19 Chuck Lever
2026-03-11 0:19 ` [PATCH v2 1/8] tls: Factor tls_decrypt_async_drain() from recvmsg Chuck Lever
` (8 more replies)
0 siblings, 9 replies; 12+ messages in thread
From: Chuck Lever @ 2026-03-11 0:19 UTC (permalink / raw)
To: john.fastabend, kuba, sd; +Cc: netdev, kernel-tls-handshake, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
I'd like to encourage in-kernel kTLS consumers (i.e., NFS and
NVMe/TCP) to coalesce on the use of read_sock. When I suggested
this to Hannes, he reported a number of nagging performance
scalability issues with read_sock. This series is an attempt to run
these issues down and get them fixed before we convert the above
sock_recvmsg consumers over to read_sock.
While I assemble performance data, let's nail down the preferred
code structure.
Base commit: 05e059510edf ("Merge branch 'eth-fbnic-add-fbnic-self-tests'")
---
Changes since v1:
- Add C11 reference
- Extend data_ready reduction to recvmsg and splice
- Restructure read_sock and recvmsg using shared helpers
Chuck Lever (8):
tls: Factor tls_decrypt_async_drain() from recvmsg
tls: Factor tls_rx_decrypt_record() helper
tls: Fix dangling skb pointer in tls_sw_read_sock()
tls: Factor tls_strp_msg_release() from tls_strp_msg_done()
tls: Suppress spurious saved_data_ready on all receive paths
tls: Flush backlog before tls_rx_rec_wait in read_sock
tls: Restructure tls_sw_read_sock() into submit/deliver phases
tls: Enable batch async decryption in read_sock
net/tls/tls.h | 3 +-
net/tls/tls_strp.c | 34 ++++++--
net/tls/tls_sw.c | 212 ++++++++++++++++++++++++++++++++-------------
3 files changed, 183 insertions(+), 66 deletions(-)
--
2.53.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 1/8] tls: Factor tls_decrypt_async_drain() from recvmsg
2026-03-11 0:19 [PATCH v2 0/8] TLS read_sock performance scalability Chuck Lever
@ 2026-03-11 0:19 ` Chuck Lever
2026-03-11 17:24 ` Hannes Reinecke
2026-03-11 0:19 ` [PATCH v2 2/8] tls: Factor tls_rx_decrypt_record() helper Chuck Lever
` (7 subsequent siblings)
8 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2026-03-11 0:19 UTC (permalink / raw)
To: john.fastabend, kuba, sd; +Cc: netdev, kernel-tls-handshake, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
The recvmsg path pairs tls_decrypt_async_wait() with
__skb_queue_purge(&ctx->async_hold). Bundling the two into
tls_decrypt_async_drain() gives later patches a single call for
async teardown.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/tls/tls_sw.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index a656ce235758..cedcc82669db 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -249,6 +249,18 @@ static int tls_decrypt_async_wait(struct tls_sw_context_rx *ctx)
return ctx->async_wait.err;
}
+/* Collect all pending async AEAD completions and release the
+ * skbs held for them. Returns the crypto error if any
+ * operation failed, zero otherwise.
+ */
+static int tls_decrypt_async_drain(struct tls_sw_context_rx *ctx)
+{
+ int ret = tls_decrypt_async_wait(ctx);
+
+ __skb_queue_purge(&ctx->async_hold);
+ return ret;
+}
+
static int tls_do_decryption(struct sock *sk,
struct scatterlist *sgin,
struct scatterlist *sgout,
@@ -2223,8 +2235,7 @@ int tls_sw_recvmsg(struct sock *sk,
int ret;
/* Wait for all previously submitted records to be decrypted */
- ret = tls_decrypt_async_wait(ctx);
- __skb_queue_purge(&ctx->async_hold);
+ ret = tls_decrypt_async_drain(ctx);
if (ret) {
if (err >= 0 || err == -EINPROGRESS)
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 2/8] tls: Factor tls_rx_decrypt_record() helper
2026-03-11 0:19 [PATCH v2 0/8] TLS read_sock performance scalability Chuck Lever
2026-03-11 0:19 ` [PATCH v2 1/8] tls: Factor tls_decrypt_async_drain() from recvmsg Chuck Lever
@ 2026-03-11 0:19 ` Chuck Lever
2026-03-11 17:25 ` Hannes Reinecke
2026-03-11 0:19 ` [PATCH v2 3/8] tls: Fix dangling skb pointer in tls_sw_read_sock() Chuck Lever
` (6 subsequent siblings)
8 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever @ 2026-03-11 0:19 UTC (permalink / raw)
To: john.fastabend, kuba, sd; +Cc: netdev, kernel-tls-handshake, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
recvmsg, read_sock, and splice_read each open-code the
same sequence: zero-initialize the decrypt arguments, call
tls_rx_one_record(), and abort the connection on failure.
Extract tls_rx_decrypt_record() so each receive path shares
a single decrypt-and-abort primitive. Each call site still
initializes darg.inargs separately, since recvmsg sets zc
and async between the memset and the decrypt call.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/tls/tls_sw.c | 29 +++++++++++++++++------------
1 file changed, 17 insertions(+), 12 deletions(-)
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index cedcc82669db..81e0e8aaa6f9 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1832,6 +1832,17 @@ static int tls_rx_one_record(struct sock *sk, struct msghdr *msg,
return tls_check_pending_rekey(sk, tls_ctx, darg->skb);
}
+/* Decrypt one record and abort the connection on failure. */
+static int tls_rx_decrypt_record(struct sock *sk, struct msghdr *msg,
+ struct tls_decrypt_arg *darg)
+{
+ int err = tls_rx_one_record(sk, msg, darg);
+
+ if (err < 0)
+ tls_err_abort(sk, -EBADMSG);
+ return err;
+}
+
int decrypt_skb(struct sock *sk, struct scatterlist *sgout)
{
struct tls_decrypt_arg darg = { .zc = true, };
@@ -2132,11 +2143,9 @@ int tls_sw_recvmsg(struct sock *sk,
else
darg.async = false;
- err = tls_rx_one_record(sk, msg, &darg);
- if (err < 0) {
- tls_err_abort(sk, -EBADMSG);
+ err = tls_rx_decrypt_record(sk, msg, &darg);
+ if (err < 0)
goto recv_end;
- }
async |= darg.async;
@@ -2294,11 +2303,9 @@ ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos,
memset(&darg.inargs, 0, sizeof(darg.inargs));
- err = tls_rx_one_record(sk, NULL, &darg);
- if (err < 0) {
- tls_err_abort(sk, -EBADMSG);
+ err = tls_rx_decrypt_record(sk, NULL, &darg);
+ if (err < 0)
goto splice_read_end;
- }
tls_rx_rec_done(ctx);
skb = darg.skb;
@@ -2380,11 +2387,9 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
memset(&darg.inargs, 0, sizeof(darg.inargs));
- err = tls_rx_one_record(sk, NULL, &darg);
- if (err < 0) {
- tls_err_abort(sk, -EBADMSG);
+ err = tls_rx_decrypt_record(sk, NULL, &darg);
+ if (err < 0)
goto read_sock_end;
- }
released = tls_read_flush_backlog(sk, prot, INT_MAX,
0, decrypted,
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 3/8] tls: Fix dangling skb pointer in tls_sw_read_sock()
2026-03-11 0:19 [PATCH v2 0/8] TLS read_sock performance scalability Chuck Lever
2026-03-11 0:19 ` [PATCH v2 1/8] tls: Factor tls_decrypt_async_drain() from recvmsg Chuck Lever
2026-03-11 0:19 ` [PATCH v2 2/8] tls: Factor tls_rx_decrypt_record() helper Chuck Lever
@ 2026-03-11 0:19 ` Chuck Lever
2026-03-11 0:19 ` [PATCH v2 4/8] tls: Factor tls_strp_msg_release() from tls_strp_msg_done() Chuck Lever
` (5 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Chuck Lever @ 2026-03-11 0:19 UTC (permalink / raw)
To: john.fastabend, kuba, sd
Cc: netdev, kernel-tls-handshake, Chuck Lever, Hannes Reinecke,
Alistair Francis
From: Chuck Lever <chuck.lever@oracle.com>
Per ISO/IEC 9899:2011 section 6.2.4p2, a pointer value becomes
indeterminate when the object it points to reaches the end of its
lifetime; Annex J.2 classifies the use of such a value as undefined
behavior. In tls_sw_read_sock(), consume_skb(skb) in the
fully-consumed path frees the skb, but the "do { } while (skb)"
loop condition then evaluates that freed pointer. Although the
value is never dereferenced -- the loop either continues and
overwrites skb, or exits -- any future change that adds a
dereference between consume_skb() and the loop condition would
produce a silent use-after-free.
Fixes: 662fbcec32f4 ("net/tls: implement ->read_sock()")
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/tls/tls_sw.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 81e0e8aaa6f9..e5d0447cbba6 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2373,7 +2373,7 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
goto read_sock_end;
decrypted = 0;
- do {
+ for (;;) {
if (!skb_queue_empty(&ctx->rx_list)) {
skb = __skb_dequeue(&ctx->rx_list);
rxm = strp_msg(skb);
@@ -2422,10 +2422,11 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
goto read_sock_requeue;
} else {
consume_skb(skb);
+ skb = NULL;
if (!desc->count)
- skb = NULL;
+ break;
}
- } while (skb);
+ }
read_sock_end:
tls_rx_reader_release(sk, ctx);
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 4/8] tls: Factor tls_strp_msg_release() from tls_strp_msg_done()
2026-03-11 0:19 [PATCH v2 0/8] TLS read_sock performance scalability Chuck Lever
` (2 preceding siblings ...)
2026-03-11 0:19 ` [PATCH v2 3/8] tls: Fix dangling skb pointer in tls_sw_read_sock() Chuck Lever
@ 2026-03-11 0:19 ` Chuck Lever
2026-03-11 0:19 ` [PATCH v2 5/8] tls: Suppress spurious saved_data_ready on all receive paths Chuck Lever
` (4 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Chuck Lever @ 2026-03-11 0:19 UTC (permalink / raw)
To: john.fastabend, kuba, sd
Cc: netdev, kernel-tls-handshake, Chuck Lever, Hannes Reinecke,
Alistair Francis
From: Chuck Lever <chuck.lever@oracle.com>
tls_strp_msg_done() conflates releasing the current record with
checking for the next one via tls_strp_check_rcv(). Batch
processing requires releasing a record without immediately
triggering that check, so the release step is separated into
tls_strp_msg_release(). tls_strp_msg_done() is preserved as a
wrapper for existing callers.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/tls/tls.h | 1 +
net/tls/tls_strp.c | 15 ++++++++++++++-
2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/net/tls/tls.h b/net/tls/tls.h
index e8f81a006520..a97f1acef31d 100644
--- a/net/tls/tls.h
+++ b/net/tls/tls.h
@@ -193,6 +193,7 @@ int tls_strp_init(struct tls_strparser *strp, struct sock *sk);
void tls_strp_data_ready(struct tls_strparser *strp);
void tls_strp_check_rcv(struct tls_strparser *strp);
+void tls_strp_msg_release(struct tls_strparser *strp);
void tls_strp_msg_done(struct tls_strparser *strp);
int tls_rx_msg_size(struct tls_strparser *strp, struct sk_buff *skb);
diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c
index 98e12f0ff57e..a7648ebde162 100644
--- a/net/tls/tls_strp.c
+++ b/net/tls/tls_strp.c
@@ -581,7 +581,16 @@ static void tls_strp_work(struct work_struct *w)
release_sock(strp->sk);
}
-void tls_strp_msg_done(struct tls_strparser *strp)
+/**
+ * tls_strp_msg_release - release the current strparser message
+ * @strp: TLS stream parser instance
+ *
+ * Release the current record without triggering a check for the
+ * next record. Callers must invoke tls_strp_check_rcv() before
+ * releasing the socket lock, or queued data will stall until
+ * the next tls_strp_data_ready() event.
+ */
+void tls_strp_msg_release(struct tls_strparser *strp)
{
WARN_ON(!strp->stm.full_len);
@@ -592,7 +601,11 @@ void tls_strp_msg_done(struct tls_strparser *strp)
WRITE_ONCE(strp->msg_ready, 0);
memset(&strp->stm, 0, sizeof(strp->stm));
+}
+void tls_strp_msg_done(struct tls_strparser *strp)
+{
+ tls_strp_msg_release(strp);
tls_strp_check_rcv(strp);
}
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 5/8] tls: Suppress spurious saved_data_ready on all receive paths
2026-03-11 0:19 [PATCH v2 0/8] TLS read_sock performance scalability Chuck Lever
` (3 preceding siblings ...)
2026-03-11 0:19 ` [PATCH v2 4/8] tls: Factor tls_strp_msg_release() from tls_strp_msg_done() Chuck Lever
@ 2026-03-11 0:19 ` Chuck Lever
2026-03-11 0:19 ` [PATCH v2 6/8] tls: Flush backlog before tls_rx_rec_wait in read_sock Chuck Lever
` (3 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Chuck Lever @ 2026-03-11 0:19 UTC (permalink / raw)
To: john.fastabend, kuba, sd
Cc: netdev, kernel-tls-handshake, Chuck Lever, Alistair Francis,
Hannes Reinecke
From: Chuck Lever <chuck.lever@oracle.com>
Each record release via tls_strp_msg_done() triggers
tls_strp_check_rcv(), which calls tls_rx_msg_ready() and
fires saved_data_ready(). During a multi-record receive,
the first N-1 wakeups are pure overhead: the caller is
already running and will pick up subsequent records on
the next loop iteration. The same waste occurs on the
recvmsg and splice_read paths.
Replace tls_strp_msg_done() with tls_strp_msg_release() in
all three receive paths (read_sock, recvmsg, splice_read),
deferring the tls_strp_check_rcv() call to each path's
exit point. Factor tls_rx_msg_ready() out of
tls_strp_read_sock() so that parsing a record no longer
fires the callback directly, and introduce
tls_strp_check_rcv_quiet() for use in tls_rx_rec_wait(),
which parses queued data without notifying.
With no remaining callers, tls_strp_msg_done() and its
wrapper tls_rx_rec_done() are removed.
Acked-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/tls/tls.h | 2 +-
net/tls/tls_strp.c | 27 +++++++++++++++++++--------
net/tls/tls_sw.c | 20 +++++++++++++-------
3 files changed, 33 insertions(+), 16 deletions(-)
diff --git a/net/tls/tls.h b/net/tls/tls.h
index a97f1acef31d..0ab3b83c3724 100644
--- a/net/tls/tls.h
+++ b/net/tls/tls.h
@@ -193,8 +193,8 @@ int tls_strp_init(struct tls_strparser *strp, struct sock *sk);
void tls_strp_data_ready(struct tls_strparser *strp);
void tls_strp_check_rcv(struct tls_strparser *strp);
+void tls_strp_check_rcv_quiet(struct tls_strparser *strp);
void tls_strp_msg_release(struct tls_strparser *strp);
-void tls_strp_msg_done(struct tls_strparser *strp);
int tls_rx_msg_size(struct tls_strparser *strp, struct sk_buff *skb);
void tls_rx_msg_ready(struct tls_strparser *strp);
diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c
index a7648ebde162..6cf274380da2 100644
--- a/net/tls/tls_strp.c
+++ b/net/tls/tls_strp.c
@@ -368,7 +368,6 @@ static int tls_strp_copyin(read_descriptor_t *desc, struct sk_buff *in_skb,
desc->count = 0;
WRITE_ONCE(strp->msg_ready, 1);
- tls_rx_msg_ready(strp);
}
return ret;
@@ -539,11 +538,27 @@ static int tls_strp_read_sock(struct tls_strparser *strp)
return tls_strp_read_copy(strp, false);
WRITE_ONCE(strp->msg_ready, 1);
- tls_rx_msg_ready(strp);
return 0;
}
+/**
+ * tls_strp_check_rcv_quiet - parse without consumer notification
+ * @strp: TLS stream parser instance
+ *
+ * Parse queued data without firing the consumer notification. A subsequent
+ * tls_strp_check_rcv() is required before the socket lock is released;
+ * otherwise queued data stalls until the next tls_strp_data_ready() event.
+ */
+void tls_strp_check_rcv_quiet(struct tls_strparser *strp)
+{
+ if (unlikely(strp->stopped) || strp->msg_ready)
+ return;
+
+ if (tls_strp_read_sock(strp) == -ENOMEM)
+ queue_work(tls_strp_wq, &strp->work);
+}
+
void tls_strp_check_rcv(struct tls_strparser *strp)
{
if (unlikely(strp->stopped) || strp->msg_ready)
@@ -551,6 +566,8 @@ void tls_strp_check_rcv(struct tls_strparser *strp)
if (tls_strp_read_sock(strp) == -ENOMEM)
queue_work(tls_strp_wq, &strp->work);
+ else if (strp->msg_ready)
+ tls_rx_msg_ready(strp);
}
/* Lower sock lock held */
@@ -603,12 +620,6 @@ void tls_strp_msg_release(struct tls_strparser *strp)
memset(&strp->stm, 0, sizeof(strp->stm));
}
-void tls_strp_msg_done(struct tls_strparser *strp)
-{
- tls_strp_msg_release(strp);
- tls_strp_check_rcv(strp);
-}
-
void tls_strp_stop(struct tls_strparser *strp)
{
strp->stopped = 1;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index e5d0447cbba6..006e0a955b3f 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1384,7 +1384,10 @@ tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock,
return ret;
if (!skb_queue_empty(&sk->sk_receive_queue)) {
- tls_strp_check_rcv(&ctx->strp);
+ /* tls_strp_check_rcv() is called at each receive
+ * path's exit before the socket lock is released.
+ */
+ tls_strp_check_rcv_quiet(&ctx->strp);
if (tls_strp_msg_ready(ctx))
break;
}
@@ -1876,9 +1879,9 @@ static int tls_record_content_type(struct msghdr *msg, struct tls_msg *tlm,
return 1;
}
-static void tls_rx_rec_done(struct tls_sw_context_rx *ctx)
+static void tls_rx_rec_release(struct tls_sw_context_rx *ctx)
{
- tls_strp_msg_done(&ctx->strp);
+ tls_strp_msg_release(&ctx->strp);
}
/* This function traverses the rx_list in tls receive context to copies the
@@ -2159,7 +2162,7 @@ int tls_sw_recvmsg(struct sock *sk,
err = tls_record_content_type(msg, tls_msg(darg.skb), &control);
if (err <= 0) {
DEBUG_NET_WARN_ON_ONCE(darg.zc);
- tls_rx_rec_done(ctx);
+ tls_rx_rec_release(ctx);
put_on_rx_list_err:
__skb_queue_tail(&ctx->rx_list, darg.skb);
goto recv_end;
@@ -2173,7 +2176,7 @@ int tls_sw_recvmsg(struct sock *sk,
/* TLS 1.3 may have updated the length by more than overhead */
rxm = strp_msg(darg.skb);
chunk = rxm->full_len;
- tls_rx_rec_done(ctx);
+ tls_rx_rec_release(ctx);
if (!darg.zc) {
bool partially_consumed = chunk > len;
@@ -2267,6 +2270,7 @@ int tls_sw_recvmsg(struct sock *sk,
copied += decrypted;
end:
+ tls_strp_check_rcv(&ctx->strp);
tls_rx_reader_unlock(sk, ctx);
if (psock)
sk_psock_put(sk, psock);
@@ -2307,7 +2311,7 @@ ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos,
if (err < 0)
goto splice_read_end;
- tls_rx_rec_done(ctx);
+ tls_rx_rec_release(ctx);
skb = darg.skb;
}
@@ -2334,6 +2338,7 @@ ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos,
consume_skb(skb);
splice_read_end:
+ tls_strp_check_rcv(&ctx->strp);
tls_rx_reader_unlock(sk, ctx);
return copied ? : err;
@@ -2399,7 +2404,7 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
tlm = tls_msg(skb);
decrypted += rxm->full_len;
- tls_rx_rec_done(ctx);
+ tls_rx_rec_release(ctx);
}
/* read_sock does not support reading control messages */
@@ -2429,6 +2434,7 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
}
read_sock_end:
+ tls_strp_check_rcv(&ctx->strp);
tls_rx_reader_release(sk, ctx);
return copied ? : err;
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 6/8] tls: Flush backlog before tls_rx_rec_wait in read_sock
2026-03-11 0:19 [PATCH v2 0/8] TLS read_sock performance scalability Chuck Lever
` (4 preceding siblings ...)
2026-03-11 0:19 ` [PATCH v2 5/8] tls: Suppress spurious saved_data_ready on all receive paths Chuck Lever
@ 2026-03-11 0:19 ` Chuck Lever
2026-03-11 0:19 ` [PATCH v2 7/8] tls: Restructure tls_sw_read_sock() into submit/deliver phases Chuck Lever
` (2 subsequent siblings)
8 siblings, 0 replies; 12+ messages in thread
From: Chuck Lever @ 2026-03-11 0:19 UTC (permalink / raw)
To: john.fastabend, kuba, sd
Cc: netdev, kernel-tls-handshake, Chuck Lever, Alistair Francis,
Hannes Reinecke
From: Chuck Lever <chuck.lever@oracle.com>
While lock_sock is held during read_sock, incoming TCP segments
land on sk->sk_backlog rather than sk->sk_receive_queue.
tls_rx_rec_wait() inspects only sk_receive_queue, so backlog
data remains invisible until release_sock() drains it, forcing
an extra workqueue cycle for records that arrive during
decryption.
Calling sk_flush_backlog() before tls_rx_rec_wait() moves
backlog data into sk_receive_queue, where tls_strp_check_rcv()
can parse it immediately. The existing tls_read_flush_backlog
call after decryption is retained for TCP window management.
Acked-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/tls/tls_sw.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 006e0a955b3f..644a65ff9964 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2386,6 +2386,11 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
} else {
struct tls_decrypt_arg darg;
+ /* Drain backlog so segments that arrived while the
+ * lock was held appear on sk_receive_queue before
+ * tls_rx_rec_wait waits for a new record.
+ */
+ sk_flush_backlog(sk);
err = tls_rx_rec_wait(sk, NULL, true, released);
if (err <= 0)
goto read_sock_end;
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 7/8] tls: Restructure tls_sw_read_sock() into submit/deliver phases
2026-03-11 0:19 [PATCH v2 0/8] TLS read_sock performance scalability Chuck Lever
` (5 preceding siblings ...)
2026-03-11 0:19 ` [PATCH v2 6/8] tls: Flush backlog before tls_rx_rec_wait in read_sock Chuck Lever
@ 2026-03-11 0:19 ` Chuck Lever
2026-03-11 0:19 ` [PATCH v2 8/8] tls: Enable batch async decryption in read_sock Chuck Lever
2026-03-11 20:42 ` [PATCH v2 0/8] TLS read_sock performance scalability Jakub Kicinski
8 siblings, 0 replies; 12+ messages in thread
From: Chuck Lever @ 2026-03-11 0:19 UTC (permalink / raw)
To: john.fastabend, kuba, sd
Cc: netdev, kernel-tls-handshake, Chuck Lever, Hannes Reinecke
From: Chuck Lever <chuck.lever@oracle.com>
Pipelining multiple AEAD operations requires separating decryption
from delivery so that several records can be submitted before any
are passed to the read_actor callback. The main loop in
tls_sw_read_sock() is split into two explicit phases: a submit
phase that decrypts one record onto ctx->rx_list, and a deliver
phase that drains rx_list and passes each cleartext skb to the
read_actor callback.
With a single record per submit phase, behavior is identical to the
previous code. A subsequent patch will extend the submit phase to
pipeline multiple AEAD operations.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/tls/tls_sw.c | 79 +++++++++++++++++++++++++-----------------------
1 file changed, 41 insertions(+), 38 deletions(-)
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 644a65ff9964..535c856d64e0 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2353,8 +2353,8 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
struct tls_prot_info *prot = &tls_ctx->prot_info;
- struct strp_msg *rxm = NULL;
struct sk_buff *skb = NULL;
+ struct strp_msg *rxm;
struct sk_psock *psock;
size_t flushed_at = 0;
bool released = true;
@@ -2379,17 +2379,15 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
decrypted = 0;
for (;;) {
- if (!skb_queue_empty(&ctx->rx_list)) {
- skb = __skb_dequeue(&ctx->rx_list);
- rxm = strp_msg(skb);
- tlm = tls_msg(skb);
- } else {
- struct tls_decrypt_arg darg;
+ struct tls_decrypt_arg darg;
- /* Drain backlog so segments that arrived while the
- * lock was held appear on sk_receive_queue before
- * tls_rx_rec_wait waits for a new record.
- */
+ /* Phase 1: Submit -- decrypt one record onto rx_list.
+ * Flush the backlog first so that segments that
+ * arrived while the lock was held appear on
+ * sk_receive_queue before tls_rx_rec_wait waits
+ * for a new record.
+ */
+ if (skb_queue_empty(&ctx->rx_list)) {
sk_flush_backlog(sk);
err = tls_rx_rec_wait(sk, NULL, true, released);
if (err <= 0)
@@ -2404,38 +2402,43 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
released = tls_read_flush_backlog(sk, prot, INT_MAX,
0, decrypted,
&flushed_at);
- skb = darg.skb;
+ decrypted += strp_msg(darg.skb)->full_len;
+ tls_rx_rec_release(ctx);
+ __skb_queue_tail(&ctx->rx_list, darg.skb);
+ }
+
+ /* Phase 2: Deliver -- drain rx_list to read_actor */
+ while ((skb = __skb_dequeue(&ctx->rx_list)) != NULL) {
rxm = strp_msg(skb);
tlm = tls_msg(skb);
- decrypted += rxm->full_len;
- tls_rx_rec_release(ctx);
- }
-
- /* read_sock does not support reading control messages */
- if (tlm->control != TLS_RECORD_TYPE_DATA) {
- err = -EINVAL;
- goto read_sock_requeue;
- }
-
- used = read_actor(desc, skb, rxm->offset, rxm->full_len);
- if (used <= 0) {
- if (!copied)
- err = used;
- goto read_sock_requeue;
- }
- copied += used;
- if (used < rxm->full_len) {
- rxm->offset += used;
- rxm->full_len -= used;
- if (!desc->count)
+ /* read_sock does not support reading control messages */
+ if (tlm->control != TLS_RECORD_TYPE_DATA) {
+ err = -EINVAL;
goto read_sock_requeue;
- } else {
- consume_skb(skb);
- skb = NULL;
- if (!desc->count)
- break;
+ }
+
+ used = read_actor(desc, skb, rxm->offset,
+ rxm->full_len);
+ if (used <= 0) {
+ if (!copied)
+ err = used;
+ goto read_sock_requeue;
+ }
+ copied += used;
+ if (used < rxm->full_len) {
+ rxm->offset += used;
+ rxm->full_len -= used;
+ if (!desc->count)
+ goto read_sock_requeue;
+ } else {
+ consume_skb(skb);
+ skb = NULL;
+ }
}
+ /* Drain all of rx_list before honoring !desc->count */
+ if (!desc->count)
+ break;
}
read_sock_end:
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 8/8] tls: Enable batch async decryption in read_sock
2026-03-11 0:19 [PATCH v2 0/8] TLS read_sock performance scalability Chuck Lever
` (6 preceding siblings ...)
2026-03-11 0:19 ` [PATCH v2 7/8] tls: Restructure tls_sw_read_sock() into submit/deliver phases Chuck Lever
@ 2026-03-11 0:19 ` Chuck Lever
2026-03-11 20:42 ` [PATCH v2 0/8] TLS read_sock performance scalability Jakub Kicinski
8 siblings, 0 replies; 12+ messages in thread
From: Chuck Lever @ 2026-03-11 0:19 UTC (permalink / raw)
To: john.fastabend, kuba, sd
Cc: netdev, kernel-tls-handshake, Chuck Lever, Hannes Reinecke
From: Chuck Lever <chuck.lever@oracle.com>
tls_sw_read_sock() decrypts one TLS record at a time, blocking until
each AEAD operation completes before proceeding. Hardware async
crypto engines depend on pipelining multiple operations to achieve
full throughput, and the one-at-a-time model prevents that. Kernel
consumers such as NVMe-TCP and NFSD (when using TLS) are therefore
unable to benefit from hardware offload.
When ctx->async_capable is true, the submit phase now loops up to
TLS_READ_SOCK_BATCH (16) records. The first record waits via
tls_rx_rec_wait(); subsequent iterations use tls_strp_msg_ready()
and tls_strp_check_rcv() to collect records already queued on the
socket without blocking. Each record is submitted with darg.async
set, and all resulting skbs are appended to rx_list.
After the submit loop, a single tls_decrypt_async_drain() collects
all pending AEAD completions before the deliver phase passes
cleartext records to the consumer. The batch bound of 16 limits
concurrent memory consumption to 16 cleartext skbs plus their AEAD
contexts. If async_capable is false, the loop exits after one
record and the async wait is skipped, preserving prior behavior.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/tls/tls_sw.c | 95 +++++++++++++++++++++++++++++++++++++++---------
1 file changed, 78 insertions(+), 17 deletions(-)
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 535c856d64e0..0e2b7d285d06 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -261,6 +261,12 @@ static int tls_decrypt_async_drain(struct tls_sw_context_rx *ctx)
return ret;
}
+/* Submit an AEAD decrypt request. On success with darg->async set,
+ * the caller must not touch aead_req; the completion handler frees
+ * it. Every error return clears darg->async and guarantees no
+ * in-flight AEAD operation remains -- callers rely on this to
+ * safely free aead_req and to skip async drain on error paths.
+ */
static int tls_do_decryption(struct sock *sk,
struct scatterlist *sgin,
struct scatterlist *sgout,
@@ -2347,6 +2353,13 @@ ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos,
goto splice_read_end;
}
+/* Bound on concurrent async AEAD submissions per read_sock
+ * call. Chosen to fill typical hardware crypto pipelines
+ * without excessive memory consumption (each in-flight record
+ * holds one cleartext skb plus its AEAD request context).
+ */
+#define TLS_READ_SOCK_BATCH 16
+
int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
sk_read_actor_t read_actor)
{
@@ -2358,6 +2371,7 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
struct sk_psock *psock;
size_t flushed_at = 0;
bool released = true;
+ bool async = false;
struct tls_msg *tlm;
ssize_t copied = 0;
ssize_t decrypted;
@@ -2380,31 +2394,68 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
decrypted = 0;
for (;;) {
struct tls_decrypt_arg darg;
+ int nr_async = 0;
- /* Phase 1: Submit -- decrypt one record onto rx_list.
+ /* Phase 1: Submit -- decrypt records onto rx_list.
* Flush the backlog first so that segments that
* arrived while the lock was held appear on
* sk_receive_queue before tls_rx_rec_wait waits
* for a new record.
*/
if (skb_queue_empty(&ctx->rx_list)) {
- sk_flush_backlog(sk);
- err = tls_rx_rec_wait(sk, NULL, true, released);
- if (err <= 0)
+ while (nr_async < TLS_READ_SOCK_BATCH) {
+ if (nr_async == 0) {
+ sk_flush_backlog(sk);
+ err = tls_rx_rec_wait(sk, NULL,
+ true,
+ released);
+ if (err <= 0)
+ goto read_sock_end;
+ } else {
+ if (!tls_strp_msg_ready(ctx)) {
+ tls_strp_check_rcv_quiet(&ctx->strp);
+ if (!tls_strp_msg_ready(ctx))
+ break;
+ }
+ if (!tls_strp_msg_load(&ctx->strp,
+ released))
+ break;
+ }
+
+ memset(&darg.inargs, 0, sizeof(darg.inargs));
+ darg.async = ctx->async_capable;
+
+ err = tls_rx_decrypt_record(sk, NULL,
+ &darg);
+ if (err < 0)
+ goto read_sock_end;
+
+ async |= darg.async;
+ released = tls_read_flush_backlog(sk, prot,
+ INT_MAX,
+ 0,
+ decrypted,
+ &flushed_at);
+ decrypted += strp_msg(darg.skb)->full_len;
+ tls_rx_rec_release(ctx);
+ __skb_queue_tail(&ctx->rx_list, darg.skb);
+ nr_async++;
+
+ if (!ctx->async_capable)
+ break;
+ }
+ }
+
+ /* Async wait -- collect pending AEAD completions */
+ if (async) {
+ int ret = tls_decrypt_async_drain(ctx);
+
+ async = false;
+ if (ret) {
+ __skb_queue_purge(&ctx->rx_list);
+ err = ret;
goto read_sock_end;
-
- memset(&darg.inargs, 0, sizeof(darg.inargs));
-
- err = tls_rx_decrypt_record(sk, NULL, &darg);
- if (err < 0)
- goto read_sock_end;
-
- released = tls_read_flush_backlog(sk, prot, INT_MAX,
- 0, decrypted,
- &flushed_at);
- decrypted += strp_msg(darg.skb)->full_len;
- tls_rx_rec_release(ctx);
- __skb_queue_tail(&ctx->rx_list, darg.skb);
+ }
}
/* Phase 2: Deliver -- drain rx_list to read_actor */
@@ -2442,6 +2493,16 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
}
read_sock_end:
+ if (async) {
+ int ret = tls_decrypt_async_drain(ctx);
+
+ __skb_queue_purge(&ctx->rx_list);
+ /* Preserve the error that triggered early exit;
+ * a crypto drain error is secondary.
+ */
+ if (ret && !err)
+ err = ret;
+ }
tls_strp_check_rcv(&ctx->strp);
tls_rx_reader_release(sk, ctx);
return copied ? : err;
--
2.53.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2 1/8] tls: Factor tls_decrypt_async_drain() from recvmsg
2026-03-11 0:19 ` [PATCH v2 1/8] tls: Factor tls_decrypt_async_drain() from recvmsg Chuck Lever
@ 2026-03-11 17:24 ` Hannes Reinecke
0 siblings, 0 replies; 12+ messages in thread
From: Hannes Reinecke @ 2026-03-11 17:24 UTC (permalink / raw)
To: Chuck Lever, john.fastabend, kuba, sd
Cc: netdev, kernel-tls-handshake, Chuck Lever
On 3/11/26 01:19, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> The recvmsg path pairs tls_decrypt_async_wait() with
> __skb_queue_purge(&ctx->async_hold). Bundling the two into
> tls_decrypt_async_drain() gives later patches a single call for
> async teardown.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> net/tls/tls_sw.c | 15 +++++++++++++--
> 1 file changed, 13 insertions(+), 2 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 2/8] tls: Factor tls_rx_decrypt_record() helper
2026-03-11 0:19 ` [PATCH v2 2/8] tls: Factor tls_rx_decrypt_record() helper Chuck Lever
@ 2026-03-11 17:25 ` Hannes Reinecke
0 siblings, 0 replies; 12+ messages in thread
From: Hannes Reinecke @ 2026-03-11 17:25 UTC (permalink / raw)
To: Chuck Lever, john.fastabend, kuba, sd
Cc: netdev, kernel-tls-handshake, Chuck Lever
On 3/11/26 01:19, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> recvmsg, read_sock, and splice_read each open-code the
> same sequence: zero-initialize the decrypt arguments, call
> tls_rx_one_record(), and abort the connection on failure.
>
> Extract tls_rx_decrypt_record() so each receive path shares
> a single decrypt-and-abort primitive. Each call site still
> initializes darg.inargs separately, since recvmsg sets zc
> and async between the memset and the decrypt call.
>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> net/tls/tls_sw.c | 29 +++++++++++++++++------------
> 1 file changed, 17 insertions(+), 12 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 0/8] TLS read_sock performance scalability
2026-03-11 0:19 [PATCH v2 0/8] TLS read_sock performance scalability Chuck Lever
` (7 preceding siblings ...)
2026-03-11 0:19 ` [PATCH v2 8/8] tls: Enable batch async decryption in read_sock Chuck Lever
@ 2026-03-11 20:42 ` Jakub Kicinski
8 siblings, 0 replies; 12+ messages in thread
From: Jakub Kicinski @ 2026-03-11 20:42 UTC (permalink / raw)
To: Chuck Lever; +Cc: john.fastabend, sd, netdev, kernel-tls-handshake, Chuck Lever
On Tue, 10 Mar 2026 20:19:44 -0400 Chuck Lever wrote:
> I'd like to encourage in-kernel kTLS consumers (i.e., NFS and
> NVMe/TCP) to coalesce on the use of read_sock. When I suggested
> this to Hannes, he reported a number of nagging performance
> scalability issues with read_sock. This series is an attempt to run
> these issues down and get them fixed before we convert the above
> sock_recvmsg consumers over to read_sock.
>
> While I assemble performance data, let's nail down the preferred
> code structure.
Could you check if the tls selftest passes?
It appears to be failing in our CI since this got posted.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-03-11 20:42 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-11 0:19 [PATCH v2 0/8] TLS read_sock performance scalability Chuck Lever
2026-03-11 0:19 ` [PATCH v2 1/8] tls: Factor tls_decrypt_async_drain() from recvmsg Chuck Lever
2026-03-11 17:24 ` Hannes Reinecke
2026-03-11 0:19 ` [PATCH v2 2/8] tls: Factor tls_rx_decrypt_record() helper Chuck Lever
2026-03-11 17:25 ` Hannes Reinecke
2026-03-11 0:19 ` [PATCH v2 3/8] tls: Fix dangling skb pointer in tls_sw_read_sock() Chuck Lever
2026-03-11 0:19 ` [PATCH v2 4/8] tls: Factor tls_strp_msg_release() from tls_strp_msg_done() Chuck Lever
2026-03-11 0:19 ` [PATCH v2 5/8] tls: Suppress spurious saved_data_ready on all receive paths Chuck Lever
2026-03-11 0:19 ` [PATCH v2 6/8] tls: Flush backlog before tls_rx_rec_wait in read_sock Chuck Lever
2026-03-11 0:19 ` [PATCH v2 7/8] tls: Restructure tls_sw_read_sock() into submit/deliver phases Chuck Lever
2026-03-11 0:19 ` [PATCH v2 8/8] tls: Enable batch async decryption in read_sock Chuck Lever
2026-03-11 20:42 ` [PATCH v2 0/8] TLS read_sock performance scalability Jakub Kicinski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox