[PATCH net-next v4 0/8] TLS read_sock performance scalability

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next v4 0/8] TLS read_sock performance scalability
@ 2026-03-17 15:04 Chuck Lever
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 1/8] tls: Factor tls_decrypt_async_drain() from recvmsg Chuck Lever
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Chuck Lever @ 2026-03-17 15:04 UTC (permalink / raw)
  To: john.fastabend, kuba, sd
  Cc: netdev, kernel-tls-handshake, Chuck Lever, Hannes Reinecke,
	Alistair Francis

I'd like to encourage in-kernel kTLS consumers (i.e., NFS and
NVMe/TCP) to coalesce on the use of read_sock. When I suggested
this to Hannes, he reported a number of nagging performance
scalability issues with read_sock. This series is an attempt to run
these issues down and get them fixed before we convert the above
sock_recvmsg consumers over to read_sock.

While I assemble performance data, let's nail down the preferred
code structure.

To: john.fastabend@gmail.com
To: kuba@kernel.org
To: sd@queasysnail.net
Cc: netdev@vger.kernel.org
Cc: kernel-tls-handshake@lists.linux.dev

Base commit: 05e059510edf ("Merge branch 'eth-fbnic-add-fbnic-self-tests'")

---
Changes since v3:
- Clarify why tls_decrypt_async_drain() is separate from _wait()
- Fold tls_err_abort() into tls_rx_one_record(), drop tls_rx_decrypt_record()
- Move backlog flush into tls_rx_rec_wait() so all RX paths benefit

Changes since v2:
- Fix short read self tests

Changes since v1:
- Add C11 reference
- Extend data_ready reduction to recvmsg and splice
- Restructure read_sock and recvmsg using shared helpers

---
Chuck Lever (8):
      tls: Factor tls_decrypt_async_drain() from recvmsg
      tls: Abort the connection on decrypt failure
      tls: Fix dangling skb pointer in tls_sw_read_sock()
      tls: Factor tls_strp_msg_release() from tls_strp_msg_done()
      tls: Suppress spurious saved_data_ready on all receive paths
      tls: Flush backlog before waiting for a new record
      tls: Restructure tls_sw_read_sock() into submit/deliver phases
      tls: Enable batch async decryption in read_sock

 net/tls/tls.h      |   3 +-
 net/tls/tls_strp.c |  34 +++++++--
 net/tls/tls_sw.c   | 198 +++++++++++++++++++++++++++++++++++++----------------
 3 files changed, 169 insertions(+), 66 deletions(-)
---
base-commit: 5446b8691eb8278f10deca92048fad84ffd1e4d5
change-id: 20260317-tls-read-sock-a0022c9df265

Best regards,
-- 
Chuck Lever


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH PATCH net-next v4 1/8] tls: Factor tls_decrypt_async_drain() from recvmsg
  2026-03-17 15:04 [PATCH net-next v4 0/8] TLS read_sock performance scalability Chuck Lever
@ 2026-03-17 15:04 ` Chuck Lever
  2026-03-17 19:55   ` Breno Leitao
  2026-03-19 17:21   ` Sabrina Dubroca
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 2/8] tls: Abort the connection on decrypt failure Chuck Lever
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 25+ messages in thread
From: Chuck Lever @ 2026-03-17 15:04 UTC (permalink / raw)
  To: john.fastabend, kuba, sd
  Cc: netdev, kernel-tls-handshake, Chuck Lever, Hannes Reinecke,
	Alistair Francis

From: Chuck Lever <chuck.lever@oracle.com>

The recvmsg path pairs tls_decrypt_async_wait() with
__skb_queue_purge(&ctx->async_hold). Bundling the two into
tls_decrypt_async_drain() gives later patches a single call
for async teardown.

The purge is kept separate from tls_decrypt_async_wait()
because other callers (the -EBUSY fallback in
tls_do_decryption and the tls_strp_msg_hold error path)
need to synchronize without discarding held skbs that
are still awaiting delivery.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/tls/tls_sw.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index a656ce2357589672bcef24343fef0aa83606cf41..09ccfe82af1a6c38978327e941de34818b5da7a8 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -249,6 +249,18 @@ static int tls_decrypt_async_wait(struct tls_sw_context_rx *ctx)
 	return ctx->async_wait.err;
 }
 
+/* Collect all pending async AEAD completions and release the
+ * skbs held for them.  Returns the crypto error if any
+ * operation failed, zero otherwise.
+ */
+static int tls_decrypt_async_drain(struct tls_sw_context_rx *ctx)
+{
+	int ret = tls_decrypt_async_wait(ctx);
+
+	__skb_queue_purge(&ctx->async_hold);
+	return ret;
+}
+
 static int tls_do_decryption(struct sock *sk,
 			     struct scatterlist *sgin,
 			     struct scatterlist *sgout,
@@ -2222,9 +2234,8 @@ int tls_sw_recvmsg(struct sock *sk,
 	if (async) {
 		int ret;
 
-		/* Wait for all previously submitted records to be decrypted */
-		ret = tls_decrypt_async_wait(ctx);
-		__skb_queue_purge(&ctx->async_hold);
+		/* Drain all pending async decryptions and their held skbs */
+		ret = tls_decrypt_async_drain(ctx);
 
 		if (ret) {
 			if (err >= 0 || err == -EINPROGRESS)

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH PATCH net-next v4 2/8] tls: Abort the connection on decrypt failure
  2026-03-17 15:04 [PATCH net-next v4 0/8] TLS read_sock performance scalability Chuck Lever
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 1/8] tls: Factor tls_decrypt_async_drain() from recvmsg Chuck Lever
@ 2026-03-17 15:04 ` Chuck Lever
  2026-03-23 10:22   ` Sabrina Dubroca
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 3/8] tls: Fix dangling skb pointer in tls_sw_read_sock() Chuck Lever
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Chuck Lever @ 2026-03-17 15:04 UTC (permalink / raw)
  To: john.fastabend, kuba, sd
  Cc: netdev, kernel-tls-handshake, Chuck Lever, Hannes Reinecke

From: Chuck Lever <chuck.lever@oracle.com>

recvmsg, read_sock, and splice_read each open-code a
tls_err_abort() call after tls_rx_one_record() fails.
Move the abort into tls_rx_one_record() so each receive
path shares a single decrypt-and-abort sequence.

Suggested-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/tls/tls_sw.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 09ccfe82af1a6c38978327e941de34818b5da7a8..bdbdaf40b3384298c80082c3acabcdb9a2becfc8 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1821,8 +1821,10 @@ static int tls_rx_one_record(struct sock *sk, struct msghdr *msg,
 	err = tls_decrypt_device(sk, msg, tls_ctx, darg);
 	if (!err)
 		err = tls_decrypt_sw(sk, tls_ctx, msg, darg);
-	if (err < 0)
+	if (err < 0) {
+		tls_err_abort(sk, -EBADMSG);
 		return err;
+	}
 
 	rxm = strp_msg(darg->skb);
 	rxm->offset += prot->prepend_size;
@@ -2133,10 +2135,8 @@ int tls_sw_recvmsg(struct sock *sk,
 			darg.async = false;
 
 		err = tls_rx_one_record(sk, msg, &darg);
-		if (err < 0) {
-			tls_err_abort(sk, -EBADMSG);
+		if (err < 0)
 			goto recv_end;
-		}
 
 		async |= darg.async;
 
@@ -2295,10 +2295,8 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 		memset(&darg.inargs, 0, sizeof(darg.inargs));
 
 		err = tls_rx_one_record(sk, NULL, &darg);
-		if (err < 0) {
-			tls_err_abort(sk, -EBADMSG);
+		if (err < 0)
 			goto splice_read_end;
-		}
 
 		tls_rx_rec_done(ctx);
 		skb = darg.skb;
@@ -2381,10 +2379,8 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
 			memset(&darg.inargs, 0, sizeof(darg.inargs));
 
 			err = tls_rx_one_record(sk, NULL, &darg);
-			if (err < 0) {
-				tls_err_abort(sk, -EBADMSG);
+			if (err < 0)
 				goto read_sock_end;
-			}
 
 			released = tls_read_flush_backlog(sk, prot, INT_MAX,
 							  0, decrypted,

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH PATCH net-next v4 3/8] tls: Fix dangling skb pointer in tls_sw_read_sock()
  2026-03-17 15:04 [PATCH net-next v4 0/8] TLS read_sock performance scalability Chuck Lever
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 1/8] tls: Factor tls_decrypt_async_drain() from recvmsg Chuck Lever
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 2/8] tls: Abort the connection on decrypt failure Chuck Lever
@ 2026-03-17 15:04 ` Chuck Lever
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 4/8] tls: Factor tls_strp_msg_release() from tls_strp_msg_done() Chuck Lever
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Chuck Lever @ 2026-03-17 15:04 UTC (permalink / raw)
  To: john.fastabend, kuba, sd
  Cc: netdev, kernel-tls-handshake, Chuck Lever, Hannes Reinecke,
	Alistair Francis

From: Chuck Lever <chuck.lever@oracle.com>

Per ISO/IEC 9899:2011 section 6.2.4p2, a pointer value becomes
indeterminate when the object it points to reaches the end of its
lifetime; Annex J.2 classifies the use of such a value as undefined
behavior. In tls_sw_read_sock(), consume_skb(skb) in the
fully-consumed path frees the skb, but the "do { } while (skb)"
loop condition then evaluates that freed pointer. Although the
value is never dereferenced -- the loop either continues and
overwrites skb, or exits -- any future change that adds a
dereference between consume_skb() and the loop condition would
produce a silent use-after-free.

Fixes: 662fbcec32f4 ("net/tls: implement ->read_sock()")
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/tls/tls_sw.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index bdbdaf40b3384298c80082c3acabcdb9a2becfc8..07f4a3d1a6f854acc7762608cc7741b3de95c195 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2364,7 +2364,7 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
 		goto read_sock_end;
 
 	decrypted = 0;
-	do {
+	for (;;) {
 		if (!skb_queue_empty(&ctx->rx_list)) {
 			skb = __skb_dequeue(&ctx->rx_list);
 			rxm = strp_msg(skb);
@@ -2413,10 +2413,11 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
 				goto read_sock_requeue;
 		} else {
 			consume_skb(skb);
+			skb = NULL;
 			if (!desc->count)
-				skb = NULL;
+				break;
 		}
-	} while (skb);
+	}
 
 read_sock_end:
 	tls_rx_reader_release(sk, ctx);

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH PATCH net-next v4 4/8] tls: Factor tls_strp_msg_release() from tls_strp_msg_done()
  2026-03-17 15:04 [PATCH net-next v4 0/8] TLS read_sock performance scalability Chuck Lever
                   ` (2 preceding siblings ...)
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 3/8] tls: Fix dangling skb pointer in tls_sw_read_sock() Chuck Lever
@ 2026-03-17 15:04 ` Chuck Lever
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 5/8] tls: Suppress spurious saved_data_ready on all receive paths Chuck Lever
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Chuck Lever @ 2026-03-17 15:04 UTC (permalink / raw)
  To: john.fastabend, kuba, sd
  Cc: netdev, kernel-tls-handshake, Chuck Lever, Hannes Reinecke,
	Alistair Francis

From: Chuck Lever <chuck.lever@oracle.com>

tls_strp_msg_done() conflates releasing the current record with
checking for the next one via tls_strp_check_rcv(). Batch
processing requires releasing a record without immediately
triggering that check, so the release step is separated into
tls_strp_msg_release(). tls_strp_msg_done() is preserved as a
wrapper for existing callers.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/tls/tls.h      |  1 +
 net/tls/tls_strp.c | 15 ++++++++++++++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/net/tls/tls.h b/net/tls/tls.h
index e8f81a006520027110962655f7b69a5ca58d7fb6..a97f1acef31d3a4ad1442c20c6914a9f5198a1e0 100644
--- a/net/tls/tls.h
+++ b/net/tls/tls.h
@@ -193,6 +193,7 @@ int tls_strp_init(struct tls_strparser *strp, struct sock *sk);
 void tls_strp_data_ready(struct tls_strparser *strp);
 
 void tls_strp_check_rcv(struct tls_strparser *strp);
+void tls_strp_msg_release(struct tls_strparser *strp);
 void tls_strp_msg_done(struct tls_strparser *strp);
 
 int tls_rx_msg_size(struct tls_strparser *strp, struct sk_buff *skb);
diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c
index 98e12f0ff57e5158c7518385aa0579528520d63f..a7648ebde162b1bd944f53d44c89cd7da4cef082 100644
--- a/net/tls/tls_strp.c
+++ b/net/tls/tls_strp.c
@@ -581,7 +581,16 @@ static void tls_strp_work(struct work_struct *w)
 	release_sock(strp->sk);
 }
 
-void tls_strp_msg_done(struct tls_strparser *strp)
+/**
+ * tls_strp_msg_release - release the current strparser message
+ * @strp: TLS stream parser instance
+ *
+ * Release the current record without triggering a check for the
+ * next record. Callers must invoke tls_strp_check_rcv() before
+ * releasing the socket lock, or queued data will stall until
+ * the next tls_strp_data_ready() event.
+ */
+void tls_strp_msg_release(struct tls_strparser *strp)
 {
 	WARN_ON(!strp->stm.full_len);
 
@@ -592,7 +601,11 @@ void tls_strp_msg_done(struct tls_strparser *strp)
 
 	WRITE_ONCE(strp->msg_ready, 0);
 	memset(&strp->stm, 0, sizeof(strp->stm));
+}
 
+void tls_strp_msg_done(struct tls_strparser *strp)
+{
+	tls_strp_msg_release(strp);
 	tls_strp_check_rcv(strp);
 }
 

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH PATCH net-next v4 5/8] tls: Suppress spurious saved_data_ready on all receive paths
  2026-03-17 15:04 [PATCH net-next v4 0/8] TLS read_sock performance scalability Chuck Lever
                   ` (3 preceding siblings ...)
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 4/8] tls: Factor tls_strp_msg_release() from tls_strp_msg_done() Chuck Lever
@ 2026-03-17 15:04 ` Chuck Lever
  2026-03-23 10:32   ` Sabrina Dubroca
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 6/8] tls: Flush backlog before waiting for a new record Chuck Lever
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Chuck Lever @ 2026-03-17 15:04 UTC (permalink / raw)
  To: john.fastabend, kuba, sd
  Cc: netdev, kernel-tls-handshake, Chuck Lever, Alistair Francis,
	Hannes Reinecke

From: Chuck Lever <chuck.lever@oracle.com>

Each record release via tls_strp_msg_done() triggers
tls_strp_check_rcv(), which calls tls_rx_msg_ready() and
fires saved_data_ready(). During a multi-record receive,
the first N-1 wakeups are pure overhead: the caller is
already running and will pick up subsequent records on
the next loop iteration. The same waste occurs on the
recvmsg and splice_read paths.

Replace tls_strp_msg_done() with tls_strp_msg_release() in
all three receive paths (read_sock, recvmsg, splice_read),
deferring the tls_strp_check_rcv() call to each path's
exit point. Factor tls_rx_msg_ready() out of
tls_strp_read_sock() so that parsing a record no longer
fires the callback directly, and introduce
tls_strp_check_rcv_quiet() for use in tls_rx_rec_wait(),
which parses queued data without notifying.

With no remaining callers, tls_strp_msg_done() and its
wrapper tls_rx_rec_done() are removed.

Acked-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/tls/tls.h      |  2 +-
 net/tls/tls_strp.c | 27 +++++++++++++++++++--------
 net/tls/tls_sw.c   | 21 ++++++++++++++-------
 3 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/net/tls/tls.h b/net/tls/tls.h
index a97f1acef31d3a4ad1442c20c6914a9f5198a1e0..0ab3b83c37243ab393367ac80836b9fd8fec7f82 100644
--- a/net/tls/tls.h
+++ b/net/tls/tls.h
@@ -193,8 +193,8 @@ int tls_strp_init(struct tls_strparser *strp, struct sock *sk);
 void tls_strp_data_ready(struct tls_strparser *strp);
 
 void tls_strp_check_rcv(struct tls_strparser *strp);
+void tls_strp_check_rcv_quiet(struct tls_strparser *strp);
 void tls_strp_msg_release(struct tls_strparser *strp);
-void tls_strp_msg_done(struct tls_strparser *strp);
 
 int tls_rx_msg_size(struct tls_strparser *strp, struct sk_buff *skb);
 void tls_rx_msg_ready(struct tls_strparser *strp);
diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c
index a7648ebde162b1bd944f53d44c89cd7da4cef082..6cf274380da2785b58206cf0e70cf4fee033a223 100644
--- a/net/tls/tls_strp.c
+++ b/net/tls/tls_strp.c
@@ -368,7 +368,6 @@ static int tls_strp_copyin(read_descriptor_t *desc, struct sk_buff *in_skb,
 		desc->count = 0;
 
 		WRITE_ONCE(strp->msg_ready, 1);
-		tls_rx_msg_ready(strp);
 	}
 
 	return ret;
@@ -539,11 +538,27 @@ static int tls_strp_read_sock(struct tls_strparser *strp)
 		return tls_strp_read_copy(strp, false);
 
 	WRITE_ONCE(strp->msg_ready, 1);
-	tls_rx_msg_ready(strp);
 
 	return 0;
 }
 
+/**
+ * tls_strp_check_rcv_quiet - parse without consumer notification
+ * @strp: TLS stream parser instance
+ *
+ * Parse queued data without firing the consumer notification. A subsequent
+ * tls_strp_check_rcv() is required before the socket lock is released;
+ * otherwise queued data stalls until the next tls_strp_data_ready() event.
+ */
+void tls_strp_check_rcv_quiet(struct tls_strparser *strp)
+{
+	if (unlikely(strp->stopped) || strp->msg_ready)
+		return;
+
+	if (tls_strp_read_sock(strp) == -ENOMEM)
+		queue_work(tls_strp_wq, &strp->work);
+}
+
 void tls_strp_check_rcv(struct tls_strparser *strp)
 {
 	if (unlikely(strp->stopped) || strp->msg_ready)
@@ -551,6 +566,8 @@ void tls_strp_check_rcv(struct tls_strparser *strp)
 
 	if (tls_strp_read_sock(strp) == -ENOMEM)
 		queue_work(tls_strp_wq, &strp->work);
+	else if (strp->msg_ready)
+		tls_rx_msg_ready(strp);
 }
 
 /* Lower sock lock held */
@@ -603,12 +620,6 @@ void tls_strp_msg_release(struct tls_strparser *strp)
 	memset(&strp->stm, 0, sizeof(strp->stm));
 }
 
-void tls_strp_msg_done(struct tls_strparser *strp)
-{
-	tls_strp_msg_release(strp);
-	tls_strp_check_rcv(strp);
-}
-
 void tls_strp_stop(struct tls_strparser *strp)
 {
 	strp->stopped = 1;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 07f4a3d1a6f854acc7762608cc7741b3de95c195..381a723b6cacc669e333752af34f051f296d6f52 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1384,7 +1384,10 @@ tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock,
 			return ret;
 
 		if (!skb_queue_empty(&sk->sk_receive_queue)) {
-			tls_strp_check_rcv(&ctx->strp);
+			/* tls_strp_check_rcv() is called at each receive
+			 * path's exit before the socket lock is released.
+			 */
+			tls_strp_check_rcv_quiet(&ctx->strp);
 			if (tls_strp_msg_ready(ctx))
 				break;
 		}
@@ -1867,9 +1870,9 @@ static int tls_record_content_type(struct msghdr *msg, struct tls_msg *tlm,
 	return 1;
 }
 
-static void tls_rx_rec_done(struct tls_sw_context_rx *ctx)
+static void tls_rx_rec_release(struct tls_sw_context_rx *ctx)
 {
-	tls_strp_msg_done(&ctx->strp);
+	tls_strp_msg_release(&ctx->strp);
 }
 
 /* This function traverses the rx_list in tls receive context to copies the
@@ -2150,7 +2153,7 @@ int tls_sw_recvmsg(struct sock *sk,
 		err = tls_record_content_type(msg, tls_msg(darg.skb), &control);
 		if (err <= 0) {
 			DEBUG_NET_WARN_ON_ONCE(darg.zc);
-			tls_rx_rec_done(ctx);
+			tls_rx_rec_release(ctx);
 put_on_rx_list_err:
 			__skb_queue_tail(&ctx->rx_list, darg.skb);
 			goto recv_end;
@@ -2164,7 +2167,8 @@ int tls_sw_recvmsg(struct sock *sk,
 		/* TLS 1.3 may have updated the length by more than overhead */
 		rxm = strp_msg(darg.skb);
 		chunk = rxm->full_len;
-		tls_rx_rec_done(ctx);
+		tls_rx_rec_release(ctx);
+		tls_strp_check_rcv_quiet(&ctx->strp);
 
 		if (!darg.zc) {
 			bool partially_consumed = chunk > len;
@@ -2258,6 +2262,7 @@ int tls_sw_recvmsg(struct sock *sk,
 	copied += decrypted;
 
 end:
+	tls_strp_check_rcv(&ctx->strp);
 	tls_rx_reader_unlock(sk, ctx);
 	if (psock)
 		sk_psock_put(sk, psock);
@@ -2298,7 +2303,7 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 		if (err < 0)
 			goto splice_read_end;
 
-		tls_rx_rec_done(ctx);
+		tls_rx_rec_release(ctx);
 		skb = darg.skb;
 	}
 
@@ -2325,6 +2330,7 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 	consume_skb(skb);
 
 splice_read_end:
+	tls_strp_check_rcv(&ctx->strp);
 	tls_rx_reader_unlock(sk, ctx);
 	return copied ? : err;
 
@@ -2390,7 +2396,7 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
 			tlm = tls_msg(skb);
 			decrypted += rxm->full_len;
 
-			tls_rx_rec_done(ctx);
+			tls_rx_rec_release(ctx);
 		}
 
 		/* read_sock does not support reading control messages */
@@ -2420,6 +2426,7 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
 	}
 
 read_sock_end:
+	tls_strp_check_rcv(&ctx->strp);
 	tls_rx_reader_release(sk, ctx);
 	return copied ? : err;
 

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH PATCH net-next v4 6/8] tls: Flush backlog before waiting for a new record
  2026-03-17 15:04 [PATCH net-next v4 0/8] TLS read_sock performance scalability Chuck Lever
                   ` (4 preceding siblings ...)
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 5/8] tls: Suppress spurious saved_data_ready on all receive paths Chuck Lever
@ 2026-03-17 15:04 ` Chuck Lever
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 7/8] tls: Restructure tls_sw_read_sock() into submit/deliver phases Chuck Lever
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 8/8] tls: Enable batch async decryption in read_sock Chuck Lever
  7 siblings, 0 replies; 25+ messages in thread
From: Chuck Lever @ 2026-03-17 15:04 UTC (permalink / raw)
  To: john.fastabend, kuba, sd
  Cc: netdev, kernel-tls-handshake, Chuck Lever, Hannes Reinecke

From: Chuck Lever <chuck.lever@oracle.com>

While lock_sock is held, incoming TCP segments land on
sk->sk_backlog rather than sk->sk_receive_queue.
tls_rx_rec_wait() inspects only sk_receive_queue, so
backlog data remains invisible. For non-blocking callers
(read_sock, and recvmsg or splice_read with MSG_DONTWAIT)
this causes a spurious -EAGAIN. For blocking callers it
forces an unnecessary sleep/wakeup cycle.

Flush the backlog inside tls_rx_rec_wait() before checking
sk_receive_queue so the strparser can parse newly-arrived
segments immediately.

Suggested-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/tls/tls_sw.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 381a723b6cacc669e333752af34f051f296d6f52..5b154afbd7ac2ddd51b46d8d6bef0a7a41f0a841 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1383,6 +1383,7 @@ tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock,
 		if (ret < 0)
 			return ret;
 
+		sk_flush_backlog(sk);
 		if (!skb_queue_empty(&sk->sk_receive_queue)) {
 			/* tls_strp_check_rcv() is called at each receive
 			 * path's exit before the socket lock is released.

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH PATCH net-next v4 7/8] tls: Restructure tls_sw_read_sock() into submit/deliver phases
  2026-03-17 15:04 [PATCH net-next v4 0/8] TLS read_sock performance scalability Chuck Lever
                   ` (5 preceding siblings ...)
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 6/8] tls: Flush backlog before waiting for a new record Chuck Lever
@ 2026-03-17 15:04 ` Chuck Lever
  2026-03-23 11:31   ` Sabrina Dubroca
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 8/8] tls: Enable batch async decryption in read_sock Chuck Lever
  7 siblings, 1 reply; 25+ messages in thread
From: Chuck Lever @ 2026-03-17 15:04 UTC (permalink / raw)
  To: john.fastabend, kuba, sd
  Cc: netdev, kernel-tls-handshake, Chuck Lever, Hannes Reinecke

From: Chuck Lever <chuck.lever@oracle.com>

Pipelining multiple AEAD operations requires separating decryption
from delivery so that several records can be submitted before any
are passed to the read_actor callback. The main loop in
tls_sw_read_sock() is split into two explicit phases: a submit
phase that decrypts one record onto ctx->rx_list, and a deliver
phase that drains rx_list and passes each cleartext skb to the
read_actor callback.

With a single record per submit phase, behavior is identical to the
previous code. A subsequent patch will extend the submit phase to
pipeline multiple AEAD operations.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/tls/tls_sw.c | 70 +++++++++++++++++++++++++++++---------------------------
 1 file changed, 36 insertions(+), 34 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 5b154afbd7ac2ddd51b46d8d6bef0a7a41f0a841..5ae7e0c026e4437fe442c3a77b0a6d9623816ce1 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2346,8 +2346,8 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
 	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
 	struct tls_prot_info *prot = &tls_ctx->prot_info;
-	struct strp_msg *rxm = NULL;
 	struct sk_buff *skb = NULL;
+	struct strp_msg *rxm;
 	struct sk_psock *psock;
 	size_t flushed_at = 0;
 	bool released = true;
@@ -2372,13 +2372,10 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
 
 	decrypted = 0;
 	for (;;) {
-		if (!skb_queue_empty(&ctx->rx_list)) {
-			skb = __skb_dequeue(&ctx->rx_list);
-			rxm = strp_msg(skb);
-			tlm = tls_msg(skb);
-		} else {
-			struct tls_decrypt_arg darg;
+		struct tls_decrypt_arg darg;
 
+		/* Phase 1: Submit -- decrypt one record onto rx_list. */
+		if (skb_queue_empty(&ctx->rx_list)) {
 			err = tls_rx_rec_wait(sk, NULL, true, released);
 			if (err <= 0)
 				goto read_sock_end;
@@ -2392,38 +2389,43 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
 			released = tls_read_flush_backlog(sk, prot, INT_MAX,
 							  0, decrypted,
 							  &flushed_at);
-			skb = darg.skb;
+			decrypted += strp_msg(darg.skb)->full_len;
+			tls_rx_rec_release(ctx);
+			__skb_queue_tail(&ctx->rx_list, darg.skb);
+		}
+
+		/* Phase 2: Deliver -- drain rx_list to read_actor */
+		while ((skb = __skb_dequeue(&ctx->rx_list)) != NULL) {
 			rxm = strp_msg(skb);
 			tlm = tls_msg(skb);
-			decrypted += rxm->full_len;
 
-			tls_rx_rec_release(ctx);
-		}
-
-		/* read_sock does not support reading control messages */
-		if (tlm->control != TLS_RECORD_TYPE_DATA) {
-			err = -EINVAL;
-			goto read_sock_requeue;
-		}
-
-		used = read_actor(desc, skb, rxm->offset, rxm->full_len);
-		if (used <= 0) {
-			if (!copied)
-				err = used;
-			goto read_sock_requeue;
-		}
-		copied += used;
-		if (used < rxm->full_len) {
-			rxm->offset += used;
-			rxm->full_len -= used;
-			if (!desc->count)
+			/* read_sock does not support reading control messages */
+			if (tlm->control != TLS_RECORD_TYPE_DATA) {
+				err = -EINVAL;
 				goto read_sock_requeue;
-		} else {
-			consume_skb(skb);
-			skb = NULL;
-			if (!desc->count)
-				break;
+			}
+
+			used = read_actor(desc, skb, rxm->offset,
+					  rxm->full_len);
+			if (used <= 0) {
+				if (!copied)
+					err = used;
+				goto read_sock_requeue;
+			}
+			copied += used;
+			if (used < rxm->full_len) {
+				rxm->offset += used;
+				rxm->full_len -= used;
+				if (!desc->count)
+					goto read_sock_requeue;
+			} else {
+				consume_skb(skb);
+				skb = NULL;
+			}
 		}
+		/* Drain all of rx_list before honoring !desc->count */
+		if (!desc->count)
+			break;
 	}
 
 read_sock_end:

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH PATCH net-next v4 8/8] tls: Enable batch async decryption in read_sock
  2026-03-17 15:04 [PATCH net-next v4 0/8] TLS read_sock performance scalability Chuck Lever
                   ` (6 preceding siblings ...)
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 7/8] tls: Restructure tls_sw_read_sock() into submit/deliver phases Chuck Lever
@ 2026-03-17 15:04 ` Chuck Lever
  2026-03-23 14:14   ` Sabrina Dubroca
  2026-03-23 21:28   ` Chuck Lever
  7 siblings, 2 replies; 25+ messages in thread
From: Chuck Lever @ 2026-03-17 15:04 UTC (permalink / raw)
  To: john.fastabend, kuba, sd
  Cc: netdev, kernel-tls-handshake, Chuck Lever, Hannes Reinecke

From: Chuck Lever <chuck.lever@oracle.com>

tls_sw_read_sock() decrypts one TLS record at a time, blocking until
each AEAD operation completes before proceeding. Hardware async
crypto engines depend on pipelining multiple operations to achieve
full throughput, and the one-at-a-time model prevents that. Kernel
consumers such as NVMe-TCP and NFSD (when using TLS) are therefore
unable to benefit from hardware offload.

When ctx->async_capable is true, the submit phase now loops up to
TLS_READ_SOCK_BATCH (16) records. The first record waits via
tls_rx_rec_wait(); subsequent iterations use tls_strp_msg_ready()
and tls_strp_check_rcv() to collect records already queued on the
socket without blocking. Each record is submitted with darg.async
set, and all resulting skbs are appended to rx_list.

After the submit loop, a single tls_decrypt_async_drain() collects
all pending AEAD completions before the deliver phase passes
cleartext records to the consumer. The batch bound of 16 limits
concurrent memory consumption to 16 cleartext skbs plus their AEAD
contexts. If async_capable is false, the loop exits after one
record and the async wait is skipped, preserving prior behavior.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/tls/tls_sw.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 76 insertions(+), 16 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 5ae7e0c026e4437fe442c3a77b0a6d9623816ce1..bc500ba7ce81eb33763c37a8b73473c42dc66044 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -261,6 +261,12 @@ static int tls_decrypt_async_drain(struct tls_sw_context_rx *ctx)
 	return ret;
 }
 
+/* Submit an AEAD decrypt request.  On success with darg->async set,
+ * the caller must not touch aead_req; the completion handler frees
+ * it.  Every error return clears darg->async and guarantees no
+ * in-flight AEAD operation remains -- callers rely on this to
+ * safely free aead_req and to skip async drain on error paths.
+ */
 static int tls_do_decryption(struct sock *sk,
 			     struct scatterlist *sgin,
 			     struct scatterlist *sgout,
@@ -2340,6 +2346,13 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 	goto splice_read_end;
 }
 
+/* Bound on concurrent async AEAD submissions per read_sock
+ * call.  Chosen to fill typical hardware crypto pipelines
+ * without excessive memory consumption (each in-flight record
+ * holds one cleartext skb plus its AEAD request context).
+ */
+#define TLS_READ_SOCK_BATCH	16
+
 int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
 		     sk_read_actor_t read_actor)
 {
@@ -2351,6 +2364,7 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
 	struct sk_psock *psock;
 	size_t flushed_at = 0;
 	bool released = true;
+	bool async = false;
 	struct tls_msg *tlm;
 	ssize_t copied = 0;
 	ssize_t decrypted;
@@ -2373,25 +2387,61 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
 	decrypted = 0;
 	for (;;) {
 		struct tls_decrypt_arg darg;
+		int nr_async = 0;
 
-		/* Phase 1: Submit -- decrypt one record onto rx_list. */
+		/* Phase 1: Submit -- decrypt records onto rx_list. */
 		if (skb_queue_empty(&ctx->rx_list)) {
-			err = tls_rx_rec_wait(sk, NULL, true, released);
-			if (err <= 0)
+			while (nr_async < TLS_READ_SOCK_BATCH) {
+				if (nr_async == 0) {
+					err = tls_rx_rec_wait(sk, NULL,
+							      true,
+							      released);
+					if (err <= 0)
+						goto read_sock_end;
+				} else {
+					if (!tls_strp_msg_ready(ctx)) {
+						tls_strp_check_rcv_quiet(&ctx->strp);
+						if (!tls_strp_msg_ready(ctx))
+							break;
+					}
+					if (!tls_strp_msg_load(&ctx->strp,
+							       released))
+						break;
+				}
+
+				memset(&darg.inargs, 0, sizeof(darg.inargs));
+				darg.async = ctx->async_capable;
+
+				err = tls_rx_one_record(sk, NULL, &darg);
+				if (err < 0)
+					goto read_sock_end;
+
+				async |= darg.async;
+				released = tls_read_flush_backlog(sk, prot,
+								  INT_MAX,
+								  0,
+								  decrypted,
+								  &flushed_at);
+				decrypted += strp_msg(darg.skb)->full_len;
+				tls_rx_rec_release(ctx);
+				__skb_queue_tail(&ctx->rx_list, darg.skb);
+				nr_async++;
+
+				if (!ctx->async_capable)
+					break;
+			}
+		}
+
+		/* Async wait -- collect pending AEAD completions */
+		if (async) {
+			int ret = tls_decrypt_async_drain(ctx);
+
+			async = false;
+			if (ret) {
+				__skb_queue_purge(&ctx->rx_list);
+				err = ret;
 				goto read_sock_end;
-
-			memset(&darg.inargs, 0, sizeof(darg.inargs));
-
-			err = tls_rx_one_record(sk, NULL, &darg);
-			if (err < 0)
-				goto read_sock_end;
-
-			released = tls_read_flush_backlog(sk, prot, INT_MAX,
-							  0, decrypted,
-							  &flushed_at);
-			decrypted += strp_msg(darg.skb)->full_len;
-			tls_rx_rec_release(ctx);
-			__skb_queue_tail(&ctx->rx_list, darg.skb);
+			}
 		}
 
 		/* Phase 2: Deliver -- drain rx_list to read_actor */
@@ -2429,6 +2479,16 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
 	}
 
 read_sock_end:
+	if (async) {
+		int ret = tls_decrypt_async_drain(ctx);
+
+		__skb_queue_purge(&ctx->rx_list);
+		/* Preserve the error that triggered early exit;
+		 * a crypto drain error is secondary.
+		 */
+		if (ret && !err)
+			err = ret;
+	}
 	tls_strp_check_rcv(&ctx->strp);
 	tls_rx_reader_release(sk, ctx);
 	return copied ? : err;

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH PATCH net-next v4 1/8] tls: Factor tls_decrypt_async_drain() from recvmsg
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 1/8] tls: Factor tls_decrypt_async_drain() from recvmsg Chuck Lever
@ 2026-03-17 19:55   ` Breno Leitao
  2026-03-19 17:21   ` Sabrina Dubroca
  1 sibling, 0 replies; 25+ messages in thread
From: Breno Leitao @ 2026-03-17 19:55 UTC (permalink / raw)
  To: Chuck Lever
  Cc: john.fastabend, kuba, sd, netdev, kernel-tls-handshake,
	Chuck Lever, Hannes Reinecke, Alistair Francis

On Tue, Mar 17, 2026 at 11:04:14AM -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> The recvmsg path pairs tls_decrypt_async_wait() with
> __skb_queue_purge(&ctx->async_hold). Bundling the two into
> tls_decrypt_async_drain() gives later patches a single call
> for async teardown.
> 
> The purge is kept separate from tls_decrypt_async_wait()
> because other callers (the -EBUSY fallback in
> tls_do_decryption and the tls_strp_msg_hold error path)
> need to synchronize without discarding held skbs that
> are still awaiting delivery.
> 
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

Reviewed-by: Breno Leitao <leitao@debian.org>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH PATCH net-next v4 1/8] tls: Factor tls_decrypt_async_drain() from recvmsg
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 1/8] tls: Factor tls_decrypt_async_drain() from recvmsg Chuck Lever
  2026-03-17 19:55   ` Breno Leitao
@ 2026-03-19 17:21   ` Sabrina Dubroca
  2026-03-20  1:03     ` Chuck Lever
  1 sibling, 1 reply; 25+ messages in thread
From: Sabrina Dubroca @ 2026-03-19 17:21 UTC (permalink / raw)
  To: Chuck Lever
  Cc: john.fastabend, kuba, netdev, kernel-tls-handshake, Chuck Lever,
	Hannes Reinecke, Alistair Francis

2026-03-17, 11:04:14 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> The recvmsg path pairs tls_decrypt_async_wait() with
> __skb_queue_purge(&ctx->async_hold). Bundling the two into
> tls_decrypt_async_drain() gives later patches a single call
> for async teardown.
> 
> The purge is kept separate from tls_decrypt_async_wait()
> because other callers (the -EBUSY fallback in
> tls_do_decryption and the tls_strp_msg_hold error path)
> need to synchronize without discarding held skbs that
> are still awaiting delivery.

Sorry, but why do we need to keep the encrypted skbs around in the
async_hold queue once async decryption has completed? I guess I'm
missing something.

-- 
Sabrina

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH PATCH net-next v4 1/8] tls: Factor tls_decrypt_async_drain() from recvmsg
  2026-03-19 17:21   ` Sabrina Dubroca
@ 2026-03-20  1:03     ` Chuck Lever
  0 siblings, 0 replies; 25+ messages in thread
From: Chuck Lever @ 2026-03-20  1:03 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: john.fastabend, Jakub Kicinski, netdev, kernel-tls-handshake,
	Chuck Lever, Hannes Reinecke, Alistair Francis

On Thu, Mar 19, 2026, at 1:21 PM, Sabrina Dubroca wrote:
> 2026-03-17, 11:04:14 -0400, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>> 
>> The recvmsg path pairs tls_decrypt_async_wait() with
>> __skb_queue_purge(&ctx->async_hold). Bundling the two into
>> tls_decrypt_async_drain() gives later patches a single call
>> for async teardown.
>> 
>> The purge is kept separate from tls_decrypt_async_wait()
>> because other callers (the -EBUSY fallback in
>> tls_do_decryption and the tls_strp_msg_hold error path)
>> need to synchronize without discarding held skbs that
>> are still awaiting delivery.
>
> Sorry, but why do we need to keep the encrypted skbs around in the
> async_hold queue once async decryption has completed? I guess I'm
> missing something.

Ah. We don’t need to keep the encrypted skbs. v5 will address that.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH PATCH net-next v4 2/8] tls: Abort the connection on decrypt failure
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 2/8] tls: Abort the connection on decrypt failure Chuck Lever
@ 2026-03-23 10:22   ` Sabrina Dubroca
  0 siblings, 0 replies; 25+ messages in thread
From: Sabrina Dubroca @ 2026-03-23 10:22 UTC (permalink / raw)
  To: Chuck Lever
  Cc: john.fastabend, kuba, netdev, kernel-tls-handshake, Chuck Lever,
	Hannes Reinecke

2026-03-17, 11:04:15 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> recvmsg, read_sock, and splice_read each open-code a
> tls_err_abort() call after tls_rx_one_record() fails.
> Move the abort into tls_rx_one_record() so each receive
> path shares a single decrypt-and-abort sequence.
> 
> Suggested-by: Sabrina Dubroca <sd@queasysnail.net>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/tls/tls_sw.c | 16 ++++++----------
>  1 file changed, 6 insertions(+), 10 deletions(-)

Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>

This introduces a minor change in the unexpected case
tls_check_pending_rekey() fails (because we won't call tls_err_abort
anymore in this case), but I think that's ok.

-- 
Sabrina

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH PATCH net-next v4 5/8] tls: Suppress spurious saved_data_ready on all receive paths
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 5/8] tls: Suppress spurious saved_data_ready on all receive paths Chuck Lever
@ 2026-03-23 10:32   ` Sabrina Dubroca
  0 siblings, 0 replies; 25+ messages in thread
From: Sabrina Dubroca @ 2026-03-23 10:32 UTC (permalink / raw)
  To: Chuck Lever
  Cc: john.fastabend, kuba, netdev, kernel-tls-handshake, Chuck Lever,
	Alistair Francis, Hannes Reinecke

2026-03-17, 11:04:18 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Each record release via tls_strp_msg_done() triggers
> tls_strp_check_rcv(), which calls tls_rx_msg_ready() and
> fires saved_data_ready(). During a multi-record receive,
> the first N-1 wakeups are pure overhead: the caller is
> already running and will pick up subsequent records on
> the next loop iteration. The same waste occurs on the
> recvmsg and splice_read paths.

nit: splice_read is less of a problem since it doesn't loop over
records?


[...]
> +void tls_strp_check_rcv_quiet(struct tls_strparser *strp)
> +{
> +	if (unlikely(strp->stopped) || strp->msg_ready)
> +		return;
> +
> +	if (tls_strp_read_sock(strp) == -ENOMEM)
> +		queue_work(tls_strp_wq, &strp->work);
> +}

c/p of tls_strp_check_rcv isn't nice. Add a 'bool wake_up' argument
instead? [but see the comment about recvmsg]


>  void tls_strp_check_rcv(struct tls_strparser *strp)
>  {
>  	if (unlikely(strp->stopped) || strp->msg_ready)
> @@ -551,6 +566,8 @@ void tls_strp_check_rcv(struct tls_strparser *strp)
>  
>  	if (tls_strp_read_sock(strp) == -ENOMEM)
>  		queue_work(tls_strp_wq, &strp->work);
> +	else if (strp->msg_ready)
> +		tls_rx_msg_ready(strp);

Since that's now the only caller of tls_rx_msg_ready, and all that
does is call saved_data_ready, inline it here?

> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
> index 07f4a3d1a6f854acc7762608cc7741b3de95c195..381a723b6cacc669e333752af34f051f296d6f52 100644
> --- a/net/tls/tls_sw.c
> +++ b/net/tls/tls_sw.c
> @@ -1384,7 +1384,10 @@ tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock,
>  			return ret;
>  
>  		if (!skb_queue_empty(&sk->sk_receive_queue)) {
> -			tls_strp_check_rcv(&ctx->strp);
> +			/* tls_strp_check_rcv() is called at each receive
> +			 * path's exit before the socket lock is released.
> +			 */

I'm not convinced this comment will make sense to someone reading the
code outside of reviewing this series.

> +			tls_strp_check_rcv_quiet(&ctx->strp);
>  			if (tls_strp_msg_ready(ctx))
>  				break;
>  		}
> @@ -1867,9 +1870,9 @@ static int tls_record_content_type(struct msghdr *msg, struct tls_msg *tlm,
>  	return 1;
>  }
>  
> -static void tls_rx_rec_done(struct tls_sw_context_rx *ctx)
> +static void tls_rx_rec_release(struct tls_sw_context_rx *ctx)
>  {
> -	tls_strp_msg_done(&ctx->strp);
> +	tls_strp_msg_release(&ctx->strp);
>  }
>  
>  /* This function traverses the rx_list in tls receive context to copies the
> @@ -2150,7 +2153,7 @@ int tls_sw_recvmsg(struct sock *sk,
>  		err = tls_record_content_type(msg, tls_msg(darg.skb), &control);
>  		if (err <= 0) {
>  			DEBUG_NET_WARN_ON_ONCE(darg.zc);
> -			tls_rx_rec_done(ctx);
> +			tls_rx_rec_release(ctx);
>  put_on_rx_list_err:
>  			__skb_queue_tail(&ctx->rx_list, darg.skb);
>  			goto recv_end;
> @@ -2164,7 +2167,8 @@ int tls_sw_recvmsg(struct sock *sk,
>  		/* TLS 1.3 may have updated the length by more than overhead */
>  		rxm = strp_msg(darg.skb);
>  		chunk = rxm->full_len;
> -		tls_rx_rec_done(ctx);
> +		tls_rx_rec_release(ctx);
> +		tls_strp_check_rcv_quiet(&ctx->strp);

This one worries me: if tls_strp_check_rcv_quiet() sets msg_ready=1
without calling saved_data_ready. If we break out of the loop after
this, the final tls_strp_check_rcv() just before returning from
tls_sw_recvmsg() will do:

void tls_strp_check_rcv(struct tls_strparser *strp, bool wake_up)
{
	if (unlikely(strp->stopped) || strp->msg_ready)
		return;
[...]

and not call saved_data_ready?

-- 
Sabrina

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH PATCH net-next v4 7/8] tls: Restructure tls_sw_read_sock() into submit/deliver phases
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 7/8] tls: Restructure tls_sw_read_sock() into submit/deliver phases Chuck Lever
@ 2026-03-23 11:31   ` Sabrina Dubroca
  0 siblings, 0 replies; 25+ messages in thread
From: Sabrina Dubroca @ 2026-03-23 11:31 UTC (permalink / raw)
  To: Chuck Lever
  Cc: john.fastabend, kuba, netdev, kernel-tls-handshake, Chuck Lever,
	Hannes Reinecke

2026-03-17, 11:04:20 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Pipelining multiple AEAD operations requires separating decryption
> from delivery so that several records can be submitted before any
> are passed to the read_actor callback. The main loop in
> tls_sw_read_sock() is split into two explicit phases: a submit
> phase that decrypts one record onto ctx->rx_list, and a deliver
> phase that drains rx_list and passes each cleartext skb to the
> read_actor callback.
> 
> With a single record per submit phase, behavior is identical to the
> previous code. A subsequent patch will extend the submit phase to
> pipeline multiple AEAD operations.
> 
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/tls/tls_sw.c | 70 +++++++++++++++++++++++++++++---------------------------
>  1 file changed, 36 insertions(+), 34 deletions(-)
> 
> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
> index 5b154afbd7ac2ddd51b46d8d6bef0a7a41f0a841..5ae7e0c026e4437fe442c3a77b0a6d9623816ce1 100644
> --- a/net/tls/tls_sw.c
> +++ b/net/tls/tls_sw.c
> @@ -2346,8 +2346,8 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
>  	struct tls_context *tls_ctx = tls_get_ctx(sk);
>  	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
>  	struct tls_prot_info *prot = &tls_ctx->prot_info;
> -	struct strp_msg *rxm = NULL;
>  	struct sk_buff *skb = NULL;
> +	struct strp_msg *rxm;

nit: networking tries to follow the "reverse xmas tree" ordering (a
bit broken in ktls because of context structs).

[or just leave this alone because setting it to NULL doesn't hurt?]


[...]
> +		/* Phase 2: Deliver -- drain rx_list to read_actor */
> +		while ((skb = __skb_dequeue(&ctx->rx_list)) != NULL) {
>  			rxm = strp_msg(skb);
>  			tlm = tls_msg(skb);
> -			decrypted += rxm->full_len;
>  
[...]
> +			copied += used;
> +			if (used < rxm->full_len) {
> +				rxm->offset += used;
> +				rxm->full_len -= used;
> +				if (!desc->count)
> +					goto read_sock_requeue;
> +			} else {
> +				consume_skb(skb);
> +				skb = NULL;
> +			}
>  		}
> +		/* Drain all of rx_list before honoring !desc->count */
> +		if (!desc->count)
> +			break;

I'm not really familiar with the read_sock users, why is it ok to
ignore desc->count reaching 0 while we're in the rx_list loop?

>  	}

-- 
Sabrina

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH PATCH net-next v4 8/8] tls: Enable batch async decryption in read_sock
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 8/8] tls: Enable batch async decryption in read_sock Chuck Lever
@ 2026-03-23 14:14   ` Sabrina Dubroca
  2026-03-23 15:04     ` Chuck Lever
  2026-03-23 15:53     ` Chuck Lever
  2026-03-23 21:28   ` Chuck Lever
  1 sibling, 2 replies; 25+ messages in thread
From: Sabrina Dubroca @ 2026-03-23 14:14 UTC (permalink / raw)
  To: Chuck Lever
  Cc: john.fastabend, kuba, netdev, kernel-tls-handshake, Chuck Lever,
	Hannes Reinecke

2026-03-17, 11:04:21 -0400, Chuck Lever wrote:
> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
> index 5ae7e0c026e4437fe442c3a77b0a6d9623816ce1..bc500ba7ce81eb33763c37a8b73473c42dc66044 100644
> --- a/net/tls/tls_sw.c
> +++ b/net/tls/tls_sw.c
> @@ -261,6 +261,12 @@ static int tls_decrypt_async_drain(struct tls_sw_context_rx *ctx)
>  	return ret;
>  }
>  
> +/* Submit an AEAD decrypt request.  On success with darg->async set,
> + * the caller must not touch aead_req; the completion handler frees
> + * it.  Every error return clears darg->async and guarantees no
> + * in-flight AEAD operation remains -- callers rely on this to

-EBUSY (which is not a "real" error) will result in calling
tls_decrypt_async_wait, but not other errors?

> + * safely free aead_req and to skip async drain on error paths.
> + */
>  static int tls_do_decryption(struct sock *sk,
>  			     struct scatterlist *sgin,
>  			     struct scatterlist *sgout,
> @@ -2340,6 +2346,13 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
>  	goto splice_read_end;
>  }
>  
> +/* Bound on concurrent async AEAD submissions per read_sock
> + * call.  Chosen to fill typical hardware crypto pipelines
> + * without excessive memory consumption (each in-flight record
> + * holds one cleartext skb plus its AEAD request context).
> + */
> +#define TLS_READ_SOCK_BATCH	16

I suspect that at some point, we'll have a request to make this
configurable (maybe system-wide, maybe by socket?).


>  int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
>  		     sk_read_actor_t read_actor)
>  {
> @@ -2351,6 +2364,7 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
>  	struct sk_psock *psock;
>  	size_t flushed_at = 0;
>  	bool released = true;
> +	bool async = false;

nit: reverse xmas tree(-ish)

>  	struct tls_msg *tlm;
>  	ssize_t copied = 0;
>  	ssize_t decrypted;
> @@ -2373,25 +2387,61 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
>  	decrypted = 0;
>  	for (;;) {
>  		struct tls_decrypt_arg darg;
> +		int nr_async = 0;
>  
> -		/* Phase 1: Submit -- decrypt one record onto rx_list. */
> +		/* Phase 1: Submit -- decrypt records onto rx_list. */
>  		if (skb_queue_empty(&ctx->rx_list)) {
> -			err = tls_rx_rec_wait(sk, NULL, true, released);
> -			if (err <= 0)
> +			while (nr_async < TLS_READ_SOCK_BATCH) {
> +				if (nr_async == 0) {
> +					err = tls_rx_rec_wait(sk, NULL,
> +							      true,
> +							      released);
> +					if (err <= 0)
> +						goto read_sock_end;
> +				} else {
> +					if (!tls_strp_msg_ready(ctx)) {
> +						tls_strp_check_rcv_quiet(&ctx->strp);
> +						if (!tls_strp_msg_ready(ctx))
> +							break;

This (and tls_rx_rec_wait) looks like tls_strp_check_rcv_quiet should
return the value of msg_ready.

This is also not very different from tls_rx_rec_wait(nonblock=true),
why are you separating the nr_async>0 case and open-coding the core
operations from tls_rx_rec_wait()?

> +					}
> +					if (!tls_strp_msg_load(&ctx->strp,
> +							       released))
> +						break;
> +				}
> +
> +				memset(&darg.inargs, 0, sizeof(darg.inargs));
> +				darg.async = ctx->async_capable;

tls_sw_recvmsg also has:

    if (tlm->control == TLS_RECORD_TYPE_DATA && !bpf_strp_enabled)

before setting darg.async. bpf_strp_enabled isn't relevant since
tls_sw_read_sock aborts when there's a psock, but I think record type
matters here.

> +
> +				err = tls_rx_one_record(sk, NULL, &darg);
> +				if (err < 0)
> +					goto read_sock_end;
> +
> +				async |= darg.async;
> +				released = tls_read_flush_backlog(sk, prot,
> +								  INT_MAX,
> +								  0,
> +								  decrypted,
> +								  &flushed_at);

The level of indentation for phase 1 is getting really unreasonable.

> +				decrypted += strp_msg(darg.skb)->full_len;
> +				tls_rx_rec_release(ctx);
> +				__skb_queue_tail(&ctx->rx_list, darg.skb);
> +				nr_async++;
> +
> +				if (!ctx->async_capable)

Do we want to break out here (stop the current decrypt batch and move
to phase 2) if we've already had to process all pending decrypts
(tls_decrypt_async_wait has been called)?


> +					break;
> +			}
> +		}

-- 
Sabrina

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH PATCH net-next v4 8/8] tls: Enable batch async decryption in read_sock
  2026-03-23 14:14   ` Sabrina Dubroca
@ 2026-03-23 15:04     ` Chuck Lever
  2026-03-23 23:08       ` Sabrina Dubroca
  2026-03-23 15:53     ` Chuck Lever
  1 sibling, 1 reply; 25+ messages in thread
From: Chuck Lever @ 2026-03-23 15:04 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: john.fastabend, Jakub Kicinski, netdev, kernel-tls-handshake,
	Chuck Lever, Hannes Reinecke

On Mon, Mar 23, 2026, at 10:14 AM, Sabrina Dubroca wrote:
> 2026-03-17, 11:04:21 -0400, Chuck Lever wrote:
>> +/* Bound on concurrent async AEAD submissions per read_sock
>> + * call.  Chosen to fill typical hardware crypto pipelines
>> + * without excessive memory consumption (each in-flight record
>> + * holds one cleartext skb plus its AEAD request context).
>> + */
>> +#define TLS_READ_SOCK_BATCH	16
>
> I suspect that at some point, we'll have a request to make this
> configurable (maybe system-wide, maybe by socket?).

I appreciate your careful and close review. The series has
improved significantly.

I will admit that the current value (16) is arbitrary. I agree
that someone might want to modify this value. At this point,
however, the constant is straightforward and it is still quite
easy to promote to a tunable later if that proves to be needed.

The right interface for this depends on kTLS consumer needs
that aren't clear (to me) yet. But let me know if you have a
preferred API mechanism or a specific use case in mind, or if
there is a netdev policy that should guide the introduction
of a suitable API for this purpose.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH PATCH net-next v4 8/8] tls: Enable batch async decryption in read_sock
  2026-03-23 14:14   ` Sabrina Dubroca
  2026-03-23 15:04     ` Chuck Lever
@ 2026-03-23 15:53     ` Chuck Lever
  1 sibling, 0 replies; 25+ messages in thread
From: Chuck Lever @ 2026-03-23 15:53 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: john.fastabend, Jakub Kicinski, netdev, kernel-tls-handshake,
	Chuck Lever, Hannes Reinecke


On Mon, Mar 23, 2026, at 10:14 AM, Sabrina Dubroca wrote:
> 2026-03-17, 11:04:21 -0400, Chuck Lever wrote:
>> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
>> index 5ae7e0c026e4437fe442c3a77b0a6d9623816ce1..bc500ba7ce81eb33763c37a8b73473c42dc66044 100644
>> --- a/net/tls/tls_sw.c
>> +++ b/net/tls/tls_sw.c

>> @@ -2373,25 +2387,61 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
>>  	decrypted = 0;
>>  	for (;;) {
>>  		struct tls_decrypt_arg darg;
>> +		int nr_async = 0;
>>  
>> -		/* Phase 1: Submit -- decrypt one record onto rx_list. */
>> +		/* Phase 1: Submit -- decrypt records onto rx_list. */
>>  		if (skb_queue_empty(&ctx->rx_list)) {
>> -			err = tls_rx_rec_wait(sk, NULL, true, released);
>> -			if (err <= 0)
>> +			while (nr_async < TLS_READ_SOCK_BATCH) {
>> +				if (nr_async == 0) {
>> +					err = tls_rx_rec_wait(sk, NULL,
>> +							      true,
>> +							      released);
>> +					if (err <= 0)
>> +						goto read_sock_end;
>> +				} else {
>> +					if (!tls_strp_msg_ready(ctx)) {
>> +						tls_strp_check_rcv_quiet(&ctx->strp);
>> +						if (!tls_strp_msg_ready(ctx))
>> +							break;
>
> This (and tls_rx_rec_wait) looks like tls_strp_check_rcv_quiet should
> return the value of msg_ready.
>
> This is also not very different from tls_rx_rec_wait(nonblock=true),
> why are you separating the nr_async>0 case and open-coding the core
> operations from tls_rx_rec_wait()?

tls_rx_rec_wait() is designed for the blocking first-record case
with full error/shutdown handling. The batch loop needs a light-
weight non-blocking poll with different semantics: "is another
record already available? If not, stop." Reusing tls_rx_rec_wait()
would require either adding more parameters to skip irrelevant
checks, or having the caller distinguish -EAGAIN (no record) from
real errors -- both of which would be more complex than the
current code.

My calculus was that the code duplication was a more readable
choice than building a single complicated helper that would
obfuscate the two flows.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH PATCH net-next v4 8/8] tls: Enable batch async decryption in read_sock
  2026-03-17 15:04 ` [PATCH PATCH net-next v4 8/8] tls: Enable batch async decryption in read_sock Chuck Lever
  2026-03-23 14:14   ` Sabrina Dubroca
@ 2026-03-23 21:28   ` Chuck Lever
  2026-03-23 21:41     ` Jakub Kicinski
  2026-03-23 22:48     ` Sabrina Dubroca
  1 sibling, 2 replies; 25+ messages in thread
From: Chuck Lever @ 2026-03-23 21:28 UTC (permalink / raw)
  To: john.fastabend, Jakub Kicinski, Sabrina Dubroca
  Cc: netdev, kernel-tls-handshake, Chuck Lever, Hannes Reinecke


On Tue, Mar 17, 2026, at 11:04 AM, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> tls_sw_read_sock() decrypts one TLS record at a time, blocking until
> each AEAD operation completes before proceeding. Hardware async
> crypto engines depend on pipelining multiple operations to achieve
> full throughput, and the one-at-a-time model prevents that. Kernel
> consumers such as NVMe-TCP and NFSD (when using TLS) are therefore
> unable to benefit from hardware offload.
>
> When ctx->async_capable is true, the submit phase now loops up to
> TLS_READ_SOCK_BATCH (16) records.

It appears that async_capable is always false for TLSv1.3. Since
TLSv1.3 is a hard requirement for both NVMe/TCP and RPC-with-TLS,
patch 8/8 is moot for us. For the moment, I'm going to drop this
one from the series. Once Alistair's KeyUpdate work is merged, we
can revisit.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH PATCH net-next v4 8/8] tls: Enable batch async decryption in read_sock
  2026-03-23 21:28   ` Chuck Lever
@ 2026-03-23 21:41     ` Jakub Kicinski
  2026-03-23 22:48     ` Sabrina Dubroca
  1 sibling, 0 replies; 25+ messages in thread
From: Jakub Kicinski @ 2026-03-23 21:41 UTC (permalink / raw)
  To: Chuck Lever
  Cc: john.fastabend, Sabrina Dubroca, netdev, kernel-tls-handshake,
	Chuck Lever, Hannes Reinecke

On Mon, 23 Mar 2026 17:28:27 -0400 Chuck Lever wrote:
> On Tue, Mar 17, 2026, at 11:04 AM, Chuck Lever wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> >
> > tls_sw_read_sock() decrypts one TLS record at a time, blocking until
> > each AEAD operation completes before proceeding. Hardware async
> > crypto engines depend on pipelining multiple operations to achieve
> > full throughput, and the one-at-a-time model prevents that. Kernel
> > consumers such as NVMe-TCP and NFSD (when using TLS) are therefore
> > unable to benefit from hardware offload.
> >
> > When ctx->async_capable is true, the submit phase now loops up to
> > TLS_READ_SOCK_BATCH (16) records.  
> 
> It appears that async_capable is always false for TLSv1.3. Since
> TLSv1.3 is a hard requirement for both NVMe/TCP and RPC-with-TLS,
> patch 8/8 is moot for us. For the moment, I'm going to drop this
> one from the series. Once Alistair's KeyUpdate work is merged, we
> can revisit.

Excellent, I was about to ask you for the actual use case..

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH PATCH net-next v4 8/8] tls: Enable batch async decryption in read_sock
  2026-03-23 21:28   ` Chuck Lever
  2026-03-23 21:41     ` Jakub Kicinski
@ 2026-03-23 22:48     ` Sabrina Dubroca
  2026-03-24 12:44       ` Chuck Lever
  1 sibling, 1 reply; 25+ messages in thread
From: Sabrina Dubroca @ 2026-03-23 22:48 UTC (permalink / raw)
  To: Chuck Lever
  Cc: john.fastabend, Jakub Kicinski, netdev, kernel-tls-handshake,
	Chuck Lever, Hannes Reinecke

2026-03-23, 17:28:27 -0400, Chuck Lever wrote:
> 
> On Tue, Mar 17, 2026, at 11:04 AM, Chuck Lever wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> >
> > tls_sw_read_sock() decrypts one TLS record at a time, blocking until
> > each AEAD operation completes before proceeding. Hardware async
> > crypto engines depend on pipelining multiple operations to achieve
> > full throughput, and the one-at-a-time model prevents that. Kernel
> > consumers such as NVMe-TCP and NFSD (when using TLS) are therefore
> > unable to benefit from hardware offload.
> >
> > When ctx->async_capable is true, the submit phase now loops up to
> > TLS_READ_SOCK_BATCH (16) records.
> 
> It appears that async_capable is always false for TLSv1.3. Since
> TLSv1.3 is a hard requirement for both NVMe/TCP and RPC-with-TLS,
> patch 8/8 is moot for us. For the moment, I'm going to drop this
> one from the series.

Then 7/8 is also not useful, and the series boils down to a few small
improvements (tls_decrypt_async_drain, spurious wakeups, checking the
backlog), which are not limited to read_sock. [nothing wrong with
that, it's just a different focus from what you started with]

> Once Alistair's KeyUpdate work is merged, we can revisit.

Are you planning to add support for async crypto with TLS1.3?

-- 
Sabrina

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH PATCH net-next v4 8/8] tls: Enable batch async decryption in read_sock
  2026-03-23 15:04     ` Chuck Lever
@ 2026-03-23 23:08       ` Sabrina Dubroca
  2026-03-24 13:17         ` Chuck Lever
  0 siblings, 1 reply; 25+ messages in thread
From: Sabrina Dubroca @ 2026-03-23 23:08 UTC (permalink / raw)
  To: Chuck Lever
  Cc: john.fastabend, Jakub Kicinski, netdev, kernel-tls-handshake,
	Chuck Lever, Hannes Reinecke

2026-03-23, 11:04:16 -0400, Chuck Lever wrote:
> 
> On Mon, Mar 23, 2026, at 10:14 AM, Sabrina Dubroca wrote:
> > 2026-03-17, 11:04:21 -0400, Chuck Lever wrote:
> >> +/* Bound on concurrent async AEAD submissions per read_sock
> >> + * call.  Chosen to fill typical hardware crypto pipelines
> >> + * without excessive memory consumption (each in-flight record
> >> + * holds one cleartext skb plus its AEAD request context).
> >> + */
> >> +#define TLS_READ_SOCK_BATCH	16
> >
> > I suspect that at some point, we'll have a request to make this
> > configurable (maybe system-wide, maybe by socket?).
> 
> I appreciate your careful and close review. The series has
> improved significantly.
> 
> I will admit that the current value (16) is arbitrary. I agree
> that someone might want to modify this value. At this point,
> however, the constant is straightforward and it is still quite
> easy to promote to a tunable later if that proves to be needed.

Agreed.

> The right interface for this depends on kTLS consumer needs
> that aren't clear (to me) yet.

In this case (read_sock), the kTLS consumer is NVMe/TCP etc, and
specifically users of those features with crypto acceleration
cards. I'm not familiar with either.

> But let me know if you have a
> preferred API mechanism or a specific use case in mind, or if
> there is a netdev policy that should guide the introduction
> of a suitable API for this purpose.

Nothing specific, I just thought I'd mention it since I was replying
to the patch anyway. I think at this stage "it seems easy to promote
to a tunable later" is enough consideration (just to avoid getting
trapped in some API (or lack thereof) and unable to change it, but I
agree that it shouldn't be a problem here).

-- 
Sabrina

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH PATCH net-next v4 8/8] tls: Enable batch async decryption in read_sock
  2026-03-23 22:48     ` Sabrina Dubroca
@ 2026-03-24 12:44       ` Chuck Lever
  0 siblings, 0 replies; 25+ messages in thread
From: Chuck Lever @ 2026-03-24 12:44 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: john.fastabend, Jakub Kicinski, netdev, kernel-tls-handshake,
	Chuck Lever, Hannes Reinecke

On 3/23/26 6:48 PM, Sabrina Dubroca wrote:
> 2026-03-23, 17:28:27 -0400, Chuck Lever wrote:
>>
>> On Tue, Mar 17, 2026, at 11:04 AM, Chuck Lever wrote:
>>> From: Chuck Lever <chuck.lever@oracle.com>
>>>
>>> tls_sw_read_sock() decrypts one TLS record at a time, blocking until
>>> each AEAD operation completes before proceeding. Hardware async
>>> crypto engines depend on pipelining multiple operations to achieve
>>> full throughput, and the one-at-a-time model prevents that. Kernel
>>> consumers such as NVMe-TCP and NFSD (when using TLS) are therefore
>>> unable to benefit from hardware offload.
>>>
>>> When ctx->async_capable is true, the submit phase now loops up to
>>> TLS_READ_SOCK_BATCH (16) records.
>>
>> It appears that async_capable is always false for TLSv1.3. Since
>> TLSv1.3 is a hard requirement for both NVMe/TCP and RPC-with-TLS,
>> patch 8/8 is moot for us. For the moment, I'm going to drop this
>> one from the series.
> 
> Then 7/8 is also not useful, and the series boils down to a few small> improvements (tls_decrypt_async_drain, spurious wakeups, checking the
> backlog), which are not limited to read_sock. [nothing wrong with
> that, it's just a different focus from what you started with]

I think that's accurate. I can adjust the cover letter for v5.


>> Once Alistair's KeyUpdate work is merged, we can revisit.
> 
> Are you planning to add support for async crypto with TLS1.3?
async crypto would be a pre-requisite requirement for batching
decryption for TLS v1.3. At the moment I'm not planning to add
that support, but we should discuss it once KeyUpdate is merged.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH PATCH net-next v4 8/8] tls: Enable batch async decryption in read_sock
  2026-03-23 23:08       ` Sabrina Dubroca
@ 2026-03-24 13:17         ` Chuck Lever
  2026-03-24 22:58           ` Sabrina Dubroca
  0 siblings, 1 reply; 25+ messages in thread
From: Chuck Lever @ 2026-03-24 13:17 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: john.fastabend, Jakub Kicinski, netdev, kernel-tls-handshake,
	Chuck Lever, Hannes Reinecke

On 3/23/26 7:08 PM, Sabrina Dubroca wrote:
> 2026-03-23, 11:04:16 -0400, Chuck Lever wrote:
>>
>> On Mon, Mar 23, 2026, at 10:14 AM, Sabrina Dubroca wrote:
>>> 2026-03-17, 11:04:21 -0400, Chuck Lever wrote:
>>>> +/* Bound on concurrent async AEAD submissions per read_sock
>>>> + * call.  Chosen to fill typical hardware crypto pipelines
>>>> + * without excessive memory consumption (each in-flight record
>>>> + * holds one cleartext skb plus its AEAD request context).
>>>> + */
>>>> +#define TLS_READ_SOCK_BATCH	16
>>>
>>> I suspect that at some point, we'll have a request to make this
>>> configurable (maybe system-wide, maybe by socket?).
>>
>> I appreciate your careful and close review. The series has
>> improved significantly.
>>
>> I will admit that the current value (16) is arbitrary. I agree
>> that someone might want to modify this value. At this point,
>> however, the constant is straightforward and it is still quite
>> easy to promote to a tunable later if that proves to be needed.
> 
> Agreed.
> 
>> The right interface for this depends on kTLS consumer needs
>> that aren't clear (to me) yet.
> 
> In this case (read_sock), the kTLS consumer is NVMe/TCP etc, and
> specifically users of those features with crypto acceleration
> cards. I'm not familiar with either.
> 
>> But let me know if you have a
>> preferred API mechanism or a specific use case in mind, or if
>> there is a netdev policy that should guide the introduction
>> of a suitable API for this purpose.
> 
> Nothing specific, I just thought I'd mention it since I was replying
> to the patch anyway. I think at this stage "it seems easy to promote
> to a tunable later" is enough consideration (just to avoid getting
> trapped in some API (or lack thereof) and unable to change it, but I
> agree that it shouldn't be a problem here).

I looked into this a little more yesterday before noticing that
async crypto was disabled for TLS 1.3.

The hardware crypto engines do not surface a concurrency limit to
their consumers, but their ring sizes are typically much larger than
the batch limit of 16 I've chosen here. We can't rely on them to
tell kTLS where to set that limit.

However, the loop was structured to continue until the batch limit
was hit or reading gets -EBUSY. Either way 16 seems to be a safe
but fairly conservative setting. For the larger rings it might cause
a decrypt pipeline bubble.

So the true purpose of the low limit is to constrain the amount of
memory tls_read_sock sets aside for decryption in progress. That's
going to be about 256KB per socket. Yes, that could be made larger
or contingent upon TCP socket buffer sizes, for example.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH PATCH net-next v4 8/8] tls: Enable batch async decryption in read_sock
  2026-03-24 13:17         ` Chuck Lever
@ 2026-03-24 22:58           ` Sabrina Dubroca
  0 siblings, 0 replies; 25+ messages in thread
From: Sabrina Dubroca @ 2026-03-24 22:58 UTC (permalink / raw)
  To: Chuck Lever
  Cc: john.fastabend, Jakub Kicinski, netdev, kernel-tls-handshake,
	Chuck Lever, Hannes Reinecke

2026-03-24, 09:17:57 -0400, Chuck Lever wrote:
> On 3/23/26 7:08 PM, Sabrina Dubroca wrote:
> > 2026-03-23, 11:04:16 -0400, Chuck Lever wrote:
> >>
> >> On Mon, Mar 23, 2026, at 10:14 AM, Sabrina Dubroca wrote:
> >>> 2026-03-17, 11:04:21 -0400, Chuck Lever wrote:
> >>>> +/* Bound on concurrent async AEAD submissions per read_sock
> >>>> + * call.  Chosen to fill typical hardware crypto pipelines
> >>>> + * without excessive memory consumption (each in-flight record
> >>>> + * holds one cleartext skb plus its AEAD request context).
> >>>> + */
> >>>> +#define TLS_READ_SOCK_BATCH	16
> >>>
> >>> I suspect that at some point, we'll have a request to make this
> >>> configurable (maybe system-wide, maybe by socket?).
> >>
> >> I appreciate your careful and close review. The series has
> >> improved significantly.
> >>
> >> I will admit that the current value (16) is arbitrary. I agree
> >> that someone might want to modify this value. At this point,
> >> however, the constant is straightforward and it is still quite
> >> easy to promote to a tunable later if that proves to be needed.
> > 
> > Agreed.
> > 
> >> The right interface for this depends on kTLS consumer needs
> >> that aren't clear (to me) yet.
> > 
> > In this case (read_sock), the kTLS consumer is NVMe/TCP etc, and
> > specifically users of those features with crypto acceleration
> > cards. I'm not familiar with either.
> > 
> >> But let me know if you have a
> >> preferred API mechanism or a specific use case in mind, or if
> >> there is a netdev policy that should guide the introduction
> >> of a suitable API for this purpose.
> > 
> > Nothing specific, I just thought I'd mention it since I was replying
> > to the patch anyway. I think at this stage "it seems easy to promote
> > to a tunable later" is enough consideration (just to avoid getting
> > trapped in some API (or lack thereof) and unable to change it, but I
> > agree that it shouldn't be a problem here).
> 
> I looked into this a little more yesterday before noticing that
> async crypto was disabled for TLS 1.3.
> 
> The hardware crypto engines do not surface a concurrency limit to
> their consumers, but their ring sizes are typically much larger than
> the batch limit of 16 I've chosen here. We can't rely on them to
> tell kTLS where to set that limit.
> 
> However, the loop was structured to continue until the batch limit
> was hit or reading gets -EBUSY. Either way 16 seems to be a safe
> but fairly conservative setting. For the larger rings it might cause
> a decrypt pipeline bubble.
> 
> So the true purpose of the low limit is to constrain the amount of
> memory tls_read_sock sets aside for decryption in progress. That's
> going to be about 256KB per socket. Yes, that could be made larger
> or contingent upon TCP socket buffer sizes, for example.

If you end up resending those changes in the future (after/together
with async support for 1.3), it would be nice to record some of this
in the commit message, to justify the limit you've chosen.


About async for 1.3: I think it would be quite some work to implement,
since we don't know the record type until after we decrypt the record.
If we get a rekey, we'll have already submitted a bunch of other
records that will fail decryption, so we'll need to roll back and
resubmit everything after the rekey, once userspace gave us the new
key. I'm not sure if other non-DATA records would also be that
problematic (or maybe worse?)

Also, async for 1.2 has been a pain (see the many fixes in the past
years). 1.3 looks complex.
I don't have access to HW that would benefit from it, do you
have benchmarks on with-async vs without-async? (obviously for
non-NVMe/non-NFS use-cases, if you test that at all)

-- 
Sabrina

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2026-03-24 22:58 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-17 15:04 [PATCH net-next v4 0/8] TLS read_sock performance scalability Chuck Lever
2026-03-17 15:04 ` [PATCH PATCH net-next v4 1/8] tls: Factor tls_decrypt_async_drain() from recvmsg Chuck Lever
2026-03-17 19:55   ` Breno Leitao
2026-03-19 17:21   ` Sabrina Dubroca
2026-03-20  1:03     ` Chuck Lever
2026-03-17 15:04 ` [PATCH PATCH net-next v4 2/8] tls: Abort the connection on decrypt failure Chuck Lever
2026-03-23 10:22   ` Sabrina Dubroca
2026-03-17 15:04 ` [PATCH PATCH net-next v4 3/8] tls: Fix dangling skb pointer in tls_sw_read_sock() Chuck Lever
2026-03-17 15:04 ` [PATCH PATCH net-next v4 4/8] tls: Factor tls_strp_msg_release() from tls_strp_msg_done() Chuck Lever
2026-03-17 15:04 ` [PATCH PATCH net-next v4 5/8] tls: Suppress spurious saved_data_ready on all receive paths Chuck Lever
2026-03-23 10:32   ` Sabrina Dubroca
2026-03-17 15:04 ` [PATCH PATCH net-next v4 6/8] tls: Flush backlog before waiting for a new record Chuck Lever
2026-03-17 15:04 ` [PATCH PATCH net-next v4 7/8] tls: Restructure tls_sw_read_sock() into submit/deliver phases Chuck Lever
2026-03-23 11:31   ` Sabrina Dubroca
2026-03-17 15:04 ` [PATCH PATCH net-next v4 8/8] tls: Enable batch async decryption in read_sock Chuck Lever
2026-03-23 14:14   ` Sabrina Dubroca
2026-03-23 15:04     ` Chuck Lever
2026-03-23 23:08       ` Sabrina Dubroca
2026-03-24 13:17         ` Chuck Lever
2026-03-24 22:58           ` Sabrina Dubroca
2026-03-23 15:53     ` Chuck Lever
2026-03-23 21:28   ` Chuck Lever
2026-03-23 21:41     ` Jakub Kicinski
2026-03-23 22:48     ` Sabrina Dubroca
2026-03-24 12:44       ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox