Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next v10 0/7] tls: receive-path fixes and clean-ups
@ 2026-05-11 23:25 Chuck Lever
  2026-05-11 23:25 ` [PATCH net-next v10 1/7] tls: Move decrypt-failure abort into tls_rx_one_record() Chuck Lever
                   ` (6 more replies)
  0 siblings, 7 replies; 9+ messages in thread
From: Chuck Lever @ 2026-05-11 23:25 UTC (permalink / raw)
  To: John Fastabend, Jakub Kicinski, Sabrina Dubroca
  Cc: Eric Dumazet, Simon Horman, Paolo Abeni, netdev,
	kernel-tls-handshake, Chuck Lever, Hannes Reinecke, Sagi Grimberg,
	Alistair Francis

I'd like to encourage in-kernel kTLS consumers (NFSD, NVMe/TCP) to
coalesce on the use of read_sock. While auditing read_sock for that
purpose, Hannes flagged a few rough edges in the receive paths.

This series is a set of clean-ups, not a performance series. Async
batch decryption and its submit/deliver scaffolding were dropped
during previous review: async_capable is always false for TLS 1.3,
the version NFSD and NVMe/TCP both require, so async-related
improvements were unreachable for the in-kernel consumers this
work targets.

A subsequent series will introduce infrastructure to support
KeyUpdate for in-kernel kTLS consumers, which need to handle TLS
Alert messages that trigger a tlshd upcall.

---
Changes since v9:
- Recast cover letter: this is a clean-up series, not a
  performance series (Jakub, Sabrina)
- Rephrase subject to describe the refactor (Jakub)
- Split 2/5 into two patches separating the do/while
  loop-structure cleanup from the partial-consume fix (Sabrina)
- Continue the loop after a partial consume to match
  __tcp_read_sock() semantics, instead of exiting (Jakub)
- Drop kdoc on the internal function and rename
  tls_strp_msg_release() to tls_strp_msg_consume() (Jakub)
- Drop kdoc on tls_strp_check_rcv() and un-wrap the "Defer
  notification" comment (Jakub)
- Rename tls_strp_check_rcv() parameter wake to announce, and
  tls_rx_msg_ready() to tls_rx_msg_maybe_announce() (Jakub)
- Drop tls_rx_handoff(); fold the per-record path back into
  tls_rx_rec_done() and fire the deferred announce from
  tls_rx_reader_release() (Jakub)
- New patch: Preserve sk_err across recvmsg() when data has
  been copied, so a connection abort during sk_flush_backlog()
  surfaces on the next read instead of vanishing when the
  caller returns the bytes already accumulated

Changes since v8:
- Address review comments from sashiko
  - Patch 2: Requeue partially consumed skb to prevent leak
  - Patch 5: Re-check sk_err so RST during flush surfaces as
    -ECONNRESET instead of EOF
- Address review comments from gpt-5.5
  - Patch 4: Restore msg_ready early-return in tls_strp_check_rcv()
    so the queued strp_work doesn't double-wake the consumer
  - Patch 4: Add tls_strparser msg_announced bit so the recvmsg
    exit-point handoff doesn't re-fire saved_data_ready() for a
    record BH or the worker already announced (rx_list-only drain
    path)

Changes since v7:
- Rebased on net-next (v7.1-rc1)

Changes since v6:
- Rebased on net-next, v5's 1/6 was merged upstream

Changes since v5:
- Patch 6: Set released = true when sk_flush_backlog() returns
  true, so tls_strp_msg_load() knows the socket lock was
  released (Sabrina)
- Patch 6: Drop Fixes tag; submit bug fix separately via net
  if warranted (Sabrina)
- Patch 6: Note redundant flush on cold path in commit message
  (Sabrina)

Changes since v4:
- Drop batch async decryption and submit/deliver restructure:
  async_capable is always false for TLS 1.3, so the new code
  was unreachable for NFS and NVMe/TCP
- Purge async_hold directly in tls_decrypt_async_wait() and drop
  the tls_decrypt_async_drain() wrapper
- Merge tls_strp_check_rcv_quiet() into tls_strp_check_rcv() with
  a bool wake parameter; fix lost wakeup on the recvmsg exit path

Changes since v3:
- Clarify why tls_decrypt_async_drain() is separate from _wait()
- Fold tls_err_abort() into tls_rx_one_record(), drop tls_rx_decrypt_record()
- Move backlog flush into tls_rx_rec_wait() so all RX paths benefit

Changes since v2:
- Fix short read self tests

Changes since v1:
- Add C11 reference
- Extend data_ready reduction to recvmsg and splice
- Restructure read_sock and recvmsg using shared helpers

---
Chuck Lever (7):
      tls: Move decrypt-failure abort into tls_rx_one_record()
      tls: Avoid evaluating freed skb in tls_sw_read_sock() loop
      tls: Re-present partially-consumed records in tls_sw_read_sock()
      tls: Factor tls_strp_msg_consume() from tls_strp_msg_done()
      tls: Suppress spurious saved_data_ready on all receive paths
      tls: Flush backlog before waiting for a new record
      tls: Preserve sk_err across recvmsg() when data has been copied

 include/net/tls.h  |   5 +++
 net/tls/tls.h      |   6 ++--
 net/tls/tls_main.c |   2 +-
 net/tls/tls_strp.c |  26 +++++++++-----
 net/tls/tls_sw.c   | 103 ++++++++++++++++++++++++++++++++++++++++-------------
 5 files changed, 105 insertions(+), 37 deletions(-)
---
base-commit: 63751099502d10f0aa6bb35273e56c5800cc4e3a
change-id: 20260317-tls-read-sock-a0022c9df265

Best regards,
--  
Chuck Lever <chuck.lever@oracle.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH net-next v10 1/7] tls: Move decrypt-failure abort into tls_rx_one_record()
  2026-05-11 23:25 [PATCH net-next v10 0/7] tls: receive-path fixes and clean-ups Chuck Lever
@ 2026-05-11 23:25 ` Chuck Lever
  2026-05-11 23:25 ` [PATCH net-next v10 2/7] tls: Avoid evaluating freed skb in tls_sw_read_sock() loop Chuck Lever
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-05-11 23:25 UTC (permalink / raw)
  To: John Fastabend, Jakub Kicinski, Sabrina Dubroca
  Cc: Eric Dumazet, Simon Horman, Paolo Abeni, netdev,
	kernel-tls-handshake, Chuck Lever, Hannes Reinecke

From: Chuck Lever <chuck.lever@oracle.com>

Three receive paths -- recvmsg, read_sock, and splice_read --
each follow tls_rx_one_record() with the same tls_err_abort()
call. Consolidate the abort into tls_rx_one_record() so the
decrypt-and-abort sequence lives in one place.

A tls_check_pending_rekey() failure after successful
decryption no longer triggers tls_err_abort(). That path
fires only when skb_copy_bits() fails on a valid skb,
which is not a realistic scenario.

Suggested-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/tls/tls_sw.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 2590e855f6a5..f607ccccb232 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1809,6 +1809,9 @@ static int tls_check_pending_rekey(struct sock *sk, struct tls_context *ctx,
 	return 0;
 }
 
+/* On decrypt failure the connection is aborted (sk_err set) before
+ * returning a negative errno.
+ */
 static int tls_rx_one_record(struct sock *sk, struct msghdr *msg,
 			     struct tls_decrypt_arg *darg)
 {
@@ -1820,8 +1823,10 @@ static int tls_rx_one_record(struct sock *sk, struct msghdr *msg,
 	err = tls_decrypt_device(sk, msg, tls_ctx, darg);
 	if (!err)
 		err = tls_decrypt_sw(sk, tls_ctx, msg, darg);
-	if (err < 0)
+	if (err < 0) {
+		tls_err_abort(sk, -EBADMSG);
 		return err;
+	}
 
 	rxm = strp_msg(darg->skb);
 	rxm->offset += prot->prepend_size;
@@ -2132,10 +2137,8 @@ int tls_sw_recvmsg(struct sock *sk,
 			darg.async = false;
 
 		err = tls_rx_one_record(sk, msg, &darg);
-		if (err < 0) {
-			tls_err_abort(sk, -EBADMSG);
+		if (err < 0)
 			goto recv_end;
-		}
 
 		async |= darg.async;
 
@@ -2294,10 +2297,8 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 		memset(&darg.inargs, 0, sizeof(darg.inargs));
 
 		err = tls_rx_one_record(sk, NULL, &darg);
-		if (err < 0) {
-			tls_err_abort(sk, -EBADMSG);
+		if (err < 0)
 			goto splice_read_end;
-		}
 
 		tls_rx_rec_done(ctx);
 		skb = darg.skb;
@@ -2380,10 +2381,8 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
 			memset(&darg.inargs, 0, sizeof(darg.inargs));
 
 			err = tls_rx_one_record(sk, NULL, &darg);
-			if (err < 0) {
-				tls_err_abort(sk, -EBADMSG);
+			if (err < 0)
 				goto read_sock_end;
-			}
 
 			released = tls_read_flush_backlog(sk, prot, INT_MAX,
 							  0, decrypted,

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next v10 2/7] tls: Avoid evaluating freed skb in tls_sw_read_sock() loop
  2026-05-11 23:25 [PATCH net-next v10 0/7] tls: receive-path fixes and clean-ups Chuck Lever
  2026-05-11 23:25 ` [PATCH net-next v10 1/7] tls: Move decrypt-failure abort into tls_rx_one_record() Chuck Lever
@ 2026-05-11 23:25 ` Chuck Lever
  2026-05-11 23:25 ` [PATCH net-next v10 3/7] tls: Re-present partially-consumed records in tls_sw_read_sock() Chuck Lever
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-05-11 23:25 UTC (permalink / raw)
  To: John Fastabend, Jakub Kicinski, Sabrina Dubroca
  Cc: Eric Dumazet, Simon Horman, Paolo Abeni, netdev,
	kernel-tls-handshake, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

tls_sw_read_sock() ends its receive loop with while (skb), but
the else branch in the body calls consume_skb(skb) before the
predicate is re-evaluated. A pointer becomes indeterminate when
the object it points to reaches end-of-lifetime (C2011 6.2.4p2),
and using an indeterminate value is undefined behavior (Annex
J.2). The pointer is not dereferenced today -- the predicate
either exits the loop or skb is overwritten at the top of the
next iteration -- but any future change that adds a dereference
between consume_skb() and the predicate would silently introduce
a use-after-free.

Replace the do/while form with an explicit for(;;) loop so
termination happens through a break statement rather than
predicate evaluation of a freed pointer.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/tls/tls_sw.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index f607ccccb232..559bef05fee4 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2366,7 +2366,7 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
 		goto read_sock_end;
 
 	decrypted = 0;
-	do {
+	for (;;) {
 		if (!skb_queue_empty(&ctx->rx_list)) {
 			skb = __skb_dequeue(&ctx->rx_list);
 			rxm = strp_msg(skb);
@@ -2416,9 +2416,9 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
 		} else {
 			consume_skb(skb);
 			if (!desc->count)
-				skb = NULL;
+				break;
 		}
-	} while (skb);
+	}
 
 read_sock_end:
 	tls_rx_reader_release(sk, ctx);

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next v10 3/7] tls: Re-present partially-consumed records in tls_sw_read_sock()
  2026-05-11 23:25 [PATCH net-next v10 0/7] tls: receive-path fixes and clean-ups Chuck Lever
  2026-05-11 23:25 ` [PATCH net-next v10 1/7] tls: Move decrypt-failure abort into tls_rx_one_record() Chuck Lever
  2026-05-11 23:25 ` [PATCH net-next v10 2/7] tls: Avoid evaluating freed skb in tls_sw_read_sock() loop Chuck Lever
@ 2026-05-11 23:25 ` Chuck Lever
  2026-05-12 12:52   ` Sabrina Dubroca
  2026-05-11 23:25 ` [PATCH net-next v10 4/7] tls: Factor tls_strp_msg_consume() from tls_strp_msg_done() Chuck Lever
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 9+ messages in thread
From: Chuck Lever @ 2026-05-11 23:25 UTC (permalink / raw)
  To: John Fastabend, Jakub Kicinski, Sabrina Dubroca
  Cc: Eric Dumazet, Simon Horman, Paolo Abeni, netdev,
	kernel-tls-handshake, Chuck Lever, Sagi Grimberg

From: Chuck Lever <chuck.lever@oracle.com>

When read_actor() accepts only part of a record but desc->count
is still non-zero, the receive loop currently falls through to
the next iteration without freeing or requeuing the partially
consumed skb. The next iteration overwrites skb, leaking the
remainder of the current record and silently dropping stream
data.

__tcp_read_sock() handles the same case by leaving the unread
bytes available for the next iteration to re-present, though
its mechanism (sequence-number re-lookup) differs from the TLS
path's explicit queue management. Adopt the same loop-level
behavior here: update rxm->offset and rxm->full_len, requeue
the skb to the head of rx_list, and continue. The next
iteration pops the same skb and re-presents the unread bytes
to read_actor().

Fixes: 662fbcec32f4 ("net/tls: implement ->read_sock()")
Cc: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/tls/tls_sw.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 559bef05fee4..40cb0a92d88a 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -2411,13 +2411,13 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
 		if (used < rxm->full_len) {
 			rxm->offset += used;
 			rxm->full_len -= used;
-			if (!desc->count)
-				goto read_sock_requeue;
-		} else {
-			consume_skb(skb);
-			if (!desc->count)
-				break;
+			__skb_queue_head(&ctx->rx_list, skb);
+			skb = NULL;
+			continue;
 		}
+		consume_skb(skb);
+		if (!desc->count)
+			break;
 	}
 
 read_sock_end:

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next v10 4/7] tls: Factor tls_strp_msg_consume() from tls_strp_msg_done()
  2026-05-11 23:25 [PATCH net-next v10 0/7] tls: receive-path fixes and clean-ups Chuck Lever
                   ` (2 preceding siblings ...)
  2026-05-11 23:25 ` [PATCH net-next v10 3/7] tls: Re-present partially-consumed records in tls_sw_read_sock() Chuck Lever
@ 2026-05-11 23:25 ` Chuck Lever
  2026-05-11 23:25 ` [PATCH net-next v10 5/7] tls: Suppress spurious saved_data_ready on all receive paths Chuck Lever
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-05-11 23:25 UTC (permalink / raw)
  To: John Fastabend, Jakub Kicinski, Sabrina Dubroca
  Cc: Eric Dumazet, Simon Horman, Paolo Abeni, netdev,
	kernel-tls-handshake, Chuck Lever, Hannes Reinecke,
	Alistair Francis

From: Chuck Lever <chuck.lever@oracle.com>

tls_strp_msg_done() conflates releasing the current record with
checking for the next one via tls_strp_check_rcv(). A subsequent
patch needs to release a record without immediately triggering
that check, so the release step is separated into
tls_strp_msg_consume(). tls_strp_msg_done() is preserved as a
wrapper for existing callers.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/tls/tls.h      |  1 +
 net/tls/tls_strp.c | 11 ++++++++++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/net/tls/tls.h b/net/tls/tls.h
index 12f44cb649c9..cb0091e03f41 100644
--- a/net/tls/tls.h
+++ b/net/tls/tls.h
@@ -194,6 +194,7 @@ int tls_strp_init(struct tls_strparser *strp, struct sock *sk);
 void tls_strp_data_ready(struct tls_strparser *strp);
 
 void tls_strp_check_rcv(struct tls_strparser *strp);
+void tls_strp_msg_consume(struct tls_strparser *strp);
 void tls_strp_msg_done(struct tls_strparser *strp);
 
 int tls_rx_msg_size(struct tls_strparser *strp, struct sk_buff *skb);
diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c
index c72e88317627..e7aaee6efe6e 100644
--- a/net/tls/tls_strp.c
+++ b/net/tls/tls_strp.c
@@ -581,7 +581,12 @@ static void tls_strp_work(struct work_struct *w)
 	release_sock(strp->sk);
 }
 
-void tls_strp_msg_done(struct tls_strparser *strp)
+/* Release the current record without triggering a check for the
+ * next record. Callers must invoke tls_strp_check_rcv() before
+ * releasing the socket lock, or queued data will stall until the
+ * next tls_strp_data_ready() event.
+ */
+void tls_strp_msg_consume(struct tls_strparser *strp)
 {
 	WARN_ON(!strp->stm.full_len);
 
@@ -592,7 +597,11 @@ void tls_strp_msg_done(struct tls_strparser *strp)
 
 	WRITE_ONCE(strp->msg_ready, 0);
 	memset(&strp->stm, 0, sizeof(strp->stm));
+}
 
+void tls_strp_msg_done(struct tls_strparser *strp)
+{
+	tls_strp_msg_consume(strp);
 	tls_strp_check_rcv(strp);
 }
 

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next v10 5/7] tls: Suppress spurious saved_data_ready on all receive paths
  2026-05-11 23:25 [PATCH net-next v10 0/7] tls: receive-path fixes and clean-ups Chuck Lever
                   ` (3 preceding siblings ...)
  2026-05-11 23:25 ` [PATCH net-next v10 4/7] tls: Factor tls_strp_msg_consume() from tls_strp_msg_done() Chuck Lever
@ 2026-05-11 23:25 ` Chuck Lever
  2026-05-11 23:25 ` [PATCH net-next v10 6/7] tls: Flush backlog before waiting for a new record Chuck Lever
  2026-05-11 23:25 ` [PATCH net-next v10 7/7] tls: Preserve sk_err across recvmsg() when data has been copied Chuck Lever
  6 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-05-11 23:25 UTC (permalink / raw)
  To: John Fastabend, Jakub Kicinski, Sabrina Dubroca
  Cc: Eric Dumazet, Simon Horman, Paolo Abeni, netdev,
	kernel-tls-handshake, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Each record release via tls_strp_msg_done() triggered
tls_strp_check_rcv(), which called tls_rx_msg_ready() and
fired saved_data_ready(). During a multi-record receive, the
first N-1 wakeups are pure overhead: the caller is already
running and will pick up subsequent records on the next loop
iteration. The recvmsg and splice_read paths share this waste.

Suppress per-record notifications and emit a single one on
reader exit. tls_rx_rec_done() releases the current record
and parses the next without announcing; tls_strp_check_rcv()
gains a bool announce parameter so callers can request the
quiet form. tls_rx_reader_release() fires the deferred
announce on exit through tls_rx_msg_maybe_announce(), an
idempotent helper that calls saved_data_ready() only when a
record is parsed and has not yet been announced.

To keep the final notification idempotent against records that
the BH or the worker has already announced, tls_strparser gains
a msg_announced bit. tls_rx_msg_maybe_announce() sets the bit
when firing saved_data_ready(); the bit is cleared whenever
the parsed record is wiped, by tls_strp_msg_consume() on
consumption or by tls_strp_msg_load() when the lower socket
loses bytes from under the parse. A second call for the same
parsed record -- as when recvmsg() satisfies the request from
ctx->rx_list without touching the strparser -- becomes a
no-op.

With no remaining callers, tls_strp_msg_done() is removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/net/tls.h  |  5 +++++
 net/tls/tls.h      |  5 ++---
 net/tls/tls_main.c |  2 +-
 net/tls/tls_strp.c | 23 ++++++++++++-----------
 net/tls/tls_sw.c   | 27 ++++++++++++++++++++++++---
 5 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index ebd2550280ae..3811943288b3 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -111,11 +111,16 @@ struct tls_sw_context_tx {
 struct tls_strparser {
 	struct sock *sk;
 
+	/* Bitfield word and msg_ready are serialized by the lower
+	 * socket lock; BH and worker contexts both acquire it.
+	 */
 	u32 mark : 8;
 	u32 stopped : 1;
 	u32 copy_mode : 1;
 	u32 mixed_decrypted : 1;
 
+	u32 msg_announced : 1;
+
 	bool msg_ready;
 
 	struct strp_msg stm;
diff --git a/net/tls/tls.h b/net/tls/tls.h
index cb0091e03f41..60a37bdaaa25 100644
--- a/net/tls/tls.h
+++ b/net/tls/tls.h
@@ -193,12 +193,11 @@ void tls_strp_stop(struct tls_strparser *strp);
 int tls_strp_init(struct tls_strparser *strp, struct sock *sk);
 void tls_strp_data_ready(struct tls_strparser *strp);
 
-void tls_strp_check_rcv(struct tls_strparser *strp);
+void tls_strp_check_rcv(struct tls_strparser *strp, bool announce);
 void tls_strp_msg_consume(struct tls_strparser *strp);
-void tls_strp_msg_done(struct tls_strparser *strp);
 
 int tls_rx_msg_size(struct tls_strparser *strp, struct sk_buff *skb);
-void tls_rx_msg_ready(struct tls_strparser *strp);
+void tls_rx_msg_maybe_announce(struct tls_strparser *strp);
 
 bool tls_strp_msg_load(struct tls_strparser *strp, bool force_refresh);
 int tls_strp_msg_cow(struct tls_sw_context_rx *ctx);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index fd39acf41a61..c10a3fd7fc17 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -769,7 +769,7 @@ static int do_tls_setsockopt_conf(struct sock *sk, sockptr_t optval,
 	} else {
 		struct tls_sw_context_rx *rx_ctx = tls_sw_ctx_rx(ctx);
 
-		tls_strp_check_rcv(&rx_ctx->strp);
+		tls_strp_check_rcv(&rx_ctx->strp, true);
 	}
 	return 0;
 
diff --git a/net/tls/tls_strp.c b/net/tls/tls_strp.c
index e7aaee6efe6e..61b10c697ecc 100644
--- a/net/tls/tls_strp.c
+++ b/net/tls/tls_strp.c
@@ -368,7 +368,6 @@ static int tls_strp_copyin(read_descriptor_t *desc, struct sk_buff *in_skb,
 		desc->count = 0;
 
 		WRITE_ONCE(strp->msg_ready, 1);
-		tls_rx_msg_ready(strp);
 	}
 
 	return ret;
@@ -492,6 +491,7 @@ bool tls_strp_msg_load(struct tls_strparser *strp, bool force_refresh)
 	if (!strp->copy_mode && force_refresh) {
 		if (unlikely(tcp_inq(strp->sk) < strp->stm.full_len)) {
 			WRITE_ONCE(strp->msg_ready, 0);
+			strp->msg_announced = 0;
 			memset(&strp->stm, 0, sizeof(strp->stm));
 			return false;
 		}
@@ -539,18 +539,24 @@ static int tls_strp_read_sock(struct tls_strparser *strp)
 		return tls_strp_read_copy(strp, false);
 
 	WRITE_ONCE(strp->msg_ready, 1);
-	tls_rx_msg_ready(strp);
 
 	return 0;
 }
 
-void tls_strp_check_rcv(struct tls_strparser *strp)
+/* Parse queued data. When @announce is true and parsing produces a
+ * newly-ready record, fire the consumer notification. Callers that
+ * need to notify a waiter about a record parsed by another path
+ * should invoke tls_rx_msg_maybe_announce() directly.
+ */
+void tls_strp_check_rcv(struct tls_strparser *strp, bool announce)
 {
 	if (unlikely(strp->stopped) || strp->msg_ready)
 		return;
 
 	if (tls_strp_read_sock(strp) == -ENOMEM)
 		queue_work(tls_strp_wq, &strp->work);
+	else if (announce && strp->msg_ready)
+		tls_rx_msg_maybe_announce(strp);
 }
 
 /* Lower sock lock held */
@@ -568,7 +574,7 @@ void tls_strp_data_ready(struct tls_strparser *strp)
 		return;
 	}
 
-	tls_strp_check_rcv(strp);
+	tls_strp_check_rcv(strp, true);
 }
 
 static void tls_strp_work(struct work_struct *w)
@@ -577,7 +583,7 @@ static void tls_strp_work(struct work_struct *w)
 		container_of(w, struct tls_strparser, work);
 
 	lock_sock(strp->sk);
-	tls_strp_check_rcv(strp);
+	tls_strp_check_rcv(strp, true);
 	release_sock(strp->sk);
 }
 
@@ -596,15 +602,10 @@ void tls_strp_msg_consume(struct tls_strparser *strp)
 		tls_strp_flush_anchor_copy(strp);
 
 	WRITE_ONCE(strp->msg_ready, 0);
+	strp->msg_announced = 0;
 	memset(&strp->stm, 0, sizeof(strp->stm));
 }
 
-void tls_strp_msg_done(struct tls_strparser *strp)
-{
-	tls_strp_msg_consume(strp);
-	tls_strp_check_rcv(strp);
-}
-
 void tls_strp_stop(struct tls_strparser *strp)
 {
 	strp->stopped = 1;
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 40cb0a92d88a..88d07199d5aa 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1383,7 +1383,10 @@ tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock,
 			return ret;
 
 		if (!skb_queue_empty(&sk->sk_receive_queue)) {
-			tls_strp_check_rcv(&ctx->strp);
+			/* Defer notification to the exit point; this thread
+			 * will consume the record directly.
+			 */
+			tls_strp_check_rcv(&ctx->strp, false);
 			if (tls_strp_msg_ready(ctx))
 				break;
 		}
@@ -1869,9 +1872,13 @@ static int tls_record_content_type(struct msghdr *msg, struct tls_msg *tlm,
 	return 1;
 }
 
+/* The deferred announce is fired once on reader exit by
+ * tls_rx_reader_release().
+ */
 static void tls_rx_rec_done(struct tls_sw_context_rx *ctx)
 {
-	tls_strp_msg_done(&ctx->strp);
+	tls_strp_msg_consume(&ctx->strp);
+	tls_strp_check_rcv(&ctx->strp, false);
 }
 
 /* This function traverses the rx_list in tls receive context to copies the
@@ -2026,6 +2033,12 @@ static int tls_rx_reader_lock(struct sock *sk, struct tls_sw_context_rx *ctx,
 
 static void tls_rx_reader_release(struct sock *sk, struct tls_sw_context_rx *ctx)
 {
+	/* Fire any deferred announce once per reader so that a record
+	 * parsed but not yet announced becomes visible to the next
+	 * reader. The call is idempotent through msg_announced.
+	 */
+	tls_rx_msg_maybe_announce(&ctx->strp);
+
 	if (unlikely(ctx->reader_contended)) {
 		if (wq_has_sleeper(&ctx->wq))
 			wake_up(&ctx->wq);
@@ -2505,10 +2518,18 @@ int tls_rx_msg_size(struct tls_strparser *strp, struct sk_buff *skb)
 	return ret;
 }
 
-void tls_rx_msg_ready(struct tls_strparser *strp)
+/* Fire saved_data_ready() at most once per parsed record. The
+ * msg_announced bit is cleared by tls_strp_msg_consume() when the
+ * record is consumed, arming the next announcement.
+ */
+void tls_rx_msg_maybe_announce(struct tls_strparser *strp)
 {
 	struct tls_sw_context_rx *ctx;
 
+	if (!READ_ONCE(strp->msg_ready) || strp->msg_announced)
+		return;
+	strp->msg_announced = 1;
+
 	ctx = container_of(strp, struct tls_sw_context_rx, strp);
 	ctx->saved_data_ready(strp->sk);
 }

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next v10 6/7] tls: Flush backlog before waiting for a new record
  2026-05-11 23:25 [PATCH net-next v10 0/7] tls: receive-path fixes and clean-ups Chuck Lever
                   ` (4 preceding siblings ...)
  2026-05-11 23:25 ` [PATCH net-next v10 5/7] tls: Suppress spurious saved_data_ready on all receive paths Chuck Lever
@ 2026-05-11 23:25 ` Chuck Lever
  2026-05-11 23:25 ` [PATCH net-next v10 7/7] tls: Preserve sk_err across recvmsg() when data has been copied Chuck Lever
  6 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-05-11 23:25 UTC (permalink / raw)
  To: John Fastabend, Jakub Kicinski, Sabrina Dubroca
  Cc: Eric Dumazet, Simon Horman, Paolo Abeni, netdev,
	kernel-tls-handshake, Chuck Lever, Hannes Reinecke

From: Chuck Lever <chuck.lever@oracle.com>

While lock_sock is held, incoming TCP segments land on
sk->sk_backlog rather than sk->sk_receive_queue.
tls_rx_rec_wait() inspects only sk_receive_queue, so backlog
data remains invisible. For non-blocking callers (read_sock,
and recvmsg or splice_read with MSG_DONTWAIT) this causes a
spurious -EAGAIN. For blocking callers it forces an
unnecessary sleep/wakeup cycle.

Flush the backlog inside tls_rx_rec_wait() before checking
sk_receive_queue so the strparser can parse newly-arrived
segments immediately. On the next loop iteration
tls_read_flush_backlog() may redundantly flush, but this
path is cold and the cost is negligible.

Backlog processing can run tcp_reset(), which calls
tcp_done_with_error() to set sk->sk_err = ECONNRESET and then
tcp_done() to set sk->sk_shutdown = SHUTDOWN_MASK. The pre-existing
top-of-loop sk_err check already ran before the flush, so the
freshly-set error would be masked by the next-line sk_shutdown test
returning 0 (EOF). Re-check sk_err immediately before the sk_shutdown
test so a connection abort surfaces as -ECONNRESET rather than a clean
EOF.

Suggested-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/tls/tls_sw.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 88d07199d5aa..2b7093d27eb6 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1382,6 +1382,8 @@ tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock,
 		if (ret < 0)
 			return ret;
 
+		if (sk_flush_backlog(sk))
+			released = true;
 		if (!skb_queue_empty(&sk->sk_receive_queue)) {
 			/* Defer notification to the exit point; this thread
 			 * will consume the record directly.
@@ -1391,6 +1393,13 @@ tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock,
 				break;
 		}
 
+		/* sk_flush_backlog() can run tcp_reset(), which sets
+		 * sk_err and then sk_shutdown via tcp_done(). Recheck
+		 * sk_err here so a connection abort surfaces as the
+		 * actual error rather than a clean EOF.
+		 */
+		if (sk->sk_err)
+			return sock_error(sk);
 		if (sk->sk_shutdown & RCV_SHUTDOWN)
 			return 0;
 

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH net-next v10 7/7] tls: Preserve sk_err across recvmsg() when data has been copied
  2026-05-11 23:25 [PATCH net-next v10 0/7] tls: receive-path fixes and clean-ups Chuck Lever
                   ` (5 preceding siblings ...)
  2026-05-11 23:25 ` [PATCH net-next v10 6/7] tls: Flush backlog before waiting for a new record Chuck Lever
@ 2026-05-11 23:25 ` Chuck Lever
  6 siblings, 0 replies; 9+ messages in thread
From: Chuck Lever @ 2026-05-11 23:25 UTC (permalink / raw)
  To: John Fastabend, Jakub Kicinski, Sabrina Dubroca
  Cc: Eric Dumazet, Simon Horman, Paolo Abeni, netdev,
	kernel-tls-handshake, Chuck Lever

From: Chuck Lever <chuck.lever@oracle.com>

Both sk_err checks in tls_rx_rec_wait() consume the error via
sock_error(), which clears sk_err atomically. When the caller
(tls_sw_recvmsg, tls_sw_splice_read, or tls_sw_read_sock) already
has bytes copied to userspace, it returns those bytes and discards
the error from this call. sk_err is now zero on the socket, so the
next read syscall observes only RCV_SHUTDOWN and reports a clean
EOF instead of the actual error (typically -ECONNRESET).

The race was reachable before this series via tls_read_flush_backlog()
when its periodic sk_flush_backlog() triggered tcp_reset() in the
middle of a multi-record read. The earlier patch in this series that
flushes the backlog inside tls_rx_rec_wait() widens the window: the
flush now runs on every iteration of every wait, not only when the
periodic threshold fires.

Have tls_rx_rec_wait() report sk_err without clearing it, using
READ_ONCE() to keep the read explicit. Each caller's return path
consumes sk_err only when no data is being returned and the err
about to surface matches the pending sk_err. This mirrors the
tcp_recvmsg() preserve-and-surface pattern, and also handles
tls_rx_one_record()'s decrypt-abort path: it raises sk_err to
EBADMSG via tls_err_abort() before returning a different errno
(-EFAULT from tls_setup_from_iter() on zero-copy receive, -ENOMEM
from decrypt setup). The gate keeps the actual error on this read
and lets the EBADMSG surface on the next, matching pre-series
behavior.

Fixes: c46b01839f7a ("tls: rx: periodically flush socket backlog")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/tls/tls_sw.c | 34 +++++++++++++++++++++++++++++-----
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 2b7093d27eb6..6a0ac2ccde56 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1376,8 +1376,14 @@ tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock,
 		if (!sk_psock_queue_empty(psock))
 			return 0;
 
+		/* Report sk_err without clearing it. The caller may
+		 * discard the error return from this function in favor
+		 * of bytes already copied; leaving sk_err set ensures
+		 * the next read syscall surfaces the error instead of
+		 * a spurious EOF.
+		 */
 		if (sk->sk_err)
-			return sock_error(sk);
+			return -READ_ONCE(sk->sk_err);
 
 		if (ret < 0)
 			return ret;
@@ -1399,7 +1405,7 @@ tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock,
 		 * actual error rather than a clean EOF.
 		 */
 		if (sk->sk_err)
-			return sock_error(sk);
+			return -READ_ONCE(sk->sk_err);
 		if (sk->sk_shutdown & RCV_SHUTDOWN)
 			return 0;
 
@@ -1430,6 +1436,18 @@ tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock, bool nonblock,
 	return 1;
 }
 
+/* Clear sk_err only when it matches the err about to be returned.
+ * tls_rx_one_record() can raise sk_err to EBADMSG via tls_err_abort()
+ * while returning a different errno; preserving sk_err in that case
+ * lets the EBADMSG surface on the next read.
+ */
+static int tls_sw_consume_matching_sk_err(struct sock *sk, int err)
+{
+	if (err < 0 && -err == READ_ONCE(sk->sk_err))
+		return sock_error(sk);
+	return err;
+}
+
 static int tls_setup_from_iter(struct iov_iter *from,
 			       int length, int *pages_used,
 			       struct scatterlist *to,
@@ -2285,7 +2303,9 @@ int tls_sw_recvmsg(struct sock *sk,
 	tls_rx_reader_unlock(sk, ctx);
 	if (psock)
 		sk_psock_put(sk, psock);
-	return copied ? : err;
+	if (copied)
+		return copied;
+	return tls_sw_consume_matching_sk_err(sk, err);
 }
 
 ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
@@ -2350,7 +2370,9 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t *ppos,
 
 splice_read_end:
 	tls_rx_reader_unlock(sk, ctx);
-	return copied ? : err;
+	if (copied)
+		return copied;
+	return tls_sw_consume_matching_sk_err(sk, err);
 
 splice_requeue:
 	__skb_queue_head(&ctx->rx_list, skb);
@@ -2444,7 +2466,9 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
 
 read_sock_end:
 	tls_rx_reader_release(sk, ctx);
-	return copied ? : err;
+	if (copied)
+		return copied;
+	return tls_sw_consume_matching_sk_err(sk, err);
 
 read_sock_requeue:
 	__skb_queue_head(&ctx->rx_list, skb);

-- 
2.54.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next v10 3/7] tls: Re-present partially-consumed records in tls_sw_read_sock()
  2026-05-11 23:25 ` [PATCH net-next v10 3/7] tls: Re-present partially-consumed records in tls_sw_read_sock() Chuck Lever
@ 2026-05-12 12:52   ` Sabrina Dubroca
  0 siblings, 0 replies; 9+ messages in thread
From: Sabrina Dubroca @ 2026-05-12 12:52 UTC (permalink / raw)
  To: Chuck Lever
  Cc: John Fastabend, Jakub Kicinski, Eric Dumazet, Simon Horman,
	Paolo Abeni, netdev, kernel-tls-handshake, Chuck Lever,
	Sagi Grimberg

2026-05-11, 19:25:54 -0400, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> When read_actor() accepts only part of a record but desc->count
> is still non-zero, the receive loop currently falls through to
> the next iteration without freeing or requeuing the partially
> consumed skb. The next iteration overwrites skb, leaking the
> remainder of the current record and silently dropping stream
> data.
> 
> __tcp_read_sock() handles the same case by leaving the unread
> bytes available for the next iteration to re-present, though
> its mechanism (sequence-number re-lookup) differs from the TLS
> path's explicit queue management. Adopt the same loop-level
> behavior here: update rxm->offset and rxm->full_len, requeue
> the skb to the head of rx_list, and continue. The next
> iteration pops the same skb and re-presents the unread bytes
> to read_actor().
> 
> Fixes: 662fbcec32f4 ("net/tls: implement ->read_sock()")

Fixes typically go through "net", not "net-next".

> Cc: Sagi Grimberg <sagi@grimberg.me>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/tls/tls_sw.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
> index 559bef05fee4..40cb0a92d88a 100644
> --- a/net/tls/tls_sw.c
> +++ b/net/tls/tls_sw.c
> @@ -2411,13 +2411,13 @@ int tls_sw_read_sock(struct sock *sk, read_descriptor_t *desc,
>  		if (used < rxm->full_len) {
>  			rxm->offset += used;
>  			rxm->full_len -= used;
> -			if (!desc->count)
> -				goto read_sock_requeue;
> -		} else {
> -			consume_skb(skb);
> -			if (!desc->count)
> -				break;
> +			__skb_queue_head(&ctx->rx_list, skb);
> +			skb = NULL;

Here you're NULLing a "stolen" skb...

> +			continue;

It doesn't make much sense to me to loop again if !desc->count, we'll
just dequeue the same skb again and go back to read_actor(). Do we
expect read_actor to do more work if we call it again once dest->count
has reached 0?

>  		}
> +		consume_skb(skb);

...but not here.
(which I don't find particularly useful in either case)

> +		if (!desc->count)
> +			break;

Maybe then change the loop to

    while (desc->count)


We'd end up with

	while (desc->count) {
		// ...
		if (used < rxm->full_len) {
			rxm->offset += used;
			rxm->full_len -= used;
			__skb_queue_head(&ctx->rx_list, skb);
		} else {
			consume_skb(skb);
		}
	}

-- 
Sabrina

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-05-12 12:53 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-11 23:25 [PATCH net-next v10 0/7] tls: receive-path fixes and clean-ups Chuck Lever
2026-05-11 23:25 ` [PATCH net-next v10 1/7] tls: Move decrypt-failure abort into tls_rx_one_record() Chuck Lever
2026-05-11 23:25 ` [PATCH net-next v10 2/7] tls: Avoid evaluating freed skb in tls_sw_read_sock() loop Chuck Lever
2026-05-11 23:25 ` [PATCH net-next v10 3/7] tls: Re-present partially-consumed records in tls_sw_read_sock() Chuck Lever
2026-05-12 12:52   ` Sabrina Dubroca
2026-05-11 23:25 ` [PATCH net-next v10 4/7] tls: Factor tls_strp_msg_consume() from tls_strp_msg_done() Chuck Lever
2026-05-11 23:25 ` [PATCH net-next v10 5/7] tls: Suppress spurious saved_data_ready on all receive paths Chuck Lever
2026-05-11 23:25 ` [PATCH net-next v10 6/7] tls: Flush backlog before waiting for a new record Chuck Lever
2026-05-11 23:25 ` [PATCH net-next v10 7/7] tls: Preserve sk_err across recvmsg() when data has been copied Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox