* [PATCH net v3 0/8] net/handshake: anchor request lifetime to a pinned file reference
@ 2026-05-25 16:51 Chuck Lever
2026-05-25 16:51 ` [PATCH net v3 1/8] net/handshake: Use spin_lock_bh for hn_lock Chuck Lever
` (8 more replies)
0 siblings, 9 replies; 13+ messages in thread
From: Chuck Lever @ 2026-05-25 16:51 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: Chris Mason, Christian Brauner, kernel-tls-handshake, linux-nvme,
netdev, Chuck Lever, Hannes Reinecke, Alistair Francis
handshake_nl_accept_doit() has accumulated four follow-on fixes
since 3b3009ea8abb ("net/handshake: Create a NETLINK service for
handling handshake requests"): 7ea9c1ec66bc, 7798b59409c3,
fe67b063f687, and dabac51b8102. Each was a local refcount or
NULL-check correction; none moved where the file reference is
owned, and the same code keeps producing the same class of bug.
Reworking the ownership is what breaks the pattern.
For the duration of a request, sock->file has no single owner.
Submit publishes the request without taking a file reference;
accept_doit acquires one inside the handler, after the request
has already left the pending list. The consumer can drop its
own reference at any time, including the moment between
handshake_req_next() popping the request and accept_doit
reaching get_file(). The submit-side sock_hold() pins only
struct sock; struct socket and sock->file remain under the
consumer's control via the file descriptor.
This series places the file reference under unambiguous
ownership. handshake_req_submit() pins it on the request and
completion or cancel drops it (patches 4-5); the submit-side
sock_hold() then becomes redundant, and dropping it also closes
a publish-before-pin race the late sock_hold itself opened
(patch 6). The handshake_complete() API and its consumers move
to a uniform negative-errno sign convention (patch 3), with the
matching sign correction in nvme-tcp (patch 2). Patch 1
hardens hn_lock for BH context, the netns-exit drain fix
builds on the new file-pin infrastructure (patch 8), and new
KUnit file-count assertions verify the refcount contract
(patch 7).
Three things in this restructuring want a careful look. In
handshake_complete(), the fput() of the request's file
reference has to come after hp_done() -- fput() can transitively
run handshake_sk_destruct() and free the request, so the patch
stashes hr_file in a local first. handshake_sk_destruct()
itself is kept on purpose: it owns rhashtable removal and
kfree, and remains the backstop if a consumer path bypasses
handshake_complete() entirely. Third, handshake_req_next() now
returns its request with an extra get_file() held under
hn_lock; accept_doit must consume that reference (FD_PREPARE on
success, explicit fput on the fdf.err path), and any future
caller has to honor the same contract.
---
Changes in v3:
- Gate submit error path fput with test_and_set_bit to prevent double fput from concurrent cancel
- Convert hn->hn_lock to spin_lock_bh (cancel reachable from softirq via nvmet-tcp)
- Pin file references during netns drain to prevent use-after-free on local list
- Series re-ordered to improve bisectability
- Link to v2: https://patch.msgid.link/20260521-handshake-file-pin-v2-0-b9dadc472840@oracle.com
Changes in v2:
- Patch 3: move HANDSHAKE_A_DONE_STATUS upper bound into netlink policy (Jakub)
- Link to v1: https://patch.msgid.link/20260518-handshake-file-pin-v1-0-4bbcb7e62fda@oracle.com
---
Chuck Lever (8):
net/handshake: Use spin_lock_bh for hn_lock
nvme-tcp: store negative errno in queue->tls_err
net/handshake: Pass negative errno through handshake_complete()
net/handshake: Take a long-lived file reference at submit
net/handshake: hand off the pinned file reference to accept_doit
net/handshake: Close the submit-side sock_hold race
net/handshake: Verify file-reference balance in submit paths
net/handshake: Drain pending requests at net namespace exit
Documentation/netlink/specs/handshake.yaml | 8 +++
drivers/nvme/host/tcp.c | 2 +-
net/handshake/genl.c | 3 +-
net/handshake/genl.h | 1 +
net/handshake/handshake-test.c | 38 +++++++++++++-
net/handshake/handshake.h | 6 ++-
net/handshake/netlink.c | 29 +++++------
net/handshake/request.c | 81 ++++++++++++++++++++++--------
net/handshake/tlshd.c | 6 ++-
9 files changed, 129 insertions(+), 45 deletions(-)
---
base-commit: 87a1e0fe7776da7ab411be332b4be58ac8840d10
change-id: 20260516-handshake-file-pin-21390f5f3513
Best regards,
--
Chuck Lever <chuck.lever@oracle.com>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH net v3 1/8] net/handshake: Use spin_lock_bh for hn_lock
2026-05-25 16:51 [PATCH net v3 0/8] net/handshake: anchor request lifetime to a pinned file reference Chuck Lever
@ 2026-05-25 16:51 ` Chuck Lever
2026-05-27 8:59 ` Hannes Reinecke
2026-05-25 16:51 ` [PATCH net v3 2/8] nvme-tcp: store negative errno in queue->tls_err Chuck Lever
` (7 subsequent siblings)
8 siblings, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2026-05-25 16:51 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: Chris Mason, Christian Brauner, kernel-tls-handshake, linux-nvme,
netdev, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
nvmet_tcp_state_change(), a socket callback that runs in BH context,
can reach handshake_req_cancel() via nvmet_tcp_schedule_release_queue()
and tls_handshake_cancel(). handshake_req_cancel() acquires
hn->hn_lock with plain spin_lock(). If a process-context thread on
the same CPU holds hn->hn_lock when a softirq invokes the cancel path,
the lock attempt deadlocks. This is the only caller that invokes
tls_handshake_cancel() from BH context; every other consumer calls it
from process context.
Deferring the cancel to process context in the NVMe target is not
straightforward: nvmet_tcp_schedule_release_queue() must call
tls_handshake_cancel() atomically with its state transition to
DISCONNECTING. If the cancel were deferred, the handshake completion
callback could fire in the window before the cancel runs, observe the
unexpected state, and return without dropping its kref on the queue.
Reworking that interlock is considerably more invasive than hardening
the handshake lock. Convert all hn->hn_lock acquisitions from
spin_lock/spin_unlock to spin_lock_bh/spin_unlock_bh so the lock is
never taken with softirqs enabled.
Fixes: 675b453e0241 ("nvmet-tcp: enable TLS handshake upcall")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/handshake/netlink.c | 4 ++--
net/handshake/request.c | 14 +++++++-------
net/handshake/tlshd.c | 2 ++
3 files changed, 11 insertions(+), 9 deletions(-)
diff --git a/net/handshake/netlink.c b/net/handshake/netlink.c
index b989456fc4c5..97114ec8027a 100644
--- a/net/handshake/netlink.c
+++ b/net/handshake/netlink.c
@@ -202,10 +202,10 @@ static void __net_exit handshake_net_exit(struct net *net)
* accepted and are in progress will be destroyed when
* the socket is closed.
*/
- spin_lock(&hn->hn_lock);
+ spin_lock_bh(&hn->hn_lock);
set_bit(HANDSHAKE_F_NET_DRAINING, &hn->hn_flags);
list_splice_init(&requests, &hn->hn_requests);
- spin_unlock(&hn->hn_lock);
+ spin_unlock_bh(&hn->hn_lock);
while (!list_empty(&requests)) {
req = list_first_entry(&requests, struct handshake_req, hr_list);
diff --git a/net/handshake/request.c b/net/handshake/request.c
index 2829adbeb149..5d4a17f902d2 100644
--- a/net/handshake/request.c
+++ b/net/handshake/request.c
@@ -167,12 +167,12 @@ static bool remove_pending(struct handshake_net *hn, struct handshake_req *req)
{
bool ret = false;
- spin_lock(&hn->hn_lock);
+ spin_lock_bh(&hn->hn_lock);
if (!list_empty(&req->hr_list)) {
__remove_pending_locked(hn, req);
ret = true;
}
- spin_unlock(&hn->hn_lock);
+ spin_unlock_bh(&hn->hn_lock);
return ret;
}
@@ -182,7 +182,7 @@ struct handshake_req *handshake_req_next(struct handshake_net *hn, int class)
struct handshake_req *req, *pos;
req = NULL;
- spin_lock(&hn->hn_lock);
+ spin_lock_bh(&hn->hn_lock);
list_for_each_entry(pos, &hn->hn_requests, hr_list) {
if (pos->hr_proto->hp_handler_class != class)
continue;
@@ -190,7 +190,7 @@ struct handshake_req *handshake_req_next(struct handshake_net *hn, int class)
req = pos;
break;
}
- spin_unlock(&hn->hn_lock);
+ spin_unlock_bh(&hn->hn_lock);
return req;
}
@@ -249,7 +249,7 @@ int handshake_req_submit(struct socket *sock, struct handshake_req *req,
if (READ_ONCE(hn->hn_pending) >= hn->hn_pending_max)
goto out_err;
- spin_lock(&hn->hn_lock);
+ spin_lock_bh(&hn->hn_lock);
ret = -EOPNOTSUPP;
if (test_bit(HANDSHAKE_F_NET_DRAINING, &hn->hn_flags))
goto out_unlock;
@@ -258,7 +258,7 @@ int handshake_req_submit(struct socket *sock, struct handshake_req *req,
goto out_unlock;
if (!__add_pending_locked(hn, req))
goto out_unlock;
- spin_unlock(&hn->hn_lock);
+ spin_unlock_bh(&hn->hn_lock);
ret = handshake_genl_notify(net, req->hr_proto, flags);
if (ret) {
@@ -274,7 +274,7 @@ int handshake_req_submit(struct socket *sock, struct handshake_req *req,
return 0;
out_unlock:
- spin_unlock(&hn->hn_lock);
+ spin_unlock_bh(&hn->hn_lock);
out_err:
/* Restore original destructor so socket teardown still runs on failure */
req->hr_sk->sk_destruct = req->hr_odestruct;
diff --git a/net/handshake/tlshd.c b/net/handshake/tlshd.c
index 8f9532a15f43..af294c6cc717 100644
--- a/net/handshake/tlshd.c
+++ b/net/handshake/tlshd.c
@@ -425,6 +425,8 @@ EXPORT_SYMBOL(tls_server_hello_psk);
* Request cancellation races with request completion. To determine
* who won, callers examine the return value from this function.
*
+ * Context: May be called from process or softirq context.
+ *
* Return values:
* %true - Uncompleted handshake request was canceled
* %false - Handshake request already completed or not found
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH net v3 2/8] nvme-tcp: store negative errno in queue->tls_err
2026-05-25 16:51 [PATCH net v3 0/8] net/handshake: anchor request lifetime to a pinned file reference Chuck Lever
2026-05-25 16:51 ` [PATCH net v3 1/8] net/handshake: Use spin_lock_bh for hn_lock Chuck Lever
@ 2026-05-25 16:51 ` Chuck Lever
2026-05-25 16:51 ` [PATCH net v3 3/8] net/handshake: Pass negative errno through handshake_complete() Chuck Lever
` (6 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2026-05-25 16:51 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: Chris Mason, Christian Brauner, kernel-tls-handshake, linux-nvme,
netdev, Chuck Lever, Hannes Reinecke, Alistair Francis
From: Chuck Lever <chuck.lever@oracle.com>
nvme_tcp_tls_done() assigns queue->tls_err in three branches. The
ENOKEY lookup failure and the EOPNOTSUPP initializer both store
negative errnos. The third branch, reached when the handshake
layer reports a non-zero status, stores -status.
The handshake layer delivers status to the consumer callback as a
negative errno; the other in-tree consumers --
xs_tls_handshake_done() and the nvmet target callback -- treat
their status argument that way. The extra negation in
nvme_tcp_tls_done() flips the sign, leaving tls_err as a positive
value (for instance, +EIO), which nvme_tcp_start_tls() then
returns to its caller.
Drop the extra negation so queue->tls_err uniformly carries a
negative errno on failure.
Fixes: be8e82caa685 ("nvme-tcp: enable TLS handshake upcall")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
---
drivers/nvme/host/tcp.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 15d36d6a728e..68a1d7640494 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1702,7 +1702,7 @@ static void nvme_tcp_tls_done(void *data, int status, key_serial_t pskid)
qid, pskid, status);
if (status) {
- queue->tls_err = -status;
+ queue->tls_err = status;
goto out_complete;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH net v3 3/8] net/handshake: Pass negative errno through handshake_complete()
2026-05-25 16:51 [PATCH net v3 0/8] net/handshake: anchor request lifetime to a pinned file reference Chuck Lever
2026-05-25 16:51 ` [PATCH net v3 1/8] net/handshake: Use spin_lock_bh for hn_lock Chuck Lever
2026-05-25 16:51 ` [PATCH net v3 2/8] nvme-tcp: store negative errno in queue->tls_err Chuck Lever
@ 2026-05-25 16:51 ` Chuck Lever
2026-05-25 16:51 ` [PATCH net v3 4/8] net/handshake: Take a long-lived file reference at submit Chuck Lever
` (5 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2026-05-25 16:51 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: Chris Mason, Christian Brauner, kernel-tls-handshake, linux-nvme,
netdev, Chuck Lever, Hannes Reinecke
From: Chuck Lever <chuck.lever@oracle.com>
handshake_complete() declares status as unsigned int and
tls_handshake_done() negates that value (-status) before handing
it to the TLS consumer. Consumers match on negative errno
constants -- xs_tls_handshake_done() has
switch (status) {
case 0:
case -EACCES:
case -ETIMEDOUT:
lower_transport->xprt_err = status;
break;
default:
lower_transport->xprt_err = -EACCES;
}
so the API as designed expects callers to pass positive errno
values that the tlshd shim then negates.
Three internal callers in handshake_nl_accept_doit(), the
net-exit drain, and a kunit test follow kernel convention and
pass negative errnos -- -EIO, -ETIMEDOUT, -ETIMEDOUT. The
implicit conversion to unsigned int turns -ETIMEDOUT into
0xFFFFFF92; the subsequent -status in tls_handshake_done()
wraps back to 110, the consumer's switch falls through, and
the xprt reports -EACCES on what should be -ETIMEDOUT or -EIO.
Fix the API rather than the call sites. The natural kernel
convention is negative errno in, negative errno out. Change
handshake_complete() and hp_done to take int status, drop the
negation in tls_handshake_done(), and negate once in
handshake_nl_done_doit() where status arrives from the wire
as an unsigned netlink attribute. The three internal callers
were already correct under that convention and need no change.
At the same wire boundary, declare MAX_ERRNO as the netlink
policy upper bound for HANDSHAKE_A_DONE_STATUS. Attribute
validation rejects out-of-range values before
handshake_nl_done_doit() runs, and negating a bounded u32 there
stays within int range -- closing the UBSAN-visible signed-
integer overflow that an unconstrained u32 would invoke.
Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
---
Documentation/netlink/specs/handshake.yaml | 8 ++++++++
net/handshake/genl.c | 3 ++-
net/handshake/genl.h | 1 +
net/handshake/handshake-test.c | 2 +-
net/handshake/handshake.h | 4 ++--
net/handshake/netlink.c | 2 +-
net/handshake/request.c | 2 +-
net/handshake/tlshd.c | 4 ++--
8 files changed, 18 insertions(+), 8 deletions(-)
diff --git a/Documentation/netlink/specs/handshake.yaml b/Documentation/netlink/specs/handshake.yaml
index 95c3fade7a8d..1024297b3851 100644
--- a/Documentation/netlink/specs/handshake.yaml
+++ b/Documentation/netlink/specs/handshake.yaml
@@ -12,6 +12,12 @@ protocol: genetlink
doc: Netlink protocol to request a transport layer security handshake.
definitions:
+ -
+ type: const
+ name: max-errno
+ value: 4095
+ header: linux/err.h
+ scope: kernel
-
type: enum
name: handler-class
@@ -80,6 +86,8 @@ attribute-sets:
-
name: status
type: u32
+ checks:
+ max: max-errno
-
name: sockfd
type: s32
diff --git a/net/handshake/genl.c b/net/handshake/genl.c
index 870612609491..4b20cd9cdd0e 100644
--- a/net/handshake/genl.c
+++ b/net/handshake/genl.c
@@ -10,6 +10,7 @@
#include "genl.h"
#include <uapi/linux/handshake.h>
+#include <linux/err.h>
/* HANDSHAKE_CMD_ACCEPT - do */
static const struct nla_policy handshake_accept_nl_policy[HANDSHAKE_A_ACCEPT_HANDLER_CLASS + 1] = {
@@ -18,7 +19,7 @@ static const struct nla_policy handshake_accept_nl_policy[HANDSHAKE_A_ACCEPT_HAN
/* HANDSHAKE_CMD_DONE - do */
static const struct nla_policy handshake_done_nl_policy[HANDSHAKE_A_DONE_REMOTE_AUTH + 1] = {
- [HANDSHAKE_A_DONE_STATUS] = { .type = NLA_U32, },
+ [HANDSHAKE_A_DONE_STATUS] = NLA_POLICY_MAX(NLA_U32, MAX_ERRNO),
[HANDSHAKE_A_DONE_SOCKFD] = { .type = NLA_S32, },
[HANDSHAKE_A_DONE_REMOTE_AUTH] = { .type = NLA_U32, },
};
diff --git a/net/handshake/genl.h b/net/handshake/genl.h
index 8d3e18672daf..46b65f131669 100644
--- a/net/handshake/genl.h
+++ b/net/handshake/genl.h
@@ -11,6 +11,7 @@
#include <net/genetlink.h>
#include <uapi/linux/handshake.h>
+#include <linux/err.h>
int handshake_nl_accept_doit(struct sk_buff *skb, struct genl_info *info);
int handshake_nl_done_doit(struct sk_buff *skb, struct genl_info *info);
diff --git a/net/handshake/handshake-test.c b/net/handshake/handshake-test.c
index 55442b2f518a..df3948e807a0 100644
--- a/net/handshake/handshake-test.c
+++ b/net/handshake/handshake-test.c
@@ -25,7 +25,7 @@ static int test_accept_func(struct handshake_req *req, struct genl_info *info,
return 0;
}
-static void test_done_func(struct handshake_req *req, unsigned int status,
+static void test_done_func(struct handshake_req *req, int status,
struct genl_info *info)
{
}
diff --git a/net/handshake/handshake.h b/net/handshake/handshake.h
index a48163765a7a..2289b0e274f4 100644
--- a/net/handshake/handshake.h
+++ b/net/handshake/handshake.h
@@ -57,7 +57,7 @@ struct handshake_proto {
int (*hp_accept)(struct handshake_req *req,
struct genl_info *info, int fd);
void (*hp_done)(struct handshake_req *req,
- unsigned int status,
+ int status,
struct genl_info *info);
void (*hp_destroy)(struct handshake_req *req);
};
@@ -86,7 +86,7 @@ struct handshake_req *handshake_req_hash_lookup(struct sock *sk);
struct handshake_req *handshake_req_next(struct handshake_net *hn, int class);
int handshake_req_submit(struct socket *sock, struct handshake_req *req,
gfp_t flags);
-void handshake_complete(struct handshake_req *req, unsigned int status,
+void handshake_complete(struct handshake_req *req, int status,
struct genl_info *info);
bool handshake_req_cancel(struct sock *sk);
diff --git a/net/handshake/netlink.c b/net/handshake/netlink.c
index 97114ec8027a..039344979de9 100644
--- a/net/handshake/netlink.c
+++ b/net/handshake/netlink.c
@@ -160,7 +160,7 @@ int handshake_nl_done_doit(struct sk_buff *skb, struct genl_info *info)
status = -EIO;
if (info->attrs[HANDSHAKE_A_DONE_STATUS])
- status = nla_get_u32(info->attrs[HANDSHAKE_A_DONE_STATUS]);
+ status = -(int)nla_get_u32(info->attrs[HANDSHAKE_A_DONE_STATUS]);
handshake_complete(req, status, info);
sockfd_put(sock);
diff --git a/net/handshake/request.c b/net/handshake/request.c
index 5d4a17f902d2..97f9f8239949 100644
--- a/net/handshake/request.c
+++ b/net/handshake/request.c
@@ -284,7 +284,7 @@ int handshake_req_submit(struct socket *sock, struct handshake_req *req,
}
EXPORT_SYMBOL(handshake_req_submit);
-void handshake_complete(struct handshake_req *req, unsigned int status,
+void handshake_complete(struct handshake_req *req, int status,
struct genl_info *info)
{
struct sock *sk = req->hr_sk;
diff --git a/net/handshake/tlshd.c b/net/handshake/tlshd.c
index af294c6cc717..7567150c2a4f 100644
--- a/net/handshake/tlshd.c
+++ b/net/handshake/tlshd.c
@@ -93,7 +93,7 @@ static void tls_handshake_remote_peerids(struct tls_handshake_req *treq,
*
*/
static void tls_handshake_done(struct handshake_req *req,
- unsigned int status, struct genl_info *info)
+ int status, struct genl_info *info)
{
struct tls_handshake_req *treq = handshake_req_private(req);
@@ -104,7 +104,7 @@ static void tls_handshake_done(struct handshake_req *req,
if (!status)
set_bit(HANDSHAKE_F_REQ_SESSION, &req->hr_flags);
- treq->th_consumer_done(treq->th_consumer_data, -status,
+ treq->th_consumer_done(treq->th_consumer_data, status,
treq->th_peerid[0]);
}
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH net v3 4/8] net/handshake: Take a long-lived file reference at submit
2026-05-25 16:51 [PATCH net v3 0/8] net/handshake: anchor request lifetime to a pinned file reference Chuck Lever
` (2 preceding siblings ...)
2026-05-25 16:51 ` [PATCH net v3 3/8] net/handshake: Pass negative errno through handshake_complete() Chuck Lever
@ 2026-05-25 16:51 ` Chuck Lever
2026-05-25 16:51 ` [PATCH net v3 5/8] net/handshake: hand off the pinned file reference to accept_doit Chuck Lever
` (4 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2026-05-25 16:51 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: Chris Mason, Christian Brauner, kernel-tls-handshake, linux-nvme,
netdev, Chuck Lever
From: Chuck Lever <chuck.lever@oracle.com>
handshake_nl_accept_doit() needs the file pointer backing
req->hr_sk->sk_socket to survive the window between
handshake_req_next() and the subsequent FD_PREPARE() and get_file().
The submit-side sock_hold() does not provide that. sk_refcnt keeps
struct sock alive, but struct socket is owned by sock->file: when
the consumer fputs the last file reference, sock_release() tears
the socket down regardless of any sock_hold.
Add an hr_file pointer to struct handshake_req and acquire an
explicit reference on sock->file during handshake_req_submit().
handshake_complete() and handshake_req_cancel() release the
reference on the completion-bit-winning path.
The submit error path must also release the file reference, but
after rhashtable insertion a concurrent handshake_req_cancel() can
discover the request and race the error path. Gate the error-path
cleanup -- sk_destruct restoration, fput, and request destruction
-- with test_and_set_bit(HANDSHAKE_F_REQ_COMPLETED), the same
serialization handshake_complete() and handshake_req_cancel()
already use. When cancel has already claimed ownership, the submit
error path returns without touching the request; socket teardown
handles final destruction.
The accept-side dereferences are not yet retargeted; that change
comes in the next patch.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/handshake/handshake.h | 2 ++
net/handshake/netlink.c | 6 ------
net/handshake/request.c | 42 +++++++++++++++++++++++++++++++++++-------
3 files changed, 37 insertions(+), 13 deletions(-)
diff --git a/net/handshake/handshake.h b/net/handshake/handshake.h
index 2289b0e274f4..da61cadd1ad3 100644
--- a/net/handshake/handshake.h
+++ b/net/handshake/handshake.h
@@ -24,6 +24,7 @@ enum hn_flags_bits {
HANDSHAKE_F_NET_DRAINING,
};
+struct file;
struct handshake_proto;
/* One handshake request */
@@ -32,6 +33,7 @@ struct handshake_req {
struct rhash_head hr_rhash;
unsigned long hr_flags;
const struct handshake_proto *hr_proto;
+ struct file *hr_file;
struct sock *hr_sk;
void (*hr_odestruct)(struct sock *sk);
diff --git a/net/handshake/netlink.c b/net/handshake/netlink.c
index 039344979de9..1a5821eb7184 100644
--- a/net/handshake/netlink.c
+++ b/net/handshake/netlink.c
@@ -210,12 +210,6 @@ static void __net_exit handshake_net_exit(struct net *net)
while (!list_empty(&requests)) {
req = list_first_entry(&requests, struct handshake_req, hr_list);
list_del(&req->hr_list);
-
- /*
- * Requests on this list have not yet been
- * accepted, so they do not have an fd to put.
- */
-
handshake_complete(req, -ETIMEDOUT, NULL);
}
}
diff --git a/net/handshake/request.c b/net/handshake/request.c
index 97f9f8239949..da064511ab86 100644
--- a/net/handshake/request.c
+++ b/net/handshake/request.c
@@ -13,6 +13,7 @@
#include <linux/module.h>
#include <linux/skbuff.h>
#include <linux/inet.h>
+#include <linux/file.h>
#include <linux/rhashtable.h>
#include <net/sock.h>
@@ -215,9 +216,16 @@ EXPORT_SYMBOL_IF_KUNIT(handshake_req_next);
* A zero return value from handshake_req_submit() means that
* exactly one subsequent completion callback is guaranteed.
*
- * A negative return value from handshake_req_submit() means that
- * no completion callback will be done and that @req has been
- * destroyed.
+ * A negative return value from handshake_req_submit() guarantees that
+ * no completion callback will occur and that @req is no longer owned by
+ * the caller. If cancellation wins the completion race after the request
+ * has been published, final destruction is deferred until socket teardown.
+ *
+ * The caller must hold a reference on @sock->file for the duration
+ * of this call. Once the request is published to the accept side, a
+ * concurrent completion or cancellation may release the request's pin on
+ * @sock->file; the caller's reference is what keeps @sock->sk valid until
+ * handshake_req_submit() returns.
*/
int handshake_req_submit(struct socket *sock, struct handshake_req *req,
gfp_t flags)
@@ -236,6 +244,14 @@ int handshake_req_submit(struct socket *sock, struct handshake_req *req,
kfree(req);
return -EINVAL;
}
+
+ /*
+ * Pin sock->file for the lifetime of the request so the
+ * accept side does not race a consumer that releases the
+ * socket while a handshake is pending.
+ */
+ req->hr_file = get_file(sock->file);
+
req->hr_odestruct = req->hr_sk->sk_destruct;
req->hr_sk->sk_destruct = handshake_sk_destruct;
@@ -267,7 +283,11 @@ int handshake_req_submit(struct socket *sock, struct handshake_req *req,
goto out_err;
}
- /* Prevent socket release while a handshake request is pending */
+ /*
+ * Pin struct sock so sk_destruct does not run until the
+ * handshake completion path releases it; struct socket is
+ * held separately via hr_file above.
+ */
sock_hold(req->hr_sk);
trace_handshake_submit(net, req, req->hr_sk);
@@ -276,10 +296,13 @@ int handshake_req_submit(struct socket *sock, struct handshake_req *req,
out_unlock:
spin_unlock_bh(&hn->hn_lock);
out_err:
- /* Restore original destructor so socket teardown still runs on failure */
- req->hr_sk->sk_destruct = req->hr_odestruct;
trace_handshake_submit_err(net, req, req->hr_sk, ret);
- handshake_req_destroy(req);
+ if (!test_and_set_bit(HANDSHAKE_F_REQ_COMPLETED, &req->hr_flags)) {
+ /* Restore original destructor so socket teardown still runs. */
+ req->hr_sk->sk_destruct = req->hr_odestruct;
+ fput(req->hr_file);
+ handshake_req_destroy(req);
+ }
return ret;
}
EXPORT_SYMBOL(handshake_req_submit);
@@ -291,11 +314,15 @@ void handshake_complete(struct handshake_req *req, int status,
struct net *net = sock_net(sk);
if (!test_and_set_bit(HANDSHAKE_F_REQ_COMPLETED, &req->hr_flags)) {
+ struct file *file = req->hr_file;
+
trace_handshake_complete(net, req, sk, status);
req->hr_proto->hp_done(req, status, info);
/* Handshake request is no longer pending */
sock_put(sk);
+
+ fput(file);
}
}
EXPORT_SYMBOL_IF_KUNIT(handshake_complete);
@@ -344,6 +371,7 @@ bool handshake_req_cancel(struct sock *sk)
/* Handshake request is no longer pending */
sock_put(sk);
+ fput(req->hr_file);
return true;
}
EXPORT_SYMBOL(handshake_req_cancel);
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH net v3 5/8] net/handshake: hand off the pinned file reference to accept_doit
2026-05-25 16:51 [PATCH net v3 0/8] net/handshake: anchor request lifetime to a pinned file reference Chuck Lever
` (3 preceding siblings ...)
2026-05-25 16:51 ` [PATCH net v3 4/8] net/handshake: Take a long-lived file reference at submit Chuck Lever
@ 2026-05-25 16:51 ` Chuck Lever
2026-05-25 16:51 ` [PATCH net v3 6/8] net/handshake: Close the submit-side sock_hold race Chuck Lever
` (3 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2026-05-25 16:51 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: Chris Mason, Christian Brauner, kernel-tls-handshake, linux-nvme,
netdev, Chuck Lever, Hannes Reinecke
From: Chuck Lever <chuck.lever@oracle.com>
handshake_req_next() removes the request from the per-net
pending list and drops hn_lock before handshake_nl_accept_doit()
reads req->hr_sk->sk_socket and dereferences sock->file (once in
FD_PREPARE() and again in get_file()). In that window a
consumer running tls_handshake_cancel() followed by sockfd_put()
(svc_sock_free) or __fput_sync() (xs_reset_transport) releases
sock->file. sock_release() then runs sock_orphan(), zeroing
sk_socket, and frees the struct socket. The accept-side code
either reads NULL through sk_socket or chases freed memory.
The submit-side sock_hold() does not prevent this. sk_refcnt
protects struct sock, but struct socket and sock->file are
independently refcounted via the file descriptor the consumer
owns. Pinning sk leaves sock and sock->file unprotected.
Retarget the accept-side dereferences at req->hr_file, which was
pinned at submit time, instead of req->hr_sk->sk_socket->file.
Pinning on its own is not sufficient: a consumer that cancels
between handshake_req_next() returning and accept_doit reaching
FD_PREPARE() takes the !remove_pending() branch in
handshake_req_cancel() and drops hr_file before the accept side
takes its own reference. Hand off an additional file reference
inside handshake_req_next(), under hn_lock, so the accept side
operates on a reference that no concurrent handshake_req_cancel()
can revoke. FD_PREPARE() consumes that handed-off reference,
either by transferring it to the new fd in fd_publish() or by
dropping it in the cleanup destructor on error; the explicit
get_file() that previously balanced FD_PREPARE() is therefore
redundant and goes away.
Update handshake_req_cancel_test2 and _test3 to simulate the
FD_PREPARE() consumption with an fput() so the kunit file-count
assertions stay balanced.
Reported-by: Chris Mason <clm@meta.com>
Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
---
net/handshake/handshake-test.c | 8 ++++++++
net/handshake/netlink.c | 7 ++-----
net/handshake/request.c | 18 ++++++++++++++++++
3 files changed, 28 insertions(+), 5 deletions(-)
diff --git a/net/handshake/handshake-test.c b/net/handshake/handshake-test.c
index df3948e807a0..9cc7a95f4120 100644
--- a/net/handshake/handshake-test.c
+++ b/net/handshake/handshake-test.c
@@ -375,6 +375,10 @@ static void handshake_req_cancel_test2(struct kunit *test)
/* Pretend to accept this request */
next = handshake_req_next(hn, HANDSHAKE_HANDLER_CLASS_TLSHD);
KUNIT_ASSERT_PTR_EQ(test, req, next);
+ /* Simulate FD_PREPARE() consuming the file reference handed
+ * off by handshake_req_next(); see handshake_nl_accept_doit().
+ */
+ fput(filp);
/* Act */
result = handshake_req_cancel(sock->sk);
@@ -417,6 +421,10 @@ static void handshake_req_cancel_test3(struct kunit *test)
/* Pretend to accept this request */
next = handshake_req_next(hn, HANDSHAKE_HANDLER_CLASS_TLSHD);
KUNIT_ASSERT_PTR_EQ(test, req, next);
+ /* Simulate FD_PREPARE() consuming the file reference handed
+ * off by handshake_req_next(); see handshake_nl_accept_doit().
+ */
+ fput(filp);
/* Pretend to complete this request */
handshake_complete(next, -ETIMEDOUT, NULL);
diff --git a/net/handshake/netlink.c b/net/handshake/netlink.c
index 1a5821eb7184..21d6cbd52fcd 100644
--- a/net/handshake/netlink.c
+++ b/net/handshake/netlink.c
@@ -92,7 +92,6 @@ int handshake_nl_accept_doit(struct sk_buff *skb, struct genl_info *info)
struct net *net = sock_net(skb->sk);
struct handshake_net *hn = handshake_pernet(net);
struct handshake_req *req = NULL;
- struct socket *sock;
int class, err;
err = -EOPNOTSUPP;
@@ -107,15 +106,13 @@ int handshake_nl_accept_doit(struct sk_buff *skb, struct genl_info *info)
err = -EAGAIN;
req = handshake_req_next(hn, class);
if (req) {
- sock = req->hr_sk->sk_socket;
-
- FD_PREPARE(fdf, O_CLOEXEC, sock->file);
+ FD_PREPARE(fdf, O_CLOEXEC, req->hr_file);
if (fdf.err) {
+ fput(req->hr_file); /* drop ref from handshake_req_next() */
err = fdf.err;
goto out_complete;
}
- get_file(sock->file); /* FD_PREPARE() consumes a reference. */
err = req->hr_proto->hp_accept(req, info, fd_prepare_fd(fdf));
if (err)
goto out_complete; /* Automatic cleanup handles fput */
diff --git a/net/handshake/request.c b/net/handshake/request.c
index da064511ab86..e2d7ee7ce6e0 100644
--- a/net/handshake/request.c
+++ b/net/handshake/request.c
@@ -178,6 +178,17 @@ static bool remove_pending(struct handshake_net *hn, struct handshake_req *req)
return ret;
}
+/**
+ * handshake_req_next - Return the next queued handshake request
+ * @hn: per-net handshake state
+ * @class: handler class to match
+ *
+ * On a non-NULL return, the caller owns an extra reference
+ * on @req->hr_file. FD_PREPARE() consumes it on success; on
+ * the FD_PREPARE() failure path the caller must fput() it.
+ *
+ * Return: pointer to a removed handshake_req, or NULL.
+ */
struct handshake_req *handshake_req_next(struct handshake_net *hn, int class)
{
struct handshake_req *req, *pos;
@@ -188,6 +199,13 @@ struct handshake_req *handshake_req_next(struct handshake_net *hn, int class)
if (pos->hr_proto->hp_handler_class != class)
continue;
__remove_pending_locked(hn, pos);
+ /* Hand off a file reference to the accept side under
+ * hn_lock. A concurrent handshake_req_cancel() can drop
+ * hr_file before accept reaches FD_PREPARE(); this extra
+ * reference keeps the file alive until FD_PREPARE() takes
+ * ownership.
+ */
+ get_file(pos->hr_file);
req = pos;
break;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH net v3 6/8] net/handshake: Close the submit-side sock_hold race
2026-05-25 16:51 [PATCH net v3 0/8] net/handshake: anchor request lifetime to a pinned file reference Chuck Lever
` (4 preceding siblings ...)
2026-05-25 16:51 ` [PATCH net v3 5/8] net/handshake: hand off the pinned file reference to accept_doit Chuck Lever
@ 2026-05-25 16:51 ` Chuck Lever
2026-05-25 16:51 ` [PATCH net v3 7/8] net/handshake: Verify file-reference balance in submit paths Chuck Lever
` (2 subsequent siblings)
8 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2026-05-25 16:51 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: Chris Mason, Christian Brauner, kernel-tls-handshake, linux-nvme,
netdev, Chuck Lever, Hannes Reinecke
From: Chuck Lever <chuck.lever@oracle.com>
handshake_req_submit() publishes the request via
handshake_req_hash_add() and __add_pending_locked(), drops
hn_lock, and calls handshake_genl_notify() (which can sleep)
before taking sock_hold() on req->hr_sk. A fast tlshd ACCEPT
followed by DONE can drive handshake_complete()'s sock_put()
into the window between the spin_unlock and the late
sock_hold(); on a system where the consumer's fd held the
only sk reference, the late sock_hold() then operates on an
sk whose refcount has reached zero.
The preceding two patches install an explicit file reference
on struct handshake_req. That file pins sock->file, which
pins the embedded struct socket, which defers inet_release()'s
sock_put(). As long as hr_file is held, sk cannot reach refcount
zero from the consumer side, and the submit-side sock_hold()
with its matching sock_put() calls in handshake_complete() and
handshake_req_cancel() is now redundant.
Drop all three. The file reference already keeps each request's
socket alive, and the lifetime story is contained in a single
get_file()/fput() pair.
Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
---
net/handshake/request.c | 12 ------------
1 file changed, 12 deletions(-)
diff --git a/net/handshake/request.c b/net/handshake/request.c
index e2d7ee7ce6e0..bd3d9467ab91 100644
--- a/net/handshake/request.c
+++ b/net/handshake/request.c
@@ -301,13 +301,6 @@ int handshake_req_submit(struct socket *sock, struct handshake_req *req,
goto out_err;
}
- /*
- * Pin struct sock so sk_destruct does not run until the
- * handshake completion path releases it; struct socket is
- * held separately via hr_file above.
- */
- sock_hold(req->hr_sk);
-
trace_handshake_submit(net, req, req->hr_sk);
return 0;
@@ -337,9 +330,6 @@ void handshake_complete(struct handshake_req *req, int status,
trace_handshake_complete(net, req, sk, status);
req->hr_proto->hp_done(req, status, info);
- /* Handshake request is no longer pending */
- sock_put(sk);
-
fput(file);
}
}
@@ -387,8 +377,6 @@ bool handshake_req_cancel(struct sock *sk)
out_true:
trace_handshake_cancel(net, req, sk);
- /* Handshake request is no longer pending */
- sock_put(sk);
fput(req->hr_file);
return true;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH net v3 7/8] net/handshake: Verify file-reference balance in submit paths
2026-05-25 16:51 [PATCH net v3 0/8] net/handshake: anchor request lifetime to a pinned file reference Chuck Lever
` (5 preceding siblings ...)
2026-05-25 16:51 ` [PATCH net v3 6/8] net/handshake: Close the submit-side sock_hold race Chuck Lever
@ 2026-05-25 16:51 ` Chuck Lever
2026-05-25 16:51 ` [PATCH net v3 8/8] net/handshake: Drain pending requests at net namespace exit Chuck Lever
2026-05-28 11:50 ` [PATCH net v3 0/8] net/handshake: anchor request lifetime to a pinned file reference patchwork-bot+netdevbpf
8 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2026-05-25 16:51 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: Chris Mason, Christian Brauner, kernel-tls-handshake, linux-nvme,
netdev, Chuck Lever, Hannes Reinecke
From: Chuck Lever <chuck.lever@oracle.com>
The new file-reference contract on struct handshake_req is silently
breakable: a missing get_file() at submit or a missing fput() on an
error path leaves the file leaked but does not crash the test, so
the existing absence-of-crash checks pass either way.
Snapshot file_count(filp) before each handshake_req_submit() in
the submit-success, EAGAIN, EBUSY, and cancel tests, and assert
the expected balance after submit and again after cancel. The
already-completed cancel test also asserts the post-complete
balance, which pins down that handshake_complete() drops the
reference and that the subsequent cancel does not double-fput.
The destroy test gets the same treatment before __fput_sync(),
which double-checks that cancel's fput() ran and the only
remaining reference is the one sock_alloc_file() established.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
---
net/handshake/handshake-test.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/net/handshake/handshake-test.c b/net/handshake/handshake-test.c
index 9cc7a95f4120..3dd507470d5f 100644
--- a/net/handshake/handshake-test.c
+++ b/net/handshake/handshake-test.c
@@ -208,6 +208,7 @@ static void handshake_req_submit_test3(struct kunit *test)
static void handshake_req_submit_test4(struct kunit *test)
{
struct handshake_req *req, *result;
+ unsigned long fcount_before;
struct socket *sock;
struct file *filp;
int err;
@@ -224,8 +225,10 @@ static void handshake_req_submit_test4(struct kunit *test)
KUNIT_ASSERT_NOT_NULL(test, sock->sk);
sock->file = filp;
+ fcount_before = file_count(filp);
err = handshake_req_submit(sock, req, GFP_KERNEL);
KUNIT_ASSERT_EQ(test, err, 0);
+ KUNIT_EXPECT_EQ(test, file_count(filp), fcount_before + 1);
/* Act */
result = handshake_req_hash_lookup(sock->sk);
@@ -235,11 +238,13 @@ static void handshake_req_submit_test4(struct kunit *test)
KUNIT_EXPECT_PTR_EQ(test, req, result);
handshake_req_cancel(sock->sk);
+ KUNIT_EXPECT_EQ(test, file_count(filp), fcount_before);
fput(filp);
}
static void handshake_req_submit_test5(struct kunit *test)
{
+ unsigned long fcount_before;
struct handshake_req *req;
struct handshake_net *hn;
struct socket *sock;
@@ -265,12 +270,14 @@ static void handshake_req_submit_test5(struct kunit *test)
saved = hn->hn_pending;
hn->hn_pending = hn->hn_pending_max + 1;
+ fcount_before = file_count(filp);
/* Act */
err = handshake_req_submit(sock, req, GFP_KERNEL);
/* Assert */
KUNIT_EXPECT_EQ(test, err, -EAGAIN);
+ KUNIT_EXPECT_EQ(test, file_count(filp), fcount_before);
fput(filp);
hn->hn_pending = saved;
@@ -279,6 +286,7 @@ static void handshake_req_submit_test5(struct kunit *test)
static void handshake_req_submit_test6(struct kunit *test)
{
struct handshake_req *req1, *req2;
+ unsigned long fcount_before;
struct socket *sock;
struct file *filp;
int err;
@@ -296,21 +304,26 @@ static void handshake_req_submit_test6(struct kunit *test)
KUNIT_ASSERT_NOT_ERR_OR_NULL(test, filp);
KUNIT_ASSERT_NOT_NULL(test, sock->sk);
sock->file = filp;
+ fcount_before = file_count(filp);
/* Act */
err = handshake_req_submit(sock, req1, GFP_KERNEL);
KUNIT_ASSERT_EQ(test, err, 0);
+ KUNIT_EXPECT_EQ(test, file_count(filp), fcount_before + 1);
err = handshake_req_submit(sock, req2, GFP_KERNEL);
/* Assert */
KUNIT_EXPECT_EQ(test, err, -EBUSY);
+ KUNIT_EXPECT_EQ(test, file_count(filp), fcount_before + 1);
handshake_req_cancel(sock->sk);
+ KUNIT_EXPECT_EQ(test, file_count(filp), fcount_before);
fput(filp);
}
static void handshake_req_cancel_test1(struct kunit *test)
{
+ unsigned long fcount_before;
struct handshake_req *req;
struct socket *sock;
struct file *filp;
@@ -329,8 +342,10 @@ static void handshake_req_cancel_test1(struct kunit *test)
KUNIT_ASSERT_NOT_ERR_OR_NULL(test, filp);
sock->file = filp;
+ fcount_before = file_count(filp);
err = handshake_req_submit(sock, req, GFP_KERNEL);
KUNIT_ASSERT_EQ(test, err, 0);
+ KUNIT_EXPECT_EQ(test, file_count(filp), fcount_before + 1);
/* NB: handshake_req hasn't been accepted */
@@ -339,12 +354,14 @@ static void handshake_req_cancel_test1(struct kunit *test)
/* Assert */
KUNIT_EXPECT_TRUE(test, result);
+ KUNIT_EXPECT_EQ(test, file_count(filp), fcount_before);
fput(filp);
}
static void handshake_req_cancel_test2(struct kunit *test)
{
+ unsigned long fcount_before;
struct handshake_req *req, *next;
struct handshake_net *hn;
struct socket *sock;
@@ -365,8 +382,10 @@ static void handshake_req_cancel_test2(struct kunit *test)
KUNIT_ASSERT_NOT_ERR_OR_NULL(test, filp);
sock->file = filp;
+ fcount_before = file_count(filp);
err = handshake_req_submit(sock, req, GFP_KERNEL);
KUNIT_ASSERT_EQ(test, err, 0);
+ KUNIT_EXPECT_EQ(test, file_count(filp), fcount_before + 1);
net = sock_net(sock->sk);
hn = handshake_pernet(net);
@@ -385,12 +404,14 @@ static void handshake_req_cancel_test2(struct kunit *test)
/* Assert */
KUNIT_EXPECT_TRUE(test, result);
+ KUNIT_EXPECT_EQ(test, file_count(filp), fcount_before);
fput(filp);
}
static void handshake_req_cancel_test3(struct kunit *test)
{
+ unsigned long fcount_before;
struct handshake_req *req, *next;
struct handshake_net *hn;
struct socket *sock;
@@ -411,8 +432,10 @@ static void handshake_req_cancel_test3(struct kunit *test)
KUNIT_ASSERT_NOT_ERR_OR_NULL(test, filp);
sock->file = filp;
+ fcount_before = file_count(filp);
err = handshake_req_submit(sock, req, GFP_KERNEL);
KUNIT_ASSERT_EQ(test, err, 0);
+ KUNIT_EXPECT_EQ(test, file_count(filp), fcount_before + 1);
net = sock_net(sock->sk);
hn = handshake_pernet(net);
@@ -428,12 +451,14 @@ static void handshake_req_cancel_test3(struct kunit *test)
/* Pretend to complete this request */
handshake_complete(next, -ETIMEDOUT, NULL);
+ KUNIT_EXPECT_EQ(test, file_count(filp), fcount_before);
/* Act */
result = handshake_req_cancel(sock->sk);
/* Assert */
KUNIT_EXPECT_FALSE(test, result);
+ KUNIT_EXPECT_EQ(test, file_count(filp), fcount_before);
fput(filp);
}
@@ -454,6 +479,7 @@ static struct handshake_proto handshake_req_alloc_proto_destroy = {
static void handshake_req_destroy_test1(struct kunit *test)
{
+ unsigned long fcount_before;
struct handshake_req *req;
struct socket *sock;
struct file *filp;
@@ -473,10 +499,12 @@ static void handshake_req_destroy_test1(struct kunit *test)
KUNIT_ASSERT_NOT_ERR_OR_NULL(test, filp);
sock->file = filp;
+ fcount_before = file_count(filp);
err = handshake_req_submit(sock, req, GFP_KERNEL);
KUNIT_ASSERT_EQ(test, err, 0);
handshake_req_cancel(sock->sk);
+ KUNIT_EXPECT_EQ(test, file_count(filp), fcount_before);
/* Act */
/* Ensure the close/release/put process has run to
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH net v3 8/8] net/handshake: Drain pending requests at net namespace exit
2026-05-25 16:51 [PATCH net v3 0/8] net/handshake: anchor request lifetime to a pinned file reference Chuck Lever
` (6 preceding siblings ...)
2026-05-25 16:51 ` [PATCH net v3 7/8] net/handshake: Verify file-reference balance in submit paths Chuck Lever
@ 2026-05-25 16:51 ` Chuck Lever
2026-05-28 11:50 ` [PATCH net v3 0/8] net/handshake: anchor request lifetime to a pinned file reference patchwork-bot+netdevbpf
8 siblings, 0 replies; 13+ messages in thread
From: Chuck Lever @ 2026-05-25 16:51 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman
Cc: Chris Mason, Christian Brauner, kernel-tls-handshake, linux-nvme,
netdev, Chuck Lever, Hannes Reinecke
From: Chuck Lever <chuck.lever@oracle.com>
The arguments to list_splice_init() in handshake_net_exit() are
reversed. The call moves the local empty "requests" list onto
hn->hn_requests, leaving the local list empty, so the subsequent
drain loop runs zero iterations. Pending handshake requests that
had not yet been accepted are not torn down when the net namespace
is destroyed; each one keeps a reference on a socket file and on
the handshake_req allocation.
Pass the source and destination in the documented order
(list_splice_init(list, head) moves list onto head) so the pending
list is transferred to the local scratch list and drained through
handshake_complete().
Fixing the splice direction exposes a list-corruption race. After
the splice each req->hr_list still has non-empty link pointers,
threading the stack-local scratch list rather than hn_requests.
A concurrent handshake_req_cancel() -- for example, from sunrpc's
TLS timeout on a kernel socket whose netns reference was not
taken -- finds the request through the rhashtable, calls
remove_pending(), and sees !list_empty(&req->hr_list).
__remove_pending_locked() then list_del_init()s an entry off the
scratch list while the drain iterates, corrupting it. The same
call arriving after the drain loop has run list_del() on an
entry hits LIST_POISON instead.
Have remove_pending() check HANDSHAKE_F_NET_DRAINING under
hn_lock and report not-found when drain is in progress. The
drain has already taken ownership; handshake_complete()'s existing
test_and_set on HANDSHAKE_F_REQ_COMPLETED still arbitrates
between drain and cancel for who calls the consumer's hp_done. Use
list_del_init() rather than list_del() in the drain so req->hr_list
does not carry LIST_POISON after drain releases the entry.
The DRAINING guard in remove_pending() makes cancel return false,
but cancel still falls through to test_and_set_bit on
HANDSHAKE_F_REQ_COMPLETED and drops the request's hr_file reference.
Without another pin, if that is the last reference, sk_destruct frees
the request while it is still linked on the drain loop's local list.
Pin each request's hr_file under hn_lock before releasing the list,
and drop that drain pin after the loop finishes with the request.
Fixes: 3b3009ea8abb ("net/handshake: Create a NETLINK service for handling handshake requests")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Hannes Reinecke <hare@kernel.org>
---
net/handshake/netlink.c | 10 ++++++++--
net/handshake/request.c | 5 ++++-
2 files changed, 12 insertions(+), 3 deletions(-)
diff --git a/net/handshake/netlink.c b/net/handshake/netlink.c
index 21d6cbd52fcd..3fd4fef9bab1 100644
--- a/net/handshake/netlink.c
+++ b/net/handshake/netlink.c
@@ -201,13 +201,19 @@ static void __net_exit handshake_net_exit(struct net *net)
*/
spin_lock_bh(&hn->hn_lock);
set_bit(HANDSHAKE_F_NET_DRAINING, &hn->hn_flags);
- list_splice_init(&requests, &hn->hn_requests);
+ list_splice_init(&hn->hn_requests, &requests);
+ list_for_each_entry(req, &requests, hr_list)
+ get_file(req->hr_file);
spin_unlock_bh(&hn->hn_lock);
while (!list_empty(&requests)) {
+ struct file *file;
+
req = list_first_entry(&requests, struct handshake_req, hr_list);
- list_del(&req->hr_list);
+ file = req->hr_file;
+ list_del_init(&req->hr_list);
handshake_complete(req, -ETIMEDOUT, NULL);
+ fput(file);
}
}
diff --git a/net/handshake/request.c b/net/handshake/request.c
index bd3d9467ab91..cd30d54d0501 100644
--- a/net/handshake/request.c
+++ b/net/handshake/request.c
@@ -163,13 +163,16 @@ static void __remove_pending_locked(struct handshake_net *hn,
* otherwise %false.
*
* If @req was on a pending list, it has not yet been accepted.
+ * Returns %false when the net namespace is draining; the drain
+ * loop has taken ownership of the pending list.
*/
static bool remove_pending(struct handshake_net *hn, struct handshake_req *req)
{
bool ret = false;
spin_lock_bh(&hn->hn_lock);
- if (!list_empty(&req->hr_list)) {
+ if (!test_bit(HANDSHAKE_F_NET_DRAINING, &hn->hn_flags) &&
+ !list_empty(&req->hr_list)) {
__remove_pending_locked(hn, req);
ret = true;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH net v3 1/8] net/handshake: Use spin_lock_bh for hn_lock
2026-05-25 16:51 ` [PATCH net v3 1/8] net/handshake: Use spin_lock_bh for hn_lock Chuck Lever
@ 2026-05-27 8:59 ` Hannes Reinecke
2026-05-27 14:24 ` Chuck Lever
0 siblings, 1 reply; 13+ messages in thread
From: Hannes Reinecke @ 2026-05-27 8:59 UTC (permalink / raw)
To: Chuck Lever, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman
Cc: Chris Mason, Christian Brauner, kernel-tls-handshake, linux-nvme,
netdev, Chuck Lever
On 5/25/26 18:51, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
>
> nvmet_tcp_state_change(), a socket callback that runs in BH context,
> can reach handshake_req_cancel() via nvmet_tcp_schedule_release_queue()
> and tls_handshake_cancel(). handshake_req_cancel() acquires
> hn->hn_lock with plain spin_lock(). If a process-context thread on
> the same CPU holds hn->hn_lock when a softirq invokes the cancel path,
> the lock attempt deadlocks. This is the only caller that invokes
> tls_handshake_cancel() from BH context; every other consumer calls it
> from process context.
>
> Deferring the cancel to process context in the NVMe target is not
> straightforward: nvmet_tcp_schedule_release_queue() must call
> tls_handshake_cancel() atomically with its state transition to
> DISCONNECTING. If the cancel were deferred, the handshake completion
> callback could fire in the window before the cancel runs, observe the
> unexpected state, and return without dropping its kref on the queue.
> Reworking that interlock is considerably more invasive than hardening
> the handshake lock. Convert all hn->hn_lock acquisitions from
> spin_lock/spin_unlock to spin_lock_bh/spin_unlock_bh so the lock is
> never taken with softirqs enabled.
>
> Fixes: 675b453e0241 ("nvmet-tcp: enable TLS handshake upcall")
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
> net/handshake/netlink.c | 4 ++--
> net/handshake/request.c | 14 +++++++-------
> net/handshake/tlshd.c | 2 ++
> 3 files changed, 11 insertions(+), 9 deletions(-)
>
... and there is always the question whather we should be calling
tls_handshake_cancel() in the first place.
We already call 'tls_handshake_cancel()' from the handshake timeout
handler, and this instance of 'tls_handshake_cancel()' is called
from a socket callback (ie when the soecket is closed or something).
I would have expected that the handshake code cleans up outstanding
requests on socket close already; if so, we can delete the cancel
here.
Hmm?
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net v3 1/8] net/handshake: Use spin_lock_bh for hn_lock
2026-05-27 8:59 ` Hannes Reinecke
@ 2026-05-27 14:24 ` Chuck Lever
2026-05-27 17:32 ` Hannes Reinecke
0 siblings, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2026-05-27 14:24 UTC (permalink / raw)
To: Hannes Reinecke, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman
Cc: Chris Mason, Christian Brauner, kernel-tls-handshake, linux-nvme,
netdev, Chuck Lever
On Wed, May 27, 2026, at 4:59 AM, Hannes Reinecke wrote:
> On 5/25/26 18:51, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever@oracle.com>
>>
>> nvmet_tcp_state_change(), a socket callback that runs in BH context,
>> can reach handshake_req_cancel() via nvmet_tcp_schedule_release_queue()
>> and tls_handshake_cancel(). handshake_req_cancel() acquires
>> hn->hn_lock with plain spin_lock(). If a process-context thread on
>> the same CPU holds hn->hn_lock when a softirq invokes the cancel path,
>> the lock attempt deadlocks. This is the only caller that invokes
>> tls_handshake_cancel() from BH context; every other consumer calls it
>> from process context.
>>
>> Deferring the cancel to process context in the NVMe target is not
>> straightforward: nvmet_tcp_schedule_release_queue() must call
>> tls_handshake_cancel() atomically with its state transition to
>> DISCONNECTING. If the cancel were deferred, the handshake completion
>> callback could fire in the window before the cancel runs, observe the
>> unexpected state, and return without dropping its kref on the queue.
>> Reworking that interlock is considerably more invasive than hardening
>> the handshake lock. Convert all hn->hn_lock acquisitions from
>> spin_lock/spin_unlock to spin_lock_bh/spin_unlock_bh so the lock is
>> never taken with softirqs enabled.
>>
>> Fixes: 675b453e0241 ("nvmet-tcp: enable TLS handshake upcall")
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>> net/handshake/netlink.c | 4 ++--
>> net/handshake/request.c | 14 +++++++-------
>> net/handshake/tlshd.c | 2 ++
>> 3 files changed, 11 insertions(+), 9 deletions(-)
>>
> ... and there is always the question whather we should be calling
> tls_handshake_cancel() in the first place.
> We already call 'tls_handshake_cancel()' from the handshake timeout
> handler, and this instance of 'tls_handshake_cancel()' is called
> from a socket callback (ie when the soecket is closed or something).
> I would have expected that the handshake code cleans up outstanding
> requests on socket close already; if so, we can delete the cancel
> here.
The handshake layer doesn't auto-cancel on socket close.
handshake_sk_destruct() fires when the sock refcount reaches
zero, but it only frees the handshake_req and removes it from
the hash table. It does not set HANDSHAKE_F_REQ_COMPLETED or
invoke the consumer's done callback. It assumes the request
was already completed or canceled before the socket reached
final destruction.
The timeout handler and the state-change path cover different
failure modes. The timeout fires when tlshd is slow or absent.
The state-change callback fires when the remote peer resets or
closes the connection while the handshake is still in
progress -- that can happen well before the timeout expires.
Without the cancel in nvmet_tcp_schedule_release_queue(), the
race plays out as follows: the state-change callback
transitions the queue to DISCONNECTING and drops its kref.
Then tlshd finishes and the handshake layer calls
nvmet_tcp_tls_handshake_done(). That callback sees the queue
is not in TLS_HANDSHAKE state, hits the WARN_ON, and returns
without dropping its own kref -- leaking the queue.
The cancel makes the teardown race-free: either cancel wins
and the completion callback never fires, or completion already
won and cancel returns false. Either way, exactly one side
drops the kref.
The SunRPC consumers (client and server) avoid this problem by
blocking a thread on wait_for_completion() during the
handshake. A socket state change wakes the waiter indirectly
through the completion callback, and the waiter handles
cleanup. nvme-tcp's handshake is fully asynchronous -- there
is no blocked thread to wake, so the state-change callback is
the only context that can cancel the handshake before queue
teardown begins.
--
Chuck Lever
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net v3 1/8] net/handshake: Use spin_lock_bh for hn_lock
2026-05-27 14:24 ` Chuck Lever
@ 2026-05-27 17:32 ` Hannes Reinecke
0 siblings, 0 replies; 13+ messages in thread
From: Hannes Reinecke @ 2026-05-27 17:32 UTC (permalink / raw)
To: Chuck Lever, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman
Cc: Chris Mason, Christian Brauner, kernel-tls-handshake, linux-nvme,
netdev, Chuck Lever
On 5/27/26 16:24, Chuck Lever wrote:
>
>
> On Wed, May 27, 2026, at 4:59 AM, Hannes Reinecke wrote:
>> On 5/25/26 18:51, Chuck Lever wrote:
>>> From: Chuck Lever <chuck.lever@oracle.com>
>>>
>>> nvmet_tcp_state_change(), a socket callback that runs in BH context,
>>> can reach handshake_req_cancel() via nvmet_tcp_schedule_release_queue()
>>> and tls_handshake_cancel(). handshake_req_cancel() acquires
>>> hn->hn_lock with plain spin_lock(). If a process-context thread on
>>> the same CPU holds hn->hn_lock when a softirq invokes the cancel path,
>>> the lock attempt deadlocks. This is the only caller that invokes
>>> tls_handshake_cancel() from BH context; every other consumer calls it
>>> from process context.
>>>
>>> Deferring the cancel to process context in the NVMe target is not
>>> straightforward: nvmet_tcp_schedule_release_queue() must call
>>> tls_handshake_cancel() atomically with its state transition to
>>> DISCONNECTING. If the cancel were deferred, the handshake completion
>>> callback could fire in the window before the cancel runs, observe the
>>> unexpected state, and return without dropping its kref on the queue.
>>> Reworking that interlock is considerably more invasive than hardening
>>> the handshake lock. Convert all hn->hn_lock acquisitions from
>>> spin_lock/spin_unlock to spin_lock_bh/spin_unlock_bh so the lock is
>>> never taken with softirqs enabled.
>>>
>>> Fixes: 675b453e0241 ("nvmet-tcp: enable TLS handshake upcall")
>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>> ---
>>> net/handshake/netlink.c | 4 ++--
>>> net/handshake/request.c | 14 +++++++-------
>>> net/handshake/tlshd.c | 2 ++
>>> 3 files changed, 11 insertions(+), 9 deletions(-)
>>>
>> ... and there is always the question whather we should be calling
>> tls_handshake_cancel() in the first place.
>> We already call 'tls_handshake_cancel()' from the handshake timeout
>> handler, and this instance of 'tls_handshake_cancel()' is called
>> from a socket callback (ie when the soecket is closed or something).
>> I would have expected that the handshake code cleans up outstanding
>> requests on socket close already; if so, we can delete the cancel
>> here.
>
> The handshake layer doesn't auto-cancel on socket close.
>
> handshake_sk_destruct() fires when the sock refcount reaches
> zero, but it only frees the handshake_req and removes it from
> the hash table. It does not set HANDSHAKE_F_REQ_COMPLETED or
> invoke the consumer's done callback. It assumes the request
> was already completed or canceled before the socket reached
> final destruction.
>
> The timeout handler and the state-change path cover different
> failure modes. The timeout fires when tlshd is slow or absent.
> The state-change callback fires when the remote peer resets or
> closes the connection while the handshake is still in
> progress -- that can happen well before the timeout expires.
>
> Without the cancel in nvmet_tcp_schedule_release_queue(), the
> race plays out as follows: the state-change callback
> transitions the queue to DISCONNECTING and drops its kref.
> Then tlshd finishes and the handshake layer calls
> nvmet_tcp_tls_handshake_done(). That callback sees the queue
> is not in TLS_HANDSHAKE state, hits the WARN_ON, and returns
> without dropping its own kref -- leaking the queue.
>
> The cancel makes the teardown race-free: either cancel wins
> and the completion callback never fires, or completion already
> won and cancel returns false. Either way, exactly one side
> drops the kref.
>
> The SunRPC consumers (client and server) avoid this problem by
> blocking a thread on wait_for_completion() during the
> handshake. A socket state change wakes the waiter indirectly
> through the completion callback, and the waiter handles
> cleanup. nvme-tcp's handshake is fully asynchronous -- there
> is no blocked thread to wake, so the state-change callback is
> the only context that can cancel the handshake before queue
> teardown begins.
>
Thanks for the clarification. That's what I had (implicitely)
assumed, so the tls_handshake_cancel() needs to stay.
So you can add my:
Reviewed-by: Hannes Reinecke <hare@kernel.org>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH net v3 0/8] net/handshake: anchor request lifetime to a pinned file reference
2026-05-25 16:51 [PATCH net v3 0/8] net/handshake: anchor request lifetime to a pinned file reference Chuck Lever
` (7 preceding siblings ...)
2026-05-25 16:51 ` [PATCH net v3 8/8] net/handshake: Drain pending requests at net namespace exit Chuck Lever
@ 2026-05-28 11:50 ` patchwork-bot+netdevbpf
8 siblings, 0 replies; 13+ messages in thread
From: patchwork-bot+netdevbpf @ 2026-05-28 11:50 UTC (permalink / raw)
To: Chuck Lever
Cc: davem, edumazet, kuba, pabeni, horms, clm, brauner,
kernel-tls-handshake, linux-nvme, netdev, chuck.lever, hare,
alistair.francis
Hello:
This series was applied to netdev/net.git (main)
by Paolo Abeni <pabeni@redhat.com>:
On Mon, 25 May 2026 12:51:14 -0400 you wrote:
> handshake_nl_accept_doit() has accumulated four follow-on fixes
> since 3b3009ea8abb ("net/handshake: Create a NETLINK service for
> handling handshake requests"): 7ea9c1ec66bc, 7798b59409c3,
> fe67b063f687, and dabac51b8102. Each was a local refcount or
> NULL-check correction; none moved where the file reference is
> owned, and the same code keeps producing the same class of bug.
> Reworking the ownership is what breaks the pattern.
>
> [...]
Here is the summary with links:
- [net,v3,1/8] net/handshake: Use spin_lock_bh for hn_lock
https://git.kernel.org/netdev/net/c/cc993e0927ec
- [net,v3,2/8] nvme-tcp: store negative errno in queue->tls_err
https://git.kernel.org/netdev/net/c/9015985b5eb1
- [net,v3,3/8] net/handshake: Pass negative errno through handshake_complete()
https://git.kernel.org/netdev/net/c/6b22d433aa13
- [net,v3,4/8] net/handshake: Take a long-lived file reference at submit
https://git.kernel.org/netdev/net/c/09dba37eee70
- [net,v3,5/8] net/handshake: hand off the pinned file reference to accept_doit
https://git.kernel.org/netdev/net/c/f4251190e58b
- [net,v3,6/8] net/handshake: Close the submit-side sock_hold race
https://git.kernel.org/netdev/net/c/5da98f55b131
- [net,v3,7/8] net/handshake: Verify file-reference balance in submit paths
https://git.kernel.org/netdev/net/c/204a5efde5ed
- [net,v3,8/8] net/handshake: Drain pending requests at net namespace exit
https://git.kernel.org/netdev/net/c/ea5fe6a73ca5
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2026-05-28 11:50 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-25 16:51 [PATCH net v3 0/8] net/handshake: anchor request lifetime to a pinned file reference Chuck Lever
2026-05-25 16:51 ` [PATCH net v3 1/8] net/handshake: Use spin_lock_bh for hn_lock Chuck Lever
2026-05-27 8:59 ` Hannes Reinecke
2026-05-27 14:24 ` Chuck Lever
2026-05-27 17:32 ` Hannes Reinecke
2026-05-25 16:51 ` [PATCH net v3 2/8] nvme-tcp: store negative errno in queue->tls_err Chuck Lever
2026-05-25 16:51 ` [PATCH net v3 3/8] net/handshake: Pass negative errno through handshake_complete() Chuck Lever
2026-05-25 16:51 ` [PATCH net v3 4/8] net/handshake: Take a long-lived file reference at submit Chuck Lever
2026-05-25 16:51 ` [PATCH net v3 5/8] net/handshake: hand off the pinned file reference to accept_doit Chuck Lever
2026-05-25 16:51 ` [PATCH net v3 6/8] net/handshake: Close the submit-side sock_hold race Chuck Lever
2026-05-25 16:51 ` [PATCH net v3 7/8] net/handshake: Verify file-reference balance in submit paths Chuck Lever
2026-05-25 16:51 ` [PATCH net v3 8/8] net/handshake: Drain pending requests at net namespace exit Chuck Lever
2026-05-28 11:50 ` [PATCH net v3 0/8] net/handshake: anchor request lifetime to a pinned file reference patchwork-bot+netdevbpf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox