[PATCH net-next] netlink: clean up failed initial dump-start state

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next] netlink: clean up failed initial dump-start state
@ 2026-04-20 16:27 Michael Bommarito
  2026-04-20 17:37 ` Jakub Kicinski
  2026-04-23 21:28 ` [PATCH net-next v2] " Michael Bommarito
  0 siblings, 2 replies; 4+ messages in thread
From: Michael Bommarito @ 2026-04-20 16:27 UTC (permalink / raw)
  To: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev
  Cc: Simon Horman, Kuniyuki Iwashima, Kees Cook, Feng Yang,
	linux-kernel

When __netlink_dump_start() has already installed cb->skb, taken the
module reference and set cb_running, a failure from the first
netlink_dump(sk, true) call returns via errout_skb without unwinding the
callback lifetime. That leaves cb_running set and defers module_put()
and consume_skb(cb->skb) until userspace drains the socket or closes it.

Share the normal callback teardown in a helper and use it on successful
completion and on the initial lock_taken=true failure path. Keep the
lock_taken=false continuation path unchanged, because recvmsg()-driven
retries legitimately preserve cb_running when they run out of receive
room.

Fixes: 16b304f3404f ("netlink: Eliminate kmalloc in netlink dump operation.")
Assisted-by: Claude:claude-opus-4-6
Assisted-by: Codex:gpt-5-4
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---
Validation inside a UML guest on current mainline:

  - An unprivileged local task (uid=65534, no CAP_NET_ADMIN) opens a
    plain NETLINK_ROUTE socket, preloads sk_rmem_alloc with echoed
    NLMSG_ERROR replies from an unsupported rtnetlink type, then issues
    RTM_GETLINK | NLM_F_DUMP | NLM_F_ACK.
  - Stock kernel: the initial __netlink_dump_start() hits the rmem gate
    and returns via errout_skb with cb_running stuck at 1 until
    recvmsg() or close() drives forward progress.
  - Patched kernel: the same probe leaves cb_running clear immediately
    on the lock_taken=true failure, and the larger-rcvbuf continuation
    path (legitimate dump in progress) is unchanged.

A scaling pass on 3500 such wedged sockets in a 256M UML guest shows
about 3.8-3.9 MiB of extra unreclaimable slab (/proc/meminfo
SUnreclaim) beyond the visible queued rmem on the vulnerable kernel,
roughly 1.1 KiB/socket. Real accumulation, but the test hits
RLIMIT_NOFILE long before the guest approaches OOM, so this still
looks like a local availability cleanup rather than an exhaustion
primitive.

No Cc: stable@ on the theory that the bug self-heals on
recvmsg()/close and the accumulation is mild. Happy to add it and
route to net if you'd rather see it backported.

 net/netlink/af_netlink.c | 30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 4d609d5cf406..7019c17e6879 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2250,6 +2250,20 @@ static int netlink_dump_done(struct netlink_sock *nlk, struct sk_buff *skb,
 	return 0;
 }
 
+static void netlink_dump_cleanup(struct netlink_sock *nlk)
+{
+	struct module *module = nlk->cb.module;
+	struct sk_buff *skb = nlk->cb.skb;
+
+	if (nlk->cb.done)
+		nlk->cb.done(&nlk->cb);
+
+	WRITE_ONCE(nlk->cb_running, false);
+	mutex_unlock(&nlk->nl_cb_mutex);
+	module_put(module);
+	consume_skb(skb);
+}
+
 static int netlink_dump(struct sock *sk, bool lock_taken)
 {
 	struct netlink_sock *nlk = nlk_sk(sk);
@@ -2258,7 +2272,6 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
 	struct sk_buff *skb = NULL;
 	unsigned int rmem, rcvbuf;
 	size_t max_recvmsg_len;
-	struct module *module;
 	int err = -ENOBUFS;
 	int alloc_min_size;
 	int alloc_size;
@@ -2366,19 +2379,14 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
 	else
 		__netlink_sendskb(sk, skb);
 
-	if (cb->done)
-		cb->done(cb);
-
-	WRITE_ONCE(nlk->cb_running, false);
-	module = cb->module;
-	skb = cb->skb;
-	mutex_unlock(&nlk->nl_cb_mutex);
-	module_put(module);
-	consume_skb(skb);
+	netlink_dump_cleanup(nlk);
 	return 0;
 
 errout_skb:
-	mutex_unlock(&nlk->nl_cb_mutex);
+	if (lock_taken)
+		netlink_dump_cleanup(nlk);
+	else
+		mutex_unlock(&nlk->nl_cb_mutex);
 	kfree_skb(skb);
 	return err;
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next] netlink: clean up failed initial dump-start state
  2026-04-20 16:27 [PATCH net-next] netlink: clean up failed initial dump-start state Michael Bommarito
@ 2026-04-20 17:37 ` Jakub Kicinski
  2026-04-20 17:56   ` Michael Bommarito
  2026-04-23 21:28 ` [PATCH net-next v2] " Michael Bommarito
  1 sibling, 1 reply; 4+ messages in thread
From: Jakub Kicinski @ 2026-04-20 17:37 UTC (permalink / raw)
  To: Michael Bommarito
  Cc: David S . Miller, Eric Dumazet, Paolo Abeni, netdev, Simon Horman,
	Kuniyuki Iwashima, Kees Cook, Feng Yang, linux-kernel

On Mon, 20 Apr 2026 12:27:34 -0400 Michael Bommarito wrote:
> When __netlink_dump_start() has already installed cb->skb, taken the
> module reference and set cb_running, a failure from the first
> netlink_dump(sk, true) call returns via errout_skb without unwinding the
> callback lifetime. That leaves cb_running set and defers module_put()
> and consume_skb(cb->skb) until userspace drains the socket or closes it.

On a quick look I can't see which path clears the dump state in case we
keep failing to allocate an skb. Could you add more info on that?

> Share the normal callback teardown in a helper and use it on successful
> completion and on the initial lock_taken=true failure path. Keep the
> lock_taken=false continuation path unchanged, because recvmsg()-driven
> retries legitimately preserve cb_running when they run out of receive
> room.
> 
> Fixes: 16b304f3404f ("netlink: Eliminate kmalloc in netlink dump operation.")
> Assisted-by: Claude:claude-opus-4-6
> Assisted-by: Codex:gpt-5-4
> Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
> ---
> Validation inside a UML guest on current mainline:
> 
>   - An unprivileged local task (uid=65534, no CAP_NET_ADMIN) opens a
>     plain NETLINK_ROUTE socket, preloads sk_rmem_alloc with echoed
>     NLMSG_ERROR replies from an unsupported rtnetlink type, then issues
>     RTM_GETLINK | NLM_F_DUMP | NLM_F_ACK.
>   - Stock kernel: the initial __netlink_dump_start() hits the rmem gate
>     and returns via errout_skb with cb_running stuck at 1 until
>     recvmsg() or close() drives forward progress.
>   - Patched kernel: the same probe leaves cb_running clear immediately
>     on the lock_taken=true failure, and the larger-rcvbuf continuation
>     path (legitimate dump in progress) is unchanged.
> 
> A scaling pass on 3500 such wedged sockets in a 256M UML guest shows
> about 3.8-3.9 MiB of extra unreclaimable slab (/proc/meminfo
> SUnreclaim) beyond the visible queued rmem on the vulnerable kernel,
> roughly 1.1 KiB/socket. Real accumulation, but the test hits
> RLIMIT_NOFILE long before the guest approaches OOM, so this still
> looks like a local availability cleanup rather than an exhaustion
> primitive.

This should be part of the commit message, it's useful to understanding
the problem. Actually more than the current commit msg TBH.

> No Cc: stable@ on the theory that the bug self-heals on
> recvmsg()/close and the accumulation is mild. Happy to add it and
> route to net if you'd rather see it backported.
> 
>  net/netlink/af_netlink.c | 30 +++++++++++++++++++-----------
>  1 file changed, 19 insertions(+), 11 deletions(-)
> 
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 4d609d5cf406..7019c17e6879 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -2250,6 +2250,20 @@ static int netlink_dump_done(struct netlink_sock *nlk, struct sk_buff *skb,
>  	return 0;
>  }
>  
> +static void netlink_dump_cleanup(struct netlink_sock *nlk)
> +{
> +	struct module *module = nlk->cb.module;
> +	struct sk_buff *skb = nlk->cb.skb;
> +
> +	if (nlk->cb.done)
> +		nlk->cb.done(&nlk->cb);
> +
> +	WRITE_ONCE(nlk->cb_running, false);
> +	mutex_unlock(&nlk->nl_cb_mutex);
> +	module_put(module);
> +	consume_skb(skb);
> +}

It's probably better to create a helper that shares the code with 
the release path as well. And try not to switch the skb freeing 
to consume_skb().

>  static int netlink_dump(struct sock *sk, bool lock_taken)
>  {
>  	struct netlink_sock *nlk = nlk_sk(sk);
> @@ -2258,7 +2272,6 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
>  	struct sk_buff *skb = NULL;
>  	unsigned int rmem, rcvbuf;
>  	size_t max_recvmsg_len;
> -	struct module *module;
>  	int err = -ENOBUFS;
>  	int alloc_min_size;
>  	int alloc_size;
> @@ -2366,19 +2379,14 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
>  	else
>  		__netlink_sendskb(sk, skb);
>  
> -	if (cb->done)
> -		cb->done(cb);
> -
> -	WRITE_ONCE(nlk->cb_running, false);
> -	module = cb->module;
> -	skb = cb->skb;
> -	mutex_unlock(&nlk->nl_cb_mutex);
> -	module_put(module);
> -	consume_skb(skb);
> +	netlink_dump_cleanup(nlk);
>  	return 0;
>  
>  errout_skb:
> -	mutex_unlock(&nlk->nl_cb_mutex);
> +	if (lock_taken)
> +		netlink_dump_cleanup(nlk);
> +	else
> +		mutex_unlock(&nlk->nl_cb_mutex);
>  	kfree_skb(skb);
>  	return err;
>  }

If you're planning to repost - please wait until tomorrow, we ask that
revisions are at least 24h apart so that people across the timezones
have a chance to chime in.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net-next] netlink: clean up failed initial dump-start state
  2026-04-20 17:37 ` Jakub Kicinski
@ 2026-04-20 17:56   ` Michael Bommarito
  0 siblings, 0 replies; 4+ messages in thread
From: Michael Bommarito @ 2026-04-20 17:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S . Miller, Eric Dumazet, Paolo Abeni, netdev, Simon Horman,
	Kuniyuki Iwashima, Kees Cook, Feng Yang, linux-kernel

On Mon, Apr 20, 2026 at 1:37 PM Jakub Kicinski <kuba@kernel.org> wrote:
> On a quick look I can't see which path clears the dump state in case we
> keep failing to allocate an skb. Could you add more info on that?
> ...
> This should be part of the commit message, it's useful to understanding
> the problem. Actually more than the current commit msg TBH.
> ...
> If you're planning to repost - please wait until tomorrow, we ask that
> revisions are at least 24h apart so that people across the timezones
> have a chance to chime in.

Thanks, good points.  I'll set a reminder and follow up tomorrow with
your ideas if we don't hear from others.

Thanks,
Mike Bommarito

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH net-next v2] netlink: clean up failed initial dump-start state
  2026-04-20 16:27 [PATCH net-next] netlink: clean up failed initial dump-start state Michael Bommarito
  2026-04-20 17:37 ` Jakub Kicinski
@ 2026-04-23 21:28 ` Michael Bommarito
  1 sibling, 0 replies; 4+ messages in thread
From: Michael Bommarito @ 2026-04-23 21:28 UTC (permalink / raw)
  To: Jakub Kicinski, David S . Miller, Eric Dumazet, Paolo Abeni
  Cc: Simon Horman, Kuniyuki Iwashima, Kees Cook, Feng Yang, netdev,
	linux-kernel, Michael Bommarito

__netlink_dump_start() installs cb->skb, takes the module reference,
and sets cb_running before calling netlink_dump(sk, true). If that
first call returns via errout_skb the callback state is left behind:
cb_running stays set, module_put() and consume_skb(cb->skb) are
deferred until recvmsg() drives the dump back through the success
path, or netlink_release() on close runs the catch-all cleanup. On
sustained alloc failure neither fires.

Factor the teardown into netlink_dump_cleanup(nlk, drop) shared by
the dump success path, the lock_taken=true errout_skb path, and
netlink_release(). The @drop flag preserves the existing split:
consume_skb() on normal completion, kfree_skb() on abort.

Validation on a UML guest: an unprivileged task opens NETLINK_ROUTE,
preloads sk_rmem_alloc, then issues RTM_GETLINK | NLM_F_DUMP. Stock
kernel leaves cb_running stuck at 1 until recvmsg() or close()
drives it. Patched kernel clears cb_running immediately on the
lock_taken=true failure; the recvmsg continuation path is unchanged.

At scale: 3500 wedged sockets in a 256M guest show about 3.8-3.9
MiB of extra unreclaimable slab (~1.1 KiB/sock) on stock vs zero on
patched. RLIMIT_NOFILE bounds the test before OOM, so this is a
local availability cleanup rather than an exhaustion primitive.

Fixes: 16b304f3404f ("netlink: Eliminate kmalloc in netlink dump operation.")
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
---
v2 (per Jakub's review <20260420103715.347fbd4a@kernel.org>):
  * commit message names both paths that do clear the state
    (recvmsg-driven retry on drain, netlink_release() on close)
    and notes that neither fires on sustained alloc failure
  * moved the UML validation into the commit message
  * extracted netlink_dump_cleanup(nlk, bool drop); shared with
    netlink_release() and the success path. The bool preserves
    the existing kfree_skb / consume_skb split.

v1: https://lore.kernel.org/netdev/20260420162734.854587-1-michael.bommarito@gmail.com/

 net/netlink/af_netlink.c | 47 ++++++++++++++++++++++++++--------------
 1 file changed, 31 insertions(+), 16 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 4d609d5cf406..ab21a6218631 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -131,6 +131,7 @@ static const char *const nlk_cb_mutex_key_strings[MAX_LINKS + 1] = {
 };
 
 static int netlink_dump(struct sock *sk, bool lock_taken);
+static void netlink_dump_cleanup(struct netlink_sock *nlk, bool drop);
 
 /* nl_table locking explained:
  * Lookup and traversal are protected with an RCU read-side lock. Insertion
@@ -763,13 +764,8 @@ static int netlink_release(struct socket *sock)
 	}
 
 	/* Terminate any outstanding dump */
-	if (nlk->cb_running) {
-		if (nlk->cb.done)
-			nlk->cb.done(&nlk->cb);
-		module_put(nlk->cb.module);
-		kfree_skb(nlk->cb.skb);
-		WRITE_ONCE(nlk->cb_running, false);
-	}
+	if (nlk->cb_running)
+		netlink_dump_cleanup(nlk, true);
 
 	module_put(nlk->module);
 
@@ -2250,6 +2246,26 @@ static int netlink_dump_done(struct netlink_sock *nlk, struct sk_buff *skb,
 	return 0;
 }
 
+/* Must be called with nl_cb_mutex NOT held. @drop=true frees the skb
+ * via kfree_skb() so drop-monitor sees the teardown; @drop=false uses
+ * consume_skb() for the normal-completion path.
+ */
+static void netlink_dump_cleanup(struct netlink_sock *nlk, bool drop)
+{
+	struct module *module = nlk->cb.module;
+	struct sk_buff *skb = nlk->cb.skb;
+
+	if (nlk->cb.done)
+		nlk->cb.done(&nlk->cb);
+
+	WRITE_ONCE(nlk->cb_running, false);
+	module_put(module);
+	if (drop)
+		kfree_skb(skb);
+	else
+		consume_skb(skb);
+}
+
 static int netlink_dump(struct sock *sk, bool lock_taken)
 {
 	struct netlink_sock *nlk = nlk_sk(sk);
@@ -2258,7 +2274,6 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
 	struct sk_buff *skb = NULL;
 	unsigned int rmem, rcvbuf;
 	size_t max_recvmsg_len;
-	struct module *module;
 	int err = -ENOBUFS;
 	int alloc_min_size;
 	int alloc_size;
@@ -2366,19 +2381,19 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
 	else
 		__netlink_sendskb(sk, skb);
 
-	if (cb->done)
-		cb->done(cb);
-
-	WRITE_ONCE(nlk->cb_running, false);
-	module = cb->module;
-	skb = cb->skb;
 	mutex_unlock(&nlk->nl_cb_mutex);
-	module_put(module);
-	consume_skb(skb);
+	netlink_dump_cleanup(nlk, false);
 	return 0;
 
 errout_skb:
 	mutex_unlock(&nlk->nl_cb_mutex);
+	/* The recvmsg() retry path (lock_taken=false) keeps cb_running so
+	 * the next recvmsg() can drive the dump forward once receive room
+	 * is available; only the initial __netlink_dump_start() failure
+	 * owns the teardown.
+	 */
+	if (lock_taken)
+		netlink_dump_cleanup(nlk, true);
 	kfree_skb(skb);
 	return err;
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-23 21:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-20 16:27 [PATCH net-next] netlink: clean up failed initial dump-start state Michael Bommarito
2026-04-20 17:37 ` Jakub Kicinski
2026-04-20 17:56   ` Michael Bommarito
2026-04-23 21:28 ` [PATCH net-next v2] " Michael Bommarito

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox