Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH 0/2] sched: Introduce and use deferred WARNs in sched
From: Sebastian Andrzej Siewior @ 2026-06-23 14:26 UTC (permalink / raw)
  To: linux-arch, linux-kernel, sched-ext, netdev
  Cc: David S . Miller, Andrea Righi, Andrew Morton, Arnd Bergmann,
	Ben Segall, Breno Leitao, Changwoo Min, David Vernet,
	Dietmar Eggemann, Eric Dumazet, Ingo Molnar, Jakub Kicinski,
	John Ogness, Juri Lelli, K Prateek Nayak, Paolo Abeni,
	Peter Zijlstra, Petr Mladek, Sergey Senozhatsky, Simon Horman,
	Steven Rostedt, Tejun Heo, Vincent Guittot, Vlad Poenaru,
	Sebastian Andrzej Siewior

This is a follow-up to the netconsole lockup reported
	https://lore.kernel.org/all/20260610183621.3915271-1-vlad.wing@gmail.com/

The idea is to use deferred printing for WARNs and use them in sched. I
tried to use only where it looks that the rq lock acquired instead a
plain s/WARN_ON/WARN_ON_DEFFERED which would be simpler.

This unholy deferred mess can be removed once we don't have legacy
consoles anymore _or_ force force_legacy_kthread=true.

The initial report is against v6.16 and netconsole. The reported problem
does not occur upstream since commit 7eab73b18630e ("netconsole: convert
to NBCON console infrastructure") which is v7.0-rc1.

Should this be rejected outright because the preferred sollution is to
| - stick msg in buffer (lockless)
| - print to atomic consoles (lockless)
| - use irq_work to wake console kthreads (lockless)
| - each kthread then tries to flush buffer to its own non-atomic console
|   in non-atomic context."

then this means to force force_legacy_kthread=true.
The threaded legacy printer is available since v6.12-rc1. It terms of stable
fix, this could go back as of v6.12 stable and not earlier (in case we care).

I tested this on a x86 box with 8250 and warning in put_prev_entity().
After it printed the initial warning, it dead-locked shortly after
because systemd was writing to the kernel buffer it acquired the
uart_port_lock then attempted to write lockdep report which required the
same lock…

Sebastian Andrzej Siewior (2):
  bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
  sched: Use WARN_ON.*_DEFERRED()

 include/asm-generic/bug.h  |  41 ++++++++++++++
 kernel/sched/core.c        |  78 +++++++++++++-------------
 kernel/sched/core_sched.c  |   6 +-
 kernel/sched/cpudeadline.c |   6 +-
 kernel/sched/deadline.c    |  62 ++++++++++-----------
 kernel/sched/ext.c         | 110 ++++++++++++++++++-------------------
 kernel/sched/fair.c        |  88 ++++++++++++++---------------
 kernel/sched/rt.c          |  36 ++++++------
 kernel/sched/sched.h       |  18 +++---
 lib/bug.c                  |  16 +++++-
 10 files changed, 257 insertions(+), 204 deletions(-)

-- 
2.53.0

^ permalink raw reply

* [PATCH 1/2] bug: Provide WARN_ON.*DEFERRED() macros for console deferred output
From: Sebastian Andrzej Siewior @ 2026-06-23 14:26 UTC (permalink / raw)
  To: linux-arch, linux-kernel, sched-ext, netdev
  Cc: David S . Miller, Andrea Righi, Andrew Morton, Arnd Bergmann,
	Ben Segall, Breno Leitao, Changwoo Min, David Vernet,
	Dietmar Eggemann, Eric Dumazet, Ingo Molnar, Jakub Kicinski,
	John Ogness, Juri Lelli, K Prateek Nayak, Paolo Abeni,
	Peter Zijlstra, Petr Mladek, Sergey Senozhatsky, Simon Horman,
	Steven Rostedt, Tejun Heo, Vincent Guittot, Vlad Poenaru,
	Sebastian Andrzej Siewior
In-Reply-To: <20260623142650.265721-1-bigeasy@linutronix.de>

Provide a deferred version of the WARN_ON() macro. It will delay
flushing the console until a later context. It is needed in a context
where the caller holds locks which can lead to a deadlock content is
flushed to the console driver.
An example would from a warning from within the scheduler resulting in a
wake-up of a task.

Deferring the output works by using printk_deferred_enter/ exit() around
the printing output. This must be used in a context where the task can't
migrate to another CPU. This should be the case usually, since the
scheduler would acquire the rq lock whith disabled interrupts, but to be
safe preemption is disabled to guarantee this.

In order not to bloat the code on architectures which provide an
optimized __WARN_FLAGS() define BUGFLAG_DEFERRED which is handled by
__report_bug() and does not increase the code size.

Provide the DEFERRED macros based on __WARN_FLAGS and __WARN_FLAGS
macros. Extend __report_bug() to handle the deferred case.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/asm-generic/bug.h | 41 +++++++++++++++++++++++++++++++++++++++
 lib/bug.c                 | 16 +++++++++++++--
 2 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 09e8eccee8ed9..1e3ff00f709b8 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -14,6 +14,7 @@
 #define BUGFLAG_DONE		(1 << 2)
 #define BUGFLAG_NO_CUT_HERE	(1 << 3)	/* CUT_HERE already sent */
 #define BUGFLAG_ARGS		(1 << 4)
+#define BUGFLAG_DEFERRED	(1 << 5)
 #define BUGFLAG_TAINT(taint)	((taint) << 8)
 #define BUG_GET_TAINT(bug)	((bug)->flags >> 8)
 #endif
@@ -115,6 +116,16 @@ extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 })
 #endif
 
+#define WARN_ON_DEFERRED(condition) ({					\
+	int __ret_warn_on = !!(condition);				\
+	if (unlikely(__ret_warn_on)) {					\
+		__WARN_FLAGS(#condition,				\
+			     BUGFLAG_DEFERRED |				\
+			     BUGFLAG_TAINT(TAINT_WARN));		\
+	}								\
+	unlikely(__ret_warn_on);					\
+})
+
 #ifndef WARN_ON_ONCE
 #define WARN_ON_ONCE(condition) ({					\
 	int __ret_warn_on = !!(condition);				\
@@ -125,6 +136,16 @@ extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 	unlikely(__ret_warn_on);					\
 })
 #endif
+
+#define WARN_ON_ONCE_DEFERRED(condition) ({				\
+	int __ret_warn_on = !!(condition);				\
+	if (unlikely(__ret_warn_on)) {					\
+		__WARN_FLAGS(#condition,				\
+			     BUGFLAG_ONCE | BUGFLAG_DEFERRED |		\
+			     BUGFLAG_TAINT(TAINT_WARN));		\
+	}								\
+	unlikely(__ret_warn_on);					\
+})
 #endif /* __WARN_FLAGS */
 
 #if defined(__WARN_FLAGS) && !defined(__WARN_printf)
@@ -159,6 +180,19 @@ extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 })
 #endif
 
+#ifndef WARN_ON_DEFERRED
+#define WARN_ON_DEFERRED(condition) ({					\
+	int __ret_warn_on = !!(condition);				\
+	if (unlikely(__ret_warn_on)) {					\
+		guard(preempt)();					\
+		printk_deferred_enter()					\
+		__WARN();						\
+		printk_deferred_exit()					\
+	}								\
+	unlikely(__ret_warn_on);					\
+})
+#endif
+
 #ifndef WARN
 #define WARN(condition, format...) ({					\
 	int __ret_warn_on = !!(condition);				\
@@ -180,6 +214,11 @@ extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 	DO_ONCE_LITE_IF(condition, WARN_ON, 1)
 #endif
 
+#ifndef WARN_ON_ONCE_DEFERRED
+#define WARN_ON_ONCE_DEFERRED(condition)				\
+	DO_ONCE_LITE_IF(condition, WARN_ON_DEFERRED, 1)
+#endif
+
 #ifndef WARN_ONCE
 #define WARN_ONCE(condition, format...)				\
 	DO_ONCE_LITE_IF(condition, WARN, 1, format)
@@ -215,7 +254,9 @@ extern __printf(1, 2) void __warn_printk(const char *fmt, ...);
 })
 #endif
 
+#define WARN_ON_DEFERRED(condition) WARN_ON(condition)
 #define WARN_ON_ONCE(condition) WARN_ON(condition)
+#define WARN_ON_ONCE_DEFERRED(condition) WARN_ON(condition)
 #define WARN_ONCE(condition, format...) WARN(condition, format)
 #define WARN_TAINT(condition, taint, format...) WARN(condition, format)
 #define WARN_TAINT_ONCE(condition, taint, format...) WARN(condition, format)
diff --git a/lib/bug.c b/lib/bug.c
index 224f4cfa4aa31..f5768f5d17b47 100644
--- a/lib/bug.c
+++ b/lib/bug.c
@@ -196,7 +196,7 @@ void __warn_printf(const char *fmt, struct pt_regs *regs)
 
 static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long bugaddr, struct pt_regs *regs)
 {
-	bool warning, once, done, no_cut, has_args;
+	bool warning, once, done, no_cut, has_args, deferred;
 	const char *file, *fmt;
 	unsigned line;
 
@@ -219,6 +219,7 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
 	done     = bug->flags & BUGFLAG_DONE;
 	no_cut   = bug->flags & BUGFLAG_NO_CUT_HERE;
 	has_args = bug->flags & BUGFLAG_ARGS;
+	deferred = bug->flags & BUGFLAG_DEFERRED;
 
 	if (warning && once) {
 		if (done)
@@ -229,7 +230,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
 		 */
 		bug->flags |= BUGFLAG_DONE;
 	}
-
+	if (deferred) {
+		preempt_disable_notrace();
+		printk_deferred_enter();
+	}
 	/*
 	 * BUG() and WARN_ON() families don't print a custom debug message
 	 * before triggering the exception handler, so we must add the
@@ -245,6 +249,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
 		/* this is a WARN_ON rather than BUG/BUG_ON */
 		__warn(file, line, (void *)bugaddr, BUG_GET_TAINT(bug), regs,
 		       NULL);
+		if (deferred) {
+			printk_deferred_exit();
+			preempt_enable_notrace();
+		}
 		return BUG_TRAP_TYPE_WARN;
 	}
 
@@ -254,6 +262,10 @@ static enum bug_trap_type __report_bug(struct bug_entry *bug, unsigned long buga
 		pr_crit("kernel BUG at %pB [verbose debug info unavailable]\n",
 			(void *)bugaddr);
 
+	if (deferred) {
+		printk_deferred_exit();
+		preempt_enable_notrace();
+	}
 	return BUG_TRAP_TYPE_BUG;
 }
 
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH net] tipc: fix UAF in cleanup_bearer() due to premature dst_cache_destroy()
From: Xin Long @ 2026-06-23 14:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Kuniyuki Iwashima, David S . Miller, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netdev, eric.dumazet, syzbot+e14bc5d4942756023b77,
	Jon Maloy
In-Reply-To: <CANn89iLOUWwECfTyiPbk--nqNo=tbshZbYAS9Zy_9OFNdpADoA@mail.gmail.com>

On Tue, Jun 23, 2026 at 9:58 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Jun 23, 2026 at 6:56 AM Xin Long <lucien.xin@gmail.com> wrote:
> >
> > On Tue, Jun 23, 2026 at 2:35 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Mon, Jun 22, 2026 at 10:37 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Mon, Jun 22, 2026 at 6:48 PM Xin Long <lucien.xin@gmail.com> wrote:
> > > > >
> > > >
> > > > > Could this corrupt the list for concurrent RCU readers?
> > > > > When list_del_rcu() is called, it intentionally leaves the next pointer
> > > > > intact so concurrent readers can continue their traversal. However, the
> > > > > immediate call to list_add() overwrites both the next and prev pointers
> > > > > to link the entry into private_list.
> > > > > If a concurrent reader is currently positioned at rcast, won't it follow
> > > > > the newly clobbered next pointer and jump from the original RCU list
> > > > > directly into private_list?
> > > > > Because private_list is allocated on the local stack, the reader might
> > > > > interpret stack memory as a struct udp_replicast. Furthermore, the reader
> > > > > would miss its loop termination condition because it expects to reach the
> > > > > original list head, potentially resulting in an infinite loop or a crash.
> > > > > [ ... ]
> > > >
> > > > I think you are right.
> > > >
> > > > Considering there is already one rcu_head in udp_replicast I will use it in V2.
> > >
> > > While looking at many syzbot reports with RTNL pressure. I found this
> > > gem in  tipc_exit_net()
> > >
> > > while (atomic_read(&tn->wq_count))
> > >       cond_resched();
> > >
> > > On some kernel builds cond_resched() can be a NOP, so we might loop
> > > here for a while :/
> > >
> > True, thanks for the report,
> >
> > I think a cleanup_wq should be added for 'ub->work' instead of using system_wq,
> > and then do flush_workqueue(cleanup_wq) in tipc_init_net().
> >
>
>  I will send a series of 2 patches.
>
> Second one looks like:
>
Cool, wait_var_event() looks better.

Thanks.

> More complex stuff can be added later in net-next.
>
> commit 2f6b56e70f7048a9a2577715b8cfdb0ec94c2469
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Tue Jun 23 06:48:33 2026 +0000
>
>     tipc: avoid busy looping in tipc_exit_net()
>
>     Blamed commit introduced a busy-wait loop in tipc_exit_net()
>     to wait for pending UDP bearer cleanup works to complete:
>
>            while (atomic_read(&tn->wq_count))
>                    cond_resched();
>
>     This loop can busy-wait for a long time if cond_resched() is a NOP. This
>     typically happens if the netns exit is executed by a high priority task,
>     or under kernels configured without preemption (CONFIG_PREEMPT_NONE). In
>     such cases, it wastes CPU cycles and can lead to soft lockups.
>
>     Fix this by replacing the busy loop with wait_var_event(), allowing the
>     thread to sleep properly until the work queue count reaches zero.
>
>     Accordingly, update cleanup_bearer() to use atomic_dec_and_test() and
>     wake_up_var() to wake up the waiter when the count drops to zero.
>
>     This uses the global wait queue hash table, avoiding the need to bloat
>     struct tipc_net with a wait_queue_head_t. The atomic_dec_and_test()
>     provides the necessary memory barrier to ensure the wakeup is not missed.
>
>     Fixes: 04c26faa51d1 ("tipc: wait and exit until all work queues are done")
>     Signed-off-by: Eric Dumazet <edumazet@google.com>
>     Cc: Xin Long <lucien.xin@gmail.com>
>     Cc: Jon Maloy <jmaloy@redhat.com>
>     Cc: tipc-discussion@lists.sourceforge.net
>
> diff --git a/net/tipc/core.c b/net/tipc/core.c
> index 1ddecea1df6e9100334c47a28ff6c065292fb9ad..315975c3be8186784e9c44c9ff69d62c17ffd4b9
> 100644
> --- a/net/tipc/core.c
> +++ b/net/tipc/core.c
> @@ -45,6 +45,7 @@
>  #include "crypto.h"
>
>  #include <linux/module.h>
> +#include <linux/wait_bit.h>
>
>  /* configurable TIPC parameters */
>  unsigned int tipc_net_id __read_mostly;
> @@ -118,8 +119,7 @@ static void __net_exit tipc_exit_net(struct net *net)
>  #ifdef CONFIG_TIPC_CRYPTO
>         tipc_crypto_stop(&tipc_net(net)->crypto_tx);
>  #endif
> -       while (atomic_read(&tn->wq_count))
> -               cond_resched();
> +       wait_var_event(&tn->wq_count, atomic_read(&tn->wq_count) == 0);
>  }
>
>  static void __net_exit tipc_pernet_pre_exit(struct net *net)
> diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
> index 66f3cb87a0aaaac8f40e8f237ab9a44d539b1cd8..62ae7f5b58409c89798c915dee752ac42487581f
> 100644
> --- a/net/tipc/udp_media.c
> +++ b/net/tipc/udp_media.c
> @@ -40,6 +40,7 @@
>  #include <linux/igmp.h>
>  #include <linux/kernel.h>
>  #include <linux/workqueue.h>
> +#include <linux/wait_bit.h>
>  #include <linux/list.h>
>  #include <net/sock.h>
>  #include <net/ip.h>
> @@ -830,7 +831,8 @@ static void cleanup_bearer(struct work_struct *work)
>         synchronize_net();
>
>         dst_cache_destroy(&ub->rcast.dst_cache);
> -       atomic_dec(&tn->wq_count);
> +       if (atomic_dec_and_test(&tn->wq_count))
> +               wake_up_var(&tn->wq_count);
>         kfree(ub);
>  }

^ permalink raw reply

* Re: [PATCH net] tipc: fix UAF in cleanup_bearer() due to premature dst_cache_destroy()
From: Eric Dumazet @ 2026-06-23 13:58 UTC (permalink / raw)
  To: Xin Long
  Cc: Kuniyuki Iwashima, David S . Miller, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netdev, eric.dumazet, syzbot+e14bc5d4942756023b77,
	Jon Maloy
In-Reply-To: <CADvbK_cMHtBFGb87P6CoqJN+DCasn_5=RwtNhymdZ6p1eFnjuQ@mail.gmail.com>

On Tue, Jun 23, 2026 at 6:56 AM Xin Long <lucien.xin@gmail.com> wrote:
>
> On Tue, Jun 23, 2026 at 2:35 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Jun 22, 2026 at 10:37 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Mon, Jun 22, 2026 at 6:48 PM Xin Long <lucien.xin@gmail.com> wrote:
> > > >
> > >
> > > > Could this corrupt the list for concurrent RCU readers?
> > > > When list_del_rcu() is called, it intentionally leaves the next pointer
> > > > intact so concurrent readers can continue their traversal. However, the
> > > > immediate call to list_add() overwrites both the next and prev pointers
> > > > to link the entry into private_list.
> > > > If a concurrent reader is currently positioned at rcast, won't it follow
> > > > the newly clobbered next pointer and jump from the original RCU list
> > > > directly into private_list?
> > > > Because private_list is allocated on the local stack, the reader might
> > > > interpret stack memory as a struct udp_replicast. Furthermore, the reader
> > > > would miss its loop termination condition because it expects to reach the
> > > > original list head, potentially resulting in an infinite loop or a crash.
> > > > [ ... ]
> > >
> > > I think you are right.
> > >
> > > Considering there is already one rcu_head in udp_replicast I will use it in V2.
> >
> > While looking at many syzbot reports with RTNL pressure. I found this
> > gem in  tipc_exit_net()
> >
> > while (atomic_read(&tn->wq_count))
> >       cond_resched();
> >
> > On some kernel builds cond_resched() can be a NOP, so we might loop
> > here for a while :/
> >
> True, thanks for the report,
>
> I think a cleanup_wq should be added for 'ub->work' instead of using system_wq,
> and then do flush_workqueue(cleanup_wq) in tipc_init_net().
>

 I will send a series of 2 patches.

Second one looks like:

More complex stuff can be added later in net-next.

commit 2f6b56e70f7048a9a2577715b8cfdb0ec94c2469
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Jun 23 06:48:33 2026 +0000

    tipc: avoid busy looping in tipc_exit_net()

    Blamed commit introduced a busy-wait loop in tipc_exit_net()
    to wait for pending UDP bearer cleanup works to complete:

           while (atomic_read(&tn->wq_count))
                   cond_resched();

    This loop can busy-wait for a long time if cond_resched() is a NOP. This
    typically happens if the netns exit is executed by a high priority task,
    or under kernels configured without preemption (CONFIG_PREEMPT_NONE). In
    such cases, it wastes CPU cycles and can lead to soft lockups.

    Fix this by replacing the busy loop with wait_var_event(), allowing the
    thread to sleep properly until the work queue count reaches zero.

    Accordingly, update cleanup_bearer() to use atomic_dec_and_test() and
    wake_up_var() to wake up the waiter when the count drops to zero.

    This uses the global wait queue hash table, avoiding the need to bloat
    struct tipc_net with a wait_queue_head_t. The atomic_dec_and_test()
    provides the necessary memory barrier to ensure the wakeup is not missed.

    Fixes: 04c26faa51d1 ("tipc: wait and exit until all work queues are done")
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Xin Long <lucien.xin@gmail.com>
    Cc: Jon Maloy <jmaloy@redhat.com>
    Cc: tipc-discussion@lists.sourceforge.net

diff --git a/net/tipc/core.c b/net/tipc/core.c
index 1ddecea1df6e9100334c47a28ff6c065292fb9ad..315975c3be8186784e9c44c9ff69d62c17ffd4b9
100644
--- a/net/tipc/core.c
+++ b/net/tipc/core.c
@@ -45,6 +45,7 @@
 #include "crypto.h"

 #include <linux/module.h>
+#include <linux/wait_bit.h>

 /* configurable TIPC parameters */
 unsigned int tipc_net_id __read_mostly;
@@ -118,8 +119,7 @@ static void __net_exit tipc_exit_net(struct net *net)
 #ifdef CONFIG_TIPC_CRYPTO
        tipc_crypto_stop(&tipc_net(net)->crypto_tx);
 #endif
-       while (atomic_read(&tn->wq_count))
-               cond_resched();
+       wait_var_event(&tn->wq_count, atomic_read(&tn->wq_count) == 0);
 }

 static void __net_exit tipc_pernet_pre_exit(struct net *net)
diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
index 66f3cb87a0aaaac8f40e8f237ab9a44d539b1cd8..62ae7f5b58409c89798c915dee752ac42487581f
100644
--- a/net/tipc/udp_media.c
+++ b/net/tipc/udp_media.c
@@ -40,6 +40,7 @@
 #include <linux/igmp.h>
 #include <linux/kernel.h>
 #include <linux/workqueue.h>
+#include <linux/wait_bit.h>
 #include <linux/list.h>
 #include <net/sock.h>
 #include <net/ip.h>
@@ -830,7 +831,8 @@ static void cleanup_bearer(struct work_struct *work)
        synchronize_net();

        dst_cache_destroy(&ub->rcast.dst_cache);
-       atomic_dec(&tn->wq_count);
+       if (atomic_dec_and_test(&tn->wq_count))
+               wake_up_var(&tn->wq_count);
        kfree(ub);
 }

^ permalink raw reply

* Re: [PATCH net] ice: fix stats array overflow when VF requests more queues
From: Przemek Kitszel @ 2026-06-23 13:59 UTC (permalink / raw)
  To: Michal Schmidt
  Cc: Tony Nguyen, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jacob Keller, Petr Oros,
	intel-wired-lan, netdev, linux-kernel
In-Reply-To: <CADEbmW0BsQsu1pPX=kk58tTz_5EArjCKgmp_MKxRFcuvb3TDGg@mail.gmail.com>

On 4/29/26 23:59, Michal Schmidt wrote:
> On Tue, Apr 28, 2026 at 4:00 PM Przemek Kitszel
> <przemyslaw.kitszel@intel.com> wrote:
>> On 4/27/26 17:18, Michal Schmidt wrote:
>>> When a VF increases its queue count via VIRTCHNL_OP_REQUEST_QUEUES,
>>> ice_vc_request_qs_msg() sets vf->num_req_qs and triggers a VF reset.
>>> The reset calls ice_vf_reconfig_vsi(), which does ice_vsi_decfg()
>>> followed by ice_vsi_cfg(). ice_vsi_decfg() does not free the per-ring
>>> stats arrays. Inside ice_vsi_cfg_def(), ice_vsi_set_num_qs() updates
>>> alloc_txq/alloc_rxq to the new larger value, but
>>> ice_vsi_alloc_stat_arrays() returns early because the stats already
>>> exist. ice_vsi_alloc_ring_stats() then iterates using the new larger
>>> alloc_txq and writes beyond the bounds of the old, smaller
>>> tx_ring_stats/rx_ring_stats pointer arrays, corrupting adjacent SLUB
>>> metadata.
>>>
>>
>> thank you for reproducing the bug, it is exactly the situation that
>> I was facing
>> have you tried with my proposed (unfortunately not public yet) fix
>> to just combine ice_vsi_alloc_stat_arrays() and
>> ice_vsi_realloc_stat_arrays() into one function?
> 
> I tried that now and the result is: yes, your patch fixes the bug too.
> Michal
> 

Hi,
are you going to make your patch more robust against on CHNL VSIs?
https://lore.kernel.org/netdev/20260523001618.1757240-1-kuba@kernel.org

alternatively I could sent my "alternative fix" which covers that case

^ permalink raw reply

* Re: [PATCH net] tipc: fix UAF in cleanup_bearer() due to premature dst_cache_destroy()
From: Xin Long @ 2026-06-23 13:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Kuniyuki Iwashima, David S . Miller, Jakub Kicinski, Paolo Abeni,
	Simon Horman, netdev, eric.dumazet, syzbot+e14bc5d4942756023b77,
	Jon Maloy
In-Reply-To: <CANn89i+dkbrSAwvaWXW7yWMfcwUebuTBLG5T7AGZaZcpVYGyfQ@mail.gmail.com>

On Tue, Jun 23, 2026 at 2:35 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Jun 22, 2026 at 10:37 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Jun 22, 2026 at 6:48 PM Xin Long <lucien.xin@gmail.com> wrote:
> > >
> >
> > > Could this corrupt the list for concurrent RCU readers?
> > > When list_del_rcu() is called, it intentionally leaves the next pointer
> > > intact so concurrent readers can continue their traversal. However, the
> > > immediate call to list_add() overwrites both the next and prev pointers
> > > to link the entry into private_list.
> > > If a concurrent reader is currently positioned at rcast, won't it follow
> > > the newly clobbered next pointer and jump from the original RCU list
> > > directly into private_list?
> > > Because private_list is allocated on the local stack, the reader might
> > > interpret stack memory as a struct udp_replicast. Furthermore, the reader
> > > would miss its loop termination condition because it expects to reach the
> > > original list head, potentially resulting in an infinite loop or a crash.
> > > [ ... ]
> >
> > I think you are right.
> >
> > Considering there is already one rcu_head in udp_replicast I will use it in V2.
>
> While looking at many syzbot reports with RTNL pressure. I found this
> gem in  tipc_exit_net()
>
> while (atomic_read(&tn->wq_count))
>       cond_resched();
>
> On some kernel builds cond_resched() can be a NOP, so we might loop
> here for a while :/
>
True, thanks for the report,

I think a cleanup_wq should be added for 'ub->work' instead of using system_wq,
and then do flush_workqueue(cleanup_wq) in tipc_init_net().

> Added in
>
> commit 04c26faa51d1e2fe71cf13c45791f5174c37f986    tipc: wait and exit
> until all work queues are done

^ permalink raw reply

* [PATCH net] tipc: fix out-of-bounds read in broadcast Gap ACK blocks
From: Samuel Page @ 2026-06-23 13:54 UTC (permalink / raw)
  To: Jon Maloy
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Tung Quang Nguyen, netdev, tipc-discussion,
	linux-kernel, Samuel Page

A broadcast PROTOCOL/STATE_MSG can carry a Gap ACK blocks record in its
data area. tipc_get_gap_ack_blks() only verifies that the record's len
field is self-consistent with its ugack_cnt/bgack_cnt counts
(sz == struct_size(p, gacks, ugack_cnt + bgack_cnt)); it does not check
that the record actually fits in the message data area, msg_data_sz().

The unicast caller tipc_link_proto_rcv() bounds it ("if (glen > dlen)
break;"), but the broadcast caller tipc_bcast_sync_rcv() discards the
returned size, so tipc_link_advance_transmq() copies the record off the
receive skb with an attacker-controlled count:

	this_ga = kmemdup(ga, struct_size(ga, gacks, ga->bgack_cnt),
			  GFP_ATOMIC);

A TIPC neighbour that negotiated TIPC_GAP_ACK_BLOCK triggers it with one
ordinary broadcast STATE_MSG (msg_bc_ack_invalid() clear), sized so its
data area is short, carrying a Gap ACK record with len = 0x400,
bgack_cnt = 0xff and ugack_cnt = 0. len then equals
struct_size(p, gacks, 255), so the consistency check passes and ga is
non-NULL; kmemdup() reads struct_size(ga, gacks, 255) = 1024 bytes out
of the much smaller skb:

  BUG: KASAN: slab-out-of-bounds in kmemdup_noprof+0x48/0x60
  Read of size 1024 at addr ffff0000c7030d38 by task poc864/69
  Call trace:
   kmemdup_noprof+0x48/0x60
   tipc_link_advance_transmq+0x86c/0xb80
   tipc_link_bc_ack_rcv+0x19c/0x1e0
   tipc_bcast_sync_rcv+0x1c4/0x2c4
   tipc_rcv+0x85c/0x1340
   tipc_l2_rcv_msg+0xac/0x104
  The buggy address belongs to the object at ffff0000c7030d00
   which belongs to the cache skbuff_small_head of size 704
  The buggy address is located 56 bytes inside of
   allocated 704-byte region [ffff0000c7030d00, ffff0000c7030fc0)

The copied-out bytes are subsequently consumed as gap/ack values, but
the read is already out of bounds at the kmemdup() regardless of how
they are used.

Apply the same bound the unicast path uses to the broadcast caller: drop
the Gap ACK blocks when the reported size exceeds the message data size.
A NULL ga is already the defined "no Gap ACK blocks" case, so well-formed
state messages are unaffected.

Fixes: d7626b5acff9 ("tipc: introduce Gap ACK blocks for broadcast link")
Cc: stable@vger.kernel.org
Assisted-by: Bynario AI
Signed-off-by: Samuel Page <sam@bynar.io>
---
Before posting I found an earlier thread for what looks like the same (or a
very closely related) issue:

  https://lore.kernel.org/netdev/1316452e465e9a96fce44ec15130a14f3872149f.1775809727.git.caoruide123@gmail.com/
  [PATCH net 1/1] tipc: validate Gap ACK blocks in STATE message

That one added the validation inside tipc_get_gap_ack_blks() and the thread
stalled on whether the extra checks were redundant. This patch instead adds,
on the broadcast caller, only the same bound the unicast path already applies,
and includes the KASAN reproducer that was asked for there. 

 net/tipc/bcast.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c
index 76a1585d3f6b..61c83bd95755 100644
--- a/net/tipc/bcast.c
+++ b/net/tipc/bcast.c
@@ -502,6 +502,7 @@ int tipc_bcast_sync_rcv(struct net *net, struct tipc_link *l,
 	struct sk_buff_head *inputq = &tipc_bc_base(net)->inputq;
 	struct tipc_gap_ack_blks *ga;
 	struct sk_buff_head xmitq;
+	u16 glen;
 	int rc = 0;

 	__skb_queue_head_init(&xmitq);
@@ -510,7 +511,10 @@ int tipc_bcast_sync_rcv(struct net *net, struct tipc_link *l,
 	if (msg_type(hdr) != STATE_MSG) {
 		tipc_link_bc_init_rcv(l, hdr);
 	} else if (!msg_bc_ack_invalid(hdr)) {
-		tipc_get_gap_ack_blks(&ga, l, hdr, false);
+		/* Validate Gap ACK blocks, drop if invalid */
+		glen = tipc_get_gap_ack_blks(&ga, l, hdr, false);
+		if (glen > msg_data_sz(hdr))
+			ga = NULL;
 		if (!sysctl_tipc_bc_retruni)
 			retrq = &xmitq;
 		rc = tipc_link_bc_ack_rcv(l, msg_bcast_ack(hdr),

base-commit: a986fde914d88af47eb78fd29c5d1af7952c3500
-- 
2.54.0

^ permalink raw reply related

* Re: [PATCH v4 9/9] rust: macros: remove `THIS_MODULE` static from `module!`
From: Gary Guo @ 2026-06-23 13:53 UTC (permalink / raw)
  To: Alvin Sun, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Luis Chamberlain, Petr Pavlu,
	Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
	Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
	Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
	Jens Axboe, Dave Ertman, Ira Weiny, Leon Romanovsky, Igor Korotin,
	FUJITA Tomonori, Bjorn Helgaas, Krzysztof Wilczyński,
	Arve Hjønnevåg, Todd Kjos, Christian Brauner,
	Carlos Llamas
  Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
	linux-kselftest, kunit-dev, linux-block, linux-kernel, netdev,
	linux-pci
In-Reply-To: <20260623-fix-fops-owner-v4-9-0daf5f077d5c@linux.dev>

On Tue Jun 23, 2026 at 7:29 AM BST, Alvin Sun wrote:
> All users have been migrated to `ModuleMetadata::THIS_MODULE` const or
> `this_module::<LocalModule>()` helper. The `static THIS_MODULE`
> generated by the `module!` macro is no longer referenced anywhere,
> so remove it to avoid having two sources of the same `ThisModule`
> pointer.
> 
> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>

Reviewed-by: Gary Guo <gary@garyguo.net>

> ---
>  rust/macros/module.rs | 16 ----------------
>  1 file changed, 16 deletions(-)


^ permalink raw reply

* Re: [PATCH v4 8/9] rust: binder: use `LocalModule` for `THIS_MODULE`
From: Gary Guo @ 2026-06-23 13:53 UTC (permalink / raw)
  To: Alvin Sun, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Luis Chamberlain, Petr Pavlu,
	Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
	Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
	Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
	Jens Axboe, Dave Ertman, Ira Weiny, Leon Romanovsky, Igor Korotin,
	FUJITA Tomonori, Bjorn Helgaas, Krzysztof Wilczyński,
	Arve Hjønnevåg, Todd Kjos, Christian Brauner,
	Carlos Llamas
  Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
	linux-kselftest, kunit-dev, linux-block, linux-kernel, netdev,
	linux-pci
In-Reply-To: <20260623-fix-fops-owner-v4-8-0daf5f077d5c@linux.dev>

On Tue Jun 23, 2026 at 7:29 AM BST, Alvin Sun wrote:
> Replace the `THIS_MODULE` static reference in the binder fops with
> `this_module::<LocalModule>()`, consistent with the move of
> `THIS_MODULE` into the `ModuleMetadata` trait.
> 
> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>

Reviewed-by: Gary Guo <gary@garyguo.net>

> ---
>  drivers/android/binder/rust_binder_main.rs | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)


^ permalink raw reply

* Re: [PATCH v4 7/9] rust: configfs: use `LocalModule` for `THIS_MODULE`
From: Gary Guo @ 2026-06-23 13:53 UTC (permalink / raw)
  To: Alvin Sun, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Luis Chamberlain, Petr Pavlu,
	Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
	Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
	Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
	Jens Axboe, Dave Ertman, Ira Weiny, Leon Romanovsky, Igor Korotin,
	FUJITA Tomonori, Bjorn Helgaas, Krzysztof Wilczyński,
	Arve Hjønnevåg, Todd Kjos, Christian Brauner,
	Carlos Llamas
  Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
	linux-kselftest, kunit-dev, linux-block, linux-kernel, netdev,
	linux-pci
In-Reply-To: <20260623-fix-fops-owner-v4-7-0daf5f077d5c@linux.dev>

On Tue Jun 23, 2026 at 7:29 AM BST, Alvin Sun wrote:
> Replace the `THIS_MODULE` static reference in the `configfs_attrs!`
> macro with `this_module::<LocalModule>()`, and update
> rnull to import `LocalModule` instead of `THIS_MODULE`, consistent
> with the move of `THIS_MODULE` into the `ModuleMetadata` trait.
>
> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>
> ---
>  drivers/block/rnull/configfs.rs | 6 ++----
>  rust/kernel/configfs.rs         | 8 +++++---
>  2 files changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/block/rnull/configfs.rs b/drivers/block/rnull/configfs.rs
> index c10a55fc58948..b2547ad1e5ddd 100644
> --- a/drivers/block/rnull/configfs.rs
> +++ b/drivers/block/rnull/configfs.rs
> @@ -1,9 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0
>  
> -use super::{
> -    NullBlkDevice,
> -    THIS_MODULE, //
> -};
> +use super::NullBlkDevice;
> +use crate::LocalModule;
>  use kernel::{
>      block::mq::gen_disk::{
>          GenDisk,
> diff --git a/rust/kernel/configfs.rs b/rust/kernel/configfs.rs
> index 2339c6467325d..b542422115461 100644
> --- a/rust/kernel/configfs.rs
> +++ b/rust/kernel/configfs.rs
> @@ -875,7 +875,7 @@ fn as_ptr(&self) -> *const bindings::config_item_type {
>  ///                 configfs::Subsystem<Configuration>,
>  ///                 Configuration
>  ///                 >::new_with_child_ctor::<N,Child>(
> -///             &THIS_MODULE,
> +///             ::kernel::module::this_module::<LocalModule>(),

This should be `crate::LocalModule`.

Best,
Gary

>  ///             &CONFIGURATION_ATTRS
>  ///         );
>  ///
> @@ -1021,7 +1021,8 @@ macro_rules! configfs_attrs {
>  
>                      static [< $data:upper _TPE >] : $crate::configfs::ItemType<$container, $data>  =
>                          $crate::configfs::ItemType::<$container, $data>::new::<N>(
> -                            &THIS_MODULE, &[<$ data:upper _ATTRS >]
> +                            $crate::module::this_module::<LocalModule>(),
> +                            &[<$ data:upper _ATTRS >]
>                          );
>                  )?
>  
> @@ -1030,7 +1031,8 @@ macro_rules! configfs_attrs {
>                          $crate::configfs::ItemType<$container, $data>  =
>                              $crate::configfs::ItemType::<$container, $data>::
>                              new_with_child_ctor::<N, $child>(
> -                                &THIS_MODULE, &[<$ data:upper _ATTRS >]
> +                                $crate::module::this_module::<LocalModule>(),
> +                                &[<$ data:upper _ATTRS >]
>                              );
>                  )?
>  



^ permalink raw reply

* Re: [PATCH v4 6/9] rust: miscdevice: set fops.owner from driver module pointer
From: Gary Guo @ 2026-06-23 13:51 UTC (permalink / raw)
  To: Alvin Sun, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Luis Chamberlain, Petr Pavlu,
	Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
	Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
	Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
	Jens Axboe, Dave Ertman, Ira Weiny, Leon Romanovsky, Igor Korotin,
	FUJITA Tomonori, Bjorn Helgaas, Krzysztof Wilczyński,
	Arve Hjønnevåg, Todd Kjos, Christian Brauner,
	Carlos Llamas
  Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
	linux-kselftest, kunit-dev, linux-block, linux-kernel, netdev,
	linux-pci
In-Reply-To: <20260623-fix-fops-owner-v4-6-0daf5f077d5c@linux.dev>

On Tue Jun 23, 2026 at 7:29 AM BST, Alvin Sun wrote:
> Set the miscdevice fops owner field from the driver module pointer
> via the `this_module::<T::OwnerModule>()` helper, instead of
> defaulting to null.
> 
> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>

Reviewed-by: Gary Guo <gary@garyguo.net>

> ---
>  rust/kernel/miscdevice.rs | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)


^ permalink raw reply

* Re: [PATCH v4 4/9] rust: macros: auto-insert OwnerModule in #[vtable]
From: Gary Guo @ 2026-06-23 13:50 UTC (permalink / raw)
  To: Alvin Sun, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Luis Chamberlain, Petr Pavlu,
	Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
	Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
	Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
	Jens Axboe, Dave Ertman, Ira Weiny, Leon Romanovsky, Igor Korotin,
	FUJITA Tomonori, Bjorn Helgaas, Krzysztof Wilczyński,
	Arve Hjønnevåg, Todd Kjos, Christian Brauner,
	Carlos Llamas
  Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
	linux-kselftest, kunit-dev, linux-block, linux-kernel, netdev,
	linux-pci
In-Reply-To: <20260623-fix-fops-owner-v4-4-0daf5f077d5c@linux.dev>

On Tue Jun 23, 2026 at 7:29 AM BST, Alvin Sun wrote:
> Auto-add `type OwnerModule: ::kernel::ModuleMetadata;` as a required
> associated type on the trait side if not already defined, and
> auto-insert `type OwnerModule = crate::LocalModule;` on the impl side
> if not explicitly provided, eliminating the need to manually declare
> and implement `OwnerModule` in every vtable trait and impl.
> 
> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
> Suggested-by: Gary Guo <gary@garyguo.net>
> Link: https://lore.kernel.org/all/DIMMWHUOLPSH.13JFRHDKDQJGO@garyguo.net
> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>

Reviewed-by: Gary Guo <gary@garyguo.net>

> ---
>  rust/macros/lib.rs    |  6 ++++++
>  rust/macros/vtable.rs | 41 ++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 42 insertions(+), 5 deletions(-)


^ permalink raw reply

* Re: [PATCH v4 3/9] rust: doctest: add LocalModule fallback for #[vtable] ThisModule
From: Gary Guo @ 2026-06-23 13:49 UTC (permalink / raw)
  To: Alvin Sun, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Luis Chamberlain, Petr Pavlu,
	Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
	Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
	Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
	Jens Axboe, Dave Ertman, Ira Weiny, Leon Romanovsky, Igor Korotin,
	FUJITA Tomonori, Bjorn Helgaas, Krzysztof Wilczyński,
	Arve Hjønnevåg, Todd Kjos, Christian Brauner,
	Carlos Llamas
  Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
	linux-kselftest, kunit-dev, linux-block, linux-kernel, netdev,
	linux-pci
In-Reply-To: <20260623-fix-fops-owner-v4-3-0daf5f077d5c@linux.dev>

On Tue Jun 23, 2026 at 7:29 AM BST, Alvin Sun wrote:
> Add a `LocalModule` struct with a null-pointer `ModuleMetadata` impl
> in the doctest harness, so that `crate::LocalModule` (auto-inserted
> by `#[vtable]`) resolves correctly when there is no `module!` macro.
>
> Reviewed-by: Andreas Hindborg <a.hindborg@kernel.org>
> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>
> ---
>  scripts/rustdoc_test_gen.rs | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
>
> diff --git a/scripts/rustdoc_test_gen.rs b/scripts/rustdoc_test_gen.rs
> index ee76e96b41eea..198af4e446c8c 100644
> --- a/scripts/rustdoc_test_gen.rs
> +++ b/scripts/rustdoc_test_gen.rs
> @@ -239,6 +239,22 @@ macro_rules! assert_eq {{
>  
>  const __LOG_PREFIX: &[u8] = b"rust_doctests_kernel\0";
>  
> +/// Dummy module type for doctest context.
> +struct LocalModule;
> +
> +use kernel::{{
> +    str::CStr,
> +    ModuleMetadata,
> +    ThisModule, //
> +}};
> +use core::ptr::null_mut;
> +
> +impl ModuleMetadata for LocalModule {{
> +    const NAME: &'static CStr = c"rust_doctests_kernel";
> +    // SAFETY: `try_module_get`/`module_put` handle null module pointers gracefully.
> +    const THIS_MODULE: ThisModule = unsafe {{ ThisModule::from_ptr(null_mut()) }};
> +}}

We probably a macro for crates that are built-in or are not the main crate of a
multi-crate module, and this would be able to use that mechanism.

But this looks okay for now.

Reviewed-by: Gary Guo <gary@garyguo.net>

> +
>  {rust_tests}
>  "#
>      )



^ permalink raw reply

* Re: [PATCH v4 2/9] rust: module: add `THIS_MODULE` const to `ModuleMetadata` trait
From: Gary Guo @ 2026-06-23 13:46 UTC (permalink / raw)
  To: Alvin Sun, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Luis Chamberlain, Petr Pavlu,
	Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Greg Kroah-Hartman,
	Rafael J. Wysocki, David Airlie, Simona Vetter, Daniel Almeida,
	Arnd Bergmann, Brendan Higgins, David Gow, Rae Moar, Breno Leitao,
	Jens Axboe, Dave Ertman, Ira Weiny, Leon Romanovsky, Igor Korotin,
	FUJITA Tomonori, Bjorn Helgaas, Krzysztof Wilczyński,
	Arve Hjønnevåg, Todd Kjos, Christian Brauner,
	Carlos Llamas
  Cc: rust-for-linux, linux-modules, driver-core, dri-devel, nova-gpu,
	linux-kselftest, kunit-dev, linux-block, linux-kernel, netdev,
	linux-pci
In-Reply-To: <20260623-fix-fops-owner-v4-2-0daf5f077d5c@linux.dev>

On Tue Jun 23, 2026 at 7:29 AM BST, Alvin Sun wrote:
> Since `const_refs_to_static` has been stable as of the MSRV bump, a
> `ThisModule` pointer can now be used in const contexts.
>
> Add a `THIS_MODULE` const to the `ModuleMetadata` trait so that modules
> can provide their `ThisModule` pointer in const contexts such as static
> `file_operations`.
>
> Add a `this_module()` helper to retrieve the `THIS_MODULE` pointer of a
> given module type, and update `__init` to use it instead of the
> `THIS_MODULE` static generated by the `module!` macro.
>
> The `static THIS_MODULE` generated by the `module!` macro is retained
> for backwards compatibility with existing users and removed in a later
> patch once all references have been migrated.
>
> Signed-off-by: Alvin Sun <alvin.sun@linux.dev>
> ---
>  rust/kernel/module.rs |  8 ++++++++
>  rust/macros/module.rs | 18 +++++++++++++++++-
>  2 files changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/rust/kernel/module.rs b/rust/kernel/module.rs
> index be242a82e86d2..5aca42f7a33fc 100644
> --- a/rust/kernel/module.rs
> +++ b/rust/kernel/module.rs
> @@ -42,6 +42,14 @@ fn init(module: &'static ThisModule) -> impl pin_init::PinInit<Self, crate::erro
>  pub trait ModuleMetadata {
>      /// The name of the module as specified in the `module!` macro.
>      const NAME: &'static crate::str::CStr;
> +
> +    /// The module's `THIS_MODULE` pointer.
> +    const THIS_MODULE: ThisModule;
> +}
> +
> +/// Returns a reference to the `THIS_MODULE` of the given module type.

#[inline]

> +pub const fn this_module<M: ModuleMetadata>() -> &'static ThisModule {
> +    &M::THIS_MODULE
>  }


With the change,

Reviewed-by: Gary Guo <gary@garyguo.net>

>  
>  /// Equivalent to `THIS_MODULE` in the C API.
> diff --git a/rust/macros/module.rs b/rust/macros/module.rs
> index 06c18e2075083..aa9a618d5d19e 100644
> --- a/rust/macros/module.rs
> +++ b/rust/macros/module.rs
> @@ -519,6 +519,22 @@ pub(crate) fn module(info: ModuleInfo) -> Result<TokenStream> {
>  
>          impl ::kernel::ModuleMetadata for #type_ {
>              const NAME: &'static ::kernel::str::CStr = #name_cstr;
> +
> +            #[cfg(MODULE)]
> +            const THIS_MODULE: ::kernel::ThisModule = {
> +                extern "C" {
> +                    static __this_module: ::kernel::types::Opaque<::kernel::bindings::module>;
> +                }
> +
> +                // SAFETY: `__this_module` is constructed by the kernel at load time
> +                // and lives until the module is unloaded.
> +                unsafe { ::kernel::ThisModule::from_ptr(__this_module.get()) }
> +            };
> +
> +            #[cfg(not(MODULE))]
> +            const THIS_MODULE: ::kernel::ThisModule = unsafe {
> +                ::kernel::ThisModule::from_ptr(::core::ptr::null_mut())
> +            };
>          }
>  
>          // Double nested modules, since then nobody can access the public items inside.
> @@ -616,7 +632,7 @@ pub extern "C" fn #ident_exit() {
>                  /// This function must only be called once.
>                  unsafe fn __init() -> ::kernel::ffi::c_int {
>                      let initer = <super::super::LocalModule as ::kernel::InPlaceModule>::init(
> -                        &super::super::THIS_MODULE
> +                        ::kernel::module::this_module::<super::super::LocalModule>()
>                      );
>                      // SAFETY: No data race, since `__MOD` can only be accessed by this module
>                      // and there only `__init` and `__exit` access it. These functions are only



^ permalink raw reply

* [PATCH net 7/7] selftests/xsk: account invalid multi-buffer Tx descriptors
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn, Maciej Fijalkowski
In-Reply-To: <20260623133240.1048434-1-maciej.fijalkowski@intel.com>

Invalid descriptors in the middle of a multi-buffer packet still belong
to the packet being consumed from the Tx ring. The tests should therefore
count the whole invalid packet as outstanding in verbatim mode, even
though the packet must not be expected on the Rx side.

Make fragment counting follow the packet boundary instead of stopping at
the first invalid fragment. Update custom stream generation so invalid
middle fragments terminate the generated Rx packet while Tx accounting
still covers all descriptors consumed from the invalid multi-buffer
packet.

Also add explicit end fragments after invalid middle descriptors. This
exercises the kernel drain logic and verifies that subsequent valid
packets are not interpreted as continuations of the invalid packet.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 .../selftests/bpf/prog_tests/test_xsk.c       | 24 ++++++++++++-------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index de1e63c3fdf6..d8a1c0d40e5a 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -433,14 +433,14 @@ static u32 pkt_nb_frags(u32 frame_size, struct pkt_stream *pkt_stream, struct pk
 	}
 
 	/* Search for the end of the packet in verbatim mode */
-	if (!pkt_continues(pkt->options) || !pkt->valid)
+	if (!pkt_continues(pkt->options))
 		return nb_frags;
 
 	next_frag = pkt_stream->current_pkt_nb;
 	pkt++;
 	while (next_frag++ < pkt_stream->nb_pkts) {
 		nb_frags++;
-		if (!pkt_continues(pkt->options) || !pkt->valid)
+		if (!pkt_continues(pkt->options))
 			break;
 		pkt++;
 	}
@@ -671,11 +671,11 @@ static struct pkt_stream *__pkt_stream_generate_custom(struct ifobject *ifobj, s
 			if (!frame->valid || !pkt_continues(frame->options))
 				payload++;
 		} else {
-			if (frame->valid)
+			if (frame->valid) {
 				len += frame->len;
-			if (frame->valid && pkt_continues(frame->options))
-				continue;
-
+				if (pkt_continues(frame->options))
+					continue;
+			}
 			pkt->pkt_nb = pkt_nb;
 			pkt->len = len;
 			pkt->valid = frame->valid;
@@ -1214,6 +1214,7 @@ static int __send_pkts(struct ifobject *ifobject, struct xsk_socket_info *xsk, b
 	for (i = 0; i < xsk->batch_size; i++) {
 		struct pkt *pkt = pkt_stream_get_next_tx_pkt(pkt_stream);
 		u32 nb_frags_left, nb_frags, bytes_written = 0;
+		struct pkt *first_pkt = pkt;
 
 		if (!pkt)
 			break;
@@ -1258,6 +1259,8 @@ static int __send_pkts(struct ifobject *ifobject, struct xsk_socket_info *xsk, b
 		if (pkt && pkt->valid) {
 			valid_pkts++;
 			valid_frags += nb_frags;
+		} else if (pkt_stream->verbatim && pkt_continues(first_pkt->options)) {
+			valid_frags += nb_frags;
 		}
 	}
 
@@ -2104,13 +2107,16 @@ int testapp_invalid_desc_mb(struct test_spec *test)
 		{0, 0, 0, false, 0},
 		/* Invalid address in the second frame */
 		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
-		{umem_sz, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
+		{umem_sz * 2, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
+		{0, MIN_PKT_SIZE, 0, false, 0},
 		/* Invalid len in the middle */
 		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
 		{0, XSK_UMEM__INVALID_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
+		{0, MIN_PKT_SIZE, 0, false, 0},
 		/* Invalid options in the middle */
 		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XDP_PKT_CONTD},
 		{0, XSK_UMEM__LARGE_FRAME_SIZE, 0, false, XSK_DESC__INVALID_OPTION},
+		{0, MIN_PKT_SIZE, 0, false, 0},
 		/* Transmit 2 frags, receive 3 */
 		{0, XSK_UMEM__MAX_FRAME_SIZE, 0, true, XDP_PKT_CONTD},
 		{0, XSK_UMEM__MAX_FRAME_SIZE, 0, true, 0},
@@ -2122,8 +2128,8 @@ int testapp_invalid_desc_mb(struct test_spec *test)
 
 	if (umem->unaligned_mode) {
 		/* Crossing a chunk boundary allowed */
-		pkts[12].valid = true;
-		pkts[13].valid = true;
+		pkts[15].valid = true;
+		pkts[16].valid = true;
 	}
 
 	test->mtu = MAX_ETH_JUMBO_SIZE;
-- 
2.43.0


^ permalink raw reply related

* [PATCH net 6/7] selftests/xsk: fix too-many-frags multi-buffer Tx test
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn, Maciej Fijalkowski
In-Reply-To: <20260623133240.1048434-1-maciej.fijalkowski@intel.com>

The too-many-frags test describes a packet that is valid from the Tx
ring ownership point of view, but invalid for transmission because it
exceeds the supported number of fragments.

Keep the generated Tx descriptors valid so that __send_pkts() accounts
them as outstanding descriptors that must be reclaimed through the CQ.
Then mark the corresponding Rx packet invalid so the test still does
not expect the oversized packet to appear on the receive side.

Add a valid synchronization packet after the oversized packet so the
test can verify that the Tx path drains the bad packet and resumes at
the next packet boundary.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 .../selftests/bpf/prog_tests/test_xsk.c       | 20 +++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/test_xsk.c b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
index 72875071d4f1..de1e63c3fdf6 100644
--- a/tools/testing/selftests/bpf/prog_tests/test_xsk.c
+++ b/tools/testing/selftests/bpf/prog_tests/test_xsk.c
@@ -2258,7 +2258,7 @@ int testapp_too_many_frags(struct test_spec *test)
 		max_frags += 1;
 	}
 
-	pkts = calloc(2 * max_frags + 2, sizeof(struct pkt));
+	pkts = calloc(2 * max_frags + 3, sizeof(struct pkt));
 	if (!pkts)
 		return TEST_FAILURE;
 
@@ -2279,21 +2279,29 @@ int testapp_too_many_frags(struct test_spec *test)
 	/* An invalid packet with the max amount of frags but signals packet
 	 * continues on the last frag
 	 */
-	for (i = max_frags + 1; i < 2 * max_frags + 1; i++) {
+	for (i = max_frags + 1; i < 2 * max_frags + 2; i++) {
 		pkts[i].len = MIN_PKT_SIZE;
 		pkts[i].options = XDP_PKT_CONTD;
-		pkts[i].valid = false;
+		pkts[i].valid = true;
 	}
+	pkts[2 * max_frags + 1].options = 0;
 
 	/* Valid packet for synch */
-	pkts[2 * max_frags + 1].len = MIN_PKT_SIZE;
-	pkts[2 * max_frags + 1].valid = true;
+	pkts[2 * max_frags + 2].len = MIN_PKT_SIZE;
+	pkts[2 * max_frags + 2].valid = true;
 
-	if (pkt_stream_generate_custom(test, pkts, 2 * max_frags + 2)) {
+	if (pkt_stream_generate_custom(test, pkts, 2 * max_frags + 3)) {
 		free(pkts);
 		return TEST_FAILURE;
 	}
 
+	/* The generated Tx stream must keep the too-big packet valid so that
+	 * __send_pkts() accounts its descriptors in outstanding_tx. The Rx
+	 * stream, however, must not expect this packet on the wire.
+	 */
+	test->ifobj_rx->xsk->pkt_stream->pkts[2].valid = false;
+	test->ifobj_rx->xsk->pkt_stream->nb_valid_entries--;
+
 	ret = testapp_validate_traffic(test);
 	free(pkts);
 	return ret;
-- 
2.43.0


^ permalink raw reply related

* [PATCH net 5/7] xsk: reclaim invalid multi-buffer Tx descs in ZC path
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn, Maciej Fijalkowski
In-Reply-To: <20260623133240.1048434-1-maciej.fijalkowski@intel.com>

Currently, the zero-copy Tx batching path stops when it encounters an
invalid descriptor. For multi-buffer packets this can leave descriptors
consumed from the Tx ring without returning their buffers to userspace
through the completion ring.

Handle invalid multi-buffer packets as a packet-sized unit. Keep
descriptors that are valid for transmission separate from descriptors
that are consumed only because they belong to an invalid multi-buffer
packet. The former are returned to the driver as Tx work, while the
latter are written to the CQ address area so they can be reclaimed by
userspace.

The batched path can retain drain state when the producer has not yet
supplied the end of an invalid packet. Do not allow a second Tx socket to
join the pool while such state exists. Gate the batched data path while a
same-pool bind waits for pre-existing readers, then either add the new
socket or fail the bind with -EAGAIN. This guarantees that drain state is
handled only by the singular batched path and avoids teaching the shared
UMEM fallback path about multi-buffer packet draining.

The reclaim-only descriptors must not be submitted to the completion
ring immediately when they follow real Tx descriptors in the same batch.
Drivers may complete only part of the Tx work returned by
xsk_tx_peek_release_desc_batch(), and publishing the reclaim descriptors
too early would also publish earlier real Tx descriptors that the driver
has not completed yet.

Track the number of driver-visible Tx descriptors that precede pending
reclaim descriptors. xsk_tx_completed() first advances through the real
Tx completions and submits the reclaim descriptors only after all earlier
Tx descriptors in the CQ address order have been completed. If a batch
contains only reclaim descriptors, complete them immediately because
there is no driver-visible Tx work in front of them.

This preserves CQ ordering while ensuring that every descriptor consumed
as part of an invalid multi-buffer packet is eventually returned to
userspace.

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 include/net/xsk_buff_pool.h |  6 ++++
 net/xdp/xsk.c               | 62 +++++++++++++++++++++++++++++++---
 net/xdp/xsk_buff_pool.c     | 66 +++++++++++++++++++++++++++++++++++++
 net/xdp/xsk_queue.h         | 66 +++++++++++++++++++++++++++----------
 4 files changed, 177 insertions(+), 23 deletions(-)

diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index ccb3b350001f..4e5abacfcbb7 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -78,9 +78,12 @@ struct xsk_buff_pool {
 	u32 chunk_size;
 	u32 chunk_shift;
 	u32 frame_len;
+	u32 reclaim_descs;
+	u32 tx_zc_pending_descs;
 	u32 xdp_zc_max_segs;
 	u8 tx_metadata_len; /* inherited from umem */
 	u8 cached_need_wakeup;
+	bool tx_share_pending;
 	bool uses_need_wakeup;
 	bool unaligned;
 	bool tx_sw_csum;
@@ -113,6 +116,9 @@ void xp_get_pool(struct xsk_buff_pool *pool);
 bool xp_put_pool(struct xsk_buff_pool *pool);
 void xp_clear_dev(struct xsk_buff_pool *pool);
 void xp_add_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs);
+int xp_prepare_xsk_tx_share(struct xsk_buff_pool *pool, struct xdp_sock *xs,
+			    bool *pending);
+void xp_finish_xsk_tx_share(struct xsk_buff_pool *pool);
 void xp_del_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs);
 
 /* AF_XDP, and XDP core. */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 43791647cf18..2dda854c6590 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -499,6 +499,18 @@ void __xsk_map_flush(struct list_head *flush_list)
 
 void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries)
 {
+	if (unlikely(pool->reclaim_descs)) {
+		if (nb_entries < pool->tx_zc_pending_descs) {
+			pool->tx_zc_pending_descs -= nb_entries;
+			xskq_prod_submit_n(pool->cq, nb_entries);
+			return;
+		}
+
+		pool->tx_zc_pending_descs = 0;
+		nb_entries += pool->reclaim_descs;
+		pool->reclaim_descs = 0;
+	}
+
 	xskq_prod_submit_n(pool->cq, nb_entries);
 }
 EXPORT_SYMBOL(xsk_tx_completed);
@@ -576,9 +588,20 @@ static u32 xsk_tx_peek_release_fallback(struct xsk_buff_pool *pool, u32 max_entr
 
 u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
 {
+	struct xsk_tx_batch batch = {};
 	struct xdp_sock *xs;
+	u32 cq_cached_prod;
 
 	rcu_read_lock();
+
+	/* Pairs with the release stores in xp_prepare_xsk_tx_share() and
+	 * xp_finish_xsk_tx_share(). If bind is converting a singular Tx pool
+	 * to shared, do not enter the singular batched path.
+	 */
+	if (smp_load_acquire(&pool->tx_share_pending))
+		goto out;
+	if (unlikely(pool->reclaim_descs))
+		goto out;
 	if (!list_is_singular(&pool->xsk_tx_list)) {
 		/* Fallback to the non-batched version */
 		rcu_read_unlock();
@@ -586,10 +609,8 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
 	}
 
 	xs = list_first_or_null_rcu(&pool->xsk_tx_list, struct xdp_sock, tx_list);
-	if (!xs) {
-		nb_pkts = 0;
+	if (!xs)
 		goto out;
-	}
 
 	nb_pkts = xskq_cons_nb_entries(xs->tx, nb_pkts);
 
@@ -603,19 +624,38 @@ u32 xsk_tx_peek_release_desc_batch(struct xsk_buff_pool *pool, u32 nb_pkts)
 	if (!nb_pkts)
 		goto out;
 
-	nb_pkts = xskq_cons_read_desc_batch(xs->tx, pool, nb_pkts);
+	batch = xskq_cons_read_desc_batch(xs, pool, nb_pkts);
+	nb_pkts = xsk_tx_batch_cq_descs(&batch);
 	if (!nb_pkts) {
 		xs->tx->queue_empty_descs++;
 		goto out;
 	}
 
 	__xskq_cons_release(xs->tx);
+	cq_cached_prod = pool->cq->cached_prod;
+
 	xskq_prod_write_addr_batch(pool->cq, pool->tx_descs, nb_pkts);
+
+	if (unlikely(batch.reclaim_descs)) {
+		u32 cq_pending_descs;
+
+		/* CQ is positional. Descriptors already written but not
+		 * submitted must complete before any reclaim-only descriptors
+		 * appended below.
+		 */
+		cq_pending_descs = cq_cached_prod - xskq_get_prod(pool->cq);
+
+		pool->tx_zc_pending_descs = batch.tx_descs + cq_pending_descs;
+		pool->reclaim_descs = batch.reclaim_descs;
+		if (unlikely(!pool->tx_zc_pending_descs))
+			xsk_tx_completed(pool, 0);
+	}
+
 	xs->sk.sk_write_space(&xs->sk);
 
 out:
 	rcu_read_unlock();
-	return nb_pkts;
+	return batch.tx_descs;
 }
 EXPORT_SYMBOL(xsk_tx_peek_release_desc_batch);
 
@@ -1442,6 +1482,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
 	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
 	struct sock *sk = sock->sk;
 	struct xdp_sock *xs = xdp_sk(sk);
+	bool tx_share_pending = false;
 	struct net_device *dev;
 	int bound_dev_if;
 	u32 flags, qid;
@@ -1549,6 +1590,13 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
 				goto out_unlock;
 			}
 
+			err = xp_prepare_xsk_tx_share(umem_xs->pool, xs,
+						      &tx_share_pending);
+			if (err) {
+				sockfd_put(sock);
+				goto out_unlock;
+			}
+
 			xp_get_pool(umem_xs->pool);
 			xs->pool = umem_xs->pool;
 
@@ -1559,6 +1607,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
 			if (xs->tx && !xs->pool->tx_descs) {
 				err = xp_alloc_tx_descs(xs->pool, xs);
 				if (err) {
+					if (tx_share_pending)
+						xp_finish_xsk_tx_share(xs->pool);
 					xp_put_pool(xs->pool);
 					xs->pool = NULL;
 					sockfd_put(sock);
@@ -1598,6 +1648,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr_unsized *addr, int addr
 	xs->sg = !!(xs->umem->flags & XDP_UMEM_SG_FLAG);
 	xs->queue_id = qid;
 	xp_add_xsk(xs->pool, xs);
+	if (tx_share_pending)
+		xp_finish_xsk_tx_share(xs->pool);
 
 	if (qid < dev->real_num_rx_queues) {
 		struct netdev_rx_queue *rxq;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 1f28a9641571..6fa732a843a9 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -22,6 +22,72 @@ void xp_add_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs)
 	spin_unlock(&pool->xsk_tx_list_lock);
 }
 
+int xp_prepare_xsk_tx_share(struct xsk_buff_pool *pool, struct xdp_sock *xs,
+			    bool *pending)
+{
+	struct xdp_sock *tmp;
+	int err = 0;
+
+	*pending = false;
+	if (!xs->tx)
+		return 0;
+
+	spin_lock(&pool->xsk_tx_list_lock);
+	if (!list_is_singular(&pool->xsk_tx_list)) {
+		spin_unlock(&pool->xsk_tx_list_lock);
+		return 0;
+	}
+
+	if (pool->tx_share_pending) {
+		spin_unlock(&pool->xsk_tx_list_lock);
+		return -EAGAIN;
+	}
+
+	/* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
+	 * Stop new singular batched Tx readers before synchronize_net()
+	 * waits for readers that may already have observed a singular list.
+	 */
+	smp_store_release(&pool->tx_share_pending, true);
+	*pending = true;
+	spin_unlock(&pool->xsk_tx_list_lock);
+
+	/* A batch that observed a singular Tx socket list before the gate was
+	 * armed may set drain_cont. Wait for all such readers before checking
+	 * whether the pool can safely become shared.
+	 */
+	synchronize_net();
+
+	spin_lock(&pool->xsk_tx_list_lock);
+	list_for_each_entry(tmp, &pool->xsk_tx_list, tx_list) {
+		if (READ_ONCE(tmp->drain_cont)) {
+			err = -EAGAIN;
+			break;
+		}
+	}
+
+	if (err) {
+		/* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
+		 * No socket was added; clear the gate so Tx can resume.
+		 */
+		smp_store_release(&pool->tx_share_pending, false);
+		*pending = false;
+	}
+	spin_unlock(&pool->xsk_tx_list_lock);
+
+	return err;
+}
+
+void xp_finish_xsk_tx_share(struct xsk_buff_pool *pool)
+{
+	spin_lock(&pool->xsk_tx_list_lock);
+	/* Pairs with the acquire load in xsk_tx_peek_release_desc_batch().
+	 * Publish the preceding xp_add_xsk() list update before allowing Tx
+	 * to observe that the share transition has finished.
+	 */
+	smp_store_release(&pool->tx_share_pending, false);
+	spin_unlock(&pool->xsk_tx_list_lock);
+}
+
 void xp_del_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs)
 {
 	if (!xs->tx)
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 3e3fbb73d23e..99fa62e0d337 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -58,6 +58,16 @@ struct parsed_desc {
 	u32 valid;
 };
 
+struct xsk_tx_batch {
+	u32 tx_descs;
+	u32 reclaim_descs;
+};
+
+static inline u32 xsk_tx_batch_cq_descs(const struct xsk_tx_batch *batch)
+{
+	return batch->tx_descs + batch->reclaim_descs;
+}
+
 /* The structure of the shared state of the rings are a simple
  * circular buffer, as outlined in
  * Documentation/core-api/circular-buffers.rst. For the Rx and
@@ -263,17 +273,19 @@ static inline void parse_desc(struct xsk_queue *q, struct xsk_buff_pool *pool,
 	parsed->mb = xp_mb_desc(desc);
 }
 
-static inline
-u32 xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
-			      u32 max)
+static inline struct xsk_tx_batch
+xskq_cons_read_desc_batch(struct xdp_sock *xs, struct xsk_buff_pool *pool,
+			  u32 max)
 {
-	u32 cached_cons = q->cached_cons, nb_entries = 0;
 	struct xdp_desc *descs = pool->tx_descs;
-	u32 total_descs = 0, nr_frags = 0;
+	bool drain = READ_ONCE(xs->drain_cont);
+	u32 cached_cons, nb_entries = 0;
+	struct xsk_tx_batch batch = {};
+	struct xsk_queue *q = xs->tx;
+	u32 nr_frags = 0;
+
+	cached_cons = q->cached_cons;
 
-	/* track first entry, if stumble upon *any* invalid descriptor, rewind
-	 * current packet that consists of frags and stop the processing
-	 */
 	while (cached_cons != q->cached_prod && nb_entries < max) {
 		struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
 		u32 idx = cached_cons & q->ring_mask;
@@ -282,26 +294,44 @@ u32 xskq_cons_read_desc_batch(struct xsk_queue *q, struct xsk_buff_pool *pool,
 		descs[nb_entries] = ring->desc[idx];
 		cached_cons++;
 		parse_desc(q, pool, &descs[nb_entries], &parsed);
-		if (unlikely(!parsed.valid))
-			break;
+		if (unlikely(!parsed.valid)) {
+			if (!drain && !nr_frags && !parsed.mb)
+				break;
+
+			drain = true;
+		}
+
+		nr_frags++;
+		nb_entries++;
 
 		if (likely(!parsed.mb)) {
-			total_descs += (nr_frags + 1);
-			nr_frags = 0;
-		} else {
-			nr_frags++;
-			if (nr_frags == pool->xdp_zc_max_segs) {
+			if (unlikely(drain)) {
+				batch.reclaim_descs = nr_frags;
+				WRITE_ONCE(xs->drain_cont, false);
 				nr_frags = 0;
 				break;
 			}
+
+			batch.tx_descs += nr_frags;
+			nr_frags = 0;
+			continue;
 		}
-		nb_entries++;
+
+		if (nr_frags == pool->xdp_zc_max_segs)
+			drain = true;
 	}
 
-	cached_cons -= nr_frags;
+	if (nr_frags) {
+		if (drain) {
+			batch.reclaim_descs = nr_frags;
+			WRITE_ONCE(xs->drain_cont, true);
+		} else {
+			cached_cons -= nr_frags;
+		}
+	}
 	/* Release valid plus any invalid entries */
 	xskq_cons_release_n(q, cached_cons - q->cached_cons);
-	return total_descs;
+	return batch;
 }
 
 /* Functions for consumers */
-- 
2.43.0


^ permalink raw reply related

* [PATCH net 4/7] xsk: reclaim offending invalid desc in generic multi-buffer Tx
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn, Maciej Fijalkowski
In-Reply-To: <20260623133240.1048434-1-maciej.fijalkowski@intel.com>

After an invalid descriptor is found in __xsk_generic_xmit(),
xskq_cons_peek_desc() returns false and the loop body is not entered.
Jason's drain fixes reclaim descriptors already attached to xs->skb and
later continuation descriptors handled through drain_cont, but the
offending descriptor that made peek fail is only released from the Tx
ring.

This loses one completion for each invalid multi-buffer packet in the
generic path. Userspace then waits forever for a descriptor that has
already been consumed by the kernel.

If the failed descriptor belongs to an already-started or already-draining
multi-buffer packet, publish its address to the completion ring before
releasing it. Standalone invalid descriptors keep the existing behavior.

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
---
 net/xdp/xsk.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index c489fadc3608..43791647cf18 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1125,8 +1125,22 @@ static int __xsk_generic_xmit(struct sock *sk)
 	}

 	if (xskq_has_descs(xs->tx)) {
+		bool reclaim_desc = xs->skb || xs->drain_cont;
+
+		if (reclaim_desc) {
+			err = xsk_cq_reserve_locked(xs->pool);
+			if (err) {
+				err = -EAGAIN;
+				goto out;
+			}
+		}
+
 		if (xs->skb)
 			xsk_drop_skb(xs->skb);
+
+		if (reclaim_desc)
+			xsk_cq_submit_addr_single_locked(xs->pool, &desc);
+
 		xskq_cons_release(xs->tx);
 		xs->drain_cont = xp_mb_desc(&desc);
 	}
-- 
2.43.0

^ permalink raw reply related

* [PATCH net 3/7] xsk: drain continuation descs on invalid descriptor in __xsk_generic_xmit()
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn, Jason Xing
In-Reply-To: <20260623133240.1048434-1-maciej.fijalkowski@intel.com>

From: Jason Xing <kernelxing@tencent.com>

When the TX loop in __xsk_generic_xmit() encounters an invalid
descriptor mid-packet (e.g. an out-of-bounds address), the partial
skb is dropped and the offending descriptor is released. However,
remaining continuation descriptors belonging to the same multi-buffer
packet still sit in the TX ring. Since xs->skb becomes NULL after the
drop, the next iteration treats the leftover continuation fragment as
a brand-new packet, corrupting the packet stream.

Fix this by setting the drain_cont flag when the released descriptor
has XDP_PKT_CONTD set. On the next call to __xsk_generic_xmit(), the
drain logic introduced in the previous patch handles the remaining
fragments with normal CQ backpressure.

There is one subtle case: if a subsequent continuation descriptor also
has an invalid address, xskq_cons_peek_desc() rejects it and the
while loop is never entered, so the in-loop drain path cannot clear
drain_cont. The post-loop code already handles this: it sees
xskq_has_descs() is true (the failed descriptor was read but not
released by peek), releases it, and checks its XDP_PKT_CONTD flag.
Add an else branch so that when the released descriptor is the
last fragment (no XDP_PKT_CONTD), drain_cont is cleared. This
prevents the next valid packet from being incorrectly drained.

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index e80c035a7af5..c489fadc3608 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1128,6 +1128,7 @@ static int __xsk_generic_xmit(struct sock *sk)
 		if (xs->skb)
 			xsk_drop_skb(xs->skb);
 		xskq_cons_release(xs->tx);
+		xs->drain_cont = xp_mb_desc(&desc);
 	}

 out:
-- 
2.43.0

^ permalink raw reply related

* [PATCH net 2/7] xsk: drain continuation descs after overflow in xsk_build_skb()
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn, Jason Xing, Maciej Fijalkowski
In-Reply-To: <20260623133240.1048434-1-maciej.fijalkowski@intel.com>

From: Jason Xing <kernelxing@tencent.com>

When a multi-buffer packet exceeds MAX_SKB_FRAGS and triggers -EOVERFLOW,
only the current descriptor is released from the TX ring. The remaining
continuation descriptors of the same packet stay in the ring. Since
xs->skb is set to NULL after the drop, the TX loop picks up these
leftover frags and misinterprets each one as the beginning of a new
packet, corrupting the packet stream.

Fix this by adding a drain_cont flag to xdp_sock. When overflow occurs
and the dropped descriptor has XDP_PKT_CONTD set, the flag is raised,
so we have a chance to examine and handle the potential remaining descs
of this big overflow'ed skb.

When the last fragment (without XDP_PKT_CONTD) is processed, the flag
is cleared and the loop continues to process subsequent descriptors
with the remaining budget. This behavior follows how previous xmit path
treats overflow packets.

Closes: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/
Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> # wrapped cq addr submission onto routine
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 include/net/xdp_sock.h |  1 +
 net/xdp/xsk.c          | 24 ++++++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index ebac60a3d8a1..8b51876efbed 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -80,6 +80,7 @@ struct xdp_sock {
 	 * call of __xsk_generic_xmit().
 	 */
 	struct sk_buff *skb;
+	bool drain_cont;
 
 	struct list_head map_list;
 	/* Protects map_list */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a7a83dc4546a..e80c035a7af5 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -737,6 +737,19 @@ static void xsk_cq_submit_addr_locked(struct xsk_buff_pool *pool,
 	spin_unlock_irqrestore(&pool->cq_prod_lock, flags);
 }
 
+static void xsk_cq_submit_addr_single_locked(struct xsk_buff_pool *pool,
+					     struct xdp_desc *desc)
+{
+	unsigned long flags;
+	u32 idx;
+
+	spin_lock_irqsave(&pool->cq_prod_lock, flags);
+	idx = xskq_get_prod(pool->cq);
+	xskq_prod_write_addr(pool->cq, idx, desc->addr);
+	xskq_prod_submit_n(pool->cq, 1);
+	spin_unlock_irqrestore(&pool->cq_prod_lock, flags);
+}
+
 static void xsk_cq_cancel_locked(struct xsk_buff_pool *pool, u32 n)
 {
 	spin_lock(&pool->cq->cq_cached_prod_lock);
@@ -1063,11 +1076,22 @@ static int __xsk_generic_xmit(struct sock *sk)
 			goto out;
 		}
 
+		if (unlikely(xs->drain_cont)) {
+			xsk_cq_submit_addr_single_locked(xs->pool, &desc);
+
+			xs->tx->invalid_descs++;
+			xskq_cons_release(xs->tx);
+			xs->drain_cont = xp_mb_desc(&desc);
+			continue;
+		}
+
 		skb = xsk_build_skb(xs, &desc);
 		if (IS_ERR(skb)) {
 			err = PTR_ERR(skb);
 			if (err != -EOVERFLOW)
 				goto out;
+			if (xp_mb_desc(&desc))
+				xs->drain_cont = true;
 			err = 0;
 			continue;
 		}
-- 
2.43.0


^ permalink raw reply related

* [PATCH net 1/7] xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn, Jason Xing
In-Reply-To: <20260623133240.1048434-1-maciej.fijalkowski@intel.com>

From: Jason Xing <kernelxing@tencent.com>

This patch is inspired by the check[1] from sashiko. It says when
overflow happens, the address of cq to be published is invalid.
Actually the severer thing is the whole process of publishing the
address of cq in this particular case is not right: it should truely
publish the address and advance the cached_prod in cq as long as it
reads descriptors from txq.

The following is the full analysis.
xsk_drop_skb() is called in three places, which all discard a partially
built multi-buffer skb:
1) xsk_build_skb() -EOVERFLOW error path: packet exceeds MAX_SKB_FRAGS
2) __xsk_generic_xmit() post-loop cleanup: an invalid descriptor in
   the TX ring prevents the partial packet from completing
3) xsk_release(): socket close while xs->skb holds an incomplete packet

In all three cases, the TX descriptors for the already-processed frags
have been consumed from the TX ring (xskq_cons_release), and CQ slots
have been reserved. However, xsk_drop_skb() calls xsk_consume_skb()
which cancels the CQ reservations via xsk_cq_cancel_locked(). Since
the buffer addresses never appear in the completion queue, userspace
permanently loses track of these buffers.

Fix this by letting consume_skb() trigger the existing xsk_destruct_skb
destructor, which already submits buffer addresses to the CQ via
xsk_cq_submit_addr_locked().

Note that cancelling the descriptors back to the TX ring (via
xskq_cons_cancel_n) is not a appropriate option because an oversized
packet that always exceeds MAX_SKB_FRAGS would be retried indefinitely,
which is an obviously deadlock bug in the TX path.

Also move the desc->addr assignment in xsk_build_skb() above the
overflow check so that the current descriptor's address is recorded
before a potential -EOVERFLOW jump to free_err, consistent with the
zerocopy path in xsk_build_skb_zerocopy().

[1]: https://lore.kernel.org/all/20260425041726.85FB3C2BCB2@smtp.kernel.org/

Fixes: cf24f5a5feea ("xsk: add support for AF_XDP multi-buffer on Tx path")
Signed-off-by: Jason Xing <kernelxing@tencent.com>
---
 net/xdp/xsk.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index b970f30ea9b9..a7a83dc4546a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -794,8 +794,11 @@ static void xsk_consume_skb(struct sk_buff *skb)

 static void xsk_drop_skb(struct sk_buff *skb)
 {
-	xdp_sk(skb->sk)->tx->invalid_descs += xsk_get_num_desc(skb);
-	xsk_consume_skb(skb);
+	struct xdp_sock *xs = xdp_sk(skb->sk);
+
+	xs->tx->invalid_descs += xsk_get_num_desc(skb);
+	consume_skb(skb);
+	xs->skb = NULL;
 }

 static int xsk_skb_metadata(struct sk_buff *skb, void *buffer,
@@ -877,7 +880,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 			return ERR_PTR(-ENOMEM);

 		/* in case of -EOVERFLOW that could happen below,
-		 * xsk_consume_skb() will release this node as whole skb
+		 * xsk_drop_skb() will release this node as whole skb
 		 * would be dropped, which implies freeing all list elements
 		 */
 		xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
@@ -969,6 +972,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 				goto free_err;
 			}

+			xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
+
 			if (unlikely(nr_frags == (MAX_SKB_FRAGS - 1) && xp_mb_desc(desc))) {
 				err = -EOVERFLOW;
 				goto free_err;
@@ -986,8 +991,6 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,

 			skb_add_rx_frag(skb, nr_frags, page, 0, len, PAGE_SIZE);
 			refcount_add(PAGE_SIZE, &xs->sk.sk_wmem_alloc);
-
-			xsk_addr->addrs[xsk_addr->num_descs] = desc->addr;
 		}
 	}

-- 
2.43.0

^ permalink raw reply related

* [PATCH net 0/7] xsk: fix AF_XDP multi-buffer Tx descriptor reclaim
From: Maciej Fijalkowski @ 2026-06-23 13:32 UTC (permalink / raw)
  To: netdev
  Cc: bpf, magnus.karlsson, stfomichev, kuba, pabeni, horms,
	kerneljasonxing, bjorn, Maciej Fijalkowski

Hi,

This series fixes several AF_XDP multi-buffer Tx paths where descriptors
consumed from the Tx ring are not consistently returned to userspace
through the completion ring when the packet is later dropped as invalid.

The affected cases are invalid or oversized multi-buffer Tx packets in
both the generic and zero-copy paths. In these cases, the kernel can
consume one or more Tx descriptors while building or validating a
multi-buffer packet, then drop the packet before it reaches the device.
Userspace still owns the UMEM buffers only after the corresponding
addresses are returned through the CQ. Missing completions therefore
make userspace lose track of those buffers.

The generic path fixes cover three related cases:
* partially built multi-buffer skbs dropped by xsk_drop_skb();
  continuation descriptors left in the Tx ring after xsk_build_skb()
  reports overflow;
* invalid descriptors encountered in the middle of a multi-buffer
  packet, including the offending invalid descriptor itself.

The zero-copy path is handled separately. The batched Tx parser now
distinguishes descriptors that can be passed to the driver from
descriptors that are consumed only because they belong to an invalid
multi-buffer packet. Reclaim-only descriptors are written to the CQ
address area and published in completion order, after any earlier
driver-visible Tx descriptors.

The ZC batching path can also retain drain state when userspace has not
yet provided the end of an invalid multi-buffer packet. To keep this
state local to the singular batched path, the series prevents a second
Tx socket from joining the same pool while such drain state exists.
During the singular-to-shared transition, Tx batching is gated,
pre-existing readers are waited out, and bind fails with -EAGAIN if the
existing socket still has pending drain state. This avoids adding
multi-buffer drain handling to the shared-UMEM fallback path.

The last two patches update xskxceiver so the tests account invalid
multi-buffer Tx packets as descriptors that must be reclaimed, while
still not expecting those invalid packets on the Rx side.

This is a follow-up to Jason's changes [0] which were addressing generic
xmit only and this set allows me to pass full xskxceiver test suite run
against ice driver.

Thanks,
Maciej

[0]: https://lore.kernel.org/netdev/20260520004244.55663-1-kerneljasonxing@gmail.com/

Jason Xing (3):
  xsk: fix buffer leak in xsk_drop_skb() for AF_XDP multi-buffer Tx
  xsk: drain continuation descs after overflow in xsk_build_skb()
  xsk: drain continuation descs on invalid descriptor in
    __xsk_generic_xmit()

Maciej Fijalkowski (4):
  xsk: reclaim offending invalid desc in generic multi-buffer Tx
  xsk: reclaim invalid multi-buffer Tx descs in ZC path
  selftests/xsk: fix too-many-frags multi-buffer Tx test
  selftests/xsk: account invalid multi-buffer Tx descriptors

 include/net/xdp_sock.h                        |   1 +
 include/net/xsk_buff_pool.h                   |   6 +
 net/xdp/xsk.c                                 | 114 ++++++++++++++++--
 net/xdp/xsk_buff_pool.c                       |  66 ++++++++++
 net/xdp/xsk_queue.h                           |  66 +++++++---
 .../selftests/bpf/prog_tests/test_xsk.c       |  44 ++++---
 6 files changed, 254 insertions(+), 43 deletions(-)

-- 
2.43.0

^ permalink raw reply

* Re: [PATCH] net: liquidio: Check soft command allocation in lio_main setup_nic_devices()
From: Breno Leitao @ 2026-06-23 13:31 UTC (permalink / raw)
  To: Haoxiang Li
  Cc: andrew+netdev, davem, kuba, pabeni, kory.maincent, zilin, petrm,
	u.kleine-koenig, marco.crivellari, vadim.fedorenko,
	Aleksey.Makarov, satananda.burla, felix.manlunas, derek.chickles,
	rvatsavayi, netdev, linux-kernel, stable
In-Reply-To: <20260623125611.2228149-1-haoxiang_li2024@163.com>

On Tue, Jun 23, 2026 at 08:56:11PM +0800, Haoxiang Li wrote:
> octeon_alloc_soft_command() returns NULL when the soft command buffer
> pool is empty. setup_nic_devices() dereferences the returned pointer
> immediately when preparing the interface configuration command, which
> can lead to a NULL pointer dereference if the pool is exhausted.
> 
> Return -ENOMEM when the allocation fails and let the existing NIC init
> failure path handle the error.
> 
> Fixes: f21fb3ed364b ("Add support of Cavium Liquidio ethernet adapters")
> Cc: stable@vger.kernel.org
> Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
> ---
>  drivers/net/ethernet/cavium/liquidio/lio_main.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c
> index 0db08ac3d098..5077129656e8 100644
> --- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
> +++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
> @@ -3363,6 +3363,9 @@ static int setup_nic_devices(struct octeon_device *octeon_dev)
>  		sc = (struct octeon_soft_command *)
>  			octeon_alloc_soft_command(octeon_dev, data_size,
>  						  resp_size, 0);
> +		if (!sc)
> +			return -ENOMEM;

Is it fine to return in here, given that the
octeon_register_reqtype_free_fn()  and octeon_register_dispatch_fn()
functions succeed above? Do you need to clean any side effect by them?

--breno

^ permalink raw reply

* [PATCH] net: liquidio: Check soft command allocation in lio_main setup_nic_devices()
From: Haoxiang Li @ 2026-06-23 12:56 UTC (permalink / raw)
  To: andrew+netdev, davem, kuba, pabeni, kory.maincent, zilin, petrm,
	u.kleine-koenig, marco.crivellari, vadim.fedorenko,
	Aleksey.Makarov, satananda.burla, felix.manlunas, derek.chickles,
	rvatsavayi
  Cc: netdev, linux-kernel, Haoxiang Li, stable

octeon_alloc_soft_command() returns NULL when the soft command buffer
pool is empty. setup_nic_devices() dereferences the returned pointer
immediately when preparing the interface configuration command, which
can lead to a NULL pointer dereference if the pool is exhausted.

Return -ENOMEM when the allocation fails and let the existing NIC init
failure path handle the error.

Fixes: f21fb3ed364b ("Add support of Cavium Liquidio ethernet adapters")
Cc: stable@vger.kernel.org
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
---
 drivers/net/ethernet/cavium/liquidio/lio_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 0db08ac3d098..5077129656e8 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -3363,6 +3363,9 @@ static int setup_nic_devices(struct octeon_device *octeon_dev)
 		sc = (struct octeon_soft_command *)
 			octeon_alloc_soft_command(octeon_dev, data_size,
 						  resp_size, 0);
+		if (!sc)
+			return -ENOMEM;
+
 		resp = (struct liquidio_if_cfg_resp *)sc->virtrptr;
 		vdata = (struct lio_version *)sc->virtdptr;
 
-- 
2.25.1


^ permalink raw reply related

* [PATCH] af_unix: move proto info out of CONFIG_BPF_SYSCALL
From: Ben Dooks @ 2026-06-23 12:49 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, netdev, linux-kernel
  Cc: Ben Dooks

These two structs are defined even if CONFIG_BPF_SYSCALL but
the header does not export them, so declare them anyway and
move the check for CONFIG_BPF_SYSCALL lower into the file.

This removes the two sparse warnings:
net/unix/af_unix.c:1060:14: warning: symbol 'unix_dgram_proto' was not declared. Should it be static?
net/unix/af_unix.c:1071:14: warning: symbol 'unix_stream_proto' was not declared. Should it be static?

This change is less complicated than trying to make those two
structs static based on the CONFIG_BPF_SYSCALL configuration.

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
---
 net/unix/af_unix.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/unix/af_unix.h b/net/unix/af_unix.h
index 8119dbeef3a3..2a6a26b3a2db 100644
--- a/net/unix/af_unix.h
+++ b/net/unix/af_unix.h
@@ -55,10 +55,10 @@ static inline void unix_sysctl_unregister(struct net *net)
 int __unix_dgram_recvmsg(struct sock *sk, struct msghdr *msg, size_t size, int flags);
 int __unix_stream_recvmsg(struct sock *sk, struct msghdr *msg, size_t size, int flags);

-#ifdef CONFIG_BPF_SYSCALL
 extern struct proto unix_dgram_proto;
 extern struct proto unix_stream_proto;

+#ifdef CONFIG_BPF_SYSCALL
 int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore);
 int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore);
 void __init unix_bpf_build_proto(void);
-- 
2.37.2.352.g3c44437643

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox