[PATCH net-next v2 0/3] netconsole: Fix reported problems

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next v2 0/3] netconsole: Fix reported problems
@ 2026-06-02 14:26 Breno Leitao
  2026-06-02 14:26 ` [PATCH net-next v2 1/3] netconsole: do not schedule skb pool refill from NMI Breno Leitao
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Breno Leitao @ 2026-06-02 14:26 UTC (permalink / raw)
  To: Breno Leitao, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: netdev, linux-kernel, kernel-team

These are some of the issues that LLM reported to netconsole, and they
are being addressed here before big refactors.

I was doing some big refactors, and got some "pre-existent-issues"
during LLM review of the refactor, that make them hard to guarantee that
refactor is not introducing any bug, so, let's clean these pre-existent
bugs first, and then submit the refactor.

The issues fixed in this patchset were reported during the review of
https://lore.kernel.org/all/20260524-netconsole_move_more-v1-0-909d1ab398b4@debian.org/

Not all of them got fixed, but, those that were easy to reason about.

Why net-next and not 'net' tree.

Most of the functions that are being fixed here moved from netpoll to
netconsole, thus, fixing this on net will cause merge conflicts from
'net' to 'net-next', thus I decided to fix it on 'net-next', given we
are on 7.1-rc6 already. Sorry if that is not the right approach.

Changed from v1:
  * Change it from 'net' to 'net-next'.

---
Breno Leitao (3):
      netconsole: do not schedule skb pool refill from NMI
      netconsole: do not dequeue pooled skbs that cannot satisfy len
      netconsole: take target_cleanup_list_lock in drop_netconsole_target()

 drivers/net/netconsole.c | 26 ++++++++++++++++++++++++--
 include/linux/netpoll.h  | 16 ++++++++++++++++
 net/core/netpoll.c       |  7 -------
 3 files changed, 40 insertions(+), 9 deletions(-)
---
base-commit: 08484c504b55a98bd100527fbe10a3caf55ff3ff
change-id: 20260528-netcons_fix_before_move-cd6cfec4e8f5

Best regards,
-- 
Breno Leitao <leitao@debian.org>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH net-next v2 1/3] netconsole: do not schedule skb pool refill from NMI
  2026-06-02 14:26 [PATCH net-next v2 0/3] netconsole: Fix reported problems Breno Leitao
@ 2026-06-02 14:26 ` Breno Leitao
  2026-06-02 14:26 ` [PATCH net-next v2 2/3] netconsole: do not dequeue pooled skbs that cannot satisfy len Breno Leitao
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Breno Leitao @ 2026-06-02 14:26 UTC (permalink / raw)
  To: Breno Leitao, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: netdev, linux-kernel, kernel-team

When alloc_skb() fails in find_skb(), the fallback path dequeues an skb
from np->skb_pool and unconditionally calls schedule_work() to top the
pool back up. schedule_work() ends up taking the workqueue pool locks,
which are not NMI-safe.

netconsole_write() is registered as the nbcon write_atomic callback and
is explicitly marked CON_NBCON_ATOMIC_UNSAFE, meaning it is invoked from
emergency/panic contexts including NMIs. If the NMI interrupts a thread
already holding the workqueue pool lock, calling schedule_work()
self-deadlocks and the panic message that was being printed is lost.

Introduce netcons_skb_pop() to fold the pool dequeue and the refill
request into a single helper. The helper skips schedule_work() when
called from NMI context; the pool is best-effort, and the next non-NMI
invocation of find_skb() will refill it. This keeps the fast path
untouched, the panic path NMI-safe, and the locking rules around the
fallback pool documented in one place.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/net/netconsole.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index 8ecc2c71c699..918e4a9f4456 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -1654,6 +1654,23 @@ static struct notifier_block netconsole_netdev_notifier = {
 	.notifier_call  = netconsole_netdev_event,
 };

+/* Pop a pre-allocated skb from the pool and request a refill.
+ *
+ * The refill is requested via schedule_work(), which takes the workqueue
+ * pool locks and is therefore not NMI-safe. Skip the refill when called
+ * from NMI context; the next non-NMI caller will top the pool back up.
+ */
+static struct sk_buff *netcons_skb_pop(struct netpoll *np)
+{
+	struct sk_buff *skb;
+
+	skb = skb_dequeue(&np->skb_pool);
+	if (!in_nmi())
+		schedule_work(&np->refill_wq);
+
+	return skb;
+}
+
 static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve)
 {
 	int count = 0;
@@ -1663,10 +1680,8 @@ static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve)
 repeat:

 	skb = alloc_skb(len, GFP_ATOMIC);
-	if (!skb) {
-		skb = skb_dequeue(&np->skb_pool);
-		schedule_work(&np->refill_wq);
-	}
+	if (!skb)
+		skb = netcons_skb_pop(np);

 	if (!skb) {
 		if (++count < 10) {

-- 
2.54.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH net-next v2 2/3] netconsole: do not dequeue pooled skbs that cannot satisfy len
  2026-06-02 14:26 [PATCH net-next v2 0/3] netconsole: Fix reported problems Breno Leitao
  2026-06-02 14:26 ` [PATCH net-next v2 1/3] netconsole: do not schedule skb pool refill from NMI Breno Leitao
@ 2026-06-02 14:26 ` Breno Leitao
  2026-06-02 14:26 ` [PATCH net-next v2 3/3] netconsole: take target_cleanup_list_lock in drop_netconsole_target() Breno Leitao
  2026-06-04 13:06 ` [PATCH net-next v2 0/3] netconsole: Fix reported problems Simon Horman
  3 siblings, 0 replies; 6+ messages in thread
From: Breno Leitao @ 2026-06-02 14:26 UTC (permalink / raw)
  To: Breno Leitao, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: netdev, linux-kernel, kernel-team

find_skb() falls back to np->skb_pool when the GFP_ATOMIC alloc_skb()
fails. The pool is refilled by refill_skbs(), which always allocates
buffers of MAX_SKB_SIZE (ethhdr + iphdr + udphdr + MAX_UDP_CHUNK ==
1502 bytes).

netconsole, however, computes the requested length dynamically as

        total_len + np->dev->needed_tailroom

If the egress device declares a non-zero needed_tailroom (e.g. some
tunnel or hardware accelerator devices), the required length can exceed
MAX_SKB_SIZE. The pooled skb is then handed back to the caller, which
immediately performs skb_put(skb, len), trips the tail > end check, and
triggers skb_over_panic().

Leave the normal alloc_skb(len, GFP_ATOMIC) path untouched -- the slab
allocator can still satisfy oversized requests when memory is available,
so senders to devices with non-zero needed_tailroom keep working in the
common case. Only the pool fallback is gated: when alloc_skb() failed
and len exceeds the pool buffer size, skip the skb_dequeue() instead of
burning a pre-allocated skb on a request that would later trip
skb_over_panic(). Reserving pool entries for requests they can actually
satisfy also keeps the panic path, which depends on the pool being
primed, intact.

When that drop happens, emit a rate-limited net_warn() so the user
notices that netconsole is unable to push messages on the egress device.
The warn is skipped under in_nmi() for the same reason schedule_work()
is: printk machinery taken by net_warn_ratelimited() is not NMI-safe and
would risk recursing into the same nbcon console we are servicing.

MAX_SKB_SIZE / MAX_UDP_CHUNK were private to net/core/netpoll.c. Move
them to include/linux/netpoll.h so netconsole can reference the same
definition that refill_skbs() uses, keeping the two in sync by
construction. The header now pulls in <linux/ip.h> and <linux/udp.h>
explicitly so MAX_SKB_SIZE remains self-contained for any future user.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/net/netconsole.c |  7 ++++++-
 include/linux/netpoll.h  | 16 ++++++++++++++++
 net/core/netpoll.c       |  7 -------
 3 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index 918e4a9f4456..b77879ead641 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -1680,8 +1680,13 @@ static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve)
 repeat:

 	skb = alloc_skb(len, GFP_ATOMIC);
-	if (!skb)
+	if (!skb) {
+		/* The pool is refilled with MAX_SKB_SIZE buffers */
+		if (WARN_ON_ONCE(len > MAX_SKB_SIZE))
+			return NULL;
+
 		skb = netcons_skb_pop(np);
+	}

 	if (!skb) {
 		if (++count < 10) {
diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h
index e4b8f1f91e54..88f7daa8560e 100644
--- a/include/linux/netpoll.h
+++ b/include/linux/netpoll.h
@@ -13,12 +13,28 @@
 #include <linux/rcupdate.h>
 #include <linux/list.h>
 #include <linux/refcount.h>
+#include <linux/ip.h>
+#include <linux/udp.h>

 union inet_addr {
 	__be32		ip;
 	struct in6_addr	in6;
 };

+/*
+ * Maximum payload netpoll's preallocated skb pool can carry. Keep this in
+ * sync with the buffer size used by refill_skbs() in net/core/netpoll.c;
+ * callers (e.g. netconsole) use it to detect requests the pool can never
+ * satisfy and avoid dequeuing a pooled skb that would later trip
+ * skb_over_panic() in skb_put().
+ */
+#define MAX_UDP_CHUNK	1460
+#define MAX_SKB_SIZE						\
+	(sizeof(struct ethhdr) +				\
+	 sizeof(struct iphdr) +					\
+	 sizeof(struct udphdr) +				\
+	 MAX_UDP_CHUNK)
+
 struct netpoll {
 	struct net_device *dev;
 	netdevice_tracker dev_tracker;
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index b3fe59445f2d..229dde818ab3 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -41,16 +41,9 @@
  * message gets out even in extreme OOM situations.
  */

-#define MAX_UDP_CHUNK 1460
 #define MAX_SKBS 32
 #define USEC_PER_POLL	50

-#define MAX_SKB_SIZE							\
-	(sizeof(struct ethhdr) +					\
-	 sizeof(struct iphdr) +						\
-	 sizeof(struct udphdr) +					\
-	 MAX_UDP_CHUNK)
-
 static unsigned int carrier_timeout = 4;
 module_param(carrier_timeout, uint, 0644);

-- 
2.54.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH net-next v2 3/3] netconsole: take target_cleanup_list_lock in drop_netconsole_target()
  2026-06-02 14:26 [PATCH net-next v2 0/3] netconsole: Fix reported problems Breno Leitao
  2026-06-02 14:26 ` [PATCH net-next v2 1/3] netconsole: do not schedule skb pool refill from NMI Breno Leitao
  2026-06-02 14:26 ` [PATCH net-next v2 2/3] netconsole: do not dequeue pooled skbs that cannot satisfy len Breno Leitao
@ 2026-06-02 14:26 ` Breno Leitao
  2026-06-04 13:06 ` [PATCH net-next v2 0/3] netconsole: Fix reported problems Simon Horman
  3 siblings, 0 replies; 6+ messages in thread
From: Breno Leitao @ 2026-06-02 14:26 UTC (permalink / raw)
  To: Breno Leitao, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: netdev, linux-kernel, kernel-team

drop_netconsole_target() unlinks the target while only holding
target_list_lock. However, when the underlying interface has been
unregistered, netconsole_netdev_event() moves the target from
target_list to target_cleanup_list, and netconsole_process_cleanups_core()
walks that list under target_cleanup_list_lock only.

If a user removes the configfs target at the same time the cleanup
worker is iterating target_cleanup_list, list_del() can corrupt the list
because the two paths take disjoint locks while operating on the same
list node.

Acquire target_cleanup_list_lock around the list_del() so the unlink is
serialised against netconsole_process_cleanups_core() regardless of
which list the target currently belongs to. The state transition that
downgrades STATE_DEACTIVATED to STATE_DISABLED is left intact and is
performed under the same combined locking, preserving the existing
ordering with resume_target().

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/net/netconsole.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index b77879ead641..59712d1d75fd 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -1452,6 +1452,7 @@ static void drop_netconsole_target(struct config_group *group,

 	dynamic_netconsole_mutex_lock();

+	mutex_lock(&target_cleanup_list_lock);
 	spin_lock_irqsave(&target_list_lock, flags);
 	/* Disable deactivated target to prevent races between resume attempt
 	 * and target removal.
@@ -1460,6 +1461,7 @@ static void drop_netconsole_target(struct config_group *group,
 		nt->state = STATE_DISABLED;
 	list_del(&nt->list);
 	spin_unlock_irqrestore(&target_list_lock, flags);
+	mutex_unlock(&target_cleanup_list_lock);

 	dynamic_netconsole_mutex_unlock();

-- 
2.54.0

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next v2 0/3] netconsole: Fix reported problems
  2026-06-02 14:26 [PATCH net-next v2 0/3] netconsole: Fix reported problems Breno Leitao
                   ` (2 preceding siblings ...)
  2026-06-02 14:26 ` [PATCH net-next v2 3/3] netconsole: take target_cleanup_list_lock in drop_netconsole_target() Breno Leitao
@ 2026-06-04 13:06 ` Simon Horman
  2026-06-04 13:59   ` Breno Leitao
  3 siblings, 1 reply; 6+ messages in thread
From: Simon Horman @ 2026-06-04 13:06 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, linux-kernel, kernel-team

On Tue, Jun 02, 2026 at 07:26:56AM -0700, Breno Leitao wrote:
> These are some of the issues that LLM reported to netconsole, and they
> are being addressed here before big refactors.
> 
> I was doing some big refactors, and got some "pre-existent-issues"
> during LLM review of the refactor, that make them hard to guarantee that
> refactor is not introducing any bug, so, let's clean these pre-existent
> bugs first, and then submit the refactor.
> 
> The issues fixed in this patchset were reported during the review of
> https://lore.kernel.org/all/20260524-netconsole_move_more-v1-0-909d1ab398b4@debian.org/
> 
> Not all of them got fixed, but, those that were easy to reason about.
> 
> Why net-next and not 'net' tree.
> 
> Most of the functions that are being fixed here moved from netpoll to
> netconsole, thus, fixing this on net will cause merge conflicts from
> 'net' to 'net-next', thus I decided to fix it on 'net-next', given we
> are on 7.1-rc6 already. Sorry if that is not the right approach.
> 
> Changed from v1:
>   * Change it from 'net' to 'net-next'.

Hi Breno,

There is AI-generated review of this patch-set available on both
https://sashiko.dev and https://netdev-ai.bots.linux.dev/sashiko/

I would appreciate it if you could look over that with a view
to addressing any issues that directly effect this patch-set.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next v2 0/3] netconsole: Fix reported problems
  2026-06-04 13:06 ` [PATCH net-next v2 0/3] netconsole: Fix reported problems Simon Horman
@ 2026-06-04 13:59   ` Breno Leitao
  0 siblings, 0 replies; 6+ messages in thread
From: Breno Leitao @ 2026-06-04 13:59 UTC (permalink / raw)
  To: Simon Horman
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, netdev, linux-kernel, kernel-team

On Thu, Jun 04, 2026 at 02:06:22PM +0100, Simon Horman wrote:
> On Tue, Jun 02, 2026 at 07:26:56AM -0700, Breno Leitao wrote:
> > These are some of the issues that LLM reported to netconsole, and they
> > are being addressed here before big refactors.
> >
> > I was doing some big refactors, and got some "pre-existent-issues"
> > during LLM review of the refactor, that make them hard to guarantee that
> > refactor is not introducing any bug, so, let's clean these pre-existent
> > bugs first, and then submit the refactor.
> >
> > The issues fixed in this patchset were reported during the review of
> > https://lore.kernel.org/all/20260524-netconsole_move_more-v1-0-909d1ab398b4@debian.org/
> >
> > Not all of them got fixed, but, those that were easy to reason about.
> >
> > Why net-next and not 'net' tree.
> >
> > Most of the functions that are being fixed here moved from netpoll to
> > netconsole, thus, fixing this on net will cause merge conflicts from
> > 'net' to 'net-next', thus I decided to fix it on 'net-next', given we
> > are on 7.1-rc6 already. Sorry if that is not the right approach.
> >
> > Changed from v1:
> >   * Change it from 'net' to 'net-next'.
>
> Hi Breno,
>
> There is AI-generated review of this patch-set available on both
> https://sashiko.dev and https://netdev-ai.bots.linux.dev/sashiko/
>
> I would appreciate it if you could look over that with a view
> to addressing any issues that directly effect this patch-set.

Ack. While most of the reported issues are pre-existing are being fixed,
there is one genuine regression: invoking WARN_ON_ONCE() in a potential
NMI context is problematic since it may trigger panic_on_warn.

I will send a revised version.

Thanks,
--breno

--
pw-bot: cr

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-06-04 14:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-02 14:26 [PATCH net-next v2 0/3] netconsole: Fix reported problems Breno Leitao
2026-06-02 14:26 ` [PATCH net-next v2 1/3] netconsole: do not schedule skb pool refill from NMI Breno Leitao
2026-06-02 14:26 ` [PATCH net-next v2 2/3] netconsole: do not dequeue pooled skbs that cannot satisfy len Breno Leitao
2026-06-02 14:26 ` [PATCH net-next v2 3/3] netconsole: take target_cleanup_list_lock in drop_netconsole_target() Breno Leitao
2026-06-04 13:06 ` [PATCH net-next v2 0/3] netconsole: Fix reported problems Simon Horman
2026-06-04 13:59   ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox