Netdev List
 help / color / mirror / Atom feed
* [PATCH net 0/3] netconsole: Fix reported problems
@ 2026-05-29  7:45 Breno Leitao
  2026-05-29  7:45 ` [PATCH net 1/3] netconsole: do not schedule skb pool refill from NMI Breno Leitao
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Breno Leitao @ 2026-05-29  7:45 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Neil Horman, Cong Wang
  Cc: netdev, linux-kernel, Breno Leitao, kernel-team

These are some of the issues that LLM reported to netconsole, and they
are being addressed here before big refactors.

I was doing some big refactors, and got some "pre-existent-issues"
during LLM review of the refactor, that make them hard to guarantee that
refactor is not introducing any bug, so, let's clean these pre-existent
bugs first, and then submit the refactor.

The issues fixed in this patchset were reported during the review of
https://lore.kernel.org/all/20260524-netconsole_move_more-v1-0-909d1ab398b4@debian.org/

Not all of them got fixed, but, those that were easy to reason about.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Breno Leitao (3):
      netconsole: do not schedule skb pool refill from NMI
      netconsole: do not dequeue pooled skbs that cannot satisfy len
      netconsole: take target_cleanup_list_lock in drop_netconsole_target()

 drivers/net/netconsole.c | 26 ++++++++++++++++++++++++--
 include/linux/netpoll.h  | 16 ++++++++++++++++
 net/core/netpoll.c       |  7 -------
 3 files changed, 40 insertions(+), 9 deletions(-)
---
base-commit: e7e28506af98ce4e1059e5ec59334b335c00a246
change-id: 20260528-netcons_fix_before_move-cd6cfec4e8f5
prerequisite-change-id: 20260528-netconsole_fixes-35d4e1f88828:v1

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH net 1/3] netconsole: do not schedule skb pool refill from NMI
  2026-05-29  7:45 [PATCH net 0/3] netconsole: Fix reported problems Breno Leitao
@ 2026-05-29  7:45 ` Breno Leitao
  2026-05-29  7:45 ` [PATCH net 2/3] netconsole: do not dequeue pooled skbs that cannot satisfy len Breno Leitao
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Breno Leitao @ 2026-05-29  7:45 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Neil Horman, Cong Wang
  Cc: netdev, linux-kernel, Breno Leitao, kernel-team

When alloc_skb() fails in find_skb(), the fallback path dequeues an skb
from np->skb_pool and unconditionally calls schedule_work() to top the
pool back up. schedule_work() ends up taking the workqueue pool locks,
which are not NMI-safe.

netconsole_write() is registered as the nbcon write_atomic callback and
is explicitly marked CON_NBCON_ATOMIC_UNSAFE, meaning it is invoked from
emergency/panic contexts including NMIs. If the NMI interrupts a thread
already holding the workqueue pool lock, calling schedule_work()
self-deadlocks and the panic message that was being printed is lost.

Introduce netcons_skb_pop() to fold the pool dequeue and the refill
request into a single helper. The helper skips schedule_work() when
called from NMI context; the pool is best-effort, and the next non-NMI
invocation of find_skb() will refill it. This keeps the fast path
untouched, the panic path NMI-safe, and the locking rules around the
fallback pool documented in one place.

Fixes: 248f6571fd4c ("netpoll: Optimize skb refilling on critical path")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/net/netconsole.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index d804d44af87c..699bdfa1fb45 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -1654,6 +1654,23 @@ static struct notifier_block netconsole_netdev_notifier = {
 	.notifier_call  = netconsole_netdev_event,
 };
 
+/* Pop a pre-allocated skb from the pool and request a refill.
+ *
+ * The refill is requested via schedule_work(), which takes the workqueue
+ * pool locks and is therefore not NMI-safe. Skip the refill when called
+ * from NMI context; the next non-NMI caller will top the pool back up.
+ */
+static struct sk_buff *netcons_skb_pop(struct netpoll *np)
+{
+	struct sk_buff *skb;
+
+	skb = skb_dequeue(&np->skb_pool);
+	if (!in_nmi())
+		schedule_work(&np->refill_wq);
+
+	return skb;
+}
+
 static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve)
 {
 	int count = 0;
@@ -1663,10 +1680,8 @@ static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve)
 repeat:
 
 	skb = alloc_skb(len, GFP_ATOMIC);
-	if (!skb) {
-		skb = skb_dequeue(&np->skb_pool);
-		schedule_work(&np->refill_wq);
-	}
+	if (!skb)
+		skb = netcons_skb_pop(np);
 
 	if (!skb) {
 		if (++count < 10) {

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net 2/3] netconsole: do not dequeue pooled skbs that cannot satisfy len
  2026-05-29  7:45 [PATCH net 0/3] netconsole: Fix reported problems Breno Leitao
  2026-05-29  7:45 ` [PATCH net 1/3] netconsole: do not schedule skb pool refill from NMI Breno Leitao
@ 2026-05-29  7:45 ` Breno Leitao
  2026-05-29  7:45 ` [PATCH net 3/3] netconsole: take target_cleanup_list_lock in drop_netconsole_target() Breno Leitao
  2026-06-01 10:31 ` [PATCH net 0/3] netconsole: Fix reported problems Breno Leitao
  3 siblings, 0 replies; 5+ messages in thread
From: Breno Leitao @ 2026-05-29  7:45 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Neil Horman, Cong Wang
  Cc: netdev, linux-kernel, Breno Leitao, kernel-team

find_skb() falls back to np->skb_pool when the GFP_ATOMIC alloc_skb()
fails. The pool is refilled by refill_skbs(), which always allocates
buffers of MAX_SKB_SIZE (ethhdr + iphdr + udphdr + MAX_UDP_CHUNK ==
1502 bytes).

netconsole, however, computes the requested length dynamically as

        total_len + np->dev->needed_tailroom

If the egress device declares a non-zero needed_tailroom (e.g. some
tunnel or hardware accelerator devices), the required length can exceed
MAX_SKB_SIZE. The pooled skb is then handed back to the caller, which
immediately performs skb_put(skb, len), trips the tail > end check, and
triggers skb_over_panic().

Leave the normal alloc_skb(len, GFP_ATOMIC) path untouched -- the slab
allocator can still satisfy oversized requests when memory is available,
so senders to devices with non-zero needed_tailroom keep working in the
common case. Only the pool fallback is gated: when alloc_skb() failed
and len exceeds the pool buffer size, skip the skb_dequeue() instead of
burning a pre-allocated skb on a request that would later trip
skb_over_panic(). Reserving pool entries for requests they can actually
satisfy also keeps the panic path, which depends on the pool being
primed, intact.

When that drop happens, emit a rate-limited net_warn() so the user
notices that netconsole is unable to push messages on the egress device.
The warn is skipped under in_nmi() for the same reason schedule_work()
is: printk machinery taken by net_warn_ratelimited() is not NMI-safe and
would risk recursing into the same nbcon console we are servicing.

MAX_SKB_SIZE / MAX_UDP_CHUNK were private to net/core/netpoll.c. Move
them to include/linux/netpoll.h so netconsole can reference the same
definition that refill_skbs() uses, keeping the two in sync by
construction. The header now pulls in <linux/ip.h> and <linux/udp.h>
explicitly so MAX_SKB_SIZE remains self-contained for any future user.

Fixes: 954fba027405 ("netpoll: fix netpoll_send_udp() bugs")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/net/netconsole.c |  7 ++++++-
 include/linux/netpoll.h  | 16 ++++++++++++++++
 net/core/netpoll.c       |  7 -------
 3 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index 699bdfa1fb45..a3dcbe713a0b 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -1680,8 +1680,13 @@ static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve)
 repeat:
 
 	skb = alloc_skb(len, GFP_ATOMIC);
-	if (!skb)
+	if (!skb) {
+		/* The pool is refilled with MAX_SKB_SIZE buffers */
+		if (WARN_ON_ONCE(len > MAX_SKB_SIZE))
+			return NULL;
+
 		skb = netcons_skb_pop(np);
+	}
 
 	if (!skb) {
 		if (++count < 10) {
diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h
index e4b8f1f91e54..88f7daa8560e 100644
--- a/include/linux/netpoll.h
+++ b/include/linux/netpoll.h
@@ -13,12 +13,28 @@
 #include <linux/rcupdate.h>
 #include <linux/list.h>
 #include <linux/refcount.h>
+#include <linux/ip.h>
+#include <linux/udp.h>
 
 union inet_addr {
 	__be32		ip;
 	struct in6_addr	in6;
 };
 
+/*
+ * Maximum payload netpoll's preallocated skb pool can carry. Keep this in
+ * sync with the buffer size used by refill_skbs() in net/core/netpoll.c;
+ * callers (e.g. netconsole) use it to detect requests the pool can never
+ * satisfy and avoid dequeuing a pooled skb that would later trip
+ * skb_over_panic() in skb_put().
+ */
+#define MAX_UDP_CHUNK	1460
+#define MAX_SKB_SIZE						\
+	(sizeof(struct ethhdr) +				\
+	 sizeof(struct iphdr) +					\
+	 sizeof(struct udphdr) +				\
+	 MAX_UDP_CHUNK)
+
 struct netpoll {
 	struct net_device *dev;
 	netdevice_tracker dev_tracker;
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index b3fe59445f2d..229dde818ab3 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -41,16 +41,9 @@
  * message gets out even in extreme OOM situations.
  */
 
-#define MAX_UDP_CHUNK 1460
 #define MAX_SKBS 32
 #define USEC_PER_POLL	50
 
-#define MAX_SKB_SIZE							\
-	(sizeof(struct ethhdr) +					\
-	 sizeof(struct iphdr) +						\
-	 sizeof(struct udphdr) +					\
-	 MAX_UDP_CHUNK)
-
 static unsigned int carrier_timeout = 4;
 module_param(carrier_timeout, uint, 0644);
 

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net 3/3] netconsole: take target_cleanup_list_lock in drop_netconsole_target()
  2026-05-29  7:45 [PATCH net 0/3] netconsole: Fix reported problems Breno Leitao
  2026-05-29  7:45 ` [PATCH net 1/3] netconsole: do not schedule skb pool refill from NMI Breno Leitao
  2026-05-29  7:45 ` [PATCH net 2/3] netconsole: do not dequeue pooled skbs that cannot satisfy len Breno Leitao
@ 2026-05-29  7:45 ` Breno Leitao
  2026-06-01 10:31 ` [PATCH net 0/3] netconsole: Fix reported problems Breno Leitao
  3 siblings, 0 replies; 5+ messages in thread
From: Breno Leitao @ 2026-05-29  7:45 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Neil Horman, Cong Wang
  Cc: netdev, linux-kernel, Breno Leitao, kernel-team

drop_netconsole_target() unlinks the target while only holding
target_list_lock. However, when the underlying interface has been
unregistered, netconsole_netdev_event() moves the target from
target_list to target_cleanup_list, and netconsole_process_cleanups_core()
walks that list under target_cleanup_list_lock only.

If a user removes the configfs target at the same time the cleanup
worker is iterating target_cleanup_list, list_del() can corrupt the list
because the two paths take disjoint locks while operating on the same
list node.

Acquire target_cleanup_list_lock around the list_del() so the unlink is
serialised against netconsole_process_cleanups_core() regardless of
which list the target currently belongs to. The state transition that
downgrades STATE_DEACTIVATED to STATE_DISABLED is left intact and is
performed under the same combined locking, preserving the existing
ordering with resume_target().

Fixes: 97714695ef90 ("net: netconsole: Defer netpoll cleanup to avoid lock release during list traversal")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/net/netconsole.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index a3dcbe713a0b..9e15d4186436 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -1452,6 +1452,7 @@ static void drop_netconsole_target(struct config_group *group,
 
 	dynamic_netconsole_mutex_lock();
 
+	mutex_lock(&target_cleanup_list_lock);
 	spin_lock_irqsave(&target_list_lock, flags);
 	/* Disable deactivated target to prevent races between resume attempt
 	 * and target removal.
@@ -1460,6 +1461,7 @@ static void drop_netconsole_target(struct config_group *group,
 		nt->state = STATE_DISABLED;
 	list_del(&nt->list);
 	spin_unlock_irqrestore(&target_list_lock, flags);
+	mutex_unlock(&target_cleanup_list_lock);
 
 	dynamic_netconsole_mutex_unlock();
 

-- 
2.51.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH net 0/3] netconsole: Fix reported problems
  2026-05-29  7:45 [PATCH net 0/3] netconsole: Fix reported problems Breno Leitao
                   ` (2 preceding siblings ...)
  2026-05-29  7:45 ` [PATCH net 3/3] netconsole: take target_cleanup_list_lock in drop_netconsole_target() Breno Leitao
@ 2026-06-01 10:31 ` Breno Leitao
  3 siblings, 0 replies; 5+ messages in thread
From: Breno Leitao @ 2026-06-01 10:31 UTC (permalink / raw)
  To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Neil Horman, Cong Wang
  Cc: netdev, linux-kernel, kernel-team

On Fri, May 29, 2026 at 03:45:10AM -0400, Breno Leitao wrote:
> These are some of the issues that LLM reported to netconsole, and they
> are being addressed here before big refactors.
> 
> I was doing some big refactors, and got some "pre-existent-issues"
> during LLM review of the refactor, that make them hard to guarantee that
> refactor is not introducing any bug, so, let's clean these pre-existent
> bugs first, and then submit the refactor.
> 
> The issues fixed in this patchset were reported during the review of
> https://lore.kernel.org/all/20260524-netconsole_move_more-v1-0-909d1ab398b4@debian.org/
> 
> Not all of them got fixed, but, those that were easy to reason about.
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>

Somehow this patch haven't applied to 'net' tree and the tests haven't
run.

https://patchwork.kernel.org/project/netdevbpf/patch/20260529-netcons_fix_before_move-v1-1-cb2d1426dd75@debian.org/

I will respin it.

--
pw-bot: cr

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-01 10:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-29  7:45 [PATCH net 0/3] netconsole: Fix reported problems Breno Leitao
2026-05-29  7:45 ` [PATCH net 1/3] netconsole: do not schedule skb pool refill from NMI Breno Leitao
2026-05-29  7:45 ` [PATCH net 2/3] netconsole: do not dequeue pooled skbs that cannot satisfy len Breno Leitao
2026-05-29  7:45 ` [PATCH net 3/3] netconsole: take target_cleanup_list_lock in drop_netconsole_target() Breno Leitao
2026-06-01 10:31 ` [PATCH net 0/3] netconsole: Fix reported problems Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox