* [PATCH net 1/3] netconsole: do not schedule skb pool refill from NMI
2026-05-29 7:45 [PATCH net 0/3] netconsole: Fix reported problems Breno Leitao
@ 2026-05-29 7:45 ` Breno Leitao
2026-05-29 7:45 ` [PATCH net 2/3] netconsole: do not dequeue pooled skbs that cannot satisfy len Breno Leitao
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Breno Leitao @ 2026-05-29 7:45 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Neil Horman, Cong Wang
Cc: netdev, linux-kernel, Breno Leitao, kernel-team
When alloc_skb() fails in find_skb(), the fallback path dequeues an skb
from np->skb_pool and unconditionally calls schedule_work() to top the
pool back up. schedule_work() ends up taking the workqueue pool locks,
which are not NMI-safe.
netconsole_write() is registered as the nbcon write_atomic callback and
is explicitly marked CON_NBCON_ATOMIC_UNSAFE, meaning it is invoked from
emergency/panic contexts including NMIs. If the NMI interrupts a thread
already holding the workqueue pool lock, calling schedule_work()
self-deadlocks and the panic message that was being printed is lost.
Introduce netcons_skb_pop() to fold the pool dequeue and the refill
request into a single helper. The helper skips schedule_work() when
called from NMI context; the pool is best-effort, and the next non-NMI
invocation of find_skb() will refill it. This keeps the fast path
untouched, the panic path NMI-safe, and the locking rules around the
fallback pool documented in one place.
Fixes: 248f6571fd4c ("netpoll: Optimize skb refilling on critical path")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
drivers/net/netconsole.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)
diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index d804d44af87c..699bdfa1fb45 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -1654,6 +1654,23 @@ static struct notifier_block netconsole_netdev_notifier = {
.notifier_call = netconsole_netdev_event,
};
+/* Pop a pre-allocated skb from the pool and request a refill.
+ *
+ * The refill is requested via schedule_work(), which takes the workqueue
+ * pool locks and is therefore not NMI-safe. Skip the refill when called
+ * from NMI context; the next non-NMI caller will top the pool back up.
+ */
+static struct sk_buff *netcons_skb_pop(struct netpoll *np)
+{
+ struct sk_buff *skb;
+
+ skb = skb_dequeue(&np->skb_pool);
+ if (!in_nmi())
+ schedule_work(&np->refill_wq);
+
+ return skb;
+}
+
static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve)
{
int count = 0;
@@ -1663,10 +1680,8 @@ static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve)
repeat:
skb = alloc_skb(len, GFP_ATOMIC);
- if (!skb) {
- skb = skb_dequeue(&np->skb_pool);
- schedule_work(&np->refill_wq);
- }
+ if (!skb)
+ skb = netcons_skb_pop(np);
if (!skb) {
if (++count < 10) {
--
2.51.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* [PATCH net 2/3] netconsole: do not dequeue pooled skbs that cannot satisfy len
2026-05-29 7:45 [PATCH net 0/3] netconsole: Fix reported problems Breno Leitao
2026-05-29 7:45 ` [PATCH net 1/3] netconsole: do not schedule skb pool refill from NMI Breno Leitao
@ 2026-05-29 7:45 ` Breno Leitao
2026-05-29 7:45 ` [PATCH net 3/3] netconsole: take target_cleanup_list_lock in drop_netconsole_target() Breno Leitao
2026-06-01 10:31 ` [PATCH net 0/3] netconsole: Fix reported problems Breno Leitao
3 siblings, 0 replies; 5+ messages in thread
From: Breno Leitao @ 2026-05-29 7:45 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Neil Horman, Cong Wang
Cc: netdev, linux-kernel, Breno Leitao, kernel-team
find_skb() falls back to np->skb_pool when the GFP_ATOMIC alloc_skb()
fails. The pool is refilled by refill_skbs(), which always allocates
buffers of MAX_SKB_SIZE (ethhdr + iphdr + udphdr + MAX_UDP_CHUNK ==
1502 bytes).
netconsole, however, computes the requested length dynamically as
total_len + np->dev->needed_tailroom
If the egress device declares a non-zero needed_tailroom (e.g. some
tunnel or hardware accelerator devices), the required length can exceed
MAX_SKB_SIZE. The pooled skb is then handed back to the caller, which
immediately performs skb_put(skb, len), trips the tail > end check, and
triggers skb_over_panic().
Leave the normal alloc_skb(len, GFP_ATOMIC) path untouched -- the slab
allocator can still satisfy oversized requests when memory is available,
so senders to devices with non-zero needed_tailroom keep working in the
common case. Only the pool fallback is gated: when alloc_skb() failed
and len exceeds the pool buffer size, skip the skb_dequeue() instead of
burning a pre-allocated skb on a request that would later trip
skb_over_panic(). Reserving pool entries for requests they can actually
satisfy also keeps the panic path, which depends on the pool being
primed, intact.
When that drop happens, emit a rate-limited net_warn() so the user
notices that netconsole is unable to push messages on the egress device.
The warn is skipped under in_nmi() for the same reason schedule_work()
is: printk machinery taken by net_warn_ratelimited() is not NMI-safe and
would risk recursing into the same nbcon console we are servicing.
MAX_SKB_SIZE / MAX_UDP_CHUNK were private to net/core/netpoll.c. Move
them to include/linux/netpoll.h so netconsole can reference the same
definition that refill_skbs() uses, keeping the two in sync by
construction. The header now pulls in <linux/ip.h> and <linux/udp.h>
explicitly so MAX_SKB_SIZE remains self-contained for any future user.
Fixes: 954fba027405 ("netpoll: fix netpoll_send_udp() bugs")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
drivers/net/netconsole.c | 7 ++++++-
include/linux/netpoll.h | 16 ++++++++++++++++
net/core/netpoll.c | 7 -------
3 files changed, 22 insertions(+), 8 deletions(-)
diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index 699bdfa1fb45..a3dcbe713a0b 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -1680,8 +1680,13 @@ static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve)
repeat:
skb = alloc_skb(len, GFP_ATOMIC);
- if (!skb)
+ if (!skb) {
+ /* The pool is refilled with MAX_SKB_SIZE buffers */
+ if (WARN_ON_ONCE(len > MAX_SKB_SIZE))
+ return NULL;
+
skb = netcons_skb_pop(np);
+ }
if (!skb) {
if (++count < 10) {
diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h
index e4b8f1f91e54..88f7daa8560e 100644
--- a/include/linux/netpoll.h
+++ b/include/linux/netpoll.h
@@ -13,12 +13,28 @@
#include <linux/rcupdate.h>
#include <linux/list.h>
#include <linux/refcount.h>
+#include <linux/ip.h>
+#include <linux/udp.h>
union inet_addr {
__be32 ip;
struct in6_addr in6;
};
+/*
+ * Maximum payload netpoll's preallocated skb pool can carry. Keep this in
+ * sync with the buffer size used by refill_skbs() in net/core/netpoll.c;
+ * callers (e.g. netconsole) use it to detect requests the pool can never
+ * satisfy and avoid dequeuing a pooled skb that would later trip
+ * skb_over_panic() in skb_put().
+ */
+#define MAX_UDP_CHUNK 1460
+#define MAX_SKB_SIZE \
+ (sizeof(struct ethhdr) + \
+ sizeof(struct iphdr) + \
+ sizeof(struct udphdr) + \
+ MAX_UDP_CHUNK)
+
struct netpoll {
struct net_device *dev;
netdevice_tracker dev_tracker;
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index b3fe59445f2d..229dde818ab3 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -41,16 +41,9 @@
* message gets out even in extreme OOM situations.
*/
-#define MAX_UDP_CHUNK 1460
#define MAX_SKBS 32
#define USEC_PER_POLL 50
-#define MAX_SKB_SIZE \
- (sizeof(struct ethhdr) + \
- sizeof(struct iphdr) + \
- sizeof(struct udphdr) + \
- MAX_UDP_CHUNK)
-
static unsigned int carrier_timeout = 4;
module_param(carrier_timeout, uint, 0644);
--
2.51.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* [PATCH net 3/3] netconsole: take target_cleanup_list_lock in drop_netconsole_target()
2026-05-29 7:45 [PATCH net 0/3] netconsole: Fix reported problems Breno Leitao
2026-05-29 7:45 ` [PATCH net 1/3] netconsole: do not schedule skb pool refill from NMI Breno Leitao
2026-05-29 7:45 ` [PATCH net 2/3] netconsole: do not dequeue pooled skbs that cannot satisfy len Breno Leitao
@ 2026-05-29 7:45 ` Breno Leitao
2026-06-01 10:31 ` [PATCH net 0/3] netconsole: Fix reported problems Breno Leitao
3 siblings, 0 replies; 5+ messages in thread
From: Breno Leitao @ 2026-05-29 7:45 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Neil Horman, Cong Wang
Cc: netdev, linux-kernel, Breno Leitao, kernel-team
drop_netconsole_target() unlinks the target while only holding
target_list_lock. However, when the underlying interface has been
unregistered, netconsole_netdev_event() moves the target from
target_list to target_cleanup_list, and netconsole_process_cleanups_core()
walks that list under target_cleanup_list_lock only.
If a user removes the configfs target at the same time the cleanup
worker is iterating target_cleanup_list, list_del() can corrupt the list
because the two paths take disjoint locks while operating on the same
list node.
Acquire target_cleanup_list_lock around the list_del() so the unlink is
serialised against netconsole_process_cleanups_core() regardless of
which list the target currently belongs to. The state transition that
downgrades STATE_DEACTIVATED to STATE_DISABLED is left intact and is
performed under the same combined locking, preserving the existing
ordering with resume_target().
Fixes: 97714695ef90 ("net: netconsole: Defer netpoll cleanup to avoid lock release during list traversal")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
drivers/net/netconsole.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index a3dcbe713a0b..9e15d4186436 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -1452,6 +1452,7 @@ static void drop_netconsole_target(struct config_group *group,
dynamic_netconsole_mutex_lock();
+ mutex_lock(&target_cleanup_list_lock);
spin_lock_irqsave(&target_list_lock, flags);
/* Disable deactivated target to prevent races between resume attempt
* and target removal.
@@ -1460,6 +1461,7 @@ static void drop_netconsole_target(struct config_group *group,
nt->state = STATE_DISABLED;
list_del(&nt->list);
spin_unlock_irqrestore(&target_list_lock, flags);
+ mutex_unlock(&target_cleanup_list_lock);
dynamic_netconsole_mutex_unlock();
--
2.51.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH net 0/3] netconsole: Fix reported problems
2026-05-29 7:45 [PATCH net 0/3] netconsole: Fix reported problems Breno Leitao
` (2 preceding siblings ...)
2026-05-29 7:45 ` [PATCH net 3/3] netconsole: take target_cleanup_list_lock in drop_netconsole_target() Breno Leitao
@ 2026-06-01 10:31 ` Breno Leitao
3 siblings, 0 replies; 5+ messages in thread
From: Breno Leitao @ 2026-06-01 10:31 UTC (permalink / raw)
To: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Neil Horman, Cong Wang
Cc: netdev, linux-kernel, kernel-team
On Fri, May 29, 2026 at 03:45:10AM -0400, Breno Leitao wrote:
> These are some of the issues that LLM reported to netconsole, and they
> are being addressed here before big refactors.
>
> I was doing some big refactors, and got some "pre-existent-issues"
> during LLM review of the refactor, that make them hard to guarantee that
> refactor is not introducing any bug, so, let's clean these pre-existent
> bugs first, and then submit the refactor.
>
> The issues fixed in this patchset were reported during the review of
> https://lore.kernel.org/all/20260524-netconsole_move_more-v1-0-909d1ab398b4@debian.org/
>
> Not all of them got fixed, but, those that were easy to reason about.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
Somehow this patch haven't applied to 'net' tree and the tests haven't
run.
https://patchwork.kernel.org/project/netdevbpf/patch/20260529-netcons_fix_before_move-v1-1-cb2d1426dd75@debian.org/
I will respin it.
--
pw-bot: cr
^ permalink raw reply [flat|nested] 5+ messages in thread