* [PATCH net-next v2 1/3] netconsole: do not schedule skb pool refill from NMI
2026-06-02 14:26 [PATCH net-next v2 0/3] netconsole: Fix reported problems Breno Leitao
@ 2026-06-02 14:26 ` Breno Leitao
2026-06-02 14:26 ` [PATCH net-next v2 2/3] netconsole: do not dequeue pooled skbs that cannot satisfy len Breno Leitao
` (2 subsequent siblings)
3 siblings, 0 replies; 6+ messages in thread
From: Breno Leitao @ 2026-06-02 14:26 UTC (permalink / raw)
To: Breno Leitao, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman
Cc: netdev, linux-kernel, kernel-team
When alloc_skb() fails in find_skb(), the fallback path dequeues an skb
from np->skb_pool and unconditionally calls schedule_work() to top the
pool back up. schedule_work() ends up taking the workqueue pool locks,
which are not NMI-safe.
netconsole_write() is registered as the nbcon write_atomic callback and
is explicitly marked CON_NBCON_ATOMIC_UNSAFE, meaning it is invoked from
emergency/panic contexts including NMIs. If the NMI interrupts a thread
already holding the workqueue pool lock, calling schedule_work()
self-deadlocks and the panic message that was being printed is lost.
Introduce netcons_skb_pop() to fold the pool dequeue and the refill
request into a single helper. The helper skips schedule_work() when
called from NMI context; the pool is best-effort, and the next non-NMI
invocation of find_skb() will refill it. This keeps the fast path
untouched, the panic path NMI-safe, and the locking rules around the
fallback pool documented in one place.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
drivers/net/netconsole.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)
diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index 8ecc2c71c699..918e4a9f4456 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -1654,6 +1654,23 @@ static struct notifier_block netconsole_netdev_notifier = {
.notifier_call = netconsole_netdev_event,
};
+/* Pop a pre-allocated skb from the pool and request a refill.
+ *
+ * The refill is requested via schedule_work(), which takes the workqueue
+ * pool locks and is therefore not NMI-safe. Skip the refill when called
+ * from NMI context; the next non-NMI caller will top the pool back up.
+ */
+static struct sk_buff *netcons_skb_pop(struct netpoll *np)
+{
+ struct sk_buff *skb;
+
+ skb = skb_dequeue(&np->skb_pool);
+ if (!in_nmi())
+ schedule_work(&np->refill_wq);
+
+ return skb;
+}
+
static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve)
{
int count = 0;
@@ -1663,10 +1680,8 @@ static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve)
repeat:
skb = alloc_skb(len, GFP_ATOMIC);
- if (!skb) {
- skb = skb_dequeue(&np->skb_pool);
- schedule_work(&np->refill_wq);
- }
+ if (!skb)
+ skb = netcons_skb_pop(np);
if (!skb) {
if (++count < 10) {
--
2.54.0
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH net-next v2 2/3] netconsole: do not dequeue pooled skbs that cannot satisfy len
2026-06-02 14:26 [PATCH net-next v2 0/3] netconsole: Fix reported problems Breno Leitao
2026-06-02 14:26 ` [PATCH net-next v2 1/3] netconsole: do not schedule skb pool refill from NMI Breno Leitao
@ 2026-06-02 14:26 ` Breno Leitao
2026-06-02 14:26 ` [PATCH net-next v2 3/3] netconsole: take target_cleanup_list_lock in drop_netconsole_target() Breno Leitao
2026-06-04 13:06 ` [PATCH net-next v2 0/3] netconsole: Fix reported problems Simon Horman
3 siblings, 0 replies; 6+ messages in thread
From: Breno Leitao @ 2026-06-02 14:26 UTC (permalink / raw)
To: Breno Leitao, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman
Cc: netdev, linux-kernel, kernel-team
find_skb() falls back to np->skb_pool when the GFP_ATOMIC alloc_skb()
fails. The pool is refilled by refill_skbs(), which always allocates
buffers of MAX_SKB_SIZE (ethhdr + iphdr + udphdr + MAX_UDP_CHUNK ==
1502 bytes).
netconsole, however, computes the requested length dynamically as
total_len + np->dev->needed_tailroom
If the egress device declares a non-zero needed_tailroom (e.g. some
tunnel or hardware accelerator devices), the required length can exceed
MAX_SKB_SIZE. The pooled skb is then handed back to the caller, which
immediately performs skb_put(skb, len), trips the tail > end check, and
triggers skb_over_panic().
Leave the normal alloc_skb(len, GFP_ATOMIC) path untouched -- the slab
allocator can still satisfy oversized requests when memory is available,
so senders to devices with non-zero needed_tailroom keep working in the
common case. Only the pool fallback is gated: when alloc_skb() failed
and len exceeds the pool buffer size, skip the skb_dequeue() instead of
burning a pre-allocated skb on a request that would later trip
skb_over_panic(). Reserving pool entries for requests they can actually
satisfy also keeps the panic path, which depends on the pool being
primed, intact.
When that drop happens, emit a rate-limited net_warn() so the user
notices that netconsole is unable to push messages on the egress device.
The warn is skipped under in_nmi() for the same reason schedule_work()
is: printk machinery taken by net_warn_ratelimited() is not NMI-safe and
would risk recursing into the same nbcon console we are servicing.
MAX_SKB_SIZE / MAX_UDP_CHUNK were private to net/core/netpoll.c. Move
them to include/linux/netpoll.h so netconsole can reference the same
definition that refill_skbs() uses, keeping the two in sync by
construction. The header now pulls in <linux/ip.h> and <linux/udp.h>
explicitly so MAX_SKB_SIZE remains self-contained for any future user.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
drivers/net/netconsole.c | 7 ++++++-
include/linux/netpoll.h | 16 ++++++++++++++++
net/core/netpoll.c | 7 -------
3 files changed, 22 insertions(+), 8 deletions(-)
diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index 918e4a9f4456..b77879ead641 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -1680,8 +1680,13 @@ static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve)
repeat:
skb = alloc_skb(len, GFP_ATOMIC);
- if (!skb)
+ if (!skb) {
+ /* The pool is refilled with MAX_SKB_SIZE buffers */
+ if (WARN_ON_ONCE(len > MAX_SKB_SIZE))
+ return NULL;
+
skb = netcons_skb_pop(np);
+ }
if (!skb) {
if (++count < 10) {
diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h
index e4b8f1f91e54..88f7daa8560e 100644
--- a/include/linux/netpoll.h
+++ b/include/linux/netpoll.h
@@ -13,12 +13,28 @@
#include <linux/rcupdate.h>
#include <linux/list.h>
#include <linux/refcount.h>
+#include <linux/ip.h>
+#include <linux/udp.h>
union inet_addr {
__be32 ip;
struct in6_addr in6;
};
+/*
+ * Maximum payload netpoll's preallocated skb pool can carry. Keep this in
+ * sync with the buffer size used by refill_skbs() in net/core/netpoll.c;
+ * callers (e.g. netconsole) use it to detect requests the pool can never
+ * satisfy and avoid dequeuing a pooled skb that would later trip
+ * skb_over_panic() in skb_put().
+ */
+#define MAX_UDP_CHUNK 1460
+#define MAX_SKB_SIZE \
+ (sizeof(struct ethhdr) + \
+ sizeof(struct iphdr) + \
+ sizeof(struct udphdr) + \
+ MAX_UDP_CHUNK)
+
struct netpoll {
struct net_device *dev;
netdevice_tracker dev_tracker;
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index b3fe59445f2d..229dde818ab3 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -41,16 +41,9 @@
* message gets out even in extreme OOM situations.
*/
-#define MAX_UDP_CHUNK 1460
#define MAX_SKBS 32
#define USEC_PER_POLL 50
-#define MAX_SKB_SIZE \
- (sizeof(struct ethhdr) + \
- sizeof(struct iphdr) + \
- sizeof(struct udphdr) + \
- MAX_UDP_CHUNK)
-
static unsigned int carrier_timeout = 4;
module_param(carrier_timeout, uint, 0644);
--
2.54.0
^ permalink raw reply related [flat|nested] 6+ messages in thread* [PATCH net-next v2 3/3] netconsole: take target_cleanup_list_lock in drop_netconsole_target()
2026-06-02 14:26 [PATCH net-next v2 0/3] netconsole: Fix reported problems Breno Leitao
2026-06-02 14:26 ` [PATCH net-next v2 1/3] netconsole: do not schedule skb pool refill from NMI Breno Leitao
2026-06-02 14:26 ` [PATCH net-next v2 2/3] netconsole: do not dequeue pooled skbs that cannot satisfy len Breno Leitao
@ 2026-06-02 14:26 ` Breno Leitao
2026-06-04 13:06 ` [PATCH net-next v2 0/3] netconsole: Fix reported problems Simon Horman
3 siblings, 0 replies; 6+ messages in thread
From: Breno Leitao @ 2026-06-02 14:26 UTC (permalink / raw)
To: Breno Leitao, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman
Cc: netdev, linux-kernel, kernel-team
drop_netconsole_target() unlinks the target while only holding
target_list_lock. However, when the underlying interface has been
unregistered, netconsole_netdev_event() moves the target from
target_list to target_cleanup_list, and netconsole_process_cleanups_core()
walks that list under target_cleanup_list_lock only.
If a user removes the configfs target at the same time the cleanup
worker is iterating target_cleanup_list, list_del() can corrupt the list
because the two paths take disjoint locks while operating on the same
list node.
Acquire target_cleanup_list_lock around the list_del() so the unlink is
serialised against netconsole_process_cleanups_core() regardless of
which list the target currently belongs to. The state transition that
downgrades STATE_DEACTIVATED to STATE_DISABLED is left intact and is
performed under the same combined locking, preserving the existing
ordering with resume_target().
Signed-off-by: Breno Leitao <leitao@debian.org>
---
drivers/net/netconsole.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index b77879ead641..59712d1d75fd 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -1452,6 +1452,7 @@ static void drop_netconsole_target(struct config_group *group,
dynamic_netconsole_mutex_lock();
+ mutex_lock(&target_cleanup_list_lock);
spin_lock_irqsave(&target_list_lock, flags);
/* Disable deactivated target to prevent races between resume attempt
* and target removal.
@@ -1460,6 +1461,7 @@ static void drop_netconsole_target(struct config_group *group,
nt->state = STATE_DISABLED;
list_del(&nt->list);
spin_unlock_irqrestore(&target_list_lock, flags);
+ mutex_unlock(&target_cleanup_list_lock);
dynamic_netconsole_mutex_unlock();
--
2.54.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH net-next v2 0/3] netconsole: Fix reported problems
2026-06-02 14:26 [PATCH net-next v2 0/3] netconsole: Fix reported problems Breno Leitao
` (2 preceding siblings ...)
2026-06-02 14:26 ` [PATCH net-next v2 3/3] netconsole: take target_cleanup_list_lock in drop_netconsole_target() Breno Leitao
@ 2026-06-04 13:06 ` Simon Horman
2026-06-04 13:59 ` Breno Leitao
3 siblings, 1 reply; 6+ messages in thread
From: Simon Horman @ 2026-06-04 13:06 UTC (permalink / raw)
To: Breno Leitao
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, netdev, linux-kernel, kernel-team
On Tue, Jun 02, 2026 at 07:26:56AM -0700, Breno Leitao wrote:
> These are some of the issues that LLM reported to netconsole, and they
> are being addressed here before big refactors.
>
> I was doing some big refactors, and got some "pre-existent-issues"
> during LLM review of the refactor, that make them hard to guarantee that
> refactor is not introducing any bug, so, let's clean these pre-existent
> bugs first, and then submit the refactor.
>
> The issues fixed in this patchset were reported during the review of
> https://lore.kernel.org/all/20260524-netconsole_move_more-v1-0-909d1ab398b4@debian.org/
>
> Not all of them got fixed, but, those that were easy to reason about.
>
> Why net-next and not 'net' tree.
>
> Most of the functions that are being fixed here moved from netpoll to
> netconsole, thus, fixing this on net will cause merge conflicts from
> 'net' to 'net-next', thus I decided to fix it on 'net-next', given we
> are on 7.1-rc6 already. Sorry if that is not the right approach.
>
> Changed from v1:
> * Change it from 'net' to 'net-next'.
Hi Breno,
There is AI-generated review of this patch-set available on both
https://sashiko.dev and https://netdev-ai.bots.linux.dev/sashiko/
I would appreciate it if you could look over that with a view
to addressing any issues that directly effect this patch-set.
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: [PATCH net-next v2 0/3] netconsole: Fix reported problems
2026-06-04 13:06 ` [PATCH net-next v2 0/3] netconsole: Fix reported problems Simon Horman
@ 2026-06-04 13:59 ` Breno Leitao
0 siblings, 0 replies; 6+ messages in thread
From: Breno Leitao @ 2026-06-04 13:59 UTC (permalink / raw)
To: Simon Horman
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, netdev, linux-kernel, kernel-team
On Thu, Jun 04, 2026 at 02:06:22PM +0100, Simon Horman wrote:
> On Tue, Jun 02, 2026 at 07:26:56AM -0700, Breno Leitao wrote:
> > These are some of the issues that LLM reported to netconsole, and they
> > are being addressed here before big refactors.
> >
> > I was doing some big refactors, and got some "pre-existent-issues"
> > during LLM review of the refactor, that make them hard to guarantee that
> > refactor is not introducing any bug, so, let's clean these pre-existent
> > bugs first, and then submit the refactor.
> >
> > The issues fixed in this patchset were reported during the review of
> > https://lore.kernel.org/all/20260524-netconsole_move_more-v1-0-909d1ab398b4@debian.org/
> >
> > Not all of them got fixed, but, those that were easy to reason about.
> >
> > Why net-next and not 'net' tree.
> >
> > Most of the functions that are being fixed here moved from netpoll to
> > netconsole, thus, fixing this on net will cause merge conflicts from
> > 'net' to 'net-next', thus I decided to fix it on 'net-next', given we
> > are on 7.1-rc6 already. Sorry if that is not the right approach.
> >
> > Changed from v1:
> > * Change it from 'net' to 'net-next'.
>
> Hi Breno,
>
> There is AI-generated review of this patch-set available on both
> https://sashiko.dev and https://netdev-ai.bots.linux.dev/sashiko/
>
> I would appreciate it if you could look over that with a view
> to addressing any issues that directly effect this patch-set.
Ack. While most of the reported issues are pre-existing are being fixed,
there is one genuine regression: invoking WARN_ON_ONCE() in a potential
NMI context is problematic since it may trigger panic_on_warn.
I will send a revised version.
Thanks,
--breno
--
pw-bot: cr
^ permalink raw reply [flat|nested] 6+ messages in thread