[PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
@ 2026-06-10 18:36 Vlad Poenaru
  2026-06-11 18:36 ` sashiko-bot
  2026-06-12  2:11 ` Jakub Kicinski
  0 siblings, 2 replies; 3+ messages in thread
From: Vlad Poenaru @ 2026-06-10 18:36 UTC (permalink / raw)
  To: netdev, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Breno Leitao, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
	stable

netpoll_poll_dev() can be called from process context with interrupts
disabled, most notably from printk() -> netconsole when a WARN()/printk()
is emitted while holding a runqueue lock inside __schedule() (e.g. from
put_prev_entity() during a context switch). console_unlock() then flushes
netconsole inline, which polls the NIC to drain its TX ring.

Drivers free completed TX skbs from their ->poll() via
dev_kfree_skb_irq_reason(), which queues the skb and calls
raise_softirq_irqoff(NET_TX_SOFTIRQ). Outside softirq context that helper
takes the !in_interrupt() path and calls wakeup_softirqd() ->
try_to_wake_up(). Waking the local ksoftirqd takes the current CPU's
rq->lock (ttwu_queue() -> rq_lock(); ttwu_queue_cond() refuses the remote
wakelist for a same-CPU wakeup). If the caller already holds that rq->lock
this recursively acquires a non-recursive spinlock: the CPU spins forever
with IRQs disabled. Every other CPU that subsequently load-balances
against this runqueue spins on the same lock, TLB-shootdown IPIs to the
wedged CPUs go unanswered, and the machine dies under the NMI hard-lockup
watchdog.

This was hit in production on a 252-CPU AMD system running a 6.16-based
kernel. A scheduler WARN_ON_ONCE() fired from __enqueue_entity() with the
rq lock held during a context switch; flushing it to netconsole reentered
the scheduler and the CPU deadlocked on its own rq->lock. The backtrace of
the wedged CPU (spinning at the top of the stack on the rq->lock it is
already holding further down):

  native_queued_spin_lock_slowpath
  _raw_spin_lock
  raw_spin_rq_lock_nested
  rq_lock
  ttwu_queue
  try_to_wake_up                  // wakes ksoftirqd/N
  dev_kfree_skb_irq_reason        // raise_softirq_irqoff(NET_TX_SOFTIRQ)
  __bnxt_tx_int
  bnxt_poll_p5
  poll_one_napi
  poll_napi
  netpoll_poll_dev
  netpoll_send_udp
  write_ext_msg                   // netconsole
  console_unlock
  vprintk_emit
  __warn
  __enqueue_entity                // WARN_ON_ONCE() here -- rq->lock held
  put_prev_entity
  put_prev_task_fair
  __schedule
  sched_exec
  bprm_execve
  __x64_sys_execve

About 215 of the 252 CPUs then piled up in sched_balance_rq() spinning on
that runqueue's lock; pending TLB shootdowns to the wedged CPUs stalled in
csd_lock_wait(), and a victim CPU finally took down the box with "Kernel
panic - not syncing: Hard LOCKUP". The particular WARN is incidental --
any printk() that reaches netconsole while a rq->lock is held reproduces
the same self-deadlock.

In the normal receive path this cannot happen because net_rx_action()
runs ->poll() with bottom halves disabled, so raise_softirq_irqoff() sees
in_interrupt() and merely sets the pending bit. Make netpoll do the same:
wrap the poll callbacks in local_bh_disable(). On !PREEMPT_RT all callers
invoke netpoll_poll_dev() with IRQs disabled (see the WARN_ONCE() in
netpoll_send_skb_on_dev()), so pair it with _local_bh_enable() to leave
the section without running softirqs inline -- running them here would
re-enable IRQs and execute softirq handlers deep in a lock-holding
context. On PREEMPT_RT the path runs with IRQs enabled and softirqs are
threaded; _local_bh_enable() is not available there and would not drop
the softirq_ctrl local_lock taken by local_bh_disable(), so use the
regular local_bh_enable(). The raised NET_TX softirq is harmless: netpoll
reaps the freed skbs via zap_completion_queue() and the pending softirq
is serviced at the next irq_exit().

Cc: stable@vger.kernel.org
Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
---
 net/core/netpoll.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 3f4a17fa5713..18da97eff532 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -194,11 +194,56 @@ void netpoll_poll_dev(struct net_device *dev)
 	}

 	ops = dev->netdev_ops;
+
+	/*
+	 * Run the poll callbacks in softirq context, exactly as net_rx_action()
+	 * does for the normal NAPI path. netpoll_poll_dev() is called from
+	 * process context with IRQs disabled (e.g. printk() -> netconsole while
+	 * holding a rq->lock inside __schedule()). Drivers free completed TX
+	 * skbs from their ->poll() via dev_kfree_skb_irq_reason(), which calls
+	 * raise_softirq_irqoff(NET_TX_SOFTIRQ). Outside softirq context that
+	 * helper sees !in_interrupt() and calls wakeup_softirqd() ->
+	 * try_to_wake_up(), which takes the rq->lock of the current CPU. If the
+	 * caller already holds that rq->lock this self-deadlocks, wedging the
+	 * CPU (and then the whole machine via rq->lock contention) until the
+	 * hard-lockup watchdog panics.
+	 *
+	 * Disabling BH makes in_interrupt() true for the duration of the poll,
+	 * so the TX completion only sets the softirq-pending bit and never wakes
+	 * ksoftirqd. The raised softirq is harmless and benign: netpoll reaps
+	 * the freed skbs itself via zap_completion_queue() below, and the
+	 * pending NET_TX softirq is serviced at the next irq_exit().
+	 */
+	local_bh_disable();
+
 	if (ops->ndo_poll_controller)
 		ops->ndo_poll_controller(dev);

 	poll_napi(dev);

+#ifndef CONFIG_PREEMPT_RT
+	/*
+	 * On !PREEMPT_RT all netpoll_poll_dev() callers invoke us with IRQs
+	 * disabled (see the WARN_ONCE() in netpoll_send_skb_on_dev()). Use
+	 * _local_bh_enable(), which leaves the BH-disabled section without
+	 * running pending softirqs inline -- the full local_bh_enable() would
+	 * re-enable IRQs and run softirq handlers deep inside this restricted,
+	 * lock-holding context. The raised NET_TX softirq is benign: netpoll
+	 * reaps the freed skbs itself via zap_completion_queue() below, and the
+	 * pending softirq is serviced at the next irq_exit().
+	 */
+	_local_bh_enable();
+#else
+	/*
+	 * On PREEMPT_RT this path runs with IRQs enabled and softirqs are
+	 * threaded, so there is no IRQ-disabled, lock-holding context to
+	 * protect. _local_bh_enable() is not available on RT, and local_bh_disable()
+	 * there takes the per-CPU softirq_ctrl local_lock that only the full
+	 * local_bh_enable() releases -- so use it.
+	 */
+	local_bh_enable();
+#endif
+
 	up(&ni->dev_lock);

 	zap_completion_queue();
-- 
2.53.0-Meta

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
  2026-06-10 18:36 [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock Vlad Poenaru
@ 2026-06-11 18:36 ` sashiko-bot
  2026-06-12  2:11 ` Jakub Kicinski
  1 sibling, 0 replies; 3+ messages in thread
From: sashiko-bot @ 2026-06-11 18:36 UTC (permalink / raw)
  To: Vlad Poenaru; +Cc: linux-rt-devel, linux-kernel

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:

New issues:
- [High] On PREEMPT_RT, `local_bh_enable()` triggers a lockdep splat and self-deadlocks because IRQs are disabled.
- [High] The addition of `local_bh_disable()` and `_local_bh_enable()` triggers a `WARN_ON_ONCE(in_hardirq())` splat when netconsole is used from hardirq context.

Pre-existing issues:
- [Critical] The patch fails to resolve the self-deadlock because it shifts the ksoftirqd wakeup to `zap_completion_queue()`.
--

commit 5559ca23d7e6dcba4ded6d9783c2b7954c8e659f
Author: Vlad Poenaru <vlad.wing@gmail.com>

netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock

This commit aims to prevent a self-deadlock caused by netconsole flushing
messages while holding a runqueue lock. It wraps the netpoll NAPI callback
in local_bh_disable() to ensure softirqs are not raised out of band,
deferring them instead of waking ksoftirqd inline.

> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index 3f4a17fa5713a..18da97eff5322 100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -194,11 +194,56 @@ void netpoll_poll_dev(struct net_device *dev)
[ ... ]
> +	 * Disabling BH makes in_interrupt() true for the duration of the poll,
> +	 * so the TX completion only sets the softirq-pending bit and never wakes
> +	 * ksoftirqd. The raised softirq is harmless and benign: netpoll reaps
> +	 * the freed skbs itself via zap_completion_queue() below, and the
> +	 * pending NET_TX softirq is serviced at the next irq_exit().
> +	 */
> +	local_bh_disable();

[Severity: High]
Does adding local_bh_disable() here trigger a warning when netpoll is called
from hardirq context?

Netconsole is frequently used to flush dmesg synchronously during kernel
panics or sysrq, which occur in hardware interrupt handlers. Both
local_bh_disable() (via __local_bh_disable_ip()) and _local_bh_enable()
explicitly assert WARN_ON_ONCE(in_hardirq()).

Could this cause recursive warnings or disrupt the logging path during a
panic sequence?

>  	if (ops->ndo_poll_controller)
>  		ops->ndo_poll_controller(dev);
>  
>  	poll_napi(dev);
>  
> +#ifndef CONFIG_PREEMPT_RT
[ ... ]
> +#else
> +	/*
> +	 * On PREEMPT_RT this path runs with IRQs enabled and softirqs are
> +	 * threaded, so there is no IRQ-disabled, lock-holding context to
> +	 * protect. _local_bh_enable() is not available on RT, and local_bh_disable()
> +	 * there takes the per-CPU softirq_ctrl local_lock that only the full
> +	 * local_bh_enable() releases -- so use it.
> +	 */
> +	local_bh_enable();

[Severity: High]
The commit message states that on PREEMPT_RT this path runs with IRQs
enabled. Is this accurate?

Looking at netpoll_send_skb(), it unconditionally calls local_irq_save(),
meaning hardware interrupts are disabled here. Calling local_bh_enable()
maps to __local_bh_enable_ip(), which enforces lockdep_assert_irqs_enabled().
Will this trigger a lockdep splat?

Additionally, because IRQs are disabled, preemptible() evaluates to false.
Does the RT implementation then skip inline softirq processing and instead
fall back to wakeup_softirqd(), resulting in the exact same rq->lock
self-deadlock?

> +#endif
> +
>  	up(&ni->dev_lock);
>  
>  	zap_completion_queue();

[Severity: Critical]
This isn't a bug introduced by this patch, but does the patch fail
to fix the self-deadlock it intends to resolve?

The fix wraps poll_napi() to ensure in_interrupt() is true, preventing
the ksoftirqd wakeup. However, zap_completion_queue() is called after
_local_bh_enable(), meaning in_interrupt() is false again.

When zap_completion_queue() processes unfreeable SKBs, it calls:

zap_completion_queue()
  dev_kfree_skb_any()
    dev_kfree_skb_irq_reason()
      raise_softirq_irqoff(NET_TX_SOFTIRQ)

Because in_interrupt() is now false, this immediately invokes
wakeup_softirqd(). Won't this attempt to acquire the CPU's rq->lock
that the caller (e.g., __schedule()) already holds, resulting in the
exact same self-deadlock?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260610183621.3915271-1-vlad.wing@gmail.com?part=1

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
  2026-06-10 18:36 [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock Vlad Poenaru
  2026-06-11 18:36 ` sashiko-bot
@ 2026-06-12  2:11 ` Jakub Kicinski
  1 sibling, 0 replies; 3+ messages in thread
From: Jakub Kicinski @ 2026-06-12  2:11 UTC (permalink / raw)
  To: Vlad Poenaru, Sebastian Andrzej Siewior, Thomas Gleixner
  Cc: netdev, David S . Miller, Eric Dumazet, Paolo Abeni, Simon Horman,
	Breno Leitao, Clark Williams, Steven Rostedt, linux-rt-devel,
	linux-kernel, stable, Frederic Weisbecker

Please trim the pages of slop in the commit message and the comments.

On Wed, 10 Jun 2026 11:36:21 -0700 Vlad Poenaru wrote:
> @@ -194,11 +194,56 @@ void netpoll_poll_dev(struct net_device *dev)
> +	local_bh_disable();
> + 	poll_napi(dev);
> +	_local_bh_enable();

tglx, Sebastian, are you okay with using _local_bh_enable() to trick
softirq into not waking ksoftirqd? The problematic path is:

  scheduler -> printk -> netconsole -> raise softirq -> scheduler (deadlock)

so the softirq may never get serviced.

In netcons we try to avoid touching the network driver if the Tx path
locks are already held. Ideally we'd do something similar with the
scheduler. Try to do bare minimum if we may be in the scheduler.
Failing that - don't poll the driver if we were called with irqs
already disabled.

Or maybe we only poll from console->write_thread ?

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-12  2:11 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-10 18:36 [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock Vlad Poenaru
2026-06-11 18:36 ` sashiko-bot
2026-06-12  2:11 ` Jakub Kicinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.