* [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
@ 2026-06-10 18:36 Vlad Poenaru
[not found] ` <20260611191114.5bc43a59@kernel.org>
0 siblings, 1 reply; 2+ messages in thread
From: Vlad Poenaru @ 2026-06-10 18:36 UTC (permalink / raw)
To: netdev, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Simon Horman, Breno Leitao, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
stable
netpoll_poll_dev() can be called from process context with interrupts
disabled, most notably from printk() -> netconsole when a WARN()/printk()
is emitted while holding a runqueue lock inside __schedule() (e.g. from
put_prev_entity() during a context switch). console_unlock() then flushes
netconsole inline, which polls the NIC to drain its TX ring.
Drivers free completed TX skbs from their ->poll() via
dev_kfree_skb_irq_reason(), which queues the skb and calls
raise_softirq_irqoff(NET_TX_SOFTIRQ). Outside softirq context that helper
takes the !in_interrupt() path and calls wakeup_softirqd() ->
try_to_wake_up(). Waking the local ksoftirqd takes the current CPU's
rq->lock (ttwu_queue() -> rq_lock(); ttwu_queue_cond() refuses the remote
wakelist for a same-CPU wakeup). If the caller already holds that rq->lock
this recursively acquires a non-recursive spinlock: the CPU spins forever
with IRQs disabled. Every other CPU that subsequently load-balances
against this runqueue spins on the same lock, TLB-shootdown IPIs to the
wedged CPUs go unanswered, and the machine dies under the NMI hard-lockup
watchdog.
This was hit in production on a 252-CPU AMD system running a 6.16-based
kernel. A scheduler WARN_ON_ONCE() fired from __enqueue_entity() with the
rq lock held during a context switch; flushing it to netconsole reentered
the scheduler and the CPU deadlocked on its own rq->lock. The backtrace of
the wedged CPU (spinning at the top of the stack on the rq->lock it is
already holding further down):
native_queued_spin_lock_slowpath
_raw_spin_lock
raw_spin_rq_lock_nested
rq_lock
ttwu_queue
try_to_wake_up // wakes ksoftirqd/N
dev_kfree_skb_irq_reason // raise_softirq_irqoff(NET_TX_SOFTIRQ)
__bnxt_tx_int
bnxt_poll_p5
poll_one_napi
poll_napi
netpoll_poll_dev
netpoll_send_udp
write_ext_msg // netconsole
console_unlock
vprintk_emit
__warn
__enqueue_entity // WARN_ON_ONCE() here -- rq->lock held
put_prev_entity
put_prev_task_fair
__schedule
sched_exec
bprm_execve
__x64_sys_execve
About 215 of the 252 CPUs then piled up in sched_balance_rq() spinning on
that runqueue's lock; pending TLB shootdowns to the wedged CPUs stalled in
csd_lock_wait(), and a victim CPU finally took down the box with "Kernel
panic - not syncing: Hard LOCKUP". The particular WARN is incidental --
any printk() that reaches netconsole while a rq->lock is held reproduces
the same self-deadlock.
In the normal receive path this cannot happen because net_rx_action()
runs ->poll() with bottom halves disabled, so raise_softirq_irqoff() sees
in_interrupt() and merely sets the pending bit. Make netpoll do the same:
wrap the poll callbacks in local_bh_disable(). On !PREEMPT_RT all callers
invoke netpoll_poll_dev() with IRQs disabled (see the WARN_ONCE() in
netpoll_send_skb_on_dev()), so pair it with _local_bh_enable() to leave
the section without running softirqs inline -- running them here would
re-enable IRQs and execute softirq handlers deep in a lock-holding
context. On PREEMPT_RT the path runs with IRQs enabled and softirqs are
threaded; _local_bh_enable() is not available there and would not drop
the softirq_ctrl local_lock taken by local_bh_disable(), so use the
regular local_bh_enable(). The raised NET_TX softirq is harmless: netpoll
reaps the freed skbs via zap_completion_queue() and the pending softirq
is serviced at the next irq_exit().
Cc: stable@vger.kernel.org
Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
---
net/core/netpoll.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 3f4a17fa5713..18da97eff532 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -194,11 +194,56 @@ void netpoll_poll_dev(struct net_device *dev)
}
ops = dev->netdev_ops;
+
+ /*
+ * Run the poll callbacks in softirq context, exactly as net_rx_action()
+ * does for the normal NAPI path. netpoll_poll_dev() is called from
+ * process context with IRQs disabled (e.g. printk() -> netconsole while
+ * holding a rq->lock inside __schedule()). Drivers free completed TX
+ * skbs from their ->poll() via dev_kfree_skb_irq_reason(), which calls
+ * raise_softirq_irqoff(NET_TX_SOFTIRQ). Outside softirq context that
+ * helper sees !in_interrupt() and calls wakeup_softirqd() ->
+ * try_to_wake_up(), which takes the rq->lock of the current CPU. If the
+ * caller already holds that rq->lock this self-deadlocks, wedging the
+ * CPU (and then the whole machine via rq->lock contention) until the
+ * hard-lockup watchdog panics.
+ *
+ * Disabling BH makes in_interrupt() true for the duration of the poll,
+ * so the TX completion only sets the softirq-pending bit and never wakes
+ * ksoftirqd. The raised softirq is harmless and benign: netpoll reaps
+ * the freed skbs itself via zap_completion_queue() below, and the
+ * pending NET_TX softirq is serviced at the next irq_exit().
+ */
+ local_bh_disable();
+
if (ops->ndo_poll_controller)
ops->ndo_poll_controller(dev);
poll_napi(dev);
+#ifndef CONFIG_PREEMPT_RT
+ /*
+ * On !PREEMPT_RT all netpoll_poll_dev() callers invoke us with IRQs
+ * disabled (see the WARN_ONCE() in netpoll_send_skb_on_dev()). Use
+ * _local_bh_enable(), which leaves the BH-disabled section without
+ * running pending softirqs inline -- the full local_bh_enable() would
+ * re-enable IRQs and run softirq handlers deep inside this restricted,
+ * lock-holding context. The raised NET_TX softirq is benign: netpoll
+ * reaps the freed skbs itself via zap_completion_queue() below, and the
+ * pending softirq is serviced at the next irq_exit().
+ */
+ _local_bh_enable();
+#else
+ /*
+ * On PREEMPT_RT this path runs with IRQs enabled and softirqs are
+ * threaded, so there is no IRQ-disabled, lock-holding context to
+ * protect. _local_bh_enable() is not available on RT, and local_bh_disable()
+ * there takes the per-CPU softirq_ctrl local_lock that only the full
+ * local_bh_enable() releases -- so use it.
+ */
+ local_bh_enable();
+#endif
+
up(&ni->dev_lock);
zap_completion_queue();
--
2.53.0-Meta
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
[not found] ` <20260611191114.5bc43a59@kernel.org>
@ 2026-06-15 13:56 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 2+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-06-15 13:56 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Vlad Poenaru, Thomas Gleixner, netdev, David S . Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, Breno Leitao,
Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
stable, Frederic Weisbecker
On 2026-06-11 19:11:14 [-0700], Jakub Kicinski wrote:
> Please trim the pages of slop in the commit message and the comments.
>
> On Wed, 10 Jun 2026 11:36:21 -0700 Vlad Poenaru wrote:
> > @@ -194,11 +194,56 @@ void netpoll_poll_dev(struct net_device *dev)
> > + local_bh_disable();
> > + poll_napi(dev);
> > + _local_bh_enable();
>
> tglx, Sebastian, are you okay with using _local_bh_enable() to trick
> softirq into not waking ksoftirqd? The problematic path is:
The I planned to get to this today but I won't make it. I try to get to
this as soon I can…
Sebastian
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-06-15 13:56 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-10 18:36 [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock Vlad Poenaru
[not found] ` <20260611191114.5bc43a59@kernel.org>
2026-06-15 13:56 ` Sebastian Andrzej Siewior
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox