All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
@ 2026-06-10 18:36 Vlad Poenaru
  2026-06-11 18:36 ` sashiko-bot
  2026-06-12  2:11 ` Jakub Kicinski
  0 siblings, 2 replies; 3+ messages in thread
From: Vlad Poenaru @ 2026-06-10 18:36 UTC (permalink / raw)
  To: netdev, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Breno Leitao, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
	stable

netpoll_poll_dev() can be called from process context with interrupts
disabled, most notably from printk() -> netconsole when a WARN()/printk()
is emitted while holding a runqueue lock inside __schedule() (e.g. from
put_prev_entity() during a context switch). console_unlock() then flushes
netconsole inline, which polls the NIC to drain its TX ring.

Drivers free completed TX skbs from their ->poll() via
dev_kfree_skb_irq_reason(), which queues the skb and calls
raise_softirq_irqoff(NET_TX_SOFTIRQ). Outside softirq context that helper
takes the !in_interrupt() path and calls wakeup_softirqd() ->
try_to_wake_up(). Waking the local ksoftirqd takes the current CPU's
rq->lock (ttwu_queue() -> rq_lock(); ttwu_queue_cond() refuses the remote
wakelist for a same-CPU wakeup). If the caller already holds that rq->lock
this recursively acquires a non-recursive spinlock: the CPU spins forever
with IRQs disabled. Every other CPU that subsequently load-balances
against this runqueue spins on the same lock, TLB-shootdown IPIs to the
wedged CPUs go unanswered, and the machine dies under the NMI hard-lockup
watchdog.

This was hit in production on a 252-CPU AMD system running a 6.16-based
kernel. A scheduler WARN_ON_ONCE() fired from __enqueue_entity() with the
rq lock held during a context switch; flushing it to netconsole reentered
the scheduler and the CPU deadlocked on its own rq->lock. The backtrace of
the wedged CPU (spinning at the top of the stack on the rq->lock it is
already holding further down):

  native_queued_spin_lock_slowpath
  _raw_spin_lock
  raw_spin_rq_lock_nested
  rq_lock
  ttwu_queue
  try_to_wake_up                  // wakes ksoftirqd/N
  dev_kfree_skb_irq_reason        // raise_softirq_irqoff(NET_TX_SOFTIRQ)
  __bnxt_tx_int
  bnxt_poll_p5
  poll_one_napi
  poll_napi
  netpoll_poll_dev
  netpoll_send_udp
  write_ext_msg                   // netconsole
  console_unlock
  vprintk_emit
  __warn
  __enqueue_entity                // WARN_ON_ONCE() here -- rq->lock held
  put_prev_entity
  put_prev_task_fair
  __schedule
  sched_exec
  bprm_execve
  __x64_sys_execve

About 215 of the 252 CPUs then piled up in sched_balance_rq() spinning on
that runqueue's lock; pending TLB shootdowns to the wedged CPUs stalled in
csd_lock_wait(), and a victim CPU finally took down the box with "Kernel
panic - not syncing: Hard LOCKUP". The particular WARN is incidental --
any printk() that reaches netconsole while a rq->lock is held reproduces
the same self-deadlock.

In the normal receive path this cannot happen because net_rx_action()
runs ->poll() with bottom halves disabled, so raise_softirq_irqoff() sees
in_interrupt() and merely sets the pending bit. Make netpoll do the same:
wrap the poll callbacks in local_bh_disable(). On !PREEMPT_RT all callers
invoke netpoll_poll_dev() with IRQs disabled (see the WARN_ONCE() in
netpoll_send_skb_on_dev()), so pair it with _local_bh_enable() to leave
the section without running softirqs inline -- running them here would
re-enable IRQs and execute softirq handlers deep in a lock-holding
context. On PREEMPT_RT the path runs with IRQs enabled and softirqs are
threaded; _local_bh_enable() is not available there and would not drop
the softirq_ctrl local_lock taken by local_bh_disable(), so use the
regular local_bh_enable(). The raised NET_TX softirq is harmless: netpoll
reaps the freed skbs via zap_completion_queue() and the pending softirq
is serviced at the next irq_exit().

Cc: stable@vger.kernel.org
Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
---
 net/core/netpoll.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 3f4a17fa5713..18da97eff532 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -194,11 +194,56 @@ void netpoll_poll_dev(struct net_device *dev)
 	}
 
 	ops = dev->netdev_ops;
+
+	/*
+	 * Run the poll callbacks in softirq context, exactly as net_rx_action()
+	 * does for the normal NAPI path. netpoll_poll_dev() is called from
+	 * process context with IRQs disabled (e.g. printk() -> netconsole while
+	 * holding a rq->lock inside __schedule()). Drivers free completed TX
+	 * skbs from their ->poll() via dev_kfree_skb_irq_reason(), which calls
+	 * raise_softirq_irqoff(NET_TX_SOFTIRQ). Outside softirq context that
+	 * helper sees !in_interrupt() and calls wakeup_softirqd() ->
+	 * try_to_wake_up(), which takes the rq->lock of the current CPU. If the
+	 * caller already holds that rq->lock this self-deadlocks, wedging the
+	 * CPU (and then the whole machine via rq->lock contention) until the
+	 * hard-lockup watchdog panics.
+	 *
+	 * Disabling BH makes in_interrupt() true for the duration of the poll,
+	 * so the TX completion only sets the softirq-pending bit and never wakes
+	 * ksoftirqd. The raised softirq is harmless and benign: netpoll reaps
+	 * the freed skbs itself via zap_completion_queue() below, and the
+	 * pending NET_TX softirq is serviced at the next irq_exit().
+	 */
+	local_bh_disable();
+
 	if (ops->ndo_poll_controller)
 		ops->ndo_poll_controller(dev);
 
 	poll_napi(dev);
 
+#ifndef CONFIG_PREEMPT_RT
+	/*
+	 * On !PREEMPT_RT all netpoll_poll_dev() callers invoke us with IRQs
+	 * disabled (see the WARN_ONCE() in netpoll_send_skb_on_dev()). Use
+	 * _local_bh_enable(), which leaves the BH-disabled section without
+	 * running pending softirqs inline -- the full local_bh_enable() would
+	 * re-enable IRQs and run softirq handlers deep inside this restricted,
+	 * lock-holding context. The raised NET_TX softirq is benign: netpoll
+	 * reaps the freed skbs itself via zap_completion_queue() below, and the
+	 * pending softirq is serviced at the next irq_exit().
+	 */
+	_local_bh_enable();
+#else
+	/*
+	 * On PREEMPT_RT this path runs with IRQs enabled and softirqs are
+	 * threaded, so there is no IRQ-disabled, lock-holding context to
+	 * protect. _local_bh_enable() is not available on RT, and local_bh_disable()
+	 * there takes the per-CPU softirq_ctrl local_lock that only the full
+	 * local_bh_enable() releases -- so use it.
+	 */
+	local_bh_enable();
+#endif
+
 	up(&ni->dev_lock);
 
 	zap_completion_queue();
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-12  2:11 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-10 18:36 [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock Vlad Poenaru
2026-06-11 18:36 ` sashiko-bot
2026-06-12  2:11 ` Jakub Kicinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.