Netdev List
 help / color / mirror / Atom feed
* [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
@ 2026-06-10 18:36 Vlad Poenaru
       [not found] ` <20260611191114.5bc43a59@kernel.org>
  0 siblings, 1 reply; 3+ messages in thread
From: Vlad Poenaru @ 2026-06-10 18:36 UTC (permalink / raw)
  To: netdev, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Simon Horman, Breno Leitao, Sebastian Andrzej Siewior,
	Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
	stable

netpoll_poll_dev() can be called from process context with interrupts
disabled, most notably from printk() -> netconsole when a WARN()/printk()
is emitted while holding a runqueue lock inside __schedule() (e.g. from
put_prev_entity() during a context switch). console_unlock() then flushes
netconsole inline, which polls the NIC to drain its TX ring.

Drivers free completed TX skbs from their ->poll() via
dev_kfree_skb_irq_reason(), which queues the skb and calls
raise_softirq_irqoff(NET_TX_SOFTIRQ). Outside softirq context that helper
takes the !in_interrupt() path and calls wakeup_softirqd() ->
try_to_wake_up(). Waking the local ksoftirqd takes the current CPU's
rq->lock (ttwu_queue() -> rq_lock(); ttwu_queue_cond() refuses the remote
wakelist for a same-CPU wakeup). If the caller already holds that rq->lock
this recursively acquires a non-recursive spinlock: the CPU spins forever
with IRQs disabled. Every other CPU that subsequently load-balances
against this runqueue spins on the same lock, TLB-shootdown IPIs to the
wedged CPUs go unanswered, and the machine dies under the NMI hard-lockup
watchdog.

This was hit in production on a 252-CPU AMD system running a 6.16-based
kernel. A scheduler WARN_ON_ONCE() fired from __enqueue_entity() with the
rq lock held during a context switch; flushing it to netconsole reentered
the scheduler and the CPU deadlocked on its own rq->lock. The backtrace of
the wedged CPU (spinning at the top of the stack on the rq->lock it is
already holding further down):

  native_queued_spin_lock_slowpath
  _raw_spin_lock
  raw_spin_rq_lock_nested
  rq_lock
  ttwu_queue
  try_to_wake_up                  // wakes ksoftirqd/N
  dev_kfree_skb_irq_reason        // raise_softirq_irqoff(NET_TX_SOFTIRQ)
  __bnxt_tx_int
  bnxt_poll_p5
  poll_one_napi
  poll_napi
  netpoll_poll_dev
  netpoll_send_udp
  write_ext_msg                   // netconsole
  console_unlock
  vprintk_emit
  __warn
  __enqueue_entity                // WARN_ON_ONCE() here -- rq->lock held
  put_prev_entity
  put_prev_task_fair
  __schedule
  sched_exec
  bprm_execve
  __x64_sys_execve

About 215 of the 252 CPUs then piled up in sched_balance_rq() spinning on
that runqueue's lock; pending TLB shootdowns to the wedged CPUs stalled in
csd_lock_wait(), and a victim CPU finally took down the box with "Kernel
panic - not syncing: Hard LOCKUP". The particular WARN is incidental --
any printk() that reaches netconsole while a rq->lock is held reproduces
the same self-deadlock.

In the normal receive path this cannot happen because net_rx_action()
runs ->poll() with bottom halves disabled, so raise_softirq_irqoff() sees
in_interrupt() and merely sets the pending bit. Make netpoll do the same:
wrap the poll callbacks in local_bh_disable(). On !PREEMPT_RT all callers
invoke netpoll_poll_dev() with IRQs disabled (see the WARN_ONCE() in
netpoll_send_skb_on_dev()), so pair it with _local_bh_enable() to leave
the section without running softirqs inline -- running them here would
re-enable IRQs and execute softirq handlers deep in a lock-holding
context. On PREEMPT_RT the path runs with IRQs enabled and softirqs are
threaded; _local_bh_enable() is not available there and would not drop
the softirq_ctrl local_lock taken by local_bh_disable(), so use the
regular local_bh_enable(). The raised NET_TX softirq is harmless: netpoll
reaps the freed skbs via zap_completion_queue() and the pending softirq
is serviced at the next irq_exit().

Cc: stable@vger.kernel.org
Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
---
 net/core/netpoll.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 3f4a17fa5713..18da97eff532 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -194,11 +194,56 @@ void netpoll_poll_dev(struct net_device *dev)
 	}
 
 	ops = dev->netdev_ops;
+
+	/*
+	 * Run the poll callbacks in softirq context, exactly as net_rx_action()
+	 * does for the normal NAPI path. netpoll_poll_dev() is called from
+	 * process context with IRQs disabled (e.g. printk() -> netconsole while
+	 * holding a rq->lock inside __schedule()). Drivers free completed TX
+	 * skbs from their ->poll() via dev_kfree_skb_irq_reason(), which calls
+	 * raise_softirq_irqoff(NET_TX_SOFTIRQ). Outside softirq context that
+	 * helper sees !in_interrupt() and calls wakeup_softirqd() ->
+	 * try_to_wake_up(), which takes the rq->lock of the current CPU. If the
+	 * caller already holds that rq->lock this self-deadlocks, wedging the
+	 * CPU (and then the whole machine via rq->lock contention) until the
+	 * hard-lockup watchdog panics.
+	 *
+	 * Disabling BH makes in_interrupt() true for the duration of the poll,
+	 * so the TX completion only sets the softirq-pending bit and never wakes
+	 * ksoftirqd. The raised softirq is harmless and benign: netpoll reaps
+	 * the freed skbs itself via zap_completion_queue() below, and the
+	 * pending NET_TX softirq is serviced at the next irq_exit().
+	 */
+	local_bh_disable();
+
 	if (ops->ndo_poll_controller)
 		ops->ndo_poll_controller(dev);
 
 	poll_napi(dev);
 
+#ifndef CONFIG_PREEMPT_RT
+	/*
+	 * On !PREEMPT_RT all netpoll_poll_dev() callers invoke us with IRQs
+	 * disabled (see the WARN_ONCE() in netpoll_send_skb_on_dev()). Use
+	 * _local_bh_enable(), which leaves the BH-disabled section without
+	 * running pending softirqs inline -- the full local_bh_enable() would
+	 * re-enable IRQs and run softirq handlers deep inside this restricted,
+	 * lock-holding context. The raised NET_TX softirq is benign: netpoll
+	 * reaps the freed skbs itself via zap_completion_queue() below, and the
+	 * pending softirq is serviced at the next irq_exit().
+	 */
+	_local_bh_enable();
+#else
+	/*
+	 * On PREEMPT_RT this path runs with IRQs enabled and softirqs are
+	 * threaded, so there is no IRQ-disabled, lock-holding context to
+	 * protect. _local_bh_enable() is not available on RT, and local_bh_disable()
+	 * there takes the per-CPU softirq_ctrl local_lock that only the full
+	 * local_bh_enable() releases -- so use it.
+	 */
+	local_bh_enable();
+#endif
+
 	up(&ni->dev_lock);
 
 	zap_completion_queue();
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
       [not found] ` <20260611191114.5bc43a59@kernel.org>
@ 2026-06-15 13:56   ` Sebastian Andrzej Siewior
  2026-06-16 10:35   ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 3+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-06-15 13:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Vlad Poenaru, Thomas Gleixner, netdev, David S . Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Breno Leitao,
	Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
	stable, Frederic Weisbecker

On 2026-06-11 19:11:14 [-0700], Jakub Kicinski wrote:
> Please trim the pages of slop in the commit message and the comments.
> 
> On Wed, 10 Jun 2026 11:36:21 -0700 Vlad Poenaru wrote:
> > @@ -194,11 +194,56 @@ void netpoll_poll_dev(struct net_device *dev)
> > +	local_bh_disable();
> > + 	poll_napi(dev);
> > +	_local_bh_enable();
> 
> tglx, Sebastian, are you okay with using _local_bh_enable() to trick
> softirq into not waking ksoftirqd? The problematic path is:

The I planned to get to this today but I won't make it. I try to get to
this as soon I can…

Sebastian

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
       [not found] ` <20260611191114.5bc43a59@kernel.org>
  2026-06-15 13:56   ` Sebastian Andrzej Siewior
@ 2026-06-16 10:35   ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 3+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-06-16 10:35 UTC (permalink / raw)
  To: Jakub Kicinski, Petr Mladek, John Ogness, Sergey Senozhatsky,
	Peter Zijlstra
  Cc: Vlad Poenaru, Thomas Gleixner, netdev, David S . Miller,
	Eric Dumazet, Paolo Abeni, Simon Horman, Breno Leitao,
	Clark Williams, Steven Rostedt, linux-rt-devel, linux-kernel,
	stable, Frederic Weisbecker, Ingo Molnar, Vincent Guittot,
	Dietmar Eggemann, K Prateek Nayak

On 2026-06-11 19:11:14 [-0700], Jakub Kicinski wrote:
> On Wed, 10 Jun 2026 11:36:21 -0700 Vlad Poenaru wrote:
> > @@ -194,11 +194,56 @@ void netpoll_poll_dev(struct net_device *dev)
> > +	local_bh_disable();
> > + 	poll_napi(dev);
> > +	_local_bh_enable();
> 
> tglx, Sebastian, are you okay with using _local_bh_enable() to trick
> softirq into not waking ksoftirqd? The problematic path is:
> 
>   scheduler -> printk -> netconsole -> raise softirq -> scheduler (deadlock)
> 
> so the softirq may never get serviced.
> 
> In netcons we try to avoid touching the network driver if the Tx path
> locks are already held. Ideally we'd do something similar with the
> scheduler. Try to do bare minimum if we may be in the scheduler.
> Failing that - don't poll the driver if we were called with irqs
> already disabled.
> 
> Or maybe we only poll from console->write_thread ?

So this is not an issue since commit 7eab73b18630e ("netconsole: convert
to NBCON console infrastructure"). Because from here now on writes are
deferred to the nbcon thread. So this purely about -stable in this case.

Looking at the patch and the amount of comments vs code changes look
somehow hackish. That ifdef for PREEMPT_RT is not needed because on
PREEMPT_RT we have either nbcon or the legacy console (including
netconsole before the mentioned commit) wrapped in a dedicated thread
(via force_legacy_kthread()).
That means in both cases the flow never ends there and the problem is
limited to !PREEMPT_RT.

Now. The scheduler usually does printk_deferred() because of the rq lock
so it does not deadlock for various reasons. It is kind of a pity that
the various WARN macros don't do that.
I don't think that patch is enough. It works around the problem in this
scenario but should the NIC driver invoke schedule_work() then we are
back here again.
Should the network driver acquire a lock then lockdep might observe
rq -> driver-lock and then driver-lock -> rq and yell dead lock (CPU1
doing AB and CPU2 doing BA). This includes also other console driver so
it is not limited to netconsole.

Point being made is that we should avoid the callchain:

|  console_unlock
|  vprintk_emit
|  __warn
|  __enqueue_entity                // WARN_ON_ONCE() here -- rq->lock held
|  put_prev_entity
|  put_prev_task_fair
|  __schedule

basically a printk under the rq lock.

We could add printk_deferred_enter/exit() to all the rq_lock() variants.
I think PeterZ loves this the most. And Greg will appreciate it too
while backporting because of all the context changes.

We could also introduce WARN_ON_DEFERRED +variants which do the
printk_deferred_enter/exit() thingy should around the printk and replace
all the WARNs in kernel/sched/.
I *think* the tty/console layer has also a deadlock problem where it
holds locks and then the WARN(), that never triggers, asks for the same
locks again so we might have a second user…

Adding sched and printk folks for opinions while eyeballing
WARN_ON_DEFERRED().

Sebastian

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-16 10:35 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-10 18:36 [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock Vlad Poenaru
     [not found] ` <20260611191114.5bc43a59@kernel.org>
2026-06-15 13:56   ` Sebastian Andrzej Siewior
2026-06-16 10:35   ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox