From: Vlad Poenaru <vlad.wing@gmail.com>
To: netdev@vger.kernel.org, "David S . Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>, Breno Leitao <leitao@debian.org>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
Clark Williams <clrkwllms@kernel.org>,
Steven Rostedt <rostedt@goodmis.org>,
linux-rt-devel@lists.linux.dev, linux-kernel@vger.kernel.org,
stable@vger.kernel.org
Subject: [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock
Date: Wed, 10 Jun 2026 11:36:21 -0700 [thread overview]
Message-ID: <20260610183621.3915271-1-vlad.wing@gmail.com> (raw)
netpoll_poll_dev() can be called from process context with interrupts
disabled, most notably from printk() -> netconsole when a WARN()/printk()
is emitted while holding a runqueue lock inside __schedule() (e.g. from
put_prev_entity() during a context switch). console_unlock() then flushes
netconsole inline, which polls the NIC to drain its TX ring.
Drivers free completed TX skbs from their ->poll() via
dev_kfree_skb_irq_reason(), which queues the skb and calls
raise_softirq_irqoff(NET_TX_SOFTIRQ). Outside softirq context that helper
takes the !in_interrupt() path and calls wakeup_softirqd() ->
try_to_wake_up(). Waking the local ksoftirqd takes the current CPU's
rq->lock (ttwu_queue() -> rq_lock(); ttwu_queue_cond() refuses the remote
wakelist for a same-CPU wakeup). If the caller already holds that rq->lock
this recursively acquires a non-recursive spinlock: the CPU spins forever
with IRQs disabled. Every other CPU that subsequently load-balances
against this runqueue spins on the same lock, TLB-shootdown IPIs to the
wedged CPUs go unanswered, and the machine dies under the NMI hard-lockup
watchdog.
This was hit in production on a 252-CPU AMD system running a 6.16-based
kernel. A scheduler WARN_ON_ONCE() fired from __enqueue_entity() with the
rq lock held during a context switch; flushing it to netconsole reentered
the scheduler and the CPU deadlocked on its own rq->lock. The backtrace of
the wedged CPU (spinning at the top of the stack on the rq->lock it is
already holding further down):
native_queued_spin_lock_slowpath
_raw_spin_lock
raw_spin_rq_lock_nested
rq_lock
ttwu_queue
try_to_wake_up // wakes ksoftirqd/N
dev_kfree_skb_irq_reason // raise_softirq_irqoff(NET_TX_SOFTIRQ)
__bnxt_tx_int
bnxt_poll_p5
poll_one_napi
poll_napi
netpoll_poll_dev
netpoll_send_udp
write_ext_msg // netconsole
console_unlock
vprintk_emit
__warn
__enqueue_entity // WARN_ON_ONCE() here -- rq->lock held
put_prev_entity
put_prev_task_fair
__schedule
sched_exec
bprm_execve
__x64_sys_execve
About 215 of the 252 CPUs then piled up in sched_balance_rq() spinning on
that runqueue's lock; pending TLB shootdowns to the wedged CPUs stalled in
csd_lock_wait(), and a victim CPU finally took down the box with "Kernel
panic - not syncing: Hard LOCKUP". The particular WARN is incidental --
any printk() that reaches netconsole while a rq->lock is held reproduces
the same self-deadlock.
In the normal receive path this cannot happen because net_rx_action()
runs ->poll() with bottom halves disabled, so raise_softirq_irqoff() sees
in_interrupt() and merely sets the pending bit. Make netpoll do the same:
wrap the poll callbacks in local_bh_disable(). On !PREEMPT_RT all callers
invoke netpoll_poll_dev() with IRQs disabled (see the WARN_ONCE() in
netpoll_send_skb_on_dev()), so pair it with _local_bh_enable() to leave
the section without running softirqs inline -- running them here would
re-enable IRQs and execute softirq handlers deep in a lock-holding
context. On PREEMPT_RT the path runs with IRQs enabled and softirqs are
threaded; _local_bh_enable() is not available there and would not drop
the softirq_ctrl local_lock taken by local_bh_disable(), so use the
regular local_bh_enable(). The raised NET_TX softirq is harmless: netpoll
reaps the freed skbs via zap_completion_queue() and the pending softirq
is serviced at the next irq_exit().
Cc: stable@vger.kernel.org
Signed-off-by: Vlad Poenaru <vlad.wing@gmail.com>
---
net/core/netpoll.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 3f4a17fa5713..18da97eff532 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -194,11 +194,56 @@ void netpoll_poll_dev(struct net_device *dev)
}
ops = dev->netdev_ops;
+
+ /*
+ * Run the poll callbacks in softirq context, exactly as net_rx_action()
+ * does for the normal NAPI path. netpoll_poll_dev() is called from
+ * process context with IRQs disabled (e.g. printk() -> netconsole while
+ * holding a rq->lock inside __schedule()). Drivers free completed TX
+ * skbs from their ->poll() via dev_kfree_skb_irq_reason(), which calls
+ * raise_softirq_irqoff(NET_TX_SOFTIRQ). Outside softirq context that
+ * helper sees !in_interrupt() and calls wakeup_softirqd() ->
+ * try_to_wake_up(), which takes the rq->lock of the current CPU. If the
+ * caller already holds that rq->lock this self-deadlocks, wedging the
+ * CPU (and then the whole machine via rq->lock contention) until the
+ * hard-lockup watchdog panics.
+ *
+ * Disabling BH makes in_interrupt() true for the duration of the poll,
+ * so the TX completion only sets the softirq-pending bit and never wakes
+ * ksoftirqd. The raised softirq is harmless and benign: netpoll reaps
+ * the freed skbs itself via zap_completion_queue() below, and the
+ * pending NET_TX softirq is serviced at the next irq_exit().
+ */
+ local_bh_disable();
+
if (ops->ndo_poll_controller)
ops->ndo_poll_controller(dev);
poll_napi(dev);
+#ifndef CONFIG_PREEMPT_RT
+ /*
+ * On !PREEMPT_RT all netpoll_poll_dev() callers invoke us with IRQs
+ * disabled (see the WARN_ONCE() in netpoll_send_skb_on_dev()). Use
+ * _local_bh_enable(), which leaves the BH-disabled section without
+ * running pending softirqs inline -- the full local_bh_enable() would
+ * re-enable IRQs and run softirq handlers deep inside this restricted,
+ * lock-holding context. The raised NET_TX softirq is benign: netpoll
+ * reaps the freed skbs itself via zap_completion_queue() below, and the
+ * pending softirq is serviced at the next irq_exit().
+ */
+ _local_bh_enable();
+#else
+ /*
+ * On PREEMPT_RT this path runs with IRQs enabled and softirqs are
+ * threaded, so there is no IRQ-disabled, lock-holding context to
+ * protect. _local_bh_enable() is not available on RT, and local_bh_disable()
+ * there takes the per-CPU softirq_ctrl local_lock that only the full
+ * local_bh_enable() releases -- so use it.
+ */
+ local_bh_enable();
+#endif
+
up(&ni->dev_lock);
zap_completion_queue();
--
2.53.0-Meta
next reply other threads:[~2026-06-10 18:36 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-10 18:36 Vlad Poenaru [this message]
[not found] ` <20260611191114.5bc43a59@kernel.org>
2026-06-15 13:56 ` [PATCH net] netpoll: run NAPI poll in softirq context to avoid rq->lock self-deadlock Sebastian Andrzej Siewior
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260610183621.3915271-1-vlad.wing@gmail.com \
--to=vlad.wing@gmail.com \
--cc=bigeasy@linutronix.de \
--cc=clrkwllms@kernel.org \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=kuba@kernel.org \
--cc=leitao@debian.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rt-devel@lists.linux.dev \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=rostedt@goodmis.org \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox