* [PATCH net v2] netpoll: prevent hanging NAPI when netcons gets enabled
@ 2025-07-26 1:08 Jakub Kicinski
2025-07-28 3:20 ` Jason Wang
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: Jakub Kicinski @ 2025-07-26 1:08 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, Jakub Kicinski,
Jason Wang, Zigit Zo, Michael S. Tsirkin, Xuan Zhuo,
Eugenio Pérez, leitao, sdf
Paolo spotted hangs in NIPA running driver tests against virtio.
The tests hang in virtnet_close() -> virtnet_napi_tx_disable().
The problem is only reproducible if running multiple of our tests
in sequence (I used TEST_PROGS="xdp.py ping.py netcons_basic.sh \
netpoll_basic.py stats.py"). Initial suspicion was that this is
a simple case of double-disable of NAPI, but instrumenting the
code reveals:
Deadlocked on NAPI ffff888007cd82c0 (virtnet_poll_tx):
state: 0x37, disabled: false, owner: 0, listed: false, weight: 64
The NAPI was not in fact disabled, owner is 0 (rather than -1),
so the NAPI "thinks" it's scheduled for CPU 0 but it's not listed
(!list_empty(&n->poll_list) => false). It seems odd that normal NAPI
processing would wedge itself like this.
Better suspicion is that netpoll gets enabled while NAPI is polling,
and also grabs the NAPI instance. This confuses napi_complete_done():
[netpoll] [normal NAPI]
napi_poll()
have = netpoll_poll_lock()
rcu_access_pointer(dev->npinfo)
return NULL # no netpoll
__napi_poll()
->poll(->weight)
poll_napi()
cmpxchg(->poll_owner, -1, cpu)
poll_one_napi()
set_bit(NAPI_STATE_NPSVC, ->state)
napi_complete_done()
if (NAPIF_STATE_NPSVC)
return false
# exit without clearing SCHED
This feels very unlikely, but perhaps virtio has some interactions
with the hypervisor in the NAPI ->poll that makes the race window
larger?
Best I could to to prove the theory was to add and trigger this
warning in napi_poll (just before netpoll_poll_unlock()):
WARN_ONCE(!have && rcu_access_pointer(n->dev->npinfo) &&
napi_is_scheduled(n) && list_empty(&n->poll_list),
"NAPI race with netpoll %px", n);
If this warning hits the next virtio_close() will hang.
This patch survived 30 test iterations without a hang (without it
the longest clean run was around 10). Credit for triggering this
goes to Breno's recent netconsole tests.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/c5a93ed1-9abe-4880-a3bb-8d1678018b1d@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
v2:
- move the sync to netpoll_setup()
v1: https://lore.kernel.org/20250725024454.690517-1-kuba@kernel.org
CC: Jason Wang <jasowang@redhat.com>
CC: Zigit Zo <zuozhijie@bytedance.com>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
CC: Eugenio Pérez <eperezma@redhat.com>
CC: leitao@debian.org
CC: sdf@fomichev.me
---
net/core/netpoll.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index a1da97b5b30b..5f65b62346d4 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -768,6 +768,13 @@ int netpoll_setup(struct netpoll *np)
if (err)
goto flush;
rtnl_unlock();
+
+ /* Make sure all NAPI polls which started before dev->npinfo
+ * was visible have exited before we start calling NAPI poll.
+ * NAPI skips locking if dev->npinfo is NULL.
+ */
+ synchronize_rcu();
+
return 0;
flush:
--
2.50.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH net v2] netpoll: prevent hanging NAPI when netcons gets enabled
2025-07-26 1:08 [PATCH net v2] netpoll: prevent hanging NAPI when netcons gets enabled Jakub Kicinski
@ 2025-07-28 3:20 ` Jason Wang
2025-07-28 3:37 ` Xuan Zhuo
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Jason Wang @ 2025-07-28 3:20 UTC (permalink / raw)
To: Jakub Kicinski
Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, Zigit Zo,
Michael S. Tsirkin, Xuan Zhuo, Eugenio Pérez, leitao, sdf
On Sat, Jul 26, 2025 at 9:08 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Paolo spotted hangs in NIPA running driver tests against virtio.
> The tests hang in virtnet_close() -> virtnet_napi_tx_disable().
>
> The problem is only reproducible if running multiple of our tests
> in sequence (I used TEST_PROGS="xdp.py ping.py netcons_basic.sh \
> netpoll_basic.py stats.py"). Initial suspicion was that this is
> a simple case of double-disable of NAPI, but instrumenting the
> code reveals:
>
> Deadlocked on NAPI ffff888007cd82c0 (virtnet_poll_tx):
> state: 0x37, disabled: false, owner: 0, listed: false, weight: 64
>
> The NAPI was not in fact disabled, owner is 0 (rather than -1),
> so the NAPI "thinks" it's scheduled for CPU 0 but it's not listed
> (!list_empty(&n->poll_list) => false). It seems odd that normal NAPI
> processing would wedge itself like this.
>
> Better suspicion is that netpoll gets enabled while NAPI is polling,
> and also grabs the NAPI instance. This confuses napi_complete_done():
>
> [netpoll] [normal NAPI]
> napi_poll()
> have = netpoll_poll_lock()
> rcu_access_pointer(dev->npinfo)
> return NULL # no netpoll
> __napi_poll()
> ->poll(->weight)
> poll_napi()
> cmpxchg(->poll_owner, -1, cpu)
> poll_one_napi()
> set_bit(NAPI_STATE_NPSVC, ->state)
> napi_complete_done()
> if (NAPIF_STATE_NPSVC)
> return false
> # exit without clearing SCHED
>
> This feels very unlikely, but perhaps virtio has some interactions
> with the hypervisor in the NAPI ->poll that makes the race window
> larger?
Somehow, for example KVM can schedule out the vcpu that runs normal
NAPI at this time.
>
> Best I could to to prove the theory was to add and trigger this
> warning in napi_poll (just before netpoll_poll_unlock()):
>
> WARN_ONCE(!have && rcu_access_pointer(n->dev->npinfo) &&
> napi_is_scheduled(n) && list_empty(&n->poll_list),
> "NAPI race with netpoll %px", n);
>
> If this warning hits the next virtio_close() will hang.
>
> This patch survived 30 test iterations without a hang (without it
> the longest clean run was around 10). Credit for triggering this
> goes to Breno's recent netconsole tests.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Reported-by: Paolo Abeni <pabeni@redhat.com>
> Link: https://lore.kernel.org/c5a93ed1-9abe-4880-a3bb-8d1678018b1d@redhat.com
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> v2:
> - move the sync to netpoll_setup()
> v1: https://lore.kernel.org/20250725024454.690517-1-kuba@kernel.org
>
> CC: Jason Wang <jasowang@redhat.com>
> CC: Zigit Zo <zuozhijie@bytedance.com>
> CC: "Michael S. Tsirkin" <mst@redhat.com>
> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> CC: Eugenio Pérez <eperezma@redhat.com>
>
> CC: leitao@debian.org
> CC: sdf@fomichev.me
> ---
> net/core/netpoll.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index a1da97b5b30b..5f65b62346d4 100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -768,6 +768,13 @@ int netpoll_setup(struct netpoll *np)
> if (err)
> goto flush;
> rtnl_unlock();
> +
> + /* Make sure all NAPI polls which started before dev->npinfo
> + * was visible have exited before we start calling NAPI poll.
> + * NAPI skips locking if dev->npinfo is NULL.
> + */
> + synchronize_rcu();
> +
> return 0;
>
> flush:
> --
> 2.50.1
>
Acked-by: Jason Wang <jasowang@redhat.com>
Thanks
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH net v2] netpoll: prevent hanging NAPI when netcons gets enabled
2025-07-26 1:08 [PATCH net v2] netpoll: prevent hanging NAPI when netcons gets enabled Jakub Kicinski
2025-07-28 3:20 ` Jason Wang
@ 2025-07-28 3:37 ` Xuan Zhuo
2025-07-28 7:52 ` Breno Leitao
2025-07-31 1:20 ` patchwork-bot+netdevbpf
3 siblings, 0 replies; 5+ messages in thread
From: Xuan Zhuo @ 2025-07-28 3:37 UTC (permalink / raw)
To: Jakub Kicinski
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, Jakub Kicinski,
Jason Wang, Zigit Zo, Michael S. Tsirkin, Eugenio Pérez,
leitao, sdf, davem
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Thanks.
On Fri, 25 Jul 2025 18:08:46 -0700, Jakub Kicinski <kuba@kernel.org> wrote:
> Paolo spotted hangs in NIPA running driver tests against virtio.
> The tests hang in virtnet_close() -> virtnet_napi_tx_disable().
>
> The problem is only reproducible if running multiple of our tests
> in sequence (I used TEST_PROGS="xdp.py ping.py netcons_basic.sh \
> netpoll_basic.py stats.py"). Initial suspicion was that this is
> a simple case of double-disable of NAPI, but instrumenting the
> code reveals:
>
> Deadlocked on NAPI ffff888007cd82c0 (virtnet_poll_tx):
> state: 0x37, disabled: false, owner: 0, listed: false, weight: 64
>
> The NAPI was not in fact disabled, owner is 0 (rather than -1),
> so the NAPI "thinks" it's scheduled for CPU 0 but it's not listed
> (!list_empty(&n->poll_list) => false). It seems odd that normal NAPI
> processing would wedge itself like this.
>
> Better suspicion is that netpoll gets enabled while NAPI is polling,
> and also grabs the NAPI instance. This confuses napi_complete_done():
>
> [netpoll] [normal NAPI]
> napi_poll()
> have = netpoll_poll_lock()
> rcu_access_pointer(dev->npinfo)
> return NULL # no netpoll
> __napi_poll()
> ->poll(->weight)
> poll_napi()
> cmpxchg(->poll_owner, -1, cpu)
> poll_one_napi()
> set_bit(NAPI_STATE_NPSVC, ->state)
> napi_complete_done()
> if (NAPIF_STATE_NPSVC)
> return false
> # exit without clearing SCHED
>
> This feels very unlikely, but perhaps virtio has some interactions
> with the hypervisor in the NAPI ->poll that makes the race window
> larger?
>
> Best I could to to prove the theory was to add and trigger this
> warning in napi_poll (just before netpoll_poll_unlock()):
>
> WARN_ONCE(!have && rcu_access_pointer(n->dev->npinfo) &&
> napi_is_scheduled(n) && list_empty(&n->poll_list),
> "NAPI race with netpoll %px", n);
>
> If this warning hits the next virtio_close() will hang.
>
> This patch survived 30 test iterations without a hang (without it
> the longest clean run was around 10). Credit for triggering this
> goes to Breno's recent netconsole tests.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Reported-by: Paolo Abeni <pabeni@redhat.com>
> Link: https://lore.kernel.org/c5a93ed1-9abe-4880-a3bb-8d1678018b1d@redhat.com
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> v2:
> - move the sync to netpoll_setup()
> v1: https://lore.kernel.org/20250725024454.690517-1-kuba@kernel.org
>
> CC: Jason Wang <jasowang@redhat.com>
> CC: Zigit Zo <zuozhijie@bytedance.com>
> CC: "Michael S. Tsirkin" <mst@redhat.com>
> CC: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> CC: Eugenio Pérez <eperezma@redhat.com>
>
> CC: leitao@debian.org
> CC: sdf@fomichev.me
> ---
> net/core/netpoll.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index a1da97b5b30b..5f65b62346d4 100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -768,6 +768,13 @@ int netpoll_setup(struct netpoll *np)
> if (err)
> goto flush;
> rtnl_unlock();
> +
> + /* Make sure all NAPI polls which started before dev->npinfo
> + * was visible have exited before we start calling NAPI poll.
> + * NAPI skips locking if dev->npinfo is NULL.
> + */
> + synchronize_rcu();
> +
> return 0;
>
> flush:
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH net v2] netpoll: prevent hanging NAPI when netcons gets enabled
2025-07-26 1:08 [PATCH net v2] netpoll: prevent hanging NAPI when netcons gets enabled Jakub Kicinski
2025-07-28 3:20 ` Jason Wang
2025-07-28 3:37 ` Xuan Zhuo
@ 2025-07-28 7:52 ` Breno Leitao
2025-07-31 1:20 ` patchwork-bot+netdevbpf
3 siblings, 0 replies; 5+ messages in thread
From: Breno Leitao @ 2025-07-28 7:52 UTC (permalink / raw)
To: Jakub Kicinski
Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, Jason Wang,
Zigit Zo, Michael S. Tsirkin, Xuan Zhuo, Eugenio Pérez, sdf
On Fri, Jul 25, 2025 at 06:08:46PM -0700, Jakub Kicinski wrote:
> Paolo spotted hangs in NIPA running driver tests against virtio.
> The tests hang in virtnet_close() -> virtnet_napi_tx_disable().
>
> The problem is only reproducible if running multiple of our tests
> in sequence (I used TEST_PROGS="xdp.py ping.py netcons_basic.sh \
> netpoll_basic.py stats.py"). Initial suspicion was that this is
> a simple case of double-disable of NAPI, but instrumenting the
> code reveals:
>
> Deadlocked on NAPI ffff888007cd82c0 (virtnet_poll_tx):
> state: 0x37, disabled: false, owner: 0, listed: false, weight: 64
>
> The NAPI was not in fact disabled, owner is 0 (rather than -1),
> so the NAPI "thinks" it's scheduled for CPU 0 but it's not listed
> (!list_empty(&n->poll_list) => false). It seems odd that normal NAPI
> processing would wedge itself like this.
>
> Better suspicion is that netpoll gets enabled while NAPI is polling,
> and also grabs the NAPI instance. This confuses napi_complete_done():
>
> [netpoll] [normal NAPI]
> napi_poll()
> have = netpoll_poll_lock()
> rcu_access_pointer(dev->npinfo)
> return NULL # no netpoll
> __napi_poll()
> ->poll(->weight)
> poll_napi()
> cmpxchg(->poll_owner, -1, cpu)
> poll_one_napi()
> set_bit(NAPI_STATE_NPSVC, ->state)
> napi_complete_done()
> if (NAPIF_STATE_NPSVC)
> return false
> # exit without clearing SCHED
>
> This feels very unlikely, but perhaps virtio has some interactions
> with the hypervisor in the NAPI ->poll that makes the race window
> larger?
>
> Best I could to to prove the theory was to add and trigger this
> warning in napi_poll (just before netpoll_poll_unlock()):
>
> WARN_ONCE(!have && rcu_access_pointer(n->dev->npinfo) &&
> napi_is_scheduled(n) && list_empty(&n->poll_list),
> "NAPI race with netpoll %px", n);
>
> If this warning hits the next virtio_close() will hang.
>
> This patch survived 30 test iterations without a hang (without it
> the longest clean run was around 10). Credit for triggering this
> goes to Breno's recent netconsole tests.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Reported-by: Paolo Abeni <pabeni@redhat.com>
> Link: https://lore.kernel.org/c5a93ed1-9abe-4880-a3bb-8d1678018b1d@redhat.com
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviwed-by: Breno Leitao <leitao@debian.org>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH net v2] netpoll: prevent hanging NAPI when netcons gets enabled
2025-07-26 1:08 [PATCH net v2] netpoll: prevent hanging NAPI when netcons gets enabled Jakub Kicinski
` (2 preceding siblings ...)
2025-07-28 7:52 ` Breno Leitao
@ 2025-07-31 1:20 ` patchwork-bot+netdevbpf
3 siblings, 0 replies; 5+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-07-31 1:20 UTC (permalink / raw)
To: Jakub Kicinski
Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, jasowang,
zuozhijie, mst, xuanzhuo, eperezma, leitao, sdf
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Fri, 25 Jul 2025 18:08:46 -0700 you wrote:
> Paolo spotted hangs in NIPA running driver tests against virtio.
> The tests hang in virtnet_close() -> virtnet_napi_tx_disable().
>
> The problem is only reproducible if running multiple of our tests
> in sequence (I used TEST_PROGS="xdp.py ping.py netcons_basic.sh \
> netpoll_basic.py stats.py"). Initial suspicion was that this is
> a simple case of double-disable of NAPI, but instrumenting the
> code reveals:
>
> [...]
Here is the summary with links:
- [net,v2] netpoll: prevent hanging NAPI when netcons gets enabled
https://git.kernel.org/netdev/net/c/2da4def0f487
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-07-31 1:19 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-26 1:08 [PATCH net v2] netpoll: prevent hanging NAPI when netcons gets enabled Jakub Kicinski
2025-07-28 3:20 ` Jason Wang
2025-07-28 3:37 ` Xuan Zhuo
2025-07-28 7:52 ` Breno Leitao
2025-07-31 1:20 ` patchwork-bot+netdevbpf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).