* virtio_close() stuck on napi_disable_locked()
@ 2025-07-22 11:00 Paolo Abeni
2025-07-22 21:55 ` Jakub Kicinski
0 siblings, 1 reply; 9+ messages in thread
From: Paolo Abeni @ 2025-07-22 11:00 UTC (permalink / raw)
To: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez
Cc: netdev@vger.kernel.org
Hi,
The NIPA CI is reporting some hung-up in the stats.py test caused by the
virtio_net driver stuck at close time.
A sample splat is available here:
https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr
AFAICS the issue happens only on debug builds.
I'm wild guessing to something similar to the the issue addressed by
commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
but I could not spot anything obvious.
Could you please have a look?
Thanks,
Paolo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: virtio_close() stuck on napi_disable_locked()
2025-07-22 11:00 virtio_close() stuck on napi_disable_locked() Paolo Abeni
@ 2025-07-22 21:55 ` Jakub Kicinski
2025-07-23 5:14 ` Jason Wang
2025-07-24 8:58 ` Zigit Zo
0 siblings, 2 replies; 9+ messages in thread
From: Jakub Kicinski @ 2025-07-22 21:55 UTC (permalink / raw)
To: Paolo Abeni, Zigit Zo
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
netdev@vger.kernel.org
On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote:
> Hi,
>
> The NIPA CI is reporting some hung-up in the stats.py test caused by the
> virtio_net driver stuck at close time.
>
> A sample splat is available here:
>
> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr
>
> AFAICS the issue happens only on debug builds.
>
> I'm wild guessing to something similar to the the issue addressed by
> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
> but I could not spot anything obvious.
>
> Could you please have a look?
It only hits in around 1 in 5 runs. Likely some pre-existing race, but
it started popping up for us when be5dcaed694e ("virtio-net: fix
recursived rtnl_lock() during probe()") was merged. It never hit before.
If we can't find a quick fix I think we should revert be5dcaed694e for
now, so that it doesn't end up regressing 6.16 final.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: virtio_close() stuck on napi_disable_locked()
2025-07-22 21:55 ` Jakub Kicinski
@ 2025-07-23 5:14 ` Jason Wang
2025-07-23 13:51 ` Jakub Kicinski
2025-07-24 8:43 ` Paolo Abeni
2025-07-24 8:58 ` Zigit Zo
1 sibling, 2 replies; 9+ messages in thread
From: Jason Wang @ 2025-07-23 5:14 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Paolo Abeni, Zigit Zo, Michael S. Tsirkin, Xuan Zhuo,
Eugenio Pérez, netdev@vger.kernel.org
On Wed, Jul 23, 2025 at 5:55 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote:
> > Hi,
> >
> > The NIPA CI is reporting some hung-up in the stats.py test caused by the
> > virtio_net driver stuck at close time.
> >
> > A sample splat is available here:
> >
> > https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr
> >
> > AFAICS the issue happens only on debug builds.
> >
> > I'm wild guessing to something similar to the the issue addressed by
> > commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
> > but I could not spot anything obvious.
> >
> > Could you please have a look?
>
> It only hits in around 1 in 5 runs.
I tried to reproduce this locally but failed. Where can I see the qemu
command line for the VM?
> Likely some pre-existing race, but
> it started popping up for us when be5dcaed694e ("virtio-net: fix
> recursived rtnl_lock() during probe()") was merged.
Probably but I didn't see a direct connection with that commit. It
looks like the root cause is the deadloop of napi_disable() for some
reason as Paolo said.
> It never hit before.
> If we can't find a quick fix I think we should revert be5dcaed694e for
> now, so that it doesn't end up regressing 6.16 final.
>
Thanks
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: virtio_close() stuck on napi_disable_locked()
2025-07-23 5:14 ` Jason Wang
@ 2025-07-23 13:51 ` Jakub Kicinski
2025-07-24 8:43 ` Paolo Abeni
1 sibling, 0 replies; 9+ messages in thread
From: Jakub Kicinski @ 2025-07-23 13:51 UTC (permalink / raw)
To: Jason Wang
Cc: Paolo Abeni, Zigit Zo, Michael S. Tsirkin, Xuan Zhuo,
Eugenio Pérez, netdev@vger.kernel.org
On Wed, 23 Jul 2025 13:14:38 +0800 Jason Wang wrote:
> > It only hits in around 1 in 5 runs.
>
> I tried to reproduce this locally but failed. Where can I see the qemu
> command line for the VM?
Please see:
https://github.com/linux-netdev/nipa/wiki/Running-driver-tests-on-virtio
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: virtio_close() stuck on napi_disable_locked()
2025-07-23 5:14 ` Jason Wang
2025-07-23 13:51 ` Jakub Kicinski
@ 2025-07-24 8:43 ` Paolo Abeni
2025-07-24 10:53 ` Jason Wang
1 sibling, 1 reply; 9+ messages in thread
From: Paolo Abeni @ 2025-07-24 8:43 UTC (permalink / raw)
To: Jason Wang, Jakub Kicinski
Cc: Zigit Zo, Michael S. Tsirkin, Xuan Zhuo, Eugenio Pérez,
netdev@vger.kernel.org
On 7/23/25 7:14 AM, Jason Wang wrote:
> On Wed, Jul 23, 2025 at 5:55 AM Jakub Kicinski <kuba@kernel.org> wrote:
>> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote:
>>> The NIPA CI is reporting some hung-up in the stats.py test caused by the
>>> virtio_net driver stuck at close time.
>>>
>>> A sample splat is available here:
>>>
>>> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr
>>>
>>> AFAICS the issue happens only on debug builds.
>>>
>>> I'm wild guessing to something similar to the the issue addressed by
>>> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
>>> but I could not spot anything obvious.
>>>
>>> Could you please have a look?
>>
>> It only hits in around 1 in 5 runs.
>
> I tried to reproduce this locally but failed. Where can I see the qemu
> command line for the VM?
>
>> Likely some pre-existing race, but
>> it started popping up for us when be5dcaed694e ("virtio-net: fix
>> recursived rtnl_lock() during probe()") was merged.
>
> Probably but I didn't see a direct connection with that commit. It
> looks like the root cause is the deadloop of napi_disable() for some
> reason as Paolo said.
>
>> It never hit before.
>> If we can't find a quick fix I think we should revert be5dcaed694e for
>> now, so that it doesn't end up regressing 6.16 final.
I tried hard to reproduce the issue locally - to validate an eventual
revert before pushing it. But so far I failed quite miserably.
Given the above, I would avoid the revert and just mention the known
issue in the net PR to Linus.
/P
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Re: virtio_close() stuck on napi_disable_locked()
2025-07-22 21:55 ` Jakub Kicinski
2025-07-23 5:14 ` Jason Wang
@ 2025-07-24 8:58 ` Zigit Zo
2025-07-24 9:18 ` Paolo Abeni
1 sibling, 1 reply; 9+ messages in thread
From: Zigit Zo @ 2025-07-24 8:58 UTC (permalink / raw)
To: Jakub Kicinski, Paolo Abeni
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
netdev@vger.kernel.org
On 7/23/25 5:55 AM, Jakub Kicinski wrote:
> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote:
>> Hi,
>>
>> The NIPA CI is reporting some hung-up in the stats.py test caused by the
>> virtio_net driver stuck at close time.
>>
>> A sample splat is available here:
>>
>> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr
>>
>> AFAICS the issue happens only on debug builds.
>>
>> I'm wild guessing to something similar to the the issue addressed by
>> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
>> but I could not spot anything obvious.
>>
>> Could you please have a look?
>
> It only hits in around 1 in 5 runs. Likely some pre-existing race, but
> it started popping up for us when be5dcaed694e ("virtio-net: fix
> recursived rtnl_lock() during probe()") was merged. It never hit before.
> If we can't find a quick fix I think we should revert be5dcaed694e for
> now, so that it doesn't end up regressing 6.16 final.
Just found that there's a new test script `netpoll_basic.py`. Before 209441,
this test did not exist at all, I randomly picked some test results prior to
its introduction and did not find any hung logs. Is it relevant?
Regards,
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: virtio_close() stuck on napi_disable_locked()
2025-07-24 8:58 ` Zigit Zo
@ 2025-07-24 9:18 ` Paolo Abeni
0 siblings, 0 replies; 9+ messages in thread
From: Paolo Abeni @ 2025-07-24 9:18 UTC (permalink / raw)
To: Zigit Zo, Jakub Kicinski
Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
netdev@vger.kernel.org
On 7/24/25 10:58 AM, Zigit Zo wrote:
> On 7/23/25 5:55 AM, Jakub Kicinski wrote:
>> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote:
>>> The NIPA CI is reporting some hung-up in the stats.py test caused by the
>>> virtio_net driver stuck at close time.
>>>
>>> A sample splat is available here:
>>>
>>> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr
>>>
>>> AFAICS the issue happens only on debug builds.
>>>
>>> I'm wild guessing to something similar to the the issue addressed by
>>> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
>>> but I could not spot anything obvious.
>>>
>>> Could you please have a look?
>>
>> It only hits in around 1 in 5 runs. Likely some pre-existing race, but
>> it started popping up for us when be5dcaed694e ("virtio-net: fix
>> recursived rtnl_lock() during probe()") was merged. It never hit before.
>> If we can't find a quick fix I think we should revert be5dcaed694e for
>> now, so that it doesn't end up regressing 6.16 final.
>
> Just found that there's a new test script `netpoll_basic.py`. Before 209441,
> this test did not exist at all, I randomly picked some test results prior to
> its introduction and did not find any hung logs. Is it relevant?
If I read correctly the nipa configuration, it executes the tests in
sequence. `netpoll_basic.py` runs just before `stats.py`, so it could
cause failure if it leaves some inconsistent state - but not in a
deterministic way, as the issue is sporadic.
Technically possible, IMHO unlikely. Still it would explain why I can't
repro the issue here. With unlimited time available could be worthy
validating if the issue could be reproduced running the 2 tests in sequence.
/P
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: virtio_close() stuck on napi_disable_locked()
2025-07-24 8:43 ` Paolo Abeni
@ 2025-07-24 10:53 ` Jason Wang
2025-07-24 11:29 ` Paolo Abeni
0 siblings, 1 reply; 9+ messages in thread
From: Jason Wang @ 2025-07-24 10:53 UTC (permalink / raw)
To: Paolo Abeni
Cc: Jakub Kicinski, Zigit Zo, Michael S. Tsirkin, Xuan Zhuo,
Eugenio Pérez, netdev@vger.kernel.org
On Thu, Jul 24, 2025 at 4:43 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 7/23/25 7:14 AM, Jason Wang wrote:
> > On Wed, Jul 23, 2025 at 5:55 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote:
> >>> The NIPA CI is reporting some hung-up in the stats.py test caused by the
> >>> virtio_net driver stuck at close time.
> >>>
> >>> A sample splat is available here:
> >>>
> >>> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr
> >>>
> >>> AFAICS the issue happens only on debug builds.
> >>>
> >>> I'm wild guessing to something similar to the the issue addressed by
> >>> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
> >>> but I could not spot anything obvious.
> >>>
> >>> Could you please have a look?
> >>
> >> It only hits in around 1 in 5 runs.
> >
> > I tried to reproduce this locally but failed. Where can I see the qemu
> > command line for the VM?
> >
> >> Likely some pre-existing race, but
> >> it started popping up for us when be5dcaed694e ("virtio-net: fix
> >> recursived rtnl_lock() during probe()") was merged.
> >
> > Probably but I didn't see a direct connection with that commit. It
> > looks like the root cause is the deadloop of napi_disable() for some
> > reason as Paolo said.
> >
> >> It never hit before.
> >> If we can't find a quick fix I think we should revert be5dcaed694e for
> >> now, so that it doesn't end up regressing 6.16 final.
>
> I tried hard to reproduce the issue locally - to validate an eventual
> revert before pushing it. But so far I failed quite miserably.
>
I've also tried to follow the instructions of nipai for setup 2 virtio
and make the relevant taps to connect with a bridge on the host. But I
failed to reproduce it locally for several hours.
Is there a log of the execution of nipa test that we can know more
information like:
1) full qemu command line
2) host kernel version
3) Qemu version
> Given the above, I would avoid the revert and just mention the known
> issue in the net PR to Linus.
+1
Thanks
>
> /P
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: virtio_close() stuck on napi_disable_locked()
2025-07-24 10:53 ` Jason Wang
@ 2025-07-24 11:29 ` Paolo Abeni
0 siblings, 0 replies; 9+ messages in thread
From: Paolo Abeni @ 2025-07-24 11:29 UTC (permalink / raw)
To: Jason Wang
Cc: Jakub Kicinski, Zigit Zo, Michael S. Tsirkin, Xuan Zhuo,
Eugenio Pérez, netdev@vger.kernel.org
On 7/24/25 12:53 PM, Jason Wang wrote:
> On Thu, Jul 24, 2025 at 4:43 PM Paolo Abeni <pabeni@redhat.com> wrote:
>> On 7/23/25 7:14 AM, Jason Wang wrote:
>>> On Wed, Jul 23, 2025 at 5:55 AM Jakub Kicinski <kuba@kernel.org> wrote:
>>>> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote:
>>>>> The NIPA CI is reporting some hung-up in the stats.py test caused by the
>>>>> virtio_net driver stuck at close time.
>>>>>
>>>>> A sample splat is available here:
>>>>>
>>>>> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr
>>>>>
>>>>> AFAICS the issue happens only on debug builds.
>>>>>
>>>>> I'm wild guessing to something similar to the the issue addressed by
>>>>> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
>>>>> but I could not spot anything obvious.
>>>>>
>>>>> Could you please have a look?
>>>>
>>>> It only hits in around 1 in 5 runs.
>>>
>>> I tried to reproduce this locally but failed. Where can I see the qemu
>>> command line for the VM?
>>>
>>>> Likely some pre-existing race, but
>>>> it started popping up for us when be5dcaed694e ("virtio-net: fix
>>>> recursived rtnl_lock() during probe()") was merged.
>>>
>>> Probably but I didn't see a direct connection with that commit. It
>>> looks like the root cause is the deadloop of napi_disable() for some
>>> reason as Paolo said.
>>>
>>>> It never hit before.
>>>> If we can't find a quick fix I think we should revert be5dcaed694e for
>>>> now, so that it doesn't end up regressing 6.16 final.
>>
>> I tried hard to reproduce the issue locally - to validate an eventual
>> revert before pushing it. But so far I failed quite miserably.
>>
>
> I've also tried to follow the instructions of nipai for setup 2 virtio
> and make the relevant taps to connect with a bridge on the host. But I
> failed to reproduce it locally for several hours.
>
> Is there a log of the execution of nipa test that we can know more
> information like:
>
> 1) full qemu command line
I guess it could depend on vng version; here I'm getting:
qemu-system-x86_64 -name virtme-ng -m 1G -chardev
socket,id=charvirtfs5,path=/tmp/virtmebyfqshp5 -device
vhost-user-fs-device,chardev=charvirtfs5,tag=ROOTFS -object
memory-backend-memfd,id=mem,size=1G,share=on -numa node,memdev=mem
-machine accel=kvm:tcg -M microvm,accel=kvm,pcie=on,rtc=on -cpu host
-parallel none -net none -echr 1 -chardev
file,path=/proc/self/fd/2,id=dmesg -device virtio-serial-device -device
virtconsole,chardev=dmesg -chardev stdio,id=console,signal=off,mux=on
-serial chardev:console -mon chardev=console -vga none -display none
-smp 4 -kernel ./arch/x86/boot/bzImage -append virtme_hostname=virtme-ng
nr_open=1073741816
virtme_link_mods=/data/net-next/.virtme_mods/lib/modules/0.0.0
virtme_rw_overlay0=/etc virtme_rw_overlay1=/lib virtme_rw_overlay2=/home
virtme_rw_overlay3=/opt virtme_rw_overlay4=/srv virtme_rw_overlay5=/usr
virtme_rw_overlay6=/var virtme_rw_overlay7=/tmp console=hvc0
earlyprintk=serial,ttyS0,115200 virtme_console=ttyS0 psmouse.proto=exps
"virtme_stty_con=rows 32 cols 136 iutf8" TERM=xterm-256color
virtme_chdir=data/net-next virtme_root_user=1 rootfstype=virtiofs
root=ROOTFS raid=noautodetect ro debug
init=/usr/lib/python3.13/site-packages/virtme/guest/bin/virtme-ng-init
-device
virtio-net-pci,netdev=n0,iommu_platform=on,disable-legacy=on,mq=on,vectors=18
-netdev tap,id=n0,ifname=tap0,vhost=on,script=no,downscript=no,queues=8
-device
virtio-net-pci,netdev=n1,iommu_platform=on,disable-legacy=on,mq=on,vectors=18
-netdev tap,id=n1,ifname=tap1,vhost=on,script=no,downscript=no,queues=8
I guess the significant part is: ' -smp 4 -m 1G'. The networking bits
are the verbatim configuration from the wiki.
> 2) host kernel version
I can give a reasonably sure answer only for this point; the kernel is
'current' net-next with 'current' net merged in and all the patches
pending on patchwork merged in, too.
For a given tests iteration nipa provides a snapshot of the patches
merged in, i.e. for tests run on 2025/07/24 at 00:00 see:
https://netdev.bots.linux.dev/static/nipa/branch_deltas/net-next-hw-2025-07-24--00-00.html
> 3) Qemu version
Should be a stock, recent ubuntu build.
/P
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-07-24 11:29 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-22 11:00 virtio_close() stuck on napi_disable_locked() Paolo Abeni
2025-07-22 21:55 ` Jakub Kicinski
2025-07-23 5:14 ` Jason Wang
2025-07-23 13:51 ` Jakub Kicinski
2025-07-24 8:43 ` Paolo Abeni
2025-07-24 10:53 ` Jason Wang
2025-07-24 11:29 ` Paolo Abeni
2025-07-24 8:58 ` Zigit Zo
2025-07-24 9:18 ` Paolo Abeni
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).