netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* virtio_close() stuck on napi_disable_locked()
@ 2025-07-22 11:00 Paolo Abeni
  2025-07-22 21:55 ` Jakub Kicinski
  0 siblings, 1 reply; 9+ messages in thread
From: Paolo Abeni @ 2025-07-22 11:00 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez
  Cc: netdev@vger.kernel.org

Hi,

The NIPA CI is reporting some hung-up in the stats.py test caused by the
virtio_net driver stuck at close time.

A sample splat is available here:

https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr

AFAICS the issue happens only on debug builds.

I'm wild guessing to something similar to the the issue addressed by
commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
but I could not spot anything obvious.

Could you please have a look?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: virtio_close() stuck on napi_disable_locked()
  2025-07-22 11:00 virtio_close() stuck on napi_disable_locked() Paolo Abeni
@ 2025-07-22 21:55 ` Jakub Kicinski
  2025-07-23  5:14   ` Jason Wang
  2025-07-24  8:58   ` Zigit Zo
  0 siblings, 2 replies; 9+ messages in thread
From: Jakub Kicinski @ 2025-07-22 21:55 UTC (permalink / raw)
  To: Paolo Abeni, Zigit Zo
  Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	netdev@vger.kernel.org

On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote:
> Hi,
> 
> The NIPA CI is reporting some hung-up in the stats.py test caused by the
> virtio_net driver stuck at close time.
> 
> A sample splat is available here:
> 
> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr
> 
> AFAICS the issue happens only on debug builds.
> 
> I'm wild guessing to something similar to the the issue addressed by
> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
> but I could not spot anything obvious.
> 
> Could you please have a look?

It only hits in around 1 in 5 runs. Likely some pre-existing race, but
it started popping up for us when be5dcaed694e ("virtio-net: fix
recursived rtnl_lock() during probe()") was merged. It never hit before.
If we can't find a quick fix I think we should revert be5dcaed694e for
now, so that it doesn't end up regressing 6.16 final.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: virtio_close() stuck on napi_disable_locked()
  2025-07-22 21:55 ` Jakub Kicinski
@ 2025-07-23  5:14   ` Jason Wang
  2025-07-23 13:51     ` Jakub Kicinski
  2025-07-24  8:43     ` Paolo Abeni
  2025-07-24  8:58   ` Zigit Zo
  1 sibling, 2 replies; 9+ messages in thread
From: Jason Wang @ 2025-07-23  5:14 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Paolo Abeni, Zigit Zo, Michael S. Tsirkin, Xuan Zhuo,
	Eugenio Pérez, netdev@vger.kernel.org

On Wed, Jul 23, 2025 at 5:55 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote:
> > Hi,
> >
> > The NIPA CI is reporting some hung-up in the stats.py test caused by the
> > virtio_net driver stuck at close time.
> >
> > A sample splat is available here:
> >
> > https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr
> >
> > AFAICS the issue happens only on debug builds.
> >
> > I'm wild guessing to something similar to the the issue addressed by
> > commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
> > but I could not spot anything obvious.
> >
> > Could you please have a look?
>
> It only hits in around 1 in 5 runs.

I tried to reproduce this locally but failed. Where can I see the qemu
command line for the VM?

> Likely some pre-existing race, but
> it started popping up for us when be5dcaed694e ("virtio-net: fix
> recursived rtnl_lock() during probe()") was merged.

Probably but I didn't see a direct connection with that commit. It
looks like the root cause is the deadloop of napi_disable() for some
reason as Paolo said.

> It never hit before.
> If we can't find a quick fix I think we should revert be5dcaed694e for
> now, so that it doesn't end up regressing 6.16 final.
>

Thanks


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: virtio_close() stuck on napi_disable_locked()
  2025-07-23  5:14   ` Jason Wang
@ 2025-07-23 13:51     ` Jakub Kicinski
  2025-07-24  8:43     ` Paolo Abeni
  1 sibling, 0 replies; 9+ messages in thread
From: Jakub Kicinski @ 2025-07-23 13:51 UTC (permalink / raw)
  To: Jason Wang
  Cc: Paolo Abeni, Zigit Zo, Michael S. Tsirkin, Xuan Zhuo,
	Eugenio Pérez, netdev@vger.kernel.org

On Wed, 23 Jul 2025 13:14:38 +0800 Jason Wang wrote:
> > It only hits in around 1 in 5 runs.  
> 
> I tried to reproduce this locally but failed. Where can I see the qemu
> command line for the VM?

Please see:
https://github.com/linux-netdev/nipa/wiki/Running-driver-tests-on-virtio

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: virtio_close() stuck on napi_disable_locked()
  2025-07-23  5:14   ` Jason Wang
  2025-07-23 13:51     ` Jakub Kicinski
@ 2025-07-24  8:43     ` Paolo Abeni
  2025-07-24 10:53       ` Jason Wang
  1 sibling, 1 reply; 9+ messages in thread
From: Paolo Abeni @ 2025-07-24  8:43 UTC (permalink / raw)
  To: Jason Wang, Jakub Kicinski
  Cc: Zigit Zo, Michael S. Tsirkin, Xuan Zhuo, Eugenio Pérez,
	netdev@vger.kernel.org

On 7/23/25 7:14 AM, Jason Wang wrote:
> On Wed, Jul 23, 2025 at 5:55 AM Jakub Kicinski <kuba@kernel.org> wrote:
>> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote:
>>> The NIPA CI is reporting some hung-up in the stats.py test caused by the
>>> virtio_net driver stuck at close time.
>>>
>>> A sample splat is available here:
>>>
>>> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr
>>>
>>> AFAICS the issue happens only on debug builds.
>>>
>>> I'm wild guessing to something similar to the the issue addressed by
>>> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
>>> but I could not spot anything obvious.
>>>
>>> Could you please have a look?
>>
>> It only hits in around 1 in 5 runs.
> 
> I tried to reproduce this locally but failed. Where can I see the qemu
> command line for the VM?
> 
>> Likely some pre-existing race, but
>> it started popping up for us when be5dcaed694e ("virtio-net: fix
>> recursived rtnl_lock() during probe()") was merged.
> 
> Probably but I didn't see a direct connection with that commit. It
> looks like the root cause is the deadloop of napi_disable() for some
> reason as Paolo said.
> 
>> It never hit before.
>> If we can't find a quick fix I think we should revert be5dcaed694e for
>> now, so that it doesn't end up regressing 6.16 final.

I tried hard to reproduce the issue locally - to validate an eventual
revert before pushing it. But so far I failed quite miserably.

Given the above, I would avoid the revert and just mention the known
issue in the net PR to Linus.

/P


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Re: virtio_close() stuck on napi_disable_locked()
  2025-07-22 21:55 ` Jakub Kicinski
  2025-07-23  5:14   ` Jason Wang
@ 2025-07-24  8:58   ` Zigit Zo
  2025-07-24  9:18     ` Paolo Abeni
  1 sibling, 1 reply; 9+ messages in thread
From: Zigit Zo @ 2025-07-24  8:58 UTC (permalink / raw)
  To: Jakub Kicinski, Paolo Abeni
  Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	netdev@vger.kernel.org

On 7/23/25 5:55 AM, Jakub Kicinski wrote:
> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote:
>> Hi,
>>
>> The NIPA CI is reporting some hung-up in the stats.py test caused by the
>> virtio_net driver stuck at close time.
>>
>> A sample splat is available here:
>>
>> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr
>>
>> AFAICS the issue happens only on debug builds.
>>
>> I'm wild guessing to something similar to the the issue addressed by
>> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
>> but I could not spot anything obvious.
>>
>> Could you please have a look?
> 
> It only hits in around 1 in 5 runs. Likely some pre-existing race, but
> it started popping up for us when be5dcaed694e ("virtio-net: fix
> recursived rtnl_lock() during probe()") was merged. It never hit before.
> If we can't find a quick fix I think we should revert be5dcaed694e for
> now, so that it doesn't end up regressing 6.16 final.

Just found that there's a new test script `netpoll_basic.py`. Before 209441,
this test did not exist at all, I randomly picked some test results prior to
its introduction and did not find any hung logs. Is it relevant?

Regards,

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: virtio_close() stuck on napi_disable_locked()
  2025-07-24  8:58   ` Zigit Zo
@ 2025-07-24  9:18     ` Paolo Abeni
  0 siblings, 0 replies; 9+ messages in thread
From: Paolo Abeni @ 2025-07-24  9:18 UTC (permalink / raw)
  To: Zigit Zo, Jakub Kicinski
  Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	netdev@vger.kernel.org

On 7/24/25 10:58 AM, Zigit Zo wrote:
> On 7/23/25 5:55 AM, Jakub Kicinski wrote:
>> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote:
>>> The NIPA CI is reporting some hung-up in the stats.py test caused by the
>>> virtio_net driver stuck at close time.
>>>
>>> A sample splat is available here:
>>>
>>> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr
>>>
>>> AFAICS the issue happens only on debug builds.
>>>
>>> I'm wild guessing to something similar to the the issue addressed by
>>> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
>>> but I could not spot anything obvious.
>>>
>>> Could you please have a look?
>>
>> It only hits in around 1 in 5 runs. Likely some pre-existing race, but
>> it started popping up for us when be5dcaed694e ("virtio-net: fix
>> recursived rtnl_lock() during probe()") was merged. It never hit before.
>> If we can't find a quick fix I think we should revert be5dcaed694e for
>> now, so that it doesn't end up regressing 6.16 final.
> 
> Just found that there's a new test script `netpoll_basic.py`. Before 209441,
> this test did not exist at all, I randomly picked some test results prior to
> its introduction and did not find any hung logs. Is it relevant?

If I read correctly the nipa configuration, it executes the tests in
sequence. `netpoll_basic.py` runs just before `stats.py`, so it could
cause failure if it leaves some inconsistent state - but not in a
deterministic way, as the issue is sporadic.

Technically possible, IMHO unlikely. Still it would explain why I can't
repro the issue here. With unlimited time available could be worthy
validating if the issue could be reproduced running the 2 tests in sequence.

/P


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: virtio_close() stuck on napi_disable_locked()
  2025-07-24  8:43     ` Paolo Abeni
@ 2025-07-24 10:53       ` Jason Wang
  2025-07-24 11:29         ` Paolo Abeni
  0 siblings, 1 reply; 9+ messages in thread
From: Jason Wang @ 2025-07-24 10:53 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Jakub Kicinski, Zigit Zo, Michael S. Tsirkin, Xuan Zhuo,
	Eugenio Pérez, netdev@vger.kernel.org

On Thu, Jul 24, 2025 at 4:43 PM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 7/23/25 7:14 AM, Jason Wang wrote:
> > On Wed, Jul 23, 2025 at 5:55 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote:
> >>> The NIPA CI is reporting some hung-up in the stats.py test caused by the
> >>> virtio_net driver stuck at close time.
> >>>
> >>> A sample splat is available here:
> >>>
> >>> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr
> >>>
> >>> AFAICS the issue happens only on debug builds.
> >>>
> >>> I'm wild guessing to something similar to the the issue addressed by
> >>> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
> >>> but I could not spot anything obvious.
> >>>
> >>> Could you please have a look?
> >>
> >> It only hits in around 1 in 5 runs.
> >
> > I tried to reproduce this locally but failed. Where can I see the qemu
> > command line for the VM?
> >
> >> Likely some pre-existing race, but
> >> it started popping up for us when be5dcaed694e ("virtio-net: fix
> >> recursived rtnl_lock() during probe()") was merged.
> >
> > Probably but I didn't see a direct connection with that commit. It
> > looks like the root cause is the deadloop of napi_disable() for some
> > reason as Paolo said.
> >
> >> It never hit before.
> >> If we can't find a quick fix I think we should revert be5dcaed694e for
> >> now, so that it doesn't end up regressing 6.16 final.
>
> I tried hard to reproduce the issue locally - to validate an eventual
> revert before pushing it. But so far I failed quite miserably.
>

I've also tried to follow the instructions of nipai for setup 2 virtio
and make the relevant taps to connect with a bridge on the host. But I
failed to reproduce it locally for several hours.

Is there a log of the execution of nipa test that we can know more
information like:

1) full qemu command line
2) host kernel version
3) Qemu version

> Given the above, I would avoid the revert and just mention the known
> issue in the net PR to Linus.

+1

Thanks

>
> /P
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: virtio_close() stuck on napi_disable_locked()
  2025-07-24 10:53       ` Jason Wang
@ 2025-07-24 11:29         ` Paolo Abeni
  0 siblings, 0 replies; 9+ messages in thread
From: Paolo Abeni @ 2025-07-24 11:29 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jakub Kicinski, Zigit Zo, Michael S. Tsirkin, Xuan Zhuo,
	Eugenio Pérez, netdev@vger.kernel.org

On 7/24/25 12:53 PM, Jason Wang wrote:
> On Thu, Jul 24, 2025 at 4:43 PM Paolo Abeni <pabeni@redhat.com> wrote:
>> On 7/23/25 7:14 AM, Jason Wang wrote:
>>> On Wed, Jul 23, 2025 at 5:55 AM Jakub Kicinski <kuba@kernel.org> wrote:
>>>> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote:
>>>>> The NIPA CI is reporting some hung-up in the stats.py test caused by the
>>>>> virtio_net driver stuck at close time.
>>>>>
>>>>> A sample splat is available here:
>>>>>
>>>>> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr
>>>>>
>>>>> AFAICS the issue happens only on debug builds.
>>>>>
>>>>> I'm wild guessing to something similar to the the issue addressed by
>>>>> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi,
>>>>> but I could not spot anything obvious.
>>>>>
>>>>> Could you please have a look?
>>>>
>>>> It only hits in around 1 in 5 runs.
>>>
>>> I tried to reproduce this locally but failed. Where can I see the qemu
>>> command line for the VM?
>>>
>>>> Likely some pre-existing race, but
>>>> it started popping up for us when be5dcaed694e ("virtio-net: fix
>>>> recursived rtnl_lock() during probe()") was merged.
>>>
>>> Probably but I didn't see a direct connection with that commit. It
>>> looks like the root cause is the deadloop of napi_disable() for some
>>> reason as Paolo said.
>>>
>>>> It never hit before.
>>>> If we can't find a quick fix I think we should revert be5dcaed694e for
>>>> now, so that it doesn't end up regressing 6.16 final.
>>
>> I tried hard to reproduce the issue locally - to validate an eventual
>> revert before pushing it. But so far I failed quite miserably.
>>
> 
> I've also tried to follow the instructions of nipai for setup 2 virtio
> and make the relevant taps to connect with a bridge on the host. But I
> failed to reproduce it locally for several hours.
> 
> Is there a log of the execution of nipa test that we can know more
> information like:
> 
> 1) full qemu command line

I guess it could depend on vng version; here I'm getting:

qemu-system-x86_64 -name virtme-ng -m 1G -chardev
socket,id=charvirtfs5,path=/tmp/virtmebyfqshp5 -device
vhost-user-fs-device,chardev=charvirtfs5,tag=ROOTFS -object
memory-backend-memfd,id=mem,size=1G,share=on -numa node,memdev=mem
-machine accel=kvm:tcg -M microvm,accel=kvm,pcie=on,rtc=on -cpu host
-parallel none -net none -echr 1 -chardev
file,path=/proc/self/fd/2,id=dmesg -device virtio-serial-device -device
virtconsole,chardev=dmesg -chardev stdio,id=console,signal=off,mux=on
-serial chardev:console -mon chardev=console -vga none -display none
-smp 4 -kernel ./arch/x86/boot/bzImage -append virtme_hostname=virtme-ng
nr_open=1073741816
virtme_link_mods=/data/net-next/.virtme_mods/lib/modules/0.0.0
virtme_rw_overlay0=/etc virtme_rw_overlay1=/lib virtme_rw_overlay2=/home
virtme_rw_overlay3=/opt virtme_rw_overlay4=/srv virtme_rw_overlay5=/usr
virtme_rw_overlay6=/var virtme_rw_overlay7=/tmp console=hvc0
earlyprintk=serial,ttyS0,115200 virtme_console=ttyS0 psmouse.proto=exps
"virtme_stty_con=rows 32 cols 136 iutf8" TERM=xterm-256color
virtme_chdir=data/net-next virtme_root_user=1 rootfstype=virtiofs
root=ROOTFS raid=noautodetect ro debug
init=/usr/lib/python3.13/site-packages/virtme/guest/bin/virtme-ng-init
-device
virtio-net-pci,netdev=n0,iommu_platform=on,disable-legacy=on,mq=on,vectors=18
-netdev tap,id=n0,ifname=tap0,vhost=on,script=no,downscript=no,queues=8
-device
virtio-net-pci,netdev=n1,iommu_platform=on,disable-legacy=on,mq=on,vectors=18
-netdev tap,id=n1,ifname=tap1,vhost=on,script=no,downscript=no,queues=8

I guess the significant part is: ' -smp 4 -m 1G'. The networking bits
are the verbatim configuration from the wiki.

> 2) host kernel version

I can give a reasonably sure answer only for this point; the kernel is
'current' net-next with 'current' net merged in and all the patches
pending on patchwork merged in, too.

For a given tests iteration nipa provides a snapshot of the patches
merged in, i.e. for tests run on 2025/07/24 at 00:00 see:

https://netdev.bots.linux.dev/static/nipa/branch_deltas/net-next-hw-2025-07-24--00-00.html

> 3) Qemu version

Should be a stock, recent ubuntu build.

/P


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-07-24 11:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-22 11:00 virtio_close() stuck on napi_disable_locked() Paolo Abeni
2025-07-22 21:55 ` Jakub Kicinski
2025-07-23  5:14   ` Jason Wang
2025-07-23 13:51     ` Jakub Kicinski
2025-07-24  8:43     ` Paolo Abeni
2025-07-24 10:53       ` Jason Wang
2025-07-24 11:29         ` Paolo Abeni
2025-07-24  8:58   ` Zigit Zo
2025-07-24  9:18     ` Paolo Abeni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).