* virtio_close() stuck on napi_disable_locked() @ 2025-07-22 11:00 Paolo Abeni 2025-07-22 21:55 ` Jakub Kicinski 0 siblings, 1 reply; 9+ messages in thread From: Paolo Abeni @ 2025-07-22 11:00 UTC (permalink / raw) To: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez Cc: netdev@vger.kernel.org Hi, The NIPA CI is reporting some hung-up in the stats.py test caused by the virtio_net driver stuck at close time. A sample splat is available here: https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr AFAICS the issue happens only on debug builds. I'm wild guessing to something similar to the the issue addressed by commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi, but I could not spot anything obvious. Could you please have a look? Thanks, Paolo ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: virtio_close() stuck on napi_disable_locked() 2025-07-22 11:00 virtio_close() stuck on napi_disable_locked() Paolo Abeni @ 2025-07-22 21:55 ` Jakub Kicinski 2025-07-23 5:14 ` Jason Wang 2025-07-24 8:58 ` Zigit Zo 0 siblings, 2 replies; 9+ messages in thread From: Jakub Kicinski @ 2025-07-22 21:55 UTC (permalink / raw) To: Paolo Abeni, Zigit Zo Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez, netdev@vger.kernel.org On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote: > Hi, > > The NIPA CI is reporting some hung-up in the stats.py test caused by the > virtio_net driver stuck at close time. > > A sample splat is available here: > > https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr > > AFAICS the issue happens only on debug builds. > > I'm wild guessing to something similar to the the issue addressed by > commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi, > but I could not spot anything obvious. > > Could you please have a look? It only hits in around 1 in 5 runs. Likely some pre-existing race, but it started popping up for us when be5dcaed694e ("virtio-net: fix recursived rtnl_lock() during probe()") was merged. It never hit before. If we can't find a quick fix I think we should revert be5dcaed694e for now, so that it doesn't end up regressing 6.16 final. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: virtio_close() stuck on napi_disable_locked() 2025-07-22 21:55 ` Jakub Kicinski @ 2025-07-23 5:14 ` Jason Wang 2025-07-23 13:51 ` Jakub Kicinski 2025-07-24 8:43 ` Paolo Abeni 2025-07-24 8:58 ` Zigit Zo 1 sibling, 2 replies; 9+ messages in thread From: Jason Wang @ 2025-07-23 5:14 UTC (permalink / raw) To: Jakub Kicinski Cc: Paolo Abeni, Zigit Zo, Michael S. Tsirkin, Xuan Zhuo, Eugenio Pérez, netdev@vger.kernel.org On Wed, Jul 23, 2025 at 5:55 AM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote: > > Hi, > > > > The NIPA CI is reporting some hung-up in the stats.py test caused by the > > virtio_net driver stuck at close time. > > > > A sample splat is available here: > > > > https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr > > > > AFAICS the issue happens only on debug builds. > > > > I'm wild guessing to something similar to the the issue addressed by > > commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi, > > but I could not spot anything obvious. > > > > Could you please have a look? > > It only hits in around 1 in 5 runs. I tried to reproduce this locally but failed. Where can I see the qemu command line for the VM? > Likely some pre-existing race, but > it started popping up for us when be5dcaed694e ("virtio-net: fix > recursived rtnl_lock() during probe()") was merged. Probably but I didn't see a direct connection with that commit. It looks like the root cause is the deadloop of napi_disable() for some reason as Paolo said. > It never hit before. > If we can't find a quick fix I think we should revert be5dcaed694e for > now, so that it doesn't end up regressing 6.16 final. > Thanks ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: virtio_close() stuck on napi_disable_locked() 2025-07-23 5:14 ` Jason Wang @ 2025-07-23 13:51 ` Jakub Kicinski 2025-07-24 8:43 ` Paolo Abeni 1 sibling, 0 replies; 9+ messages in thread From: Jakub Kicinski @ 2025-07-23 13:51 UTC (permalink / raw) To: Jason Wang Cc: Paolo Abeni, Zigit Zo, Michael S. Tsirkin, Xuan Zhuo, Eugenio Pérez, netdev@vger.kernel.org On Wed, 23 Jul 2025 13:14:38 +0800 Jason Wang wrote: > > It only hits in around 1 in 5 runs. > > I tried to reproduce this locally but failed. Where can I see the qemu > command line for the VM? Please see: https://github.com/linux-netdev/nipa/wiki/Running-driver-tests-on-virtio ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: virtio_close() stuck on napi_disable_locked() 2025-07-23 5:14 ` Jason Wang 2025-07-23 13:51 ` Jakub Kicinski @ 2025-07-24 8:43 ` Paolo Abeni 2025-07-24 10:53 ` Jason Wang 1 sibling, 1 reply; 9+ messages in thread From: Paolo Abeni @ 2025-07-24 8:43 UTC (permalink / raw) To: Jason Wang, Jakub Kicinski Cc: Zigit Zo, Michael S. Tsirkin, Xuan Zhuo, Eugenio Pérez, netdev@vger.kernel.org On 7/23/25 7:14 AM, Jason Wang wrote: > On Wed, Jul 23, 2025 at 5:55 AM Jakub Kicinski <kuba@kernel.org> wrote: >> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote: >>> The NIPA CI is reporting some hung-up in the stats.py test caused by the >>> virtio_net driver stuck at close time. >>> >>> A sample splat is available here: >>> >>> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr >>> >>> AFAICS the issue happens only on debug builds. >>> >>> I'm wild guessing to something similar to the the issue addressed by >>> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi, >>> but I could not spot anything obvious. >>> >>> Could you please have a look? >> >> It only hits in around 1 in 5 runs. > > I tried to reproduce this locally but failed. Where can I see the qemu > command line for the VM? > >> Likely some pre-existing race, but >> it started popping up for us when be5dcaed694e ("virtio-net: fix >> recursived rtnl_lock() during probe()") was merged. > > Probably but I didn't see a direct connection with that commit. It > looks like the root cause is the deadloop of napi_disable() for some > reason as Paolo said. > >> It never hit before. >> If we can't find a quick fix I think we should revert be5dcaed694e for >> now, so that it doesn't end up regressing 6.16 final. I tried hard to reproduce the issue locally - to validate an eventual revert before pushing it. But so far I failed quite miserably. Given the above, I would avoid the revert and just mention the known issue in the net PR to Linus. /P ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: virtio_close() stuck on napi_disable_locked() 2025-07-24 8:43 ` Paolo Abeni @ 2025-07-24 10:53 ` Jason Wang 2025-07-24 11:29 ` Paolo Abeni 0 siblings, 1 reply; 9+ messages in thread From: Jason Wang @ 2025-07-24 10:53 UTC (permalink / raw) To: Paolo Abeni Cc: Jakub Kicinski, Zigit Zo, Michael S. Tsirkin, Xuan Zhuo, Eugenio Pérez, netdev@vger.kernel.org On Thu, Jul 24, 2025 at 4:43 PM Paolo Abeni <pabeni@redhat.com> wrote: > > On 7/23/25 7:14 AM, Jason Wang wrote: > > On Wed, Jul 23, 2025 at 5:55 AM Jakub Kicinski <kuba@kernel.org> wrote: > >> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote: > >>> The NIPA CI is reporting some hung-up in the stats.py test caused by the > >>> virtio_net driver stuck at close time. > >>> > >>> A sample splat is available here: > >>> > >>> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr > >>> > >>> AFAICS the issue happens only on debug builds. > >>> > >>> I'm wild guessing to something similar to the the issue addressed by > >>> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi, > >>> but I could not spot anything obvious. > >>> > >>> Could you please have a look? > >> > >> It only hits in around 1 in 5 runs. > > > > I tried to reproduce this locally but failed. Where can I see the qemu > > command line for the VM? > > > >> Likely some pre-existing race, but > >> it started popping up for us when be5dcaed694e ("virtio-net: fix > >> recursived rtnl_lock() during probe()") was merged. > > > > Probably but I didn't see a direct connection with that commit. It > > looks like the root cause is the deadloop of napi_disable() for some > > reason as Paolo said. > > > >> It never hit before. > >> If we can't find a quick fix I think we should revert be5dcaed694e for > >> now, so that it doesn't end up regressing 6.16 final. > > I tried hard to reproduce the issue locally - to validate an eventual > revert before pushing it. But so far I failed quite miserably. > I've also tried to follow the instructions of nipai for setup 2 virtio and make the relevant taps to connect with a bridge on the host. But I failed to reproduce it locally for several hours. Is there a log of the execution of nipa test that we can know more information like: 1) full qemu command line 2) host kernel version 3) Qemu version > Given the above, I would avoid the revert and just mention the known > issue in the net PR to Linus. +1 Thanks > > /P > ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: virtio_close() stuck on napi_disable_locked() 2025-07-24 10:53 ` Jason Wang @ 2025-07-24 11:29 ` Paolo Abeni 0 siblings, 0 replies; 9+ messages in thread From: Paolo Abeni @ 2025-07-24 11:29 UTC (permalink / raw) To: Jason Wang Cc: Jakub Kicinski, Zigit Zo, Michael S. Tsirkin, Xuan Zhuo, Eugenio Pérez, netdev@vger.kernel.org On 7/24/25 12:53 PM, Jason Wang wrote: > On Thu, Jul 24, 2025 at 4:43 PM Paolo Abeni <pabeni@redhat.com> wrote: >> On 7/23/25 7:14 AM, Jason Wang wrote: >>> On Wed, Jul 23, 2025 at 5:55 AM Jakub Kicinski <kuba@kernel.org> wrote: >>>> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote: >>>>> The NIPA CI is reporting some hung-up in the stats.py test caused by the >>>>> virtio_net driver stuck at close time. >>>>> >>>>> A sample splat is available here: >>>>> >>>>> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr >>>>> >>>>> AFAICS the issue happens only on debug builds. >>>>> >>>>> I'm wild guessing to something similar to the the issue addressed by >>>>> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi, >>>>> but I could not spot anything obvious. >>>>> >>>>> Could you please have a look? >>>> >>>> It only hits in around 1 in 5 runs. >>> >>> I tried to reproduce this locally but failed. Where can I see the qemu >>> command line for the VM? >>> >>>> Likely some pre-existing race, but >>>> it started popping up for us when be5dcaed694e ("virtio-net: fix >>>> recursived rtnl_lock() during probe()") was merged. >>> >>> Probably but I didn't see a direct connection with that commit. It >>> looks like the root cause is the deadloop of napi_disable() for some >>> reason as Paolo said. >>> >>>> It never hit before. >>>> If we can't find a quick fix I think we should revert be5dcaed694e for >>>> now, so that it doesn't end up regressing 6.16 final. >> >> I tried hard to reproduce the issue locally - to validate an eventual >> revert before pushing it. But so far I failed quite miserably. >> > > I've also tried to follow the instructions of nipai for setup 2 virtio > and make the relevant taps to connect with a bridge on the host. But I > failed to reproduce it locally for several hours. > > Is there a log of the execution of nipa test that we can know more > information like: > > 1) full qemu command line I guess it could depend on vng version; here I'm getting: qemu-system-x86_64 -name virtme-ng -m 1G -chardev socket,id=charvirtfs5,path=/tmp/virtmebyfqshp5 -device vhost-user-fs-device,chardev=charvirtfs5,tag=ROOTFS -object memory-backend-memfd,id=mem,size=1G,share=on -numa node,memdev=mem -machine accel=kvm:tcg -M microvm,accel=kvm,pcie=on,rtc=on -cpu host -parallel none -net none -echr 1 -chardev file,path=/proc/self/fd/2,id=dmesg -device virtio-serial-device -device virtconsole,chardev=dmesg -chardev stdio,id=console,signal=off,mux=on -serial chardev:console -mon chardev=console -vga none -display none -smp 4 -kernel ./arch/x86/boot/bzImage -append virtme_hostname=virtme-ng nr_open=1073741816 virtme_link_mods=/data/net-next/.virtme_mods/lib/modules/0.0.0 virtme_rw_overlay0=/etc virtme_rw_overlay1=/lib virtme_rw_overlay2=/home virtme_rw_overlay3=/opt virtme_rw_overlay4=/srv virtme_rw_overlay5=/usr virtme_rw_overlay6=/var virtme_rw_overlay7=/tmp console=hvc0 earlyprintk=serial,ttyS0,115200 virtme_console=ttyS0 psmouse.proto=exps "virtme_stty_con=rows 32 cols 136 iutf8" TERM=xterm-256color virtme_chdir=data/net-next virtme_root_user=1 rootfstype=virtiofs root=ROOTFS raid=noautodetect ro debug init=/usr/lib/python3.13/site-packages/virtme/guest/bin/virtme-ng-init -device virtio-net-pci,netdev=n0,iommu_platform=on,disable-legacy=on,mq=on,vectors=18 -netdev tap,id=n0,ifname=tap0,vhost=on,script=no,downscript=no,queues=8 -device virtio-net-pci,netdev=n1,iommu_platform=on,disable-legacy=on,mq=on,vectors=18 -netdev tap,id=n1,ifname=tap1,vhost=on,script=no,downscript=no,queues=8 I guess the significant part is: ' -smp 4 -m 1G'. The networking bits are the verbatim configuration from the wiki. > 2) host kernel version I can give a reasonably sure answer only for this point; the kernel is 'current' net-next with 'current' net merged in and all the patches pending on patchwork merged in, too. For a given tests iteration nipa provides a snapshot of the patches merged in, i.e. for tests run on 2025/07/24 at 00:00 see: https://netdev.bots.linux.dev/static/nipa/branch_deltas/net-next-hw-2025-07-24--00-00.html > 3) Qemu version Should be a stock, recent ubuntu build. /P ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: Re: virtio_close() stuck on napi_disable_locked() 2025-07-22 21:55 ` Jakub Kicinski 2025-07-23 5:14 ` Jason Wang @ 2025-07-24 8:58 ` Zigit Zo 2025-07-24 9:18 ` Paolo Abeni 1 sibling, 1 reply; 9+ messages in thread From: Zigit Zo @ 2025-07-24 8:58 UTC (permalink / raw) To: Jakub Kicinski, Paolo Abeni Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez, netdev@vger.kernel.org On 7/23/25 5:55 AM, Jakub Kicinski wrote: > On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote: >> Hi, >> >> The NIPA CI is reporting some hung-up in the stats.py test caused by the >> virtio_net driver stuck at close time. >> >> A sample splat is available here: >> >> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr >> >> AFAICS the issue happens only on debug builds. >> >> I'm wild guessing to something similar to the the issue addressed by >> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi, >> but I could not spot anything obvious. >> >> Could you please have a look? > > It only hits in around 1 in 5 runs. Likely some pre-existing race, but > it started popping up for us when be5dcaed694e ("virtio-net: fix > recursived rtnl_lock() during probe()") was merged. It never hit before. > If we can't find a quick fix I think we should revert be5dcaed694e for > now, so that it doesn't end up regressing 6.16 final. Just found that there's a new test script `netpoll_basic.py`. Before 209441, this test did not exist at all, I randomly picked some test results prior to its introduction and did not find any hung logs. Is it relevant? Regards, ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: virtio_close() stuck on napi_disable_locked() 2025-07-24 8:58 ` Zigit Zo @ 2025-07-24 9:18 ` Paolo Abeni 0 siblings, 0 replies; 9+ messages in thread From: Paolo Abeni @ 2025-07-24 9:18 UTC (permalink / raw) To: Zigit Zo, Jakub Kicinski Cc: Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez, netdev@vger.kernel.org On 7/24/25 10:58 AM, Zigit Zo wrote: > On 7/23/25 5:55 AM, Jakub Kicinski wrote: >> On Tue, 22 Jul 2025 13:00:14 +0200 Paolo Abeni wrote: >>> The NIPA CI is reporting some hung-up in the stats.py test caused by the >>> virtio_net driver stuck at close time. >>> >>> A sample splat is available here: >>> >>> https://netdev-3.bots.linux.dev/vmksft-drv-hw-dbg/results/209441/4-stats-py/stderr >>> >>> AFAICS the issue happens only on debug builds. >>> >>> I'm wild guessing to something similar to the the issue addressed by >>> commit 4bc12818b363bd30f0f7348dd9ab077290a637ae, possibly for tx_napi, >>> but I could not spot anything obvious. >>> >>> Could you please have a look? >> >> It only hits in around 1 in 5 runs. Likely some pre-existing race, but >> it started popping up for us when be5dcaed694e ("virtio-net: fix >> recursived rtnl_lock() during probe()") was merged. It never hit before. >> If we can't find a quick fix I think we should revert be5dcaed694e for >> now, so that it doesn't end up regressing 6.16 final. > > Just found that there's a new test script `netpoll_basic.py`. Before 209441, > this test did not exist at all, I randomly picked some test results prior to > its introduction and did not find any hung logs. Is it relevant? If I read correctly the nipa configuration, it executes the tests in sequence. `netpoll_basic.py` runs just before `stats.py`, so it could cause failure if it leaves some inconsistent state - but not in a deterministic way, as the issue is sporadic. Technically possible, IMHO unlikely. Still it would explain why I can't repro the issue here. With unlimited time available could be worthy validating if the issue could be reproduced running the 2 tests in sequence. /P ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-07-24 11:29 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-22 11:00 virtio_close() stuck on napi_disable_locked() Paolo Abeni 2025-07-22 21:55 ` Jakub Kicinski 2025-07-23 5:14 ` Jason Wang 2025-07-23 13:51 ` Jakub Kicinski 2025-07-24 8:43 ` Paolo Abeni 2025-07-24 10:53 ` Jason Wang 2025-07-24 11:29 ` Paolo Abeni 2025-07-24 8:58 ` Zigit Zo 2025-07-24 9:18 ` Paolo Abeni
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).