* [RFC] Observed lockdep circular dependency in SR-IOV paths
@ 2026-03-31 17:23 Raghavendra Rao Ananta
2026-04-06 22:34 ` Jason Gunthorpe
0 siblings, 1 reply; 5+ messages in thread
From: Raghavendra Rao Ananta @ 2026-03-31 17:23 UTC (permalink / raw)
To: Alex Williamson
Cc: David Matlack, Vipin Sharma, Josh Hilke, linux-kernel, kvm
Hi Alex,
While running the vfio_pci_sriov_uapi_test [1] on a CONFIG_LOCKDEP
enabled kernel (7.0-rc1), we observed the following lockdep circular
locking dependency warning:
[ 286.997167] ======================================================
[ 287.003363] WARNING: possible circular locking dependency detected
[ 287.009562] 7.0.0-dbg-DEV #3 Tainted: G S
[ 287.015074] ------------------------------------------------------
[ 287.021270] vfio_pci_sriov_/18636 is trying to acquire lock:
[ 287.026942] ff45bea2294d4968 (&vdev->memory_lock){+.+.}-{4:4}, at:
vfio_pci_core_runtime_resume+0x1f/0xa0
[ 287.036530]
[ 287.036530] but task is already holding lock:
[ 287.042383] ff45bea3a96b8230 (&new_dev_set->lock){+.+.}-{4:4}, at:
vfio_group_fops_unl_ioctl+0x44d/0x7b0
[ 287.051879]
[ 287.051879] which lock already depends on the new lock.
[ 287.051879]
[ 287.060070]
[ 287.060070] the existing dependency chain (in reverse order) is:
[ 287.067568]
[ 287.067568] -> #2 (&new_dev_set->lock){+.+.}-{4:4}:
[ 287.073941] __mutex_lock+0x92/0xb80
[ 287.078058] vfio_assign_device_set+0x66/0x1b0
[ 287.083042] vfio_pci_core_register_device+0xd1/0x2a0
[ 287.088638] vfio_pci_probe+0xd2/0x100
[ 287.092933] local_pci_probe_callback+0x4d/0xa0
[ 287.098001] process_scheduled_works+0x2ca/0x680
[ 287.103158] worker_thread+0x1e8/0x2f0
[ 287.107452] kthread+0x10c/0x140
[ 287.111230] ret_from_fork+0x18e/0x360
[ 287.115519] ret_from_fork_asm+0x1a/0x30
[ 287.119983]
[ 287.119983] -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
[ 287.127219] __flush_work+0x345/0x490
[ 287.131429] pci_device_probe+0x2e3/0x490
[ 287.135979] really_probe+0x1f9/0x4e0
[ 287.140180] __driver_probe_device+0x77/0x100
[ 287.145079] driver_probe_device+0x1e/0x110
[ 287.149803] __device_attach_driver+0xe3/0x170
[ 287.154789] bus_for_each_drv+0x125/0x150
[ 287.159346] __device_attach+0xca/0x1a0
[ 287.163720] device_initial_probe+0x34/0x50
[ 287.168445] pci_bus_add_device+0x6e/0x90
[ 287.172995] pci_iov_add_virtfn+0x3c9/0x3e0
[ 287.177719] sriov_add_vfs+0x2c/0x60
[ 287.181838] sriov_enable+0x306/0x4a0
[ 287.186038] vfio_pci_core_sriov_configure+0x184/0x220
[ 287.191715] sriov_numvfs_store+0xd9/0x1c0
[ 287.196351] kernfs_fop_write_iter+0x13f/0x1d0
[ 287.201338] vfs_write+0x2be/0x3b0
[ 287.205286] ksys_write+0x73/0x100
[ 287.209233] do_syscall_64+0x14d/0x750
[ 287.213529] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 287.219120]
[ 287.219120] -> #0 (&vdev->memory_lock){+.+.}-{4:4}:
[ 287.225491] __lock_acquire+0x14c6/0x2800
[ 287.230048] lock_acquire+0xd3/0x2f0
[ 287.234168] down_write+0x3a/0xc0
[ 287.238019] vfio_pci_core_runtime_resume+0x1f/0xa0
[ 287.243436] __rpm_callback+0x8c/0x310
[ 287.247730] rpm_resume+0x529/0x6f0
[ 287.251765] __pm_runtime_resume+0x68/0x90
[ 287.256402] vfio_pci_core_enable+0x44/0x310
[ 287.261216] vfio_pci_open_device+0x1c/0x80
[ 287.265947] vfio_df_open+0x10f/0x150
[ 287.270148] vfio_group_fops_unl_ioctl+0x4a4/0x7b0
[ 287.275476] __se_sys_ioctl+0x71/0xc0
[ 287.279679] do_syscall_64+0x14d/0x750
[ 287.283975] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 287.289559]
[ 287.289559] other info that might help us debug this:
[ 287.289559]
[ 287.297582] Chain exists of:
[ 287.297582] &vdev->memory_lock --> (work_completion)(&arg.work)
--> &new_dev_set->lock
[ 287.297582]
[ 287.310023] Possible unsafe locking scenario:
[ 287.310023]
[ 287.315961] CPU0 CPU1
[ 287.320510] ---- ----
[ 287.325059] lock(&new_dev_set->lock);
[ 287.328917]
lock((work_completion)(&arg.work));
[ 287.336153] lock(&new_dev_set->lock);
[ 287.342523] lock(&vdev->memory_lock);
[ 287.346382]
[ 287.346382] *** DEADLOCK ***
[ 287.346382]
[ 287.352315] 2 locks held by vfio_pci_sriov_/18636:
[ 287.357125] #0: ff45bea208ed3e18 (&group->group_lock){+.+.}-{4:4},
at: vfio_group_fops_unl_ioctl+0x3e3/0x7b0
[ 287.367048] #1: ff45bea3a96b8230 (&new_dev_set->lock){+.+.}-{4:4},
at: vfio_group_fops_unl_ioctl+0x44d/0x7b0
[ 287.376976]
[ 287.376976] stack backtrace:
[ 287.381353] CPU: 191 UID: 0 PID: 18636 Comm: vfio_pci_sriov_
Tainted: G S 7.0.0-dbg-DEV #3 PREEMPTLAZY
[ 287.381355] Tainted: [S]=CPU_OUT_OF_SPEC
[ 287.381356] Call Trace:
[ 287.381357] <TASK>
[ 287.381358] dump_stack_lvl+0x54/0x70
[ 287.381361] print_circular_bug+0x2e1/0x300
[ 287.381363] check_noncircular+0xf9/0x120
[ 287.381364] ? __lock_acquire+0x5b4/0x2800
[ 287.381366] __lock_acquire+0x14c6/0x2800
[ 287.381368] ? pci_mmcfg_read+0x4f/0x220
[ 287.381370] ? pci_mmcfg_write+0x57/0x220
[ 287.381371] ? lock_acquire+0xd3/0x2f0
[ 287.381373] ? pci_mmcfg_write+0x57/0x220
[ 287.381374] ? lock_release+0xef/0x360
[ 287.381376] ? vfio_pci_core_runtime_resume+0x1f/0xa0
[ 287.381377] lock_acquire+0xd3/0x2f0
[ 287.381378] ? vfio_pci_core_runtime_resume+0x1f/0xa0
[ 287.381379] ? lock_is_held_type+0x76/0x100
[ 287.381382] down_write+0x3a/0xc0
[ 287.381382] ? vfio_pci_core_runtime_resume+0x1f/0xa0
[ 287.381383] vfio_pci_core_runtime_resume+0x1f/0xa0
[ 287.381384] ? __pfx_pci_pm_runtime_resume+0x10/0x10
[ 287.381385] __rpm_callback+0x8c/0x310
[ 287.381386] ? ktime_get_mono_fast_ns+0x3d/0xb0
[ 287.381389] ? __pfx_pci_pm_runtime_resume+0x10/0x10
[ 287.381390] rpm_resume+0x529/0x6f0
[ 287.381392] ? lock_is_held_type+0x76/0x100
[ 287.381394] __pm_runtime_resume+0x68/0x90
[ 287.381396] vfio_pci_core_enable+0x44/0x310
[ 287.381398] vfio_pci_open_device+0x1c/0x80
[ 287.381399] vfio_df_open+0x10f/0x150
[ 287.381401] vfio_group_fops_unl_ioctl+0x4a4/0x7b0
[ 287.381402] __se_sys_ioctl+0x71/0xc0
[ 287.381404] do_syscall_64+0x14d/0x750
[ 287.381405] ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 287.381406] ? trace_irq_disable+0x25/0xd0
[ 287.381409] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 287.381410] RIP: 0033:0x447c8b
[ 287.381413] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24
10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00
00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28
00 00
[ 287.381414] RSP: 002b:00007ffff5fb2530 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ 287.381415] RAX: ffffffffffffffda RBX: 00007ffff5fb2d90 RCX: 0000000000447c8b
[ 287.381416] RDX: 00007ffff5fb25b0 RSI: 0000000000003b6a RDI: 0000000000000006
[ 287.381417] RBP: 00007ffff5fb2620 R08: 0000000000000000 R09: 00000000ffffffff
[ 287.381418] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffff5fb2d78
[ 287.381419] R13: 0000000000000016 R14: 00000000004f5440 R15: 0000000000000016
[ 287.381423] </TASK>
Initial analysis suggests it could be a false positive, depicting a
classic ABBA circular locking scenario:
* Trace#1 and Trace#2: Triggered while setting sriov_numvfs, where the
PF's vdev->memory_lock (A) is taken first, and then the VF's
dev_set->lock is allocated and taken (B).
* Trace#0: Triggered during VFIO_GROUP_GET_DEVICE_FD where first the
VF's dev_set->lock is taken (B) and then the PF's vdev->memory_lock
(A) is taken.
From Trace#1/Trace#2, the dev_set->lock (B) is allocated and
immediately grabbed during VF creation. Since it isn't taken until
after the device is fully created and VFIO_GROUP_GET_DEVICE_FD is
called on it, it's perfectly synchronized, and therefore cannot cause
a deadlock. As far as we know, this is at least true for the VFs being
created, i.e, we were able to find a 'dev_set' from the
'vfio_device_set_xa' xarray using 'idx' in vfio_assign_device_set().
I've tried this naiive diff, which seems to help with the issue, but
may not be a complete solution:
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -106,7 +107,8 @@ int vfio_assign_device_set(struct vfio_device
*device, void *set_id)
found_get_ref:
dev_set->device_count++;
xa_unlock(&vfio_device_set_xa);
- mutex_lock(&dev_set->lock);
+ mutex_lock_nested(&dev_set->lock, SINGLE_DEPTH_NESTING);
device->dev_set = dev_set;
list_add_tail(&device->dev_set_list, &dev_set->device_list);
mutex_unlock(&dev_set->lock);
I wanted your opinion or suggestions on how to proceed with the warning.
Thank you.
Raghavendra
[1]: https://lore.kernel.org/all/20260303193822.2526335-1-rananta@google.com/
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [RFC] Observed lockdep circular dependency in SR-IOV paths 2026-03-31 17:23 [RFC] Observed lockdep circular dependency in SR-IOV paths Raghavendra Rao Ananta @ 2026-04-06 22:34 ` Jason Gunthorpe 2026-04-07 3:59 ` Alex Williamson 0 siblings, 1 reply; 5+ messages in thread From: Jason Gunthorpe @ 2026-04-06 22:34 UTC (permalink / raw) To: Raghavendra Rao Ananta Cc: Alex Williamson, David Matlack, Vipin Sharma, Josh Hilke, linux-kernel, kvm On Tue, Mar 31, 2026 at 10:23:06AM -0700, Raghavendra Rao Ananta wrote: > Hi Alex, > > While running the vfio_pci_sriov_uapi_test [1] on a CONFIG_LOCKDEP > enabled kernel (7.0-rc1), we observed the following lockdep circular > locking dependency warning: This loos more like a class issue, the locks used by the VF driver are different than the locks used by the PF driver, and even though the devsets are shared a devset should never have both the PF and VF. Maybe shifting memory_lock into a different lock class for VFs is enough. However, I think it is a bad idea to hold the memory_lock while probing a device, I'd prefer to revisit f4162eb1e2fc ("vfio/pci: Change the PF power state to D0 before enabling VFs") and change it's logic to rely on a private flag instead of pci_num_vf() Then we don't need to hold the lock any longer than just setting the flag. Jason ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC] Observed lockdep circular dependency in SR-IOV paths 2026-04-06 22:34 ` Jason Gunthorpe @ 2026-04-07 3:59 ` Alex Williamson 2026-04-08 15:46 ` Raghavendra Rao Ananta 0 siblings, 1 reply; 5+ messages in thread From: Alex Williamson @ 2026-04-07 3:59 UTC (permalink / raw) To: Jason Gunthorpe, Raghavendra Rao Ananta Cc: David Matlack, Vipin Sharma, Josh Hilke, linux-kernel, kvm On Mon, Apr 6, 2026, at 4:34 PM, Jason Gunthorpe wrote: > On Tue, Mar 31, 2026 at 10:23:06AM -0700, Raghavendra Rao Ananta wrote: >> Hi Alex, >> >> While running the vfio_pci_sriov_uapi_test [1] on a CONFIG_LOCKDEP >> enabled kernel (7.0-rc1), we observed the following lockdep circular >> locking dependency warning: > > This loos more like a class issue, the locks used by the VF driver are > different than the locks used by the PF driver, and even though the > devsets are shared a devset should never have both the PF and VF. > > Maybe shifting memory_lock into a different lock class for VFs is > enough. > > However, I think it is a bad idea to hold the memory_lock while > probing a device, I'd prefer to revisit f4162eb1e2fc ("vfio/pci: > Change the PF power state to D0 before enabling VFs") and change it's > logic to rely on a private flag instead of pci_num_vf() > > Then we don't need to hold the lock any longer than just setting the > flag. I agree, that's cleaner, only hold the write-lock on memory_lock in vfio_pci_core_sriov_configure() to bring the device to D0 and to set a flag on the PF vdev. That flag would replace the pci_num_vfs() test in vfio_pci_set_power_state(). The flag is cleared without memory_lock if pci_enable_sriov() fails or after pci_disable_sriov(). It's a nice compartmentalized fix. The lockdep splat is a false positive, but this would eliminate the memory_lock -> work_completion arc of the detection. Thanks, Alex ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC] Observed lockdep circular dependency in SR-IOV paths 2026-04-07 3:59 ` Alex Williamson @ 2026-04-08 15:46 ` Raghavendra Rao Ananta 2026-04-10 16:43 ` Alex Williamson 0 siblings, 1 reply; 5+ messages in thread From: Raghavendra Rao Ananta @ 2026-04-08 15:46 UTC (permalink / raw) To: Alex Williamson Cc: Jason Gunthorpe, David Matlack, Vipin Sharma, Josh Hilke, linux-kernel, kvm On Mon, Apr 6, 2026 at 9:00 PM Alex Williamson <alex@shazbot.org> wrote: > > On Mon, Apr 6, 2026, at 4:34 PM, Jason Gunthorpe wrote: > > On Tue, Mar 31, 2026 at 10:23:06AM -0700, Raghavendra Rao Ananta wrote: > >> Hi Alex, > >> > >> While running the vfio_pci_sriov_uapi_test [1] on a CONFIG_LOCKDEP > >> enabled kernel (7.0-rc1), we observed the following lockdep circular > >> locking dependency warning: > > > > This loos more like a class issue, the locks used by the VF driver are > > different than the locks used by the PF driver, and even though the > > devsets are shared a devset should never have both the PF and VF. > > > > Maybe shifting memory_lock into a different lock class for VFs is > > enough. > > > > However, I think it is a bad idea to hold the memory_lock while > > probing a device, I'd prefer to revisit f4162eb1e2fc ("vfio/pci: > > Change the PF power state to D0 before enabling VFs") and change it's > > logic to rely on a private flag instead of pci_num_vf() > > > > Then we don't need to hold the lock any longer than just setting the > > flag. > > I agree, that's cleaner, only hold the write-lock on memory_lock in vfio_pci_core_sriov_configure() to bring the device to D0 and to set a flag on the PF vdev. That flag would replace the pci_num_vfs() test in vfio_pci_set_power_state(). The flag is cleared without memory_lock if pci_enable_sriov() fails or after pci_disable_sriov(). > > It's a nice compartmentalized fix. The lockdep splat is a false positive, but this would eliminate the memory_lock -> work_completion arc of the detection. Thanks for the suggestion. The pointer to f4162eb1e2fc helped provide context. Based on the suggestion, I tried this diff and we no longer see the lockdep warning: diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index d43745fe4c843..f857a23c034a4 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -271,7 +271,7 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat int ret; /* Prevent changing power state for PFs with VFs enabled */ - if (pci_num_vf(pdev) && state > PCI_D0) + if (vdev->sriov_pwr_active && state > PCI_D0) return -EBUSY; if (vdev->needs_pm_restore) { @@ -2294,8 +2294,9 @@ int vfio_pci_core_sriov_configure(struct vfio_pci_core_device *vdev, down_write(&vdev->memory_lock); vfio_pci_set_power_state(vdev, PCI_D0); - ret = pci_enable_sriov(pdev, nr_virtfn); + vdev->sriov_pwr_active = true; up_write(&vdev->memory_lock); + ret = pci_enable_sriov(pdev, nr_virtfn); if (ret) { pm_runtime_put(&pdev->dev); goto out_del; @@ -2309,6 +2310,7 @@ int vfio_pci_core_sriov_configure(struct vfio_pci_core_device *vdev, } out_del: + vdev->sriov_pwr_active = false; mutex_lock(&vfio_pci_sriov_pfs_mutex); list_del_init(&vdev->sriov_pfs_item); out_unlock: My only concern was clearing 'sriov_pwr_active' after disabling sriov, since the PF would still remain in D0. But from the f4162eb1e2fc 's commit description, it seems like that was intentional? Thank you. Raghavendra ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [RFC] Observed lockdep circular dependency in SR-IOV paths 2026-04-08 15:46 ` Raghavendra Rao Ananta @ 2026-04-10 16:43 ` Alex Williamson 0 siblings, 0 replies; 5+ messages in thread From: Alex Williamson @ 2026-04-10 16:43 UTC (permalink / raw) To: Raghavendra Rao Ananta Cc: Jason Gunthorpe, David Matlack, Vipin Sharma, Josh Hilke, linux-kernel, kvm, alex On Wed, 8 Apr 2026 08:46:30 -0700 Raghavendra Rao Ananta <rananta@google.com> wrote: > On Mon, Apr 6, 2026 at 9:00 PM Alex Williamson <alex@shazbot.org> wrote: > > > > On Mon, Apr 6, 2026, at 4:34 PM, Jason Gunthorpe wrote: > > > On Tue, Mar 31, 2026 at 10:23:06AM -0700, Raghavendra Rao Ananta wrote: > > >> Hi Alex, > > >> > > >> While running the vfio_pci_sriov_uapi_test [1] on a CONFIG_LOCKDEP > > >> enabled kernel (7.0-rc1), we observed the following lockdep circular > > >> locking dependency warning: > > > > > > This loos more like a class issue, the locks used by the VF driver are > > > different than the locks used by the PF driver, and even though the > > > devsets are shared a devset should never have both the PF and VF. > > > > > > Maybe shifting memory_lock into a different lock class for VFs is > > > enough. > > > > > > However, I think it is a bad idea to hold the memory_lock while > > > probing a device, I'd prefer to revisit f4162eb1e2fc ("vfio/pci: > > > Change the PF power state to D0 before enabling VFs") and change it's > > > logic to rely on a private flag instead of pci_num_vf() > > > > > > Then we don't need to hold the lock any longer than just setting the > > > flag. > > > > I agree, that's cleaner, only hold the write-lock on memory_lock in > > vfio_pci_core_sriov_configure() to bring the device to D0 and to > > set a flag on the PF vdev. That flag would replace the > > pci_num_vfs() test in vfio_pci_set_power_state(). The flag is > > cleared without memory_lock if pci_enable_sriov() fails or after > > pci_disable_sriov(). > > > > It's a nice compartmentalized fix. The lockdep splat is a false > > positive, but this would eliminate the memory_lock -> > > work_completion arc of the detection. > > Thanks for the suggestion. The pointer to f4162eb1e2fc helped provide > context. Based on the suggestion, I tried this diff and we no longer > see the lockdep warning: > > diff --git a/drivers/vfio/pci/vfio_pci_core.c > b/drivers/vfio/pci/vfio_pci_core.c index d43745fe4c843..f857a23c034a4 > 100644 --- a/drivers/vfio/pci/vfio_pci_core.c > +++ b/drivers/vfio/pci/vfio_pci_core.c > @@ -271,7 +271,7 @@ int vfio_pci_set_power_state(struct > vfio_pci_core_device *vdev, pci_power_t stat > int ret; > > /* Prevent changing power state for PFs with VFs enabled */ > - if (pci_num_vf(pdev) && state > PCI_D0) > + if (vdev->sriov_pwr_active && state > PCI_D0) > return -EBUSY; > > if (vdev->needs_pm_restore) { > @@ -2294,8 +2294,9 @@ int vfio_pci_core_sriov_configure(struct > vfio_pci_core_device *vdev, > > down_write(&vdev->memory_lock); > vfio_pci_set_power_state(vdev, PCI_D0); > - ret = pci_enable_sriov(pdev, nr_virtfn); > + vdev->sriov_pwr_active = true; > up_write(&vdev->memory_lock); > + ret = pci_enable_sriov(pdev, nr_virtfn); > if (ret) { > pm_runtime_put(&pdev->dev); > goto out_del; > @@ -2309,6 +2310,7 @@ int vfio_pci_core_sriov_configure(struct > vfio_pci_core_device *vdev, > } > > out_del: > + vdev->sriov_pwr_active = false; > mutex_lock(&vfio_pci_sriov_pfs_mutex); > list_del_init(&vdev->sriov_pfs_item); > out_unlock: > > My only concern was clearing 'sriov_pwr_active' after disabling sriov, > since the PF would still remain in D0. But from the f4162eb1e2fc 's > commit description, it seems like that was intentional? I don't see any obvious problem with that ordering, once SR-IOV is disabled the PF is again subject to both PM and directed power control. Thanks, Alex ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-04-10 16:43 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-31 17:23 [RFC] Observed lockdep circular dependency in SR-IOV paths Raghavendra Rao Ananta 2026-04-06 22:34 ` Jason Gunthorpe 2026-04-07 3:59 ` Alex Williamson 2026-04-08 15:46 ` Raghavendra Rao Ananta 2026-04-10 16:43 ` Alex Williamson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox