Kernel KVM virtualization development
 help / color / mirror / Atom feed
From: Alex Williamson <alex@shazbot.org>
To: Raghavendra Rao Ananta <rananta@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>,
	David Matlack <dmatlack@google.com>,
	Vipin Sharma <vipinsh@google.com>,
	Josh Hilke <jrhilke@google.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	stable@vger.kernel.org, alex@shazbot.org
Subject: Re: [PATCH] vfio/pci: Use a private flag to prevent power state change with VFs
Date: Thu, 7 May 2026 15:09:25 -0600	[thread overview]
Message-ID: <20260507150925.11e9681e@shazbot.org> (raw)
In-Reply-To: <20260504224142.1041477-1-rananta@google.com>

On Mon,  4 May 2026 22:41:42 +0000
Raghavendra Rao Ananta <rananta@google.com> wrote:

> The current implementation uses pci_num_vf() while holding the
> memory_lock to prevent changing the power state of a PF when
> VFs are enabled. This creates a lockdep circular dependency
> warning in because memory_lock is held during device probing.

s/in// ?
 
> [  286.997167] ======================================================
> [  287.003363] WARNING: possible circular locking dependency detected
> [  287.009562] 7.0.0-dbg-DEV #3 Tainted: G S
> [  287.015074] ------------------------------------------------------
> [  287.021270] vfio_pci_sriov_/18636 is trying to acquire lock:
> [  287.026942] ff45bea2294d4968 (&vdev->memory_lock){+.+.}-{4:4}, at:
> vfio_pci_core_runtime_resume+0x1f/0xa0
> [  287.036530]
> [  287.036530] but task is already holding lock:
> [  287.042383] ff45bea3a96b8230 (&new_dev_set->lock){+.+.}-{4:4}, at:
> vfio_group_fops_unl_ioctl+0x44d/0x7b0
> [  287.051879]
> [  287.051879] which lock already depends on the new lock.
> [  287.051879]
> [  287.060070]
> [  287.060070] the existing dependency chain (in reverse order) is:
> [  287.067568]
> [  287.067568] -> #2 (&new_dev_set->lock){+.+.}-{4:4}:
> [  287.073941]        __mutex_lock+0x92/0xb80
> [  287.078058]        vfio_assign_device_set+0x66/0x1b0
> [  287.083042]        vfio_pci_core_register_device+0xd1/0x2a0
> [  287.088638]        vfio_pci_probe+0xd2/0x100
> [  287.092933]        local_pci_probe_callback+0x4d/0xa0
> [  287.098001]        process_scheduled_works+0x2ca/0x680
> [  287.103158]        worker_thread+0x1e8/0x2f0
> [  287.107452]        kthread+0x10c/0x140
> [  287.111230]        ret_from_fork+0x18e/0x360
> [  287.115519]        ret_from_fork_asm+0x1a/0x30
> [  287.119983]
> [  287.119983] -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
> [  287.127219]        __flush_work+0x345/0x490
> [  287.131429]        pci_device_probe+0x2e3/0x490
> [  287.135979]        really_probe+0x1f9/0x4e0
> [  287.140180]        __driver_probe_device+0x77/0x100
> [  287.145079]        driver_probe_device+0x1e/0x110
> [  287.149803]        __device_attach_driver+0xe3/0x170
> [  287.154789]        bus_for_each_drv+0x125/0x150
> [  287.159346]        __device_attach+0xca/0x1a0
> [  287.163720]        device_initial_probe+0x34/0x50
> [  287.168445]        pci_bus_add_device+0x6e/0x90
> [  287.172995]        pci_iov_add_virtfn+0x3c9/0x3e0
> [  287.177719]        sriov_add_vfs+0x2c/0x60
> [  287.181838]        sriov_enable+0x306/0x4a0
> [  287.186038]        vfio_pci_core_sriov_configure+0x184/0x220
> [  287.191715]        sriov_numvfs_store+0xd9/0x1c0
> [  287.196351]        kernfs_fop_write_iter+0x13f/0x1d0
> [  287.201338]        vfs_write+0x2be/0x3b0
> [  287.205286]        ksys_write+0x73/0x100
> [  287.209233]        do_syscall_64+0x14d/0x750
> [  287.213529]        entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [  287.219120]
> [  287.219120] -> #0 (&vdev->memory_lock){+.+.}-{4:4}:
> [  287.225491]        __lock_acquire+0x14c6/0x2800
> [  287.230048]        lock_acquire+0xd3/0x2f0
> [  287.234168]        down_write+0x3a/0xc0
> [  287.238019]        vfio_pci_core_runtime_resume+0x1f/0xa0
> [  287.243436]        __rpm_callback+0x8c/0x310
> [  287.247730]        rpm_resume+0x529/0x6f0
> [  287.251765]        __pm_runtime_resume+0x68/0x90
> [  287.256402]        vfio_pci_core_enable+0x44/0x310
> [  287.261216]        vfio_pci_open_device+0x1c/0x80
> [  287.265947]        vfio_df_open+0x10f/0x150
> [  287.270148]        vfio_group_fops_unl_ioctl+0x4a4/0x7b0
> [  287.275476]        __se_sys_ioctl+0x71/0xc0
> [  287.279679]        do_syscall_64+0x14d/0x750
> [  287.283975]        entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [  287.289559]
> [  287.289559] other info that might help us debug this:
> [  287.289559]
> [  287.297582] Chain exists of:
> [  287.297582]   &vdev->memory_lock --> (work_completion)(&arg.work)
> --> &new_dev_set->lock  
> [  287.297582]
> [  287.310023]  Possible unsafe locking scenario:
> [  287.310023]
> [  287.315961]        CPU0                    CPU1
> [  287.320510]        ----                    ----
> [  287.325059]   lock(&new_dev_set->lock);
> [  287.328917]
> lock((work_completion)(&arg.work));
> [  287.336153]                                lock(&new_dev_set->lock);
> [  287.342523]   lock(&vdev->memory_lock);
> [  287.346382]
> [  287.346382]  *** DEADLOCK ***
> [  287.346382]
> [  287.352315] 2 locks held by vfio_pci_sriov_/18636:
> [  287.357125]  #0: ff45bea208ed3e18 (&group->group_lock){+.+.}-{4:4},
> at: vfio_group_fops_unl_ioctl+0x3e3/0x7b0
> [  287.367048]  #1: ff45bea3a96b8230 (&new_dev_set->lock){+.+.}-{4:4},
> at: vfio_group_fops_unl_ioctl+0x44d/0x7b0
> [  287.376976]
> [  287.376976] stack backtrace:
> [  287.381353] CPU: 191 UID: 0 PID: 18636 Comm: vfio_pci_sriov_
> Tainted: G S                  7.0.0-dbg-DEV #3 PREEMPTLAZY
> [  287.381355] Tainted: [S]=CPU_OUT_OF_SPEC
> [  287.381356] Call Trace:
> [  287.381357]  <TASK>
> [  287.381358]  dump_stack_lvl+0x54/0x70
> [  287.381361]  print_circular_bug+0x2e1/0x300
> [  287.381363]  check_noncircular+0xf9/0x120
> [  287.381364]  ? __lock_acquire+0x5b4/0x2800
> [  287.381366]  __lock_acquire+0x14c6/0x2800
> [  287.381368]  ? pci_mmcfg_read+0x4f/0x220
> [  287.381370]  ? pci_mmcfg_write+0x57/0x220
> [  287.381371]  ? lock_acquire+0xd3/0x2f0
> [  287.381373]  ? pci_mmcfg_write+0x57/0x220
> [  287.381374]  ? lock_release+0xef/0x360
> [  287.381376]  ? vfio_pci_core_runtime_resume+0x1f/0xa0
> [  287.381377]  lock_acquire+0xd3/0x2f0
> [  287.381378]  ? vfio_pci_core_runtime_resume+0x1f/0xa0
> [  287.381379]  ? lock_is_held_type+0x76/0x100
> [  287.381382]  down_write+0x3a/0xc0
> [  287.381382]  ? vfio_pci_core_runtime_resume+0x1f/0xa0
> [  287.381383]  vfio_pci_core_runtime_resume+0x1f/0xa0
> [  287.381384]  ? __pfx_pci_pm_runtime_resume+0x10/0x10
> [  287.381385]  __rpm_callback+0x8c/0x310
> [  287.381386]  ? ktime_get_mono_fast_ns+0x3d/0xb0
> [  287.381389]  ? __pfx_pci_pm_runtime_resume+0x10/0x10
> [  287.381390]  rpm_resume+0x529/0x6f0
> [  287.381392]  ? lock_is_held_type+0x76/0x100
> [  287.381394]  __pm_runtime_resume+0x68/0x90
> [  287.381396]  vfio_pci_core_enable+0x44/0x310
> [  287.381398]  vfio_pci_open_device+0x1c/0x80
> [  287.381399]  vfio_df_open+0x10f/0x150
> [  287.381401]  vfio_group_fops_unl_ioctl+0x4a4/0x7b0
> [  287.381402]  __se_sys_ioctl+0x71/0xc0
> [  287.381404]  do_syscall_64+0x14d/0x750
> [  287.381405]  ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [  287.381406]  ? trace_irq_disable+0x25/0xd0
> [  287.381409]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> Introduce a private flag 'sriov_pwr_active' in the vfio_pci_core_device
> struct. This  allows the driver to track the SR-IOV power state requirement
> without  relying on pci_num_vf() while holding the memory_lock. The lock is
> now  only held to set the flag and ensure the device is in D0, after which
> pci_enable_sriov() can be called without the lock.
> 
> Fixes: f4162eb1e2fc ("vfio/pci: Change the PF power state to D0 before enabling VFs")
> Cc: stable@vger.kernel.org
> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> Suggested-by: Alex Williamson <alex@shazbot.org>
> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 6 ++++--
>  include/linux/vfio_pci_core.h    | 1 +
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 3f8d093aacf8a..0e4a73e541d3a 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -271,7 +271,7 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>  	int ret;
>  
>  	/* Prevent changing power state for PFs with VFs enabled */
> -	if (pci_num_vf(pdev) && state > PCI_D0)
> +	if (vdev->sriov_pwr_active && state > PCI_D0)
>  		return -EBUSY;

AIUI, clearing the flag in the out_del: below can be lockless because
at worst we'll deny a low power transition, but I think the test here
for any state >PCI_D0 does expect memory_lock, right?  Maybe this
should be something like:

	if (state > PCI_D0) {
		lockdep_assert_held_write(&vdev->memory_lock);
		if (vdev->sriov_pwr_active)
			return -EBUSY;
	}

>  
>  	if (vdev->needs_pm_restore) {
> @@ -2292,8 +2292,9 @@ int vfio_pci_core_sriov_configure(struct vfio_pci_core_device *vdev,
>  
>  		down_write(&vdev->memory_lock);
>  		vfio_pci_set_power_state(vdev, PCI_D0);
> -		ret = pci_enable_sriov(pdev, nr_virtfn);
> +		vdev->sriov_pwr_active = true;
>  		up_write(&vdev->memory_lock);
> +		ret = pci_enable_sriov(pdev, nr_virtfn);
>  		if (ret) {
>  			pm_runtime_put(&pdev->dev);
>  			goto out_del;
> @@ -2307,6 +2308,7 @@ int vfio_pci_core_sriov_configure(struct vfio_pci_core_device *vdev,
>  	}
>  
>  out_del:
> +	vdev->sriov_pwr_active = false;

A comment that this is intentionally lockless would be useful.

>  	mutex_lock(&vfio_pci_sriov_pfs_mutex);
>  	list_del_init(&vdev->sriov_pfs_item);
>  out_unlock:
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index 2ebba746c18f7..9a39a13a65766 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -127,6 +127,7 @@ struct vfio_pci_core_device {
>  	bool			needs_pm_restore:1;
>  	bool			pm_intx_masked:1;
>  	bool			pm_runtime_engaged:1;
> +	bool			sriov_pwr_active:1;

Just 'sriov_active'?  Adding 'pwr' into it implies the use case rather
than the state it represents.  Thanks,

Alex

>  	struct pci_saved_state	*pci_saved_state;
>  	struct pci_saved_state	*pm_save;
>  	int			ioeventfds_nr;
> 
> base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731


      reply	other threads:[~2026-05-07 21:09 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-04 22:41 [PATCH] vfio/pci: Use a private flag to prevent power state change with VFs Raghavendra Rao Ananta
2026-05-07 21:09 ` Alex Williamson [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260507150925.11e9681e@shazbot.org \
    --to=alex@shazbot.org \
    --cc=dmatlack@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jrhilke@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rananta@google.com \
    --cc=stable@vger.kernel.org \
    --cc=vipinsh@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox