From: Yi Liu <yi.l.liu@intel.com>
To: Anthony DeRossi <ajderossi@gmail.com>, <kvm@vger.kernel.org>
Cc: <alex.williamson@redhat.com>, <cohuck@redhat.com>,
<jgg@nvidia.com>, <kevin.tian@intel.com>, <abhsahu@nvidia.com>,
<yishaih@nvidia.com>
Subject: Re: [PATCH v6 3/3] vfio/pci: Check the device set open count on reset
Date: Thu, 10 Nov 2022 11:03:29 +0800 [thread overview]
Message-ID: <49b64e4b-43b9-ec7b-23d2-2fa1bf921046@intel.com> (raw)
In-Reply-To: <20221110014027.28780-4-ajderossi@gmail.com>
Hi DeRossi,
On 2022/11/10 09:40, Anthony DeRossi wrote:
> vfio_pci_dev_set_needs_reset() inspects the open_count of every device
> in the set to determine whether a reset is allowed. The current device
> always has open_count == 1 within vfio_pci_core_disable(), effectively
> disabling the reset logic. This field is also documented as private in
> vfio_device, so it should not be used to determine whether other devices
> in the set are open.
haven't went through the prior version. maybe may question has been already
answered. My question is:
the major reason is the order problem in vfio_main.c. close_device() is
always called before decreasing open_count to be 0. So even other device
has no open fd, the current vfio_device still have one open count. So why
can't we just switch the order of open_count-- and close_device()?
> Checking for vfio_device_set_open_count() > 1 on the device set fixes
> both issues
tbh. it's weird to me that a driver needs to know the internal logic of
vfio core before knowing it needs to check the vfio_device_set_open_count()
in this way. Is vfio-pci the only driver that needs to do this check or
there are other drivers? If there are other drivers, maybe fixing the order
in core is better.
> After commit 2cd8b14aaa66 ("vfio/pci: Move to the device set
> infrastructure"), failure to create a new file for a device would cause
> the reset to be skipped due to open_count being decremented after
> calling close_device() in the error path.
>
> After commit eadd86f835c6 ("vfio: Remove calls to
> vfio_group_add_container_user()"), releasing a device would always skip
> the reset due to an ordering change in vfio_device_fops_release().
>
> Failing to reset the device leaves it in an unknown state, potentially
> causing errors when it is accessed later or bound to a different driver.
>
> This issue was observed with a Radeon RX Vega 56 [1002:687f] (rev c3)
> assigned to a Windows guest. After shutting down the guest, unbinding
> the device from vfio-pci, and binding the device to amdgpu:
>
> [ 548.007102] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
> [ 548.027174] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
> [ 548.027242] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
> [ 548.027306] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed
> [ 548.027308] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init
>
> Fixes: 2cd8b14aaa66 ("vfio/pci: Move to the device set infrastructure")
> Fixes: eadd86f835c6 ("vfio: Remove calls to vfio_group_add_container_user()")
> Signed-off-by: Anthony DeRossi <ajderossi@gmail.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> ---
> drivers/vfio/pci/vfio_pci_core.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index badc9d828cac..e030c2120183 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -2488,12 +2488,12 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
> struct vfio_pci_core_device *cur;
> bool needs_reset = false;
>
> - list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
> - /* No VFIO device in the set can have an open device FD */
> - if (cur->vdev.open_count)
> - return false;
> + /* No other VFIO device in the set can be open. */
> + if (vfio_device_set_open_count(dev_set) > 1)
> + return false;
> +
> + list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
> needs_reset |= cur->needs_reset;
> - }
> return needs_reset;
> }
>
--
Regards,
Yi Liu
next prev parent reply other threads:[~2022-11-10 3:03 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-10 1:40 [PATCH v6 0/3] vfio/pci: Check the device set open count on reset Anthony DeRossi
2022-11-10 1:40 ` [PATCH v6 1/3] vfio: Fix container device registration life cycle Anthony DeRossi
2022-11-10 2:36 ` Yi Liu
2022-11-10 1:40 ` [PATCH v6 2/3] vfio: Export the device set open count Anthony DeRossi
2022-11-10 2:46 ` Yi Liu
2022-11-10 1:40 ` [PATCH v6 3/3] vfio/pci: Check the device set open count on reset Anthony DeRossi
2022-11-10 3:03 ` Yi Liu [this message]
2022-11-10 4:17 ` Alex Williamson
2022-11-10 4:33 ` Yi Liu
2022-11-10 20:16 ` [PATCH v6 0/3] " Alex Williamson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=49b64e4b-43b9-ec7b-23d2-2fa1bf921046@intel.com \
--to=yi.l.liu@intel.com \
--cc=abhsahu@nvidia.com \
--cc=ajderossi@gmail.com \
--cc=alex.williamson@redhat.com \
--cc=cohuck@redhat.com \
--cc=jgg@nvidia.com \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=yishaih@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.