* Locking between vfio hot-remove and pci sysfs sriov_numvfs [not found] <CGME20231207223824uscas1p27dd91f0af56cda282cd28046cc981fe9@uscas1p2.samsung.com> @ 2023-12-07 22:38 ` Jim Harris 2023-12-07 23:21 ` Alex Williamson 0 siblings, 1 reply; 16+ messages in thread From: Jim Harris @ 2023-12-07 22:38 UTC (permalink / raw) To: bhelgaas@google.com, alex.williamson@redhat.com, linux-pci@vger.kernel.org, kvm@vger.kernel.org, ben@nvidia.com, jgg@nvidia.com I am seeing a deadlock using SPDK with hotplug detection using vfio-pci and an SR-IOV enabled NVMe SSD. It is not clear if this deadlock is intended or if it's a kernel bug. Note: SPDK uses DPDK's PCI device enumeration framework, so I'll reference both SPDK and DPDK in this description. DPDK registers an eventfd with vfio for hotplug notifications. If the associated device is removed (i.e. write 1 to its pci sysfs remove entry), vfio writes to the eventfd, requesting DPDK to release the device. It does this while holding the device_lock(), and then waits for completion. DPDK gets the notification, and passes it up to SPDK. SPDK does not release the device immediately. It has some asynchronous operations that need to be performed first, so it will release the device a bit later. But before the device is released, SPDK also triggers DPDK to do a sysfs scan looking for newly inserted devices. Note that the removed device is not completely removed yet from kernel PCI perspective - all of its sysfs entries are still available, including sriov_numvfs. DPDK explicitly reads sriov_numvfs to see if the device is SR-IOV capable. SPDK itself doesn't actually use this value, but it is part of the scan triggered by SPDK and directly leads to the deadlock. sriov_numvfs_show() deadlocks because it tries to hold device_lock() while reading the pci device's pdev->sriov->num_VFs. We're able to workaround this in SPDK by deferring the sysfs scan if a device removal is in process. And maybe that is what we are supposed to be doing, to avoid this deadlock? Reference to SPDK issue, for some more details (plus simple repro stpes for anyone already familiar with SPDK): https://github.com/spdk/spdk/issues/3205 ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs 2023-12-07 22:38 ` Locking between vfio hot-remove and pci sysfs sriov_numvfs Jim Harris @ 2023-12-07 23:21 ` Alex Williamson 2023-12-07 23:48 ` Jason Gunthorpe 0 siblings, 1 reply; 16+ messages in thread From: Alex Williamson @ 2023-12-07 23:21 UTC (permalink / raw) To: Jim Harris Cc: bhelgaas@google.com, linux-pci@vger.kernel.org, kvm@vger.kernel.org, ben@nvidia.com, jgg@nvidia.com On Thu, 7 Dec 2023 22:38:23 +0000 Jim Harris <jim.harris@samsung.com> wrote: > I am seeing a deadlock using SPDK with hotplug detection using vfio-pci > and an SR-IOV enabled NVMe SSD. It is not clear if this deadlock is intended > or if it's a kernel bug. > > Note: SPDK uses DPDK's PCI device enumeration framework, so I'll reference > both SPDK and DPDK in this description. > > DPDK registers an eventfd with vfio for hotplug notifications. If the associated > device is removed (i.e. write 1 to its pci sysfs remove entry), vfio > writes to the eventfd, requesting DPDK to release the device. It does this > while holding the device_lock(), and then waits for completion. > > DPDK gets the notification, and passes it up to SPDK. SPDK does not release > the device immediately. It has some asynchronous operations that need to be > performed first, so it will release the device a bit later. > > But before the device is released, SPDK also triggers DPDK to do a sysfs scan > looking for newly inserted devices. Note that the removed device is not > completely removed yet from kernel PCI perspective - all of its sysfs entries > are still available, including sriov_numvfs. > > DPDK explicitly reads sriov_numvfs to see if the device is SR-IOV capable. > SPDK itself doesn't actually use this value, but it is part of the scan > triggered by SPDK and directly leads to the deadlock. sriov_numvfs_show() > deadlocks because it tries to hold device_lock() while reading the pci > device's pdev->sriov->num_VFs. > > We're able to workaround this in SPDK by deferring the sysfs scan if > a device removal is in process. And maybe that is what we are supposed to > be doing, to avoid this deadlock? > > Reference to SPDK issue, for some more details (plus simple repro stpes for > anyone already familiar with SPDK): https://github.com/spdk/spdk/issues/3205 device_lock() has been a recurring problem. We don't have a lot of leeway in how we support the driver remove callback, the device needs to be released. We can't return -EBUSY and I don't think we can drop the mutex while we're waiting on userspace. I've done some fix-ups in the past to use device_trylock() to avoid deadlocks, which might be an option here, ex. reading sriov_numvfs could return -EBUSY in this scenario. We keep running into these scenarios though and we might just need to pick a point at which we kill the user process holding the device. I'm open to suggestions. Thanks, Alex ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs 2023-12-07 23:21 ` Alex Williamson @ 2023-12-07 23:48 ` Jason Gunthorpe 2023-12-08 17:07 ` Jim Harris 2023-12-08 17:38 ` Jim Harris 0 siblings, 2 replies; 16+ messages in thread From: Jason Gunthorpe @ 2023-12-07 23:48 UTC (permalink / raw) To: Alex Williamson Cc: Jim Harris, bhelgaas@google.com, linux-pci@vger.kernel.org, kvm@vger.kernel.org, ben@nvidia.com On Thu, Dec 07, 2023 at 04:21:48PM -0700, Alex Williamson wrote: > On Thu, 7 Dec 2023 22:38:23 +0000 > Jim Harris <jim.harris@samsung.com> wrote: > > > I am seeing a deadlock using SPDK with hotplug detection using vfio-pci > > and an SR-IOV enabled NVMe SSD. It is not clear if this deadlock is intended > > or if it's a kernel bug. > > > > Note: SPDK uses DPDK's PCI device enumeration framework, so I'll reference > > both SPDK and DPDK in this description. > > > > DPDK registers an eventfd with vfio for hotplug notifications. If the associated > > device is removed (i.e. write 1 to its pci sysfs remove entry), vfio > > writes to the eventfd, requesting DPDK to release the device. It does this > > while holding the device_lock(), and then waits for completion. > > > > DPDK gets the notification, and passes it up to SPDK. SPDK does not release > > the device immediately. It has some asynchronous operations that need to be > > performed first, so it will release the device a bit later. > > > > But before the device is released, SPDK also triggers DPDK to do a sysfs scan > > looking for newly inserted devices. Note that the removed device is not > > completely removed yet from kernel PCI perspective - all of its sysfs entries > > are still available, including sriov_numvfs. > > > > DPDK explicitly reads sriov_numvfs to see if the device is SR-IOV capable. > > SPDK itself doesn't actually use this value, but it is part of the scan > > triggered by SPDK and directly leads to the deadlock. sriov_numvfs_show() > > deadlocks because it tries to hold device_lock() while reading the pci > > device's pdev->sriov->num_VFs. > > > > We're able to workaround this in SPDK by deferring the sysfs scan if > > a device removal is in process. And maybe that is what we are supposed to > > be doing, to avoid this deadlock? > > > > Reference to SPDK issue, for some more details (plus simple repro stpes for > > anyone already familiar with SPDK): https://github.com/spdk/spdk/issues/3205 > > device_lock() has been a recurring problem. We don't have a lot of > leeway in how we support the driver remove callback, the device needs > to be released. We can't return -EBUSY and I don't think we can drop > the mutex while we're waiting on userspace. The mechanism of waiting in remove for userspace is inherently flawed, it can never work fully correctly. :( I've hit this many times. Upon remove VFIO should immediately remove itself and leave behind a non-functional file descriptor. Userspace should catch up eventually and see it is toast. The kernel locking model just cannot support userspace delaying this process. Jason ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs 2023-12-07 23:48 ` Jason Gunthorpe @ 2023-12-08 17:07 ` Jim Harris 2023-12-08 19:41 ` Jason Gunthorpe 2023-12-08 17:38 ` Jim Harris 1 sibling, 1 reply; 16+ messages in thread From: Jim Harris @ 2023-12-08 17:07 UTC (permalink / raw) To: Jason Gunthorpe Cc: Alex Williamson, bhelgaas@google.com, linux-pci@vger.kernel.org, kvm@vger.kernel.org, ben@nvidia.com On Thu, Dec 07, 2023 at 07:48:10PM -0400, Jason Gunthorpe wrote: > On Thu, Dec 07, 2023 at 04:21:48PM -0700, Alex Williamson wrote: > > On Thu, 7 Dec 2023 22:38:23 +0000 > > Jim Harris <jim.harris@samsung.com> wrote: > > > > device_lock() has been a recurring problem. We don't have a lot of > > leeway in how we support the driver remove callback, the device needs > > to be released. We can't return -EBUSY and I don't think we can drop > > the mutex while we're waiting on userspace. > > The mechanism of waiting in remove for userspace is inherently flawed, > it can never work fully correctly. :( I've hit this many times. > > Upon remove VFIO should immediately remove itself and leave behind a > non-functional file descriptor. Userspace should catch up eventually > and see it is toast. > > The kernel locking model just cannot support userspace delaying this > process. > > Jason Maybe for now we just whack this specific mole with a separate mutex for synchronizing access to sriov->num_VFs in the sysfs paths? Something like this (tested on my system): --- Author: Jim Harris <jim.harris@samsung.com> pci: sync sriov->num_VFs sysfs access with its own mutex If SR-IOV enabled device is held by vfio, and device is removed, vfio will hold device lock and notify userspace of the removal. If userspace reads sriov_numvfs sysfs entry, that thread will be blocked since sriov_numvfs_show() also tries to acquire the device lock. If that same thread is responsible for releasing the device to vfio, it results in a deadlock. So add a separate mutex, specifically for struct pci_sriov. Use this to synchronize accesses to sriov_numvfs in the sysfs paths. sriov_numvfs_store() will also still hold the device lock while configuring sriov. Fixes: 35ff867b765 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes") Signed-off-by: Jim Harris <jim.harris@samsung.com> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 25dbe85c4217..8910cf6c97be 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -398,9 +398,9 @@ static ssize_t sriov_numvfs_show(struct device *dev, u16 num_vfs; /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ - device_lock(&pdev->dev); + mutex_lock(&pdev->sriov->lock); num_vfs = pdev->sriov->num_VFs; - device_unlock(&pdev->dev); + mutex_unlock(&pdev->sriov->lock); return sysfs_emit(buf, "%u\n", num_vfs); } @@ -427,6 +427,7 @@ static ssize_t sriov_numvfs_store(struct device *dev, return -ERANGE; device_lock(&pdev->dev); + mutex_lock(&pdev->sriov->lock); if (num_vfs == pdev->sriov->num_VFs) goto exit; @@ -468,6 +469,7 @@ static ssize_t sriov_numvfs_store(struct device *dev, num_vfs, ret); exit: + mutex_unlock(&pdev->sriov->lock); device_unlock(&pdev->dev); if (ret < 0) @@ -808,6 +810,7 @@ static int sriov_init(struct pci_dev *dev, int pos) nres++; } + mutex_init(&iov->lock); iov->pos = pos; iov->nres = nres; iov->ctrl = ctrl; diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 5ecbcf041179..04e636ab50e5 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -313,6 +313,7 @@ struct pci_sriov { u16 subsystem_device; /* VF subsystem device */ resource_size_t barsz[PCI_SRIOV_NUM_BARS]; /* VF BAR size */ bool drivers_autoprobe; /* Auto probing of VFs by driver */ + struct mutex lock; /* for synchronizing num_VFs sysfs accesses */ }; #ifdef CONFIG_PCI_DOE ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs 2023-12-08 17:07 ` Jim Harris @ 2023-12-08 19:41 ` Jason Gunthorpe 2023-12-08 20:09 ` Jim Harris 0 siblings, 1 reply; 16+ messages in thread From: Jason Gunthorpe @ 2023-12-08 19:41 UTC (permalink / raw) To: Jim Harris, Leon Romanovsky Cc: Alex Williamson, bhelgaas@google.com, linux-pci@vger.kernel.org, kvm@vger.kernel.org, ben@nvidia.com On Fri, Dec 08, 2023 at 05:07:22PM +0000, Jim Harris wrote: > On Thu, Dec 07, 2023 at 07:48:10PM -0400, Jason Gunthorpe wrote: > > On Thu, Dec 07, 2023 at 04:21:48PM -0700, Alex Williamson wrote: > > > On Thu, 7 Dec 2023 22:38:23 +0000 > > > Jim Harris <jim.harris@samsung.com> wrote: > > > > > > device_lock() has been a recurring problem. We don't have a lot of > > > leeway in how we support the driver remove callback, the device needs > > > to be released. We can't return -EBUSY and I don't think we can drop > > > the mutex while we're waiting on userspace. > > > > The mechanism of waiting in remove for userspace is inherently flawed, > > it can never work fully correctly. :( I've hit this many times. > > > > Upon remove VFIO should immediately remove itself and leave behind a > > non-functional file descriptor. Userspace should catch up eventually > > and see it is toast. > > > > The kernel locking model just cannot support userspace delaying this > > process. > > > > Jason > > Maybe for now we just whack this specific mole with a separate mutex > for synchronizing access to sriov->num_VFs in the sysfs paths? > Something like this (tested on my system): TBH, I don't have the time right now to unpack this locking mystery. Maybe Leon remembers? device_lock() gets everywhere and does a lot of different stuff, so I would be surprised if it was so easy.. Jason ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs 2023-12-08 19:41 ` Jason Gunthorpe @ 2023-12-08 20:09 ` Jim Harris 2023-12-10 19:05 ` Jason Gunthorpe 0 siblings, 1 reply; 16+ messages in thread From: Jim Harris @ 2023-12-08 20:09 UTC (permalink / raw) To: Jason Gunthorpe Cc: Leon Romanovsky, Alex Williamson, bhelgaas@google.com, linux-pci@vger.kernel.org, kvm@vger.kernel.org, ben@nvidia.com, pierre.cregut@orange.com On Fri, Dec 08, 2023 at 03:41:59PM -0400, Jason Gunthorpe wrote: > On Fri, Dec 08, 2023 at 05:07:22PM +0000, Jim Harris wrote: > > > > Maybe for now we just whack this specific mole with a separate mutex > > for synchronizing access to sriov->num_VFs in the sysfs paths? > > Something like this (tested on my system): > > TBH, I don't have the time right now to unpack this locking > mystery. Maybe Leon remembers? > > device_lock() gets everywhere and does a lot of different stuff, so I > would be surprised if it was so easy.. The store() side still keeps the device_lock(), it just also acquires this new sriov lock. So store() side should observe zero differences. The only difference is now the show() side can acquire just the more-granular lock, since it is only trying to synchronize on sriov->num_VFs with the store() side. But maybe I'm missing something subtle here... Adding Pierre who authored the 35ff867b7 commit. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs 2023-12-08 20:09 ` Jim Harris @ 2023-12-10 19:05 ` Jason Gunthorpe 2023-12-11 7:20 ` Leon Romanovsky 0 siblings, 1 reply; 16+ messages in thread From: Jason Gunthorpe @ 2023-12-10 19:05 UTC (permalink / raw) To: Jim Harris Cc: Leon Romanovsky, Alex Williamson, bhelgaas@google.com, linux-pci@vger.kernel.org, kvm@vger.kernel.org, ben@nvidia.com, pierre.cregut@orange.com On Fri, Dec 08, 2023 at 08:09:34PM +0000, Jim Harris wrote: > On Fri, Dec 08, 2023 at 03:41:59PM -0400, Jason Gunthorpe wrote: > > On Fri, Dec 08, 2023 at 05:07:22PM +0000, Jim Harris wrote: > > > > > > Maybe for now we just whack this specific mole with a separate mutex > > > for synchronizing access to sriov->num_VFs in the sysfs paths? > > > Something like this (tested on my system): > > > > TBH, I don't have the time right now to unpack this locking > > mystery. Maybe Leon remembers? > > > > device_lock() gets everywhere and does a lot of different stuff, so I > > would be surprised if it was so easy.. > > The store() side still keeps the device_lock(), it just also acquires this > new sriov lock. So store() side should observe zero differences. The only > difference is now the show() side can acquire just the more-granular lock, > since it is only trying to synchronize on sriov->num_VFs with the store() > side. But maybe I'm missing something subtle here... Oh if that is the only goal then probably a READ_ONCE is fine Jason ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs 2023-12-10 19:05 ` Jason Gunthorpe @ 2023-12-11 7:20 ` Leon Romanovsky 2023-12-12 21:34 ` Jim Harris 0 siblings, 1 reply; 16+ messages in thread From: Leon Romanovsky @ 2023-12-11 7:20 UTC (permalink / raw) To: Jason Gunthorpe, Jim Harris Cc: Alex Williamson, bhelgaas@google.com, linux-pci@vger.kernel.org, kvm@vger.kernel.org, ben@nvidia.com, pierre.cregut@orange.com On Sun, Dec 10, 2023 at 03:05:49PM -0400, Jason Gunthorpe wrote: > On Fri, Dec 08, 2023 at 08:09:34PM +0000, Jim Harris wrote: > > On Fri, Dec 08, 2023 at 03:41:59PM -0400, Jason Gunthorpe wrote: > > > On Fri, Dec 08, 2023 at 05:07:22PM +0000, Jim Harris wrote: > > > > > > > > Maybe for now we just whack this specific mole with a separate mutex > > > > for synchronizing access to sriov->num_VFs in the sysfs paths? > > > > Something like this (tested on my system): > > > > > > TBH, I don't have the time right now to unpack this locking > > > mystery. Maybe Leon remembers? > > > > > > device_lock() gets everywhere and does a lot of different stuff, so I > > > would be surprised if it was so easy.. > > > > The store() side still keeps the device_lock(), it just also acquires this > > new sriov lock. So store() side should observe zero differences. The only > > difference is now the show() side can acquire just the more-granular lock, > > since it is only trying to synchronize on sriov->num_VFs with the store() > > side. But maybe I'm missing something subtle here... > > Oh if that is the only goal then probably a READ_ONCE is fine I would say that worth to revert the patch 35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes") as there is no such promise that netdev devices (as presented in script https://bugzilla.kernel.org/show_bug.cgi?id=202991), which have different lifetime model will be only after sysfs changes in PF. netlink event means netdev FOO is ready and if someone needs to follow after sriov_numvfs, he/she should listen to sysfs events. In addition, I would do this change: diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 25dbe85c4217..3b768e20c7ab 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -683,8 +683,8 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) if (rc) goto err_pcibios; - kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); iov->num_VFs = nr_virtfn; + kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); return 0; Thanks > > Jason > ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs 2023-12-11 7:20 ` Leon Romanovsky @ 2023-12-12 21:34 ` Jim Harris 2023-12-13 6:55 ` Leon Romanovsky 0 siblings, 1 reply; 16+ messages in thread From: Jim Harris @ 2023-12-12 21:34 UTC (permalink / raw) To: Leon Romanovsky Cc: Jason Gunthorpe, Alex Williamson, bhelgaas@google.com, linux-pci@vger.kernel.org, kvm@vger.kernel.org, ben@nvidia.com, pierre.cregut@orange.com On Mon, Dec 11, 2023 at 09:20:06AM +0200, Leon Romanovsky wrote: > On Sun, Dec 10, 2023 at 03:05:49PM -0400, Jason Gunthorpe wrote: > > On Fri, Dec 08, 2023 at 08:09:34PM +0000, Jim Harris wrote: > > > > > > The store() side still keeps the device_lock(), it just also acquires this > > > new sriov lock. So store() side should observe zero differences. The only > > > difference is now the show() side can acquire just the more-granular lock, > > > since it is only trying to synchronize on sriov->num_VFs with the store() > > > side. But maybe I'm missing something subtle here... > > > > Oh if that is the only goal then probably a READ_ONCE is fine IIUC, the synchronization was to block readers of sriov_numvfs if a writer was in process of the driver->sriov_configure(). Presumably sriov_configure() can take a long time, and it was better to block the sysfs read rather than return a stale value. > I would say that worth to revert the patch > 35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes") > as there is no such promise that netdev devices (as presented in script > https://bugzilla.kernel.org/show_bug.cgi?id=202991), which have different > lifetime model will be only after sysfs changes in PF. But I guess you're saying using the sysfs change as any kind of indicator is wrong to begin with. > netlink event means netdev FOO is ready and if someone needs to follow > after sriov_numvfs, he/she should listen to sysfs events. > > In addition, I would do this change: > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > index 25dbe85c4217..3b768e20c7ab 100644 > --- a/drivers/pci/iov.c > +++ b/drivers/pci/iov.c > @@ -683,8 +683,8 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) > if (rc) > goto err_pcibios; > > - kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); > iov->num_VFs = nr_virtfn; > + kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); Ack. I'll post patches for both of these suggestions. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs 2023-12-12 21:34 ` Jim Harris @ 2023-12-13 6:55 ` Leon Romanovsky 0 siblings, 0 replies; 16+ messages in thread From: Leon Romanovsky @ 2023-12-13 6:55 UTC (permalink / raw) To: Jim Harris Cc: Jason Gunthorpe, Alex Williamson, bhelgaas@google.com, linux-pci@vger.kernel.org, kvm@vger.kernel.org, ben@nvidia.com, pierre.cregut@orange.com On Tue, Dec 12, 2023 at 09:34:43PM +0000, Jim Harris wrote: > On Mon, Dec 11, 2023 at 09:20:06AM +0200, Leon Romanovsky wrote: > > On Sun, Dec 10, 2023 at 03:05:49PM -0400, Jason Gunthorpe wrote: > > > On Fri, Dec 08, 2023 at 08:09:34PM +0000, Jim Harris wrote: > > > > > > > > The store() side still keeps the device_lock(), it just also acquires this > > > > new sriov lock. So store() side should observe zero differences. The only > > > > difference is now the show() side can acquire just the more-granular lock, > > > > since it is only trying to synchronize on sriov->num_VFs with the store() > > > > side. But maybe I'm missing something subtle here... > > > > > > Oh if that is the only goal then probably a READ_ONCE is fine > > IIUC, the synchronization was to block readers of sriov_numvfs if a writer was > in process of the driver->sriov_configure(). Presumably sriov_configure() > can take a long time, and it was better to block the sysfs read rather than > return a stale value. Nothing prevents from user to write to sriov_numvfs in one thread and read from another thread. User will get stale value anyway. > > > I would say that worth to revert the patch > > 35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes") > > as there is no such promise that netdev devices (as presented in script > > https://bugzilla.kernel.org/show_bug.cgi?id=202991), which have different > > lifetime model will be only after sysfs changes in PF. > > But I guess you're saying using the sysfs change as any kind of indicator > is wrong to begin with. Yes > > > netlink event means netdev FOO is ready and if someone needs to follow > > after sriov_numvfs, he/she should listen to sysfs events. > > > > In addition, I would do this change: > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > index 25dbe85c4217..3b768e20c7ab 100644 > > --- a/drivers/pci/iov.c > > +++ b/drivers/pci/iov.c > > @@ -683,8 +683,8 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) > > if (rc) > > goto err_pcibios; > > > > - kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); > > iov->num_VFs = nr_virtfn; > > + kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); > > Ack. I'll post patches for both of these suggestions. Thanks ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs 2023-12-07 23:48 ` Jason Gunthorpe 2023-12-08 17:07 ` Jim Harris @ 2023-12-08 17:38 ` Jim Harris 2023-12-08 17:41 ` Jason Gunthorpe 1 sibling, 1 reply; 16+ messages in thread From: Jim Harris @ 2023-12-08 17:38 UTC (permalink / raw) To: Jason Gunthorpe Cc: Alex Williamson, bhelgaas@google.com, linux-pci@vger.kernel.org, kvm@vger.kernel.org, ben@nvidia.com On Thu, Dec 07, 2023 at 07:48:10PM -0400, Jason Gunthorpe wrote: > > The mechanism of waiting in remove for userspace is inherently flawed, > it can never work fully correctly. :( I've hit this many times. > > Upon remove VFIO should immediately remove itself and leave behind a > non-functional file descriptor. Userspace should catch up eventually > and see it is toast. One nice aspect of the current design is that vfio will leave the BARs mapped until userspace releases the vfio handle. It avoids some rather nasty hacks for handling SIGBUS errors in the fast path (i.e. writing NVMe doorbells) where we cannot try to check for device removal on every MMIO write. Would your proposal immediately yank the BARs, without waiting for userspace to respond? This is mostly for my curiosity - SPDK already has these hacks implemented, so I don't think it would be affected by this kind of change in behavior. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs 2023-12-08 17:38 ` Jim Harris @ 2023-12-08 17:41 ` Jason Gunthorpe 2023-12-08 17:59 ` Jim Harris 0 siblings, 1 reply; 16+ messages in thread From: Jason Gunthorpe @ 2023-12-08 17:41 UTC (permalink / raw) To: Jim Harris Cc: Alex Williamson, bhelgaas@google.com, linux-pci@vger.kernel.org, kvm@vger.kernel.org, ben@nvidia.com On Fri, Dec 08, 2023 at 05:38:51PM +0000, Jim Harris wrote: > On Thu, Dec 07, 2023 at 07:48:10PM -0400, Jason Gunthorpe wrote: > > > > The mechanism of waiting in remove for userspace is inherently flawed, > > it can never work fully correctly. :( I've hit this many times. > > > > Upon remove VFIO should immediately remove itself and leave behind a > > non-functional file descriptor. Userspace should catch up eventually > > and see it is toast. > > One nice aspect of the current design is that vfio will leave the BARs > mapped until userspace releases the vfio handle. It avoids some rather > nasty hacks for handling SIGBUS errors in the fast path (i.e. writing > NVMe doorbells) where we cannot try to check for device removal on > every MMIO write. Would your proposal immediately yank the BARs, without > waiting for userspace to respond? This is mostly for my curiosity - SPDK > already has these hacks implemented, so I don't think it would be > affected by this kind of change in behavior. What we did in RDMA was map a dummy page to the BARs so the sigbus was avoided. But in that case RDMA knows the BAR memory is used only for doorbell write so this is a reasonable thing to do. Jason ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs 2023-12-08 17:41 ` Jason Gunthorpe @ 2023-12-08 17:59 ` Jim Harris 2023-12-08 18:01 ` Jason Gunthorpe 0 siblings, 1 reply; 16+ messages in thread From: Jim Harris @ 2023-12-08 17:59 UTC (permalink / raw) To: Jason Gunthorpe Cc: Alex Williamson, bhelgaas@google.com, linux-pci@vger.kernel.org, kvm@vger.kernel.org, ben@nvidia.com On Fri, Dec 08, 2023 at 01:41:09PM -0400, Jason Gunthorpe wrote: > On Fri, Dec 08, 2023 at 05:38:51PM +0000, Jim Harris wrote: > > On Thu, Dec 07, 2023 at 07:48:10PM -0400, Jason Gunthorpe wrote: > > > > > > The mechanism of waiting in remove for userspace is inherently flawed, > > > it can never work fully correctly. :( I've hit this many times. > > > > > > Upon remove VFIO should immediately remove itself and leave behind a > > > non-functional file descriptor. Userspace should catch up eventually > > > and see it is toast. > > > > One nice aspect of the current design is that vfio will leave the BARs > > mapped until userspace releases the vfio handle. It avoids some rather > > nasty hacks for handling SIGBUS errors in the fast path (i.e. writing > > NVMe doorbells) where we cannot try to check for device removal on > > every MMIO write. Would your proposal immediately yank the BARs, without > > waiting for userspace to respond? This is mostly for my curiosity - SPDK > > already has these hacks implemented, so I don't think it would be > > affected by this kind of change in behavior. > > What we did in RDMA was map a dummy page to the BARs so the sigbus was > avoided. But in that case RDMA knows the BAR memory is used only for > doorbell write so this is a reasonable thing to do. Yeah, this is exactly what SPDK (and DPDK) does today. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs 2023-12-08 17:59 ` Jim Harris @ 2023-12-08 18:01 ` Jason Gunthorpe 2023-12-08 18:12 ` Alex Williamson 0 siblings, 1 reply; 16+ messages in thread From: Jason Gunthorpe @ 2023-12-08 18:01 UTC (permalink / raw) To: Jim Harris Cc: Alex Williamson, bhelgaas@google.com, linux-pci@vger.kernel.org, kvm@vger.kernel.org, ben@nvidia.com On Fri, Dec 08, 2023 at 05:59:17PM +0000, Jim Harris wrote: > On Fri, Dec 08, 2023 at 01:41:09PM -0400, Jason Gunthorpe wrote: > > On Fri, Dec 08, 2023 at 05:38:51PM +0000, Jim Harris wrote: > > > On Thu, Dec 07, 2023 at 07:48:10PM -0400, Jason Gunthorpe wrote: > > > > > > > > The mechanism of waiting in remove for userspace is inherently flawed, > > > > it can never work fully correctly. :( I've hit this many times. > > > > > > > > Upon remove VFIO should immediately remove itself and leave behind a > > > > non-functional file descriptor. Userspace should catch up eventually > > > > and see it is toast. > > > > > > One nice aspect of the current design is that vfio will leave the BARs > > > mapped until userspace releases the vfio handle. It avoids some rather > > > nasty hacks for handling SIGBUS errors in the fast path (i.e. writing > > > NVMe doorbells) where we cannot try to check for device removal on > > > every MMIO write. Would your proposal immediately yank the BARs, without > > > waiting for userspace to respond? This is mostly for my curiosity - SPDK > > > already has these hacks implemented, so I don't think it would be > > > affected by this kind of change in behavior. > > > > What we did in RDMA was map a dummy page to the BARs so the sigbus was > > avoided. But in that case RDMA knows the BAR memory is used only for > > doorbell write so this is a reasonable thing to do. > > Yeah, this is exactly what SPDK (and DPDK) does today. To be clear, I mean we did it in the kernel. When the device driver is removed we zap all the VMAs and install a fault handler that installs the dummy page instead of SIGBUS The application doesn't do anything, and this is how SPDK already will be supporting device hot unplug of the RDMA drivers. Jason ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs 2023-12-08 18:01 ` Jason Gunthorpe @ 2023-12-08 18:12 ` Alex Williamson 2023-12-08 19:43 ` Jason Gunthorpe 0 siblings, 1 reply; 16+ messages in thread From: Alex Williamson @ 2023-12-08 18:12 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jim Harris, bhelgaas@google.com, linux-pci@vger.kernel.org, kvm@vger.kernel.org, ben@nvidia.com On Fri, 8 Dec 2023 14:01:57 -0400 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Fri, Dec 08, 2023 at 05:59:17PM +0000, Jim Harris wrote: > > On Fri, Dec 08, 2023 at 01:41:09PM -0400, Jason Gunthorpe wrote: > > > On Fri, Dec 08, 2023 at 05:38:51PM +0000, Jim Harris wrote: > > > > On Thu, Dec 07, 2023 at 07:48:10PM -0400, Jason Gunthorpe wrote: > > > > > > > > > > The mechanism of waiting in remove for userspace is inherently flawed, > > > > > it can never work fully correctly. :( I've hit this many times. > > > > > > > > > > Upon remove VFIO should immediately remove itself and leave behind a > > > > > non-functional file descriptor. Userspace should catch up eventually > > > > > and see it is toast. > > > > > > > > One nice aspect of the current design is that vfio will leave the BARs > > > > mapped until userspace releases the vfio handle. It avoids some rather > > > > nasty hacks for handling SIGBUS errors in the fast path (i.e. writing > > > > NVMe doorbells) where we cannot try to check for device removal on > > > > every MMIO write. Would your proposal immediately yank the BARs, without > > > > waiting for userspace to respond? This is mostly for my curiosity - SPDK > > > > already has these hacks implemented, so I don't think it would be > > > > affected by this kind of change in behavior. > > > > > > What we did in RDMA was map a dummy page to the BARs so the sigbus was > > > avoided. But in that case RDMA knows the BAR memory is used only for > > > doorbell write so this is a reasonable thing to do. > > > > Yeah, this is exactly what SPDK (and DPDK) does today. > > To be clear, I mean we did it in the kernel. > > When the device driver is removed we zap all the VMAs and install a > fault handler that installs the dummy page instead of SIGBUS > > The application doesn't do anything, and this is how SPDK already will > be supporting device hot unplug of the RDMA drivers. But I think you can only do that in the kernel because you understand the device uses those pages for doorbells and it's not a general purpose solution, right? Perhaps a variant driver could do something similar for NVMe devices doorbell pages, but a device agnostic driver like vfio-pci would need to SIGBUS on access or else we risk significant data integrity issues. Thanks, Alex ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Locking between vfio hot-remove and pci sysfs sriov_numvfs 2023-12-08 18:12 ` Alex Williamson @ 2023-12-08 19:43 ` Jason Gunthorpe 0 siblings, 0 replies; 16+ messages in thread From: Jason Gunthorpe @ 2023-12-08 19:43 UTC (permalink / raw) To: Alex Williamson Cc: Jim Harris, bhelgaas@google.com, linux-pci@vger.kernel.org, kvm@vger.kernel.org, ben@nvidia.com On Fri, Dec 08, 2023 at 11:12:15AM -0700, Alex Williamson wrote: > > > > avoided. But in that case RDMA knows the BAR memory is used only for > > > > doorbell write so this is a reasonable thing to do. > > > > > > Yeah, this is exactly what SPDK (and DPDK) does today. > > > > To be clear, I mean we did it in the kernel. > > > > When the device driver is removed we zap all the VMAs and install a > > fault handler that installs the dummy page instead of SIGBUS > > > > The application doesn't do anything, and this is how SPDK already will > > be supporting device hot unplug of the RDMA drivers. > > But I think you can only do that in the kernel because you understand > the device uses those pages for doorbells and it's not a general > purpose solution, right? > > Perhaps a variant driver could do something similar for NVMe devices > doorbell pages, but a device agnostic driver like vfio-pci would need > to SIGBUS on access or else we risk significant data integrity issues. > Thanks, Yes, basically. Might be interesting to consider having a VFIO FEATURE flag to opt into SIGBUS or dummy page, perhaps even on a VMA by VMA basis. Jason ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2023-12-13 6:56 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CGME20231207223824uscas1p27dd91f0af56cda282cd28046cc981fe9@uscas1p2.samsung.com>
2023-12-07 22:38 ` Locking between vfio hot-remove and pci sysfs sriov_numvfs Jim Harris
2023-12-07 23:21 ` Alex Williamson
2023-12-07 23:48 ` Jason Gunthorpe
2023-12-08 17:07 ` Jim Harris
2023-12-08 19:41 ` Jason Gunthorpe
2023-12-08 20:09 ` Jim Harris
2023-12-10 19:05 ` Jason Gunthorpe
2023-12-11 7:20 ` Leon Romanovsky
2023-12-12 21:34 ` Jim Harris
2023-12-13 6:55 ` Leon Romanovsky
2023-12-08 17:38 ` Jim Harris
2023-12-08 17:41 ` Jason Gunthorpe
2023-12-08 17:59 ` Jim Harris
2023-12-08 18:01 ` Jason Gunthorpe
2023-12-08 18:12 ` Alex Williamson
2023-12-08 19:43 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox