* [PATCH v2 0/2] PCI/IOV: sriov_numvfs bug fixes [not found] <CGME20240209235208uscas1p26c658c64cc85711cd3aa6312224164fc@uscas1p2.samsung.com> @ 2024-02-09 23:52 ` Jim Harris 2024-02-09 23:52 ` [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" Jim Harris 2024-02-09 23:52 ` [PATCH v2 2/2] pci/iov: fix kobject_uevent() ordering in sriov_enable() Jim Harris 0 siblings, 2 replies; 24+ messages in thread From: Jim Harris @ 2024-02-09 23:52 UTC (permalink / raw) To: Bjorn Helgaas, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Leon Romanovsky, Jason Gunthorpe, Alex Williamson, pierre.cregut@orange.com A deadlock condition was discovered by SPDK when removing an SRIOV-enabled and VFIO-attached device, with a specific threading model. While discussing this on the mailing list, a separate issue around updating sriov_numvfs and its kobject_uevent() was also discovered. This series fixes both of those issues. --- v1 => v2: * No code changes * Updated commit messages per feedback from Bjorn * Added Leon's Reviewed-by tags * Moved bulk of the v1 cover letter to the revert patch commit message where it belonged Jim Harris (2): PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" PCI/IOV: fix kobject_uevent() ordering in sriov_enable() drivers/pci/iov.c | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-) -- ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-09 23:52 ` [PATCH v2 0/2] PCI/IOV: sriov_numvfs bug fixes Jim Harris @ 2024-02-09 23:52 ` Jim Harris 2024-02-10 3:20 ` Kuppuswamy Sathyanarayanan 2024-02-09 23:52 ` [PATCH v2 2/2] pci/iov: fix kobject_uevent() ordering in sriov_enable() Jim Harris 1 sibling, 1 reply; 24+ messages in thread From: Jim Harris @ 2024-02-09 23:52 UTC (permalink / raw) To: Bjorn Helgaas, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Leon Romanovsky, Jason Gunthorpe, Alex Williamson, pierre.cregut@orange.com If an SR-IOV enabled device is held by vfio, and the device is removed, vfio will hold device lock and notify userspace of the removal. If userspace reads the sriov_numvfs sysfs entry, that thread will be blocked since sriov_numvfs_show() also tries to acquire the device lock. If that same thread is responsible for releasing the device to vfio, it results in a deadlock. The proper way to detect a change to the num_VFs value is to listen for a sysfs event, not to add a device_lock() on the attribute _show() in the kernel. This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ Suggested-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jim Harris <jim.harris@samsung.com> --- drivers/pci/iov.c | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index aaa33e8dc4c9..0ca20cd518d5 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, char *buf) { struct pci_dev *pdev = to_pci_dev(dev); - u16 num_vfs; - - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ - device_lock(&pdev->dev); - num_vfs = pdev->sriov->num_VFs; - device_unlock(&pdev->dev); - return sysfs_emit(buf, "%u\n", num_vfs); + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); } /* ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-09 23:52 ` [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" Jim Harris @ 2024-02-10 3:20 ` Kuppuswamy Sathyanarayanan 2024-02-11 8:48 ` Leon Romanovsky 0 siblings, 1 reply; 24+ messages in thread From: Kuppuswamy Sathyanarayanan @ 2024-02-10 3:20 UTC (permalink / raw) To: Jim Harris, Bjorn Helgaas, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Leon Romanovsky, Jason Gunthorpe, Alex Williamson, pierre.cregut@orange.com On 2/9/24 3:52 PM, Jim Harris wrote: > If an SR-IOV enabled device is held by vfio, and the device is removed, > vfio will hold device lock and notify userspace of the removal. If > userspace reads the sriov_numvfs sysfs entry, that thread will be blocked > since sriov_numvfs_show() also tries to acquire the device lock. If that > same thread is responsible for releasing the device to vfio, it results in > a deadlock. > > The proper way to detect a change to the num_VFs value is to listen for a > sysfs event, not to add a device_lock() on the attribute _show() in the > kernel. Since you are reverting a commit that synchronizes SysFS read /write, please add some comments about why it is not an issue anymore. > > This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. > Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). > > Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ > Suggested-by: Leon Romanovsky <leonro@nvidia.com> > Reviewed-by: Leon Romanovsky <leonro@nvidia.com> > Signed-off-by: Jim Harris <jim.harris@samsung.com> > --- > drivers/pci/iov.c | 8 +------- > 1 file changed, 1 insertion(+), 7 deletions(-) > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > index aaa33e8dc4c9..0ca20cd518d5 100644 > --- a/drivers/pci/iov.c > +++ b/drivers/pci/iov.c > @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, > char *buf) > { > struct pci_dev *pdev = to_pci_dev(dev); > - u16 num_vfs; > - > - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ > - device_lock(&pdev->dev); > - num_vfs = pdev->sriov->num_VFs; > - device_unlock(&pdev->dev); > > - return sysfs_emit(buf, "%u\n", num_vfs); > + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); > } > > /* > -- Sathyanarayanan Kuppuswamy Linux Kernel Developer ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-10 3:20 ` Kuppuswamy Sathyanarayanan @ 2024-02-11 8:48 ` Leon Romanovsky 2024-02-11 19:15 ` Kuppuswamy Sathyanarayanan 2024-02-12 20:27 ` Bjorn Helgaas 0 siblings, 2 replies; 24+ messages in thread From: Leon Romanovsky @ 2024-02-11 8:48 UTC (permalink / raw) To: Kuppuswamy Sathyanarayanan Cc: Jim Harris, Bjorn Helgaas, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Jason Gunthorpe, Alex Williamson, pierre.cregut@orange.com On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: > > On 2/9/24 3:52 PM, Jim Harris wrote: > > If an SR-IOV enabled device is held by vfio, and the device is removed, > > vfio will hold device lock and notify userspace of the removal. If > > userspace reads the sriov_numvfs sysfs entry, that thread will be blocked > > since sriov_numvfs_show() also tries to acquire the device lock. If that > > same thread is responsible for releasing the device to vfio, it results in > > a deadlock. > > > > The proper way to detect a change to the num_VFs value is to listen for a > > sysfs event, not to add a device_lock() on the attribute _show() in the > > kernel. > > Since you are reverting a commit that synchronizes SysFS read > /write, please add some comments about why it is not an > issue anymore. It was never an issue, the idea that sysfs read and write should be serialized by kernel is not correct by definition. Thanks > > > > > This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. > > Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). > > > > Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ > > Suggested-by: Leon Romanovsky <leonro@nvidia.com> > > Reviewed-by: Leon Romanovsky <leonro@nvidia.com> > > Signed-off-by: Jim Harris <jim.harris@samsung.com> > > --- > > drivers/pci/iov.c | 8 +------- > > 1 file changed, 1 insertion(+), 7 deletions(-) > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > index aaa33e8dc4c9..0ca20cd518d5 100644 > > --- a/drivers/pci/iov.c > > +++ b/drivers/pci/iov.c > > @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, > > char *buf) > > { > > struct pci_dev *pdev = to_pci_dev(dev); > > - u16 num_vfs; > > - > > - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ > > - device_lock(&pdev->dev); > > - num_vfs = pdev->sriov->num_VFs; > > - device_unlock(&pdev->dev); > > > > - return sysfs_emit(buf, "%u\n", num_vfs); > > + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); > > } > > > > /* > > > -- > Sathyanarayanan Kuppuswamy > Linux Kernel Developer > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-11 8:48 ` Leon Romanovsky @ 2024-02-11 19:15 ` Kuppuswamy Sathyanarayanan 2024-02-12 9:31 ` Leon Romanovsky 2024-02-12 20:27 ` Bjorn Helgaas 1 sibling, 1 reply; 24+ messages in thread From: Kuppuswamy Sathyanarayanan @ 2024-02-11 19:15 UTC (permalink / raw) To: Leon Romanovsky Cc: Jim Harris, Bjorn Helgaas, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Jason Gunthorpe, Alex Williamson, pierre.cregut@orange.com On 2/11/24 12:48 AM, Leon Romanovsky wrote: > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: >> On 2/9/24 3:52 PM, Jim Harris wrote: >>> If an SR-IOV enabled device is held by vfio, and the device is removed, >>> vfio will hold device lock and notify userspace of the removal. If >>> userspace reads the sriov_numvfs sysfs entry, that thread will be blocked >>> since sriov_numvfs_show() also tries to acquire the device lock. If that >>> same thread is responsible for releasing the device to vfio, it results in >>> a deadlock. >>> >>> The proper way to detect a change to the num_VFs value is to listen for a >>> sysfs event, not to add a device_lock() on the attribute _show() in the >>> kernel. >> Since you are reverting a commit that synchronizes SysFS read >> /write, please add some comments about why it is not an >> issue anymore. > It was never an issue, the idea that sysfs read and write should be serialized by kernel > is not correct by definition. What: /sys/bus/pci/devices/.../sriov_numvfs Date: November 2012 Contact: Donald Dutile <ddutile@redhat.com> Description: This file appears when a physical PCIe device supports SR-IOV. Userspace applications can read and write to this file to determine and control the enablement or disablement of Virtual Functions (VFs) on the physical function (PF). A read of this file will return the number of VFs that are enabled on this PF. I am not very clear about the user of this SysFs. But, as per above description, this sysfs seems to controls the number of VFs. A typical usage is to allow user to write a value and then read to check the enabled/disabled number of VMs, right? If you are not synchronizing, then the value returned may not reflect the actual number of enabled / disabled VFs. So wont this change affect the existing user of this SysFS. > > Thanks > >>> This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. >>> Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). >>> >>> Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ >>> Suggested-by: Leon Romanovsky <leonro@nvidia.com> >>> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> >>> Signed-off-by: Jim Harris <jim.harris@samsung.com> >>> --- >>> drivers/pci/iov.c | 8 +------- >>> 1 file changed, 1 insertion(+), 7 deletions(-) >>> >>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c >>> index aaa33e8dc4c9..0ca20cd518d5 100644 >>> --- a/drivers/pci/iov.c >>> +++ b/drivers/pci/iov.c >>> @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, >>> char *buf) >>> { >>> struct pci_dev *pdev = to_pci_dev(dev); >>> - u16 num_vfs; >>> - >>> - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ >>> - device_lock(&pdev->dev); >>> - num_vfs = pdev->sriov->num_VFs; >>> - device_unlock(&pdev->dev); >>> >>> - return sysfs_emit(buf, "%u\n", num_vfs); >>> + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); >>> } >>> >>> /* >>> >> -- >> Sathyanarayanan Kuppuswamy >> Linux Kernel Developer >> -- Sathyanarayanan Kuppuswamy Linux Kernel Developer ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-11 19:15 ` Kuppuswamy Sathyanarayanan @ 2024-02-12 9:31 ` Leon Romanovsky 0 siblings, 0 replies; 24+ messages in thread From: Leon Romanovsky @ 2024-02-12 9:31 UTC (permalink / raw) To: Kuppuswamy Sathyanarayanan Cc: Jim Harris, Bjorn Helgaas, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Jason Gunthorpe, Alex Williamson, pierre.cregut@orange.com On Sun, Feb 11, 2024 at 11:15:44AM -0800, Kuppuswamy Sathyanarayanan wrote: > > On 2/11/24 12:48 AM, Leon Romanovsky wrote: > > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: > >> On 2/9/24 3:52 PM, Jim Harris wrote: > >>> If an SR-IOV enabled device is held by vfio, and the device is removed, > >>> vfio will hold device lock and notify userspace of the removal. If > >>> userspace reads the sriov_numvfs sysfs entry, that thread will be blocked > >>> since sriov_numvfs_show() also tries to acquire the device lock. If that > >>> same thread is responsible for releasing the device to vfio, it results in > >>> a deadlock. > >>> > >>> The proper way to detect a change to the num_VFs value is to listen for a > >>> sysfs event, not to add a device_lock() on the attribute _show() in the > >>> kernel. > >> Since you are reverting a commit that synchronizes SysFS read > >> /write, please add some comments about why it is not an > >> issue anymore. > > It was never an issue, the idea that sysfs read and write should be serialized by kernel > > is not correct by definition. > > What: /sys/bus/pci/devices/.../sriov_numvfs > Date: November 2012 > Contact: Donald Dutile <ddutile@redhat.com> > Description: > This file appears when a physical PCIe device supports SR-IOV. > Userspace applications can read and write to this file to > determine and control the enablement or disablement of Virtual > Functions (VFs) on the physical function (PF). A read of this > file will return the number of VFs that are enabled on this PF. > > I am not very clear about the user of this SysFs. But, as per above description, > this sysfs seems to controls the number of VFs. A typical usage is to allow user > to write a value and then read to check the enabled/disabled number of VMs, > right? No, typical usage is to write a value and observe spawned VFs. The read from sysfs is allowed for completeness and performed in sequence with write. Any attempt to read and write in parallel is prone to races by definition as they are not controlled by the kernel. > > If you are not synchronizing, then the value returned may not reflect the actual > number of enabled / disabled VFs. So wont this change affect the existing user > of this SysFS. No, it won't. Users were supposed to synchronize their read and write before calling sysfs. If they didn't do it, they already have broken code. Please read original discussion pointed by Jim. Thanks > > > > > Thanks > > > >>> This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. > >>> Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). > >>> > >>> Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ > >>> Suggested-by: Leon Romanovsky <leonro@nvidia.com> > >>> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> > >>> Signed-off-by: Jim Harris <jim.harris@samsung.com> > >>> --- > >>> drivers/pci/iov.c | 8 +------- > >>> 1 file changed, 1 insertion(+), 7 deletions(-) > >>> > >>> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > >>> index aaa33e8dc4c9..0ca20cd518d5 100644 > >>> --- a/drivers/pci/iov.c > >>> +++ b/drivers/pci/iov.c > >>> @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, > >>> char *buf) > >>> { > >>> struct pci_dev *pdev = to_pci_dev(dev); > >>> - u16 num_vfs; > >>> - > >>> - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ > >>> - device_lock(&pdev->dev); > >>> - num_vfs = pdev->sriov->num_VFs; > >>> - device_unlock(&pdev->dev); > >>> > >>> - return sysfs_emit(buf, "%u\n", num_vfs); > >>> + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); > >>> } > >>> > >>> /* > >>> > >> -- > >> Sathyanarayanan Kuppuswamy > >> Linux Kernel Developer > >> > -- > Sathyanarayanan Kuppuswamy > Linux Kernel Developer > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-11 8:48 ` Leon Romanovsky 2024-02-11 19:15 ` Kuppuswamy Sathyanarayanan @ 2024-02-12 20:27 ` Bjorn Helgaas 2024-02-12 22:59 ` Jim Harris 2024-02-13 7:34 ` Leon Romanovsky 1 sibling, 2 replies; 24+ messages in thread From: Bjorn Helgaas @ 2024-02-12 20:27 UTC (permalink / raw) To: Leon Romanovsky Cc: Kuppuswamy Sathyanarayanan, Jim Harris, Bjorn Helgaas, linux-pci, linux-kernel, Jason Gunthorpe, Alex Williamson, Pierre Crégut On Sun, Feb 11, 2024 at 10:48:44AM +0200, Leon Romanovsky wrote: > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: > > On 2/9/24 3:52 PM, Jim Harris wrote: > > > If an SR-IOV enabled device is held by vfio, and the device is removed, > > > vfio will hold device lock and notify userspace of the removal. If > > > userspace reads the sriov_numvfs sysfs entry, that thread will be blocked > > > since sriov_numvfs_show() also tries to acquire the device lock. If that > > > same thread is responsible for releasing the device to vfio, it results in > > > a deadlock. > > > > > > The proper way to detect a change to the num_VFs value is to listen for a > > > sysfs event, not to add a device_lock() on the attribute _show() in the > > > kernel. The lock was not about detecting a change; Pierre did this: ip monitor dev ${DEVICE} | grep --line-buffered "^${id}:" | while read line; do \ cat ${path}/device/sriov_numvfs; \ which I assume works by listening for sysfs events. The problem was that after the event occurred, the sriov_numvfs read got a stale value (see https://bugzilla.kernel.org/show_bug.cgi?id=202991). So I would drop this sentence because I don't think it accurately reflects the reason for 35ff867b7657. > > Since you are reverting a commit that synchronizes SysFS read > > /write, please add some comments about why it is not an > > issue anymore. > > It was never an issue, the idea that sysfs read and write should be > serialized by kernel is not correct by definition. I think it *was* an issue. The behavior Pierre observed at was clearly wrong, and we added 35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes") to resolve it. We should try to avoid reintroducing the problem, so I think we should probably squash these two patches and describe it as a deadlock fix instead of dismissing 35ff867b7657 as being based on false premises. It would be awesome if you had time to verify that these patches also resolve the problem you saw, Pierre. I think we should also add: Fixes: 35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes") as a trigger for backporting this to kernels that include 35ff867b7657. Bjorn > > > This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. > > > Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). > > > > > > Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ > > > Suggested-by: Leon Romanovsky <leonro@nvidia.com> > > > Reviewed-by: Leon Romanovsky <leonro@nvidia.com> > > > Signed-off-by: Jim Harris <jim.harris@samsung.com> > > > --- > > > drivers/pci/iov.c | 8 +------- > > > 1 file changed, 1 insertion(+), 7 deletions(-) > > > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > > index aaa33e8dc4c9..0ca20cd518d5 100644 > > > --- a/drivers/pci/iov.c > > > +++ b/drivers/pci/iov.c > > > @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, > > > char *buf) > > > { > > > struct pci_dev *pdev = to_pci_dev(dev); > > > - u16 num_vfs; > > > - > > > - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ > > > - device_lock(&pdev->dev); > > > - num_vfs = pdev->sriov->num_VFs; > > > - device_unlock(&pdev->dev); > > > > > > - return sysfs_emit(buf, "%u\n", num_vfs); > > > + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); > > > } > > > > > > /* > > > > > -- > > Sathyanarayanan Kuppuswamy > > Linux Kernel Developer > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-12 20:27 ` Bjorn Helgaas @ 2024-02-12 22:59 ` Jim Harris 2024-02-13 7:37 ` Leon Romanovsky ` (2 more replies) 2024-02-13 7:34 ` Leon Romanovsky 1 sibling, 3 replies; 24+ messages in thread From: Jim Harris @ 2024-02-12 22:59 UTC (permalink / raw) To: Bjorn Helgaas Cc: Leon Romanovsky, Kuppuswamy Sathyanarayanan, Bjorn Helgaas, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Jason Gunthorpe, Alex Williamson, Pierre Crégut On Mon, Feb 12, 2024 at 02:27:14PM -0600, Bjorn Helgaas wrote: > On Sun, Feb 11, 2024 at 10:48:44AM +0200, Leon Romanovsky wrote: > > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: > > > On 2/9/24 3:52 PM, Jim Harris wrote: > > > > If an SR-IOV enabled device is held by vfio, and the device is removed, > > > > vfio will hold device lock and notify userspace of the removal. If > > > > userspace reads the sriov_numvfs sysfs entry, that thread will be blocked > > > > since sriov_numvfs_show() also tries to acquire the device lock. If that > > > > same thread is responsible for releasing the device to vfio, it results in > > > > a deadlock. > > > > > > > > The proper way to detect a change to the num_VFs value is to listen for a > > > > sysfs event, not to add a device_lock() on the attribute _show() in the > > > > kernel. > > The lock was not about detecting a change; Pierre did this: > > ip monitor dev ${DEVICE} | grep --line-buffered "^${id}:" | while read line; do \ > cat ${path}/device/sriov_numvfs; \ > > which I assume works by listening for sysfs events. The problem was > that after the event occurred, the sriov_numvfs read got a stale value > (see https://bugzilla.kernel.org/show_bug.cgi?id=202991). I don't think 'ip monitor dev' listens for any sysfs events. Or at least if I have this running and write values to sriov_numvfs, I don't see any output. It looks like the original bug report was against v5.0 (matching by dates and the patch file attached). In that code, we have: kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); iov->num_VFs = nr_virtfn; which is identical to how the code looks today. Is it possible that userspace could react to this uevent and read the stale num_VFs before iov->num_VFs gets written here? I mean, theoretically it's possible, but from the bug report it seems like the scenario Pierre was facing was 100% reproducible. It would be great if we could get input from Pierre on this. It isn't clear to me from the bug report what exactly is updating the sriov_numvfs sysfs entry, and what is triggering that update. We could also revisit my original suggestion, which was to use a discrete lock just for this sysfs entry, rather than overloading the device lock. That probably has lower risk of introducing an unintended regression. https://lore.kernel.org/linux-pci/ZXNNQkXzluoyeguu@bgt-140510-bm01.eng.stellus.in/ > > So I would drop this sentence because I don't think it accurately > reflects the reason for 35ff867b7657. > > > > Since you are reverting a commit that synchronizes SysFS read > > > /write, please add some comments about why it is not an > > > issue anymore. > > > > It was never an issue, the idea that sysfs read and write should be > > serialized by kernel is not correct by definition. > > I think it *was* an issue. The behavior Pierre observed at was > clearly wrong, and we added 35ff867b7657 ("PCI/IOV: Serialize sysfs > sriov_numvfs reads vs writes") to resolve it. > > We should try to avoid reintroducing the problem, so I think we should > probably squash these two patches and describe it as a deadlock fix > instead of dismissing 35ff867b7657 as being based on false premises. > > It would be awesome if you had time to verify that these patches also > resolve the problem you saw, Pierre. > > I think we should also add: > > Fixes: 35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes") > > as a trigger for backporting this to kernels that include > 35ff867b7657. > > Bjorn > > > > > This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. > > > > Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). > > > > > > > > Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ > > > > Suggested-by: Leon Romanovsky <leonro@nvidia.com> > > > > Reviewed-by: Leon Romanovsky <leonro@nvidia.com> > > > > Signed-off-by: Jim Harris <jim.harris@samsung.com> > > > > --- > > > > drivers/pci/iov.c | 8 +------- > > > > 1 file changed, 1 insertion(+), 7 deletions(-) > > > > > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > > > index aaa33e8dc4c9..0ca20cd518d5 100644 > > > > --- a/drivers/pci/iov.c > > > > +++ b/drivers/pci/iov.c > > > > @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, > > > > char *buf) > > > > { > > > > struct pci_dev *pdev = to_pci_dev(dev); > > > > - u16 num_vfs; > > > > - > > > > - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ > > > > - device_lock(&pdev->dev); > > > > - num_vfs = pdev->sriov->num_VFs; > > > > - device_unlock(&pdev->dev); > > > > > > > > - return sysfs_emit(buf, "%u\n", num_vfs); > > > > + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); > > > > } > > > > > > > > /* > > > > > > > -- > > > Sathyanarayanan Kuppuswamy > > > Linux Kernel Developer > > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-12 22:59 ` Jim Harris @ 2024-02-13 7:37 ` Leon Romanovsky 2024-02-13 9:40 ` pierre.cregut 2024-02-13 14:59 ` Jason Gunthorpe 2 siblings, 0 replies; 24+ messages in thread From: Leon Romanovsky @ 2024-02-13 7:37 UTC (permalink / raw) To: Jim Harris Cc: Bjorn Helgaas, Kuppuswamy Sathyanarayanan, Bjorn Helgaas, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Jason Gunthorpe, Alex Williamson, Pierre Crégut On Mon, Feb 12, 2024 at 10:59:03PM +0000, Jim Harris wrote: > On Mon, Feb 12, 2024 at 02:27:14PM -0600, Bjorn Helgaas wrote: > > On Sun, Feb 11, 2024 at 10:48:44AM +0200, Leon Romanovsky wrote: > > > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: > > > > On 2/9/24 3:52 PM, Jim Harris wrote: > > > > > If an SR-IOV enabled device is held by vfio, and the device is removed, > > > > > vfio will hold device lock and notify userspace of the removal. If > > > > > userspace reads the sriov_numvfs sysfs entry, that thread will be blocked > > > > > since sriov_numvfs_show() also tries to acquire the device lock. If that > > > > > same thread is responsible for releasing the device to vfio, it results in > > > > > a deadlock. > > > > > > > > > > The proper way to detect a change to the num_VFs value is to listen for a > > > > > sysfs event, not to add a device_lock() on the attribute _show() in the > > > > > kernel. > > > > The lock was not about detecting a change; Pierre did this: > > > > ip monitor dev ${DEVICE} | grep --line-buffered "^${id}:" | while read line; do \ > > cat ${path}/device/sriov_numvfs; \ > > > > which I assume works by listening for sysfs events. The problem was > > that after the event occurred, the sriov_numvfs read got a stale value > > (see https://bugzilla.kernel.org/show_bug.cgi?id=202991). > > I don't think 'ip monitor dev' listens for any sysfs events. Or at least if > I have this running and write values to sriov_numvfs, I don't see any > output. > > It looks like the original bug report was against v5.0 (matching by dates > and the patch file attached). In that code, we have: > > kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); > iov->num_VFs = nr_virtfn; > > which is identical to how the code looks today. Is it possible that > userspace could react to this uevent and read the stale num_VFs before > iov->num_VFs gets written here? I mean, theoretically it's possible, but > from the bug report it seems like the scenario Pierre was facing was > 100% reproducible. > > It would be great if we could get input from Pierre on this. It isn't clear > to me from the bug report what exactly is updating the sriov_numvfs sysfs > entry, and what is triggering that update. > > We could also revisit my original suggestion, which was to use a > discrete lock just for this sysfs entry, rather than overloading the > device lock. That probably has lower risk of introducing an unintended > regression. The idea that lock issues are need to be solved by adding more locks doesn't sound good to me. Thanks > > https://lore.kernel.org/linux-pci/ZXNNQkXzluoyeguu@bgt-140510-bm01.eng.stellus.in/ > > > > > So I would drop this sentence because I don't think it accurately > > reflects the reason for 35ff867b7657. > > > > > > Since you are reverting a commit that synchronizes SysFS read > > > > /write, please add some comments about why it is not an > > > > issue anymore. > > > > > > It was never an issue, the idea that sysfs read and write should be > > > serialized by kernel is not correct by definition. > > > > I think it *was* an issue. The behavior Pierre observed at was > > clearly wrong, and we added 35ff867b7657 ("PCI/IOV: Serialize sysfs > > sriov_numvfs reads vs writes") to resolve it. > > > > We should try to avoid reintroducing the problem, so I think we should > > probably squash these two patches and describe it as a deadlock fix > > instead of dismissing 35ff867b7657 as being based on false premises. > > > > It would be awesome if you had time to verify that these patches also > > resolve the problem you saw, Pierre. > > > > I think we should also add: > > > > Fixes: 35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes") > > > > as a trigger for backporting this to kernels that include > > 35ff867b7657. > > > > Bjorn > > > > > > > This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. > > > > > Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). > > > > > > > > > > Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ > > > > > Suggested-by: Leon Romanovsky <leonro@nvidia.com> > > > > > Reviewed-by: Leon Romanovsky <leonro@nvidia.com> > > > > > Signed-off-by: Jim Harris <jim.harris@samsung.com> > > > > > --- > > > > > drivers/pci/iov.c | 8 +------- > > > > > 1 file changed, 1 insertion(+), 7 deletions(-) > > > > > > > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > > > > index aaa33e8dc4c9..0ca20cd518d5 100644 > > > > > --- a/drivers/pci/iov.c > > > > > +++ b/drivers/pci/iov.c > > > > > @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, > > > > > char *buf) > > > > > { > > > > > struct pci_dev *pdev = to_pci_dev(dev); > > > > > - u16 num_vfs; > > > > > - > > > > > - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ > > > > > - device_lock(&pdev->dev); > > > > > - num_vfs = pdev->sriov->num_VFs; > > > > > - device_unlock(&pdev->dev); > > > > > > > > > > - return sysfs_emit(buf, "%u\n", num_vfs); > > > > > + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); > > > > > } > > > > > > > > > > /* > > > > > > > > > -- > > > > Sathyanarayanan Kuppuswamy > > > > Linux Kernel Developer > > > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-12 22:59 ` Jim Harris 2024-02-13 7:37 ` Leon Romanovsky @ 2024-02-13 9:40 ` pierre.cregut 2024-02-13 14:59 ` Jason Gunthorpe 2 siblings, 0 replies; 24+ messages in thread From: pierre.cregut @ 2024-02-13 9:40 UTC (permalink / raw) To: Jim Harris, Bjorn Helgaas Cc: Leon Romanovsky, Kuppuswamy Sathyanarayanan, Bjorn Helgaas, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Jason Gunthorpe, Alex Williamson First sorry for not answering earlier but it is a long time ago. I do not work on the topic any more (a monitoring tool Skydive, an open source project no more actively developed as far as I know) and only have vague memories of it. > > kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); > iov->num_VFs = nr_virtfn; > > which is identical to how the code looks today. Is it possible that > userspace could react to this uevent and read the stale num_VFs before > iov->num_VFs gets written here? I mean, theoretically it's possible, but > from the bug report it seems like the scenario Pierre was facing was > 100% reproducible. From my memories yes that was exactly the problem. Any stable method that could detect the change of configuration in user land and ensure that we get a reliable value of num_vfs after we received it would be fine. Best regards, Pierre Crégut ____________________________________________________________________________________________________________ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified. Thank you. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-12 22:59 ` Jim Harris 2024-02-13 7:37 ` Leon Romanovsky 2024-02-13 9:40 ` pierre.cregut @ 2024-02-13 14:59 ` Jason Gunthorpe 2 siblings, 0 replies; 24+ messages in thread From: Jason Gunthorpe @ 2024-02-13 14:59 UTC (permalink / raw) To: Jim Harris Cc: Bjorn Helgaas, Leon Romanovsky, Kuppuswamy Sathyanarayanan, Bjorn Helgaas, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Alex Williamson, Pierre Crégut On Mon, Feb 12, 2024 at 10:59:03PM +0000, Jim Harris wrote: > On Mon, Feb 12, 2024 at 02:27:14PM -0600, Bjorn Helgaas wrote: > > On Sun, Feb 11, 2024 at 10:48:44AM +0200, Leon Romanovsky wrote: > > > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: > > > > On 2/9/24 3:52 PM, Jim Harris wrote: > > > > > If an SR-IOV enabled device is held by vfio, and the device is removed, > > > > > vfio will hold device lock and notify userspace of the removal. If > > > > > userspace reads the sriov_numvfs sysfs entry, that thread will be blocked > > > > > since sriov_numvfs_show() also tries to acquire the device lock. If that > > > > > same thread is responsible for releasing the device to vfio, it results in > > > > > a deadlock. > > > > > > > > > > The proper way to detect a change to the num_VFs value is to listen for a > > > > > sysfs event, not to add a device_lock() on the attribute _show() in the > > > > > kernel. > > > > The lock was not about detecting a change; Pierre did this: > > > > ip monitor dev ${DEVICE} | grep --line-buffered "^${id}:" | while read line; do \ > > cat ${path}/device/sriov_numvfs; \ > > > > which I assume works by listening for sysfs events. The problem was > > that after the event occurred, the sriov_numvfs read got a stale value > > (see https://bugzilla.kernel.org/show_bug.cgi?id=202991). > > I don't think 'ip monitor dev' listens for any sysfs events. Or at least if > I have this running and write values to sriov_numvfs, I don't see any > output. The issue is that the sysfs change inadvertently throws out a netlink event (or udev event, or whatever) and something can observe that event and then turn around and read the sysfs and observe a sysfs result that hasn't caught up to the event launch. The lock fixed this because it held it across the event launch and the update of the internal state. > It looks like the original bug report was against v5.0 (matching by dates > and the patch file attached). In that code, we have: > > kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); > iov->num_VFs = nr_virtfn; This is a udev event, I suspect the ip monitor event was thrown by driver binding during the VF creation. Jason ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-12 20:27 ` Bjorn Helgaas 2024-02-12 22:59 ` Jim Harris @ 2024-02-13 7:34 ` Leon Romanovsky 2024-02-13 15:59 ` Bjorn Helgaas 1 sibling, 1 reply; 24+ messages in thread From: Leon Romanovsky @ 2024-02-13 7:34 UTC (permalink / raw) To: Bjorn Helgaas Cc: Kuppuswamy Sathyanarayanan, Jim Harris, Bjorn Helgaas, linux-pci, linux-kernel, Jason Gunthorpe, Alex Williamson, Pierre Crégut On Mon, Feb 12, 2024 at 02:27:14PM -0600, Bjorn Helgaas wrote: > On Sun, Feb 11, 2024 at 10:48:44AM +0200, Leon Romanovsky wrote: > > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: > > > On 2/9/24 3:52 PM, Jim Harris wrote: > > > > If an SR-IOV enabled device is held by vfio, and the device is removed, > > > > vfio will hold device lock and notify userspace of the removal. If > > > > userspace reads the sriov_numvfs sysfs entry, that thread will be blocked > > > > since sriov_numvfs_show() also tries to acquire the device lock. If that > > > > same thread is responsible for releasing the device to vfio, it results in > > > > a deadlock. > > > > > > > > The proper way to detect a change to the num_VFs value is to listen for a > > > > sysfs event, not to add a device_lock() on the attribute _show() in the > > > > kernel. > > The lock was not about detecting a change; Pierre did this: > > ip monitor dev ${DEVICE} | grep --line-buffered "^${id}:" | while read line; do \ > cat ${path}/device/sriov_numvfs; \ > > which I assume works by listening for sysfs events. It is not, "ip monitor ..." listens to netlink events emitted by netdev core and not sysfs events. Sysfs events are not involved in this case. > The problem was that after the event occurred, the sriov_numvfs > read got a stale value (see https://bugzilla.kernel.org/show_bug.cgi?id=202991). Yes, and it is outcome of such cross-subsytem involvement, which is racy by definition. Someone can come with even simpler example of why locking sysfs read and write is not a good idea. For example, let's consider the following scenario with two CPUs and locks on sysfs read and write: CPU1 CPU2 echo 1 > ${path}/device/sriov_numvfs context_switch -> cat ${path}/device/sriov_numvfs lock return 0 unlock context_switch <- lock set 1 unlock CPU1 CPU2 echo 1 > ${path}/device/sriov_numvfs lock set 1 unlock context_switch -> cat ${path}/device/sriov_numvfs lock return 1 unlock So same scenario will return different values if user doesn't protect such case with external to the kernel lock. But if we return back to Pierre report and if you want to provide completely bullet proof solution to solve cross-subsystem interaction, you will need to prohibit device probe till sriov_numvfs update is completed. However, it is overkill for something that is not a real issue. > > So I would drop this sentence because I don't think it accurately > reflects the reason for 35ff867b7657. > > > > Since you are reverting a commit that synchronizes SysFS read > > > /write, please add some comments about why it is not an > > > issue anymore. > > > > It was never an issue, the idea that sysfs read and write should be > > serialized by kernel is not correct by definition. > > I think it *was* an issue. The behavior Pierre observed at was > clearly wrong, I disagree with this sentence. > and we added 35ff867b7657 ("PCI/IOV: Serialize sysfs > sriov_numvfs reads vs writes") to resolve it. > > We should try to avoid reintroducing the problem, so I think we should > probably squash these two patches and describe it as a deadlock fix > instead of dismissing 35ff867b7657 as being based on false premises. > > It would be awesome if you had time to verify that these patches also > resolve the problem you saw, Pierre. They won't resolve his problem, because he is not listening to sysfs events, but rely on something from netdev side. > > I think we should also add: > > Fixes: 35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes") > > as a trigger for backporting this to kernels that include > 35ff867b7657. > > Bjorn > > > > > This reverts commit 35ff867b76576e32f34c698ccd11343f7d616204. > > > > Revert had a small conflict, the sprintf() is now changed to sysfs_emit(). > > > > > > > > Link: https://lore.kernel.org/linux-pci/ZXJI5+f8bUelVXqu@ubuntu/ > > > > Suggested-by: Leon Romanovsky <leonro@nvidia.com> > > > > Reviewed-by: Leon Romanovsky <leonro@nvidia.com> > > > > Signed-off-by: Jim Harris <jim.harris@samsung.com> > > > > --- > > > > drivers/pci/iov.c | 8 +------- > > > > 1 file changed, 1 insertion(+), 7 deletions(-) > > > > > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > > > index aaa33e8dc4c9..0ca20cd518d5 100644 > > > > --- a/drivers/pci/iov.c > > > > +++ b/drivers/pci/iov.c > > > > @@ -395,14 +395,8 @@ static ssize_t sriov_numvfs_show(struct device *dev, > > > > char *buf) > > > > { > > > > struct pci_dev *pdev = to_pci_dev(dev); > > > > - u16 num_vfs; > > > > - > > > > - /* Serialize vs sriov_numvfs_store() so readers see valid num_VFs */ > > > > - device_lock(&pdev->dev); > > > > - num_vfs = pdev->sriov->num_VFs; > > > > - device_unlock(&pdev->dev); > > > > > > > > - return sysfs_emit(buf, "%u\n", num_vfs); > > > > + return sysfs_emit(buf, "%u\n", pdev->sriov->num_VFs); > > > > } > > > > > > > > /* > > > > > > > -- > > > Sathyanarayanan Kuppuswamy > > > Linux Kernel Developer > > > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-13 7:34 ` Leon Romanovsky @ 2024-02-13 15:59 ` Bjorn Helgaas 2024-02-13 17:46 ` Leon Romanovsky 0 siblings, 1 reply; 24+ messages in thread From: Bjorn Helgaas @ 2024-02-13 15:59 UTC (permalink / raw) To: Leon Romanovsky Cc: Kuppuswamy Sathyanarayanan, Jim Harris, Bjorn Helgaas, linux-pci, linux-kernel, Jason Gunthorpe, Alex Williamson, Pierre Crégut On Tue, Feb 13, 2024 at 09:34:50AM +0200, Leon Romanovsky wrote: > On Mon, Feb 12, 2024 at 02:27:14PM -0600, Bjorn Helgaas wrote: > > On Sun, Feb 11, 2024 at 10:48:44AM +0200, Leon Romanovsky wrote: > > > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: > > > > On 2/9/24 3:52 PM, Jim Harris wrote: > > > > > If an SR-IOV enabled device is held by vfio, and the device > > > > > is removed, vfio will hold device lock and notify userspace > > > > > of the removal. If userspace reads the sriov_numvfs sysfs > > > > > entry, that thread will be blocked since sriov_numvfs_show() > > > > > also tries to acquire the device lock. If that same thread > > > > > is responsible for releasing the device to vfio, it results > > > > > in a deadlock. > > > > > > > > > > The proper way to detect a change to the num_VFs value is to > > > > > listen for a sysfs event, not to add a device_lock() on the > > > > > attribute _show() in the kernel. > > > > The lock was not about detecting a change; Pierre did this: > > > > ip monitor dev ${DEVICE} | grep --line-buffered "^${id}:" | while read line; do \ > > cat ${path}/device/sriov_numvfs; \ > > > > which I assume works by listening for sysfs events. > > It is not, "ip monitor ..." listens to netlink events emitted by > netdev core and not sysfs events. Sysfs events are not involved in > this case. Thanks for correcting my hasty assumption! > > The problem was that after the event occurred, the sriov_numvfs > > read got a stale value (see https://bugzilla.kernel.org/show_bug.cgi?id=202991). > > Yes, and it is outcome of such cross-subsytem involvement, which > is racy by definition. Someone can come with even simpler example of why > locking sysfs read and write is not a good idea. > > For example, let's consider the following scenario with two CPUs and > locks on sysfs read and write: > > CPU1 CPU2 > echo 1 > ${path}/device/sriov_numvfs > context_switch -> > cat ${path}/device/sriov_numvfs > lock > return 0 > unlock > context_switch <- > lock > set 1 > unlock > > CPU1 CPU2 > echo 1 > ${path}/device/sriov_numvfs > lock > set 1 > unlock > context_switch -> > cat ${path}/device/sriov_numvfs > lock > return 1 > unlock > > So same scenario will return different values if user doesn't protect > such case with external to the kernel lock. > > But if we return back to Pierre report and if you want to provide > completely bullet proof solution to solve cross-subsystem interaction, > you will need to prohibit device probe till sriov_numvfs update is completed. > However, it is overkill for something that is not a real issue. Pierre wanted to detect the configuration change and learn the new num_vfs, which seems like a reasonable thing to do. Is there a way to do both via netlink or some other mechanism? > > So I would drop this sentence because I don't think it accurately > > reflects the reason for 35ff867b7657. > > > > > > Since you are reverting a commit that synchronizes SysFS read > > > > /write, please add some comments about why it is not an > > > > issue anymore. > > > > > > It was never an issue, the idea that sysfs read and write should be > > > serialized by kernel is not correct by definition. > > > > I think it *was* an issue. The behavior Pierre observed at was > > clearly wrong, > > I disagree with this sentence. > > > and we added 35ff867b7657 ("PCI/IOV: Serialize sysfs > > sriov_numvfs reads vs writes") to resolve it. > > > > We should try to avoid reintroducing the problem, so I think we should > > probably squash these two patches and describe it as a deadlock fix > > instead of dismissing 35ff867b7657 as being based on false premises. > > > > It would be awesome if you had time to verify that these patches also > > resolve the problem you saw, Pierre. > > They won't resolve his problem, because he is not listening to sysfs > events, but rely on something from netdev side. I guess that means that if we apply this revert, the problem Pierre reported will return. Obviously the deadlock is more important than the inconsistency Pierre observed, but from the user's point of view this will look like a regression. Maybe listening to netlink and then looking at sysfs isn't the "correct" way to do this, but I don't want to just casually break existing user code. If we do contemplate doing the revert, at the very least we should include specific details about what the user code *should* do instead, at the level of the actual commands to use instead of "ip monitor dev; cat ${path}/device/sriov_numvfs". Bjorn ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-13 15:59 ` Bjorn Helgaas @ 2024-02-13 17:46 ` Leon Romanovsky 2024-02-13 18:00 ` Kuppuswamy Sathyanarayanan 2024-02-13 19:45 ` Bjorn Helgaas 0 siblings, 2 replies; 24+ messages in thread From: Leon Romanovsky @ 2024-02-13 17:46 UTC (permalink / raw) To: Bjorn Helgaas Cc: Kuppuswamy Sathyanarayanan, Jim Harris, Bjorn Helgaas, linux-pci, linux-kernel, Jason Gunthorpe, Alex Williamson, Pierre Crégut On Tue, Feb 13, 2024 at 09:59:54AM -0600, Bjorn Helgaas wrote: > On Tue, Feb 13, 2024 at 09:34:50AM +0200, Leon Romanovsky wrote: > > On Mon, Feb 12, 2024 at 02:27:14PM -0600, Bjorn Helgaas wrote: > > > On Sun, Feb 11, 2024 at 10:48:44AM +0200, Leon Romanovsky wrote: > > > > On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: > > > > > On 2/9/24 3:52 PM, Jim Harris wrote: > > > > > > If an SR-IOV enabled device is held by vfio, and the device > > > > > > is removed, vfio will hold device lock and notify userspace > > > > > > of the removal. If userspace reads the sriov_numvfs sysfs > > > > > > entry, that thread will be blocked since sriov_numvfs_show() > > > > > > also tries to acquire the device lock. If that same thread > > > > > > is responsible for releasing the device to vfio, it results > > > > > > in a deadlock. > > > > > > > > > > > > The proper way to detect a change to the num_VFs value is to > > > > > > listen for a sysfs event, not to add a device_lock() on the > > > > > > attribute _show() in the kernel. > > > > > > The lock was not about detecting a change; Pierre did this: > > > > > > ip monitor dev ${DEVICE} | grep --line-buffered "^${id}:" | while read line; do \ > > > cat ${path}/device/sriov_numvfs; \ > > > > > > which I assume works by listening for sysfs events. > > > > It is not, "ip monitor ..." listens to netlink events emitted by > > netdev core and not sysfs events. Sysfs events are not involved in > > this case. > > Thanks for correcting my hasty assumption! > > > > The problem was that after the event occurred, the sriov_numvfs > > > read got a stale value (see https://bugzilla.kernel.org/show_bug.cgi?id=202991). > > > > Yes, and it is outcome of such cross-subsytem involvement, which > > is racy by definition. Someone can come with even simpler example of why > > locking sysfs read and write is not a good idea. > > > > For example, let's consider the following scenario with two CPUs and > > locks on sysfs read and write: > > > > CPU1 CPU2 > > echo 1 > ${path}/device/sriov_numvfs > > context_switch -> > > cat ${path}/device/sriov_numvfs > > lock > > return 0 > > unlock > > context_switch <- > > lock > > set 1 > > unlock > > > > CPU1 CPU2 > > echo 1 > ${path}/device/sriov_numvfs > > lock > > set 1 > > unlock > > context_switch -> > > cat ${path}/device/sriov_numvfs > > lock > > return 1 > > unlock > > > > So same scenario will return different values if user doesn't protect > > such case with external to the kernel lock. > > > > But if we return back to Pierre report and if you want to provide > > completely bullet proof solution to solve cross-subsystem interaction, > > you will need to prohibit device probe till sriov_numvfs update is completed. > > However, it is overkill for something that is not a real issue. > > Pierre wanted to detect the configuration change and learn the new > num_vfs, which seems like a reasonable thing to do. Is there a way to > do both via netlink or some other mechanism? Please pay attention that Pierre listened to specific netdevice and not to something general. After patch #2 in Jim's series, he will be able to rely on "udevadm monitor" instead of "ip monitor". > > > > So I would drop this sentence because I don't think it accurately > > > reflects the reason for 35ff867b7657. > > > > > > > > Since you are reverting a commit that synchronizes SysFS read > > > > > /write, please add some comments about why it is not an > > > > > issue anymore. > > > > > > > > It was never an issue, the idea that sysfs read and write should be > > > > serialized by kernel is not correct by definition. > > > > > > I think it *was* an issue. The behavior Pierre observed at was > > > clearly wrong, > > > > I disagree with this sentence. > > > > > and we added 35ff867b7657 ("PCI/IOV: Serialize sysfs > > > sriov_numvfs reads vs writes") to resolve it. > > > > > > We should try to avoid reintroducing the problem, so I think we should > > > probably squash these two patches and describe it as a deadlock fix > > > instead of dismissing 35ff867b7657 as being based on false premises. > > > > > > It would be awesome if you had time to verify that these patches also > > > resolve the problem you saw, Pierre. > > > > They won't resolve his problem, because he is not listening to sysfs > > events, but rely on something from netdev side. > > I guess that means that if we apply this revert, the problem Pierre > reported will return. Obviously the deadlock is more important than > the inconsistency Pierre observed, but from the user's point of view > this will look like a regression. > > Maybe listening to netlink and then looking at sysfs isn't the > "correct" way to do this, but I don't want to just casually break > existing user code. If we do contemplate doing the revert, at the > very least we should include specific details about what the user code > *should* do instead, at the level of the actual commands to use > instead of "ip monitor dev; cat ${path}/device/sriov_numvfs". udevadm monitor will do the trick. Another possible solution is to refactor the code to make sure that .probe on VFs happens only after sriov_numvfs is updated. Thanks > > Bjorn > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-13 17:46 ` Leon Romanovsky @ 2024-02-13 18:00 ` Kuppuswamy Sathyanarayanan 2024-02-13 19:45 ` Bjorn Helgaas 1 sibling, 0 replies; 24+ messages in thread From: Kuppuswamy Sathyanarayanan @ 2024-02-13 18:00 UTC (permalink / raw) To: Leon Romanovsky, Bjorn Helgaas Cc: Jim Harris, Bjorn Helgaas, linux-pci, linux-kernel, Jason Gunthorpe, Alex Williamson, Pierre Crégut On 2/13/24 9:46 AM, Leon Romanovsky wrote: > On Tue, Feb 13, 2024 at 09:59:54AM -0600, Bjorn Helgaas wrote: >> On Tue, Feb 13, 2024 at 09:34:50AM +0200, Leon Romanovsky wrote: >>> On Mon, Feb 12, 2024 at 02:27:14PM -0600, Bjorn Helgaas wrote: >>>> On Sun, Feb 11, 2024 at 10:48:44AM +0200, Leon Romanovsky wrote: >>>>> On Fri, Feb 09, 2024 at 07:20:28PM -0800, Kuppuswamy Sathyanarayanan wrote: >>>>>> On 2/9/24 3:52 PM, Jim Harris wrote: >>>>>>> If an SR-IOV enabled device is held by vfio, and the device >>>>>>> is removed, vfio will hold device lock and notify userspace >>>>>>> of the removal. If userspace reads the sriov_numvfs sysfs >>>>>>> entry, that thread will be blocked since sriov_numvfs_show() >>>>>>> also tries to acquire the device lock. If that same thread >>>>>>> is responsible for releasing the device to vfio, it results >>>>>>> in a deadlock. >>>>>>> >>>>>>> The proper way to detect a change to the num_VFs value is to >>>>>>> listen for a sysfs event, not to add a device_lock() on the >>>>>>> attribute _show() in the kernel. >>>> The lock was not about detecting a change; Pierre did this: >>>> >>>> ip monitor dev ${DEVICE} | grep --line-buffered "^${id}:" | while read line; do \ >>>> cat ${path}/device/sriov_numvfs; \ >>>> >>>> which I assume works by listening for sysfs events. >>> It is not, "ip monitor ..." listens to netlink events emitted by >>> netdev core and not sysfs events. Sysfs events are not involved in >>> this case. >> Thanks for correcting my hasty assumption! >> >>>> The problem was that after the event occurred, the sriov_numvfs >>>> read got a stale value (see https://bugzilla.kernel.org/show_bug.cgi?id=202991). >>> Yes, and it is outcome of such cross-subsytem involvement, which >>> is racy by definition. Someone can come with even simpler example of why >>> locking sysfs read and write is not a good idea. >>> >>> For example, let's consider the following scenario with two CPUs and >>> locks on sysfs read and write: >>> >>> CPU1 CPU2 >>> echo 1 > ${path}/device/sriov_numvfs >>> context_switch -> >>> cat ${path}/device/sriov_numvfs >>> lock >>> return 0 >>> unlock >>> context_switch <- >>> lock >>> set 1 >>> unlock >>> >>> CPU1 CPU2 >>> echo 1 > ${path}/device/sriov_numvfs >>> lock >>> set 1 >>> unlock >>> context_switch -> >>> cat ${path}/device/sriov_numvfs >>> lock >>> return 1 >>> unlock >>> >>> So same scenario will return different values if user doesn't protect >>> such case with external to the kernel lock. >>> >>> But if we return back to Pierre report and if you want to provide >>> completely bullet proof solution to solve cross-subsystem interaction, >>> you will need to prohibit device probe till sriov_numvfs update is completed. >>> However, it is overkill for something that is not a real issue. >> Pierre wanted to detect the configuration change and learn the new >> num_vfs, which seems like a reasonable thing to do. Is there a way to >> do both via netlink or some other mechanism? > Please pay attention that Pierre listened to specific netdevice and not > to something general. After patch #2 in Jim's series, he will be able to > rely on "udevadm monitor" instead of "ip monitor". > >>>> So I would drop this sentence because I don't think it accurately >>>> reflects the reason for 35ff867b7657. >>>> >>>>>> Since you are reverting a commit that synchronizes SysFS read >>>>>> /write, please add some comments about why it is not an >>>>>> issue anymore. >>>>> It was never an issue, the idea that sysfs read and write should be >>>>> serialized by kernel is not correct by definition. >>>> I think it *was* an issue. The behavior Pierre observed at was >>>> clearly wrong, >>> I disagree with this sentence. >>> >>>> and we added 35ff867b7657 ("PCI/IOV: Serialize sysfs >>>> sriov_numvfs reads vs writes") to resolve it. >>>> >>>> We should try to avoid reintroducing the problem, so I think we should >>>> probably squash these two patches and describe it as a deadlock fix >>>> instead of dismissing 35ff867b7657 as being based on false premises. >>>> >>>> It would be awesome if you had time to verify that these patches also >>>> resolve the problem you saw, Pierre. >>> They won't resolve his problem, because he is not listening to sysfs >>> events, but rely on something from netdev side. >> I guess that means that if we apply this revert, the problem Pierre >> reported will return. Obviously the deadlock is more important than >> the inconsistency Pierre observed, but from the user's point of view >> this will look like a regression. >> >> Maybe listening to netlink and then looking at sysfs isn't the >> "correct" way to do this, but I don't want to just casually break >> existing user code. If we do contemplate doing the revert, at the >> very least we should include specific details about what the user code >> *should* do instead, at the level of the actual commands to use >> instead of "ip monitor dev; cat ${path}/device/sriov_numvfs". > udevadm monitor will do the trick. > > Another possible solution is to refactor the code to make sure that > .probe on VFs happens only after sriov_numvfs is updated. > > Thanks IMO, we can update the sriov_numvfs documentation to let users aware of the possible race condition between read/write, and also suggestion about using uevents for device changes. >> Bjorn >> -- Sathyanarayanan Kuppuswamy Linux Kernel Developer ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-13 17:46 ` Leon Romanovsky 2024-02-13 18:00 ` Kuppuswamy Sathyanarayanan @ 2024-02-13 19:45 ` Bjorn Helgaas 2024-02-14 7:16 ` Leon Romanovsky 1 sibling, 1 reply; 24+ messages in thread From: Bjorn Helgaas @ 2024-02-13 19:45 UTC (permalink / raw) To: Leon Romanovsky Cc: Kuppuswamy Sathyanarayanan, Jim Harris, Bjorn Helgaas, linux-pci, linux-kernel, Jason Gunthorpe, Alex Williamson, Pierre Crégut On Tue, Feb 13, 2024 at 07:46:02PM +0200, Leon Romanovsky wrote: > On Tue, Feb 13, 2024 at 09:59:54AM -0600, Bjorn Helgaas wrote: > ... > > I guess that means that if we apply this revert, the problem Pierre > > reported will return. Obviously the deadlock is more important than > > the inconsistency Pierre observed, but from the user's point of view > > this will look like a regression. > > > > Maybe listening to netlink and then looking at sysfs isn't the > > "correct" way to do this, but I don't want to just casually break > > existing user code. If we do contemplate doing the revert, at the > > very least we should include specific details about what the user code > > *should* do instead, at the level of the actual commands to use > > instead of "ip monitor dev; cat ${path}/device/sriov_numvfs". > > udevadm monitor will do the trick. > > Another possible solution is to refactor the code to make sure that > .probe on VFs happens only after sriov_numvfs is updated. I like the idea of refactoring it so as to preserve the existing ordering while also fixing the deadlock. If, in addition, we want to update the sriov_numvfs documentation to recommend "udevadm monitor" + /sys/.../sriov_numvfs, that's fine, too, but I don't really like the idea of merging a patch and forcing users to update their code. So I'll table the v2 series for now and watch for a v3 with the refactoring. Bjorn ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-13 19:45 ` Bjorn Helgaas @ 2024-02-14 7:16 ` Leon Romanovsky 2024-02-14 17:04 ` Jim Harris 0 siblings, 1 reply; 24+ messages in thread From: Leon Romanovsky @ 2024-02-14 7:16 UTC (permalink / raw) To: Bjorn Helgaas Cc: Kuppuswamy Sathyanarayanan, Jim Harris, Bjorn Helgaas, linux-pci, linux-kernel, Jason Gunthorpe, Alex Williamson, Pierre Crégut On Tue, Feb 13, 2024 at 01:45:56PM -0600, Bjorn Helgaas wrote: > On Tue, Feb 13, 2024 at 07:46:02PM +0200, Leon Romanovsky wrote: > > On Tue, Feb 13, 2024 at 09:59:54AM -0600, Bjorn Helgaas wrote: > > ... > > > > I guess that means that if we apply this revert, the problem Pierre > > > reported will return. Obviously the deadlock is more important than > > > the inconsistency Pierre observed, but from the user's point of view > > > this will look like a regression. > > > > > > Maybe listening to netlink and then looking at sysfs isn't the > > > "correct" way to do this, but I don't want to just casually break > > > existing user code. If we do contemplate doing the revert, at the > > > very least we should include specific details about what the user code > > > *should* do instead, at the level of the actual commands to use > > > instead of "ip monitor dev; cat ${path}/device/sriov_numvfs". > > > > udevadm monitor will do the trick. > > > > Another possible solution is to refactor the code to make sure that > > .probe on VFs happens only after sriov_numvfs is updated. > > I like the idea of refactoring it so as to preserve the existing > ordering while also fixing the deadlock. I think something like this will be enough (not tested). It will et the number of VFs before we make VFs visible to probe: diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index aaa33e8dc4c9..0cdfaae80594 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -679,12 +679,14 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) msleep(100); pci_cfg_access_unlock(dev); + iov->num_VFs = nr_virtfn; rc = sriov_add_vfs(dev, initial); - if (rc) + if (rc) { + iov->num_VFs = 0; goto err_pcibios; + } kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); - iov->num_VFs = nr_virtfn; return 0; ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-14 7:16 ` Leon Romanovsky @ 2024-02-14 17:04 ` Jim Harris 2024-02-14 17:50 ` Bjorn Helgaas 0 siblings, 1 reply; 24+ messages in thread From: Jim Harris @ 2024-02-14 17:04 UTC (permalink / raw) To: Leon Romanovsky Cc: Bjorn Helgaas, Kuppuswamy Sathyanarayanan, Bjorn Helgaas, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Jason Gunthorpe, Alex Williamson, Pierre Crégut On Wed, Feb 14, 2024 at 09:16:18AM +0200, Leon Romanovsky wrote: > On Tue, Feb 13, 2024 at 01:45:56PM -0600, Bjorn Helgaas wrote: > > On Tue, Feb 13, 2024 at 07:46:02PM +0200, Leon Romanovsky wrote: > > > On Tue, Feb 13, 2024 at 09:59:54AM -0600, Bjorn Helgaas wrote: > > > ... > > > > > > I guess that means that if we apply this revert, the problem Pierre > > > > reported will return. Obviously the deadlock is more important than > > > > the inconsistency Pierre observed, but from the user's point of view > > > > this will look like a regression. > > > > > > > > Maybe listening to netlink and then looking at sysfs isn't the > > > > "correct" way to do this, but I don't want to just casually break > > > > existing user code. If we do contemplate doing the revert, at the > > > > very least we should include specific details about what the user code > > > > *should* do instead, at the level of the actual commands to use > > > > instead of "ip monitor dev; cat ${path}/device/sriov_numvfs". > > > > > > udevadm monitor will do the trick. > > > > > > Another possible solution is to refactor the code to make sure that > > > .probe on VFs happens only after sriov_numvfs is updated. > > > > I like the idea of refactoring it so as to preserve the existing > > ordering while also fixing the deadlock. > > I think something like this will be enough (not tested). It will et the number of VFs > before we make VFs visible to probe: I'll push a v3, replacing the second patch with this one instead. Although based on this discussion it seems we're moving towards squashing the revert with Leon's suggested patch. Bjorn, I'll assume you're still OK with just squashing these on your end. I would like some input on how to actually test this though. Presumably we see some event on device PF and we want to make sure if we read PF/device/sriov_numvfs that we see the updated value. But the only type of event I think we can expect is the PF's sriov_numvfs CHANGE event. Is there any way for VFs to be created outside of writing to the sriov_numvfs sysfs file? My understanding is some older devices/drivers will auto-create VFs when the PF is initialized, but it wasn't clear from the bug report whether that was part of the configuration here. Pierre, do you have any recollection on this? Or maybe testing for this case just means compile and verify with udevadm monitor that we see the CHANGE event before any of the VFs are actually created... > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > index aaa33e8dc4c9..0cdfaae80594 100644 > --- a/drivers/pci/iov.c > +++ b/drivers/pci/iov.c > @@ -679,12 +679,14 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) > msleep(100); > pci_cfg_access_unlock(dev); > > + iov->num_VFs = nr_virtfn; > rc = sriov_add_vfs(dev, initial); > - if (rc) > + if (rc) { > + iov->num_VFs = 0; > goto err_pcibios; > + } > > kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); > - iov->num_VFs = nr_virtfn; > > return 0; > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-14 17:04 ` Jim Harris @ 2024-02-14 17:50 ` Bjorn Helgaas 2024-02-14 22:55 ` Jim Harris 0 siblings, 1 reply; 24+ messages in thread From: Bjorn Helgaas @ 2024-02-14 17:50 UTC (permalink / raw) To: Jim Harris Cc: Leon Romanovsky, Kuppuswamy Sathyanarayanan, Bjorn Helgaas, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Jason Gunthorpe, Alex Williamson, Pierre Crégut On Wed, Feb 14, 2024 at 05:04:08PM +0000, Jim Harris wrote: > On Wed, Feb 14, 2024 at 09:16:18AM +0200, Leon Romanovsky wrote: > > On Tue, Feb 13, 2024 at 01:45:56PM -0600, Bjorn Helgaas wrote: > > > On Tue, Feb 13, 2024 at 07:46:02PM +0200, Leon Romanovsky wrote: > > > > On Tue, Feb 13, 2024 at 09:59:54AM -0600, Bjorn Helgaas wrote: > > > > ... > > > > > > > > I guess that means that if we apply this revert, the problem Pierre > > > > > reported will return. Obviously the deadlock is more important than > > > > > the inconsistency Pierre observed, but from the user's point of view > > > > > this will look like a regression. > > > > > > > > > > Maybe listening to netlink and then looking at sysfs isn't the > > > > > "correct" way to do this, but I don't want to just casually break > > > > > existing user code. If we do contemplate doing the revert, at the > > > > > very least we should include specific details about what the user code > > > > > *should* do instead, at the level of the actual commands to use > > > > > instead of "ip monitor dev; cat ${path}/device/sriov_numvfs". > > > > > > > > udevadm monitor will do the trick. > > > > > > > > Another possible solution is to refactor the code to make sure that > > > > .probe on VFs happens only after sriov_numvfs is updated. > > > > > > I like the idea of refactoring it so as to preserve the existing > > > ordering while also fixing the deadlock. > > > > I think something like this will be enough (not tested). It will et the number of VFs > > before we make VFs visible to probe: > > I'll push a v3, replacing the second patch with this one instead. Although > based on this discussion it seems we're moving towards squashing the revert > with Leon's suggested patch. Bjorn, I'll assume you're still OK with just > squashing these on your end. Yep. > I would like some input on how to actually test this though. > Presumably we see some event on device PF and we want to make sure > if we read PF/device/sriov_numvfs that we see the updated value. But > the only type of event I think we can expect is the PF's > sriov_numvfs CHANGE event. > > Is there any way for VFs to be created outside of writing to the > sriov_numvfs sysfs file? My understanding is some older > devices/drivers will auto-create VFs when the PF is initialized, but > it wasn't clear from the bug report whether that was part of the > configuration here. Pierre, do you have any recollection on this? > > Or maybe testing for this case just means compile and verify with > udevadm monitor that we see the CHANGE event before any of the VFs > are actually created... I just want to make sure that Pierre's existing code continues to work unchanged. Ideally we could revert 35ff867b7657 ("PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes"), reproduce the problem with the shell script attached to https://bugzilla.kernel.org/show_bug.cgi?id=202991 (I assume Pierre used a /sys/.../sriov_numvfs write to trigger the change). Then we could verify that with 35ff867b7657 still reverted but the change below added, the problem is no longer reproducible. > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > index aaa33e8dc4c9..0cdfaae80594 100644 > > --- a/drivers/pci/iov.c > > +++ b/drivers/pci/iov.c > > @@ -679,12 +679,14 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) > > msleep(100); > > pci_cfg_access_unlock(dev); > > > > + iov->num_VFs = nr_virtfn; > > rc = sriov_add_vfs(dev, initial); > > - if (rc) > > + if (rc) { > > + iov->num_VFs = 0; > > goto err_pcibios; > > + } > > > > kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); > > - iov->num_VFs = nr_virtfn; > > > > return 0; > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" 2024-02-14 17:50 ` Bjorn Helgaas @ 2024-02-14 22:55 ` Jim Harris 0 siblings, 0 replies; 24+ messages in thread From: Jim Harris @ 2024-02-14 22:55 UTC (permalink / raw) To: Bjorn Helgaas Cc: Leon Romanovsky, Kuppuswamy Sathyanarayanan, Bjorn Helgaas, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Jason Gunthorpe, Alex Williamson, Pierre Crégut On Wed, Feb 14, 2024 at 11:50:00AM -0600, Bjorn Helgaas wrote: > On Wed, Feb 14, 2024 at 05:04:08PM +0000, Jim Harris wrote: > > On Wed, Feb 14, 2024 at 09:16:18AM +0200, Leon Romanovsky wrote: > > > On Tue, Feb 13, 2024 at 01:45:56PM -0600, Bjorn Helgaas wrote: > > > > On Tue, Feb 13, 2024 at 07:46:02PM +0200, Leon Romanovsky wrote: > > > > > On Tue, Feb 13, 2024 at 09:59:54AM -0600, Bjorn Helgaas wrote: > > > > > ... > > > > > > > > > > I guess that means that if we apply this revert, the problem Pierre > > > > > > reported will return. Obviously the deadlock is more important than > > > > > > the inconsistency Pierre observed, but from the user's point of view > > > > > > this will look like a regression. > > > > > > > > > > > > Maybe listening to netlink and then looking at sysfs isn't the > > > > > > "correct" way to do this, but I don't want to just casually break > > > > > > existing user code. If we do contemplate doing the revert, at the > > > > > > very least we should include specific details about what the user code > > > > > > *should* do instead, at the level of the actual commands to use > > > > > > instead of "ip monitor dev; cat ${path}/device/sriov_numvfs". > > > > > > > > > > udevadm monitor will do the trick. > > > > > > > > > > Another possible solution is to refactor the code to make sure that > > > > > .probe on VFs happens only after sriov_numvfs is updated. > > > > > > > > I like the idea of refactoring it so as to preserve the existing > > > > ordering while also fixing the deadlock. > > > > > > I think something like this will be enough (not tested). It will et the number of VFs > > > before we make VFs visible to probe: > > > > I'll push a v3, replacing the second patch with this one instead. Although > > based on this discussion it seems we're moving towards squashing the revert > > with Leon's suggested patch. Bjorn, I'll assume you're still OK with just > > squashing these on your end. > > Yep. > > > I would like some input on how to actually test this though. > > Presumably we see some event on device PF and we want to make sure > > if we read PF/device/sriov_numvfs that we see the updated value. But > > the only type of event I think we can expect is the PF's > > sriov_numvfs CHANGE event. > > > > Is there any way for VFs to be created outside of writing to the > > sriov_numvfs sysfs file? My understanding is some older > > devices/drivers will auto-create VFs when the PF is initialized, but > > it wasn't clear from the bug report whether that was part of the > > configuration here. Pierre, do you have any recollection on this? > > > > Or maybe testing for this case just means compile and verify with > > udevadm monitor that we see the CHANGE event before any of the VFs > > are actually created... > > I just want to make sure that Pierre's existing code continues to work > unchanged. > > Ideally we could revert 35ff867b7657 ("PCI/IOV: Serialize sysfs > sriov_numvfs reads vs writes"), reproduce the problem with the shell > script attached to https://bugzilla.kernel.org/show_bug.cgi?id=202991 > (I assume Pierre used a /sys/.../sriov_numvfs write to trigger the > change). That shell script generates no output when writing to sriov_numvfs, so I'm unable to reproduce the problem. Terminal 1: # ip monitor dev ens7f0np0 Terminal 2: # echo 1 > /sys/class/net/ens7f0np0/device/sriov_numvfs # No output in terminal 1. I've done what testing I can with the proposed patch below, I'll send out the v3 series here shortly. > Then we could verify that with 35ff867b7657 still reverted but the > change below added, the problem is no longer reproducible. > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > > index aaa33e8dc4c9..0cdfaae80594 100644 > > > --- a/drivers/pci/iov.c > > > +++ b/drivers/pci/iov.c > > > @@ -679,12 +679,14 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) > > > msleep(100); > > > pci_cfg_access_unlock(dev); > > > > > > + iov->num_VFs = nr_virtfn; > > > rc = sriov_add_vfs(dev, initial); > > > - if (rc) > > > + if (rc) { > > > + iov->num_VFs = 0; > > > goto err_pcibios; > > > + } > > > > > > kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); > > > - iov->num_VFs = nr_virtfn; > > > > > > return 0; > > > ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v2 2/2] pci/iov: fix kobject_uevent() ordering in sriov_enable() 2024-02-09 23:52 ` [PATCH v2 0/2] PCI/IOV: sriov_numvfs bug fixes Jim Harris 2024-02-09 23:52 ` [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" Jim Harris @ 2024-02-09 23:52 ` Jim Harris 2024-02-10 3:22 ` Kuppuswamy Sathyanarayanan 1 sibling, 1 reply; 24+ messages in thread From: Jim Harris @ 2024-02-09 23:52 UTC (permalink / raw) To: Bjorn Helgaas, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Leon Romanovsky, Jason Gunthorpe, Alex Williamson, pierre.cregut@orange.com Wait to call kobject_uevent() until all of the associated changes are done, including updating the num_VFs value. This avoids any possibility of userspace responding to the event and getting a stale value (although probably impossible to actually happen). Suggested-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jim Harris <jim.harris@samsung.com> --- drivers/pci/iov.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 0ca20cd518d5..c00b0f822526 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -677,8 +677,8 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) if (rc) goto err_pcibios; - kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); iov->num_VFs = nr_virtfn; + kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); return 0; ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH v2 2/2] pci/iov: fix kobject_uevent() ordering in sriov_enable() 2024-02-09 23:52 ` [PATCH v2 2/2] pci/iov: fix kobject_uevent() ordering in sriov_enable() Jim Harris @ 2024-02-10 3:22 ` Kuppuswamy Sathyanarayanan 2024-02-12 15:17 ` Keith Busch 0 siblings, 1 reply; 24+ messages in thread From: Kuppuswamy Sathyanarayanan @ 2024-02-10 3:22 UTC (permalink / raw) To: Jim Harris, Bjorn Helgaas, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Leon Romanovsky, Jason Gunthorpe, Alex Williamson, pierre.cregut@orange.com On 2/9/24 3:52 PM, Jim Harris wrote: > Wait to call kobject_uevent() until all of the associated changes are done, > including updating the num_VFs value. This avoids any possibility of > userspace responding to the event and getting a stale value (although > probably impossible to actually happen). > > Suggested-by: Leon Romanovsky <leonro@nvidia.com> > Reviewed-by: Leon Romanovsky <leonro@nvidia.com> > Signed-off-by: Jim Harris <jim.harris@samsung.com> > --- It looks fine to me. Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> > drivers/pci/iov.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > index 0ca20cd518d5..c00b0f822526 100644 > --- a/drivers/pci/iov.c > +++ b/drivers/pci/iov.c > @@ -677,8 +677,8 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) > if (rc) > goto err_pcibios; > > - kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); > iov->num_VFs = nr_virtfn; > + kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); > > return 0; > > -- Sathyanarayanan Kuppuswamy Linux Kernel Developer ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 2/2] pci/iov: fix kobject_uevent() ordering in sriov_enable() 2024-02-10 3:22 ` Kuppuswamy Sathyanarayanan @ 2024-02-12 15:17 ` Keith Busch 2024-02-12 15:29 ` Leon Romanovsky 0 siblings, 1 reply; 24+ messages in thread From: Keith Busch @ 2024-02-12 15:17 UTC (permalink / raw) To: Kuppuswamy Sathyanarayanan Cc: Jim Harris, Bjorn Helgaas, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Leon Romanovsky, Jason Gunthorpe, Alex Williamson, pierre.cregut@orange.com On Fri, Feb 09, 2024 at 07:22:17PM -0800, Kuppuswamy Sathyanarayanan wrote: > On 2/9/24 3:52 PM, Jim Harris wrote: > > @@ -677,8 +677,8 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) > > if (rc) > > goto err_pcibios; > > > > - kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); > > iov->num_VFs = nr_virtfn; > > + kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); Since it's accessed unlocked now, I *think* this wants appropriate barriers to ensure the order is observed the same on all CPUs. Something like 'smp_store_release(&iov->num_VFs, nr_virtfn)' for writing it, and use 'smp_load_acquire()' on the read-side. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v2 2/2] pci/iov: fix kobject_uevent() ordering in sriov_enable() 2024-02-12 15:17 ` Keith Busch @ 2024-02-12 15:29 ` Leon Romanovsky 0 siblings, 0 replies; 24+ messages in thread From: Leon Romanovsky @ 2024-02-12 15:29 UTC (permalink / raw) To: Keith Busch Cc: Kuppuswamy Sathyanarayanan, Jim Harris, Bjorn Helgaas, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Jason Gunthorpe, Alex Williamson, pierre.cregut@orange.com On Mon, Feb 12, 2024 at 08:17:58AM -0700, Keith Busch wrote: > On Fri, Feb 09, 2024 at 07:22:17PM -0800, Kuppuswamy Sathyanarayanan wrote: > > On 2/9/24 3:52 PM, Jim Harris wrote: > > > @@ -677,8 +677,8 @@ static int sriov_enable(struct pci_dev *dev, int nr_virtfn) > > > if (rc) > > > goto err_pcibios; > > > > > > - kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); > > > iov->num_VFs = nr_virtfn; > > > + kobject_uevent(&dev->dev.kobj, KOBJ_CHANGE); > > Since it's accessed unlocked now, I *think* this wants appropriate > barriers to ensure the order is observed the same on all CPUs. Something > like 'smp_store_release(&iov->num_VFs, nr_virtfn)' for writing it, and > use 'smp_load_acquire()' on the read-side. It is unlocked only for sysfs read. IMHO it is overkill to use stores for this case. Thanks > ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2024-02-14 22:55 UTC | newest]
Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CGME20240209235208uscas1p26c658c64cc85711cd3aa6312224164fc@uscas1p2.samsung.com>
2024-02-09 23:52 ` [PATCH v2 0/2] PCI/IOV: sriov_numvfs bug fixes Jim Harris
2024-02-09 23:52 ` [PATCH v2 1/2] PCI/IOV: Revert "PCI/IOV: Serialize sysfs sriov_numvfs reads vs writes" Jim Harris
2024-02-10 3:20 ` Kuppuswamy Sathyanarayanan
2024-02-11 8:48 ` Leon Romanovsky
2024-02-11 19:15 ` Kuppuswamy Sathyanarayanan
2024-02-12 9:31 ` Leon Romanovsky
2024-02-12 20:27 ` Bjorn Helgaas
2024-02-12 22:59 ` Jim Harris
2024-02-13 7:37 ` Leon Romanovsky
2024-02-13 9:40 ` pierre.cregut
2024-02-13 14:59 ` Jason Gunthorpe
2024-02-13 7:34 ` Leon Romanovsky
2024-02-13 15:59 ` Bjorn Helgaas
2024-02-13 17:46 ` Leon Romanovsky
2024-02-13 18:00 ` Kuppuswamy Sathyanarayanan
2024-02-13 19:45 ` Bjorn Helgaas
2024-02-14 7:16 ` Leon Romanovsky
2024-02-14 17:04 ` Jim Harris
2024-02-14 17:50 ` Bjorn Helgaas
2024-02-14 22:55 ` Jim Harris
2024-02-09 23:52 ` [PATCH v2 2/2] pci/iov: fix kobject_uevent() ordering in sriov_enable() Jim Harris
2024-02-10 3:22 ` Kuppuswamy Sathyanarayanan
2024-02-12 15:17 ` Keith Busch
2024-02-12 15:29 ` Leon Romanovsky
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox