* [PATCH v2 0/2] PCI: Protect VPD and PME accesses from power management @ 2023-08-03 17:12 Alex Williamson 2023-08-03 17:12 ` [PATCH v2 1/2] PCI/VPD: Add runtime power management to sysfs interface Alex Williamson ` (2 more replies) 0 siblings, 3 replies; 18+ messages in thread From: Alex Williamson @ 2023-08-03 17:12 UTC (permalink / raw) To: bhelgaas; +Cc: Alex Williamson, linux-pci, linux-kernel, eric.auger Since v5.19, vfio-pci makes use of runtime power management on devices. This has the effect of potentially putting entire sub-hierarchies into lower power states, which has exposed some gaps in the PCI subsystem around power management support. The first issue is that lspci accesses the VPD sysfs interface, which does not provide the same power management wrappers as general config space. The next covers PME, where we attempt to skip devices based on their PCI power state, but don't protect changes to that state or look at the overall runtime power management state of the device. This latter patch addresses the issue noted by Eric in the follow-ups to v1 linked below. These patches are logically independent, but only together resolve an issue on Eric's system where a pair of endpoints bound to vfio-pci and unused by userspace drivers trigger faults through lspci and PME polling. Thanks, Alex v1: https://lore.kernel.org/all/20230707151044.1311544-1-alex.williamson@redhat.com/ Alex Williamson (2): PCI/VPD: Add runtime power management to sysfs interface PCI: Fix runtime PM race with PME polling drivers/pci/pci.c | 23 ++++++++++++++++------- drivers/pci/vpd.c | 34 ++++++++++++++++++++++++++++++++-- 2 files changed, 48 insertions(+), 9 deletions(-) -- 2.40.1 ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 1/2] PCI/VPD: Add runtime power management to sysfs interface 2023-08-03 17:12 [PATCH v2 0/2] PCI: Protect VPD and PME accesses from power management Alex Williamson @ 2023-08-03 17:12 ` Alex Williamson 2023-08-10 15:59 ` Bjorn Helgaas 2023-08-11 19:25 ` Bjorn Helgaas 2023-08-03 17:12 ` [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling Alex Williamson 2023-08-10 18:29 ` [PATCH v2 0/2] PCI: Protect VPD and PME accesses from power management Bjorn Helgaas 2 siblings, 2 replies; 18+ messages in thread From: Alex Williamson @ 2023-08-03 17:12 UTC (permalink / raw) To: bhelgaas; +Cc: Alex Williamson, linux-pci, linux-kernel, eric.auger Unlike default access to config space through sysfs, the vpd read and write function don't actively manage the runtime power management state of the device during access. Since commit 7ab5e10eda02 ("vfio/pci: Move the unused device into low power state with runtime PM"), the vfio-pci driver will use runtime power management and release unused devices to make use of low power states. Attempting to access VPD information in this low power state can result in incorrect information or kernel crashes depending on the system behavior. Wrap the vpd read/write bin attribute handlers in runtime PM and take into account the potential quirk to select the correct device to wake. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> --- drivers/pci/vpd.c | 34 ++++++++++++++++++++++++++++++++-- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/drivers/pci/vpd.c b/drivers/pci/vpd.c index a4fc4d0690fe..81217dd4789f 100644 --- a/drivers/pci/vpd.c +++ b/drivers/pci/vpd.c @@ -275,8 +275,23 @@ static ssize_t vpd_read(struct file *filp, struct kobject *kobj, size_t count) { struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj)); + struct pci_dev *vpd_dev = dev; + ssize_t ret; + + if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) { + vpd_dev = pci_get_func0_dev(dev); + if (!vpd_dev) + return -ENODEV; + } + + pci_config_pm_runtime_get(vpd_dev); + ret = pci_read_vpd(vpd_dev, off, count, buf); + pci_config_pm_runtime_put(vpd_dev); + + if (dev != vpd_dev) + pci_dev_put(vpd_dev); - return pci_read_vpd(dev, off, count, buf); + return ret; } static ssize_t vpd_write(struct file *filp, struct kobject *kobj, @@ -284,8 +299,23 @@ static ssize_t vpd_write(struct file *filp, struct kobject *kobj, size_t count) { struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj)); + struct pci_dev *vpd_dev = dev; + ssize_t ret; + + if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) { + vpd_dev = pci_get_func0_dev(dev); + if (!vpd_dev) + return -ENODEV; + } + + pci_config_pm_runtime_get(vpd_dev); + ret = pci_write_vpd(vpd_dev, off, count, buf); + pci_config_pm_runtime_put(vpd_dev); + + if (dev != vpd_dev) + pci_dev_put(vpd_dev); - return pci_write_vpd(dev, off, count, buf); + return ret; } static BIN_ATTR(vpd, 0600, vpd_read, vpd_write, 0); -- 2.40.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v2 1/2] PCI/VPD: Add runtime power management to sysfs interface 2023-08-03 17:12 ` [PATCH v2 1/2] PCI/VPD: Add runtime power management to sysfs interface Alex Williamson @ 2023-08-10 15:59 ` Bjorn Helgaas 2023-08-10 16:26 ` Alex Williamson 2023-08-11 19:25 ` Bjorn Helgaas 1 sibling, 1 reply; 18+ messages in thread From: Bjorn Helgaas @ 2023-08-10 15:59 UTC (permalink / raw) To: Alex Williamson; +Cc: bhelgaas, linux-pci, linux-kernel, eric.auger On Thu, Aug 03, 2023 at 11:12:32AM -0600, Alex Williamson wrote: > Unlike default access to config space through sysfs, the vpd read and > write function don't actively manage the runtime power management state > of the device during access. Since commit 7ab5e10eda02 ("vfio/pci: Move > the unused device into low power state with runtime PM"), the vfio-pci > driver will use runtime power management and release unused devices to > make use of low power states. Attempting to access VPD information in > this low power state can result in incorrect information or kernel > crashes depending on the system behavior. > > Wrap the vpd read/write bin attribute handlers in runtime PM and take > into account the potential quirk to select the correct device to wake. > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > --- > drivers/pci/vpd.c | 34 ++++++++++++++++++++++++++++++++-- > 1 file changed, 32 insertions(+), 2 deletions(-) > > diff --git a/drivers/pci/vpd.c b/drivers/pci/vpd.c > index a4fc4d0690fe..81217dd4789f 100644 > --- a/drivers/pci/vpd.c > +++ b/drivers/pci/vpd.c > @@ -275,8 +275,23 @@ static ssize_t vpd_read(struct file *filp, struct kobject *kobj, > size_t count) > { > struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj)); > + struct pci_dev *vpd_dev = dev; > + ssize_t ret; > + > + if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) { > + vpd_dev = pci_get_func0_dev(dev); > + if (!vpd_dev) > + return -ENODEV; > + } > + > + pci_config_pm_runtime_get(vpd_dev); > + ret = pci_read_vpd(vpd_dev, off, count, buf); > + pci_config_pm_runtime_put(vpd_dev); > + > + if (dev != vpd_dev) > + pci_dev_put(vpd_dev); I first thought this would leak a reference if dev was func0 and had PCI_DEV_FLAGS_VPD_REF_F0 set, because in that case vpd_dev would be the same as dev. But I think that case can't happen because quirk_f0_vpd_link() does nothing for func0 devices, so PCI_DEV_FLAGS_VPD_REF_F0 should never be set for func0. But it seems like this might be easier to analyze as: if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) pci_dev_put(vpd_dev); Or am I missing something? > - return pci_read_vpd(dev, off, count, buf); > + return ret; > } > > static ssize_t vpd_write(struct file *filp, struct kobject *kobj, > @@ -284,8 +299,23 @@ static ssize_t vpd_write(struct file *filp, struct kobject *kobj, > size_t count) > { > struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj)); > + struct pci_dev *vpd_dev = dev; > + ssize_t ret; > + > + if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) { > + vpd_dev = pci_get_func0_dev(dev); > + if (!vpd_dev) > + return -ENODEV; > + } > + > + pci_config_pm_runtime_get(vpd_dev); > + ret = pci_write_vpd(vpd_dev, off, count, buf); > + pci_config_pm_runtime_put(vpd_dev); > + > + if (dev != vpd_dev) > + pci_dev_put(vpd_dev); > > - return pci_write_vpd(dev, off, count, buf); > + return ret; > } > static BIN_ATTR(vpd, 0600, vpd_read, vpd_write, 0); > > -- > 2.40.1 > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 1/2] PCI/VPD: Add runtime power management to sysfs interface 2023-08-10 15:59 ` Bjorn Helgaas @ 2023-08-10 16:26 ` Alex Williamson 0 siblings, 0 replies; 18+ messages in thread From: Alex Williamson @ 2023-08-10 16:26 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: bhelgaas, linux-pci, linux-kernel, eric.auger On Thu, 10 Aug 2023 10:59:26 -0500 Bjorn Helgaas <helgaas@kernel.org> wrote: > On Thu, Aug 03, 2023 at 11:12:32AM -0600, Alex Williamson wrote: > > Unlike default access to config space through sysfs, the vpd read and > > write function don't actively manage the runtime power management state > > of the device during access. Since commit 7ab5e10eda02 ("vfio/pci: Move > > the unused device into low power state with runtime PM"), the vfio-pci > > driver will use runtime power management and release unused devices to > > make use of low power states. Attempting to access VPD information in > > this low power state can result in incorrect information or kernel > > crashes depending on the system behavior. > > > > Wrap the vpd read/write bin attribute handlers in runtime PM and take > > into account the potential quirk to select the correct device to wake. > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > --- > > drivers/pci/vpd.c | 34 ++++++++++++++++++++++++++++++++-- > > 1 file changed, 32 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/pci/vpd.c b/drivers/pci/vpd.c > > index a4fc4d0690fe..81217dd4789f 100644 > > --- a/drivers/pci/vpd.c > > +++ b/drivers/pci/vpd.c > > @@ -275,8 +275,23 @@ static ssize_t vpd_read(struct file *filp, struct kobject *kobj, > > size_t count) > > { > > struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj)); > > + struct pci_dev *vpd_dev = dev; > > + ssize_t ret; > > + > > + if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) { > > + vpd_dev = pci_get_func0_dev(dev); > > + if (!vpd_dev) > > + return -ENODEV; > > + } > > + > > + pci_config_pm_runtime_get(vpd_dev); > > + ret = pci_read_vpd(vpd_dev, off, count, buf); > > + pci_config_pm_runtime_put(vpd_dev); > > + > > + if (dev != vpd_dev) > > + pci_dev_put(vpd_dev); > > I first thought this would leak a reference if dev was func0 and had > PCI_DEV_FLAGS_VPD_REF_F0 set, because in that case vpd_dev would be > the same as dev. > > But I think that case can't happen because quirk_f0_vpd_link() does > nothing for func0 devices, so PCI_DEV_FLAGS_VPD_REF_F0 should never be > set for func0. But it seems like this might be easier to analyze as: > > if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) > pci_dev_put(vpd_dev); > > Or am I missing something? Nope, your analysis is correct, it doesn't make any sense to have a flag on func0 redirecting VPD access to func0 so vpd_dev can only be different on non-zero functions. The alternative test is equally valid so if you think it's more intuitive, let's use it. Thanks, Alex ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 1/2] PCI/VPD: Add runtime power management to sysfs interface 2023-08-03 17:12 ` [PATCH v2 1/2] PCI/VPD: Add runtime power management to sysfs interface Alex Williamson 2023-08-10 15:59 ` Bjorn Helgaas @ 2023-08-11 19:25 ` Bjorn Helgaas 2023-08-11 19:56 ` Alex Williamson 1 sibling, 1 reply; 18+ messages in thread From: Bjorn Helgaas @ 2023-08-11 19:25 UTC (permalink / raw) To: Alex Williamson; +Cc: bhelgaas, linux-pci, linux-kernel, eric.auger On Thu, Aug 03, 2023 at 11:12:32AM -0600, Alex Williamson wrote: > Unlike default access to config space through sysfs, the vpd read and > write function don't actively manage the runtime power management state > of the device during access. Since commit 7ab5e10eda02 ("vfio/pci: Move > the unused device into low power state with runtime PM"), the vfio-pci > driver will use runtime power management and release unused devices to > make use of low power states. Attempting to access VPD information in > this low power state can result in incorrect information or kernel > crashes depending on the system behavior. I guess this specifically refers to D3cold, right? VPD is accessed via config space, which should work in all D0-D3hot states, but not in D3cold. I don't see anything in the spec about needing to be in D0 to access VPD. I assume there's no public problem report we could cite here? I suppose the behavior in D3cold is however the system handles a UR error. > Wrap the vpd read/write bin attribute handlers in runtime PM and take > into account the potential quirk to select the correct device to wake. > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > --- > drivers/pci/vpd.c | 34 ++++++++++++++++++++++++++++++++-- > 1 file changed, 32 insertions(+), 2 deletions(-) > > diff --git a/drivers/pci/vpd.c b/drivers/pci/vpd.c > index a4fc4d0690fe..81217dd4789f 100644 > --- a/drivers/pci/vpd.c > +++ b/drivers/pci/vpd.c > @@ -275,8 +275,23 @@ static ssize_t vpd_read(struct file *filp, struct kobject *kobj, > size_t count) > { > struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj)); > + struct pci_dev *vpd_dev = dev; > + ssize_t ret; > + > + if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) { > + vpd_dev = pci_get_func0_dev(dev); > + if (!vpd_dev) > + return -ENODEV; > + } > + > + pci_config_pm_runtime_get(vpd_dev); > + ret = pci_read_vpd(vpd_dev, off, count, buf); > + pci_config_pm_runtime_put(vpd_dev); > + > + if (dev != vpd_dev) > + pci_dev_put(vpd_dev); > > - return pci_read_vpd(dev, off, count, buf); > + return ret; > } > > static ssize_t vpd_write(struct file *filp, struct kobject *kobj, > @@ -284,8 +299,23 @@ static ssize_t vpd_write(struct file *filp, struct kobject *kobj, > size_t count) > { > struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj)); > + struct pci_dev *vpd_dev = dev; > + ssize_t ret; > + > + if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) { > + vpd_dev = pci_get_func0_dev(dev); > + if (!vpd_dev) > + return -ENODEV; > + } > + > + pci_config_pm_runtime_get(vpd_dev); > + ret = pci_write_vpd(vpd_dev, off, count, buf); > + pci_config_pm_runtime_put(vpd_dev); > + > + if (dev != vpd_dev) > + pci_dev_put(vpd_dev); > > - return pci_write_vpd(dev, off, count, buf); > + return ret; > } > static BIN_ATTR(vpd, 0600, vpd_read, vpd_write, 0); > > -- > 2.40.1 > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 1/2] PCI/VPD: Add runtime power management to sysfs interface 2023-08-11 19:25 ` Bjorn Helgaas @ 2023-08-11 19:56 ` Alex Williamson 0 siblings, 0 replies; 18+ messages in thread From: Alex Williamson @ 2023-08-11 19:56 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: bhelgaas, linux-pci, linux-kernel, eric.auger On Fri, 11 Aug 2023 14:25:43 -0500 Bjorn Helgaas <helgaas@kernel.org> wrote: > On Thu, Aug 03, 2023 at 11:12:32AM -0600, Alex Williamson wrote: > > Unlike default access to config space through sysfs, the vpd read and > > write function don't actively manage the runtime power management state > > of the device during access. Since commit 7ab5e10eda02 ("vfio/pci: Move > > the unused device into low power state with runtime PM"), the vfio-pci > > driver will use runtime power management and release unused devices to > > make use of low power states. Attempting to access VPD information in > > this low power state can result in incorrect information or kernel > > crashes depending on the system behavior. > > I guess this specifically refers to D3cold, right? VPD is accessed > via config space, which should work in all D0-D3hot states, but not in > D3cold. I don't see anything in the spec about needing to be in D0 to > access VPD. > > I assume there's no public problem report we could cite here? I > suppose the behavior in D3cold is however the system handles a UR > error. Yes, Eric tested that pcie_port_pm=off is a viable workaround resolving both the VPD and PME accesses, so I think the issue is actually that the root port is in D3cold. This aligns with commit 7ab5e10eda02 above, since prior to that we were only manipulating the endpoint power state. The oops indicates an "Internal error: synchronous external abort", with a stack trace ending in pci_generic_config_read, so I suspect this is a UR. Unfortunately the bz is not currently public :-\ Thanks, Alex > > Wrap the vpd read/write bin attribute handlers in runtime PM and take > > into account the potential quirk to select the correct device to wake. > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > --- > > drivers/pci/vpd.c | 34 ++++++++++++++++++++++++++++++++-- > > 1 file changed, 32 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/pci/vpd.c b/drivers/pci/vpd.c > > index a4fc4d0690fe..81217dd4789f 100644 > > --- a/drivers/pci/vpd.c > > +++ b/drivers/pci/vpd.c > > @@ -275,8 +275,23 @@ static ssize_t vpd_read(struct file *filp, struct kobject *kobj, > > size_t count) > > { > > struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj)); > > + struct pci_dev *vpd_dev = dev; > > + ssize_t ret; > > + > > + if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) { > > + vpd_dev = pci_get_func0_dev(dev); > > + if (!vpd_dev) > > + return -ENODEV; > > + } > > + > > + pci_config_pm_runtime_get(vpd_dev); > > + ret = pci_read_vpd(vpd_dev, off, count, buf); > > + pci_config_pm_runtime_put(vpd_dev); > > + > > + if (dev != vpd_dev) > > + pci_dev_put(vpd_dev); > > > > - return pci_read_vpd(dev, off, count, buf); > > + return ret; > > } > > > > static ssize_t vpd_write(struct file *filp, struct kobject *kobj, > > @@ -284,8 +299,23 @@ static ssize_t vpd_write(struct file *filp, struct kobject *kobj, > > size_t count) > > { > > struct pci_dev *dev = to_pci_dev(kobj_to_dev(kobj)); > > + struct pci_dev *vpd_dev = dev; > > + ssize_t ret; > > + > > + if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) { > > + vpd_dev = pci_get_func0_dev(dev); > > + if (!vpd_dev) > > + return -ENODEV; > > + } > > + > > + pci_config_pm_runtime_get(vpd_dev); > > + ret = pci_write_vpd(vpd_dev, off, count, buf); > > + pci_config_pm_runtime_put(vpd_dev); > > + > > + if (dev != vpd_dev) > > + pci_dev_put(vpd_dev); > > > > - return pci_write_vpd(dev, off, count, buf); > > + return ret; > > } > > static BIN_ATTR(vpd, 0600, vpd_read, vpd_write, 0); > > > > -- > > 2.40.1 > > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling 2023-08-03 17:12 [PATCH v2 0/2] PCI: Protect VPD and PME accesses from power management Alex Williamson 2023-08-03 17:12 ` [PATCH v2 1/2] PCI/VPD: Add runtime power management to sysfs interface Alex Williamson @ 2023-08-03 17:12 ` Alex Williamson 2024-01-18 18:50 ` Alex Williamson 2023-08-10 18:29 ` [PATCH v2 0/2] PCI: Protect VPD and PME accesses from power management Bjorn Helgaas 2 siblings, 1 reply; 18+ messages in thread From: Alex Williamson @ 2023-08-03 17:12 UTC (permalink / raw) To: bhelgaas; +Cc: Alex Williamson, linux-pci, linux-kernel, eric.auger Testing that a device is not currently in a low power state provides no guarantees that the device is not immenently transitioning to such a state. We need to increment the PM usage counter before accessing the device. Since we don't wish to wake the device for PME polling, do so only if the device is already active by using pm_runtime_get_if_active(). Signed-off-by: Alex Williamson <alex.williamson@redhat.com> --- drivers/pci/pci.c | 23 ++++++++++++++++------- 1 file changed, 16 insertions(+), 7 deletions(-) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 60230da957e0..bc266f290b2c 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -2415,10 +2415,13 @@ static void pci_pme_list_scan(struct work_struct *work) mutex_lock(&pci_pme_list_mutex); list_for_each_entry_safe(pme_dev, n, &pci_pme_list, list) { - if (pme_dev->dev->pme_poll) { - struct pci_dev *bridge; + struct pci_dev *pdev = pme_dev->dev; + + if (pdev->pme_poll) { + struct pci_dev *bridge = pdev->bus->self; + struct device *dev = &pdev->dev; + int pm_status; - bridge = pme_dev->dev->bus->self; /* * If bridge is in low power state, the * configuration space of subordinate devices @@ -2426,14 +2429,20 @@ static void pci_pme_list_scan(struct work_struct *work) */ if (bridge && bridge->current_state != PCI_D0) continue; + /* - * If the device is in D3cold it should not be - * polled either. + * If the device is in a low power state it + * should not be polled either. */ - if (pme_dev->dev->current_state == PCI_D3cold) + pm_status = pm_runtime_get_if_active(dev, true); + if (!pm_status) continue; - pci_pme_wakeup(pme_dev->dev, NULL); + if (pdev->current_state != PCI_D3cold) + pci_pme_wakeup(pdev, NULL); + + if (pm_status > 0) + pm_runtime_put(dev); } else { list_del(&pme_dev->list); kfree(pme_dev); -- 2.40.1 ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling 2023-08-03 17:12 ` [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling Alex Williamson @ 2024-01-18 18:50 ` Alex Williamson 2024-01-22 22:17 ` Lukas Wunner 0 siblings, 1 reply; 18+ messages in thread From: Alex Williamson @ 2024-01-18 18:50 UTC (permalink / raw) To: linux-pci Cc: bhelgaas, linux-kernel, eric.auger, mika.westerberg, lukas, rafael.j.wysocki, Sanath.S On Thu, 3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@redhat.com> wrote: > Testing that a device is not currently in a low power state provides no > guarantees that the device is not immenently transitioning to such a state. > We need to increment the PM usage counter before accessing the device. > Since we don't wish to wake the device for PME polling, do so only if the > device is already active by using pm_runtime_get_if_active(). > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > --- > drivers/pci/pci.c | 23 ++++++++++++++++------- > 1 file changed, 16 insertions(+), 7 deletions(-) Hey folks, Resurrecting this patch (currently commit d3fcd7360338) for discussion as it's been identified as the source of a regression in: https://bugzilla.kernel.org/show_bug.cgi?id=218360 Copying Mika, Lukas, and Rafael as it's related to: 000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold") where we skip devices in D3cold when processing the PME list. I think the issue in the above bz is that the downstream TB3/USB4 port is in D3 (presumably D3hot) and I therefore infer the device is in state RPM_SUSPENDED. This commit is attempting to make sure the device power state is stable across the call such that it does not transition into D3cold while we're accessing it. To do that I used pm_runtime_get_if_active(), but in retrospect this requires the device to be in RPM_ACTIVE so we end up skipping anything suspended or transitioning. As reported in the above bz, I tried replacing this with: pm_runtime_get_noresume(dev); pm_runtime_barrier(dev); The theory here being that the barrier would wait for any transitioning states such that as far as runtime power management is concerned, the device power state is stable. This causes live locks where the barrier never returns. Instead I'm considering that since we're polling the PME list, maybe we could just defer devices in transition states, for instance something that looks like pm_runtime_get_if_active(), but would return zero if the device was in RPM_SUSPENDING or RPM_RESUMING rather than requiring RPM_ACTIVE. I'm not an expert in PME or runtime power management though, so I'm looking for advice. Thanks, Alex > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 60230da957e0..bc266f290b2c 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -2415,10 +2415,13 @@ static void pci_pme_list_scan(struct work_struct *work) > > mutex_lock(&pci_pme_list_mutex); > list_for_each_entry_safe(pme_dev, n, &pci_pme_list, list) { > - if (pme_dev->dev->pme_poll) { > - struct pci_dev *bridge; > + struct pci_dev *pdev = pme_dev->dev; > + > + if (pdev->pme_poll) { > + struct pci_dev *bridge = pdev->bus->self; > + struct device *dev = &pdev->dev; > + int pm_status; > > - bridge = pme_dev->dev->bus->self; > /* > * If bridge is in low power state, the > * configuration space of subordinate devices > @@ -2426,14 +2429,20 @@ static void pci_pme_list_scan(struct work_struct *work) > */ > if (bridge && bridge->current_state != PCI_D0) > continue; > + > /* > - * If the device is in D3cold it should not be > - * polled either. > + * If the device is in a low power state it > + * should not be polled either. > */ > - if (pme_dev->dev->current_state == PCI_D3cold) > + pm_status = pm_runtime_get_if_active(dev, true); > + if (!pm_status) > continue; > > - pci_pme_wakeup(pme_dev->dev, NULL); > + if (pdev->current_state != PCI_D3cold) > + pci_pme_wakeup(pdev, NULL); > + > + if (pm_status > 0) > + pm_runtime_put(dev); > } else { > list_del(&pme_dev->list); > kfree(pme_dev); ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling 2024-01-18 18:50 ` Alex Williamson @ 2024-01-22 22:17 ` Lukas Wunner 2024-01-22 22:50 ` Alex Williamson 0 siblings, 1 reply; 18+ messages in thread From: Lukas Wunner @ 2024-01-22 22:17 UTC (permalink / raw) To: Alex Williamson Cc: linux-pci, bhelgaas, linux-kernel, eric.auger, mika.westerberg, rafael.j.wysocki, Sanath.S On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote: > On Thu, 3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@redhat.com wrote: > > Testing that a device is not currently in a low power state provides no > > guarantees that the device is not immenently transitioning to such a state. > > We need to increment the PM usage counter before accessing the device. > > Since we don't wish to wake the device for PME polling, do so only if the > > device is already active by using pm_runtime_get_if_active(). > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > --- > > drivers/pci/pci.c | 23 ++++++++++++++++------- > > 1 file changed, 16 insertions(+), 7 deletions(-) > > Resurrecting this patch (currently commit d3fcd7360338) for discussion > as it's been identified as the source of a regression in: > > https://bugzilla.kernel.org/show_bug.cgi?id=218360 > > Copying Mika, Lukas, and Rafael as it's related to: > > 000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold") > > where we skip devices in D3cold when processing the PME list. > > I think the issue in the above bz is that the downstream TB3/USB4 port > is in D3 (presumably D3hot) and I therefore infer the device is in state > RPM_SUSPENDED. This commit is attempting to make sure the device power > state is stable across the call such that it does not transition into > D3cold while we're accessing it. > > To do that I used pm_runtime_get_if_active(), but in retrospect this > requires the device to be in RPM_ACTIVE so we end up skipping anything > suspended or transitioning. How about dropping the calls to pm_runtime_get_if_active() and pm_runtime_put() and instead simply do: if (pm_runtime_suspended(&pdev->dev) && pdev->current_state != PCI_D3cold) pci_pme_wakeup(pdev, NULL); Thanks, Lukas ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling 2024-01-22 22:17 ` Lukas Wunner @ 2024-01-22 22:50 ` Alex Williamson 2024-01-23 4:46 ` Alex Williamson 2024-01-23 10:45 ` Lukas Wunner 0 siblings, 2 replies; 18+ messages in thread From: Alex Williamson @ 2024-01-22 22:50 UTC (permalink / raw) To: Lukas Wunner Cc: linux-pci, bhelgaas, linux-kernel, eric.auger, mika.westerberg, rafael.j.wysocki, Sanath.S On Mon, 22 Jan 2024 23:17:30 +0100 Lukas Wunner <lukas@wunner.de> wrote: > On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote: > > On Thu, 3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@redhat.com wrote: > > > Testing that a device is not currently in a low power state provides no > > > guarantees that the device is not immenently transitioning to such a state. > > > We need to increment the PM usage counter before accessing the device. > > > Since we don't wish to wake the device for PME polling, do so only if the > > > device is already active by using pm_runtime_get_if_active(). > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > > --- > > > drivers/pci/pci.c | 23 ++++++++++++++++------- > > > 1 file changed, 16 insertions(+), 7 deletions(-) > > > > Resurrecting this patch (currently commit d3fcd7360338) for discussion > > as it's been identified as the source of a regression in: > > > > https://bugzilla.kernel.org/show_bug.cgi?id=218360 > > > > Copying Mika, Lukas, and Rafael as it's related to: > > > > 000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold") > > > > where we skip devices in D3cold when processing the PME list. > > > > I think the issue in the above bz is that the downstream TB3/USB4 port > > is in D3 (presumably D3hot) and I therefore infer the device is in state > > RPM_SUSPENDED. This commit is attempting to make sure the device power > > state is stable across the call such that it does not transition into > > D3cold while we're accessing it. > > > > To do that I used pm_runtime_get_if_active(), but in retrospect this > > requires the device to be in RPM_ACTIVE so we end up skipping anything > > suspended or transitioning. > > How about dropping the calls to pm_runtime_get_if_active() and > pm_runtime_put() and instead simply do: > > if (pm_runtime_suspended(&pdev->dev) && > pdev->current_state != PCI_D3cold) > pci_pme_wakeup(pdev, NULL); Hi Lukas, Do we require that the polled device is in the RPM_SUSPENDED state? Also pm_runtime_suspended() can also only be trusted while holding the device power.lock, we need a usage count reference to maintain that state. I'm also seeing cases where the bridge is power state D0, but PM state RPM_SUSPENDING, so config space of the polled device becomes inaccessible even while we're holding a reference once we allow polling in RPM_SUSPENDED. I'm currently working with the below patch, which I believe addresses all these issues, but I'd welcome review and testing. Thanks, Alex commit 0a063b8e91d0bc807db712c81c8b270864f99ecb Author: Alex Williamson <alex.williamson@redhat.com> Date: Tue Jan 16 13:28:33 2024 -0700 PCI: Fix active state requirement in PME polling The commit noted in fixes added a bogus requirement that runtime PM managed devices need to be in the RPM_ACTIVE state for PME polling. In fact, there is no requirement of a specific runtime PM state, it is only required that the state is stable such that testing config space availability, ie. !D3cold, remains valid across the PME wakeup. To that effect, defer polling of runtime PM managed devices that are not in either the RPM_ACTIVE or RPM_SUSPENDED states. Devices in transitional states remain on the pci_pme_list and will be re-queued. However in allowing polling of devices in the RPM_SUSPENDED state, the bridge state requires further refinement as it's possible to poll while the bridge is in D0, but the runtime PM state is RPM_SUSPENDING. An asynchronous completion of the bridge transition to a low power state can make config space of the subordinate device become unavailable. A runtime PM reference to the bridge is therefore added with a supplementary requirement that the bridge is in the RPM_ACTIVE state. Fixes: d3fcd7360338 ("PCI: Fix runtime PM race with PME polling") Reported-by: Sanath S <sanath.s@amd.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218360 Signed-off-by: Alex Williamson <alex.williamson@redhat.com> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index bdbf8a94b4d0..31dbf1834b07 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -2433,29 +2433,45 @@ static void pci_pme_list_scan(struct work_struct *work) if (pdev->pme_poll) { struct pci_dev *bridge = pdev->bus->self; struct device *dev = &pdev->dev; - int pm_status; + struct device *bdev = bridge ? &bridge->dev : NULL; /* - * If bridge is in low power state, the - * configuration space of subordinate devices - * may be not accessible + * If we have a bridge, it should be in an active/D0 + * state or the configuration space of subordinate + * devices may not be accessible. */ - if (bridge && bridge->current_state != PCI_D0) - continue; + if (bdev) { + spin_lock_irq(&bdev->power.lock); + if (!pm_runtime_active(bdev) || + bridge->current_state != PCI_D0) { + spin_unlock_irq(&bdev->power.lock); + continue; + } + pm_runtime_get_noresume(bdev); + spin_unlock_irq(&bdev->power.lock); + } /* - * If the device is in a low power state it - * should not be polled either. + * The device itself may be either in active or + * suspended state, but must not be in D3cold so + * that configuration space is accessible. The + * transitional resuming and suspending states are + * skipped to avoid D3cold races. */ - pm_status = pm_runtime_get_if_active(dev, true); - if (!pm_status) - continue; - - if (pdev->current_state != PCI_D3cold) + spin_lock_irq(&dev->power.lock); + if ((pm_runtime_active(dev) || + pm_runtime_suspended(dev)) && + pdev->current_state != PCI_D3cold) { + pm_runtime_get_noresume(dev); + spin_unlock_irq(&dev->power.lock); pci_pme_wakeup(pdev, NULL); - - if (pm_status > 0) pm_runtime_put(dev); + } else { + spin_unlock_irq(&dev->power.lock); + } + + if (bdev) + pm_runtime_put(bdev); } else { list_del(&pme_dev->list); kfree(pme_dev); ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling 2024-01-22 22:50 ` Alex Williamson @ 2024-01-23 4:46 ` Alex Williamson 2024-01-23 4:48 ` Sanath S 2024-01-23 10:45 ` Lukas Wunner 1 sibling, 1 reply; 18+ messages in thread From: Alex Williamson @ 2024-01-23 4:46 UTC (permalink / raw) To: Lukas Wunner Cc: linux-pci, bhelgaas, linux-kernel, eric.auger, mika.westerberg, rafael.j.wysocki, Sanath.S On Mon, 22 Jan 2024 15:50:03 -0700 Alex Williamson <alex.williamson@redhat.com> wrote: > On Mon, 22 Jan 2024 23:17:30 +0100 > Lukas Wunner <lukas@wunner.de> wrote: > > > On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote: > > > On Thu, 3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@redhat.com wrote: > > > > Testing that a device is not currently in a low power state provides no > > > > guarantees that the device is not immenently transitioning to such a state. > > > > We need to increment the PM usage counter before accessing the device. > > > > Since we don't wish to wake the device for PME polling, do so only if the > > > > device is already active by using pm_runtime_get_if_active(). > > > > > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > > > --- > > > > drivers/pci/pci.c | 23 ++++++++++++++++------- > > > > 1 file changed, 16 insertions(+), 7 deletions(-) > > > > > > Resurrecting this patch (currently commit d3fcd7360338) for discussion > > > as it's been identified as the source of a regression in: > > > > > > https://bugzilla.kernel.org/show_bug.cgi?id=218360 > > > > > > Copying Mika, Lukas, and Rafael as it's related to: > > > > > > 000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold") > > > > > > where we skip devices in D3cold when processing the PME list. > > > > > > I think the issue in the above bz is that the downstream TB3/USB4 port > > > is in D3 (presumably D3hot) and I therefore infer the device is in state > > > RPM_SUSPENDED. This commit is attempting to make sure the device power > > > state is stable across the call such that it does not transition into > > > D3cold while we're accessing it. > > > > > > To do that I used pm_runtime_get_if_active(), but in retrospect this > > > requires the device to be in RPM_ACTIVE so we end up skipping anything > > > suspended or transitioning. > > > > How about dropping the calls to pm_runtime_get_if_active() and > > pm_runtime_put() and instead simply do: > > > > if (pm_runtime_suspended(&pdev->dev) && > > pdev->current_state != PCI_D3cold) > > pci_pme_wakeup(pdev, NULL); > > Hi Lukas, > > Do we require that the polled device is in the RPM_SUSPENDED state? > Also pm_runtime_suspended() can also only be trusted while holding the > device power.lock, we need a usage count reference to maintain that > state. > > I'm also seeing cases where the bridge is power state D0, but PM state > RPM_SUSPENDING, so config space of the polled device becomes > inaccessible even while we're holding a reference once we allow polling > in RPM_SUSPENDED. > > I'm currently working with the below patch, which I believe addresses > all these issues, but I'd welcome review and testing. Thanks, > > Alex > > commit 0a063b8e91d0bc807db712c81c8b270864f99ecb > Author: Alex Williamson <alex.williamson@redhat.com> > Date: Tue Jan 16 13:28:33 2024 -0700 > > PCI: Fix active state requirement in PME polling > > The commit noted in fixes added a bogus requirement that runtime PM > managed devices need to be in the RPM_ACTIVE state for PME polling. > In fact, there is no requirement of a specific runtime PM state, it > is only required that the state is stable such that testing config > space availability, ie. !D3cold, remains valid across the PME wakeup. > > To that effect, defer polling of runtime PM managed devices that are > not in either the RPM_ACTIVE or RPM_SUSPENDED states. Devices in > transitional states remain on the pci_pme_list and will be re-queued. > > However in allowing polling of devices in the RPM_SUSPENDED state, > the bridge state requires further refinement as it's possible to poll > while the bridge is in D0, but the runtime PM state is RPM_SUSPENDING. > An asynchronous completion of the bridge transition to a low power > state can make config space of the subordinate device become > unavailable. A runtime PM reference to the bridge is therefore added > with a supplementary requirement that the bridge is in the RPM_ACTIVE > state. > > Fixes: d3fcd7360338 ("PCI: Fix runtime PM race with PME polling") > Reported-by: Sanath S <sanath.s@amd.com> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218360 > Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index bdbf8a94b4d0..31dbf1834b07 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -2433,29 +2433,45 @@ static void pci_pme_list_scan(struct work_struct *work) > if (pdev->pme_poll) { > struct pci_dev *bridge = pdev->bus->self; > struct device *dev = &pdev->dev; > - int pm_status; > + struct device *bdev = bridge ? &bridge->dev : NULL; > > /* > - * If bridge is in low power state, the > - * configuration space of subordinate devices > - * may be not accessible > + * If we have a bridge, it should be in an active/D0 > + * state or the configuration space of subordinate > + * devices may not be accessible. > */ > - if (bridge && bridge->current_state != PCI_D0) > - continue; > + if (bdev) { > + spin_lock_irq(&bdev->power.lock); With the code as shown here I have one system that seems to be getting contention when reading the vpd sysfs attribute when the endpoints (QL41000) are bound to vfio-pci and unused, resulting in the root port and endpoints being suspended. A vpd read can take over a minute. Seems to be resolved changing the above spin_lock to a spin_trylock: if (!spin_trylock_irq(&bdev->power.lock)) continue; The pm_runtime_barrier() as used in the vpd path can be prone to such issues, I saw similar in the fix I previously proposed in the bz. I'll continue to do more testing with this change and hopefully Sanath can verify this resolves the bug report. Thanks, Alex > + if (!pm_runtime_active(bdev) || > + bridge->current_state != PCI_D0) { > + spin_unlock_irq(&bdev->power.lock); > + continue; > + } > + pm_runtime_get_noresume(bdev); > + spin_unlock_irq(&bdev->power.lock); > + } > > /* > - * If the device is in a low power state it > - * should not be polled either. > + * The device itself may be either in active or > + * suspended state, but must not be in D3cold so > + * that configuration space is accessible. The > + * transitional resuming and suspending states are > + * skipped to avoid D3cold races. > */ > - pm_status = pm_runtime_get_if_active(dev, true); > - if (!pm_status) > - continue; > - > - if (pdev->current_state != PCI_D3cold) > + spin_lock_irq(&dev->power.lock); > + if ((pm_runtime_active(dev) || > + pm_runtime_suspended(dev)) && > + pdev->current_state != PCI_D3cold) { > + pm_runtime_get_noresume(dev); > + spin_unlock_irq(&dev->power.lock); > pci_pme_wakeup(pdev, NULL); > - > - if (pm_status > 0) > pm_runtime_put(dev); > + } else { > + spin_unlock_irq(&dev->power.lock); > + } > + > + if (bdev) > + pm_runtime_put(bdev); > } else { > list_del(&pme_dev->list); > kfree(pme_dev); ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling 2024-01-23 4:46 ` Alex Williamson @ 2024-01-23 4:48 ` Sanath S 0 siblings, 0 replies; 18+ messages in thread From: Sanath S @ 2024-01-23 4:48 UTC (permalink / raw) To: Alex Williamson, Lukas Wunner Cc: linux-pci, bhelgaas, linux-kernel, eric.auger, mika.westerberg, rafael.j.wysocki, Sanath.S On 1/23/2024 10:16 AM, Alex Williamson wrote: > On Mon, 22 Jan 2024 15:50:03 -0700 > Alex Williamson <alex.williamson@redhat.com> wrote: > >> On Mon, 22 Jan 2024 23:17:30 +0100 >> Lukas Wunner <lukas@wunner.de> wrote: >> >>> On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote: >>>> On Thu, 3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@redhat.com wrote: >>>>> Testing that a device is not currently in a low power state provides no >>>>> guarantees that the device is not immenently transitioning to such a state. >>>>> We need to increment the PM usage counter before accessing the device. >>>>> Since we don't wish to wake the device for PME polling, do so only if the >>>>> device is already active by using pm_runtime_get_if_active(). >>>>> >>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> >>>>> --- >>>>> drivers/pci/pci.c | 23 ++++++++++++++++------- >>>>> 1 file changed, 16 insertions(+), 7 deletions(-) >>>> Resurrecting this patch (currently commit d3fcd7360338) for discussion >>>> as it's been identified as the source of a regression in: >>>> >>>> https://bugzilla.kernel.org/show_bug.cgi?id=218360 >>>> >>>> Copying Mika, Lukas, and Rafael as it's related to: >>>> >>>> 000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold") >>>> >>>> where we skip devices in D3cold when processing the PME list. >>>> >>>> I think the issue in the above bz is that the downstream TB3/USB4 port >>>> is in D3 (presumably D3hot) and I therefore infer the device is in state >>>> RPM_SUSPENDED. This commit is attempting to make sure the device power >>>> state is stable across the call such that it does not transition into >>>> D3cold while we're accessing it. >>>> >>>> To do that I used pm_runtime_get_if_active(), but in retrospect this >>>> requires the device to be in RPM_ACTIVE so we end up skipping anything >>>> suspended or transitioning. >>> How about dropping the calls to pm_runtime_get_if_active() and >>> pm_runtime_put() and instead simply do: >>> >>> if (pm_runtime_suspended(&pdev->dev) && >>> pdev->current_state != PCI_D3cold) >>> pci_pme_wakeup(pdev, NULL); >> Hi Lukas, >> >> Do we require that the polled device is in the RPM_SUSPENDED state? >> Also pm_runtime_suspended() can also only be trusted while holding the >> device power.lock, we need a usage count reference to maintain that >> state. >> >> I'm also seeing cases where the bridge is power state D0, but PM state >> RPM_SUSPENDING, so config space of the polled device becomes >> inaccessible even while we're holding a reference once we allow polling >> in RPM_SUSPENDED. >> >> I'm currently working with the below patch, which I believe addresses >> all these issues, but I'd welcome review and testing. Thanks, >> >> Alex >> >> commit 0a063b8e91d0bc807db712c81c8b270864f99ecb >> Author: Alex Williamson <alex.williamson@redhat.com> >> Date: Tue Jan 16 13:28:33 2024 -0700 >> >> PCI: Fix active state requirement in PME polling >> >> The commit noted in fixes added a bogus requirement that runtime PM >> managed devices need to be in the RPM_ACTIVE state for PME polling. >> In fact, there is no requirement of a specific runtime PM state, it >> is only required that the state is stable such that testing config >> space availability, ie. !D3cold, remains valid across the PME wakeup. >> >> To that effect, defer polling of runtime PM managed devices that are >> not in either the RPM_ACTIVE or RPM_SUSPENDED states. Devices in >> transitional states remain on the pci_pme_list and will be re-queued. >> >> However in allowing polling of devices in the RPM_SUSPENDED state, >> the bridge state requires further refinement as it's possible to poll >> while the bridge is in D0, but the runtime PM state is RPM_SUSPENDING. >> An asynchronous completion of the bridge transition to a low power >> state can make config space of the subordinate device become >> unavailable. A runtime PM reference to the bridge is therefore added >> with a supplementary requirement that the bridge is in the RPM_ACTIVE >> state. >> >> Fixes: d3fcd7360338 ("PCI: Fix runtime PM race with PME polling") >> Reported-by: Sanath S <sanath.s@amd.com> >> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218360 >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> >> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >> index bdbf8a94b4d0..31dbf1834b07 100644 >> --- a/drivers/pci/pci.c >> +++ b/drivers/pci/pci.c >> @@ -2433,29 +2433,45 @@ static void pci_pme_list_scan(struct work_struct *work) >> if (pdev->pme_poll) { >> struct pci_dev *bridge = pdev->bus->self; >> struct device *dev = &pdev->dev; >> - int pm_status; >> + struct device *bdev = bridge ? &bridge->dev : NULL; >> >> /* >> - * If bridge is in low power state, the >> - * configuration space of subordinate devices >> - * may be not accessible >> + * If we have a bridge, it should be in an active/D0 >> + * state or the configuration space of subordinate >> + * devices may not be accessible. >> */ >> - if (bridge && bridge->current_state != PCI_D0) >> - continue; >> + if (bdev) { >> + spin_lock_irq(&bdev->power.lock); > With the code as shown here I have one system that seems to be getting > contention when reading the vpd sysfs attribute when the endpoints > (QL41000) are bound to vfio-pci and unused, resulting in the root port > and endpoints being suspended. A vpd read can take over a minute. > Seems to be resolved changing the above spin_lock to a spin_trylock: > > if (!spin_trylock_irq(&bdev->power.lock)) > continue; > > The pm_runtime_barrier() as used in the vpd path can be prone to such > issues, I saw similar in the fix I previously proposed in the bz. > > I'll continue to do more testing with this change and hopefully Sanath > can verify this resolves the bug report. Thanks, > > Alex I'll verify it today and let you know the observations. >> + if (!pm_runtime_active(bdev) || >> + bridge->current_state != PCI_D0) { >> + spin_unlock_irq(&bdev->power.lock); >> + continue; >> + } >> + pm_runtime_get_noresume(bdev); >> + spin_unlock_irq(&bdev->power.lock); >> + } >> >> /* >> - * If the device is in a low power state it >> - * should not be polled either. >> + * The device itself may be either in active or >> + * suspended state, but must not be in D3cold so >> + * that configuration space is accessible. The >> + * transitional resuming and suspending states are >> + * skipped to avoid D3cold races. >> */ >> - pm_status = pm_runtime_get_if_active(dev, true); >> - if (!pm_status) >> - continue; >> - >> - if (pdev->current_state != PCI_D3cold) >> + spin_lock_irq(&dev->power.lock); >> + if ((pm_runtime_active(dev) || >> + pm_runtime_suspended(dev)) && >> + pdev->current_state != PCI_D3cold) { >> + pm_runtime_get_noresume(dev); >> + spin_unlock_irq(&dev->power.lock); >> pci_pme_wakeup(pdev, NULL); >> - >> - if (pm_status > 0) >> pm_runtime_put(dev); >> + } else { >> + spin_unlock_irq(&dev->power.lock); >> + } >> + >> + if (bdev) >> + pm_runtime_put(bdev); >> } else { >> list_del(&pme_dev->list); >> kfree(pme_dev); ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling 2024-01-22 22:50 ` Alex Williamson 2024-01-23 4:46 ` Alex Williamson @ 2024-01-23 10:45 ` Lukas Wunner 2024-01-23 15:55 ` Alex Williamson 1 sibling, 1 reply; 18+ messages in thread From: Lukas Wunner @ 2024-01-23 10:45 UTC (permalink / raw) To: Alex Williamson Cc: linux-pci, bhelgaas, linux-kernel, eric.auger, mika.westerberg, rafael.j.wysocki, Sanath.S On Mon, Jan 22, 2024 at 03:50:03PM -0700, Alex Williamson wrote: > On Mon, 22 Jan 2024 23:17:30 +0100 Lukas Wunner <lukas@wunner.de> wrote: > > On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote: > > > To do that I used pm_runtime_get_if_active(), but in retrospect this > > > requires the device to be in RPM_ACTIVE so we end up skipping anything > > > suspended or transitioning. > > > > How about dropping the calls to pm_runtime_get_if_active() and > > pm_runtime_put() and instead simply do: > > > > if (pm_runtime_suspended(&pdev->dev) && > > pdev->current_state != PCI_D3cold) > > pci_pme_wakeup(pdev, NULL); > > Do we require that the polled device is in the RPM_SUSPENDED state? If the device is RPM_SUSPENDING, why immediately resume it for polling? It's sufficient to poll it the next time around, i.e. 1 second later. Likewise, if it's already RPM_RESUMING or RPM_ACTIVE anyway, no need to poll PME. This leaves RPM_SUSPENDED as the only state in which it makes sense to poll. > Also pm_runtime_suspended() can also only be trusted while holding the > device power.lock, we need a usage count reference to maintain that > state. Why? Let's say there's a race and the device resumes immediately after we call pm_runtime_suspended() here. So we might call pci_pme_wakeup() gratuitouly. So what? No biggie. > + if (bdev) { > + spin_lock_irq(&bdev->power.lock); Hm, I'd expect that lock to be internal to the PM core, although there *are* a few stray users outside of it. Thanks, Lukas ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling 2024-01-23 10:45 ` Lukas Wunner @ 2024-01-23 15:55 ` Alex Williamson 2024-01-23 16:12 ` Lukas Wunner 0 siblings, 1 reply; 18+ messages in thread From: Alex Williamson @ 2024-01-23 15:55 UTC (permalink / raw) To: Lukas Wunner Cc: linux-pci, bhelgaas, linux-kernel, eric.auger, mika.westerberg, rafael.j.wysocki, Sanath.S On Tue, 23 Jan 2024 11:45:19 +0100 Lukas Wunner <lukas@wunner.de> wrote: > On Mon, Jan 22, 2024 at 03:50:03PM -0700, Alex Williamson wrote: > > On Mon, 22 Jan 2024 23:17:30 +0100 Lukas Wunner <lukas@wunner.de> wrote: > > > On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote: > > > > To do that I used pm_runtime_get_if_active(), but in retrospect this > > > > requires the device to be in RPM_ACTIVE so we end up skipping anything > > > > suspended or transitioning. > > > > > > How about dropping the calls to pm_runtime_get_if_active() and > > > pm_runtime_put() and instead simply do: > > > > > > if (pm_runtime_suspended(&pdev->dev) && > > > pdev->current_state != PCI_D3cold) > > > pci_pme_wakeup(pdev, NULL); > > > > Do we require that the polled device is in the RPM_SUSPENDED state? > > If the device is RPM_SUSPENDING, why immediately resume it for polling? > It's sufficient to poll it the next time around, i.e. 1 second later. > > Likewise, if it's already RPM_RESUMING or RPM_ACTIVE anyway, no need > to poll PME. I'm clearly not an expert on PME, but this is not obvious to me and before the commit that went in through this thread, PME wakeup was triggered regardless of the PM state. I was trying to restore the behavior of not requiring a specific PM state other than deferring polling across transition states. > This leaves RPM_SUSPENDED as the only state in which it makes sense to > poll. > > > Also pm_runtime_suspended() can also only be trusted while holding the > > device power.lock, we need a usage count reference to maintain that > > state. > > Why? Let's say there's a race and the device resumes immediately after > we call pm_runtime_suspended() here. So we might call pci_pme_wakeup() > gratuitouly. So what? No biggie. The issue I'm trying to address is that config space of the device can become inaccessible while calling pci_pme_wakeup() on it, causing a system fault on some hardware. So a gratuitous pci_pme_wakeup() can be detrimental. We require the device config space to remain accessible, therefore the instantaneous test against D3cold and that the parent bridge is in D0 is not sufficient. I see traces where the parent bridge is in D0, but the PM state is RPM_SUSPENDING and the endpoint device transitions to D3cold while we're executing pci_pme_wakeup(). Therefore at a minimum, I think we need to enforce that the bridge is in RPM_ACTIVE and remains in that state across pci_pme_wakeup(), which means we need to hold a usage count reference, and that usage count reference must be acquired under power.lock in RPM_ACTIVE state to be effective. > > + if (bdev) { > > + spin_lock_irq(&bdev->power.lock); > > Hm, I'd expect that lock to be internal to the PM core, > although there *are* a few stray users outside of it. Right, there are. It's possible that if we only need to hold a reference on the bridge we can abstract this through pm_runtime_get_if_active(), the semantics worked better to essentially open code it in this iteration though. Thanks, Alex ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling 2024-01-23 15:55 ` Alex Williamson @ 2024-01-23 16:12 ` Lukas Wunner 2024-01-23 16:47 ` Alex Williamson 0 siblings, 1 reply; 18+ messages in thread From: Lukas Wunner @ 2024-01-23 16:12 UTC (permalink / raw) To: Alex Williamson Cc: linux-pci, bhelgaas, linux-kernel, eric.auger, mika.westerberg, rafael.j.wysocki, Sanath.S On Tue, Jan 23, 2024 at 08:55:21AM -0700, Alex Williamson wrote: > On Tue, 23 Jan 2024 11:45:19 +0100 Lukas Wunner <lukas@wunner.de> wrote: > > If the device is RPM_SUSPENDING, why immediately resume it for polling? > > It's sufficient to poll it the next time around, i.e. 1 second later. > > > > Likewise, if it's already RPM_RESUMING or RPM_ACTIVE anyway, no need > > to poll PME. > > I'm clearly not an expert on PME, but this is not obvious to me and > before the commit that went in through this thread, PME wakeup was > triggered regardless of the PM state. I was trying to restore the > behavior of not requiring a specific PM state other than deferring > polling across transition states. There are broken devices which are incapable of signaling PME. As a workaround, the kernel polls these devices once per second. The first time the device signals PME, the kernel stops polling that particular device because PME is clearly working. So this is just a best-effort way to support PME for broken devices. If it takes a little longer to detect that PME was signaled, it's not a big deal. > The issue I'm trying to address is that config space of the device can > become inaccessible while calling pci_pme_wakeup() on it, causing a > system fault on some hardware. So a gratuitous pci_pme_wakeup() can be > detrimental. > > We require the device config space to remain accessible, therefore the > instantaneous test against D3cold and that the parent bridge is in D0 > is not sufficient. I see traces where the parent bridge is in D0, but > the PM state is RPM_SUSPENDING and the endpoint device transitions to > D3cold while we're executing pci_pme_wakeup(). We have pci_config_pm_runtime_{get,put}() helpers to ensure the parent of a device is in D0 so that the device's config space is accessible. So you may need to use that in pci_pme_wakeup(). Thanks, Lukas ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling 2024-01-23 16:12 ` Lukas Wunner @ 2024-01-23 16:47 ` Alex Williamson 0 siblings, 0 replies; 18+ messages in thread From: Alex Williamson @ 2024-01-23 16:47 UTC (permalink / raw) To: Lukas Wunner Cc: linux-pci, bhelgaas, linux-kernel, eric.auger, mika.westerberg, rafael.j.wysocki, Sanath.S On Tue, 23 Jan 2024 17:12:39 +0100 Lukas Wunner <lukas@wunner.de> wrote: > On Tue, Jan 23, 2024 at 08:55:21AM -0700, Alex Williamson wrote: > > On Tue, 23 Jan 2024 11:45:19 +0100 Lukas Wunner <lukas@wunner.de> wrote: > > > If the device is RPM_SUSPENDING, why immediately resume it for polling? > > > It's sufficient to poll it the next time around, i.e. 1 second later. > > > > > > Likewise, if it's already RPM_RESUMING or RPM_ACTIVE anyway, no need > > > to poll PME. > > > > I'm clearly not an expert on PME, but this is not obvious to me and > > before the commit that went in through this thread, PME wakeup was > > triggered regardless of the PM state. I was trying to restore the > > behavior of not requiring a specific PM state other than deferring > > polling across transition states. > > There are broken devices which are incapable of signaling PME. > As a workaround, the kernel polls these devices once per second. > The first time the device signals PME, the kernel stops polling > that particular device because PME is clearly working. > > So this is just a best-effort way to support PME for broken devices. > If it takes a little longer to detect that PME was signaled, it's not > a big deal. > > > The issue I'm trying to address is that config space of the device can > > become inaccessible while calling pci_pme_wakeup() on it, causing a > > system fault on some hardware. So a gratuitous pci_pme_wakeup() can be > > detrimental. > > > > We require the device config space to remain accessible, therefore the > > instantaneous test against D3cold and that the parent bridge is in D0 > > is not sufficient. I see traces where the parent bridge is in D0, but > > the PM state is RPM_SUSPENDING and the endpoint device transitions to > > D3cold while we're executing pci_pme_wakeup(). > > We have pci_config_pm_runtime_{get,put}() helpers to ensure the parent > of a device is in D0 so that the device's config space is accessible. > So you may need to use that in pci_pme_wakeup(). pci_config_pm_runtime_get() doesn't seem to align with our current philosophy to defer polling devices that aren't in the correct power state. We require the bridge to be in D0, but we defer polling rather than resume it otherwise. We also defer device polling if the device is in D3cold, whereas the above function would resume a device in that state. I think our bridge D0 test could be reliable if it were done holding a reference acquired via pm_runtime_get_if_active(). Thanks, Alex ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 0/2] PCI: Protect VPD and PME accesses from power management 2023-08-03 17:12 [PATCH v2 0/2] PCI: Protect VPD and PME accesses from power management Alex Williamson 2023-08-03 17:12 ` [PATCH v2 1/2] PCI/VPD: Add runtime power management to sysfs interface Alex Williamson 2023-08-03 17:12 ` [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling Alex Williamson @ 2023-08-10 18:29 ` Bjorn Helgaas 2023-08-10 18:54 ` Alex Williamson 2 siblings, 1 reply; 18+ messages in thread From: Bjorn Helgaas @ 2023-08-10 18:29 UTC (permalink / raw) To: Alex Williamson; +Cc: bhelgaas, linux-pci, linux-kernel, eric.auger On Thu, Aug 03, 2023 at 11:12:31AM -0600, Alex Williamson wrote: > Since v5.19, vfio-pci makes use of runtime power management on devices. > This has the effect of potentially putting entire sub-hierarchies into > lower power states, which has exposed some gaps in the PCI subsystem > around power management support. > > The first issue is that lspci accesses the VPD sysfs interface, which > does not provide the same power management wrappers as general config > space. > > The next covers PME, where we attempt to skip devices based on their PCI > power state, but don't protect changes to that state or look at the > overall runtime power management state of the device. > > This latter patch addresses the issue noted by Eric in the follow-ups to > v1 linked below. > > These patches are logically independent, but only together resolve an > issue on Eric's system where a pair of endpoints bound to vfio-pci and > unused by userspace drivers trigger faults through lspci and PME > polling. Thanks, > > Alex > > v1: https://lore.kernel.org/all/20230707151044.1311544-1-alex.williamson@redhat.com/ > > Alex Williamson (2): > PCI/VPD: Add runtime power management to sysfs interface > PCI: Fix runtime PM race with PME polling > > drivers/pci/pci.c | 23 ++++++++++++++++------- > drivers/pci/vpd.c | 34 ++++++++++++++++++++++++++++++++-- > 2 files changed, 48 insertions(+), 9 deletions(-) Applied with the tweak below to pci/vpd for v6.6, thanks! The idea is to match the pci_get_func0_dev() so the get/put balance is clear without having to analyze PCI_DEV_FLAGS_VPD_REF_F0 usage: - if (dev != vpd_dev) + if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v2 0/2] PCI: Protect VPD and PME accesses from power management 2023-08-10 18:29 ` [PATCH v2 0/2] PCI: Protect VPD and PME accesses from power management Bjorn Helgaas @ 2023-08-10 18:54 ` Alex Williamson 0 siblings, 0 replies; 18+ messages in thread From: Alex Williamson @ 2023-08-10 18:54 UTC (permalink / raw) To: Bjorn Helgaas; +Cc: bhelgaas, linux-pci, linux-kernel, eric.auger On Thu, 10 Aug 2023 13:29:44 -0500 Bjorn Helgaas <helgaas@kernel.org> wrote: > On Thu, Aug 03, 2023 at 11:12:31AM -0600, Alex Williamson wrote: > > Since v5.19, vfio-pci makes use of runtime power management on devices. > > This has the effect of potentially putting entire sub-hierarchies into > > lower power states, which has exposed some gaps in the PCI subsystem > > around power management support. > > > > The first issue is that lspci accesses the VPD sysfs interface, which > > does not provide the same power management wrappers as general config > > space. > > > > The next covers PME, where we attempt to skip devices based on their PCI > > power state, but don't protect changes to that state or look at the > > overall runtime power management state of the device. > > > > This latter patch addresses the issue noted by Eric in the follow-ups to > > v1 linked below. > > > > These patches are logically independent, but only together resolve an > > issue on Eric's system where a pair of endpoints bound to vfio-pci and > > unused by userspace drivers trigger faults through lspci and PME > > polling. Thanks, > > > > Alex > > > > v1: https://lore.kernel.org/all/20230707151044.1311544-1-alex.williamson@redhat.com/ > > > > Alex Williamson (2): > > PCI/VPD: Add runtime power management to sysfs interface > > PCI: Fix runtime PM race with PME polling > > > > drivers/pci/pci.c | 23 ++++++++++++++++------- > > drivers/pci/vpd.c | 34 ++++++++++++++++++++++++++++++++-- > > 2 files changed, 48 insertions(+), 9 deletions(-) > > Applied with the tweak below to pci/vpd for v6.6, thanks! The idea is > to match the pci_get_func0_dev() so the get/put balance is clear > without having to analyze PCI_DEV_FLAGS_VPD_REF_F0 usage: > > - if (dev != vpd_dev) > + if (dev->dev_flags & PCI_DEV_FLAGS_VPD_REF_F0) > Looks good, thanks! Alex ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2024-01-23 16:47 UTC | newest] Thread overview: 18+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-08-03 17:12 [PATCH v2 0/2] PCI: Protect VPD and PME accesses from power management Alex Williamson 2023-08-03 17:12 ` [PATCH v2 1/2] PCI/VPD: Add runtime power management to sysfs interface Alex Williamson 2023-08-10 15:59 ` Bjorn Helgaas 2023-08-10 16:26 ` Alex Williamson 2023-08-11 19:25 ` Bjorn Helgaas 2023-08-11 19:56 ` Alex Williamson 2023-08-03 17:12 ` [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling Alex Williamson 2024-01-18 18:50 ` Alex Williamson 2024-01-22 22:17 ` Lukas Wunner 2024-01-22 22:50 ` Alex Williamson 2024-01-23 4:46 ` Alex Williamson 2024-01-23 4:48 ` Sanath S 2024-01-23 10:45 ` Lukas Wunner 2024-01-23 15:55 ` Alex Williamson 2024-01-23 16:12 ` Lukas Wunner 2024-01-23 16:47 ` Alex Williamson 2023-08-10 18:29 ` [PATCH v2 0/2] PCI: Protect VPD and PME accesses from power management Bjorn Helgaas 2023-08-10 18:54 ` Alex Williamson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).