linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: Lukas Wunner <lukas@wunner.de>
Cc: linux-pci@vger.kernel.org, bhelgaas@google.com,
	linux-kernel@vger.kernel.org, eric.auger@redhat.com,
	mika.westerberg@linux.intel.com, rafael.j.wysocki@intel.com,
	Sanath.S@amd.com
Subject: Re: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling
Date: Mon, 22 Jan 2024 21:46:22 -0700	[thread overview]
Message-ID: <20240122214622.32e8c5fa.alex.williamson@redhat.com> (raw)
In-Reply-To: <20240122155003.587225aa.alex.williamson@redhat.com>

On Mon, 22 Jan 2024 15:50:03 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Mon, 22 Jan 2024 23:17:30 +0100
> Lukas Wunner <lukas@wunner.de> wrote:
> 
> > On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote:  
> > > On Thu,  3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@redhat.com wrote:    
> > > > Testing that a device is not currently in a low power state provides no
> > > > guarantees that the device is not immenently transitioning to such a state.
> > > > We need to increment the PM usage counter before accessing the device.
> > > > Since we don't wish to wake the device for PME polling, do so only if the
> > > > device is already active by using pm_runtime_get_if_active().
> > > > 
> > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > ---
> > > >  drivers/pci/pci.c | 23 ++++++++++++++++-------
> > > >  1 file changed, 16 insertions(+), 7 deletions(-)    
> > > 
> > > Resurrecting this patch (currently commit d3fcd7360338) for discussion
> > > as it's been identified as the source of a regression in:
> > > 
> > > https://bugzilla.kernel.org/show_bug.cgi?id=218360
> > > 
> > > Copying Mika, Lukas, and Rafael as it's related to:
> > > 
> > > 000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold")
> > > 
> > > where we skip devices in D3cold when processing the PME list.
> > > 
> > > I think the issue in the above bz is that the downstream TB3/USB4 port
> > > is in D3 (presumably D3hot) and I therefore infer the device is in state
> > > RPM_SUSPENDED.  This commit is attempting to make sure the device power
> > > state is stable across the call such that it does not transition into
> > > D3cold while we're accessing it.
> > > 
> > > To do that I used pm_runtime_get_if_active(), but in retrospect this
> > > requires the device to be in RPM_ACTIVE so we end up skipping anything
> > > suspended or transitioning.    
> > 
> > How about dropping the calls to pm_runtime_get_if_active() and
> > pm_runtime_put() and instead simply do:
> > 
> > 			if (pm_runtime_suspended(&pdev->dev) &&
> > 			    pdev->current_state != PCI_D3cold)
> > 				pci_pme_wakeup(pdev, NULL);  
> 
> Hi Lukas,
> 
> Do we require that the polled device is in the RPM_SUSPENDED state?
> Also pm_runtime_suspended() can also only be trusted while holding the
> device power.lock, we need a usage count reference to maintain that
> state.
> 
> I'm also seeing cases where the bridge is power state D0, but PM state
> RPM_SUSPENDING, so config space of the polled device becomes
> inaccessible even while we're holding a reference once we allow polling
> in RPM_SUSPENDED.
> 
> I'm currently working with the below patch, which I believe addresses
> all these issues, but I'd welcome review and testing. Thanks,
> 
> Alex
> 
> commit 0a063b8e91d0bc807db712c81c8b270864f99ecb
> Author: Alex Williamson <alex.williamson@redhat.com>
> Date:   Tue Jan 16 13:28:33 2024 -0700
> 
>     PCI: Fix active state requirement in PME polling
>     
>     The commit noted in fixes added a bogus requirement that runtime PM
>     managed devices need to be in the RPM_ACTIVE state for PME polling.
>     In fact, there is no requirement of a specific runtime PM state, it
>     is only required that the state is stable such that testing config
>     space availability, ie. !D3cold, remains valid across the PME wakeup.
>     
>     To that effect, defer polling of runtime PM managed devices that are
>     not in either the RPM_ACTIVE or RPM_SUSPENDED states.  Devices in
>     transitional states remain on the pci_pme_list and will be re-queued.
>     
>     However in allowing polling of devices in the RPM_SUSPENDED state,
>     the bridge state requires further refinement as it's possible to poll
>     while the bridge is in D0, but the runtime PM state is RPM_SUSPENDING.
>     An asynchronous completion of the bridge transition to a low power
>     state can make config space of the subordinate device become
>     unavailable.  A runtime PM reference to the bridge is therefore added
>     with a supplementary requirement that the bridge is in the RPM_ACTIVE
>     state.
>     
>     Fixes: d3fcd7360338 ("PCI: Fix runtime PM race with PME polling")
>     Reported-by: Sanath S <sanath.s@amd.com>
>     Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218360
>     Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index bdbf8a94b4d0..31dbf1834b07 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2433,29 +2433,45 @@ static void pci_pme_list_scan(struct work_struct *work)
>  		if (pdev->pme_poll) {
>  			struct pci_dev *bridge = pdev->bus->self;
>  			struct device *dev = &pdev->dev;
> -			int pm_status;
> +			struct device *bdev = bridge ? &bridge->dev : NULL;
>  
>  			/*
> -			 * If bridge is in low power state, the
> -			 * configuration space of subordinate devices
> -			 * may be not accessible
> +			 * If we have a bridge, it should be in an active/D0
> +			 * state or the configuration space of subordinate
> +			 * devices may not be accessible.
>  			 */
> -			if (bridge && bridge->current_state != PCI_D0)
> -				continue;
> +			if (bdev) {
> +				spin_lock_irq(&bdev->power.lock);

With the code as shown here I have one system that seems to be getting
contention when reading the vpd sysfs attribute when the endpoints
(QL41000) are bound to vfio-pci and unused, resulting in the root port
and endpoints being suspended.  A vpd read can take over a minute.
Seems to be resolved changing the above spin_lock to a spin_trylock:

				if (!spin_trylock_irq(&bdev->power.lock))
					continue;

The pm_runtime_barrier() as used in the vpd path can be prone to such
issues, I saw similar in the fix I previously proposed in the bz.

I'll continue to do more testing with this change and hopefully Sanath
can verify this resolves the bug report.  Thanks,

Alex

> +				if (!pm_runtime_active(bdev) ||
> +				    bridge->current_state != PCI_D0) {
> +					spin_unlock_irq(&bdev->power.lock);
> +					continue;
> +				}
> +				pm_runtime_get_noresume(bdev);
> +				spin_unlock_irq(&bdev->power.lock);
> +			}
>  
>  			/*
> -			 * If the device is in a low power state it
> -			 * should not be polled either.
> +			 * The device itself may be either in active or
> +			 * suspended state, but must not be in D3cold so
> +			 * that configuration space is accessible.  The
> +			 * transitional resuming and suspending states are
> +			 * skipped to avoid D3cold races.
>  			 */
> -			pm_status = pm_runtime_get_if_active(dev, true);
> -			if (!pm_status)
> -				continue;
> -
> -			if (pdev->current_state != PCI_D3cold)
> +			spin_lock_irq(&dev->power.lock);
> +			if ((pm_runtime_active(dev) ||
> +			     pm_runtime_suspended(dev)) &&
> +			    pdev->current_state != PCI_D3cold) {
> +				pm_runtime_get_noresume(dev);
> +				spin_unlock_irq(&dev->power.lock);
>  				pci_pme_wakeup(pdev, NULL);
> -
> -			if (pm_status > 0)
>  				pm_runtime_put(dev);
> +			} else {
> +				spin_unlock_irq(&dev->power.lock);
> +			}
> +
> +			if (bdev)
> +				pm_runtime_put(bdev);
>  		} else {
>  			list_del(&pme_dev->list);
>  			kfree(pme_dev);


  reply	other threads:[~2024-01-23  4:46 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-03 17:12 [PATCH v2 0/2] PCI: Protect VPD and PME accesses from power management Alex Williamson
2023-08-03 17:12 ` [PATCH v2 1/2] PCI/VPD: Add runtime power management to sysfs interface Alex Williamson
2023-08-10 15:59   ` Bjorn Helgaas
2023-08-10 16:26     ` Alex Williamson
2023-08-11 19:25   ` Bjorn Helgaas
2023-08-11 19:56     ` Alex Williamson
2023-08-03 17:12 ` [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling Alex Williamson
2024-01-18 18:50   ` Alex Williamson
2024-01-22 22:17     ` Lukas Wunner
2024-01-22 22:50       ` Alex Williamson
2024-01-23  4:46         ` Alex Williamson [this message]
2024-01-23  4:48           ` Sanath S
2024-01-23 10:45         ` Lukas Wunner
2024-01-23 15:55           ` Alex Williamson
2024-01-23 16:12             ` Lukas Wunner
2024-01-23 16:47               ` Alex Williamson
2023-08-10 18:29 ` [PATCH v2 0/2] PCI: Protect VPD and PME accesses from power management Bjorn Helgaas
2023-08-10 18:54   ` Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240122214622.32e8c5fa.alex.williamson@redhat.com \
    --to=alex.williamson@redhat.com \
    --cc=Sanath.S@amd.com \
    --cc=bhelgaas@google.com \
    --cc=eric.auger@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=mika.westerberg@linux.intel.com \
    --cc=rafael.j.wysocki@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).