stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()"
       [not found] <76c61361-b8b4-435f-a9f1-32b716763d62@5challer.de>
@ 2024-01-02 23:25 ` Bjorn Helgaas
  2024-01-02 23:33   ` Kuppuswamy Sathyanarayanan
  2024-01-08  8:39   ` Johan Hovold
  0 siblings, 2 replies; 10+ messages in thread
From: Bjorn Helgaas @ 2024-01-02 23:25 UTC (permalink / raw)
  To: Michael Schaller, Kai-Heng Feng
  Cc: linux-pci, linux-kernel, regressions, Maciej W . Rozycki,
	Ajay Agarwal, Kuppuswamy Sathyanarayanan, Greg Kroah-Hartman,
	Heiner Kallweit, Johan Hovold, Bjorn Helgaas, stable

From: Bjorn Helgaas <bhelgaas@google.com>

This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.

Michael reported that when attempting to resume from suspend to RAM on ASUS
mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426
("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay
with no output, followed by a reboot.

Workarounds include:

  - Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
  - Booting with "pcie_aspm=off"
  - Booting with "pcie_aspm.policy=performance"
  - "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm"
    before suspending
  - Connecting a USB flash drive

Fixes: 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
Reported-by: Michael Schaller <michael@5challer.de>
Link: https://lore.kernel.org/r/76c61361-b8b4-435f-a9f1-32b716763d62@5challer.de
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: <stable@vger.kernel.org>
---
 drivers/pci/pci.c       |  6 ++++++
 drivers/pci/pci.h       |  2 ++
 drivers/pci/pcie/aspm.c | 19 +++++++++++++++++++
 3 files changed, 27 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 55bc3576a985..bdbf8a94b4d0 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1335,6 +1335,9 @@ static int pci_set_full_power_state(struct pci_dev *dev)
 		pci_restore_bars(dev);
 	}
 
+	if (dev->bus->self)
+		pcie_aspm_pm_state_change(dev->bus->self);
+
 	return 0;
 }
 
@@ -1429,6 +1432,9 @@ static int pci_set_low_power_state(struct pci_dev *dev, pci_power_t state)
 				     pci_power_name(dev->current_state),
 				     pci_power_name(state));
 
+	if (dev->bus->self)
+		pcie_aspm_pm_state_change(dev->bus->self);
+
 	return 0;
 }
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 5ecbcf041179..f43873049d52 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -569,10 +569,12 @@ int pcie_retrain_link(struct pci_dev *pdev, bool use_lt);
 #ifdef CONFIG_PCIEASPM
 void pcie_aspm_init_link_state(struct pci_dev *pdev);
 void pcie_aspm_exit_link_state(struct pci_dev *pdev);
+void pcie_aspm_pm_state_change(struct pci_dev *pdev);
 void pcie_aspm_powersave_config_link(struct pci_dev *pdev);
 #else
 static inline void pcie_aspm_init_link_state(struct pci_dev *pdev) { }
 static inline void pcie_aspm_exit_link_state(struct pci_dev *pdev) { }
+static inline void pcie_aspm_pm_state_change(struct pci_dev *pdev) { }
 static inline void pcie_aspm_powersave_config_link(struct pci_dev *pdev) { }
 #endif
 
diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index 50b04ae5c394..8715e951c491 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -1008,6 +1008,25 @@ void pcie_aspm_exit_link_state(struct pci_dev *pdev)
 	up_read(&pci_bus_sem);
 }
 
+/* @pdev: the root port or switch downstream port */
+void pcie_aspm_pm_state_change(struct pci_dev *pdev)
+{
+	struct pcie_link_state *link = pdev->link_state;
+
+	if (aspm_disabled || !link)
+		return;
+	/*
+	 * Devices changed PM state, we should recheck if latency
+	 * meets all functions' requirement
+	 */
+	down_read(&pci_bus_sem);
+	mutex_lock(&aspm_lock);
+	pcie_update_aspm_capable(link->root);
+	pcie_config_aspm_path(link);
+	mutex_unlock(&aspm_lock);
+	up_read(&pci_bus_sem);
+}
+
 void pcie_aspm_powersave_config_link(struct pci_dev *pdev)
 {
 	struct pcie_link_state *link = pdev->link_state;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()"
  2024-01-02 23:25 ` [PATCH] Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()" Bjorn Helgaas
@ 2024-01-02 23:33   ` Kuppuswamy Sathyanarayanan
  2024-01-03  0:12     ` Bjorn Helgaas
  2024-01-08  8:39   ` Johan Hovold
  1 sibling, 1 reply; 10+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2024-01-02 23:33 UTC (permalink / raw)
  To: Bjorn Helgaas, Michael Schaller, Kai-Heng Feng
  Cc: linux-pci, linux-kernel, regressions, Maciej W . Rozycki,
	Ajay Agarwal, Greg Kroah-Hartman, Heiner Kallweit, Johan Hovold,
	Bjorn Helgaas, stable



On 1/2/2024 3:25 PM, Bjorn Helgaas wrote:
> From: Bjorn Helgaas <bhelgaas@google.com>
> 
> This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.
> 
> Michael reported that when attempting to resume from suspend to RAM on ASUS
> mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426
> ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay
> with no output, followed by a reboot.
> 
> Workarounds include:
> 
>   - Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
>   - Booting with "pcie_aspm=off"
>   - Booting with "pcie_aspm.policy=performance"
>   - "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm"
>     before suspending
>   - Connecting a USB flash drive
> 

Did you find the root cause? Is this issue specific to that particular
device? If yes, can we do a quirk?

> Fixes: 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> Reported-by: Michael Schaller <michael@5challer.de>
> Link: https://lore.kernel.org/r/76c61361-b8b4-435f-a9f1-32b716763d62@5challer.de
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>  drivers/pci/pci.c       |  6 ++++++
>  drivers/pci/pci.h       |  2 ++
>  drivers/pci/pcie/aspm.c | 19 +++++++++++++++++++
>  3 files changed, 27 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 55bc3576a985..bdbf8a94b4d0 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1335,6 +1335,9 @@ static int pci_set_full_power_state(struct pci_dev *dev)
>  		pci_restore_bars(dev);
>  	}
>  
> +	if (dev->bus->self)
> +		pcie_aspm_pm_state_change(dev->bus->self);
> +
>  	return 0;
>  }
>  
> @@ -1429,6 +1432,9 @@ static int pci_set_low_power_state(struct pci_dev *dev, pci_power_t state)
>  				     pci_power_name(dev->current_state),
>  				     pci_power_name(state));
>  
> +	if (dev->bus->self)
> +		pcie_aspm_pm_state_change(dev->bus->self);
> +
>  	return 0;
>  }
>  
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 5ecbcf041179..f43873049d52 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -569,10 +569,12 @@ int pcie_retrain_link(struct pci_dev *pdev, bool use_lt);
>  #ifdef CONFIG_PCIEASPM
>  void pcie_aspm_init_link_state(struct pci_dev *pdev);
>  void pcie_aspm_exit_link_state(struct pci_dev *pdev);
> +void pcie_aspm_pm_state_change(struct pci_dev *pdev);
>  void pcie_aspm_powersave_config_link(struct pci_dev *pdev);
>  #else
>  static inline void pcie_aspm_init_link_state(struct pci_dev *pdev) { }
>  static inline void pcie_aspm_exit_link_state(struct pci_dev *pdev) { }
> +static inline void pcie_aspm_pm_state_change(struct pci_dev *pdev) { }
>  static inline void pcie_aspm_powersave_config_link(struct pci_dev *pdev) { }
>  #endif
>  
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index 50b04ae5c394..8715e951c491 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -1008,6 +1008,25 @@ void pcie_aspm_exit_link_state(struct pci_dev *pdev)
>  	up_read(&pci_bus_sem);
>  }
>  
> +/* @pdev: the root port or switch downstream port */
> +void pcie_aspm_pm_state_change(struct pci_dev *pdev)
> +{
> +	struct pcie_link_state *link = pdev->link_state;
> +
> +	if (aspm_disabled || !link)
> +		return;
> +	/*
> +	 * Devices changed PM state, we should recheck if latency
> +	 * meets all functions' requirement
> +	 */
> +	down_read(&pci_bus_sem);
> +	mutex_lock(&aspm_lock);
> +	pcie_update_aspm_capable(link->root);
> +	pcie_config_aspm_path(link);
> +	mutex_unlock(&aspm_lock);
> +	up_read(&pci_bus_sem);
> +}
> +
>  void pcie_aspm_powersave_config_link(struct pci_dev *pdev)
>  {
>  	struct pcie_link_state *link = pdev->link_state;

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()"
  2024-01-02 23:33   ` Kuppuswamy Sathyanarayanan
@ 2024-01-03  0:12     ` Bjorn Helgaas
  0 siblings, 0 replies; 10+ messages in thread
From: Bjorn Helgaas @ 2024-01-03  0:12 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: Michael Schaller, Kai-Heng Feng, linux-pci, linux-kernel,
	regressions, Maciej W . Rozycki, Ajay Agarwal, Greg Kroah-Hartman,
	Heiner Kallweit, Johan Hovold, Bjorn Helgaas, stable

On Tue, Jan 02, 2024 at 03:33:51PM -0800, Kuppuswamy Sathyanarayanan wrote:
> On 1/2/2024 3:25 PM, Bjorn Helgaas wrote:
> > From: Bjorn Helgaas <bhelgaas@google.com>
> > 
> > This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.
> > 
> > Michael reported that when attempting to resume from suspend to RAM on ASUS
> > mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426
> > ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay
> > with no output, followed by a reboot.
> > 
> > Workarounds include:
> > 
> >   - Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> >   - Booting with "pcie_aspm=off"
> >   - Booting with "pcie_aspm.policy=performance"
> >   - "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm"
> >     before suspending
> >   - Connecting a USB flash drive
> 
> Did you find the root cause? Is this issue specific to that particular
> device? If yes, can we do a quirk?

Unfortunately we don't know the root cause yet.  Without knowing the
root cause, I don't think we can make a good quirk.

Bjorn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()"
  2024-01-02 23:25 ` [PATCH] Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()" Bjorn Helgaas
  2024-01-02 23:33   ` Kuppuswamy Sathyanarayanan
@ 2024-01-08  8:39   ` Johan Hovold
  2024-01-22 10:53     ` PCI/ASPM locking regression in 6.7-final (was: Re: [PATCH] Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()") Johan Hovold
  1 sibling, 1 reply; 10+ messages in thread
From: Johan Hovold @ 2024-01-08  8:39 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Michael Schaller, Kai-Heng Feng, linux-pci, linux-kernel,
	regressions, Maciej W . Rozycki, Ajay Agarwal,
	Kuppuswamy Sathyanarayanan, Greg Kroah-Hartman, Heiner Kallweit,
	Johan Hovold, Bjorn Helgaas, stable

Hi Bjorn,

On Tue, Jan 02, 2024 at 05:25:50PM -0600, Bjorn Helgaas wrote:
> From: Bjorn Helgaas <bhelgaas@google.com>
> 
> This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.
> 
> Michael reported that when attempting to resume from suspend to RAM on ASUS
> mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426
> ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay
> with no output, followed by a reboot.
> 
> Workarounds include:
> 
>   - Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
>   - Booting with "pcie_aspm=off"
>   - Booting with "pcie_aspm.policy=performance"
>   - "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm"
>     before suspending
>   - Connecting a USB flash drive
> 
> Fixes: 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> Reported-by: Michael Schaller <michael@5challer.de>
> Link: https://lore.kernel.org/r/76c61361-b8b4-435f-a9f1-32b716763d62@5challer.de
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> Cc: <stable@vger.kernel.org>
> ---
 
> +/* @pdev: the root port or switch downstream port */
> +void pcie_aspm_pm_state_change(struct pci_dev *pdev)
> +{
> +	struct pcie_link_state *link = pdev->link_state;
> +
> +	if (aspm_disabled || !link)
> +		return;
> +	/*
> +	 * Devices changed PM state, we should recheck if latency
> +	 * meets all functions' requirement
> +	 */
> +	down_read(&pci_bus_sem);
> +	mutex_lock(&aspm_lock);
> +	pcie_update_aspm_capable(link->root);
> +	pcie_config_aspm_path(link);
> +	mutex_unlock(&aspm_lock);
> +	up_read(&pci_bus_sem);
> +}

This function is now restored in 6.7 final and is called in paths which
already hold the pci_bus_sem as reported by lockdep (see splat below).

This can potentially lead to a deadlock and specifically prevents using
lockdep on Qualcomm platforms.

Not sure if you want to propagate whether the bus semaphore is held to
pcie_aspm_pm_state_change() or if there was some alternative to
restoring this function which should be explored instead.

Johan


   ============================================
   WARNING: possible recursive locking detected
   6.7.0 #40 Not tainted
   --------------------------------------------
   kworker/u16:5/90 is trying to acquire lock:
   ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pcie_aspm_pm_state_change+0x58/0xdc
   pcieport 0002:00:00.0: PME: Signaling with IRQ 197
   
               but task is already holding lock:
   ffffacfa78ced000
   pcieport 0002:00:00.0: AER: enabled with IRQ 197
    (pci_bus_sem
   nvme nvme0: pci function 0002:01:00.0
   ){++++}-{3:3}
   nvme 0002:01:00.0: enabling device (0000 -> 0002)
   , at: pci_walk_bus+0x34/0xbc
   
               other info that might help us debug this:
    Possible unsafe locking scenario:

          CPU0
          ----
     lock(pci_bus_sem);
     lock(pci_bus_sem);
   
                *** DEADLOCK ***

    May be due to missing lock nesting notation

   4 locks held by kworker/u16:5/90:
    #0: ffff06c5c0008d38 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x150/0x53c
    #1: ffff800081c0bdd0 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0x150/0x53c
    #2: ffff06c5c0b7d0f8 (&dev->mutex){....}-{3:3}, at: __driver_attach_async_helper+0x3c/0xf4
    #3: ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pci_walk_bus+0x34/0xbc
   
               stack backtrace:
   CPU: 1 PID: 90 Comm: kworker/u16:5 Not tainted 6.7.0 #40
   Hardware name: LENOVO 21BYZ9SRUS/21BYZ9SRUS, BIOS N3HET53W (1.25 ) 10/12/2022
   Workqueue: events_unbound async_run_entry_fn
   Call trace:
    dump_backtrace+0x9c/0x11c
    show_stack+0x18/0x24
    dump_stack_lvl+0x60/0xac
    dump_stack+0x18/0x24
    print_deadlock_bug+0x25c/0x348
    __lock_acquire+0x10a4/0x2064
    lock_acquire+0x1e8/0x318
    down_read+0x60/0x184
    pcie_aspm_pm_state_change+0x58/0xdc
    pci_set_full_power_state+0xa8/0x114
    pci_set_power_state+0xc4/0x120
    qcom_pcie_enable_aspm+0x1c/0x3c [pcie_qcom]
    pci_walk_bus+0x64/0xbc
    qcom_pcie_host_post_init_2_7_0+0x28/0x34 [pcie_qcom]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* PCI/ASPM locking regression in 6.7-final (was: Re: [PATCH] Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()")
  2024-01-08  8:39   ` Johan Hovold
@ 2024-01-22 10:53     ` Johan Hovold
  2024-01-22 18:26       ` Bjorn Helgaas
  0 siblings, 1 reply; 10+ messages in thread
From: Johan Hovold @ 2024-01-22 10:53 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Michael Schaller, Kai-Heng Feng, linux-pci, linux-kernel,
	regressions, Maciej W . Rozycki, Ajay Agarwal,
	Kuppuswamy Sathyanarayanan, Greg Kroah-Hartman, Heiner Kallweit,
	Johan Hovold, Bjorn Helgaas, stable, regressions

Hi Bjorn,

I never got a reply to this one so resending with updated Subject in
case it got buried in your inbox.

On Mon, Jan 08, 2024 at 09:39:07AM +0100, Johan Hovold wrote:
 
> On Tue, Jan 02, 2024 at 05:25:50PM -0600, Bjorn Helgaas wrote:
> > From: Bjorn Helgaas <bhelgaas@google.com>
> > 
> > This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.
> > 
> > Michael reported that when attempting to resume from suspend to RAM on ASUS
> > mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426
> > ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay
> > with no output, followed by a reboot.
> > 
> > Workarounds include:
> > 
> >   - Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> >   - Booting with "pcie_aspm=off"
> >   - Booting with "pcie_aspm.policy=performance"
> >   - "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm"
> >     before suspending
> >   - Connecting a USB flash drive
> > 
> > Fixes: 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> > Reported-by: Michael Schaller <michael@5challer.de>
> > Link: https://lore.kernel.org/r/76c61361-b8b4-435f-a9f1-32b716763d62@5challer.de
> > Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> > Cc: <stable@vger.kernel.org>
> > ---
>  
> > +/* @pdev: the root port or switch downstream port */
> > +void pcie_aspm_pm_state_change(struct pci_dev *pdev)
> > +{
> > +	struct pcie_link_state *link = pdev->link_state;
> > +
> > +	if (aspm_disabled || !link)
> > +		return;
> > +	/*
> > +	 * Devices changed PM state, we should recheck if latency
> > +	 * meets all functions' requirement
> > +	 */
> > +	down_read(&pci_bus_sem);
> > +	mutex_lock(&aspm_lock);
> > +	pcie_update_aspm_capable(link->root);
> > +	pcie_config_aspm_path(link);
> > +	mutex_unlock(&aspm_lock);
> > +	up_read(&pci_bus_sem);
> > +}
> 
> This function is now restored in 6.7 final and is called in paths which
> already hold the pci_bus_sem as reported by lockdep (see splat below).
> 
> This can potentially lead to a deadlock and specifically prevents using
> lockdep on Qualcomm platforms.
> 
> Not sure if you want to propagate whether the bus semaphore is held to
> pcie_aspm_pm_state_change() or if there was some alternative to
> restoring this function which should be explored instead.

So to summarise, this patch, which is now commit

	f93e71aea6c6 ("Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()"")

introduced a regression in 6.7-final for Qualcomm platforms (and some
Intel platforms) similar to the one recently fixed by commit

	f352ce999260 ("PCI: qcom: Fix potential deadlock when enabling ASPM").

Johan


#regzbot introduced: f93e71aea6c6

>    ============================================
>    WARNING: possible recursive locking detected
>    6.7.0 #40 Not tainted
>    --------------------------------------------
>    kworker/u16:5/90 is trying to acquire lock:
>    ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pcie_aspm_pm_state_change+0x58/0xdc
>    pcieport 0002:00:00.0: PME: Signaling with IRQ 197
>    
>                but task is already holding lock:
>    ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pci_walk_bus+0x34/0xbc
>    
>                other info that might help us debug this:
>     Possible unsafe locking scenario:
> 
>           CPU0
>           ----
>      lock(pci_bus_sem);
>      lock(pci_bus_sem);
>    
>                 *** DEADLOCK ***
> 
>     May be due to missing lock nesting notation
> 
>    4 locks held by kworker/u16:5/90:
>     #0: ffff06c5c0008d38 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x150/0x53c
>     #1: ffff800081c0bdd0 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0x150/0x53c
>     #2: ffff06c5c0b7d0f8 (&dev->mutex){....}-{3:3}, at: __driver_attach_async_helper+0x3c/0xf4
>     #3: ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pci_walk_bus+0x34/0xbc
>    
>                stack backtrace:
>    CPU: 1 PID: 90 Comm: kworker/u16:5 Not tainted 6.7.0 #40
>    Hardware name: LENOVO 21BYZ9SRUS/21BYZ9SRUS, BIOS N3HET53W (1.25 ) 10/12/2022
>    Workqueue: events_unbound async_run_entry_fn
>    Call trace:
>     dump_backtrace+0x9c/0x11c
>     show_stack+0x18/0x24
>     dump_stack_lvl+0x60/0xac
>     dump_stack+0x18/0x24
>     print_deadlock_bug+0x25c/0x348
>     __lock_acquire+0x10a4/0x2064
>     lock_acquire+0x1e8/0x318
>     down_read+0x60/0x184
>     pcie_aspm_pm_state_change+0x58/0xdc
>     pci_set_full_power_state+0xa8/0x114
>     pci_set_power_state+0xc4/0x120
>     qcom_pcie_enable_aspm+0x1c/0x3c [pcie_qcom]
>     pci_walk_bus+0x64/0xbc
>     qcom_pcie_host_post_init_2_7_0+0x28/0x34 [pcie_qcom]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: PCI/ASPM locking regression in 6.7-final (was: Re: [PATCH] Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()")
  2024-01-22 10:53     ` PCI/ASPM locking regression in 6.7-final (was: Re: [PATCH] Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()") Johan Hovold
@ 2024-01-22 18:26       ` Bjorn Helgaas
  2024-01-23 17:25         ` Johan Hovold
  0 siblings, 1 reply; 10+ messages in thread
From: Bjorn Helgaas @ 2024-01-22 18:26 UTC (permalink / raw)
  To: Johan Hovold
  Cc: Michael Schaller, Kai-Heng Feng, linux-pci, linux-kernel,
	regressions, Maciej W . Rozycki, Ajay Agarwal,
	Kuppuswamy Sathyanarayanan, Greg Kroah-Hartman, Heiner Kallweit,
	Johan Hovold, Bjorn Helgaas, stable, regressions

On Mon, Jan 22, 2024 at 11:53:35AM +0100, Johan Hovold wrote:
> Hi Bjorn,
> 
> I never got a reply to this one so resending with updated Subject in
> case it got buried in your inbox.

I did see it but decided it was better to fix the problem with resume
causing an unintended reboot, even though fixing that meant breaking
lockdep again, since I don't think we have user reports of the
potential deadlock lockdep finds.

08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") was a
start at fixing other problems and also improving the ASPM style, so I
hope somebody steps up to fix both it and the lockdep issue.  I
haven't looked at it enough to have a preference for *how* to fix it.

Bjorn

> On Mon, Jan 08, 2024 at 09:39:07AM +0100, Johan Hovold wrote:
>  
> > On Tue, Jan 02, 2024 at 05:25:50PM -0600, Bjorn Helgaas wrote:
> > > From: Bjorn Helgaas <bhelgaas@google.com>
> > > 
> > > This reverts commit 08d0cc5f34265d1a1e3031f319f594bd1970976c.
> > > 
> > > Michael reported that when attempting to resume from suspend to RAM on ASUS
> > > mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1), 08d0cc5f3426
> > > ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") caused a 12-second delay
> > > with no output, followed by a reboot.
> > > 
> > > Workarounds include:
> > > 
> > >   - Reverting 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> > >   - Booting with "pcie_aspm=off"
> > >   - Booting with "pcie_aspm.policy=performance"
> > >   - "echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm"
> > >     before suspending
> > >   - Connecting a USB flash drive
> > > 
> > > Fixes: 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> > > Reported-by: Michael Schaller <michael@5challer.de>
> > > Link: https://lore.kernel.org/r/76c61361-b8b4-435f-a9f1-32b716763d62@5challer.de
> > > Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> > > Cc: <stable@vger.kernel.org>
> > > ---
> >  
> > > +/* @pdev: the root port or switch downstream port */
> > > +void pcie_aspm_pm_state_change(struct pci_dev *pdev)
> > > +{
> > > +	struct pcie_link_state *link = pdev->link_state;
> > > +
> > > +	if (aspm_disabled || !link)
> > > +		return;
> > > +	/*
> > > +	 * Devices changed PM state, we should recheck if latency
> > > +	 * meets all functions' requirement
> > > +	 */
> > > +	down_read(&pci_bus_sem);
> > > +	mutex_lock(&aspm_lock);
> > > +	pcie_update_aspm_capable(link->root);
> > > +	pcie_config_aspm_path(link);
> > > +	mutex_unlock(&aspm_lock);
> > > +	up_read(&pci_bus_sem);
> > > +}
> > 
> > This function is now restored in 6.7 final and is called in paths which
> > already hold the pci_bus_sem as reported by lockdep (see splat below).
> > 
> > This can potentially lead to a deadlock and specifically prevents using
> > lockdep on Qualcomm platforms.
> > 
> > Not sure if you want to propagate whether the bus semaphore is held to
> > pcie_aspm_pm_state_change() or if there was some alternative to
> > restoring this function which should be explored instead.
> 
> So to summarise, this patch, which is now commit
> 
> 	f93e71aea6c6 ("Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()"")
> 
> introduced a regression in 6.7-final for Qualcomm platforms (and some
> Intel platforms) similar to the one recently fixed by commit
> 
> 	f352ce999260 ("PCI: qcom: Fix potential deadlock when enabling ASPM").
> 
> Johan
> 
> 
> #regzbot introduced: f93e71aea6c6
> 
> >    ============================================
> >    WARNING: possible recursive locking detected
> >    6.7.0 #40 Not tainted
> >    --------------------------------------------
> >    kworker/u16:5/90 is trying to acquire lock:
> >    ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pcie_aspm_pm_state_change+0x58/0xdc
> >    pcieport 0002:00:00.0: PME: Signaling with IRQ 197
> >    
> >                but task is already holding lock:
> >    ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pci_walk_bus+0x34/0xbc
> >    
> >                other info that might help us debug this:
> >     Possible unsafe locking scenario:
> > 
> >           CPU0
> >           ----
> >      lock(pci_bus_sem);
> >      lock(pci_bus_sem);
> >    
> >                 *** DEADLOCK ***
> > 
> >     May be due to missing lock nesting notation
> > 
> >    4 locks held by kworker/u16:5/90:
> >     #0: ffff06c5c0008d38 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x150/0x53c
> >     #1: ffff800081c0bdd0 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0x150/0x53c
> >     #2: ffff06c5c0b7d0f8 (&dev->mutex){....}-{3:3}, at: __driver_attach_async_helper+0x3c/0xf4
> >     #3: ffffacfa78ced000 (pci_bus_sem){++++}-{3:3}, at: pci_walk_bus+0x34/0xbc
> >    
> >                stack backtrace:
> >    CPU: 1 PID: 90 Comm: kworker/u16:5 Not tainted 6.7.0 #40
> >    Hardware name: LENOVO 21BYZ9SRUS/21BYZ9SRUS, BIOS N3HET53W (1.25 ) 10/12/2022
> >    Workqueue: events_unbound async_run_entry_fn
> >    Call trace:
> >     dump_backtrace+0x9c/0x11c
> >     show_stack+0x18/0x24
> >     dump_stack_lvl+0x60/0xac
> >     dump_stack+0x18/0x24
> >     print_deadlock_bug+0x25c/0x348
> >     __lock_acquire+0x10a4/0x2064
> >     lock_acquire+0x1e8/0x318
> >     down_read+0x60/0x184
> >     pcie_aspm_pm_state_change+0x58/0xdc
> >     pci_set_full_power_state+0xa8/0x114
> >     pci_set_power_state+0xc4/0x120
> >     qcom_pcie_enable_aspm+0x1c/0x3c [pcie_qcom]
> >     pci_walk_bus+0x64/0xbc
> >     qcom_pcie_host_post_init_2_7_0+0x28/0x34 [pcie_qcom]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: PCI/ASPM locking regression in 6.7-final (was: Re: [PATCH] Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()")
  2024-01-22 18:26       ` Bjorn Helgaas
@ 2024-01-23 17:25         ` Johan Hovold
  2024-01-23 22:36           ` Bjorn Helgaas
  0 siblings, 1 reply; 10+ messages in thread
From: Johan Hovold @ 2024-01-23 17:25 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Michael Schaller, Kai-Heng Feng, linux-pci, linux-kernel,
	regressions, Maciej W . Rozycki, Ajay Agarwal,
	Kuppuswamy Sathyanarayanan, Greg Kroah-Hartman, Heiner Kallweit,
	Johan Hovold, Bjorn Helgaas, stable, regressions

On Mon, Jan 22, 2024 at 12:26:15PM -0600, Bjorn Helgaas wrote:
> On Mon, Jan 22, 2024 at 11:53:35AM +0100, Johan Hovold wrote:

> > I never got a reply to this one so resending with updated Subject in
> > case it got buried in your inbox.
> 
> I did see it but decided it was better to fix the problem with resume
> causing an unintended reboot, even though fixing that meant breaking
> lockdep again, since I don't think we have user reports of the
> potential deadlock lockdep finds.

That may be because I fixed the previous regression in 6.7-rc1 before
any users had a chance to hit the deadlock on Qualcomm platforms.

I can easily trigger a deadlock on the X13s by instrumenting 6.7-final
with a delay to increase the race window.

And any user hitting this occasionally is likely not going to be able to
track it down to this lock inversion (unless they have lockdep enabled).
 
> 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") was a
> start at fixing other problems and also improving the ASPM style, so I
> hope somebody steps up to fix both it and the lockdep issue.  I
> haven't looked at it enough to have a preference for *how* to fix it.

Ok, but since you were the one introducing the locking regression in
6.7-final shouldn't you look into fixing it?

Especially if there were alternatives to restoring the offending commit
which would solve the underlying issue for the resume failure without
breaking other platforms.

I don't want to spend more time on this if the offending commit could
simply be reverted.

Johan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: PCI/ASPM locking regression in 6.7-final (was: Re: [PATCH] Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()")
  2024-01-23 17:25         ` Johan Hovold
@ 2024-01-23 22:36           ` Bjorn Helgaas
  2024-01-24  8:16             ` Johan Hovold
  0 siblings, 1 reply; 10+ messages in thread
From: Bjorn Helgaas @ 2024-01-23 22:36 UTC (permalink / raw)
  To: Johan Hovold
  Cc: Michael Schaller, Kai-Heng Feng, linux-pci, linux-kernel,
	regressions, Maciej W . Rozycki, Ajay Agarwal,
	Kuppuswamy Sathyanarayanan, Greg Kroah-Hartman, Heiner Kallweit,
	Johan Hovold, Bjorn Helgaas, stable, regressions

On Tue, Jan 23, 2024 at 06:25:52PM +0100, Johan Hovold wrote:
> On Mon, Jan 22, 2024 at 12:26:15PM -0600, Bjorn Helgaas wrote:
> > On Mon, Jan 22, 2024 at 11:53:35AM +0100, Johan Hovold wrote:
> > > I never got a reply to this one so resending with updated Subject in
> > > case it got buried in your inbox.
> > 
> > I did see it but decided it was better to fix the problem with resume
> > causing an unintended reboot, even though fixing that meant breaking
> > lockdep again, since I don't think we have user reports of the
> > potential deadlock lockdep finds.
> 
> That may be because I fixed the previous regression in 6.7-rc1 before
> any users had a chance to hit the deadlock on Qualcomm platforms.
> 
> I can easily trigger a deadlock on the X13s by instrumenting 6.7-final
> with a delay to increase the race window.
> 
> And any user hitting this occasionally is likely not going to be able to
> track it down to this lock inversion (unless they have lockdep enabled).

I agree, it's a problem we need to fix.

> > 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") was a
> > start at fixing other problems and also improving the ASPM style, so I
> > hope somebody steps up to fix both it and the lockdep issue.  I
> > haven't looked at it enough to have a preference for *how* to fix it.
> 
> Ok, but since you were the one introducing the locking regression in
> 6.7-final shouldn't you look into fixing it?
> 
> Especially if there were alternatives to restoring the offending commit
> which would solve the underlying issue for the resume failure without
> breaking other platforms.

Did somebody propose an alternate patch?  If so, I missed it, but we
could look at it now.

> I don't want to spend more time on this if the offending commit could
> simply be reverted.

I don't quite follow.  By simply reverting, do you mean to revert
f93e71aea6c6 ("Revert "PCI/ASPM: Remove
pcie_aspm_pm_state_change()"")?  IIUC that would break Michael's
machine again.

Bjorn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: PCI/ASPM locking regression in 6.7-final (was: Re: [PATCH] Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()")
  2024-01-23 22:36           ` Bjorn Helgaas
@ 2024-01-24  8:16             ` Johan Hovold
  2024-01-30 10:07               ` Johan Hovold
  0 siblings, 1 reply; 10+ messages in thread
From: Johan Hovold @ 2024-01-24  8:16 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Michael Schaller, Kai-Heng Feng, linux-pci, linux-kernel,
	regressions, Maciej W . Rozycki, Ajay Agarwal,
	Kuppuswamy Sathyanarayanan, Greg Kroah-Hartman, Heiner Kallweit,
	Johan Hovold, Bjorn Helgaas, stable, regressions

On Tue, Jan 23, 2024 at 04:36:48PM -0600, Bjorn Helgaas wrote:
> On Tue, Jan 23, 2024 at 06:25:52PM +0100, Johan Hovold wrote:
> > On Mon, Jan 22, 2024 at 12:26:15PM -0600, Bjorn Helgaas wrote:
> > > On Mon, Jan 22, 2024 at 11:53:35AM +0100, Johan Hovold wrote:

> > > 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") was a
> > > start at fixing other problems and also improving the ASPM style, so I
> > > hope somebody steps up to fix both it and the lockdep issue.  I
> > > haven't looked at it enough to have a preference for *how* to fix it.
> > 
> > Ok, but since you were the one introducing the locking regression in
> > 6.7-final shouldn't you look into fixing it?
> > 
> > Especially if there were alternatives to restoring the offending commit
> > which would solve the underlying issue for the resume failure without
> > breaking other platforms.
> 
> Did somebody propose an alternate patch?  If so, I missed it, but we
> could look at it now.

I've only skimmed the discussion leading up to the revert, but I got the
impression that other alternatives were looked at as it was still not
clear what the underlying issue actually was.

As Michael and Thorsten pointed out before the revert, it may have been
better not to do a last minute revert of a 16 month old commit which
risks introducing regressions (and brought back another sysfs issue
IIUC) before fully understanding what is really going on here.

> > I don't want to spend more time on this if the offending commit could
> > simply be reverted.
> 
> I don't quite follow.  By simply reverting, do you mean to revert
> f93e71aea6c6 ("Revert "PCI/ASPM: Remove
> pcie_aspm_pm_state_change()"")?  IIUC that would break Michael's
> machine again.

Right, at least until that issue is fully understood and alternative
fixes have been considered.

If that's not an option, we need to rework core to pass a flag through
more than one layer to indicate whether pcie_aspm_pm_state_change()
should take the bus semaphore or not. I'd rather not do that if it can
be avoided.

Johan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: PCI/ASPM locking regression in 6.7-final (was: Re: [PATCH] Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()")
  2024-01-24  8:16             ` Johan Hovold
@ 2024-01-30 10:07               ` Johan Hovold
  0 siblings, 0 replies; 10+ messages in thread
From: Johan Hovold @ 2024-01-30 10:07 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Michael Schaller, Kai-Heng Feng, linux-pci, linux-kernel,
	regressions, Maciej W . Rozycki, Ajay Agarwal,
	Kuppuswamy Sathyanarayanan, Greg Kroah-Hartman, Heiner Kallweit,
	Johan Hovold, Bjorn Helgaas, stable, regressions

On Wed, Jan 24, 2024 at 09:16:38AM +0100, Johan Hovold wrote:
> On Tue, Jan 23, 2024 at 04:36:48PM -0600, Bjorn Helgaas wrote:

> > I don't quite follow.  By simply reverting, do you mean to revert
> > f93e71aea6c6 ("Revert "PCI/ASPM: Remove
> > pcie_aspm_pm_state_change()"")?  IIUC that would break Michael's
> > machine again.
> 
> Right, at least until that issue is fully understood and alternative
> fixes have been considered.
> 
> If that's not an option, we need to rework core to pass a flag through
> more than one layer to indicate whether pcie_aspm_pm_state_change()
> should take the bus semaphore or not. I'd rather not do that if it can
> be avoided.

As a revert appears unlikely to happen, let's fix the regression by
adding a new helper pci_set_power_state_locked() that can be called
with the bus lock held:

	https://lore.kernel.org/lkml/20240130100243.11011-1-johan+linaro@kernel.org/

Johan

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-01-30 10:07 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <76c61361-b8b4-435f-a9f1-32b716763d62@5challer.de>
2024-01-02 23:25 ` [PATCH] Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()" Bjorn Helgaas
2024-01-02 23:33   ` Kuppuswamy Sathyanarayanan
2024-01-03  0:12     ` Bjorn Helgaas
2024-01-08  8:39   ` Johan Hovold
2024-01-22 10:53     ` PCI/ASPM locking regression in 6.7-final (was: Re: [PATCH] Revert "PCI/ASPM: Remove pcie_aspm_pm_state_change()") Johan Hovold
2024-01-22 18:26       ` Bjorn Helgaas
2024-01-23 17:25         ` Johan Hovold
2024-01-23 22:36           ` Bjorn Helgaas
2024-01-24  8:16             ` Johan Hovold
2024-01-30 10:07               ` Johan Hovold

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).