linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mario Limonciello <superm1@kernel.org>
To: Lukas Wunner <lukas@wunner.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>,
	"open list:PCI SUBSYSTEM" <linux-pci@vger.kernel.org>,
	linux-pm@vger.kernel.org,
	"Rafael J . Wysocki" <rjw@rjwysocki.net>,
	Mario Limonciello <mario.limonciello@amd.com>
Subject: Re: [PATCH v3 2/2] PCI: Fix runtime PM usage count underflow on device unplug
Date: Sat, 21 Jun 2025 14:56:08 -0500	[thread overview]
Message-ID: <8d4d98b6-fec5-466f-bd2c-059d702c7860@kernel.org> (raw)
In-Reply-To: <aFcCaw_IZr-JuUYY@wunner.de>



On 6/21/25 2:05 PM, Lukas Wunner wrote:
> On Thu, Jun 19, 2025 at 09:55:35PM -0500, Mario Limonciello wrote:
>> When a USB4 dock is unplugged the PCIe bridge it's connected to will
>> remove issue a "Link Down" and "Card not detected event". The PCI core
>> will treat this as a surprise hotplug event and unconfigure all downstream
>> devices.
>>
>> pci_stop_bus_device() will call device_release_driver(). As part of device
>> release sequence pm_runtime_put_sync() is called for the device which will
>> decrement the runtime counter to 0. After this, the device remove callback
>> (pci_device_remove()) will be called which again calls pm_runtime_put_sync()
>> but as the counter is already 0 will cause an underflow.
>>
>> This behavior was introduced in commit 967577b062417 ("PCI/PM: Keep runtime
>> PM enabled for unbound PCI devices") to prevent asymmetrical get/put from
>> probe/remove, but this misses out on the point that when releasing a driver
>> the usage count is decremented from the device core.
>>
>> Drop the extra call from pci_device_remove().
>>
>> Fixes: 967577b062417 ("PCI/PM: Keep runtime PM enabled for unbound PCI devices")
> 
> This doesn't look right.  The refcount underflow issue seems new,
> we surely haven't been doing the wrong thing since 2012.
> 
> 
>> --- a/drivers/pci/pci-driver.c
>> +++ b/drivers/pci/pci-driver.c
>> @@ -478,9 +478,6 @@ static void pci_device_remove(struct device *dev)
>>   	pci_dev->driver = NULL;
>>   	pci_iov_remove(pci_dev);
>>   
>> -	/* Undo the runtime PM settings in local_pci_probe() */
>> -	pm_runtime_put_sync(dev);
>> -
> 
> local_pci_probe() increases the refcount to keep the device in D0.
> If the driver wants to use runtime suspend, it needs to decrement
> the refcount on ->probe() and re-increment on ->remove().
> 
> In the dmesg output attached to...
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=220216
> 
> ... the device exhibiting the refcount underflow is a PCIe port.
> Are you also seeing this on a PCIe port or is it a different device?

The device with the underflow is the disconnected PCIe bridge.

In my case it was this bridge that was downstream.

pci 0000:02:04.0: [8086:15ef] type 01 class 0x060400 PCIe Switch 
Downstream Port
pci 0000:02:04.0: PCI bridge to [bus 04]
pci 0000:02:04.0: enabling Extended Tags
pci 0000:02:04.0: supports D1 D2
pci 0000:02:04.0: PME# supported from D0 D1 D2 D3hot D3cold

> 
> So the refcount decrement happens in pcie_portdrv_probe() and
> the refcount increment happens in pcie_portdrv_remove().
> Both times it's conditional on pci_bridge_d3_possible().
> Does that return a different value on probe versus remove?
> 
> Does any of the port service drivers decrement the refcount
> once too often?  I've just looked through pciehp but cannot
> find anything out of the ordinary.
> 
> Looking through recent changes, 002bf2fbc00e and bca84a7b93fd
> look like potential candidates causing a regression, but the
> former is for AER (which isn't used in the dmesg attached to
> the bugzilla) and the latter touches suspend on system sleep,
> not runtime suspend.
> 
> Can you maybe instrument the pm_runtime_{get,put}*() functions
> with a printk() and/or dump_stack() to see where a gratuitous
> refcount decrement occurs?
> 
That's exactly what I did to conclude this call was an extra one.


Here's the drop to 0:

pm_runtime: 0000:02:04.0 usage count is now 0 from 
__pm_runtime_idle+0x6f/0xd0
CPU: 7 UID: 0 PID: 196 Comm: irq/36-pciehp Not tainted 
6.16.0-rc2-00086-g4156cebf8a54 #281 PREEMPT(full) 
48d9dd361274f6fbfa98cd17f346d017a60ec738
Hardware name: Framework Laptop 13 (AMD Ryzen AI 300 Series)/FRANMGCP09, 
BIOS 03.03 03/10/2025
Call Trace:
  <TASK>
  dump_stack_lvl+0x53/0x70
  rpm_drop_usage_count+0x50/0x90
  __pm_runtime_idle+0x6f/0xd0
  device_release_driver_internal+0x197/0x200
  pci_stop_bus_device+0x6d/0x90
  pci_stop_bus_device+0x2c/0x90
  pci_stop_and_remove_bus_device+0x12/0x20
  pciehp_unconfigure_device+0x97/0x160
  pciehp_disable_slot+0x66/0x140
  pciehp_handle_presence_or_link_change+0x86/0x4b0
  pciehp_ist+0x196/0x280
  irq_thread_fn+0x20/0x60
  irq_thread+0x1ae/0x290
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xfb/0x260
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x14a/0x180
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>

And then here is a dev_WARN at the underflow case.

------------[ cut here ]------------
pcieport 0000:02:04.0: Runtime PM usage count underflow!
WARNING: CPU: 7 PID: 196 at drivers/base/power/runtime.c:1084 
rpm_drop_usage_count+0x82/0x90
Modules linked in: vivaldi_fmap aesni_intel thunderbolt(+) ccp nvme_core 
i8042 tpm_tis i2c_hid_acpi serio tpm_tis_core i2c_hid tpm libaescfb 
ecdh_generic
CPU: 7 UID: 0 PID: 196 Comm: irq/36-pciehp Not tainted 
6.16.0-rc2-00086-g4156cebf8a54 #281 PREEMPT(full) 
48d9dd361274f6fbfa98cd17f346d017a60ec738
Hardware name: Framework Laptop 13 (AMD Ryzen AI 300 Series)/FRANMGCP09, 
BIOS 03.03 03/10/2025
RIP: 0010:rpm_drop_usage_count+0x82/0x90
Code: f0 ff 87 b0 01 00 00 48 8b 5f 50 48 85 db 75 03 48 8b 1f e8 20 44 
fe ff 48 89 da 48 c7 c7 80 52 c8 a0 48 89 c6 e8 8e 3f 6e ff <0f> 0b bb 
ea ff ff ff eb 9b>
RSP: 0018:ffffd240c09e7c28 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff8997c213f430 RCX: 0000000000000027
RDX: ffff899f1dbdc288 RSI: 0000000000000001 RDI: ffff899f1dbdc280
RBP: ffff8997c29150c8 R08: 0000000000000000 R09: 0000000000000003
R10: ffffd240c09e7ab0 R11: ffff899f1e129368 R12: ffffffffa1161ca0
R13: ffff8997c2915148 R14: 0000000000000080 R15: ffff8997c1613914
FS:  0000000000000000(0000) GS:ffff899f7c175000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fbbc1357750 CR3: 0000000111245000 CR4: 0000000000b50ef0
Call Trace:
  <TASK>
  __pm_runtime_idle+0x6f/0xd0
  pci_device_remove+0x7e/0xb0
  device_release_driver_internal+0x19f/0x200
  pci_stop_bus_device+0x6d/0x90
  pci_stop_bus_device+0x2c/0x90
  pci_stop_and_remove_bus_device+0x12/0x20
  pciehp_unconfigure_device+0x97/0x160
  pciehp_disable_slot+0x66/0x140
  pciehp_handle_presence_or_link_change+0x86/0x4b0
  pciehp_ist+0x196/0x280
  irq_thread_fn+0x20/0x60
  irq_thread+0x1ae/0x290
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xfb/0x260
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x14a/0x180
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
---[ end trace 0000000000000000 ]---

  reply	other threads:[~2025-06-21 19:56 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-20  2:55 [PATCH v3 0/2] Don't make noise about disconnected USB4 devices Mario Limonciello
2025-06-20  2:55 ` [PATCH v3 1/2] PCI/PM: Skip resuming to D0 if disconnected Mario Limonciello
2025-06-23 17:48   ` Lukas Wunner
2025-06-20  2:55 ` [PATCH v3 2/2] PCI: Fix runtime PM usage count underflow on device unplug Mario Limonciello
2025-06-21 19:05   ` Lukas Wunner
2025-06-21 19:56     ` Mario Limonciello [this message]
2025-06-22  4:43       ` Lukas Wunner
2025-06-22 18:39         ` Mario Limonciello
2025-06-23  1:47           ` Mario Limonciello
2025-06-23  6:53             ` Lukas Wunner
2025-06-23  6:43           ` Lukas Wunner
2025-06-23  7:37             ` Lukas Wunner
2025-06-23 10:05               ` Lukas Wunner
2025-06-23 10:11                 ` Rafael J. Wysocki
2025-06-23 11:37                   ` Mario Limonciello
2025-06-23 12:19                     ` Lukas Wunner
2025-06-23 12:45                       ` Mario Limonciello
2025-06-23 17:23                     ` Lukas Wunner
2025-06-23 17:25                       ` Mario Limonciello

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8d4d98b6-fec5-466f-bd2c-059d702c7860@kernel.org \
    --to=superm1@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=mario.limonciello@amd.com \
    --cc=rjw@rjwysocki.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).