All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mario Limonciello <superm1@kernel.org>
To: Lukas Wunner <lukas@wunner.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>,
	"open list:PCI SUBSYSTEM" <linux-pci@vger.kernel.org>,
	linux-pm@vger.kernel.org,
	"Rafael J . Wysocki" <rjw@rjwysocki.net>,
	Mario Limonciello <mario.limonciello@amd.com>
Subject: Re: [PATCH v3 2/2] PCI: Fix runtime PM usage count underflow on device unplug
Date: Sat, 21 Jun 2025 14:56:08 -0500	[thread overview]
Message-ID: <8d4d98b6-fec5-466f-bd2c-059d702c7860@kernel.org> (raw)
In-Reply-To: <aFcCaw_IZr-JuUYY@wunner.de>



On 6/21/25 2:05 PM, Lukas Wunner wrote:
> On Thu, Jun 19, 2025 at 09:55:35PM -0500, Mario Limonciello wrote:
>> When a USB4 dock is unplugged the PCIe bridge it's connected to will
>> remove issue a "Link Down" and "Card not detected event". The PCI core
>> will treat this as a surprise hotplug event and unconfigure all downstream
>> devices.
>>
>> pci_stop_bus_device() will call device_release_driver(). As part of device
>> release sequence pm_runtime_put_sync() is called for the device which will
>> decrement the runtime counter to 0. After this, the device remove callback
>> (pci_device_remove()) will be called which again calls pm_runtime_put_sync()
>> but as the counter is already 0 will cause an underflow.
>>
>> This behavior was introduced in commit 967577b062417 ("PCI/PM: Keep runtime
>> PM enabled for unbound PCI devices") to prevent asymmetrical get/put from
>> probe/remove, but this misses out on the point that when releasing a driver
>> the usage count is decremented from the device core.
>>
>> Drop the extra call from pci_device_remove().
>>
>> Fixes: 967577b062417 ("PCI/PM: Keep runtime PM enabled for unbound PCI devices")
> 
> This doesn't look right.  The refcount underflow issue seems new,
> we surely haven't been doing the wrong thing since 2012.
> 
> 
>> --- a/drivers/pci/pci-driver.c
>> +++ b/drivers/pci/pci-driver.c
>> @@ -478,9 +478,6 @@ static void pci_device_remove(struct device *dev)
>>   	pci_dev->driver = NULL;
>>   	pci_iov_remove(pci_dev);
>>   
>> -	/* Undo the runtime PM settings in local_pci_probe() */
>> -	pm_runtime_put_sync(dev);
>> -
> 
> local_pci_probe() increases the refcount to keep the device in D0.
> If the driver wants to use runtime suspend, it needs to decrement
> the refcount on ->probe() and re-increment on ->remove().
> 
> In the dmesg output attached to...
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=220216
> 
> ... the device exhibiting the refcount underflow is a PCIe port.
> Are you also seeing this on a PCIe port or is it a different device?

The device with the underflow is the disconnected PCIe bridge.

In my case it was this bridge that was downstream.

pci 0000:02:04.0: [8086:15ef] type 01 class 0x060400 PCIe Switch 
Downstream Port
pci 0000:02:04.0: PCI bridge to [bus 04]
pci 0000:02:04.0: enabling Extended Tags
pci 0000:02:04.0: supports D1 D2
pci 0000:02:04.0: PME# supported from D0 D1 D2 D3hot D3cold

> 
> So the refcount decrement happens in pcie_portdrv_probe() and
> the refcount increment happens in pcie_portdrv_remove().
> Both times it's conditional on pci_bridge_d3_possible().
> Does that return a different value on probe versus remove?
> 
> Does any of the port service drivers decrement the refcount
> once too often?  I've just looked through pciehp but cannot
> find anything out of the ordinary.
> 
> Looking through recent changes, 002bf2fbc00e and bca84a7b93fd
> look like potential candidates causing a regression, but the
> former is for AER (which isn't used in the dmesg attached to
> the bugzilla) and the latter touches suspend on system sleep,
> not runtime suspend.
> 
> Can you maybe instrument the pm_runtime_{get,put}*() functions
> with a printk() and/or dump_stack() to see where a gratuitous
> refcount decrement occurs?
> 
That's exactly what I did to conclude this call was an extra one.


Here's the drop to 0:

pm_runtime: 0000:02:04.0 usage count is now 0 from 
__pm_runtime_idle+0x6f/0xd0
CPU: 7 UID: 0 PID: 196 Comm: irq/36-pciehp Not tainted 
6.16.0-rc2-00086-g4156cebf8a54 #281 PREEMPT(full) 
48d9dd361274f6fbfa98cd17f346d017a60ec738
Hardware name: Framework Laptop 13 (AMD Ryzen AI 300 Series)/FRANMGCP09, 
BIOS 03.03 03/10/2025
Call Trace:
  <TASK>
  dump_stack_lvl+0x53/0x70
  rpm_drop_usage_count+0x50/0x90
  __pm_runtime_idle+0x6f/0xd0
  device_release_driver_internal+0x197/0x200
  pci_stop_bus_device+0x6d/0x90
  pci_stop_bus_device+0x2c/0x90
  pci_stop_and_remove_bus_device+0x12/0x20
  pciehp_unconfigure_device+0x97/0x160
  pciehp_disable_slot+0x66/0x140
  pciehp_handle_presence_or_link_change+0x86/0x4b0
  pciehp_ist+0x196/0x280
  irq_thread_fn+0x20/0x60
  irq_thread+0x1ae/0x290
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xfb/0x260
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x14a/0x180
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>

And then here is a dev_WARN at the underflow case.

------------[ cut here ]------------
pcieport 0000:02:04.0: Runtime PM usage count underflow!
WARNING: CPU: 7 PID: 196 at drivers/base/power/runtime.c:1084 
rpm_drop_usage_count+0x82/0x90
Modules linked in: vivaldi_fmap aesni_intel thunderbolt(+) ccp nvme_core 
i8042 tpm_tis i2c_hid_acpi serio tpm_tis_core i2c_hid tpm libaescfb 
ecdh_generic
CPU: 7 UID: 0 PID: 196 Comm: irq/36-pciehp Not tainted 
6.16.0-rc2-00086-g4156cebf8a54 #281 PREEMPT(full) 
48d9dd361274f6fbfa98cd17f346d017a60ec738
Hardware name: Framework Laptop 13 (AMD Ryzen AI 300 Series)/FRANMGCP09, 
BIOS 03.03 03/10/2025
RIP: 0010:rpm_drop_usage_count+0x82/0x90
Code: f0 ff 87 b0 01 00 00 48 8b 5f 50 48 85 db 75 03 48 8b 1f e8 20 44 
fe ff 48 89 da 48 c7 c7 80 52 c8 a0 48 89 c6 e8 8e 3f 6e ff <0f> 0b bb 
ea ff ff ff eb 9b>
RSP: 0018:ffffd240c09e7c28 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff8997c213f430 RCX: 0000000000000027
RDX: ffff899f1dbdc288 RSI: 0000000000000001 RDI: ffff899f1dbdc280
RBP: ffff8997c29150c8 R08: 0000000000000000 R09: 0000000000000003
R10: ffffd240c09e7ab0 R11: ffff899f1e129368 R12: ffffffffa1161ca0
R13: ffff8997c2915148 R14: 0000000000000080 R15: ffff8997c1613914
FS:  0000000000000000(0000) GS:ffff899f7c175000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fbbc1357750 CR3: 0000000111245000 CR4: 0000000000b50ef0
Call Trace:
  <TASK>
  __pm_runtime_idle+0x6f/0xd0
  pci_device_remove+0x7e/0xb0
  device_release_driver_internal+0x19f/0x200
  pci_stop_bus_device+0x6d/0x90
  pci_stop_bus_device+0x2c/0x90
  pci_stop_and_remove_bus_device+0x12/0x20
  pciehp_unconfigure_device+0x97/0x160
  pciehp_disable_slot+0x66/0x140
  pciehp_handle_presence_or_link_change+0x86/0x4b0
  pciehp_ist+0x196/0x280
  irq_thread_fn+0x20/0x60
  irq_thread+0x1ae/0x290
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xfb/0x260
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x14a/0x180
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
---[ end trace 0000000000000000 ]---

  reply	other threads:[~2025-06-21 19:56 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-20  2:55 [PATCH v3 0/2] Don't make noise about disconnected USB4 devices Mario Limonciello
2025-06-20  2:55 ` [PATCH v3 1/2] PCI/PM: Skip resuming to D0 if disconnected Mario Limonciello
2025-06-23 17:48   ` Lukas Wunner
2025-06-20  2:55 ` [PATCH v3 2/2] PCI: Fix runtime PM usage count underflow on device unplug Mario Limonciello
2025-06-21 19:05   ` Lukas Wunner
2025-06-21 19:56     ` Mario Limonciello [this message]
2025-06-22  4:43       ` Lukas Wunner
2025-06-22 18:39         ` Mario Limonciello
2025-06-23  1:47           ` Mario Limonciello
2025-06-23  6:53             ` Lukas Wunner
2025-06-23  6:43           ` Lukas Wunner
2025-06-23  7:37             ` Lukas Wunner
2025-06-23 10:05               ` Lukas Wunner
2025-06-23 10:11                 ` Rafael J. Wysocki
2025-06-23 11:37                   ` Mario Limonciello
2025-06-23 12:19                     ` Lukas Wunner
2025-06-23 12:45                       ` Mario Limonciello
2025-06-23 17:23                     ` Lukas Wunner
2025-06-23 17:25                       ` Mario Limonciello

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8d4d98b6-fec5-466f-bd2c-059d702c7860@kernel.org \
    --to=superm1@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=mario.limonciello@amd.com \
    --cc=rjw@rjwysocki.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.