From: "Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>
To: Brian Norris <briannorris@chromium.org>
Cc: Lukas Wunner <lukas@wunner.de>,
Bjorn Helgaas <bhelgaas@google.com>,
LKML <linux-kernel@vger.kernel.org>,
linux-pm@vger.kernel.org,
"Rafael J . Wysocki" <rafael@kernel.org>,
linux-pci@vger.kernel.org
Subject: Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
Date: Mon, 20 Oct 2025 18:56:41 +0300 (EEST) [thread overview]
Message-ID: <08976178-298f-79d9-1d63-cff5a4e56cc3@linux.intel.com> (raw)
In-Reply-To: <aPKANja_k1gogTAU@google.com>
[-- Attachment #1: Type: text/plain, Size: 7717 bytes --]
On Fri, 17 Oct 2025, Brian Norris wrote:
> Hi Ilpo and Lukas,
>
> I'll reply to both of you inline:
I see you posted v2 but I'm answering here to have the context available.
Mostly things seem okay after your explanation, I think the only question
mark is a driver calling pci_resize_resource() directly to resize BARs.
> On Fri, Oct 17, 2025 at 02:49:35PM +0300, Ilpo Järvinen wrote:
> > On Fri, 17 Oct 2025, Lukas Wunner wrote:
> >
> > > [cc += Ilpo]
> > >
> > > On Thu, Oct 16, 2025 at 03:53:35PM -0700, Brian Norris wrote:
> > > > PCI devices are created via pci_scan_slot() and similar, and are
> > > > promptly configured for runtime PM (pci_pm_init()). They are initially
> > > > prevented from suspending by way of pm_runtime_forbid(); however, it's
> > > > expected that user space may override this via sysfs [1].
> >
> > Is this true as pm_runtime_forbid() also increases PM usage count?
>
> Yes it's true. See below.
>
> > "void pm_runtime_forbid(struct device *dev);
> >
> > unset the power.runtime_auto flag for the device and increase its
> > usage counter (used by the /sys/devices/.../power/control interface to
> > effectively prevent the device from being power managed at run time)"
>
> Right, but sysfs `echo auto > .../power/control` performs the inverse --
> pm_runtime_allow() -- which decrements that count.
Fair enough, I didn't check what it does.
IMO, the details about how the usage count behaves should be part of the
changelog as that documentation I quoted sounded like user control is
prevented when forbidden. I see you've put this part of the explanation
into the v2 as well so I suggest you explain the usage count in the change
so it is recorded in the commit if somebody has to look at this commit
years from now.
> > > > Now, sometime after initial scan, a PCI device receives its BAR
> > > > configuration (pci_assign_unassigned_bus_resources(), etc.).
> > > >
> > > > If a PCI device is allowed to suspend between pci_scan_slot() and
> > > > pci_assign_unassigned_bus_resources(), then pci-driver.c will
> > > > save/restore incorrect BAR configuration for the device, and the device
> > > > may cease to function.
> > > >
> > > > This behavior races with user space, since user space may enable runtime
> > > > PM [1] as soon as it sees the device, which may be before BAR
> > > > configuration.
> > > >
> > > > Prevent suspending in this intermediate state by holding a runtime PM
> > > > reference until the device is fully initialized and ready for probe().
> > >
> > > Not sure if that is comprehensible by everybody.
>
> Yeah, thanks for trying to clarify. After getting too far into the weeds
> on a bug, I sometimes don't spend the appropriate time on writing a
> simple problem description. Maybe I can do better on a v2.
>
> > > The point is that
> > > unbound devices are left in D0 but are nevertheless allowed to
> > > (logically) runtime suspend. And pci_pm_runtime_suspend() may call
> > > pci_save_state() while config space isn't fully initialized yet,
> > > or pci_pm_runtime_resume() may call pci_restore_state() (via
> > > pci_pm_default_resume_early()) and overwrite initialized config space
> > > with uninitialized data.
>
> Ack.
>
> > > Have you actually seen this happen in practice?
>
> Yes, that's why I spent my time debugging and submitting this patch :)
Thanks for doing it! :-)
> > > Normally enumeration
> > > happens during subsys_initcall time, when user space isn't running yet.
> > > Hotplug may be an exception though.
>
> Hotplug, rescan (e.g., when pwrctrl is in use, power may be stablished
> later on, and it triggers a bus rescan; pwrctrl drivers can be modules),
> or PCI controller drivers built as modules.
>
> I happen to be using both pwrctrl and controller drivers as modules, so
> I hit it that way.
>
> > Adding that pm_runtime_get_noresume() doesn't look useful given
> > pm_runtime_forbid() already increases PM usage count. If this problem is
> > actually seen in practice, there could a bug elsewhere where something
> > decrements usage count too early so this change "helps" by double
> > incrementing the usage count.
> >
> > To find more information what's going on, one could try to trace events
> > for the PM usage count (though last time I looked not all paths that
> > change PM usage count were covered by the event and adding the
> > trace_event() calls into the header turned out too much magic for me to
> > figure out so I couldn't solve the problem).
>
> See above. forbid() is not a guaranteed blocker, because user space can
> undo it.
>
> > > Patch LGTM in principle, but adding Ilpo to cc who is refactoring PCI
> > > resource allocation and may judge whether this can actually happen.
> >
> > I can see there could be other failure modes besides just saving wrong
> > config if devices get suspended underneath the resource assignment
> > algorithm.
> >
> > Besides hotplug, also BAR resize does changes the resources and BARs.
> > This case is not helped by this patch.
>
> Is that the 'resource_N_resize' sysfs attributes? Becuase that holds PM
> references (pci_config_pm_runtime_{get,put}()) and therefore should not
> generally have the same problem.
Okay, seem fine for the PCI core part.
Driver's can also trigger BAR resize by calling pci_resize_resource()
directly but I've no idea how the usage counts behave (TBH, PM isn't my
strongest forte even if Lukas pulled me in to comment).
> pci-driver.c's runtime suspend will
> save a new copy of the registers the next time we suspend after resize.
>
> (Now, some drivers could have problems if they try to stash a static
> copy via pci_store_saved_state()/pci_load_saved_state(), but that
> invites plenty of its own problems anyway.)
>
> > I also recently learned some DT platforms do the "actual" scan for the bus
> > later on Link Up event through irq which could perhaps occur late enough,
> > I dunno for sure.
>
> Sure, but that'd be covered by my patch, as those (re)scans would
> discover new devices in the same scan+add flow.
Okay, maybe it's fine like the rest. I was mostly trying to think
non-subsys_initcall() cases I knew of without contextualizing them back to
this patch.
> > > I think the code comments you're adding are a little verbose and a simple
> > > /* acquired in pci_pm_init() */ in pci_bus_add_device() may be sufficient.
> >
> > I'm also not entirely convinced these always pair
>
> That's a very valid question. There are so many variations of scan+add
> that it's been hard for me to tell.
I've noticed... unfortunately I find myself often in the same boat. :-/
> I've studied the code pretty
> closely, and tested what I could, but I don't have hotplug systems on
> hand, and I definitely could miss something.
>
> FWIW, Rafael suggested/implied an alternative, where I could simply
> delay pm_runtime_enable() until pci_bus_add_device(). That would dodge
> the pairing questions, I think.
Yes.
> > or if the pci_dev may
> > get removed before pci_bus_add_device(), see e.g., enable_slot() in
> > hotplug/acpiphp_glue.c that does acpiphp_sanitize_bus() before
> > pci_bus_add_devices() (which could have other bugs too as is).
>
> I believe it should be OK if a device is removed before the
> pm_runtime_put_noidle() half of the pair. It just means the device gets
> destroyed with a nonzero PM usage count, which is legal.
Ah yes, I think I was too attached to the "pairing" thought at that point
to see this isn't like e.g. a lock/unlock pair.
--
i.
next prev parent reply other threads:[~2025-10-20 15:56 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-16 22:53 [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized Brian Norris
2025-10-17 8:32 ` Lukas Wunner
2025-10-17 11:49 ` Ilpo Järvinen
2025-10-17 17:43 ` Brian Norris
2025-10-17 19:20 ` Brian Norris
2025-10-20 15:56 ` Ilpo Järvinen [this message]
2025-10-20 18:52 ` Brian Norris
2025-10-21 11:27 ` Ilpo Järvinen
2025-10-21 12:53 ` Rafael J. Wysocki
2025-10-21 13:18 ` Ilpo Järvinen
2025-10-21 18:27 ` Brian Norris
2025-10-21 18:56 ` Rafael J. Wysocki
2025-10-21 18:13 ` Brian Norris
2025-10-22 13:38 ` Ilpo Järvinen
2025-10-17 9:45 ` Rafael J. Wysocki
2025-10-17 17:11 ` Brian Norris
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=08976178-298f-79d9-1d63-cff5a4e56cc3@linux.intel.com \
--to=ilpo.jarvinen@linux.intel.com \
--cc=bhelgaas@google.com \
--cc=briannorris@chromium.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linux-pm@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=rafael@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).