From: Brian Norris <briannorris@chromium.org>
To: "Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>
Cc: Lukas Wunner <lukas@wunner.de>,
Bjorn Helgaas <bhelgaas@google.com>,
LKML <linux-kernel@vger.kernel.org>,
linux-pm@vger.kernel.org,
"Rafael J . Wysocki" <rafael@kernel.org>,
linux-pci@vger.kernel.org
Subject: Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
Date: Fri, 17 Oct 2025 10:43:18 -0700 [thread overview]
Message-ID: <aPKANja_k1gogTAU@google.com> (raw)
In-Reply-To: <67381f3b-4aee-a314-b5dd-2b7d987a7794@linux.intel.com>
Hi Ilpo and Lukas,
I'll reply to both of you inline:
On Fri, Oct 17, 2025 at 02:49:35PM +0300, Ilpo Järvinen wrote:
> On Fri, 17 Oct 2025, Lukas Wunner wrote:
>
> > [cc += Ilpo]
> >
> > On Thu, Oct 16, 2025 at 03:53:35PM -0700, Brian Norris wrote:
> > > PCI devices are created via pci_scan_slot() and similar, and are
> > > promptly configured for runtime PM (pci_pm_init()). They are initially
> > > prevented from suspending by way of pm_runtime_forbid(); however, it's
> > > expected that user space may override this via sysfs [1].
>
> Is this true as pm_runtime_forbid() also increases PM usage count?
Yes it's true. See below.
> "void pm_runtime_forbid(struct device *dev);
>
> unset the power.runtime_auto flag for the device and increase its
> usage counter (used by the /sys/devices/.../power/control interface to
> effectively prevent the device from being power managed at run time)"
Right, but sysfs `echo auto > .../power/control` performs the inverse --
pm_runtime_allow() -- which decrements that count.
> > > Now, sometime after initial scan, a PCI device receives its BAR
> > > configuration (pci_assign_unassigned_bus_resources(), etc.).
> > >
> > > If a PCI device is allowed to suspend between pci_scan_slot() and
> > > pci_assign_unassigned_bus_resources(), then pci-driver.c will
> > > save/restore incorrect BAR configuration for the device, and the device
> > > may cease to function.
> > >
> > > This behavior races with user space, since user space may enable runtime
> > > PM [1] as soon as it sees the device, which may be before BAR
> > > configuration.
> > >
> > > Prevent suspending in this intermediate state by holding a runtime PM
> > > reference until the device is fully initialized and ready for probe().
> >
> > Not sure if that is comprehensible by everybody.
Yeah, thanks for trying to clarify. After getting too far into the weeds
on a bug, I sometimes don't spend the appropriate time on writing a
simple problem description. Maybe I can do better on a v2.
> > The point is that
> > unbound devices are left in D0 but are nevertheless allowed to
> > (logically) runtime suspend. And pci_pm_runtime_suspend() may call
> > pci_save_state() while config space isn't fully initialized yet,
> > or pci_pm_runtime_resume() may call pci_restore_state() (via
> > pci_pm_default_resume_early()) and overwrite initialized config space
> > with uninitialized data.
Ack.
> > Have you actually seen this happen in practice?
Yes, that's why I spent my time debugging and submitting this patch :)
> > Normally enumeration
> > happens during subsys_initcall time, when user space isn't running yet.
> > Hotplug may be an exception though.
Hotplug, rescan (e.g., when pwrctrl is in use, power may be stablished
later on, and it triggers a bus rescan; pwrctrl drivers can be modules),
or PCI controller drivers built as modules.
I happen to be using both pwrctrl and controller drivers as modules, so
I hit it that way.
> Adding that pm_runtime_get_noresume() doesn't look useful given
> pm_runtime_forbid() already increases PM usage count. If this problem is
> actually seen in practice, there could a bug elsewhere where something
> decrements usage count too early so this change "helps" by double
> incrementing the usage count.
>
> To find more information what's going on, one could try to trace events
> for the PM usage count (though last time I looked not all paths that
> change PM usage count were covered by the event and adding the
> trace_event() calls into the header turned out too much magic for me to
> figure out so I couldn't solve the problem).
See above. forbid() is not a guaranteed blocker, because user space can
undo it.
> > Patch LGTM in principle, but adding Ilpo to cc who is refactoring PCI
> > resource allocation and may judge whether this can actually happen.
>
> I can see there could be other failure modes besides just saving wrong
> config if devices get suspended underneath the resource assignment
> algorithm.
>
> Besides hotplug, also BAR resize does changes the resources and BARs.
> This case is not helped by this patch.
Is that the 'resource_N_resize' sysfs attributes? Becuase that holds PM
references (pci_config_pm_runtime_{get,put}()) and therefore should not
generally have the same problem. pci-driver.c's runtime suspend will
save a new copy of the registers the next time we suspend after resize.
(Now, some drivers could have problems if they try to stash a static
copy via pci_store_saved_state()/pci_load_saved_state(), but that
invites plenty of its own problems anyway.)
> I also recently learned some DT platforms do the "actual" scan for the bus
> later on Link Up event through irq which could perhaps occur late enough,
> I dunno for sure.
Sure, but that'd be covered by my patch, as those (re)scans would
discover new devices in the same scan+add flow.
> > I think the code comments you're adding are a little verbose and a simple
> > /* acquired in pci_pm_init() */ in pci_bus_add_device() may be sufficient.
>
> I'm also not entirely convinced these always pair
That's a very valid question. There are so many variations of scan+add
that it's been hard for me to tell. I've studied the code pretty
closely, and tested what I could, but I don't have hotplug systems on
hand, and I definitely could miss something.
FWIW, Rafael suggested/implied an alternative, where I could simply
delay pm_runtime_enable() until pci_bus_add_device(). That would dodge
the pairing questions, I think.
> or if the pci_dev may
> get removed before pci_bus_add_device(), see e.g., enable_slot() in
> hotplug/acpiphp_glue.c that does acpiphp_sanitize_bus() before
> pci_bus_add_devices() (which could have other bugs too as is).
I believe it should be OK if a device is removed before the
pm_runtime_put_noidle() half of the pair. It just means the device gets
destroyed with a nonzero PM usage count, which is legal.
> > Also, I think it is neither necessary nor useful to actually cc the e-mail
> > to stable@vger.kernel.org if you include a stable designation in the
> > patch. I believe stable maintainers only pick up backports from that list,
> > not patches intended for upstream.
>
> Probably some tool will auto-insert the Cc: tags as receipients so it
> might be non-trivial to get rid of it.
Yeah, git-send-email applies "Cc:" lines automatically, so I expect it's
very common for people to do that. I use U-Boot's patman, which wraps
git-send-email. I just assume stable@ folks expect that or are at least
used to it, because ... well, "Cc:" usually means "copy this on the
email" ...
git-send-email has --suppress-cc, and maybe I can convince my tools to
do that. (e.g., `git config sendemail.suppresscc cc`)
Brian
next prev parent reply other threads:[~2025-10-17 17:43 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-16 22:53 [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized Brian Norris
2025-10-17 8:32 ` Lukas Wunner
2025-10-17 11:49 ` Ilpo Järvinen
2025-10-17 17:43 ` Brian Norris [this message]
2025-10-17 19:20 ` Brian Norris
2025-10-20 15:56 ` Ilpo Järvinen
2025-10-20 18:52 ` Brian Norris
2025-10-21 11:27 ` Ilpo Järvinen
2025-10-21 12:53 ` Rafael J. Wysocki
2025-10-21 13:18 ` Ilpo Järvinen
2025-10-21 18:27 ` Brian Norris
2025-10-21 18:56 ` Rafael J. Wysocki
2025-10-21 18:13 ` Brian Norris
2025-10-22 13:38 ` Ilpo Järvinen
2025-10-17 9:45 ` Rafael J. Wysocki
2025-10-17 17:11 ` Brian Norris
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aPKANja_k1gogTAU@google.com \
--to=briannorris@chromium.org \
--cc=bhelgaas@google.com \
--cc=ilpo.jarvinen@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linux-pm@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=rafael@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).