From: Brian Norris <briannorris@chromium.org>
To: "Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>
Cc: Lukas Wunner <lukas@wunner.de>,
Bjorn Helgaas <bhelgaas@google.com>,
LKML <linux-kernel@vger.kernel.org>,
linux-pm@vger.kernel.org,
"Rafael J . Wysocki" <rafael@kernel.org>,
linux-pci@vger.kernel.org
Subject: Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
Date: Fri, 17 Oct 2025 10:43:18 -0700 [thread overview]
Message-ID: <aPKANja_k1gogTAU@google.com> (raw)
In-Reply-To: <67381f3b-4aee-a314-b5dd-2b7d987a7794@linux.intel.com>
Hi Ilpo and Lukas,
I'll reply to both of you inline:
On Fri, Oct 17, 2025 at 02:49:35PM +0300, Ilpo Järvinen wrote:
> On Fri, 17 Oct 2025, Lukas Wunner wrote:
>
> > [cc += Ilpo]
> >
> > On Thu, Oct 16, 2025 at 03:53:35PM -0700, Brian Norris wrote:
> > > PCI devices are created via pci_scan_slot() and similar, and are
> > > promptly configured for runtime PM (pci_pm_init()). They are initially
> > > prevented from suspending by way of pm_runtime_forbid(); however, it's
> > > expected that user space may override this via sysfs [1].
>
> Is this true as pm_runtime_forbid() also increases PM usage count?
Yes it's true. See below.
> "void pm_runtime_forbid(struct device *dev);
>
> unset the power.runtime_auto flag for the device and increase its
> usage counter (used by the /sys/devices/.../power/control interface to
> effectively prevent the device from being power managed at run time)"
Right, but sysfs `echo auto > .../power/control` performs the inverse --
pm_runtime_allow() -- which decrements that count.
> > > Now, sometime after initial scan, a PCI device receives its BAR
> > > configuration (pci_assign_unassigned_bus_resources(), etc.).
> > >
> > > If a PCI device is allowed to suspend between pci_scan_slot() and
> > > pci_assign_unassigned_bus_resources(), then pci-driver.c will
> > > save/restore incorrect BAR configuration for the device, and the device
> > > may cease to function.
> > >
> > > This behavior races with user space, since user space may enable runtime
> > > PM [1] as soon as it sees the device, which may be before BAR
> > > configuration.
> > >
> > > Prevent suspending in this intermediate state by holding a runtime PM
> > > reference until the device is fully initialized and ready for probe().
> >
> > Not sure if that is comprehensible by everybody.
Yeah, thanks for trying to clarify. After getting too far into the weeds
on a bug, I sometimes don't spend the appropriate time on writing a
simple problem description. Maybe I can do better on a v2.
> > The point is that
> > unbound devices are left in D0 but are nevertheless allowed to
> > (logically) runtime suspend. And pci_pm_runtime_suspend() may call
> > pci_save_state() while config space isn't fully initialized yet,
> > or pci_pm_runtime_resume() may call pci_restore_state() (via
> > pci_pm_default_resume_early()) and overwrite initialized config space
> > with uninitialized data.
Ack.
> > Have you actually seen this happen in practice?
Yes, that's why I spent my time debugging and submitting this patch :)
> > Normally enumeration
> > happens during subsys_initcall time, when user space isn't running yet.
> > Hotplug may be an exception though.
Hotplug, rescan (e.g., when pwrctrl is in use, power may be stablished
later on, and it triggers a bus rescan; pwrctrl drivers can be modules),
or PCI controller drivers built as modules.
I happen to be using both pwrctrl and controller drivers as modules, so
I hit it that way.
> Adding that pm_runtime_get_noresume() doesn't look useful given
> pm_runtime_forbid() already increases PM usage count. If this problem is
> actually seen in practice, there could a bug elsewhere where something
> decrements usage count too early so this change "helps" by double
> incrementing the usage count.
>
> To find more information what's going on, one could try to trace events
> for the PM usage count (though last time I looked not all paths that
> change PM usage count were covered by the event and adding the
> trace_event() calls into the header turned out too much magic for me to
> figure out so I couldn't solve the problem).
See above. forbid() is not a guaranteed blocker, because user space can
undo it.
> > Patch LGTM in principle, but adding Ilpo to cc who is refactoring PCI
> > resource allocation and may judge whether this can actually happen.
>
> I can see there could be other failure modes besides just saving wrong
> config if devices get suspended underneath the resource assignment
> algorithm.
>
> Besides hotplug, also BAR resize does changes the resources and BARs.
> This case is not helped by this patch.
Is that the 'resource_N_resize' sysfs attributes? Becuase that holds PM
references (pci_config_pm_runtime_{get,put}()) and therefore should not
generally have the same problem. pci-driver.c's runtime suspend will
save a new copy of the registers the next time we suspend after resize.
(Now, some drivers could have problems if they try to stash a static
copy via pci_store_saved_state()/pci_load_saved_state(), but that
invites plenty of its own problems anyway.)
> I also recently learned some DT platforms do the "actual" scan for the bus
> later on Link Up event through irq which could perhaps occur late enough,
> I dunno for sure.
Sure, but that'd be covered by my patch, as those (re)scans would
discover new devices in the same scan+add flow.
> > I think the code comments you're adding are a little verbose and a simple
> > /* acquired in pci_pm_init() */ in pci_bus_add_device() may be sufficient.
>
> I'm also not entirely convinced these always pair
That's a very valid question. There are so many variations of scan+add
that it's been hard for me to tell. I've studied the code pretty
closely, and tested what I could, but I don't have hotplug systems on
hand, and I definitely could miss something.
FWIW, Rafael suggested/implied an alternative, where I could simply
delay pm_runtime_enable() until pci_bus_add_device(). That would dodge
the pairing questions, I think.
> or if the pci_dev may
> get removed before pci_bus_add_device(), see e.g., enable_slot() in
> hotplug/acpiphp_glue.c that does acpiphp_sanitize_bus() before
> pci_bus_add_devices() (which could have other bugs too as is).
I believe it should be OK if a device is removed before the
pm_runtime_put_noidle() half of the pair. It just means the device gets
destroyed with a nonzero PM usage count, which is legal.
> > Also, I think it is neither necessary nor useful to actually cc the e-mail
> > to stable@vger.kernel.org if you include a stable designation in the
> > patch. I believe stable maintainers only pick up backports from that list,
> > not patches intended for upstream.
>
> Probably some tool will auto-insert the Cc: tags as receipients so it
> might be non-trivial to get rid of it.
Yeah, git-send-email applies "Cc:" lines automatically, so I expect it's
very common for people to do that. I use U-Boot's patman, which wraps
git-send-email. I just assume stable@ folks expect that or are at least
used to it, because ... well, "Cc:" usually means "copy this on the
email" ...
git-send-email has --suppress-cc, and maybe I can convince my tools to
do that. (e.g., `git config sendemail.suppresscc cc`)
Brian
next prev parent reply other threads:[~2025-10-17 17:43 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-16 22:53 [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized Brian Norris
2025-10-17 8:32 ` Lukas Wunner
2025-10-17 11:49 ` Ilpo Järvinen
2025-10-17 17:43 ` Brian Norris [this message]
2025-10-17 19:20 ` Brian Norris
2025-10-20 15:56 ` Ilpo Järvinen
2025-10-20 18:52 ` Brian Norris
2025-10-21 11:27 ` Ilpo Järvinen
2025-10-21 12:53 ` Rafael J. Wysocki
2025-10-21 13:18 ` Ilpo Järvinen
2025-10-21 18:27 ` Brian Norris
2025-10-21 18:56 ` Rafael J. Wysocki
2025-10-21 18:13 ` Brian Norris
2025-10-22 13:38 ` Ilpo Järvinen
2025-10-17 9:45 ` Rafael J. Wysocki
2025-10-17 17:11 ` Brian Norris
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aPKANja_k1gogTAU@google.com \
--to=briannorris@chromium.org \
--cc=bhelgaas@google.com \
--cc=ilpo.jarvinen@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linux-pm@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=rafael@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.