linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>
To: Brian Norris <briannorris@chromium.org>
Cc: Lukas Wunner <lukas@wunner.de>,
	Bjorn Helgaas <bhelgaas@google.com>,
	 LKML <linux-kernel@vger.kernel.org>,
	linux-pm@vger.kernel.org,
	 "Rafael J . Wysocki" <rafael@kernel.org>,
	linux-pci@vger.kernel.org
Subject: Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
Date: Mon, 20 Oct 2025 18:56:41 +0300 (EEST)	[thread overview]
Message-ID: <08976178-298f-79d9-1d63-cff5a4e56cc3@linux.intel.com> (raw)
In-Reply-To: <aPKANja_k1gogTAU@google.com>

[-- Attachment #1: Type: text/plain, Size: 7717 bytes --]

On Fri, 17 Oct 2025, Brian Norris wrote:

> Hi Ilpo and Lukas,
> 
> I'll reply to both of you inline:

I see you posted v2 but I'm answering here to have the context available. 
Mostly things seem okay after your explanation, I think the only question 
mark is a driver calling pci_resize_resource() directly to resize BARs.

> On Fri, Oct 17, 2025 at 02:49:35PM +0300, Ilpo Järvinen wrote:
> > On Fri, 17 Oct 2025, Lukas Wunner wrote:
> > 
> > > [cc += Ilpo]
> > > 
> > > On Thu, Oct 16, 2025 at 03:53:35PM -0700, Brian Norris wrote:
> > > > PCI devices are created via pci_scan_slot() and similar, and are
> > > > promptly configured for runtime PM (pci_pm_init()). They are initially
> > > > prevented from suspending by way of pm_runtime_forbid(); however, it's
> > > > expected that user space may override this via sysfs [1].
> > 
> > Is this true as pm_runtime_forbid() also increases PM usage count?
> 
> Yes it's true. See below.
> 
> > "void pm_runtime_forbid(struct device *dev);
> > 
> > unset the power.runtime_auto flag for the device and increase its 
> > usage counter (used by the /sys/devices/.../power/control interface to 
> > effectively prevent the device from being power managed at run time)"
> 
> Right, but sysfs `echo auto > .../power/control` performs the inverse --
> pm_runtime_allow() -- which decrements that count.

Fair enough, I didn't check what it does.

IMO, the details about how the usage count behaves should be part of the 
changelog as that documentation I quoted sounded like user control is 
prevented when forbidden. I see you've put this part of the explanation 
into the v2 as well so I suggest you explain the usage count in the change 
so it is recorded in the commit if somebody has to look at this commit 
years from now.

> > > > Now, sometime after initial scan, a PCI device receives its BAR
> > > > configuration (pci_assign_unassigned_bus_resources(), etc.).
> > > > 
> > > > If a PCI device is allowed to suspend between pci_scan_slot() and
> > > > pci_assign_unassigned_bus_resources(), then pci-driver.c will
> > > > save/restore incorrect BAR configuration for the device, and the device
> > > > may cease to function.
> > > > 
> > > > This behavior races with user space, since user space may enable runtime
> > > > PM [1] as soon as it sees the device, which may be before BAR
> > > > configuration.
> > > > 
> > > > Prevent suspending in this intermediate state by holding a runtime PM
> > > > reference until the device is fully initialized and ready for probe().
> > >
> > > Not sure if that is comprehensible by everybody.
> 
> Yeah, thanks for trying to clarify. After getting too far into the weeds
> on a bug, I sometimes don't spend the appropriate time on writing a
> simple problem description. Maybe I can do better on a v2.
> 
> > > The point is that
> > > unbound devices are left in D0 but are nevertheless allowed to
> > > (logically) runtime suspend.  And pci_pm_runtime_suspend() may call
> > > pci_save_state() while config space isn't fully initialized yet,
> > > or pci_pm_runtime_resume() may call pci_restore_state() (via
> > > pci_pm_default_resume_early()) and overwrite initialized config space
> > > with uninitialized data.
> 
> Ack.
> 
> > > Have you actually seen this happen in practice?
> 
> Yes, that's why I spent my time debugging and submitting this patch :)

Thanks for doing it! :-)

> > > Normally enumeration
> > > happens during subsys_initcall time, when user space isn't running yet.
> > > Hotplug may be an exception though.
> 
> Hotplug, rescan (e.g., when pwrctrl is in use, power may be stablished
> later on, and it triggers a bus rescan; pwrctrl drivers can be modules),
> or PCI controller drivers built as modules.
> 
> I happen to be using both pwrctrl and controller drivers as modules, so
> I hit it that way.
> 
> > Adding that pm_runtime_get_noresume() doesn't look useful given 
> > pm_runtime_forbid() already increases PM usage count. If this problem is 
> > actually seen in practice, there could a bug elsewhere where something 
> > decrements usage count too early so this change "helps" by double 
> > incrementing the usage count.
> > 
> > To find more information what's going on, one could try to trace events 
> > for the PM usage count (though last time I looked not all paths that 
> > change PM usage count were covered by the event and adding the 
> > trace_event() calls into the header turned out too much magic for me to 
> > figure out so I couldn't solve the problem).
> 
> See above. forbid() is not a guaranteed blocker, because user space can
> undo it.
> 
> > > Patch LGTM in principle, but adding Ilpo to cc who is refactoring PCI
> > > resource allocation and may judge whether this can actually happen.
> > 
> > I can see there could be other failure modes besides just saving wrong 
> > config if devices get suspended underneath the resource assignment 
> > algorithm.
> > 
> > Besides hotplug, also BAR resize does changes the resources and BARs.
> > This case is not helped by this patch.
> 
> Is that the 'resource_N_resize' sysfs attributes? Becuase that holds PM
> references (pci_config_pm_runtime_{get,put}()) and therefore should not
> generally have the same problem.

Okay, seem fine for the PCI core part.

Driver's can also trigger BAR resize by calling pci_resize_resource() 
directly but I've no idea how the usage counts behave (TBH, PM isn't my 
strongest forte even if Lukas pulled me in to comment).

> pci-driver.c's runtime suspend will
> save a new copy of the registers the next time we suspend after resize.
> 
> (Now, some drivers could have problems if they try to stash a static
> copy via pci_store_saved_state()/pci_load_saved_state(), but that
> invites plenty of its own problems anyway.)
> 
> > I also recently learned some DT platforms do the "actual" scan for the bus 
> > later on Link Up event through irq which could perhaps occur late enough, 
> > I dunno for sure.
> 
> Sure, but that'd be covered by my patch, as those (re)scans would
> discover new devices in the same scan+add flow.

Okay, maybe it's fine like the rest. I was mostly trying to think 
non-subsys_initcall() cases I knew of without contextualizing them back to 
this patch.

> > > I think the code comments you're adding are a little verbose and a simple
> > > /* acquired in pci_pm_init() */ in pci_bus_add_device() may be sufficient.
> > 
> > I'm also not entirely convinced these always pair
> 
> That's a very valid question. There are so many variations of scan+add
> that it's been hard for me to tell.

I've noticed... unfortunately I find myself often in the same boat. :-/

> I've studied the code pretty
> closely, and tested what I could, but I don't have hotplug systems on
> hand, and I definitely could miss something.
> 
> FWIW, Rafael suggested/implied an alternative, where I could simply
> delay pm_runtime_enable() until pci_bus_add_device(). That would dodge
> the pairing questions, I think.

Yes.

> > or if the pci_dev may 
> > get removed before pci_bus_add_device(), see e.g., enable_slot() in 
> > hotplug/acpiphp_glue.c that does acpiphp_sanitize_bus() before 
> > pci_bus_add_devices() (which could have other bugs too as is).
> 
> I believe it should be OK if a device is removed before the
> pm_runtime_put_noidle() half of the pair. It just means the device gets
> destroyed with a nonzero PM usage count, which is legal.

Ah yes, I think I was too attached to the "pairing" thought at that point 
to see this isn't like e.g. a lock/unlock pair.


-- 
 i.

  parent reply	other threads:[~2025-10-20 15:56 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-16 22:53 [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized Brian Norris
2025-10-17  8:32 ` Lukas Wunner
2025-10-17 11:49   ` Ilpo Järvinen
2025-10-17 17:43     ` Brian Norris
2025-10-17 19:20       ` Brian Norris
2025-10-20 15:56       ` Ilpo Järvinen [this message]
2025-10-20 18:52         ` Brian Norris
2025-10-21 11:27           ` Ilpo Järvinen
2025-10-21 12:53             ` Rafael J. Wysocki
2025-10-21 13:18               ` Ilpo Järvinen
2025-10-21 18:27                 ` Brian Norris
2025-10-21 18:56                   ` Rafael J. Wysocki
2025-10-21 18:13             ` Brian Norris
2025-10-22 13:38               ` Ilpo Järvinen
2025-10-17  9:45 ` Rafael J. Wysocki
2025-10-17 17:11   ` Brian Norris

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=08976178-298f-79d9-1d63-cff5a4e56cc3@linux.intel.com \
    --to=ilpo.jarvinen@linux.intel.com \
    --cc=bhelgaas@google.com \
    --cc=briannorris@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=rafael@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).