[PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
@ 2025-10-16 22:53 Brian Norris
  2025-10-17  8:32 ` Lukas Wunner
  2025-10-17  9:45 ` Rafael J. Wysocki
  0 siblings, 2 replies; 16+ messages in thread
From: Brian Norris @ 2025-10-16 22:53 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-kernel, linux-pm, Rafael J . Wysocki, Lukas Wunner,
	linux-pci, Brian Norris, stable

PCI devices are created via pci_scan_slot() and similar, and are
promptly configured for runtime PM (pci_pm_init()). They are initially
prevented from suspending by way of pm_runtime_forbid(); however, it's
expected that user space may override this via sysfs [1].

Now, sometime after initial scan, a PCI device receives its BAR
configuration (pci_assign_unassigned_bus_resources(), etc.).

If a PCI device is allowed to suspend between pci_scan_slot() and
pci_assign_unassigned_bus_resources(), then pci-driver.c will
save/restore incorrect BAR configuration for the device, and the device
may cease to function.

This behavior races with user space, since user space may enable runtime
PM [1] as soon as it sees the device, which may be before BAR
configuration.

Prevent suspending in this intermediate state by holding a runtime PM
reference until the device is fully initialized and ready for probe().

[1] echo auto > /sys/bus/pci/devices/.../power/control

Cc: <stable@vger.kernel.org>
Signed-off-by: Brian Norris <briannorris@chromium.org>
---

 drivers/pci/bus.c | 7 +++++++
 drivers/pci/pci.c | 6 ++++++
 2 files changed, 13 insertions(+)

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index f26aec6ff588..227a8898acac 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -14,6 +14,7 @@
 #include <linux/of.h>
 #include <linux/of_platform.h>
 #include <linux/platform_device.h>
+#include <linux/pm_runtime.h>
 #include <linux/proc_fs.h>
 #include <linux/slab.h>

@@ -375,6 +376,12 @@ void pci_bus_add_device(struct pci_dev *dev)
 		put_device(&pdev->dev);
 	}

+	/*
+	 * Now that resources are assigned, drop the reference we grabbed in
+	 * pci_pm_init().
+	 */
+	pm_runtime_put_noidle(&dev->dev);
+
 	if (!dn || of_device_is_available(dn))
 		pci_dev_allow_binding(dev);

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b14dd064006c..06a901214f2c 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3226,6 +3226,12 @@ void pci_pm_init(struct pci_dev *dev)
 	pci_pm_power_up_and_verify_state(dev);
 	pm_runtime_forbid(&dev->dev);
 	pm_runtime_set_active(&dev->dev);
+	/*
+	 * We cannot allow a device to suspend before its resources are
+	 * configured. Otherwise, we may allow saving/restoring unexpected BAR
+	 * configuration.
+	 */
+	pm_runtime_get_noresume(&dev->dev);
 	pm_runtime_enable(&dev->dev);
 }

-- 
2.51.0.858.gf9c4a03a3a-goog

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
  2025-10-16 22:53 [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized Brian Norris
@ 2025-10-17  8:32 ` Lukas Wunner
  2025-10-17 11:49   ` Ilpo Järvinen
  2025-10-17  9:45 ` Rafael J. Wysocki
  1 sibling, 1 reply; 16+ messages in thread
From: Lukas Wunner @ 2025-10-17  8:32 UTC (permalink / raw)
  To: Brian Norris
  Cc: Bjorn Helgaas, linux-kernel, linux-pm, Rafael J . Wysocki,
	linux-pci, Ilpo Järvinen

[cc += Ilpo]

On Thu, Oct 16, 2025 at 03:53:35PM -0700, Brian Norris wrote:
> PCI devices are created via pci_scan_slot() and similar, and are
> promptly configured for runtime PM (pci_pm_init()). They are initially
> prevented from suspending by way of pm_runtime_forbid(); however, it's
> expected that user space may override this via sysfs [1].
> 
> Now, sometime after initial scan, a PCI device receives its BAR
> configuration (pci_assign_unassigned_bus_resources(), etc.).
> 
> If a PCI device is allowed to suspend between pci_scan_slot() and
> pci_assign_unassigned_bus_resources(), then pci-driver.c will
> save/restore incorrect BAR configuration for the device, and the device
> may cease to function.
> 
> This behavior races with user space, since user space may enable runtime
> PM [1] as soon as it sees the device, which may be before BAR
> configuration.
> 
> Prevent suspending in this intermediate state by holding a runtime PM
> reference until the device is fully initialized and ready for probe().

Not sure if that is comprehensible by everybody.  The point is that
unbound devices are left in D0 but are nevertheless allowed to
(logically) runtime suspend.  And pci_pm_runtime_suspend() may call
pci_save_state() while config space isn't fully initialized yet,
or pci_pm_runtime_resume() may call pci_restore_state() (via
pci_pm_default_resume_early()) and overwrite initialized config space
with uninitialized data.

Have you actually seen this happen in practice?  Normally enumeration
happens during subsys_initcall time, when user space isn't running yet.
Hotplug may be an exception though.

Patch LGTM in principle, but adding Ilpo to cc who is refactoring PCI
resource allocation and may judge whether this can actually happen.

I think the code comments you're adding are a little verbose and a simple
/* acquired in pci_pm_init() */ in pci_bus_add_device() may be sufficient.

Also, I think it is neither necessary nor useful to actually cc the e-mail
to stable@vger.kernel.org if you include a stable designation in the
patch.  I believe stable maintainers only pick up backports from that list,
not patches intended for upstream.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
  2025-10-16 22:53 [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized Brian Norris
  2025-10-17  8:32 ` Lukas Wunner
@ 2025-10-17  9:45 ` Rafael J. Wysocki
  2025-10-17 17:11   ` Brian Norris
  1 sibling, 1 reply; 16+ messages in thread
From: Rafael J. Wysocki @ 2025-10-17  9:45 UTC (permalink / raw)
  To: Brian Norris
  Cc: Bjorn Helgaas, linux-kernel, linux-pm, Rafael J . Wysocki,
	Lukas Wunner, linux-pci, stable

On Fri, Oct 17, 2025 at 1:28 AM Brian Norris <briannorris@chromium.org> wrote:
>
> PCI devices are created via pci_scan_slot() and similar, and are
> promptly configured for runtime PM (pci_pm_init()). They are initially
> prevented from suspending by way of pm_runtime_forbid(); however, it's
> expected that user space may override this via sysfs [1].
>
> Now, sometime after initial scan, a PCI device receives its BAR
> configuration (pci_assign_unassigned_bus_resources(), etc.).
>
> If a PCI device is allowed to suspend between pci_scan_slot() and
> pci_assign_unassigned_bus_resources(), then pci-driver.c will
> save/restore incorrect BAR configuration for the device, and the device
> may cease to function.
>
> This behavior races with user space, since user space may enable runtime
> PM [1] as soon as it sees the device, which may be before BAR
> configuration.
>
> Prevent suspending in this intermediate state by holding a runtime PM
> reference until the device is fully initialized and ready for probe().
>
> [1] echo auto > /sys/bus/pci/devices/.../power/control
>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Brian Norris <briannorris@chromium.org>
> ---
>
>  drivers/pci/bus.c | 7 +++++++
>  drivers/pci/pci.c | 6 ++++++
>  2 files changed, 13 insertions(+)
>
> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> index f26aec6ff588..227a8898acac 100644
> --- a/drivers/pci/bus.c
> +++ b/drivers/pci/bus.c
> @@ -14,6 +14,7 @@
>  #include <linux/of.h>
>  #include <linux/of_platform.h>
>  #include <linux/platform_device.h>
> +#include <linux/pm_runtime.h>
>  #include <linux/proc_fs.h>
>  #include <linux/slab.h>
>
> @@ -375,6 +376,12 @@ void pci_bus_add_device(struct pci_dev *dev)
>                 put_device(&pdev->dev);
>         }
>
> +       /*
> +        * Now that resources are assigned, drop the reference we grabbed in
> +        * pci_pm_init().
> +        */
> +       pm_runtime_put_noidle(&dev->dev);
> +
>         if (!dn || of_device_is_available(dn))
>                 pci_dev_allow_binding(dev);
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b14dd064006c..06a901214f2c 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3226,6 +3226,12 @@ void pci_pm_init(struct pci_dev *dev)
>         pci_pm_power_up_and_verify_state(dev);
>         pm_runtime_forbid(&dev->dev);
>         pm_runtime_set_active(&dev->dev);
> +       /*
> +        * We cannot allow a device to suspend before its resources are
> +        * configured. Otherwise, we may allow saving/restoring unexpected BAR
> +        * configuration.
> +        */
> +       pm_runtime_get_noresume(&dev->dev);
>         pm_runtime_enable(&dev->dev);

So runtime PM should not be enabled here, should it?

>  }
>
> --

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
  2025-10-17  8:32 ` Lukas Wunner
@ 2025-10-17 11:49   ` Ilpo Järvinen
  2025-10-17 17:43     ` Brian Norris
  0 siblings, 1 reply; 16+ messages in thread
From: Ilpo Järvinen @ 2025-10-17 11:49 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Brian Norris, Bjorn Helgaas, LKML, linux-pm, Rafael J . Wysocki,
	linux-pci

On Fri, 17 Oct 2025, Lukas Wunner wrote:

> [cc += Ilpo]
> 
> On Thu, Oct 16, 2025 at 03:53:35PM -0700, Brian Norris wrote:
> > PCI devices are created via pci_scan_slot() and similar, and are
> > promptly configured for runtime PM (pci_pm_init()). They are initially
> > prevented from suspending by way of pm_runtime_forbid(); however, it's
> > expected that user space may override this via sysfs [1].

Is this true as pm_runtime_forbid() also increases PM usage count?

"void pm_runtime_forbid(struct device *dev);

unset the power.runtime_auto flag for the device and increase its 
usage counter (used by the /sys/devices/.../power/control interface to 
effectively prevent the device from being power managed at run time)"

> > Now, sometime after initial scan, a PCI device receives its BAR
> > configuration (pci_assign_unassigned_bus_resources(), etc.).
> > 
> > If a PCI device is allowed to suspend between pci_scan_slot() and
> > pci_assign_unassigned_bus_resources(), then pci-driver.c will
> > save/restore incorrect BAR configuration for the device, and the device
> > may cease to function.
> > 
> > This behavior races with user space, since user space may enable runtime
> > PM [1] as soon as it sees the device, which may be before BAR
> > configuration.
> > 
> > Prevent suspending in this intermediate state by holding a runtime PM
> > reference until the device is fully initialized and ready for probe().
>
> Not sure if that is comprehensible by everybody.  The point is that
> unbound devices are left in D0 but are nevertheless allowed to
> (logically) runtime suspend.  And pci_pm_runtime_suspend() may call
> pci_save_state() while config space isn't fully initialized yet,
> or pci_pm_runtime_resume() may call pci_restore_state() (via
> pci_pm_default_resume_early()) and overwrite initialized config space
> with uninitialized data.
> 
> Have you actually seen this happen in practice?  Normally enumeration
> happens during subsys_initcall time, when user space isn't running yet.
> Hotplug may be an exception though.

Adding that pm_runtime_get_noresume() doesn't look useful given 
pm_runtime_forbid() already increases PM usage count. If this problem is 
actually seen in practice, there could a bug elsewhere where something 
decrements usage count too early so this change "helps" by double 
incrementing the usage count.

To find more information what's going on, one could try to trace events 
for the PM usage count (though last time I looked not all paths that 
change PM usage count were covered by the event and adding the 
trace_event() calls into the header turned out too much magic for me to 
figure out so I couldn't solve the problem).

> Patch LGTM in principle, but adding Ilpo to cc who is refactoring PCI
> resource allocation and may judge whether this can actually happen.

I can see there could be other failure modes besides just saving wrong 
config if devices get suspended underneath the resource assignment 
algorithm.

Besides hotplug, also BAR resize does changes the resources and BARs.
This case is not helped by this patch.

I also recently learned some DT platforms do the "actual" scan for the bus 
later on Link Up event through irq which could perhaps occur late enough, 
I dunno for sure.

> I think the code comments you're adding are a little verbose and a simple
> /* acquired in pci_pm_init() */ in pci_bus_add_device() may be sufficient.

I'm also not entirely convinced these always pair or if the pci_dev may 
get removed before pci_bus_add_device(), see e.g., enable_slot() in 
hotplug/acpiphp_glue.c that does acpiphp_sanitize_bus() before 
pci_bus_add_devices() (which could have other bugs too as is).

> Also, I think it is neither necessary nor useful to actually cc the e-mail
> to stable@vger.kernel.org if you include a stable designation in the
> patch.  I believe stable maintainers only pick up backports from that list,
> not patches intended for upstream.

Probably some tool will auto-insert the Cc: tags as receipients so it 
might be non-trivial to get rid of it.

-- 
 i.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
  2025-10-17  9:45 ` Rafael J. Wysocki
@ 2025-10-17 17:11   ` Brian Norris
  0 siblings, 0 replies; 16+ messages in thread
From: Brian Norris @ 2025-10-17 17:11 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Bjorn Helgaas, linux-kernel, linux-pm, Lukas Wunner, linux-pci,
	stable

Hi Rafael,

On Fri, Oct 17, 2025 at 11:45:14AM +0200, Rafael J. Wysocki wrote:
> On Fri, Oct 17, 2025 at 1:28 AM Brian Norris <briannorris@chromium.org> wrote:
> >
> > PCI devices are created via pci_scan_slot() and similar, and are
> > promptly configured for runtime PM (pci_pm_init()). They are initially
> > prevented from suspending by way of pm_runtime_forbid(); however, it's
> > expected that user space may override this via sysfs [1].
> >
> > Now, sometime after initial scan, a PCI device receives its BAR
> > configuration (pci_assign_unassigned_bus_resources(), etc.).
> >
> > If a PCI device is allowed to suspend between pci_scan_slot() and
> > pci_assign_unassigned_bus_resources(), then pci-driver.c will
> > save/restore incorrect BAR configuration for the device, and the device
> > may cease to function.
> >
> > This behavior races with user space, since user space may enable runtime
> > PM [1] as soon as it sees the device, which may be before BAR
> > configuration.
> >
> > Prevent suspending in this intermediate state by holding a runtime PM
> > reference until the device is fully initialized and ready for probe().
> >
> > [1] echo auto > /sys/bus/pci/devices/.../power/control
> >
> > Cc: <stable@vger.kernel.org>
> > Signed-off-by: Brian Norris <briannorris@chromium.org>
> > ---
> >
> >  drivers/pci/bus.c | 7 +++++++
> >  drivers/pci/pci.c | 6 ++++++
> >  2 files changed, 13 insertions(+)
> >
> > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > index f26aec6ff588..227a8898acac 100644
> > --- a/drivers/pci/bus.c
> > +++ b/drivers/pci/bus.c
> > @@ -14,6 +14,7 @@
> >  #include <linux/of.h>
> >  #include <linux/of_platform.h>
> >  #include <linux/platform_device.h>
> > +#include <linux/pm_runtime.h>
> >  #include <linux/proc_fs.h>
> >  #include <linux/slab.h>
> >
> > @@ -375,6 +376,12 @@ void pci_bus_add_device(struct pci_dev *dev)
> >                 put_device(&pdev->dev);
> >         }
> >
> > +       /*
> > +        * Now that resources are assigned, drop the reference we grabbed in
> > +        * pci_pm_init().
> > +        */
> > +       pm_runtime_put_noidle(&dev->dev);
> > +
> >         if (!dn || of_device_is_available(dn))
> >                 pci_dev_allow_binding(dev);
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index b14dd064006c..06a901214f2c 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -3226,6 +3226,12 @@ void pci_pm_init(struct pci_dev *dev)
> >         pci_pm_power_up_and_verify_state(dev);
> >         pm_runtime_forbid(&dev->dev);
> >         pm_runtime_set_active(&dev->dev);
> > +       /*
> > +        * We cannot allow a device to suspend before its resources are
> > +        * configured. Otherwise, we may allow saving/restoring unexpected BAR
> > +        * configuration.
> > +        */
> > +       pm_runtime_get_noresume(&dev->dev);
> >         pm_runtime_enable(&dev->dev);
> 
> So runtime PM should not be enabled here, should it?

Hmm, I suppose not. Does that imply it would be a better solution to
simply defer pm_runtime_enable() to pci_bus_add_device() or some similar
point? I'll give that a shot, since that seems like a simpler and
cleaner solution.

Thanks,
Brian

> >  }
> >
> > --

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
  2025-10-17 11:49   ` Ilpo Järvinen
@ 2025-10-17 17:43     ` Brian Norris
  2025-10-17 19:20       ` Brian Norris
  2025-10-20 15:56       ` Ilpo Järvinen
  0 siblings, 2 replies; 16+ messages in thread
From: Brian Norris @ 2025-10-17 17:43 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Lukas Wunner, Bjorn Helgaas, LKML, linux-pm, Rafael J . Wysocki,
	linux-pci

Hi Ilpo and Lukas,

I'll reply to both of you inline:

On Fri, Oct 17, 2025 at 02:49:35PM +0300, Ilpo Järvinen wrote:
> On Fri, 17 Oct 2025, Lukas Wunner wrote:
> 
> > [cc += Ilpo]
> > 
> > On Thu, Oct 16, 2025 at 03:53:35PM -0700, Brian Norris wrote:
> > > PCI devices are created via pci_scan_slot() and similar, and are
> > > promptly configured for runtime PM (pci_pm_init()). They are initially
> > > prevented from suspending by way of pm_runtime_forbid(); however, it's
> > > expected that user space may override this via sysfs [1].
> 
> Is this true as pm_runtime_forbid() also increases PM usage count?

Yes it's true. See below.

> "void pm_runtime_forbid(struct device *dev);
> 
> unset the power.runtime_auto flag for the device and increase its 
> usage counter (used by the /sys/devices/.../power/control interface to 
> effectively prevent the device from being power managed at run time)"

Right, but sysfs `echo auto > .../power/control` performs the inverse --
pm_runtime_allow() -- which decrements that count.

> > > Now, sometime after initial scan, a PCI device receives its BAR
> > > configuration (pci_assign_unassigned_bus_resources(), etc.).
> > > 
> > > If a PCI device is allowed to suspend between pci_scan_slot() and
> > > pci_assign_unassigned_bus_resources(), then pci-driver.c will
> > > save/restore incorrect BAR configuration for the device, and the device
> > > may cease to function.
> > > 
> > > This behavior races with user space, since user space may enable runtime
> > > PM [1] as soon as it sees the device, which may be before BAR
> > > configuration.
> > > 
> > > Prevent suspending in this intermediate state by holding a runtime PM
> > > reference until the device is fully initialized and ready for probe().
> >
> > Not sure if that is comprehensible by everybody.

Yeah, thanks for trying to clarify. After getting too far into the weeds
on a bug, I sometimes don't spend the appropriate time on writing a
simple problem description. Maybe I can do better on a v2.

> > The point is that
> > unbound devices are left in D0 but are nevertheless allowed to
> > (logically) runtime suspend.  And pci_pm_runtime_suspend() may call
> > pci_save_state() while config space isn't fully initialized yet,
> > or pci_pm_runtime_resume() may call pci_restore_state() (via
> > pci_pm_default_resume_early()) and overwrite initialized config space
> > with uninitialized data.

Ack.

> > Have you actually seen this happen in practice?

Yes, that's why I spent my time debugging and submitting this patch :)

> > Normally enumeration
> > happens during subsys_initcall time, when user space isn't running yet.
> > Hotplug may be an exception though.

Hotplug, rescan (e.g., when pwrctrl is in use, power may be stablished
later on, and it triggers a bus rescan; pwrctrl drivers can be modules),
or PCI controller drivers built as modules.

I happen to be using both pwrctrl and controller drivers as modules, so
I hit it that way.

> Adding that pm_runtime_get_noresume() doesn't look useful given 
> pm_runtime_forbid() already increases PM usage count. If this problem is 
> actually seen in practice, there could a bug elsewhere where something 
> decrements usage count too early so this change "helps" by double 
> incrementing the usage count.
> 
> To find more information what's going on, one could try to trace events 
> for the PM usage count (though last time I looked not all paths that 
> change PM usage count were covered by the event and adding the 
> trace_event() calls into the header turned out too much magic for me to 
> figure out so I couldn't solve the problem).

See above. forbid() is not a guaranteed blocker, because user space can
undo it.

> > Patch LGTM in principle, but adding Ilpo to cc who is refactoring PCI
> > resource allocation and may judge whether this can actually happen.
> 
> I can see there could be other failure modes besides just saving wrong 
> config if devices get suspended underneath the resource assignment 
> algorithm.
> 
> Besides hotplug, also BAR resize does changes the resources and BARs.
> This case is not helped by this patch.

Is that the 'resource_N_resize' sysfs attributes? Becuase that holds PM
references (pci_config_pm_runtime_{get,put}()) and therefore should not
generally have the same problem. pci-driver.c's runtime suspend will
save a new copy of the registers the next time we suspend after resize.

(Now, some drivers could have problems if they try to stash a static
copy via pci_store_saved_state()/pci_load_saved_state(), but that
invites plenty of its own problems anyway.)

> I also recently learned some DT platforms do the "actual" scan for the bus 
> later on Link Up event through irq which could perhaps occur late enough, 
> I dunno for sure.

Sure, but that'd be covered by my patch, as those (re)scans would
discover new devices in the same scan+add flow.

> > I think the code comments you're adding are a little verbose and a simple
> > /* acquired in pci_pm_init() */ in pci_bus_add_device() may be sufficient.
> 
> I'm also not entirely convinced these always pair

That's a very valid question. There are so many variations of scan+add
that it's been hard for me to tell. I've studied the code pretty
closely, and tested what I could, but I don't have hotplug systems on
hand, and I definitely could miss something.

FWIW, Rafael suggested/implied an alternative, where I could simply
delay pm_runtime_enable() until pci_bus_add_device(). That would dodge
the pairing questions, I think.

> or if the pci_dev may 
> get removed before pci_bus_add_device(), see e.g., enable_slot() in 
> hotplug/acpiphp_glue.c that does acpiphp_sanitize_bus() before 
> pci_bus_add_devices() (which could have other bugs too as is).

I believe it should be OK if a device is removed before the
pm_runtime_put_noidle() half of the pair. It just means the device gets
destroyed with a nonzero PM usage count, which is legal.

> > Also, I think it is neither necessary nor useful to actually cc the e-mail
> > to stable@vger.kernel.org if you include a stable designation in the
> > patch.  I believe stable maintainers only pick up backports from that list,
> > not patches intended for upstream.
> 
> Probably some tool will auto-insert the Cc: tags as receipients so it 
> might be non-trivial to get rid of it.

Yeah, git-send-email applies "Cc:" lines automatically, so I expect it's
very common for people to do that. I use U-Boot's patman, which wraps
git-send-email. I just assume stable@ folks expect that or are at least
used to it, because ... well, "Cc:" usually means "copy this on the
email" ...

git-send-email has --suppress-cc, and maybe I can convince my tools to
do that. (e.g., `git config sendemail.suppresscc cc`)

Brian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
  2025-10-17 17:43     ` Brian Norris
@ 2025-10-17 19:20       ` Brian Norris
  2025-10-20 15:56       ` Ilpo Järvinen
  1 sibling, 0 replies; 16+ messages in thread
From: Brian Norris @ 2025-10-17 19:20 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Lukas Wunner, Bjorn Helgaas, LKML, linux-pm, Rafael J . Wysocki,
	linux-pci

On Fri, Oct 17, 2025 at 10:43:18AM -0700, Brian Norris wrote:
> git-send-email has --suppress-cc, and maybe I can convince my tools to
> do that. (e.g., `git config sendemail.suppresscc cc`)

For the record, that'd actually be:

  git config sendemail.suppressCc bodycc

I think I'll use that, since I usually track intentional Cc's with other
metadata, not (just) body-trailer "Cc:" lines.

Brian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
  2025-10-17 17:43     ` Brian Norris
  2025-10-17 19:20       ` Brian Norris
@ 2025-10-20 15:56       ` Ilpo Järvinen
  2025-10-20 18:52         ` Brian Norris
  1 sibling, 1 reply; 16+ messages in thread
From: Ilpo Järvinen @ 2025-10-20 15:56 UTC (permalink / raw)
  To: Brian Norris
  Cc: Lukas Wunner, Bjorn Helgaas, LKML, linux-pm, Rafael J . Wysocki,
	linux-pci

[-- Attachment #1: Type: text/plain, Size: 7717 bytes --]

On Fri, 17 Oct 2025, Brian Norris wrote:

> Hi Ilpo and Lukas,
> 
> I'll reply to both of you inline:

I see you posted v2 but I'm answering here to have the context available. 
Mostly things seem okay after your explanation, I think the only question 
mark is a driver calling pci_resize_resource() directly to resize BARs.

> On Fri, Oct 17, 2025 at 02:49:35PM +0300, Ilpo Järvinen wrote:
> > On Fri, 17 Oct 2025, Lukas Wunner wrote:
> > 
> > > [cc += Ilpo]
> > > 
> > > On Thu, Oct 16, 2025 at 03:53:35PM -0700, Brian Norris wrote:
> > > > PCI devices are created via pci_scan_slot() and similar, and are
> > > > promptly configured for runtime PM (pci_pm_init()). They are initially
> > > > prevented from suspending by way of pm_runtime_forbid(); however, it's
> > > > expected that user space may override this via sysfs [1].
> > 
> > Is this true as pm_runtime_forbid() also increases PM usage count?
> 
> Yes it's true. See below.
> 
> > "void pm_runtime_forbid(struct device *dev);
> > 
> > unset the power.runtime_auto flag for the device and increase its 
> > usage counter (used by the /sys/devices/.../power/control interface to 
> > effectively prevent the device from being power managed at run time)"
> 
> Right, but sysfs `echo auto > .../power/control` performs the inverse --
> pm_runtime_allow() -- which decrements that count.

Fair enough, I didn't check what it does.

IMO, the details about how the usage count behaves should be part of the 
changelog as that documentation I quoted sounded like user control is 
prevented when forbidden. I see you've put this part of the explanation 
into the v2 as well so I suggest you explain the usage count in the change 
so it is recorded in the commit if somebody has to look at this commit 
years from now.

> > > > Now, sometime after initial scan, a PCI device receives its BAR
> > > > configuration (pci_assign_unassigned_bus_resources(), etc.).
> > > > 
> > > > If a PCI device is allowed to suspend between pci_scan_slot() and
> > > > pci_assign_unassigned_bus_resources(), then pci-driver.c will
> > > > save/restore incorrect BAR configuration for the device, and the device
> > > > may cease to function.
> > > > 
> > > > This behavior races with user space, since user space may enable runtime
> > > > PM [1] as soon as it sees the device, which may be before BAR
> > > > configuration.
> > > > 
> > > > Prevent suspending in this intermediate state by holding a runtime PM
> > > > reference until the device is fully initialized and ready for probe().
> > >
> > > Not sure if that is comprehensible by everybody.
> 
> Yeah, thanks for trying to clarify. After getting too far into the weeds
> on a bug, I sometimes don't spend the appropriate time on writing a
> simple problem description. Maybe I can do better on a v2.
> 
> > > The point is that
> > > unbound devices are left in D0 but are nevertheless allowed to
> > > (logically) runtime suspend.  And pci_pm_runtime_suspend() may call
> > > pci_save_state() while config space isn't fully initialized yet,
> > > or pci_pm_runtime_resume() may call pci_restore_state() (via
> > > pci_pm_default_resume_early()) and overwrite initialized config space
> > > with uninitialized data.
> 
> Ack.
> 
> > > Have you actually seen this happen in practice?
> 
> Yes, that's why I spent my time debugging and submitting this patch :)

Thanks for doing it! :-)

> > > Normally enumeration
> > > happens during subsys_initcall time, when user space isn't running yet.
> > > Hotplug may be an exception though.
> 
> Hotplug, rescan (e.g., when pwrctrl is in use, power may be stablished
> later on, and it triggers a bus rescan; pwrctrl drivers can be modules),
> or PCI controller drivers built as modules.
> 
> I happen to be using both pwrctrl and controller drivers as modules, so
> I hit it that way.
> 
> > Adding that pm_runtime_get_noresume() doesn't look useful given 
> > pm_runtime_forbid() already increases PM usage count. If this problem is 
> > actually seen in practice, there could a bug elsewhere where something 
> > decrements usage count too early so this change "helps" by double 
> > incrementing the usage count.
> > 
> > To find more information what's going on, one could try to trace events 
> > for the PM usage count (though last time I looked not all paths that 
> > change PM usage count were covered by the event and adding the 
> > trace_event() calls into the header turned out too much magic for me to 
> > figure out so I couldn't solve the problem).
> 
> See above. forbid() is not a guaranteed blocker, because user space can
> undo it.
> 
> > > Patch LGTM in principle, but adding Ilpo to cc who is refactoring PCI
> > > resource allocation and may judge whether this can actually happen.
> > 
> > I can see there could be other failure modes besides just saving wrong 
> > config if devices get suspended underneath the resource assignment 
> > algorithm.
> > 
> > Besides hotplug, also BAR resize does changes the resources and BARs.
> > This case is not helped by this patch.
> 
> Is that the 'resource_N_resize' sysfs attributes? Becuase that holds PM
> references (pci_config_pm_runtime_{get,put}()) and therefore should not
> generally have the same problem.

Okay, seem fine for the PCI core part.

Driver's can also trigger BAR resize by calling pci_resize_resource() 
directly but I've no idea how the usage counts behave (TBH, PM isn't my 
strongest forte even if Lukas pulled me in to comment).

> pci-driver.c's runtime suspend will
> save a new copy of the registers the next time we suspend after resize.
> 
> (Now, some drivers could have problems if they try to stash a static
> copy via pci_store_saved_state()/pci_load_saved_state(), but that
> invites plenty of its own problems anyway.)
> 
> > I also recently learned some DT platforms do the "actual" scan for the bus 
> > later on Link Up event through irq which could perhaps occur late enough, 
> > I dunno for sure.
> 
> Sure, but that'd be covered by my patch, as those (re)scans would
> discover new devices in the same scan+add flow.

Okay, maybe it's fine like the rest. I was mostly trying to think 
non-subsys_initcall() cases I knew of without contextualizing them back to 
this patch.

> > > I think the code comments you're adding are a little verbose and a simple
> > > /* acquired in pci_pm_init() */ in pci_bus_add_device() may be sufficient.
> > 
> > I'm also not entirely convinced these always pair
> 
> That's a very valid question. There are so many variations of scan+add
> that it's been hard for me to tell.

I've noticed... unfortunately I find myself often in the same boat. :-/

> I've studied the code pretty
> closely, and tested what I could, but I don't have hotplug systems on
> hand, and I definitely could miss something.
> 
> FWIW, Rafael suggested/implied an alternative, where I could simply
> delay pm_runtime_enable() until pci_bus_add_device(). That would dodge
> the pairing questions, I think.

Yes.

> > or if the pci_dev may 
> > get removed before pci_bus_add_device(), see e.g., enable_slot() in 
> > hotplug/acpiphp_glue.c that does acpiphp_sanitize_bus() before 
> > pci_bus_add_devices() (which could have other bugs too as is).
> 
> I believe it should be OK if a device is removed before the
> pm_runtime_put_noidle() half of the pair. It just means the device gets
> destroyed with a nonzero PM usage count, which is legal.

Ah yes, I think I was too attached to the "pairing" thought at that point 
to see this isn't like e.g. a lock/unlock pair.


-- 
 i.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
  2025-10-20 15:56       ` Ilpo Järvinen
@ 2025-10-20 18:52         ` Brian Norris
  2025-10-21 11:27           ` Ilpo Järvinen
  0 siblings, 1 reply; 16+ messages in thread
From: Brian Norris @ 2025-10-20 18:52 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Lukas Wunner, Bjorn Helgaas, LKML, linux-pm, Rafael J . Wysocki,
	linux-pci

Hi Ilpo,

On Mon, Oct 20, 2025 at 06:56:41PM +0300, Ilpo Järvinen wrote:
> On Fri, 17 Oct 2025, Brian Norris wrote:
> 
> > Hi Ilpo and Lukas,
> > 
> > I'll reply to both of you inline:
> 
> I see you posted v2 but I'm answering here to have the context available. 

Sure!

> Mostly things seem okay after your explanation, I think the only question 
> mark is a driver calling pci_resize_resource() directly to resize BARs.
> 
> > On Fri, Oct 17, 2025 at 02:49:35PM +0300, Ilpo Järvinen wrote:
> > > On Fri, 17 Oct 2025, Lukas Wunner wrote:
> > > 
> > > > [cc += Ilpo]
> > > > 
> > > > On Thu, Oct 16, 2025 at 03:53:35PM -0700, Brian Norris wrote:
> > > > > PCI devices are created via pci_scan_slot() and similar, and are
> > > > > promptly configured for runtime PM (pci_pm_init()). They are initially
> > > > > prevented from suspending by way of pm_runtime_forbid(); however, it's
> > > > > expected that user space may override this via sysfs [1].
> > > 
> > > Is this true as pm_runtime_forbid() also increases PM usage count?
> > 
> > Yes it's true. See below.
> > 
> > > "void pm_runtime_forbid(struct device *dev);
> > > 
> > > unset the power.runtime_auto flag for the device and increase its 
> > > usage counter (used by the /sys/devices/.../power/control interface to 
> > > effectively prevent the device from being power managed at run time)"

I see this doc line confused you, and I can sympathize.

IIUC, the parenthetical means that sysfs *uses* pm_runtime_forbid() to
"effectively prevent runtime power management"; pm_runtime_forbid() does
not block user space from doing anything.

> > Right, but sysfs `echo auto > .../power/control` performs the inverse --
> > pm_runtime_allow() -- which decrements that count.
> 
> Fair enough, I didn't check what it does.
> 
> IMO, the details about how the usage count behaves should be part of the 
> changelog as that documentation I quoted sounded like user control is 
> prevented when forbidden.

I tried to elaborate on the API doc confusion above. But frankly, I'm
not sure how best to explain runtime PM.

> I see you've put this part of the explanation 
> into the v2 as well so I suggest you explain the usage count in the change 
> so it is recorded in the commit if somebody has to look at this commit 
> years from now.

Both v1 and v2 mention that the sysfs 'power/control' file can override
the kernel calling pm_runtime_forbid(). They don't mention the usage
count, since that's an implementation detail IMO. (To me, the mental
model works best if "usage count" (usually get()/put()) is considered
mostly orthogonal to forbid()/allow()/sysfs, because "forbid()" can be
overridden at any time.)

This is also covered here:

https://docs.kernel.org/power/runtime_pm.html#runtime-pm-initialization-device-probing-and-removal

"In principle, this mechanism may also be used by the driver to
effectively turn off the runtime power management of the device until
the user space turns it on."

But admittedly, I find the runtime PM API surface to be enormous, and
its documentation ... not very helpful to outsiders. It's pretty much
written by and for PM experts. Case in point: you read and quoted the
appropriate docs, but it still misled you quite a bit :(

I'm more tempted to try to improve the runtime PM docs than to try to
make up for them in the commit message, but maybe I can be convinced
otherwise.

> > > > > Now, sometime after initial scan, a PCI device receives its BAR
> > > > > configuration (pci_assign_unassigned_bus_resources(), etc.).
> > > > > 
> > > > > If a PCI device is allowed to suspend between pci_scan_slot() and
> > > > > pci_assign_unassigned_bus_resources(), then pci-driver.c will
> > > > > save/restore incorrect BAR configuration for the device, and the device
> > > > > may cease to function.
> > > > > 
> > > > > This behavior races with user space, since user space may enable runtime
> > > > > PM [1] as soon as it sees the device, which may be before BAR
> > > > > configuration.
> > > > > 
> > > > > Prevent suspending in this intermediate state by holding a runtime PM
> > > > > reference until the device is fully initialized and ready for probe().
> > > >
> > > > Not sure if that is comprehensible by everybody.
> > 
> > Yeah, thanks for trying to clarify. After getting too far into the weeds
> > on a bug, I sometimes don't spend the appropriate time on writing a
> > simple problem description. Maybe I can do better on a v2.
> > 
> > > > The point is that
> > > > unbound devices are left in D0 but are nevertheless allowed to
> > > > (logically) runtime suspend.  And pci_pm_runtime_suspend() may call
> > > > pci_save_state() while config space isn't fully initialized yet,
> > > > or pci_pm_runtime_resume() may call pci_restore_state() (via
> > > > pci_pm_default_resume_early()) and overwrite initialized config space
> > > > with uninitialized data.
> > 
> > Ack.
> > 
> > > > Have you actually seen this happen in practice?
> > 
> > Yes, that's why I spent my time debugging and submitting this patch :)
> 
> Thanks for doing it! :-)
> 
> > > > Normally enumeration
> > > > happens during subsys_initcall time, when user space isn't running yet.
> > > > Hotplug may be an exception though.
> > 
> > Hotplug, rescan (e.g., when pwrctrl is in use, power may be stablished
> > later on, and it triggers a bus rescan; pwrctrl drivers can be modules),
> > or PCI controller drivers built as modules.
> > 
> > I happen to be using both pwrctrl and controller drivers as modules, so
> > I hit it that way.
> > 
> > > Adding that pm_runtime_get_noresume() doesn't look useful given 
> > > pm_runtime_forbid() already increases PM usage count. If this problem is 
> > > actually seen in practice, there could a bug elsewhere where something 
> > > decrements usage count too early so this change "helps" by double 
> > > incrementing the usage count.
> > > 
> > > To find more information what's going on, one could try to trace events 
> > > for the PM usage count (though last time I looked not all paths that 
> > > change PM usage count were covered by the event and adding the 
> > > trace_event() calls into the header turned out too much magic for me to 
> > > figure out so I couldn't solve the problem).
> > 
> > See above. forbid() is not a guaranteed blocker, because user space can
> > undo it.
> > 
> > > > Patch LGTM in principle, but adding Ilpo to cc who is refactoring PCI
> > > > resource allocation and may judge whether this can actually happen.
> > > 
> > > I can see there could be other failure modes besides just saving wrong 
> > > config if devices get suspended underneath the resource assignment 
> > > algorithm.
> > > 
> > > Besides hotplug, also BAR resize does changes the resources and BARs.
> > > This case is not helped by this patch.
> > 
> > Is that the 'resource_N_resize' sysfs attributes? Becuase that holds PM
> > references (pci_config_pm_runtime_{get,put}()) and therefore should not
> > generally have the same problem.
> 
> Okay, seem fine for the PCI core part.
> 
> Driver's can also trigger BAR resize by calling pci_resize_resource() 
> directly but I've no idea how the usage counts behave (TBH, PM isn't my 
> strongest forte even if Lukas pulled me in to comment).

There are only 3 drivers that call pci_resize_resource(). I looked into
them, and it looks like they all hold pm_runtime_* references while
doing this, or are calling it during pci_driver probe(), which docs say
will hold a reference.

https://docs.kernel.org/power/pci.html#device-runtime-power-management

So they should all be OK too.

> > pci-driver.c's runtime suspend will
> > save a new copy of the registers the next time we suspend after resize.
> > 
> > (Now, some drivers could have problems if they try to stash a static
> > copy via pci_store_saved_state()/pci_load_saved_state(), but that
> > invites plenty of its own problems anyway.)
> > 
> > > I also recently learned some DT platforms do the "actual" scan for the bus 
> > > later on Link Up event through irq which could perhaps occur late enough, 
> > > I dunno for sure.
> > 
> > Sure, but that'd be covered by my patch, as those (re)scans would
> > discover new devices in the same scan+add flow.
> 
> Okay, maybe it's fine like the rest. I was mostly trying to think 
> non-subsys_initcall() cases I knew of without contextualizing them back to 
> this patch.

Sure, thanks for the pointers. I'm happy to look into other areas if you
suspect there might similar PM-related bugs.

> > > > I think the code comments you're adding are a little verbose and a simple
> > > > /* acquired in pci_pm_init() */ in pci_bus_add_device() may be sufficient.
> > > 
> > > I'm also not entirely convinced these always pair
> > 
> > That's a very valid question. There are so many variations of scan+add
> > that it's been hard for me to tell.
> 
> I've noticed... unfortunately I find myself often in the same boat. :-/
> 
> > I've studied the code pretty
> > closely, and tested what I could, but I don't have hotplug systems on
> > hand, and I definitely could miss something.
> > 
> > FWIW, Rafael suggested/implied an alternative, where I could simply
> > delay pm_runtime_enable() until pci_bus_add_device(). That would dodge
> > the pairing questions, I think.
> 
> Yes.
> 
> > > or if the pci_dev may 
> > > get removed before pci_bus_add_device(), see e.g., enable_slot() in 
> > > hotplug/acpiphp_glue.c that does acpiphp_sanitize_bus() before 
> > > pci_bus_add_devices() (which could have other bugs too as is).
> > 
> > I believe it should be OK if a device is removed before the
> > pm_runtime_put_noidle() half of the pair. It just means the device gets
> > destroyed with a nonzero PM usage count, which is legal.
> 
> Ah yes, I think I was too attached to the "pairing" thought at that point 
> to see this isn't like e.g. a lock/unlock pair.

Thanks for looking things over.

Brian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
  2025-10-20 18:52         ` Brian Norris
@ 2025-10-21 11:27           ` Ilpo Järvinen
  2025-10-21 12:53             ` Rafael J. Wysocki
  2025-10-21 18:13             ` Brian Norris
  0 siblings, 2 replies; 16+ messages in thread
From: Ilpo Järvinen @ 2025-10-21 11:27 UTC (permalink / raw)
  To: Brian Norris
  Cc: Lukas Wunner, Bjorn Helgaas, LKML, linux-pm, Rafael J . Wysocki,
	linux-pci

[-- Attachment #1: Type: text/plain, Size: 6046 bytes --]

On Mon, 20 Oct 2025, Brian Norris wrote:

> Hi Ilpo,
> 
> On Mon, Oct 20, 2025 at 06:56:41PM +0300, Ilpo Järvinen wrote:
> > On Fri, 17 Oct 2025, Brian Norris wrote:
> > 
> > > On Fri, Oct 17, 2025 at 02:49:35PM +0300, Ilpo Järvinen wrote:
> > > > On Fri, 17 Oct 2025, Lukas Wunner wrote:
> > > > 
> > > > > [cc += Ilpo]
> > > > > 
> > > > > On Thu, Oct 16, 2025 at 03:53:35PM -0700, Brian Norris wrote:
> > > > > > PCI devices are created via pci_scan_slot() and similar, and are
> > > > > > promptly configured for runtime PM (pci_pm_init()). They are initially
> > > > > > prevented from suspending by way of pm_runtime_forbid(); however, it's
> > > > > > expected that user space may override this via sysfs [1].
> > > > 
> > > > Is this true as pm_runtime_forbid() also increases PM usage count?
> > > 
> > > Yes it's true. See below.
> > > 
> > > > "void pm_runtime_forbid(struct device *dev);
> > > > 
> > > > unset the power.runtime_auto flag for the device and increase its 
> > > > usage counter (used by the /sys/devices/.../power/control interface to 
> > > > effectively prevent the device from being power managed at run time)"
> 
> I see this doc line confused you, and I can sympathize.
> 
> IIUC, the parenthetical means that sysfs *uses* pm_runtime_forbid() to
> "effectively prevent runtime power management"; pm_runtime_forbid() does
> not block user space from doing anything.
>
> > > Right, but sysfs `echo auto > .../power/control` performs the inverse --
> > > pm_runtime_allow() -- which decrements that count.
> > 
> > Fair enough, I didn't check what it does.
> > 
> > IMO, the details about how the usage count behaves should be part of the 
> > changelog as that documentation I quoted sounded like user control is 
> > prevented when forbidden.
> 
> I tried to elaborate on the API doc confusion above. But frankly, I'm
> not sure how best to explain runtime PM.
> 
> > I see you've put this part of the explanation 
> > into the v2 as well so I suggest you explain the usage count in the change 
> > so it is recorded in the commit if somebody has to look at this commit 
> > years from now.
> 
> Both v1 and v2 mention that the sysfs 'power/control' file can override
> the kernel calling pm_runtime_forbid(). They don't mention the usage
> count, since that's an implementation detail IMO. (To me, the mental
> model works best if "usage count" (usually get()/put()) is considered
> mostly orthogonal to forbid()/allow()/sysfs, because "forbid()" can be
> overridden at any time.)
> 
> This is also covered here:
> 
> https://docs.kernel.org/power/runtime_pm.html#runtime-pm-initialization-device-probing-and-removal
> 
> "In principle, this mechanism may also be used by the driver to
> effectively turn off the runtime power management of the device until
> the user space turns it on."

The problem is already rooted into the function name, when a function is 
called "forbid", anyone unfamiliar will think it really forbids 
something. The docs just further reinforced the idea and the fact that it 
also increments usage count.

It is quite unexpected and feels quite illogical (for non-PM person like 
me) that user interface then goes to reverse that usage count increase, 
what would be the logical reason why there now are less users for it when 
user wants to turn on PM? (I understand things are done that way, no need 
to explain that further, but there are quite a few misleading things in 
this entire scenario, not just that parenthesis part of the docs.)

> But admittedly, I find the runtime PM API surface to be enormous, and
> its documentation ... not very helpful to outsiders. It's pretty much
> written by and for PM experts. Case in point: you read and quoted the
> appropriate docs, but it still misled you quite a bit :(
> 
> I'm more tempted to try to improve the runtime PM docs than to try to
> make up for them in the commit message, but maybe I can be convinced
> otherwise.

My personal approach is that if something comes up during review, it 
likely is something also the next person to look at this change (from 
history, maybe years from now) could similarly stumbles on when trying to 
understand the change. Thus, it often is worth to mention such things in 
the changelog too.

While I'm definitely not against improvements to docs too, the changelog 
for any patch should help to understand why the change was made. And 
IMO, this unexpected "internal detail" related to usage count which is 
quite significant here, if user interface wouldn't lower it, runtime PM 
would remain forbidden as forbid() promised.

> > > > > Patch LGTM in principle, but adding Ilpo to cc who is refactoring PCI
> > > > > resource allocation and may judge whether this can actually happen.
> > > > 
> > > > I can see there could be other failure modes besides just saving wrong 
> > > > config if devices get suspended underneath the resource assignment 
> > > > algorithm.
> > > > 
> > > > Besides hotplug, also BAR resize does changes the resources and BARs.
> > > > This case is not helped by this patch.
> > > 
> > > Is that the 'resource_N_resize' sysfs attributes? Becuase that holds PM
> > > references (pci_config_pm_runtime_{get,put}()) and therefore should not
> > > generally have the same problem.
> > 
> > Okay, seem fine for the PCI core part.
> > 
> > Driver's can also trigger BAR resize by calling pci_resize_resource() 
> > directly but I've no idea how the usage counts behave (TBH, PM isn't my 
> > strongest forte even if Lukas pulled me in to comment).
> 
> There are only 3 drivers that call pci_resize_resource(). I looked into
> them, and it looks like they all hold pm_runtime_* references while
> doing this, or are calling it during pci_driver probe(), which docs say
> will hold a reference.
> 
> https://docs.kernel.org/power/pci.html#device-runtime-power-management
> 
> So they should all be OK too.

Thanks for checking.

-- 
 i.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
  2025-10-21 11:27           ` Ilpo Järvinen
@ 2025-10-21 12:53             ` Rafael J. Wysocki
  2025-10-21 13:18               ` Ilpo Järvinen
  2025-10-21 18:13             ` Brian Norris
  1 sibling, 1 reply; 16+ messages in thread
From: Rafael J. Wysocki @ 2025-10-21 12:53 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Brian Norris, Lukas Wunner, Bjorn Helgaas, LKML, linux-pm,
	Rafael J . Wysocki, linux-pci

On Tue, Oct 21, 2025 at 1:27 PM Ilpo Järvinen
<ilpo.jarvinen@linux.intel.com> wrote:
>
> On Mon, 20 Oct 2025, Brian Norris wrote:
>
> > Hi Ilpo,
> >
> > On Mon, Oct 20, 2025 at 06:56:41PM +0300, Ilpo Järvinen wrote:
> > > On Fri, 17 Oct 2025, Brian Norris wrote:
> > >
> > > > On Fri, Oct 17, 2025 at 02:49:35PM +0300, Ilpo Järvinen wrote:
> > > > > On Fri, 17 Oct 2025, Lukas Wunner wrote:
> > > > >
> > > > > > [cc += Ilpo]
> > > > > >
> > > > > > On Thu, Oct 16, 2025 at 03:53:35PM -0700, Brian Norris wrote:
> > > > > > > PCI devices are created via pci_scan_slot() and similar, and are
> > > > > > > promptly configured for runtime PM (pci_pm_init()). They are initially
> > > > > > > prevented from suspending by way of pm_runtime_forbid(); however, it's
> > > > > > > expected that user space may override this via sysfs [1].
> > > > >
> > > > > Is this true as pm_runtime_forbid() also increases PM usage count?
> > > >
> > > > Yes it's true. See below.
> > > >
> > > > > "void pm_runtime_forbid(struct device *dev);
> > > > >
> > > > > unset the power.runtime_auto flag for the device and increase its
> > > > > usage counter (used by the /sys/devices/.../power/control interface to
> > > > > effectively prevent the device from being power managed at run time)"
> >
> > I see this doc line confused you, and I can sympathize.
> >
> > IIUC, the parenthetical means that sysfs *uses* pm_runtime_forbid() to
> > "effectively prevent runtime power management"; pm_runtime_forbid() does
> > not block user space from doing anything.
> >
> > > > Right, but sysfs `echo auto > .../power/control` performs the inverse --
> > > > pm_runtime_allow() -- which decrements that count.
> > >
> > > Fair enough, I didn't check what it does.
> > >
> > > IMO, the details about how the usage count behaves should be part of the
> > > changelog as that documentation I quoted sounded like user control is
> > > prevented when forbidden.
> >
> > I tried to elaborate on the API doc confusion above. But frankly, I'm
> > not sure how best to explain runtime PM.
> >
> > > I see you've put this part of the explanation
> > > into the v2 as well so I suggest you explain the usage count in the change
> > > so it is recorded in the commit if somebody has to look at this commit
> > > years from now.
> >
> > Both v1 and v2 mention that the sysfs 'power/control' file can override
> > the kernel calling pm_runtime_forbid(). They don't mention the usage
> > count, since that's an implementation detail IMO. (To me, the mental
> > model works best if "usage count" (usually get()/put()) is considered
> > mostly orthogonal to forbid()/allow()/sysfs, because "forbid()" can be
> > overridden at any time.)
> >
> > This is also covered here:
> >
> > https://docs.kernel.org/power/runtime_pm.html#runtime-pm-initialization-device-probing-and-removal
> >
> > "In principle, this mechanism may also be used by the driver to
> > effectively turn off the runtime power management of the device until
> > the user space turns it on."
>
> The problem is already rooted into the function name, when a function is
> called "forbid", anyone unfamiliar will think it really forbids
> something.

And it does, until the "allow" counterpart of it is called.

The confusing part here is that the "allow" counterpart is called from
a sysfs attribute.

> The docs just further reinforced the idea and the fact that it
> also increments usage count.
>
> It is quite unexpected and feels quite illogical (for non-PM person like
> me) that user interface then goes to reverse that usage count increase,
> what would be the logical reason why there now are less users for it when
> user wants to turn on PM? (I understand things are done that way, no need
> to explain that further, but there are quite a few misleading things in
> this entire scenario, not just that parenthesis part of the docs.)

So the purpose of this "forbid" call in pci_pm_init() is to "block"
runtime PM for PCI devices by default, but allow user space to
"unblock" it later.

Would adding a comment to that effect next to that call be useful?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
  2025-10-21 12:53             ` Rafael J. Wysocki
@ 2025-10-21 13:18               ` Ilpo Järvinen
  2025-10-21 18:27                 ` Brian Norris
  0 siblings, 1 reply; 16+ messages in thread
From: Ilpo Järvinen @ 2025-10-21 13:18 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Brian Norris, Lukas Wunner, Bjorn Helgaas, LKML, linux-pm,
	linux-pci

[-- Attachment #1: Type: text/plain, Size: 5914 bytes --]

On Tue, 21 Oct 2025, Rafael J. Wysocki wrote:
> On Tue, Oct 21, 2025 at 1:27 PM Ilpo Järvinen
> <ilpo.jarvinen@linux.intel.com> wrote:
> > On Mon, 20 Oct 2025, Brian Norris wrote:
> > > On Mon, Oct 20, 2025 at 06:56:41PM +0300, Ilpo Järvinen wrote:
> > > > On Fri, 17 Oct 2025, Brian Norris wrote:
> > > >
> > > > > On Fri, Oct 17, 2025 at 02:49:35PM +0300, Ilpo Järvinen wrote:
> > > > > > On Fri, 17 Oct 2025, Lukas Wunner wrote:
> > > > > >
> > > > > > > [cc += Ilpo]
> > > > > > >
> > > > > > > On Thu, Oct 16, 2025 at 03:53:35PM -0700, Brian Norris wrote:
> > > > > > > > PCI devices are created via pci_scan_slot() and similar, and are
> > > > > > > > promptly configured for runtime PM (pci_pm_init()). They are initially
> > > > > > > > prevented from suspending by way of pm_runtime_forbid(); however, it's
> > > > > > > > expected that user space may override this via sysfs [1].
> > > > > >
> > > > > > Is this true as pm_runtime_forbid() also increases PM usage count?
> > > > >
> > > > > Yes it's true. See below.
> > > > >
> > > > > > "void pm_runtime_forbid(struct device *dev);
> > > > > >
> > > > > > unset the power.runtime_auto flag for the device and increase its
> > > > > > usage counter (used by the /sys/devices/.../power/control interface to
> > > > > > effectively prevent the device from being power managed at run time)"
> > >
> > > I see this doc line confused you, and I can sympathize.
> > >
> > > IIUC, the parenthetical means that sysfs *uses* pm_runtime_forbid() to
> > > "effectively prevent runtime power management"; pm_runtime_forbid() does
> > > not block user space from doing anything.
> > >
> > > > > Right, but sysfs `echo auto > .../power/control` performs the inverse --
> > > > > pm_runtime_allow() -- which decrements that count.
> > > >
> > > > Fair enough, I didn't check what it does.
> > > >
> > > > IMO, the details about how the usage count behaves should be part of the
> > > > changelog as that documentation I quoted sounded like user control is
> > > > prevented when forbidden.
> > >
> > > I tried to elaborate on the API doc confusion above. But frankly, I'm
> > > not sure how best to explain runtime PM.
> > >
> > > > I see you've put this part of the explanation
> > > > into the v2 as well so I suggest you explain the usage count in the change
> > > > so it is recorded in the commit if somebody has to look at this commit
> > > > years from now.
> > >
> > > Both v1 and v2 mention that the sysfs 'power/control' file can override
> > > the kernel calling pm_runtime_forbid(). They don't mention the usage
> > > count, since that's an implementation detail IMO. (To me, the mental
> > > model works best if "usage count" (usually get()/put()) is considered
> > > mostly orthogonal to forbid()/allow()/sysfs, because "forbid()" can be
> > > overridden at any time.)
> > >
> > > This is also covered here:
> > >
> > > https://docs.kernel.org/power/runtime_pm.html#runtime-pm-initialization-device-probing-and-removal
> > >
> > > "In principle, this mechanism may also be used by the driver to
> > > effectively turn off the runtime power management of the device until
> > > the user space turns it on."
> >
> > The problem is already rooted into the function name, when a function is
> > called "forbid", anyone unfamiliar will think it really forbids
> > something.
> 
> And it does, until the "allow" counterpart of it is called.
> 
> The confusing part here is that the "allow" counterpart is called from
> a sysfs attribute.

Yes it is but the fact that allow then reduces usage count too even more 
so. I understand it's necessary for allowing the functionality but I hope 
you can see how illogical it sounds that usage suddenly is less because of 
an user action through sysfs, it just defies normal reasoning (no offense 
meant in any way to anyone :-)).

> > The docs just further reinforced the idea and the fact that it
> > also increments usage count.
> >
> > It is quite unexpected and feels quite illogical (for non-PM person like
> > me) that user interface then goes to reverse that usage count increase,
> > what would be the logical reason why there now are less users for it when
> > user wants to turn on PM? (I understand things are done that way, no need
> > to explain that further, but there are quite a few misleading things in
> > this entire scenario, not just that parenthesis part of the docs.)
> 
> So the purpose of this "forbid" call in pci_pm_init() is to "block"
> runtime PM for PCI devices by default, but allow user space to
> "unblock" it later.
> 
> Would adding a comment to that effect next to that call be useful?

It would be useful to improve the wording in PM documentation which is too 
ambiguous. I suggest changing this:

"void pm_runtime_forbid(struct device *dev);

unset the power.runtime_auto flag for the device and increase its
usage counter (used by the /sys/devices/.../power/control interface to 
effectively prevent the device from being power managed at run time).

to:

"... (used to prevent the device from being power managed at run time 
until pm_runtime_allow() or /sys/devices/.../power/control interface 
allows it)."


I have to admit I'd still end up thinking usage count remains 1 up at this 
point (if not knowing better like I do now) but at least it would
unambiguously tells what the interface does.


My another point related to whether that confusing usage count decrement 
(solely because of user action) is worth describing in the changelog of 
this patch (or v2 of it which follows your suggestion on moving enable 
locatoon). IMO, it's odd/confusing enough a note would be warranted.
I know it after this discussion, obviously, but I worry over others who 
trip over the same thing if somebody else has to look over this change.


-- 
 i.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
  2025-10-21 11:27           ` Ilpo Järvinen
  2025-10-21 12:53             ` Rafael J. Wysocki
@ 2025-10-21 18:13             ` Brian Norris
  2025-10-22 13:38               ` Ilpo Järvinen
  1 sibling, 1 reply; 16+ messages in thread
From: Brian Norris @ 2025-10-21 18:13 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Lukas Wunner, Bjorn Helgaas, LKML, linux-pm, Rafael J . Wysocki,
	linux-pci

Hi Ilpo,

On Tue, Oct 21, 2025 at 02:27:09PM +0300, Ilpo Järvinen wrote:
> On Mon, 20 Oct 2025, Brian Norris wrote:
> > On Mon, Oct 20, 2025 at 06:56:41PM +0300, Ilpo Järvinen wrote:
> > > On Fri, 17 Oct 2025, Brian Norris wrote:
> > > > On Fri, Oct 17, 2025 at 02:49:35PM +0300, Ilpo Järvinen wrote:
> > > > > On Fri, 17 Oct 2025, Lukas Wunner wrote:
> > > > > > On Thu, Oct 16, 2025 at 03:53:35PM -0700, Brian Norris wrote:
> > > > > > > PCI devices are created via pci_scan_slot() and similar, and are
> > > > > > > promptly configured for runtime PM (pci_pm_init()). They are initially
> > > > > > > prevented from suspending by way of pm_runtime_forbid(); however, it's
> > > > > > > expected that user space may override this via sysfs [1].
> > > > > 
> > > > > Is this true as pm_runtime_forbid() also increases PM usage count?
> > > > 
> > > > Yes it's true. See below.
> > > > 
> > > > > "void pm_runtime_forbid(struct device *dev);
> > > > > 
> > > > > unset the power.runtime_auto flag for the device and increase its 
> > > > > usage counter (used by the /sys/devices/.../power/control interface to 
> > > > > effectively prevent the device from being power managed at run time)"
> > 
> > I see this doc line confused you, and I can sympathize.
> > 
> > IIUC, the parenthetical means that sysfs *uses* pm_runtime_forbid() to
> > "effectively prevent runtime power management"; pm_runtime_forbid() does
> > not block user space from doing anything.
> >
> > > > Right, but sysfs `echo auto > .../power/control` performs the inverse --
> > > > pm_runtime_allow() -- which decrements that count.
> > > 
> > > Fair enough, I didn't check what it does.
> > > 
> > > IMO, the details about how the usage count behaves should be part of the 
> > > changelog as that documentation I quoted sounded like user control is 
> > > prevented when forbidden.
> > 
> > I tried to elaborate on the API doc confusion above. But frankly, I'm
> > not sure how best to explain runtime PM.
> > 
> > > I see you've put this part of the explanation 
> > > into the v2 as well so I suggest you explain the usage count in the change 
> > > so it is recorded in the commit if somebody has to look at this commit 
> > > years from now.
> > 
> > Both v1 and v2 mention that the sysfs 'power/control' file can override
> > the kernel calling pm_runtime_forbid(). They don't mention the usage
> > count, since that's an implementation detail IMO. (To me, the mental
> > model works best if "usage count" (usually get()/put()) is considered
> > mostly orthogonal to forbid()/allow()/sysfs, because "forbid()" can be
> > overridden at any time.)
> > 
> > This is also covered here:
> > 
> > https://docs.kernel.org/power/runtime_pm.html#runtime-pm-initialization-device-probing-and-removal
> > 
> > "In principle, this mechanism may also be used by the driver to
> > effectively turn off the runtime power management of the device until
> > the user space turns it on."
> 
> The problem is already rooted into the function name, when a function is 
> called "forbid", anyone unfamiliar will think it really forbids 
> something. The docs just further reinforced the idea and the fact that it 
> also increments usage count.
> 
> It is quite unexpected and feels quite illogical (for non-PM person like 
> me) that user interface then goes to reverse that usage count increase, 
> what would be the logical reason why there now are less users for it when 
> user wants to turn on PM? (I understand things are done that way, no need 
> to explain that further, but there are quite a few misleading things in 
> this entire scenario, not just that parenthesis part of the docs.)

Ack. I'm definitely not defending the API names or the docs. I'm
frustrated by the same, by how long it took me to figure out what
everything really means, and by how the same footguns hit everyone I
work with, and we have to write these essays to explain it every time.

> > But admittedly, I find the runtime PM API surface to be enormous, and
> > its documentation ... not very helpful to outsiders. It's pretty much
> > written by and for PM experts. Case in point: you read and quoted the
> > appropriate docs, but it still misled you quite a bit :(
> > 
> > I'm more tempted to try to improve the runtime PM docs than to try to
> > make up for them in the commit message, but maybe I can be convinced
> > otherwise.
> 
> My personal approach is that if something comes up during review, it 
> likely is something also the next person to look at this change (from 
> history, maybe years from now) could similarly stumbles on when trying to 
> understand the change. Thus, it often is worth to mention such things in 
> the changelog too.

Yes, I like that approach too. Unfortunately in this case, it feels like
most of the confusion stems from the existing API and docs, and not from
anything in the description. I never even mentioned the usage count,
because frankly, I don't think it's relevant to the forbid()/allow()
topic. In fact, the whole bug is because forbid() does *not* guarantee
anything from the kernel's point of view -- it doesn't really hold a
"usage count" for functional correctness in the way get()/put() do.

Regardless, I can try to insert some language. Here's an attempt at yet
another footnote for a v3:

"""
[**] The relationship between pm_runtime_forbid(), pm_runtime_allow(),
/sys/.../power/control, and the runtime PM usage counter can be subtle.
It appears that the intention of pm_runtime_forbid() /
pm_runtime_allow() is twofold:

1. Allow the user to disable runtime_pm (force device to always be
   powered on) through sysfs.
2. Allow the driver to start with runtime_pm disabled (device forced
   on) and user space could later enable runtime_pm.

This conclusion comes from reading `Documentation/power/runtime_pm.rst`,
specifically the section starting "The user space can effectively
disallow".

This means that while pm_runtime_forbid() does technically increase the
runtime PM usage counter, this usage counter is not a guarantee of
functional correctness from the kernel's point of view, because user
space can decrease that count again via sysfs.
"""

Let me know what you think. Or feel free to tweak my v2 message and send
a rewrite back to me (seriously! I'd like to see an outsider's view on
what would explain it best).

(And if that footnote is useful: I think part of that could belong in
the actual docs.)

> While I'm definitely not against improvements to docs too, the changelog 
> for any patch should help to understand why the change was made. And 
> IMO, this unexpected "internal detail" related to usage count which is 
> quite significant here, if user interface wouldn't lower it, runtime PM 
> would remain forbidden as forbid() promised.

I think improving the docs would reap rewards. Like I said, this is by
far not the first time that developers have had significant problems due
to API confusion.

Brian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
  2025-10-21 13:18               ` Ilpo Järvinen
@ 2025-10-21 18:27                 ` Brian Norris
  2025-10-21 18:56                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 16+ messages in thread
From: Brian Norris @ 2025-10-21 18:27 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Rafael J. Wysocki, Lukas Wunner, Bjorn Helgaas, LKML, linux-pm,
	linux-pci

On Tue, Oct 21, 2025 at 04:18:54PM +0300, Ilpo Järvinen wrote:
> On Tue, 21 Oct 2025, Rafael J. Wysocki wrote:
> > So the purpose of this "forbid" call in pci_pm_init() is to "block"
> > runtime PM for PCI devices by default, but allow user space to
> > "unblock" it later.
> > 
> > Would adding a comment to that effect next to that call be useful?
> 
> It would be useful to improve the wording in PM documentation which is too 
> ambiguous. I suggest changing this:
> 
> "void pm_runtime_forbid(struct device *dev);
> 
> unset the power.runtime_auto flag for the device and increase its
> usage counter (used by the /sys/devices/.../power/control interface to 
> effectively prevent the device from being power managed at run time).
> 
> to:
> 
> "... (used to prevent the device from being power managed at run time 
> until pm_runtime_allow() or /sys/devices/.../power/control interface 
> allows it)."

Looks like a good change to me, even if just scratching the surface. If
this goes in a patch, you can add my:

Reviewed-by: Brian Norris <briannorris@chromium.org>

A separate problem that sorta stopped me from trying to rewrite some of
the Documentation/ is that we have both
Documentation/power/runtime_pm.rst and kerneldoc in
include/linux/pm_runtime.h + drivers/base/power/runtime.c. It doesn't
feel great having separate variations of the same API docs.

But hey, I shouldn't let "perfect" be the enemy of progress.

Brian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
  2025-10-21 18:27                 ` Brian Norris
@ 2025-10-21 18:56                   ` Rafael J. Wysocki
  0 siblings, 0 replies; 16+ messages in thread
From: Rafael J. Wysocki @ 2025-10-21 18:56 UTC (permalink / raw)
  To: Brian Norris
  Cc: Ilpo Järvinen, Rafael J. Wysocki, Lukas Wunner,
	Bjorn Helgaas, LKML, linux-pm, linux-pci

On Tue, Oct 21, 2025 at 8:27 PM Brian Norris <briannorris@chromium.org> wrote:
>
> On Tue, Oct 21, 2025 at 04:18:54PM +0300, Ilpo Järvinen wrote:
> > On Tue, 21 Oct 2025, Rafael J. Wysocki wrote:
> > > So the purpose of this "forbid" call in pci_pm_init() is to "block"
> > > runtime PM for PCI devices by default, but allow user space to
> > > "unblock" it later.
> > >
> > > Would adding a comment to that effect next to that call be useful?
> >
> > It would be useful to improve the wording in PM documentation which is too
> > ambiguous. I suggest changing this:
> >
> > "void pm_runtime_forbid(struct device *dev);
> >
> > unset the power.runtime_auto flag for the device and increase its
> > usage counter (used by the /sys/devices/.../power/control interface to
> > effectively prevent the device from being power managed at run time).
> >
> > to:
> >
> > "... (used to prevent the device from being power managed at run time
> > until pm_runtime_allow() or /sys/devices/.../power/control interface
> > allows it)."
>
> Looks like a good change to me, even if just scratching the surface. If
> this goes in a patch, you can add my:
>
> Reviewed-by: Brian Norris <briannorris@chromium.org>
>
> A separate problem that sorta stopped me from trying to rewrite some of
> the Documentation/ is that we have both
> Documentation/power/runtime_pm.rst and kerneldoc in
> include/linux/pm_runtime.h + drivers/base/power/runtime.c. It doesn't
> feel great having separate variations of the same API docs.
>
> But hey, I shouldn't let "perfect" be the enemy of progress.

Documentation/power/runtime_pm.rst is generally outdated.

There was a plan to replace it with a new document mostly constructed
from pm_runtime.h and pm_runtime.c kerneldocs, but those also require
some work.

I would rather remove the reference to pm_runtime_forbid() from
Documentation/power/runtime_pm.rst entirely and make the other
documentation pieces describe it properly.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized
  2025-10-21 18:13             ` Brian Norris
@ 2025-10-22 13:38               ` Ilpo Järvinen
  0 siblings, 0 replies; 16+ messages in thread
From: Ilpo Järvinen @ 2025-10-22 13:38 UTC (permalink / raw)
  To: Brian Norris
  Cc: Lukas Wunner, Bjorn Helgaas, LKML, linux-pm, Rafael J . Wysocki,
	linux-pci

[-- Attachment #1: Type: text/plain, Size: 7489 bytes --]

On Tue, 21 Oct 2025, Brian Norris wrote:
> On Tue, Oct 21, 2025 at 02:27:09PM +0300, Ilpo Järvinen wrote:
> > On Mon, 20 Oct 2025, Brian Norris wrote:
> > > On Mon, Oct 20, 2025 at 06:56:41PM +0300, Ilpo Järvinen wrote:
> > > > On Fri, 17 Oct 2025, Brian Norris wrote:
> > > > > On Fri, Oct 17, 2025 at 02:49:35PM +0300, Ilpo Järvinen wrote:
> > > > > > On Fri, 17 Oct 2025, Lukas Wunner wrote:
> > > > > > > On Thu, Oct 16, 2025 at 03:53:35PM -0700, Brian Norris wrote:
> > > > > > > > PCI devices are created via pci_scan_slot() and similar, and are
> > > > > > > > promptly configured for runtime PM (pci_pm_init()). They are initially
> > > > > > > > prevented from suspending by way of pm_runtime_forbid(); however, it's
> > > > > > > > expected that user space may override this via sysfs [1].
> > > > > > 
> > > > > > Is this true as pm_runtime_forbid() also increases PM usage count?
> > > > > 
> > > > > Yes it's true. See below.
> > > > > 
> > > > > > "void pm_runtime_forbid(struct device *dev);
> > > > > > 
> > > > > > unset the power.runtime_auto flag for the device and increase its 
> > > > > > usage counter (used by the /sys/devices/.../power/control interface to 
> > > > > > effectively prevent the device from being power managed at run time)"
> > > 
> > > I see this doc line confused you, and I can sympathize.
> > > 
> > > IIUC, the parenthetical means that sysfs *uses* pm_runtime_forbid() to
> > > "effectively prevent runtime power management"; pm_runtime_forbid() does
> > > not block user space from doing anything.
> > >
> > > > > Right, but sysfs `echo auto > .../power/control` performs the inverse --
> > > > > pm_runtime_allow() -- which decrements that count.
> > > > 
> > > > Fair enough, I didn't check what it does.
> > > > 
> > > > IMO, the details about how the usage count behaves should be part of the 
> > > > changelog as that documentation I quoted sounded like user control is 
> > > > prevented when forbidden.
> > > 
> > > I tried to elaborate on the API doc confusion above. But frankly, I'm
> > > not sure how best to explain runtime PM.
> > > 
> > > > I see you've put this part of the explanation 
> > > > into the v2 as well so I suggest you explain the usage count in the change 
> > > > so it is recorded in the commit if somebody has to look at this commit 
> > > > years from now.
> > > 
> > > Both v1 and v2 mention that the sysfs 'power/control' file can override
> > > the kernel calling pm_runtime_forbid(). They don't mention the usage
> > > count, since that's an implementation detail IMO. (To me, the mental
> > > model works best if "usage count" (usually get()/put()) is considered
> > > mostly orthogonal to forbid()/allow()/sysfs, because "forbid()" can be
> > > overridden at any time.)
> > > 
> > > This is also covered here:
> > > 
> > > https://docs.kernel.org/power/runtime_pm.html#runtime-pm-initialization-device-probing-and-removal
> > > 
> > > "In principle, this mechanism may also be used by the driver to
> > > effectively turn off the runtime power management of the device until
> > > the user space turns it on."
> > 
> > The problem is already rooted into the function name, when a function is 
> > called "forbid", anyone unfamiliar will think it really forbids 
> > something. The docs just further reinforced the idea and the fact that it 
> > also increments usage count.
> > 
> > It is quite unexpected and feels quite illogical (for non-PM person like 
> > me) that user interface then goes to reverse that usage count increase, 
> > what would be the logical reason why there now are less users for it when 
> > user wants to turn on PM? (I understand things are done that way, no need 
> > to explain that further, but there are quite a few misleading things in 
> > this entire scenario, not just that parenthesis part of the docs.)
> 
> Ack. I'm definitely not defending the API names or the docs. I'm
> frustrated by the same, by how long it took me to figure out what
> everything really means, and by how the same footguns hit everyone I
> work with, and we have to write these essays to explain it every time.
> 
> > > But admittedly, I find the runtime PM API surface to be enormous, and
> > > its documentation ... not very helpful to outsiders. It's pretty much
> > > written by and for PM experts. Case in point: you read and quoted the
> > > appropriate docs, but it still misled you quite a bit :(
> > > 
> > > I'm more tempted to try to improve the runtime PM docs than to try to
> > > make up for them in the commit message, but maybe I can be convinced
> > > otherwise.
> > 
> > My personal approach is that if something comes up during review, it 
> > likely is something also the next person to look at this change (from 
> > history, maybe years from now) could similarly stumbles on when trying to 
> > understand the change. Thus, it often is worth to mention such things in 
> > the changelog too.
> 
> Yes, I like that approach too. Unfortunately in this case, it feels like
> most of the confusion stems from the existing API and docs, and not from
> anything in the description. I never even mentioned the usage count,
> because frankly, I don't think it's relevant to the forbid()/allow()
> topic. In fact, the whole bug is because forbid() does *not* guarantee
> anything from the kernel's point of view -- it doesn't really hold a
> "usage count" for functional correctness in the way get()/put() do.
> 
> Regardless, I can try to insert some language. Here's an attempt at yet
> another footnote for a v3:
> 
> """
> [**] The relationship between pm_runtime_forbid(), pm_runtime_allow(),
> /sys/.../power/control, and the runtime PM usage counter can be subtle.
> It appears that the intention of pm_runtime_forbid() /
> pm_runtime_allow() is twofold:
> 
> 1. Allow the user to disable runtime_pm (force device to always be
>    powered on) through sysfs.
> 2. Allow the driver to start with runtime_pm disabled (device forced
>    on) and user space could later enable runtime_pm.
> 
> This conclusion comes from reading `Documentation/power/runtime_pm.rst`,
> specifically the section starting "The user space can effectively
> disallow".
> 
> This means that while pm_runtime_forbid() does technically increase the
> runtime PM usage counter, this usage counter is not a guarantee of
> functional correctness from the kernel's point of view, because user
> space can decrease that count again via sysfs.
> """
> 
> Let me know what you think.

I'm fine with that, it explains well where the problem stems from.
Thanks.

--
 i.

> Or feel free to tweak my v2 message and send
> a rewrite back to me (seriously! I'd like to see an outsider's view on
> what would explain it best).
>
> (And if that footnote is useful: I think part of that could belong in
> the actual docs.)
> 
> > While I'm definitely not against improvements to docs too, the changelog 
> > for any patch should help to understand why the change was made. And 
> > IMO, this unexpected "internal detail" related to usage count which is 
> > quite significant here, if user interface wouldn't lower it, runtime PM 
> > would remain forbidden as forbid() promised.
> 
> I think improving the docs would reap rewards. Like I said, this is by
> far not the first time that developers have had significant problems due
> to API confusion.
> 
> Brian
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-10-22 13:38 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-16 22:53 [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized Brian Norris
2025-10-17  8:32 ` Lukas Wunner
2025-10-17 11:49   ` Ilpo Järvinen
2025-10-17 17:43     ` Brian Norris
2025-10-17 19:20       ` Brian Norris
2025-10-20 15:56       ` Ilpo Järvinen
2025-10-20 18:52         ` Brian Norris
2025-10-21 11:27           ` Ilpo Järvinen
2025-10-21 12:53             ` Rafael J. Wysocki
2025-10-21 13:18               ` Ilpo Järvinen
2025-10-21 18:27                 ` Brian Norris
2025-10-21 18:56                   ` Rafael J. Wysocki
2025-10-21 18:13             ` Brian Norris
2025-10-22 13:38               ` Ilpo Järvinen
2025-10-17  9:45 ` Rafael J. Wysocki
2025-10-17 17:11   ` Brian Norris

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).