All of lore.kernel.org
 help / color / mirror / Atom feed
From: Pranjal Shrivastava <praan@google.com>
To: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Joerg Roedel <joro@8bytes.org>, Will Deacon <will@kernel.org>,
	Robin Murphy <robin.murphy@arm.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	Nicolin Chen <nicolinc@nvidia.com>,
	Mostafa Saleh <smostafa@google.com>,
	Daniel Mentz <danielmentz@google.com>,
	iommu@lists.linux.dev
Subject: Re: [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended()
Date: Thu, 10 Jul 2025 08:59:55 +0000	[thread overview]
Message-ID: <aG-BC6B4R1ViEDM1@google.com> (raw)
In-Reply-To: <CAJZ5v0jyYpu=qA-_QKVoNhzRZRNKe5cFt718fo6pxR6iRkD2bw@mail.gmail.com>

On Wed, Jul 09, 2025 at 09:37:58PM +0200, Rafael J. Wysocki wrote:
> On Wed, Jul 9, 2025 at 7:06 PM Pranjal Shrivastava <praan@google.com> wrote:
> >
> > On Wed, Jul 09, 2025 at 06:35:41PM +0200, Rafael J. Wysocki wrote:
> > > On Wed, Jul 9, 2025 at 5:51 PM Pranjal Shrivastava <praan@google.com> wrote:
> > > >
> > > > On Wed, Jul 09, 2025 at 08:44:06AM +0200, Rafael J. Wysocki wrote:
> > > > > On Mon, Jun 16, 2025 at 10:32 PM Pranjal Shrivastava <praan@google.com> wrote:
> > > > > >
> > > > > > The existing opportunistic helpers like pm_runtime_get_if_active() and
> > > > > > pm_runtime_get_if_in_use(), are too strict for certain use cases. They
> > > > > > fail if the device is in a transient state like RPM_SUSPENDING, which
> > > > > > can lead to drivers making incorrect assumptions about the dev's state.
> > > > > >
> > > > > > These helpers don't suffice for cases where one wishes to elide HW clean
> > > > > > up like queue flushes or TLB invalidations if the device is powered off.
> > > > > > It is wasteful to wake up the device in cases where the resume callback
> > > > > > resets the device well OR if the HW resets in a clean state. Thus, if a
> > > > >ppy to hear your feedback or any alternative ideas you might have > device is powered off, it is preferred to elide any clean-up HW ops like
> > > > > > queue flushes / TLB invalidations when the device will resume afresh.
> > > > > >
> > > > > > Consider the following sequence of operations:
> > > > > >
> > > > > > 1. The device is in `RPM_SUSPENDING` state
> > > > > > 2. The driver calls pm_runtime_get_if_active/in_use
> > > > > > 3. Depending on these API, the driver elides a HW clean-up op like:
> > > > > >
> > > > > > if (pm_runtime_get_if_in_active(dev))
> > > > > >         invalidate_tlb(dev);
> > > > > > else
> > > > > >         // Skip flush, assuming device will fully suspend
> > > > > >
> > > > > > 4. Now, another rpm dev-linked device wakes up, causing the device's
> > > > > >    state to bounce from RPM_SUSPENDING to RPM_ACTIVE without invoking
> > > > > >    any rpm callbacks, preventing them from resetting the dev correctly.
> > > > >
> > > > > This never happens.
> > > > >
> > > > > If the status is RPM_SUSPENDING, the runtime suspend callback will be invoked.
> > > >
> > > > Ack, I believe we're talking about the check[1] in rpm_resume here:
> > >
> > > No, I'm talking about the fact that the status is changed to
> > > RPM_SUSPENDING right before invoking the callback.
> > >
> >
> > Right, but if the callback returns -EAGAIN or something, in
> > rpm_suspend() we go ahead and set the status to RPM_ACTIVE again
> > on jumping to the `fail` label[1].
> >
> > > >         if (dev->power.runtime_status == RPM_RESUMING ||
> > > >             dev->power.runtime_status == RPM_SUSPENDING) {
> > > >
> > > >             [...]
> > > >             /* Wait for the operation carried out in parallel with us. */
> > > >             [...]
> > > >             finish_wait(&dev->power.wait_queue, &wait);
> > > >             goto repeat;
> > > >
> > > > However, my worry is about the following situation:
> > > >
> > > >   rpm_suspend                      rpm_resume
> > > >
> > > >   rpm_status = RPM_SUSPENDING
> > > >                                    if (RPM_SUSPENDING)
> > > >                                    prepare_to wait(...)
> > > >
> > > >   // suspend fails
> > > >   retval = rpm_callback(...);
> > >
> > > You're talking about the callback returning an error and your
> > > changelog is talking about a different situation.
> >
> > Apologies, I should've been more clear. What I meant was for a situation
> > where due to *any* reason (except disabling runtime PM) we bounce back
> > from RPM_SUSPENDING to RPM_ACTIVE *without* invoking the resume callback
> 
> This only can happen if the suspend callback returns an error, but I'm
> not sure why and how this matters.
> 
> pm_runtime_get_if_active() does not guarantee anything in the case
> when 0 is returned anyway.
> 
> > > >   goto fail;
> > > >   [...]
> > > > fail:
> > > >   rpm_status = RPM_ACTIVE;
> > > >                                    finish_wait(...);
> > > >                                    goto repeat;
> > > >                                    repeat:
> > > >                                    if (rpm_status == RPM_ACTIVE) {
> > > >                                         retval = 1;
> > > >                                         goto out;
> > > >                                    }
> > > >                                 out:
> > > >                                    [ ... ] // put_parent if one
> > > >
> > > >                                    // we return without resume cb();
> > > >                                    return retval;
> > > >
> > > > Now, if we rely on APIs based on pm_runtime_get_conditional(), which
> > > > might return 0 if (dev->power.runtime_status != RPM_ACTIVE), we might
> > > > end up eliding some TLB invalidations in the period where the status was
> > > > RPM_SUSPENDING till it's back to RPM_ACTIVE due to a failed suspend.
> > >
> > > It actually depends on the reason why the callback returned an error.
> > >
> >
> > I'm not sure if I follow.. as per rpm_suspend[1] I see that upon failing
> > (which also happens on getting a non-zero retval from the suspend_cb) we
> > set the runtime status to RPM_ACTIVE.
> 
> Yes, it just goes back to the status from before the failure because
> when the error code is -EAGAIN or -EBUSY, the status should be still
> RPM_ACTIVE and otherwise runtime_error is set and the device likely
> requires some help.
> 
> > > > This can cause some severe address-aliasing/ghost hit issues since the
> > > > TLB still has some stale entries..
> > >
> > > I'm not quite sure what you mean.
> > >
> >
> > I meant if the TLB invalidations were elided (i.e. TLB still had stale
> > entries) in hope that the IOMMU would be suspended, but due to some
> > reason the suspend failed and the status gets back to RPM_ACTIVE while
> > exiting the rpm_suspend call and a client wakes up and performs a DMA
> > (IOMMU transaction), the TLB entry might hit for an address which was
> > supposed to be invalidated by now.
> 
> So as I said this is just one reason why you cannot rely on runtime PM
> to guarantee that the device will remain suspended, or in fact whether
> or not it will be suspended at all.
> 
> > > > Instead, we'd like to ensure that the status IS suspended
> > >
> > > But you can't.
> > >
> > > That's the difference between RPM_ACTIVE and RPM_SUSPENDED.  If you
> > > bump up the runtime PM usage counter while the device is RPM_ACTIVE
> > > (and while holding its power.spinlock), it will not suspend.  There's
> > > no way to prevent a suspended device from resuming (other than
> > > disabling runtime PM for it).
> > >
> >
> > The problem is avoiding bumping up the count while the device is in a
> > transient state like RPM_SUSPENDING and then bouncing back to the
> > RPM_ACTIVE state without invoking RPM resume callback, which seems to be
> > possible as per the rpm_suspend implementation [1].
> 
> I'm not sure what you mean here.
> 
> The usage counter is bumped up by pm_runtime_get_if_active() only if
> the device is RPM_ACTIVE in which case it will guarantee that the
> runtime PM status will not change.  There is no way to provide a
> similar guarantee on the RPM_SUSPENDED side short of disabling runtime
> PM.
> 
> > > > to make a decision to elide the TLBI or not.
> > > >
> > > > What are your thoughts on this? I'm open to approaching it differently.
> > >
> > > IMV this is all misguided, sorry.
> > >
> >
> > Ack. But I'd want to understand better here. Are you saying that it
> > isn't at all possible that RPM_SUSPENDING bounces back to RPM_ACTIVE
> > without invoking the rpm_resume ever? Because per the following snippet,
> > it seems likely:
> >
> >         __update_runtime_status(dev, RPM_SUSPENDING);
> >
> >         callback = RPM_GET_CALLBACK(dev, runtime_suspend);
> >
> >         dev_pm_enable_wake_irq_check(dev, true);
> >         retval = rpm_callback(callback, dev);
> >         if (retval)
> >                 goto fail;
> >
> >         [...]
> >
> > fail:
> >         dev_pm_disable_wake_irq_check(dev, true);
> >         __update_runtime_status(dev, RPM_ACTIVE);
> >         dev->power.deferred_resume = false;
> >         wake_up_all(&dev->power.wait_queue);
> >         [...]
> >
> > Sorry if I'm missing something here, but it does look like we can bounce
> > back to RPM_ACTIVE without invoking the resume callback, which is a
> > behaviour we'd like to avoid.
> 
> So let me repeat: This only happens if the suspend callback returns an
> error and in that case it is not clear whether or not the resume
> callback needs to be invoked (either it doesn't or the device can be
> assumed to be in a state in which invoking that callback will not help
> either way).
> 

Alright. My goal here is to ensure that TLB invalidations are elided
only when the device (IOMMU) is suspended. For that I believed we 
could do either of the following:

1. Ensure RPM_SUSPENDING -> RPM_ACTIVE transition *always* invokes the
resume callback (this is because the resume cb flushes the entire TLB).

2. Ensure that we are able to reliably *get* a PM ref if the device is
NOT suspended, i.e. even if it was in a transient state.

But I guess that maybe the new API (option 2) isn't the right way to go.

I'm assuming you mean that if we design the suspend callback in a way
that it doesn't return error, we can reliably ensure that the resume
callback will be called before transitioning to the RPM_ACTIVE state? 

[...]

Thanks,
Praan

  reply	other threads:[~2025-07-10  9:00 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-16 20:31 [RFC PATCH v3 0/8] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
2025-06-16 20:31 ` [RFC PATCH v3 1/8] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs Pranjal Shrivastava
2025-07-08 15:15   ` Mostafa Saleh
2025-06-16 20:31 ` [RFC PATCH v3 2/8] iommu/arm-smmu-v3: Add a helper to drain cmd queues Pranjal Shrivastava
2025-07-08 15:32   ` Mostafa Saleh
2025-07-14  9:24     ` Pranjal Shrivastava
2025-06-16 20:31 ` [RFC PATCH v3 3/8] iommu/tegra241-cmdqv: Add a helper to drain VCMDQs Pranjal Shrivastava
2025-06-16 20:31 ` [RFC PATCH v3 4/8] iommu/arm-smmu-v3: Cache and restore MSI config Pranjal Shrivastava
2025-07-08 15:34   ` Mostafa Saleh
2025-07-14  9:01     ` Pranjal Shrivastava
2025-06-16 20:31 ` [RFC PATCH v3 5/8] pm: runtime: Introduce pm_runtime_get_if_not_suspended() Pranjal Shrivastava
2025-06-17  9:19   ` kernel test robot
2025-07-08 22:00   ` Nicolin Chen
2025-07-09 15:51     ` Pranjal Shrivastava
2025-07-09  6:44   ` Rafael J. Wysocki
2025-07-09 15:51     ` Pranjal Shrivastava
2025-07-09 16:35       ` Rafael J. Wysocki
2025-07-09 17:06         ` Pranjal Shrivastava
2025-07-09 19:37           ` Rafael J. Wysocki
2025-07-10  8:59             ` Pranjal Shrivastava [this message]
2025-07-10 10:29               ` Rafael J. Wysocki
2025-07-11 10:20                 ` Pranjal Shrivastava
2025-07-15 23:52                   ` Daniel Mentz
2025-07-16 12:53                     ` Rafael J. Wysocki
2025-07-21 12:44                       ` Will Deacon
2025-06-16 20:31 ` [RFC PATCH v3 6/8] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops Pranjal Shrivastava
2025-06-16 20:31 ` [RFC PATCH v3 7/8] iommu/arm-smmu-v3: Enable pm_runtime and setup devlinks Pranjal Shrivastava
2025-06-16 20:31 ` [RFC PATCH v3 8/8] iommu/arm-smmu-v3: Invoke pm_runtime before hw access Pranjal Shrivastava

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aG-BC6B4R1ViEDM1@google.com \
    --to=praan@google.com \
    --cc=danielmentz@google.com \
    --cc=iommu@lists.linux.dev \
    --cc=jgg@ziepe.ca \
    --cc=joro@8bytes.org \
    --cc=nicolinc@nvidia.com \
    --cc=rafael@kernel.org \
    --cc=robin.murphy@arm.com \
    --cc=smostafa@google.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.