From: Pranjal Shrivastava <praan@google.com>
To: Daniel Mentz <danielmentz@google.com>
Cc: iommu@lists.linux.dev, Will Deacon <will@kernel.org>,
Joerg Roedel <joro@8bytes.org>,
Robin Murphy <robin.murphy@arm.com>,
Jason Gunthorpe <jgg@ziepe.ca>,
Mostafa Saleh <smostafa@google.com>,
Nicolin Chen <nicolinc@nvidia.com>,
Ashish Mhetre <amhetre@nvidia.com>,
linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH v8 07/12] iommu/arm-smmu-v3: Add CMDQ_PROD_STOP_FLAG to gate CMDQ submissions
Date: Tue, 9 Jun 2026 10:05:28 +0000 [thread overview]
Message-ID: <aiflaI4svEJvZbsC@google.com> (raw)
In-Reply-To: <CAE2F3rDZupDgq5tQrvpuixGi-kVs+oi-Tacqa=Rxtri03LBPHQ@mail.gmail.com>
On Mon, Jun 08, 2026 at 09:20:42PM -0700, Daniel Mentz wrote:
> On Sun, Jun 7, 2026 at 11:19 PM Pranjal Shrivastava <praan@google.com> wrote:
> > > >
> > > > static void queue_poll_init(struct arm_smmu_device *smmu,
> > > > @@ -718,8 +719,25 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
> > > > do {
> > > > u64 old;
> > > >
> > > > + /*
> > > > + * If the SMMU is suspended/suspending, any new CMDs are elided.
> > > > + * This loop is the Point of Commitment. If we haven't cmpxchg'd
> > > > + * our new indices yet, we can safely bail. Once the indices are
> > > > + * committed, we MUST write valid commands to those slots to
> > > > + * avoid indefinite polling in the drain function.
> > > > + */
> > > > + if (Q_STOP(llq.prod)) {
> > > > + local_irq_restore(flags);
> > > > + return 0;
> > >
> > > On second thought, I no longer believe that this is safe. I understand
> > > that READ_ONCE(cmdq->q.llq.val) implies no ordering guarantees with
> > > respect to any writes to translation tables. In the non-stopped case,
> > > we can rely on the call to dma_wmb() further down in this function.
> > > However, for the stopped case, I can't identify any barriers that
> > > would ensure that the STOP flag is checked only after the writes are
> > > visible to SMMU. Here is an example: Let's assume the following
> > > program order:
> > >
> > > * Write to invalidate PTE
> > > * Read from cmdq->q.llq.val, determine STOP flag is set, elide TLBI
> > >
> > > What prevents the CPU from reordering these operations as follows?
> > >
> > > * Read from cmdq->q.llq.val, determine STOP flag is set, elide TLBI
> > > * Write to invalidate PTE
>
> Now, I believe that the smp_mb() call in arm_smmu_domain_inv_range()
> actually prevents this (if we remove arm_smmu_cmdq_can_elide or move
> it to after the smp_mb() call)
>
> > >
> > > Can the following situation occur?
> >
> > Not really because:
> >
> > >
> > > * Read from cmdq->q.llq.val, determine STOP flag is set, elide TLBI
> > > * (Different CPU resumes SMMU)
> >
> > We do a full smp_mb() here to ensure all writes are seen before SMMUEN=1
> > (note that the STOP_FLAG is already unset at that point, so we stop
> > eliding commands much before the SMMU is physically able to access any
> > config/xlation structures, I've explained everything below).
> >
> > > * SMMU loads old PTE value into TLB
> > > * Write to invalidate PTE
> > > * (stale PTE remains in TLB)
> > >
> > > I propose the following: If you find the STOP flag set, run dma_mb()
> > > and check again. I'm afraid that running dma_mb() unconditionally
> > > might incur too much of a performance penalty.
> >
> > IMHO, I think this might be overcomplicating things here..
> > Here's why the current version works according to me:
> >
> > Till SMMUEN=0, the SMMU is spec-guaranteed to never access translation
> > or configuration structures (Section 6.3.9.6).
> >
> > Our runtime_suspend callback sets SMMUEN=0 before setting the STOP_FLAG.
>
> Does "before" refer to program order or the order in which other
> agents can observe the STOP_FLAG relative to when SMMUEN=0 is set?
> Your program might read "Set SMMUEN=0 then set STOP_FLAG". However, my
> question is whether other observers might see the STOP_FLAG before
> SMMUEN is cleared. I couldn't identify any barriers in
> arm_smmu_runtime_suspend() that would prevent this type of
> re-ordering.
Hmm.. that's right. I guess I was wrongly relying on the fact that
_relaxed() variants are guaranteed to be ordered amongst themselves on
the same CPU but reading the barriers documentation [1] I see:
These are similar to readX() and writeX(),
[...] but they are still guaranteed to be ordered with respect to other
accesses from the same CPU thread to the same peripheral when operating
on __iomem pointers mapped with the default I/O attributes.
Hence, it seems it is possible for RAM writes to be re-ordered
before the writeX_relaxed(MMIO).
In that case, I guess we could just use the non-relaxed version to set
the STOP_FLAG...
>
> > Even if the worker CPU reorders the PTE write after the STOP_FLAG check,
> > it is benign because the SMMU is incapable of fetching that (or any) PTE
> > while the gate is closed. Because GATE_CLOSED == SMMUEN = 0, implying no
> > access to any HW structures whatsoever.
> >
> > The real synchronization happens in the Resume Path:
> >
> > 1. arm_smmu_device_reset() clears all caches / TLBs.
> > (None of these can have entries before SMMUEN=1)
> >
> > 2. We execute a full smp_mb() before setting SMMUEN=1. (that's why we
> > need smp_mb before SMMUEN=1). This barrier ensures that any PTE
> > writes made by any thread—including those that were elided while the
> > gate was closed, are globally visible before the SMMU hardware starts
> > fetching into TLBs again. (This is why Jason suggested this in v6 [1])
>
> A barrier on one CPU has no bearing on whether writes by any other CPU
> can be observed by any particular agent in the system.
>
> Let's compare this with the long comment in
> arm_smmu_domain_inv_range() which is what I believe Jason was
> referring to. In that example, you see smp_mb() in the code path on
> CPU0 and dma_wmb() in the code path on CPU1. Hence, barriers exist on
> both sides. If you compare the runtime PM design with
> arm_smmu_domain_inv_range(), then smp_mb() belongs in the CPU thread
> that performs the translation table updates not the one that performs
> the suspend/resume operation.
>
I might be missing something here, so please bear with me. My
understanding it that's needed because the IOMMU is live & actively
caching, which is not true for our case.
[Assuming we use non-relaxed semantics & ordering for the STOP flag,
i.e. set STOP_FLAG + barrier & clear STOP_FLAG (implicit dma_wmb())]
In our case, during the resume op, we first clear the STOP_FLAG before
setting SMMUEN=1 in program order. Thus, any PTE invalidations occurring
before SMMUEN=1 are executed, i.e. EVEN when the SMMU is guaranteed not
to access any structures, we've resumed invalidations.
Let's consider a few examples:
1. SUSPEND (say CPU0 is suspending)
[CPU0] SMMUEN = 0 ==> SMMU stops accessing HW structures (ABORT NOT set)
HW structures not accessed means no TLB / CFG
cache accesses as well according to the spec.
[CPU1] ==> PTE update => Invalidate => Succeeds (although SMMUEN = 0)
[CPU0] GBPA.Abort set ==> Txns are blocked
[CPU2] => PTE update => Invalidate => Succeeds [Txns blocked + SMMUEN=0]
[CPU0] ==> SET STOP_FLAG ==> Elision begins
[CPU3] ==> PTE update ==> Invalidation ==> Elided [Txns blocked + SMMUEN=0]
Hence, the races in the suspend sequence are handled correctly.
2. RESUME (say CPU0 is resuming)
[CPU1] ==> Update PTE ==> Invalidate ==> Elided [Txns blocked + SMMUEN=0]
[CPU0] ==> Clear STOP_FLAGs [Txns still blocked + SMMUEN=0]
[CPU2] ==> Update PTE ==> Invalidate ==> Succeeds [Txns blocked + SMMUEN=0]
[CPU0] ==> Invalidate all TLB ==> Succeeds [Txns still blocked + SMMUEN=0]
[CPU0] ==> Invalidate all CFG ==> Succeeds [Txns still blocked + SMMUEN=0]
[CPU2] ==> Update PTE ==> Invalidate ==> Succeeds [Txns still blocked + SMMUEN=0]
[CPU0] ==> Set SMMUEN = 1 [SMMU can now access in memory structures]
However, the TLBs and CFG caches are clean because everything
until this point couldn't have cached anything anyway.
Hence, right after clearing the STOP_FLAG, we're taking in invalidations
as normal in the resume, much before the real caching can begin.
Thus, by resuming invalidations before SMMUEN=1, we guarantee a
consistent state before the very first translation is performed.
Apart from this, I guess I'll drop the can_elide check from all
invalidation paths.
Does that sound fine?
> Actually, now that I think of it: We could just piggy pack on the
> existing smp_mb() in arm_smmu_domain_inv_range(). For your
> arm_smmu_cmdq_can_elide() to work correctly, though, you would need to
> move it to *after* the smp_mb() call. This being said, I would prefer
> to remove arm_smmu_cmdq_can_elide() as I wrote on the other thread.
> If we do take advantage of the existing smp_mb() in
> arm_smmu_domain_inv_range(), then we should probably update that
> comment to state that the runtime PM implementation now also depends
> on it.
>
> >
> > Adding a dma_mb() to the elision path would be redundant given the
> > SMMU can't access any structures and the resume barrier. We'd be
> > needlessly injecting dma_mbs when no entity is actually accessing those
> > structures while the STOP_FLAG is set.
> > >
> > > I have the same concerns with arm_smmu_cmdq_can_elide(): That
> > > READ_ONCE in arm_smmu_cmdq_can_elide() provides no guarantees in
> > > regards to ordering relative to PTE writes.
> > >
> >
> > The same as above, No structures are accessed during SMMUEN=0 (spec) and
> > during resume before setting SMMUEN=1, we do a full smp_mb() to ensure
> > all concurrent PTE writes are acquired in our resume thread before we
> > enable SMMUEN=1
>
> I think the definition of "all concurrent PTE writes" is a bit vague.
> I don't think the architecture allows you to wait for writes performed
> by other CPUs to be observable by all other agents. The only thing you
> can do is to ensure that other agents can observe that the STOP_FLAG
> has been cleared *before* SMMUEN is set. However, this is already
> achieved by the existing dma_wmb() in arm_smmu_cmdq_issue_cmdlist().
> Hence, the extra smp_mb() is not required.
>
Right, I meant that the smp_mb was needed for the STOP_FLAG is visible to
other CPUs performing PTE writes.. but I agree that the first CFGI
invalidation's dma_wmb() would ensure the STOP_FLAG is visible to all
other CPUs, I'll drop this and add a comment explaining this.
Thanks,
Praan
next prev parent reply other threads:[~2026-06-09 10:05 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-01 21:58 [PATCH v8 00/10] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
2026-06-01 21:58 ` [PATCH v8 01/12] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs Pranjal Shrivastava
2026-06-01 21:58 ` [PATCH v8 02/12] iommu/arm-smmu-v3: Add a helper to drain cmd queues Pranjal Shrivastava
2026-06-02 0:12 ` Nicolin Chen
2026-06-02 3:28 ` Pranjal Shrivastava
2026-06-02 5:21 ` Daniel Mentz
2026-06-01 21:59 ` [PATCH v8 03/12] iommu/tegra241-cmdqv: Add a helper to drain VCMDQs Pranjal Shrivastava
2026-06-01 21:59 ` [PATCH v8 04/12] iommu/tegra241-cmdqv: Restore PROD and CONS after resume Pranjal Shrivastava
2026-06-01 21:59 ` [PATCH v8 05/12] iommu/arm-smmu-v3: Cache and restore MSI config Pranjal Shrivastava
2026-06-01 21:59 ` [PATCH v8 06/12] iommu/arm-smmu-v3: Handle gerror during suspend Pranjal Shrivastava
2026-06-02 0:15 ` Nicolin Chen
2026-06-02 3:31 ` Pranjal Shrivastava
2026-06-01 21:59 ` [PATCH v8 07/12] iommu/arm-smmu-v3: Add CMDQ_PROD_STOP_FLAG to gate CMDQ submissions Pranjal Shrivastava
2026-06-07 21:42 ` Daniel Mentz
2026-06-08 6:19 ` Pranjal Shrivastava
2026-06-09 4:20 ` Daniel Mentz
2026-06-09 10:05 ` Pranjal Shrivastava [this message]
2026-06-01 21:59 ` [PATCH v8 08/12] iommu/tegra241-cmdqv: Add a helper to quiesce VCMDQs Pranjal Shrivastava
2026-06-02 0:14 ` Nicolin Chen
2026-06-02 3:37 ` Pranjal Shrivastava
2026-06-02 5:59 ` Nicolin Chen
2026-06-02 6:21 ` Pranjal Shrivastava
2026-06-02 6:29 ` Nicolin Chen
2026-06-01 21:59 ` [PATCH v8 09/12] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops Pranjal Shrivastava
2026-06-02 5:25 ` Daniel Mentz
2026-06-02 12:12 ` Pranjal Shrivastava
2026-06-07 22:36 ` Daniel Mentz
2026-06-02 15:27 ` Daniel Mentz
2026-06-09 10:13 ` Pranjal Shrivastava
2026-06-07 21:53 ` Daniel Mentz
2026-06-09 10:12 ` Pranjal Shrivastava
2026-06-07 22:30 ` Daniel Mentz
2026-06-09 10:09 ` Pranjal Shrivastava
2026-06-01 21:59 ` [PATCH v8 10/12] iommu/arm-smmu-v3: Enable pm_runtime and setup devlinks Pranjal Shrivastava
2026-06-01 21:59 ` [PATCH v8 11/12] iommu/arm-smmu-v3: Invoke pm_runtime before hw access Pranjal Shrivastava
2026-06-02 0:24 ` Nicolin Chen
2026-06-02 3:59 ` Pranjal Shrivastava
2026-06-02 5:51 ` Nicolin Chen
2026-06-02 6:24 ` Pranjal Shrivastava
2026-06-03 20:28 ` Daniel Mentz
2026-06-04 6:27 ` Pranjal Shrivastava
2026-06-07 22:22 ` Daniel Mentz
2026-06-09 10:34 ` Pranjal Shrivastava
2026-06-01 21:59 ` [PATCH v8 12/12] iommu/arm-smmu-v3: Add KUnit unit tests for Runtime PM Pranjal Shrivastava
2026-06-02 6:03 ` [PATCH v8 00/10] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Nicolin Chen
2026-06-02 12:04 ` Pranjal Shrivastava
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aiflaI4svEJvZbsC@google.com \
--to=praan@google.com \
--cc=amhetre@nvidia.com \
--cc=danielmentz@google.com \
--cc=iommu@lists.linux.dev \
--cc=jgg@ziepe.ca \
--cc=joro@8bytes.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=nicolinc@nvidia.com \
--cc=robin.murphy@arm.com \
--cc=smostafa@google.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.