From: Pranjal Shrivastava <praan@google.com>
To: Daniel Mentz <danielmentz@google.com>
Cc: iommu@lists.linux.dev, Will Deacon <will@kernel.org>,
Joerg Roedel <joro@8bytes.org>,
Robin Murphy <robin.murphy@arm.com>,
Jason Gunthorpe <jgg@ziepe.ca>,
Mostafa Saleh <smostafa@google.com>,
Nicolin Chen <nicolinc@nvidia.com>,
Ashish Mhetre <amhetre@nvidia.com>,
linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH v8 07/12] iommu/arm-smmu-v3: Add CMDQ_PROD_STOP_FLAG to gate CMDQ submissions
Date: Tue, 9 Jun 2026 10:05:28 +0000 [thread overview]
Message-ID: <aiflaI4svEJvZbsC@google.com> (raw)
In-Reply-To: <CAE2F3rDZupDgq5tQrvpuixGi-kVs+oi-Tacqa=Rxtri03LBPHQ@mail.gmail.com>
On Mon, Jun 08, 2026 at 09:20:42PM -0700, Daniel Mentz wrote:
> On Sun, Jun 7, 2026 at 11:19 PM Pranjal Shrivastava <praan@google.com> wrote:
> > > >
> > > > static void queue_poll_init(struct arm_smmu_device *smmu,
> > > > @@ -718,8 +719,25 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
> > > > do {
> > > > u64 old;
> > > >
> > > > + /*
> > > > + * If the SMMU is suspended/suspending, any new CMDs are elided.
> > > > + * This loop is the Point of Commitment. If we haven't cmpxchg'd
> > > > + * our new indices yet, we can safely bail. Once the indices are
> > > > + * committed, we MUST write valid commands to those slots to
> > > > + * avoid indefinite polling in the drain function.
> > > > + */
> > > > + if (Q_STOP(llq.prod)) {
> > > > + local_irq_restore(flags);
> > > > + return 0;
> > >
> > > On second thought, I no longer believe that this is safe. I understand
> > > that READ_ONCE(cmdq->q.llq.val) implies no ordering guarantees with
> > > respect to any writes to translation tables. In the non-stopped case,
> > > we can rely on the call to dma_wmb() further down in this function.
> > > However, for the stopped case, I can't identify any barriers that
> > > would ensure that the STOP flag is checked only after the writes are
> > > visible to SMMU. Here is an example: Let's assume the following
> > > program order:
> > >
> > > * Write to invalidate PTE
> > > * Read from cmdq->q.llq.val, determine STOP flag is set, elide TLBI
> > >
> > > What prevents the CPU from reordering these operations as follows?
> > >
> > > * Read from cmdq->q.llq.val, determine STOP flag is set, elide TLBI
> > > * Write to invalidate PTE
>
> Now, I believe that the smp_mb() call in arm_smmu_domain_inv_range()
> actually prevents this (if we remove arm_smmu_cmdq_can_elide or move
> it to after the smp_mb() call)
>
> > >
> > > Can the following situation occur?
> >
> > Not really because:
> >
> > >
> > > * Read from cmdq->q.llq.val, determine STOP flag is set, elide TLBI
> > > * (Different CPU resumes SMMU)
> >
> > We do a full smp_mb() here to ensure all writes are seen before SMMUEN=1
> > (note that the STOP_FLAG is already unset at that point, so we stop
> > eliding commands much before the SMMU is physically able to access any
> > config/xlation structures, I've explained everything below).
> >
> > > * SMMU loads old PTE value into TLB
> > > * Write to invalidate PTE
> > > * (stale PTE remains in TLB)
> > >
> > > I propose the following: If you find the STOP flag set, run dma_mb()
> > > and check again. I'm afraid that running dma_mb() unconditionally
> > > might incur too much of a performance penalty.
> >
> > IMHO, I think this might be overcomplicating things here..
> > Here's why the current version works according to me:
> >
> > Till SMMUEN=0, the SMMU is spec-guaranteed to never access translation
> > or configuration structures (Section 6.3.9.6).
> >
> > Our runtime_suspend callback sets SMMUEN=0 before setting the STOP_FLAG.
>
> Does "before" refer to program order or the order in which other
> agents can observe the STOP_FLAG relative to when SMMUEN=0 is set?
> Your program might read "Set SMMUEN=0 then set STOP_FLAG". However, my
> question is whether other observers might see the STOP_FLAG before
> SMMUEN is cleared. I couldn't identify any barriers in
> arm_smmu_runtime_suspend() that would prevent this type of
> re-ordering.
Hmm.. that's right. I guess I was wrongly relying on the fact that
_relaxed() variants are guaranteed to be ordered amongst themselves on
the same CPU but reading the barriers documentation [1] I see:
These are similar to readX() and writeX(),
[...] but they are still guaranteed to be ordered with respect to other
accesses from the same CPU thread to the same peripheral when operating
on __iomem pointers mapped with the default I/O attributes.
Hence, it seems it is possible for RAM writes to be re-ordered
before the writeX_relaxed(MMIO).
In that case, I guess we could just use the non-relaxed version to set
the STOP_FLAG...
>
> > Even if the worker CPU reorders the PTE write after the STOP_FLAG check,
> > it is benign because the SMMU is incapable of fetching that (or any) PTE
> > while the gate is closed. Because GATE_CLOSED == SMMUEN = 0, implying no
> > access to any HW structures whatsoever.
> >
> > The real synchronization happens in the Resume Path:
> >
> > 1. arm_smmu_device_reset() clears all caches / TLBs.
> > (None of these can have entries before SMMUEN=1)
> >
> > 2. We execute a full smp_mb() before setting SMMUEN=1. (that's why we
> > need smp_mb before SMMUEN=1). This barrier ensures that any PTE
> > writes made by any thread—including those that were elided while the
> > gate was closed, are globally visible before the SMMU hardware starts
> > fetching into TLBs again. (This is why Jason suggested this in v6 [1])
>
> A barrier on one CPU has no bearing on whether writes by any other CPU
> can be observed by any particular agent in the system.
>
> Let's compare this with the long comment in
> arm_smmu_domain_inv_range() which is what I believe Jason was
> referring to. In that example, you see smp_mb() in the code path on
> CPU0 and dma_wmb() in the code path on CPU1. Hence, barriers exist on
> both sides. If you compare the runtime PM design with
> arm_smmu_domain_inv_range(), then smp_mb() belongs in the CPU thread
> that performs the translation table updates not the one that performs
> the suspend/resume operation.
>
I might be missing something here, so please bear with me. My
understanding it that's needed because the IOMMU is live & actively
caching, which is not true for our case.
[Assuming we use non-relaxed semantics & ordering for the STOP flag,
i.e. set STOP_FLAG + barrier & clear STOP_FLAG (implicit dma_wmb())]
In our case, during the resume op, we first clear the STOP_FLAG before
setting SMMUEN=1 in program order. Thus, any PTE invalidations occurring
before SMMUEN=1 are executed, i.e. EVEN when the SMMU is guaranteed not
to access any structures, we've resumed invalidations.
Let's consider a few examples:
1. SUSPEND (say CPU0 is suspending)
[CPU0] SMMUEN = 0 ==> SMMU stops accessing HW structures (ABORT NOT set)
HW structures not accessed means no TLB / CFG
cache accesses as well according to the spec.
[CPU1] ==> PTE update => Invalidate => Succeeds (although SMMUEN = 0)
[CPU0] GBPA.Abort set ==> Txns are blocked
[CPU2] => PTE update => Invalidate => Succeeds [Txns blocked + SMMUEN=0]
[CPU0] ==> SET STOP_FLAG ==> Elision begins
[CPU3] ==> PTE update ==> Invalidation ==> Elided [Txns blocked + SMMUEN=0]
Hence, the races in the suspend sequence are handled correctly.
2. RESUME (say CPU0 is resuming)
[CPU1] ==> Update PTE ==> Invalidate ==> Elided [Txns blocked + SMMUEN=0]
[CPU0] ==> Clear STOP_FLAGs [Txns still blocked + SMMUEN=0]
[CPU2] ==> Update PTE ==> Invalidate ==> Succeeds [Txns blocked + SMMUEN=0]
[CPU0] ==> Invalidate all TLB ==> Succeeds [Txns still blocked + SMMUEN=0]
[CPU0] ==> Invalidate all CFG ==> Succeeds [Txns still blocked + SMMUEN=0]
[CPU2] ==> Update PTE ==> Invalidate ==> Succeeds [Txns still blocked + SMMUEN=0]
[CPU0] ==> Set SMMUEN = 1 [SMMU can now access in memory structures]
However, the TLBs and CFG caches are clean because everything
until this point couldn't have cached anything anyway.
Hence, right after clearing the STOP_FLAG, we're taking in invalidations
as normal in the resume, much before the real caching can begin.
Thus, by resuming invalidations before SMMUEN=1, we guarantee a
consistent state before the very first translation is performed.
Apart from this, I guess I'll drop the can_elide check from all
invalidation paths.
Does that sound fine?
> Actually, now that I think of it: We could just piggy pack on the
> existing smp_mb() in arm_smmu_domain_inv_range(). For your
> arm_smmu_cmdq_can_elide() to work correctly, though, you would need to
> move it to *after* the smp_mb() call. This being said, I would prefer
> to remove arm_smmu_cmdq_can_elide() as I wrote on the other thread.
> If we do take advantage of the existing smp_mb() in
> arm_smmu_domain_inv_range(), then we should probably update that
> comment to state that the runtime PM implementation now also depends
> on it.
>
> >
> > Adding a dma_mb() to the elision path would be redundant given the
> > SMMU can't access any structures and the resume barrier. We'd be
> > needlessly injecting dma_mbs when no entity is actually accessing those
> > structures while the STOP_FLAG is set.
> > >
> > > I have the same concerns with arm_smmu_cmdq_can_elide(): That
> > > READ_ONCE in arm_smmu_cmdq_can_elide() provides no guarantees in
> > > regards to ordering relative to PTE writes.
> > >
> >
> > The same as above, No structures are accessed during SMMUEN=0 (spec) and
> > during resume before setting SMMUEN=1, we do a full smp_mb() to ensure
> > all concurrent PTE writes are acquired in our resume thread before we
> > enable SMMUEN=1
>
> I think the definition of "all concurrent PTE writes" is a bit vague.
> I don't think the architecture allows you to wait for writes performed
> by other CPUs to be observable by all other agents. The only thing you
> can do is to ensure that other agents can observe that the STOP_FLAG
> has been cleared *before* SMMUEN is set. However, this is already
> achieved by the existing dma_wmb() in arm_smmu_cmdq_issue_cmdlist().
> Hence, the extra smp_mb() is not required.
>
Right, I meant that the smp_mb was needed for the STOP_FLAG is visible to
other CPUs performing PTE writes.. but I agree that the first CFGI
invalidation's dma_wmb() would ensure the STOP_FLAG is visible to all
other CPUs, I'll drop this and add a comment explaining this.
Thanks,
Praan
next prev parent reply other threads:[~2026-06-09 10:05 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-01 21:58 [PATCH v8 00/10] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Pranjal Shrivastava
2026-06-01 21:58 ` [PATCH v8 01/12] iommu/arm-smmu-v3: Refactor arm_smmu_setup_irqs Pranjal Shrivastava
2026-06-01 21:58 ` [PATCH v8 02/12] iommu/arm-smmu-v3: Add a helper to drain cmd queues Pranjal Shrivastava
2026-06-02 0:12 ` Nicolin Chen
2026-06-02 3:28 ` Pranjal Shrivastava
2026-06-02 5:21 ` Daniel Mentz
2026-06-01 21:59 ` [PATCH v8 03/12] iommu/tegra241-cmdqv: Add a helper to drain VCMDQs Pranjal Shrivastava
2026-06-01 21:59 ` [PATCH v8 04/12] iommu/tegra241-cmdqv: Restore PROD and CONS after resume Pranjal Shrivastava
2026-06-01 21:59 ` [PATCH v8 05/12] iommu/arm-smmu-v3: Cache and restore MSI config Pranjal Shrivastava
2026-06-01 21:59 ` [PATCH v8 06/12] iommu/arm-smmu-v3: Handle gerror during suspend Pranjal Shrivastava
2026-06-02 0:15 ` Nicolin Chen
2026-06-02 3:31 ` Pranjal Shrivastava
2026-06-01 21:59 ` [PATCH v8 07/12] iommu/arm-smmu-v3: Add CMDQ_PROD_STOP_FLAG to gate CMDQ submissions Pranjal Shrivastava
2026-06-07 21:42 ` Daniel Mentz
2026-06-08 6:19 ` Pranjal Shrivastava
2026-06-09 4:20 ` Daniel Mentz
2026-06-09 10:05 ` Pranjal Shrivastava [this message]
2026-06-01 21:59 ` [PATCH v8 08/12] iommu/tegra241-cmdqv: Add a helper to quiesce VCMDQs Pranjal Shrivastava
2026-06-02 0:14 ` Nicolin Chen
2026-06-02 3:37 ` Pranjal Shrivastava
2026-06-02 5:59 ` Nicolin Chen
2026-06-02 6:21 ` Pranjal Shrivastava
2026-06-02 6:29 ` Nicolin Chen
2026-06-01 21:59 ` [PATCH v8 09/12] iommu/arm-smmu-v3: Implement pm_runtime & system sleep ops Pranjal Shrivastava
2026-06-02 5:25 ` Daniel Mentz
2026-06-02 12:12 ` Pranjal Shrivastava
2026-06-07 22:36 ` Daniel Mentz
2026-06-02 15:27 ` Daniel Mentz
2026-06-09 10:13 ` Pranjal Shrivastava
2026-06-07 21:53 ` Daniel Mentz
2026-06-09 10:12 ` Pranjal Shrivastava
2026-06-07 22:30 ` Daniel Mentz
2026-06-09 10:09 ` Pranjal Shrivastava
2026-06-01 21:59 ` [PATCH v8 10/12] iommu/arm-smmu-v3: Enable pm_runtime and setup devlinks Pranjal Shrivastava
2026-06-01 21:59 ` [PATCH v8 11/12] iommu/arm-smmu-v3: Invoke pm_runtime before hw access Pranjal Shrivastava
2026-06-02 0:24 ` Nicolin Chen
2026-06-02 3:59 ` Pranjal Shrivastava
2026-06-02 5:51 ` Nicolin Chen
2026-06-02 6:24 ` Pranjal Shrivastava
2026-06-03 20:28 ` Daniel Mentz
2026-06-04 6:27 ` Pranjal Shrivastava
2026-06-07 22:22 ` Daniel Mentz
2026-06-09 10:34 ` Pranjal Shrivastava
2026-06-01 21:59 ` [PATCH v8 12/12] iommu/arm-smmu-v3: Add KUnit unit tests for Runtime PM Pranjal Shrivastava
2026-06-02 6:03 ` [PATCH v8 00/10] iommu/arm-smmu-v3: Implement Runtime/System Sleep ops Nicolin Chen
2026-06-02 12:04 ` Pranjal Shrivastava
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aiflaI4svEJvZbsC@google.com \
--to=praan@google.com \
--cc=amhetre@nvidia.com \
--cc=danielmentz@google.com \
--cc=iommu@lists.linux.dev \
--cc=jgg@ziepe.ca \
--cc=joro@8bytes.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=nicolinc@nvidia.com \
--cc=robin.murphy@arm.com \
--cc=smostafa@google.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox