* [PATCH v5 1/3] iommu/arm-smmu-v3: Parameterize wfe for CMDQ polling
2025-12-08 21:28 [PATCH v5 0/3] SMMU v3 CMDQ fix and improvement Jacob Pan
@ 2025-12-08 21:28 ` Jacob Pan
2025-12-08 21:28 ` [PATCH v5 2/3] iommu/arm-smmu-v3: Fix CMDQ timeout warning Jacob Pan
` (2 subsequent siblings)
3 siblings, 0 replies; 7+ messages in thread
From: Jacob Pan @ 2025-12-08 21:28 UTC (permalink / raw)
To: linux-kernel, iommu@lists.linux.dev, Will Deacon, Joerg Roedel,
Mostafa Saleh, Jason Gunthorpe, Robin Murphy, Nicolin Chen
Cc: Jacob Pan, Zhang Yu, Jean Philippe-Brucker, Alexander Grest
When SMMU_IDR0.SEV == 1, the SMMU triggers a WFE wake-up event when a
Command queue becomes non-full and an agent external to the SMMU could
have observed that the queue was previously full. However, WFE is not
always required or available during space polling. Introduce an optional
parameter to control WFE usage.
Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index bf67d9abc901..d637a5dcf48a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -191,11 +191,11 @@ static u32 queue_inc_prod_n(struct arm_smmu_ll_queue *q, int n)
}
static void queue_poll_init(struct arm_smmu_device *smmu,
- struct arm_smmu_queue_poll *qp)
+ struct arm_smmu_queue_poll *qp, bool want_wfe)
{
qp->delay = 1;
qp->spin_cnt = 0;
- qp->wfe = !!(smmu->features & ARM_SMMU_FEAT_SEV);
+ qp->wfe = want_wfe && (!!(smmu->features & ARM_SMMU_FEAT_SEV));
qp->timeout = ktime_add_us(ktime_get(), ARM_SMMU_POLL_TIMEOUT_US);
}
@@ -656,13 +656,11 @@ static int __arm_smmu_cmdq_poll_until_msi(struct arm_smmu_device *smmu,
struct arm_smmu_queue_poll qp;
u32 *cmd = (u32 *)(Q_ENT(&cmdq->q, llq->prod));
- queue_poll_init(smmu, &qp);
-
/*
* The MSI won't generate an event, since it's being written back
* into the command queue.
*/
- qp.wfe = false;
+ queue_poll_init(smmu, &qp, false);
smp_cond_load_relaxed(cmd, !VAL || (ret = queue_poll(&qp)));
llq->cons = ret ? llq->prod : queue_inc_prod_n(llq, 1);
return ret;
@@ -680,7 +678,7 @@ static int __arm_smmu_cmdq_poll_until_consumed(struct arm_smmu_device *smmu,
u32 prod = llq->prod;
int ret = 0;
- queue_poll_init(smmu, &qp);
+ queue_poll_init(smmu, &qp, true);
llq->val = READ_ONCE(cmdq->q.llq.val);
do {
if (queue_consumed(llq, prod))
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH v5 2/3] iommu/arm-smmu-v3: Fix CMDQ timeout warning
2025-12-08 21:28 [PATCH v5 0/3] SMMU v3 CMDQ fix and improvement Jacob Pan
2025-12-08 21:28 ` [PATCH v5 1/3] iommu/arm-smmu-v3: Parameterize wfe for CMDQ polling Jacob Pan
@ 2025-12-08 21:28 ` Jacob Pan
2025-12-10 3:16 ` Will Deacon
2025-12-08 21:28 ` [PATCH v5 3/3] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency Jacob Pan
2026-01-05 22:58 ` [PATCH v5 0/3] SMMU v3 CMDQ fix and improvement Will Deacon
3 siblings, 1 reply; 7+ messages in thread
From: Jacob Pan @ 2025-12-08 21:28 UTC (permalink / raw)
To: linux-kernel, iommu@lists.linux.dev, Will Deacon, Joerg Roedel,
Mostafa Saleh, Jason Gunthorpe, Robin Murphy, Nicolin Chen
Cc: Jacob Pan, Zhang Yu, Jean Philippe-Brucker, Alexander Grest
While polling for n spaces in the cmdq, the current code instead checks
if the queue is full. If the queue is almost full but not enough space
(<n), then the CMDQ timeout warning is never triggered even if the
polling has exceeded timeout limit.
The existing arm_smmu_cmdq_poll_until_not_full() doesn't fit efficiently
nor ideally to the only caller arm_smmu_cmdq_issue_cmdlist():
- It uses a new timer at every single call, which fails to limit to the
preset ARM_SMMU_POLL_TIMEOUT_US per issue.
- It has a redundant internal queue_full(), which doesn't detect whether
there is a enough space for number of n commands.
This patch polls for the availability of exact space instead of full and
emit timeout warning accordingly.
Fixes: 587e6c10a7ce ("iommu/arm-smmu-v3: Reduce contention during command-queue insertion")
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Co-developed-by: Yu Zhang <zhangyu1@linux.microsoft.com>
Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com>
---
v5:
- Disable WFE for queue space polling (Robin, Will)
v4:
- Deleted non-ETIMEOUT error handling for queue_poll (Nicolin)
v3:
- Use a helper for cmdq poll instead of open coding (Nicolin)
- Add more explanation in the commit message (Nicolin)
v2: - Reduced debug print info (Nicolin)
- Use a separate irq flags for exclusive lock
- Handle queue_poll err code other than ETIMEOUT
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 49 ++++++++++-----------
1 file changed, 24 insertions(+), 25 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index d637a5dcf48a..3467c10be0d0 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -117,12 +117,6 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
return space >= n;
}
-static bool queue_full(struct arm_smmu_ll_queue *q)
-{
- return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
- Q_WRP(q, q->prod) != Q_WRP(q, q->cons);
-}
-
static bool queue_empty(struct arm_smmu_ll_queue *q)
{
return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
@@ -612,14 +606,13 @@ static void arm_smmu_cmdq_poll_valid_map(struct arm_smmu_cmdq *cmdq,
__arm_smmu_cmdq_poll_set_valid_map(cmdq, sprod, eprod, false);
}
-/* Wait for the command queue to become non-full */
-static int arm_smmu_cmdq_poll_until_not_full(struct arm_smmu_device *smmu,
- struct arm_smmu_cmdq *cmdq,
- struct arm_smmu_ll_queue *llq)
+
+static inline void arm_smmu_cmdq_poll(struct arm_smmu_device *smmu,
+ struct arm_smmu_cmdq *cmdq,
+ struct arm_smmu_ll_queue *llq,
+ struct arm_smmu_queue_poll *qp)
{
unsigned long flags;
- struct arm_smmu_queue_poll qp;
- int ret = 0;
/*
* Try to update our copy of cons by grabbing exclusive cmdq access. If
@@ -629,19 +622,16 @@ static int arm_smmu_cmdq_poll_until_not_full(struct arm_smmu_device *smmu,
WRITE_ONCE(cmdq->q.llq.cons, readl_relaxed(cmdq->q.cons_reg));
arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags);
llq->val = READ_ONCE(cmdq->q.llq.val);
- return 0;
+ return;
}
- queue_poll_init(smmu, &qp);
- do {
- llq->val = READ_ONCE(cmdq->q.llq.val);
- if (!queue_full(llq))
- break;
-
- ret = queue_poll(&qp);
- } while (!ret);
-
- return ret;
+ if (queue_poll(qp) == -ETIMEDOUT) {
+ dev_err_ratelimited(smmu->dev, "CMDQ timed out, cons: %08x, prod: 0x%08x\n",
+ llq->cons, llq->prod);
+ /* Restart the timer */
+ queue_poll_init(smmu, qp, false);
+ }
+ llq->val = READ_ONCE(cmdq->q.llq.val);
}
/*
@@ -781,12 +771,21 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
local_irq_save(flags);
llq.val = READ_ONCE(cmdq->q.llq.val);
do {
+ struct arm_smmu_queue_poll qp;
u64 old;
+ /*
+ * Poll without WFE because:
+ * 1) Running out of space should be rare. Power saving is not
+ * an issue.
+ * 2) WFE depends on queue full break events, which occur only
+ * when the queue is full, but here we’re polling for
+ * sufficient space, not just queue full condition.
+ */
+ queue_poll_init(smmu, &qp, false);
while (!queue_has_space(&llq, n + sync)) {
local_irq_restore(flags);
- if (arm_smmu_cmdq_poll_until_not_full(smmu, cmdq, &llq))
- dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
+ arm_smmu_cmdq_poll(smmu, cmdq, &llq, &qp);
local_irq_save(flags);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v5 2/3] iommu/arm-smmu-v3: Fix CMDQ timeout warning
2025-12-08 21:28 ` [PATCH v5 2/3] iommu/arm-smmu-v3: Fix CMDQ timeout warning Jacob Pan
@ 2025-12-10 3:16 ` Will Deacon
2025-12-12 20:05 ` Jacob Pan
0 siblings, 1 reply; 7+ messages in thread
From: Will Deacon @ 2025-12-10 3:16 UTC (permalink / raw)
To: Jacob Pan
Cc: linux-kernel, iommu@lists.linux.dev, Joerg Roedel, Mostafa Saleh,
Jason Gunthorpe, Robin Murphy, Nicolin Chen, Zhang Yu,
Jean Philippe-Brucker, Alexander Grest
On Mon, Dec 08, 2025 at 01:28:56PM -0800, Jacob Pan wrote:
> @@ -781,12 +771,21 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
> local_irq_save(flags);
> llq.val = READ_ONCE(cmdq->q.llq.val);
> do {
> + struct arm_smmu_queue_poll qp;
> u64 old;
>
> + /*
> + * Poll without WFE because:
> + * 1) Running out of space should be rare. Power saving is not
> + * an issue.
> + * 2) WFE depends on queue full break events, which occur only
> + * when the queue is full, but here we’re polling for
> + * sufficient space, not just queue full condition.
> + */
I don't think this is reasonable; we should be able to use wfe instead of
polling on hardware that supports it and that is an important power-saving
measure in mobile parts.
If this is really an issue, we could take a spinlock around the
command-queue allocation loop for hardware with small queue sizes relative
to the number of CPUs, but it's not clear to me that we need to do anything
at all. I'm happy with the locking change in patch 3.
If we apply _only_ the locking change in the next patch, does that solve the
reported problem for you?
Will
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [PATCH v5 2/3] iommu/arm-smmu-v3: Fix CMDQ timeout warning
2025-12-10 3:16 ` Will Deacon
@ 2025-12-12 20:05 ` Jacob Pan
0 siblings, 0 replies; 7+ messages in thread
From: Jacob Pan @ 2025-12-12 20:05 UTC (permalink / raw)
To: Will Deacon
Cc: linux-kernel, iommu@lists.linux.dev, Joerg Roedel, Mostafa Saleh,
Jason Gunthorpe, Robin Murphy, Nicolin Chen, Zhang Yu,
Jean Philippe-Brucker, Alexander Grest
Hi Will,
On Wed, 10 Dec 2025 12:16:19 +0900
Will Deacon <will@kernel.org> wrote:
> On Mon, Dec 08, 2025 at 01:28:56PM -0800, Jacob Pan wrote:
> > @@ -781,12 +771,21 @@ static int arm_smmu_cmdq_issue_cmdlist(struct
> > arm_smmu_device *smmu, local_irq_save(flags);
> > llq.val = READ_ONCE(cmdq->q.llq.val);
> > do {
> > + struct arm_smmu_queue_poll qp;
> > u64 old;
> >
> > + /*
> > + * Poll without WFE because:
> > + * 1) Running out of space should be rare. Power
> > saving is not
> > + * an issue.
> > + * 2) WFE depends on queue full break events,
> > which occur only
> > + * when the queue is full, but here we’re
> > polling for
> > + * sufficient space, not just queue full
> > condition.
> > + */
>
> I don't think this is reasonable; we should be able to use wfe
> instead of polling on hardware that supports it and that is an
> important power-saving measure in mobile parts.
>
After an offline discussion, I now understand that WFE essentially
stops the CPU clock, making energy savings almost always beneficial.
This differs from certain C-state or idle-state transitions, where the
energy-saving break-even point depends on how long the CPU remains
idle. Previously, I assumed power savings were not guaranteed due to
the unpredictability of wake events (e.g., timing relative to scheduler
ticks or queue-full conditions).
So I agree we should leverage WFE as much as we could here.
> If this is really an issue, we could take a spinlock around the
> command-queue allocation loop for hardware with small queue sizes
> relative to the number of CPUs, but it's not clear to me that we need
> to do anything at all. I'm happy with the locking change in patch 3.
>
> If we apply _only_ the locking change in the next patch, does that
> solve the reported problem for you?
Yes, please take #3. It should take care of the functional problem.
Thanks,
Jacob
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v5 3/3] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency
2025-12-08 21:28 [PATCH v5 0/3] SMMU v3 CMDQ fix and improvement Jacob Pan
2025-12-08 21:28 ` [PATCH v5 1/3] iommu/arm-smmu-v3: Parameterize wfe for CMDQ polling Jacob Pan
2025-12-08 21:28 ` [PATCH v5 2/3] iommu/arm-smmu-v3: Fix CMDQ timeout warning Jacob Pan
@ 2025-12-08 21:28 ` Jacob Pan
2026-01-05 22:58 ` [PATCH v5 0/3] SMMU v3 CMDQ fix and improvement Will Deacon
3 siblings, 0 replies; 7+ messages in thread
From: Jacob Pan @ 2025-12-08 21:28 UTC (permalink / raw)
To: linux-kernel, iommu@lists.linux.dev, Will Deacon, Joerg Roedel,
Mostafa Saleh, Jason Gunthorpe, Robin Murphy, Nicolin Chen
Cc: Jacob Pan, Zhang Yu, Jean Philippe-Brucker, Alexander Grest
From: Alexander Grest <Alexander.Grest@microsoft.com>
The SMMU CMDQ lock is highly contentious when there are multiple CPUs
issuing commands and the queue is nearly full.
The lock has the following states:
- 0: Unlocked
- >0: Shared lock held with count
- INT_MIN+N: Exclusive lock held, where N is the # of shared waiters
- INT_MIN: Exclusive lock held, no shared waiters
When multiple CPUs are polling for space in the queue, they attempt to
grab the exclusive lock to update the cons pointer from the hardware. If
they fail to get the lock, they will spin until either the cons pointer
is updated by another CPU.
The current code allows the possibility of shared lock starvation
if there is a constant stream of CPUs trying to grab the exclusive lock.
This leads to severe latency issues and soft lockups.
Consider the following scenario where CPU1's attempt to acquire the
shared lock is starved by CPU2 and CPU0 contending for the exclusive
lock.
CPU0 (exclusive) | CPU1 (shared) | CPU2 (exclusive) | `cmdq->lock`
--------------------------------------------------------------------------
trylock() //takes | | | 0
| shared_lock() | | INT_MIN
| fetch_inc() | | INT_MIN
| no return | | INT_MIN + 1
| spins // VAL >= 0 | | INT_MIN + 1
unlock() | spins... | | INT_MIN + 1
set_release(0) | spins... | | 0 see[NOTE]
(done) | (sees 0) | trylock() // takes | 0
| *exits loop* | cmpxchg(0, INT_MIN) | 0
| | *cuts in* | INT_MIN
| cmpxchg(0, 1) | | INT_MIN
| fails // != 0 | | INT_MIN
| spins // VAL >= 0 | | INT_MIN
| *starved* | | INT_MIN
[NOTE] The current code resets the exclusive lock to 0 regardless of the
state of the lock. This causes two problems:
1. It opens the possibility of back-to-back exclusive locks and the
downstream effect of starving shared lock.
2. The count of shared lock waiters are lost.
To mitigate this, we release the exclusive lock by only clearing the sign
bit while retaining the shared lock waiter count as a way to avoid
starving the shared lock waiters.
Also deleted cmpxchg loop while trying to acquire the shared lock as it
is not needed. The waiters can see the positive lock count and proceed
immediately after the exclusive lock is released.
Exclusive lock is not starved in that submitters will try exclusive lock
first when new spaces become available.
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Alexander Grest <Alexander.Grest@microsoft.com>
Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com>
---
v5: - Simplify exclusive lock with atomic_fetch_andnot_release (Will)
v4: - No change
v3:
- Add flow chart for example starvation case (Nicolin)
no code change.
v2:
- Changed shared lock acquire condition from VAL>=0 to VAL>0
(Mostafa)
- Added more comments to explain shared lock change (Nicolin)
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 31 ++++++++++++++-------
1 file changed, 21 insertions(+), 10 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 3467c10be0d0..7a53177885d7 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -460,20 +460,26 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
*/
static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
{
- int val;
-
/*
- * We can try to avoid the cmpxchg() loop by simply incrementing the
- * lock counter. When held in exclusive state, the lock counter is set
- * to INT_MIN so these increments won't hurt as the value will remain
- * negative.
+ * When held in exclusive state, the lock counter is set to INT_MIN
+ * so these increments won't hurt as the value will remain negative.
+ * The increment will also signal the exclusive locker that there are
+ * shared waiters.
*/
if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0)
return;
- do {
- val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
- } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val);
+ /*
+ * Someone else is holding the lock in exclusive state, so wait
+ * for them to finish. Since we already incremented the lock counter,
+ * no exclusive lock can be acquired until we finish. We don't need
+ * the return value since we only care that the exclusive lock is
+ * released (i.e. the lock counter is non-negative).
+ * Once the exclusive locker releases the lock, the sign bit will
+ * be cleared and our increment will make the lock counter positive,
+ * allowing us to proceed.
+ */
+ atomic_cond_read_relaxed(&cmdq->lock, VAL > 0);
}
static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq)
@@ -500,9 +506,14 @@ static bool arm_smmu_cmdq_shared_tryunlock(struct arm_smmu_cmdq *cmdq)
__ret; \
})
+/*
+ * Only clear the sign bit when releasing the exclusive lock this will
+ * allow any shared_lock() waiters to proceed without the possibility
+ * of entering the exclusive lock in a tight loop.
+ */
#define arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags) \
({ \
- atomic_set_release(&cmdq->lock, 0); \
+ atomic_fetch_andnot_release(INT_MIN, &cmdq->lock); \
local_irq_restore(flags); \
})
--
2.43.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH v5 0/3] SMMU v3 CMDQ fix and improvement
2025-12-08 21:28 [PATCH v5 0/3] SMMU v3 CMDQ fix and improvement Jacob Pan
` (2 preceding siblings ...)
2025-12-08 21:28 ` [PATCH v5 3/3] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency Jacob Pan
@ 2026-01-05 22:58 ` Will Deacon
3 siblings, 0 replies; 7+ messages in thread
From: Will Deacon @ 2026-01-05 22:58 UTC (permalink / raw)
To: linux-kernel, iommu, Joerg Roedel, Mostafa Saleh, Robin Murphy,
Nicolin Chen, Jason Gunthorpe, Jacob Pan
Cc: catalin.marinas, kernel-team, Will Deacon, Zhang Yu,
Alexander Grest, Jean-Philippe Brucker
On Mon, 08 Dec 2025 13:28:54 -0800, Jacob Pan wrote:
> These two patches address logic issues that occur when SMMU CMDQ spaces
> are nearly exhausted at runtime. The problems become more pronounced
> when multiple CPUs submit to a single queue, a common scenario under SVA
> when shared buffers (used by both CPU and device) are being unmapped.
>
>
> Thanks,
>
> [...]
Applied third patch to iommu (arm/smmu/updates), thanks!
[3/3] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency
https://git.kernel.org/iommu/c/df180b1a4cc5
Cheers,
--
Will
https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev
^ permalink raw reply [flat|nested] 7+ messages in thread