* [PATCH 0/2] SMMU v3 CMDQ fix and improvement
@ 2025-09-24 17:54 Jacob Pan
2025-09-24 17:54 ` [PATCH 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning Jacob Pan
` (4 more replies)
0 siblings, 5 replies; 20+ messages in thread
From: Jacob Pan @ 2025-09-24 17:54 UTC (permalink / raw)
To: linux-kernel, iommu@lists.linux.dev, Will Deacon, Jason Gunthorpe,
Robin Murphy, Nicolin Chen
Cc: Jacob Pan, Zhang Yu, Jean Philippe-Brucker, Alexander Grest
Hi Will et al,
These two patches are derived from testing SMMU driver with smaller CMDQ
sizes where we see soft lockups.
This happens on HyperV emulated SMMU v3 as well as baremetal ARM servers
with artificially reduced queue size and microbenchmark to stress test
concurrency.
Thanks,
Jacob
Alexander Grest (1):
iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency
Jacob Pan (1):
iommu/arm-smmu-v3: Fix CMDQ timeout warning
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 85 +++++++++------------
1 file changed, 35 insertions(+), 50 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning
2025-09-24 17:54 [PATCH 0/2] SMMU v3 CMDQ fix and improvement Jacob Pan
@ 2025-09-24 17:54 ` Jacob Pan
2025-10-07 0:44 ` Nicolin Chen
2025-09-24 17:54 ` [PATCH 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency Jacob Pan
` (3 subsequent siblings)
4 siblings, 1 reply; 20+ messages in thread
From: Jacob Pan @ 2025-09-24 17:54 UTC (permalink / raw)
To: linux-kernel, iommu@lists.linux.dev, Will Deacon, Jason Gunthorpe,
Robin Murphy, Nicolin Chen
Cc: Jacob Pan, Zhang Yu, Jean Philippe-Brucker, Alexander Grest
While polling for n spaces in the cmdq, the current code instead checks
if the queue is full. If the queue is almost full but not enough space
(<n), then the CMDQ timeout warning is never triggered even if the
polling has exceeded timeout limit.
This patch polls for the availability of exact space instead of full and
emit timeout warning accordingly.
Fixes: 587e6c10a7ce ("iommu/arm-smmu-v3: Reduce contention during command-queue insertion")
Co-developed-by: Yu Zhang <zhangyu1@linux.microsoft.com>
Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 61 +++++++--------------
1 file changed, 21 insertions(+), 40 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 2a8b46b948f0..9b63525c13bb 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -138,12 +138,6 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
return space >= n;
}
-static bool queue_full(struct arm_smmu_ll_queue *q)
-{
- return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
- Q_WRP(q, q->prod) != Q_WRP(q, q->cons);
-}
-
static bool queue_empty(struct arm_smmu_ll_queue *q)
{
return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
@@ -633,38 +627,6 @@ static void arm_smmu_cmdq_poll_valid_map(struct arm_smmu_cmdq *cmdq,
__arm_smmu_cmdq_poll_set_valid_map(cmdq, sprod, eprod, false);
}
-/* Wait for the command queue to become non-full */
-static int arm_smmu_cmdq_poll_until_not_full(struct arm_smmu_device *smmu,
- struct arm_smmu_cmdq *cmdq,
- struct arm_smmu_ll_queue *llq)
-{
- unsigned long flags;
- struct arm_smmu_queue_poll qp;
- int ret = 0;
-
- /*
- * Try to update our copy of cons by grabbing exclusive cmdq access. If
- * that fails, spin until somebody else updates it for us.
- */
- if (arm_smmu_cmdq_exclusive_trylock_irqsave(cmdq, flags)) {
- WRITE_ONCE(cmdq->q.llq.cons, readl_relaxed(cmdq->q.cons_reg));
- arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags);
- llq->val = READ_ONCE(cmdq->q.llq.val);
- return 0;
- }
-
- queue_poll_init(smmu, &qp);
- do {
- llq->val = READ_ONCE(cmdq->q.llq.val);
- if (!queue_full(llq))
- break;
-
- ret = queue_poll(&qp);
- } while (!ret);
-
- return ret;
-}
-
/*
* Wait until the SMMU signals a CMD_SYNC completion MSI.
* Must be called with the cmdq lock held in some capacity.
@@ -796,6 +758,7 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
unsigned long flags;
bool owner;
struct arm_smmu_ll_queue llq, head;
+ struct arm_smmu_queue_poll qp;
int ret = 0;
llq.max_n_shift = cmdq->q.llq.max_n_shift;
@@ -806,10 +769,28 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
do {
u64 old;
+ queue_poll_init(smmu, &qp);
while (!queue_has_space(&llq, n + sync)) {
local_irq_restore(flags);
- if (arm_smmu_cmdq_poll_until_not_full(smmu, cmdq, &llq))
- dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
+ /*
+ * Try to update our copy of cons by grabbing exclusive cmdq access. If
+ * that fails, spin until somebody else updates it for us.
+ */
+ if (arm_smmu_cmdq_exclusive_trylock_irqsave(cmdq, flags)) {
+ WRITE_ONCE(cmdq->q.llq.cons, readl_relaxed(cmdq->q.cons_reg));
+ arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags);
+ llq.val = READ_ONCE(cmdq->q.llq.val);
+ local_irq_save(flags);
+ continue;
+ }
+
+ ret = queue_poll(&qp);
+ if (ret == -ETIMEDOUT) {
+ dev_err_ratelimited(smmu->dev, "CPU %d CMDQ Timeout, C: %08x, CW:%x P: 0x%08x PW: %x cmdq.lock 0x%x\n",
+ smp_processor_id(), Q_IDX(&llq, llq.cons), Q_WRP(&llq, llq.cons), Q_IDX(&llq, llq.prod), Q_WRP(&llq, llq.prod), atomic_read(&cmdq->lock));
+ queue_poll_init(smmu, &qp);
+ }
+ llq.val = READ_ONCE(cmdq->q.llq.val);
local_irq_save(flags);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency
2025-09-24 17:54 [PATCH 0/2] SMMU v3 CMDQ fix and improvement Jacob Pan
2025-09-24 17:54 ` [PATCH 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning Jacob Pan
@ 2025-09-24 17:54 ` Jacob Pan
2025-10-07 1:08 ` Nicolin Chen
2025-10-17 11:04 ` Mostafa Saleh
2025-10-06 15:14 ` [PATCH 0/2] SMMU v3 CMDQ fix and improvement Jacob Pan
` (2 subsequent siblings)
4 siblings, 2 replies; 20+ messages in thread
From: Jacob Pan @ 2025-09-24 17:54 UTC (permalink / raw)
To: linux-kernel, iommu@lists.linux.dev, Will Deacon, Jason Gunthorpe,
Robin Murphy, Nicolin Chen
Cc: Jacob Pan, Zhang Yu, Jean Philippe-Brucker, Alexander Grest
From: Alexander Grest <Alexander.Grest@microsoft.com>
The SMMU CMDQ lock is highly contentious when there are multiple CPUs
issuing commands on an architecture with small queue sizes e.g 256
entries.
The lock has the following states:
- 0: Unlocked
- >0: Shared lock held with count
- INT_MIN+N: Exclusive lock held, where N is the # of shared waiters
- INT_MIN: Exclusive lock held, no shared waiters
When multiple CPUs are polling for space in the queue, they attempt to
grab the exclusive lock to update the cons pointer from the hardware. If
they fail to get the lock, they will spin until either the cons pointer
is updated by another CPU.
The current code allows the possibility of shared lock starvation
if there is a constant stream of CPUs trying to grab the exclusive lock.
This leads to severe latency issues and soft lockups.
To mitigate this, we release the exclusive lock by only clearing the sign
bit while retaining the shared lock waiter count as a way to avoid
starving the shared lock waiters.
Also deleted cmpxchg loop while trying to acquire the shared lock as it
is not needed. The waiters can see the positive lock count and proceed
immediately after the exclusive lock is released.
Exclusive lock is not starved in that submitters will try exclusive lock
first when new spaces become available.
In a staged test where 32 CPUs issue SVA invalidations simultaneously on
a system with a 256 entry queue, the madvise (MADV_DONTNEED) latency
dropped by 50% with this patch and without soft lockups.
Signed-off-by: Alexander Grest <Alexander.Grest@microsoft.com>
Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com>
---
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 ++++++++++++---------
1 file changed, 14 insertions(+), 10 deletions(-)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 9b63525c13bb..9b7c01b731df 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -481,20 +481,19 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
*/
static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
{
- int val;
-
/*
- * We can try to avoid the cmpxchg() loop by simply incrementing the
- * lock counter. When held in exclusive state, the lock counter is set
- * to INT_MIN so these increments won't hurt as the value will remain
- * negative.
+ * We can simply increment the lock counter. When held in exclusive
+ * state, the lock counter is set to INT_MIN so these increments won't
+ * hurt as the value will remain negative. This will also signal the
+ * exclusive locker that there are shared waiters. Once the exclusive
+ * locker releases the lock, the sign bit will be cleared and our
+ * increment will make the lock counter positive, allowing us to
+ * proceed.
*/
if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0)
return;
- do {
- val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
- } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val);
+ atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
}
static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq)
@@ -521,9 +520,14 @@ static bool arm_smmu_cmdq_shared_tryunlock(struct arm_smmu_cmdq *cmdq)
__ret; \
})
+/*
+ * Only clear the sign bit when releasing the exclusive lock this will
+ * allow any shared_lock() waiters to proceed without the possibility
+ * of entering the exclusive lock in a tight loop.
+ */
#define arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags) \
({ \
- atomic_set_release(&cmdq->lock, 0); \
+ atomic_fetch_and_release(~INT_MIN, &cmdq->lock); \
local_irq_restore(flags); \
})
--
2.43.0
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] SMMU v3 CMDQ fix and improvement
2025-09-24 17:54 [PATCH 0/2] SMMU v3 CMDQ fix and improvement Jacob Pan
2025-09-24 17:54 ` [PATCH 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning Jacob Pan
2025-09-24 17:54 ` [PATCH 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency Jacob Pan
@ 2025-10-06 15:14 ` Jacob Pan
2025-10-16 15:31 ` Jacob Pan
2025-10-17 10:57 ` Mostafa Saleh
4 siblings, 0 replies; 20+ messages in thread
From: Jacob Pan @ 2025-10-06 15:14 UTC (permalink / raw)
To: linux-kernel, iommu@lists.linux.dev, Will Deacon, Jason Gunthorpe,
Robin Murphy, Nicolin Chen
Cc: Zhang Yu, Jean Philippe-Brucker, Alexander Grest, Joerg Roedel
Hi, any thoughts on these?
@Jason @Nicolin
On Wed, 24 Sep 2025 10:54:36 -0700
Jacob Pan <jacob.pan@linux.microsoft.com> wrote:
> Hi Will et al,
>
> These two patches are derived from testing SMMU driver with smaller
> CMDQ sizes where we see soft lockups.
>
> This happens on HyperV emulated SMMU v3 as well as baremetal ARM
> servers with artificially reduced queue size and microbenchmark to
> stress test concurrency.
>
> Thanks,
>
> Jacob
>
>
> Alexander Grest (1):
> iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency
>
> Jacob Pan (1):
> iommu/arm-smmu-v3: Fix CMDQ timeout warning
>
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 85
> +++++++++------------ 1 file changed, 35 insertions(+), 50
> deletions(-)
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning
2025-09-24 17:54 ` [PATCH 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning Jacob Pan
@ 2025-10-07 0:44 ` Nicolin Chen
2025-10-07 16:12 ` Jacob Pan
0 siblings, 1 reply; 20+ messages in thread
From: Nicolin Chen @ 2025-10-07 0:44 UTC (permalink / raw)
To: Jacob Pan
Cc: linux-kernel, iommu@lists.linux.dev, Will Deacon, Jason Gunthorpe,
Robin Murphy, Zhang Yu, Jean Philippe-Brucker, Alexander Grest
On Wed, Sep 24, 2025 at 10:54:37AM -0700, Jacob Pan wrote:
> While polling for n spaces in the cmdq, the current code instead checks
> if the queue is full. If the queue is almost full but not enough space
> (<n), then the CMDQ timeout warning is never triggered even if the
> polling has exceeded timeout limit.
This does sound like an issue that is missing a warning print.
> This patch polls for the availability of exact space instead of full and
> emit timeout warning accordingly.
And the solution sounds plausible as well.
> @@ -806,10 +769,28 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
> do {
> u64 old;
>
> + queue_poll_init(smmu, &qp);
> while (!queue_has_space(&llq, n + sync)) {
> local_irq_restore(flags);
> - if (arm_smmu_cmdq_poll_until_not_full(smmu, cmdq, &llq))
> - dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
> + /*
> + * Try to update our copy of cons by grabbing exclusive cmdq access. If
> + * that fails, spin until somebody else updates it for us.
> + */
> + if (arm_smmu_cmdq_exclusive_trylock_irqsave(cmdq, flags)) {
> + WRITE_ONCE(cmdq->q.llq.cons, readl_relaxed(cmdq->q.cons_reg));
> + arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags);
> + llq.val = READ_ONCE(cmdq->q.llq.val);
> + local_irq_save(flags);
> + continue;
> + }
> +
> + ret = queue_poll(&qp);
> + if (ret == -ETIMEDOUT) {
> + dev_err_ratelimited(smmu->dev, "CPU %d CMDQ Timeout, C: %08x, CW:%x P: 0x%08x PW: %x cmdq.lock 0x%x\n",
> + smp_processor_id(), Q_IDX(&llq, llq.cons), Q_WRP(&llq, llq.cons), Q_IDX(&llq, llq.prod), Q_WRP(&llq, llq.prod), atomic_read(&cmdq->lock));
> + queue_poll_init(smmu, &qp);
> + }
> + llq.val = READ_ONCE(cmdq->q.llq.val);
> local_irq_save(flags);
But, couldn't we write a new arm_smmu_cmdq_poll_until_enough_space()
simply replacing arm_smmu_cmdq_exclusive_unlock_irqrestore()?
This whole unwrapped piece is really not easy to read :(
Also, the new error message has too much debugging info, which could
be trimmed away, IMHO. Though kernel coding now does allow a higher
limit per line, that 200-ish-character line is a bit overdone :-/
Nicolin
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency
2025-09-24 17:54 ` [PATCH 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency Jacob Pan
@ 2025-10-07 1:08 ` Nicolin Chen
2025-10-07 18:16 ` Jacob Pan
2025-10-17 11:04 ` Mostafa Saleh
1 sibling, 1 reply; 20+ messages in thread
From: Nicolin Chen @ 2025-10-07 1:08 UTC (permalink / raw)
To: Jacob Pan
Cc: linux-kernel, iommu@lists.linux.dev, Will Deacon, Jason Gunthorpe,
Robin Murphy, Zhang Yu, Jean Philippe-Brucker, Alexander Grest
On Wed, Sep 24, 2025 at 10:54:38AM -0700, Jacob Pan wrote:
> static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
> {
> - int val;
> -
> /*
> - * We can try to avoid the cmpxchg() loop by simply incrementing the
> - * lock counter. When held in exclusive state, the lock counter is set
> - * to INT_MIN so these increments won't hurt as the value will remain
> - * negative.
> + * We can simply increment the lock counter. When held in exclusive
> + * state, the lock counter is set to INT_MIN so these increments won't
> + * hurt as the value will remain negative.
It seems to me that the change at the first statement is not very
necessary.
> This will also signal the
> + * exclusive locker that there are shared waiters. Once the exclusive
> + * locker releases the lock, the sign bit will be cleared and our
> + * increment will make the lock counter positive, allowing us to
> + * proceed.
> */
> if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0)
> return;
>
> - do {
> - val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
> - } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val);
> + atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
The returned value is not captured for anything. Is this read()
necessary? If so, a line of comments elaborating it?
> +/*
> + * Only clear the sign bit when releasing the exclusive lock this will
> + * allow any shared_lock() waiters to proceed without the possibility
> + * of entering the exclusive lock in a tight loop.
> + */
> #define arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags) \
> ({ \
> - atomic_set_release(&cmdq->lock, 0); \
> + atomic_fetch_and_release(~INT_MIN, &cmdq->lock); \
By a quick skim, the whole thing looks quite smart to me. But I
need some time to revisit and perhaps test it as well.
It's also important to get feedback from Will. Both patches are
touching his writing that has been running for years already..
Nicolin
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning
2025-10-07 0:44 ` Nicolin Chen
@ 2025-10-07 16:12 ` Jacob Pan
2025-10-07 16:32 ` Nicolin Chen
0 siblings, 1 reply; 20+ messages in thread
From: Jacob Pan @ 2025-10-07 16:12 UTC (permalink / raw)
To: Nicolin Chen
Cc: linux-kernel, iommu@lists.linux.dev, Will Deacon, Jason Gunthorpe,
Robin Murphy, Zhang Yu, Jean Philippe-Brucker, Alexander Grest
On Mon, 6 Oct 2025 17:44:40 -0700
Nicolin Chen <nicolinc@nvidia.com> wrote:
> On Wed, Sep 24, 2025 at 10:54:37AM -0700, Jacob Pan wrote:
> > While polling for n spaces in the cmdq, the current code instead
> > checks if the queue is full. If the queue is almost full but not
> > enough space (<n), then the CMDQ timeout warning is never triggered
> > even if the polling has exceeded timeout limit.
>
> This does sound like an issue that is missing a warning print.
yeah, without this warning our debug has to start with much later soft
lockup trace.
> > This patch polls for the availability of exact space instead of
> > full and emit timeout warning accordingly.
>
> And the solution sounds plausible as well.
>
> > @@ -806,10 +769,28 @@ int arm_smmu_cmdq_issue_cmdlist(struct
> > arm_smmu_device *smmu, do {
> > u64 old;
> >
> > + queue_poll_init(smmu, &qp);
> > while (!queue_has_space(&llq, n + sync)) {
> > local_irq_restore(flags);
> > - if
> > (arm_smmu_cmdq_poll_until_not_full(smmu, cmdq, &llq))
> > - dev_err_ratelimited(smmu->dev,
> > "CMDQ timeout\n");
> > + /*
> > + * Try to update our copy of cons by
> > grabbing exclusive cmdq access. If
> > + * that fails, spin until somebody else
> > updates it for us.
> > + */
> > + if
> > (arm_smmu_cmdq_exclusive_trylock_irqsave(cmdq, flags)) {
> > + WRITE_ONCE(cmdq->q.llq.cons,
> > readl_relaxed(cmdq->q.cons_reg));
> > +
> > arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags);
> > + llq.val =
> > READ_ONCE(cmdq->q.llq.val);
> > + local_irq_save(flags);
> > + continue;
> > + }
> > +
> > + ret = queue_poll(&qp);
> > + if (ret == -ETIMEDOUT) {
> > + dev_err_ratelimited(smmu->dev,
> > "CPU %d CMDQ Timeout, C: %08x, CW:%x P: 0x%08x PW: %x cmdq.lock
> > 0x%x\n",
> > +
> > smp_processor_id(), Q_IDX(&llq, llq.cons), Q_WRP(&llq, llq.cons),
> > Q_IDX(&llq, llq.prod), Q_WRP(&llq, llq.prod),
> > atomic_read(&cmdq->lock));
> > + queue_poll_init(smmu, &qp);
> > + }
> > + llq.val = READ_ONCE(cmdq->q.llq.val);
> > local_irq_save(flags);
>
> But, couldn't we write a new arm_smmu_cmdq_poll_until_enough_space()
> simply replacing arm_smmu_cmdq_exclusive_unlock_irqrestore()?
>
Yes, I can extract the above polling for queue_has_space() into a new
arm_smmu_cmdq_poll_until_enough_space() function. I agree it will be
more readable.
But I am not following what you mean by "replacing
arm_smmu_cmdq_exclusive_unlock_irqrestore()", could you elaborate?
> This whole unwrapped piece is really not easy to read :(
>
> Also, the new error message has too much debugging info, which could
> be trimmed away, IMHO. Though kernel coding now does allow a higher
> limit per line, that 200-ish-character line is a bit overdone :-/
agreed. will reduce.
> Nicolin
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning
2025-10-07 16:12 ` Jacob Pan
@ 2025-10-07 16:32 ` Nicolin Chen
0 siblings, 0 replies; 20+ messages in thread
From: Nicolin Chen @ 2025-10-07 16:32 UTC (permalink / raw)
To: Jacob Pan
Cc: linux-kernel, iommu@lists.linux.dev, Will Deacon, Jason Gunthorpe,
Robin Murphy, Zhang Yu, Jean Philippe-Brucker, Alexander Grest
On Tue, Oct 07, 2025 at 09:12:20AM -0700, Jacob Pan wrote:
> But I am not following what you mean by "replacing
> arm_smmu_cmdq_exclusive_unlock_irqrestore()", could you elaborate?
Oh, "replacing arm_smmu_cmdq_poll_until_not_full()" I meant.
Nicolin
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency
2025-10-07 1:08 ` Nicolin Chen
@ 2025-10-07 18:16 ` Jacob Pan
0 siblings, 0 replies; 20+ messages in thread
From: Jacob Pan @ 2025-10-07 18:16 UTC (permalink / raw)
To: Nicolin Chen
Cc: linux-kernel, iommu@lists.linux.dev, Will Deacon, Jason Gunthorpe,
Robin Murphy, Zhang Yu, Jean Philippe-Brucker, Alexander Grest
On Mon, 6 Oct 2025 18:08:14 -0700
Nicolin Chen <nicolinc@nvidia.com> wrote:
> On Wed, Sep 24, 2025 at 10:54:38AM -0700, Jacob Pan wrote:
> > static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
> > {
> > - int val;
> > -
> > /*
> > - * We can try to avoid the cmpxchg() loop by simply
> > incrementing the
> > - * lock counter. When held in exclusive state, the lock
> > counter is set
> > - * to INT_MIN so these increments won't hurt as the value
> > will remain
> > - * negative.
> > + * We can simply increment the lock counter. When held in
> > exclusive
> > + * state, the lock counter is set to INT_MIN so these
> > increments won't
> > + * hurt as the value will remain negative.
>
> It seems to me that the change at the first statement is not very
> necessary.
>
I can delete "We can simply increment the lock counter." since it is
obvious. But the change to delete cmpxchg in the comment matches the
code change the follows.
> > This will also signal the
> > + * exclusive locker that there are shared waiters. Once
> > the exclusive
> > + * locker releases the lock, the sign bit will be cleared
> > and our
> > + * increment will make the lock counter positive, allowing
> > us to
> > + * proceed.
> > */
> > if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0)
> > return;
> >
> > - do {
> > - val = atomic_cond_read_relaxed(&cmdq->lock, VAL >=
> > 0);
> > - } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1)
> > != val);
> > + atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
>
> The returned value is not captured for anything. Is this read()
> necessary? If so, a line of comments elaborating it?
We don't need the return value, how about this explanation?
/*
* Someone else is holding the lock in exclusive state, so wait
* for them to finish. Since we already incremented the lock counter,
* no exclusive lock can be acquired until we finish. We don't need
* the return value since we only care that the exclusive lock is
* released (i.e. the lock counter is non-negative).
*/
> > +/*
> > + * Only clear the sign bit when releasing the exclusive lock this
> > will
> > + * allow any shared_lock() waiters to proceed without the
> > possibility
> > + * of entering the exclusive lock in a tight loop.
> > + */
> > #define arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq,
> > flags) \ ({
> > \
> > - atomic_set_release(&cmdq->lock, 0);
> > \
> > + atomic_fetch_and_release(~INT_MIN, &cmdq->lock);
> > \
>
> By a quick skim, the whole thing looks quite smart to me. But I
> need some time to revisit and perhaps test it as well.
>
> It's also important to get feedback from Will. Both patches are
> touching his writing that has been running for years already..
Definitely, really appreciated your review. I think part of the reason
is that cmdq size is usually quite large, queue full is a rare case.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] SMMU v3 CMDQ fix and improvement
2025-09-24 17:54 [PATCH 0/2] SMMU v3 CMDQ fix and improvement Jacob Pan
` (2 preceding siblings ...)
2025-10-06 15:14 ` [PATCH 0/2] SMMU v3 CMDQ fix and improvement Jacob Pan
@ 2025-10-16 15:31 ` Jacob Pan
2025-10-17 10:57 ` Mostafa Saleh
4 siblings, 0 replies; 20+ messages in thread
From: Jacob Pan @ 2025-10-16 15:31 UTC (permalink / raw)
To: linux-kernel, iommu@lists.linux.dev, Will Deacon, Jason Gunthorpe,
Robin Murphy, Nicolin Chen, Joerg Roedel
Cc: Zhang Yu, Jean Philippe-Brucker, Alexander Grest
Hi Will,
Just wondering if you had a chance to review these?
Thanks,
Jacob
On Wed, 24 Sep 2025 10:54:36 -0700
Jacob Pan <jacob.pan@linux.microsoft.com> wrote:
> Hi Will et al,
>
> These two patches are derived from testing SMMU driver with smaller
> CMDQ sizes where we see soft lockups.
>
> This happens on HyperV emulated SMMU v3 as well as baremetal ARM
> servers with artificially reduced queue size and microbenchmark to
> stress test concurrency.
>
> Thanks,
>
> Jacob
>
>
> Alexander Grest (1):
> iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency
>
> Jacob Pan (1):
> iommu/arm-smmu-v3: Fix CMDQ timeout warning
>
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 85
> +++++++++------------ 1 file changed, 35 insertions(+), 50
> deletions(-)
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] SMMU v3 CMDQ fix and improvement
2025-09-24 17:54 [PATCH 0/2] SMMU v3 CMDQ fix and improvement Jacob Pan
` (3 preceding siblings ...)
2025-10-16 15:31 ` Jacob Pan
@ 2025-10-17 10:57 ` Mostafa Saleh
2025-10-17 13:51 ` Jason Gunthorpe
4 siblings, 1 reply; 20+ messages in thread
From: Mostafa Saleh @ 2025-10-17 10:57 UTC (permalink / raw)
To: Jacob Pan
Cc: linux-kernel, iommu@lists.linux.dev, Will Deacon, Jason Gunthorpe,
Robin Murphy, Nicolin Chen, Zhang Yu, Jean Philippe-Brucker,
Alexander Grest
On Wed, Sep 24, 2025 at 10:54:36AM -0700, Jacob Pan wrote:
> Hi Will et al,
>
> These two patches are derived from testing SMMU driver with smaller CMDQ
> sizes where we see soft lockups.
>
> This happens on HyperV emulated SMMU v3 as well as baremetal ARM servers
> with artificially reduced queue size and microbenchmark to stress test
> concurrency.
Is it possible to share what are the artificial sizes and does the HW/emulation
support range invalidation (IRD3.RIL)?
I'd expect it would be really hard to overwhelm the command queue, unless the
HW doesn't support range invalidation and/or the queue entries are close to
the number of CPUs.
Thanks,
Mostafa
>
> Thanks,
>
> Jacob
>
>
> Alexander Grest (1):
> iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency
>
> Jacob Pan (1):
> iommu/arm-smmu-v3: Fix CMDQ timeout warning
>
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 85 +++++++++------------
> 1 file changed, 35 insertions(+), 50 deletions(-)
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency
2025-09-24 17:54 ` [PATCH 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency Jacob Pan
2025-10-07 1:08 ` Nicolin Chen
@ 2025-10-17 11:04 ` Mostafa Saleh
2025-10-19 5:32 ` Jacob Pan
1 sibling, 1 reply; 20+ messages in thread
From: Mostafa Saleh @ 2025-10-17 11:04 UTC (permalink / raw)
To: Jacob Pan
Cc: linux-kernel, iommu@lists.linux.dev, Will Deacon, Jason Gunthorpe,
Robin Murphy, Nicolin Chen, Zhang Yu, Jean Philippe-Brucker,
Alexander Grest
On Wed, Sep 24, 2025 at 10:54:38AM -0700, Jacob Pan wrote:
> From: Alexander Grest <Alexander.Grest@microsoft.com>
>
> The SMMU CMDQ lock is highly contentious when there are multiple CPUs
> issuing commands on an architecture with small queue sizes e.g 256
> entries.
>
> The lock has the following states:
> - 0: Unlocked
> - >0: Shared lock held with count
> - INT_MIN+N: Exclusive lock held, where N is the # of shared waiters
> - INT_MIN: Exclusive lock held, no shared waiters
>
> When multiple CPUs are polling for space in the queue, they attempt to
> grab the exclusive lock to update the cons pointer from the hardware. If
> they fail to get the lock, they will spin until either the cons pointer
> is updated by another CPU.
>
> The current code allows the possibility of shared lock starvation
> if there is a constant stream of CPUs trying to grab the exclusive lock.
> This leads to severe latency issues and soft lockups.
>
> To mitigate this, we release the exclusive lock by only clearing the sign
> bit while retaining the shared lock waiter count as a way to avoid
> starving the shared lock waiters.
>
> Also deleted cmpxchg loop while trying to acquire the shared lock as it
> is not needed. The waiters can see the positive lock count and proceed
> immediately after the exclusive lock is released.
>
> Exclusive lock is not starved in that submitters will try exclusive lock
> first when new spaces become available.
>
> In a staged test where 32 CPUs issue SVA invalidations simultaneously on
> a system with a 256 entry queue, the madvise (MADV_DONTNEED) latency
> dropped by 50% with this patch and without soft lockups.
>
> Signed-off-by: Alexander Grest <Alexander.Grest@microsoft.com>
> Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com>
> ---
> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 ++++++++++++---------
> 1 file changed, 14 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 9b63525c13bb..9b7c01b731df 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -481,20 +481,19 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
> */
> static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
> {
> - int val;
> -
> /*
> - * We can try to avoid the cmpxchg() loop by simply incrementing the
> - * lock counter. When held in exclusive state, the lock counter is set
> - * to INT_MIN so these increments won't hurt as the value will remain
> - * negative.
> + * We can simply increment the lock counter. When held in exclusive
> + * state, the lock counter is set to INT_MIN so these increments won't
> + * hurt as the value will remain negative. This will also signal the
> + * exclusive locker that there are shared waiters. Once the exclusive
> + * locker releases the lock, the sign bit will be cleared and our
> + * increment will make the lock counter positive, allowing us to
> + * proceed.
> */
> if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0)
> return;
>
> - do {
> - val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
> - } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val);
> + atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
I think that should be "VAL > 0", as it is guaranteed that we hold the shared
lock at this point.
Otherwise,
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Thanks,
Mostafa
> }
>
> static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq)
> @@ -521,9 +520,14 @@ static bool arm_smmu_cmdq_shared_tryunlock(struct arm_smmu_cmdq *cmdq)
> __ret; \
> })
>
> +/*
> + * Only clear the sign bit when releasing the exclusive lock this will
> + * allow any shared_lock() waiters to proceed without the possibility
> + * of entering the exclusive lock in a tight loop.
> + */
> #define arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags) \
> ({ \
> - atomic_set_release(&cmdq->lock, 0); \
> + atomic_fetch_and_release(~INT_MIN, &cmdq->lock); \
> local_irq_restore(flags); \
> })
>
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] SMMU v3 CMDQ fix and improvement
2025-10-17 10:57 ` Mostafa Saleh
@ 2025-10-17 13:51 ` Jason Gunthorpe
2025-10-17 14:44 ` Robin Murphy
2025-10-17 16:50 ` Jacob Pan
0 siblings, 2 replies; 20+ messages in thread
From: Jason Gunthorpe @ 2025-10-17 13:51 UTC (permalink / raw)
To: Mostafa Saleh
Cc: Jacob Pan, linux-kernel, iommu@lists.linux.dev, Will Deacon,
Robin Murphy, Nicolin Chen, Zhang Yu, Jean Philippe-Brucker,
Alexander Grest
On Fri, Oct 17, 2025 at 10:57:52AM +0000, Mostafa Saleh wrote:
> On Wed, Sep 24, 2025 at 10:54:36AM -0700, Jacob Pan wrote:
> > Hi Will et al,
> >
> > These two patches are derived from testing SMMU driver with smaller CMDQ
> > sizes where we see soft lockups.
> >
> > This happens on HyperV emulated SMMU v3 as well as baremetal ARM servers
> > with artificially reduced queue size and microbenchmark to stress test
> > concurrency.
>
> Is it possible to share what are the artificial sizes and does the HW/emulation
> support range invalidation (IRD3.RIL)?
>
> I'd expect it would be really hard to overwhelm the command queue, unless the
> HW doesn't support range invalidation and/or the queue entries are close to
> the number of CPUs.
At least on Jacob's system there is no RIL and there are 72/144 CPU
cores potentially banging on this.
I think it is combination of lots of required invalidation commands,
low queue depth and slow retirement of commands that make it easier to
create a queue full condition.
Without RIL one SVA invalidation may take out the entire small queue,
for example.
Jason
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] SMMU v3 CMDQ fix and improvement
2025-10-17 13:51 ` Jason Gunthorpe
@ 2025-10-17 14:44 ` Robin Murphy
2025-10-17 16:50 ` Jacob Pan
1 sibling, 0 replies; 20+ messages in thread
From: Robin Murphy @ 2025-10-17 14:44 UTC (permalink / raw)
To: Jason Gunthorpe, Mostafa Saleh
Cc: Jacob Pan, linux-kernel, iommu@lists.linux.dev, Will Deacon,
Nicolin Chen, Zhang Yu, Jean Philippe-Brucker, Alexander Grest
On 2025-10-17 2:51 pm, Jason Gunthorpe wrote:
> On Fri, Oct 17, 2025 at 10:57:52AM +0000, Mostafa Saleh wrote:
>> On Wed, Sep 24, 2025 at 10:54:36AM -0700, Jacob Pan wrote:
>>> Hi Will et al,
>>>
>>> These two patches are derived from testing SMMU driver with smaller CMDQ
>>> sizes where we see soft lockups.
>>>
>>> This happens on HyperV emulated SMMU v3 as well as baremetal ARM servers
>>> with artificially reduced queue size and microbenchmark to stress test
>>> concurrency.
>>
>> Is it possible to share what are the artificial sizes and does the HW/emulation
>> support range invalidation (IRD3.RIL)?
>>
>> I'd expect it would be really hard to overwhelm the command queue, unless the
>> HW doesn't support range invalidation and/or the queue entries are close to
>> the number of CPUs.
>
> At least on Jacob's system there is no RIL and there are 72/144 CPU
> cores potentially banging on this.
>
> I think it is combination of lots of required invalidation commands,
> low queue depth and slow retirement of commands that make it easier to
> create a queue full condition.
>
> Without RIL one SVA invalidation may take out the entire small queue,
> for example.
Indeed once real hardware first started to arrive, we found that even
just 4 NVMe queues doing ~8MB DMA unmaps with a modestly-clocked MMU-600
were capable of keeping a 256-entry CMDQ full enough to occasionally hit
this timeout with the original spinlock. Which is precisely why, as well
as introducing the new lock-free algorithm, we also stopped limiting the
CMDQ to 4KB 6 years ago.
Yes, if the queue is contrived to only be big enough to hold 3 or fewer
commands per CPU, one can expect catastrophic levels of contention even
with RIL. However since that requires going out of the way to hack the
driver (and/or hypervisor emulation) to force a clearly unrealistic
behaviour, I would say the best solution to that particular problem is
"stop doing that".
If significant contention is visible in real-world workloads, that would
be more of a concern of interest.
Thanks,
Robin.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] SMMU v3 CMDQ fix and improvement
2025-10-17 13:51 ` Jason Gunthorpe
2025-10-17 14:44 ` Robin Murphy
@ 2025-10-17 16:50 ` Jacob Pan
2025-10-20 12:02 ` Jason Gunthorpe
1 sibling, 1 reply; 20+ messages in thread
From: Jacob Pan @ 2025-10-17 16:50 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Mostafa Saleh, linux-kernel, iommu@lists.linux.dev, Will Deacon,
Robin Murphy, Nicolin Chen, Zhang Yu, Jean Philippe-Brucker,
Alexander Grest
On Fri, 17 Oct 2025 10:51:45 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> On Fri, Oct 17, 2025 at 10:57:52AM +0000, Mostafa Saleh wrote:
> > On Wed, Sep 24, 2025 at 10:54:36AM -0700, Jacob Pan wrote:
> > > Hi Will et al,
> > >
> > > These two patches are derived from testing SMMU driver with
> > > smaller CMDQ sizes where we see soft lockups.
> > >
> > > This happens on HyperV emulated SMMU v3 as well as baremetal ARM
> > > servers with artificially reduced queue size and microbenchmark
> > > to stress test concurrency.
> >
> > Is it possible to share what are the artificial sizes and does the
> > HW/emulation support range invalidation (IRD3.RIL)?
> >
> > I'd expect it would be really hard to overwhelm the command queue,
> > unless the HW doesn't support range invalidation and/or the queue
> > entries are close to the number of CPUs.
>
> At least on Jacob's system there is no RIL and there are 72/144 CPU
> cores potentially banging on this.
>
> I think it is combination of lots of required invalidation commands,
> low queue depth and slow retirement of commands that make it easier to
> create a queue full condition.
>
> Without RIL one SVA invalidation may take out the entire small queue,
> for example.
Right, no range invalidation and queue depth is 256 in this case.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency
2025-10-17 11:04 ` Mostafa Saleh
@ 2025-10-19 5:32 ` Jacob Pan
0 siblings, 0 replies; 20+ messages in thread
From: Jacob Pan @ 2025-10-19 5:32 UTC (permalink / raw)
To: Mostafa Saleh
Cc: linux-kernel, iommu@lists.linux.dev, Will Deacon, Jason Gunthorpe,
Robin Murphy, Nicolin Chen, Zhang Yu, Jean Philippe-Brucker,
Alexander Grest
On Fri, 17 Oct 2025 11:04:24 +0000
Mostafa Saleh <smostafa@google.com> wrote:
> On Wed, Sep 24, 2025 at 10:54:38AM -0700, Jacob Pan wrote:
> > From: Alexander Grest <Alexander.Grest@microsoft.com>
> >
> > The SMMU CMDQ lock is highly contentious when there are multiple
> > CPUs issuing commands on an architecture with small queue sizes e.g
> > 256 entries.
> >
> > The lock has the following states:
> > - 0: Unlocked
> > - >0: Shared lock held with count
> > - INT_MIN+N: Exclusive lock held, where N is the # of
> > shared waiters
> > - INT_MIN: Exclusive lock held, no shared waiters
> >
> > When multiple CPUs are polling for space in the queue, they attempt
> > to grab the exclusive lock to update the cons pointer from the
> > hardware. If they fail to get the lock, they will spin until either
> > the cons pointer is updated by another CPU.
> >
> > The current code allows the possibility of shared lock starvation
> > if there is a constant stream of CPUs trying to grab the exclusive
> > lock. This leads to severe latency issues and soft lockups.
> >
> > To mitigate this, we release the exclusive lock by only clearing
> > the sign bit while retaining the shared lock waiter count as a way
> > to avoid starving the shared lock waiters.
> >
> > Also deleted cmpxchg loop while trying to acquire the shared lock
> > as it is not needed. The waiters can see the positive lock count
> > and proceed immediately after the exclusive lock is released.
> >
> > Exclusive lock is not starved in that submitters will try exclusive
> > lock first when new spaces become available.
> >
> > In a staged test where 32 CPUs issue SVA invalidations
> > simultaneously on a system with a 256 entry queue, the madvise
> > (MADV_DONTNEED) latency dropped by 50% with this patch and without
> > soft lockups.
> >
> > Signed-off-by: Alexander Grest <Alexander.Grest@microsoft.com>
> > Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com>
> > ---
> > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24
> > ++++++++++++--------- 1 file changed, 14 insertions(+), 10
> > deletions(-)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index
> > 9b63525c13bb..9b7c01b731df 100644 ---
> > a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++
> > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -481,20 +481,19 @@
> > static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu) */
> > static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
> > {
> > - int val;
> > -
> > /*
> > - * We can try to avoid the cmpxchg() loop by simply
> > incrementing the
> > - * lock counter. When held in exclusive state, the lock
> > counter is set
> > - * to INT_MIN so these increments won't hurt as the value
> > will remain
> > - * negative.
> > + * We can simply increment the lock counter. When held in
> > exclusive
> > + * state, the lock counter is set to INT_MIN so these
> > increments won't
> > + * hurt as the value will remain negative. This will also
> > signal the
> > + * exclusive locker that there are shared waiters. Once
> > the exclusive
> > + * locker releases the lock, the sign bit will be cleared
> > and our
> > + * increment will make the lock counter positive, allowing
> > us to
> > + * proceed.
> > */
> > if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0)
> > return;
> >
> > - do {
> > - val = atomic_cond_read_relaxed(&cmdq->lock, VAL >=
> > 0);
> > - } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1)
> > != val);
> > + atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
>
> I think that should be "VAL > 0", as it is guaranteed that we hold
> the shared lock at this point.
>
Indeed, will do.
Though there is no functional difference since we did inc already, VAL
will never be 0 when it comes to this line.
> Otherwise,
> Reviewed-by: Mostafa Saleh <smostafa@google.com>
>
> Thanks,
> Mostafa
>
> > }
> >
> > static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq)
> > @@ -521,9 +520,14 @@ static bool
> > arm_smmu_cmdq_shared_tryunlock(struct arm_smmu_cmdq *cmdq)
> > __ret;
> > \ })
> > +/*
> > + * Only clear the sign bit when releasing the exclusive lock this
> > will
> > + * allow any shared_lock() waiters to proceed without the
> > possibility
> > + * of entering the exclusive lock in a tight loop.
> > + */
> > #define arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq,
> > flags) \ ({
> > \
> > - atomic_set_release(&cmdq->lock, 0);
> > \
> > + atomic_fetch_and_release(~INT_MIN, &cmdq->lock);
> > \ local_irq_restore(flags);
> > \ })
> >
> > --
> > 2.43.0
> >
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] SMMU v3 CMDQ fix and improvement
2025-10-17 16:50 ` Jacob Pan
@ 2025-10-20 12:02 ` Jason Gunthorpe
2025-10-20 18:57 ` Jacob Pan
0 siblings, 1 reply; 20+ messages in thread
From: Jason Gunthorpe @ 2025-10-20 12:02 UTC (permalink / raw)
To: Jacob Pan
Cc: Mostafa Saleh, linux-kernel, iommu@lists.linux.dev, Will Deacon,
Robin Murphy, Nicolin Chen, Zhang Yu, Jean Philippe-Brucker,
Alexander Grest
On Fri, Oct 17, 2025 at 09:50:31AM -0700, Jacob Pan wrote:
> On Fri, 17 Oct 2025 10:51:45 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> > On Fri, Oct 17, 2025 at 10:57:52AM +0000, Mostafa Saleh wrote:
> > > On Wed, Sep 24, 2025 at 10:54:36AM -0700, Jacob Pan wrote:
> > > > Hi Will et al,
> > > >
> > > > These two patches are derived from testing SMMU driver with
> > > > smaller CMDQ sizes where we see soft lockups.
> > > >
> > > > This happens on HyperV emulated SMMU v3 as well as baremetal ARM
> > > > servers with artificially reduced queue size and microbenchmark
> > > > to stress test concurrency.
> > >
> > > Is it possible to share what are the artificial sizes and does the
> > > HW/emulation support range invalidation (IRD3.RIL)?
> > >
> > > I'd expect it would be really hard to overwhelm the command queue,
> > > unless the HW doesn't support range invalidation and/or the queue
> > > entries are close to the number of CPUs.
> >
> > At least on Jacob's system there is no RIL and there are 72/144 CPU
> > cores potentially banging on this.
> >
> > I think it is combination of lots of required invalidation commands,
> > low queue depth and slow retirement of commands that make it easier to
> > create a queue full condition.
> >
> > Without RIL one SVA invalidation may take out the entire small queue,
> > for example.
> Right, no range invalidation and queue depth is 256 in this case.
I think Robin is asking you to justify why the queue depth is 256 when
ARM is recommending much larger depths specifically to fix issues like
this?
Jason
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] SMMU v3 CMDQ fix and improvement
2025-10-20 12:02 ` Jason Gunthorpe
@ 2025-10-20 18:57 ` Jacob Pan
2025-10-21 11:45 ` Robin Murphy
0 siblings, 1 reply; 20+ messages in thread
From: Jacob Pan @ 2025-10-20 18:57 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Mostafa Saleh, linux-kernel, iommu@lists.linux.dev, Will Deacon,
Robin Murphy, Nicolin Chen, Zhang Yu, Jean Philippe-Brucker,
Alexander Grest
On Mon, 20 Oct 2025 09:02:40 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> On Fri, Oct 17, 2025 at 09:50:31AM -0700, Jacob Pan wrote:
> > On Fri, 17 Oct 2025 10:51:45 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > > On Fri, Oct 17, 2025 at 10:57:52AM +0000, Mostafa Saleh wrote:
> > > > On Wed, Sep 24, 2025 at 10:54:36AM -0700, Jacob Pan wrote:
> > > > > Hi Will et al,
> > > > >
> > > > > These two patches are derived from testing SMMU driver with
> > > > > smaller CMDQ sizes where we see soft lockups.
> > > > >
> > > > > This happens on HyperV emulated SMMU v3 as well as baremetal
> > > > > ARM servers with artificially reduced queue size and
> > > > > microbenchmark to stress test concurrency.
> > > >
> > > > Is it possible to share what are the artificial sizes and does
> > > > the HW/emulation support range invalidation (IRD3.RIL)?
> > > >
> > > > I'd expect it would be really hard to overwhelm the command
> > > > queue, unless the HW doesn't support range invalidation and/or
> > > > the queue entries are close to the number of CPUs.
> > >
> > > At least on Jacob's system there is no RIL and there are 72/144
> > > CPU cores potentially banging on this.
> > >
> > > I think it is combination of lots of required invalidation
> > > commands, low queue depth and slow retirement of commands that
> > > make it easier to create a queue full condition.
> > >
> > > Without RIL one SVA invalidation may take out the entire small
> > > queue, for example.
> > Right, no range invalidation and queue depth is 256 in this case.
>
> I think Robin is asking you to justify why the queue depth is 256 when
> ARM is recommending much larger depths specifically to fix issues like
> this?
The smaller queue depth is chosen for CMD_SYNC latency reasons. But I
don't know the implementation details of HyperV and host SMMU driver.
IMHO, queue size is orthogonal to what this patch is trying to
address, which is a specific locking problem and improve efficiency.
e.g. eliminated cmpxchg
- do {
- val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
- } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val);
+ atomic_cond_read_relaxed(&cmdq->lock, VAL > 0);
Even on BM with restricted queue size, this patch reduces latency of
concurrent madvise(MADV_DONTNEED) from multiple CPUs (I tested 32 CPUs,
cutting 50% latency unmap 1GB buffer in 2MB chucks per CPU).
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] SMMU v3 CMDQ fix and improvement
2025-10-20 18:57 ` Jacob Pan
@ 2025-10-21 11:45 ` Robin Murphy
2025-10-21 20:37 ` Jacob Pan
0 siblings, 1 reply; 20+ messages in thread
From: Robin Murphy @ 2025-10-21 11:45 UTC (permalink / raw)
To: Jacob Pan
Cc: Mostafa Saleh, linux-kernel, iommu@lists.linux.dev, Will Deacon,
Nicolin Chen, Zhang Yu, Jean Philippe-Brucker, Alexander Grest,
Jason Gunthorpe
On 2025-10-20 7:57 pm, Jacob Pan wrote:
> On Mon, 20 Oct 2025 09:02:40 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>> On Fri, Oct 17, 2025 at 09:50:31AM -0700, Jacob Pan wrote:
>>> On Fri, 17 Oct 2025 10:51:45 -0300
>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>
>>>> On Fri, Oct 17, 2025 at 10:57:52AM +0000, Mostafa Saleh wrote:
>>>>> On Wed, Sep 24, 2025 at 10:54:36AM -0700, Jacob Pan wrote:
>>>>>> Hi Will et al,
>>>>>>
>>>>>> These two patches are derived from testing SMMU driver with
>>>>>> smaller CMDQ sizes where we see soft lockups.
>>>>>>
>>>>>> This happens on HyperV emulated SMMU v3 as well as baremetal
>>>>>> ARM servers with artificially reduced queue size and
>>>>>> microbenchmark to stress test concurrency.
>>>>>
>>>>> Is it possible to share what are the artificial sizes and does
>>>>> the HW/emulation support range invalidation (IRD3.RIL)?
>>>>>
>>>>> I'd expect it would be really hard to overwhelm the command
>>>>> queue, unless the HW doesn't support range invalidation and/or
>>>>> the queue entries are close to the number of CPUs.
>>>>
>>>> At least on Jacob's system there is no RIL and there are 72/144
>>>> CPU cores potentially banging on this.
>>>>
>>>> I think it is combination of lots of required invalidation
>>>> commands, low queue depth and slow retirement of commands that
>>>> make it easier to create a queue full condition.
>>>>
>>>> Without RIL one SVA invalidation may take out the entire small
>>>> queue, for example.
>>> Right, no range invalidation and queue depth is 256 in this case.
>>
>> I think Robin is asking you to justify why the queue depth is 256 when
>> ARM is recommending much larger depths specifically to fix issues like
>> this?
> The smaller queue depth is chosen for CMD_SYNC latency reasons. But I
> don't know the implementation details of HyperV and host SMMU driver.
TBH that sounds highly dubious. The only way I could imagine CMDQ size
bearing any relation to CMD_SYNC at all is if a hypervisor is emulating
a stage 1 vCMDQ in such a naive and lazy manner that a) performance is
already off the table, and b) it has a good chance of being broken anyway.
For the hardware to actually process, say, 1023 invalidations followed
by a sync takes as long as it takes, based on how busy the SMMU is. The
only difference in issuing that sequence of commands on a 256-entry
queue vs. a 1024-entry queue is that in the latter case, software does
not have to sit waiting for the first 768 to actually be consumed before
it can finish the submission and potentially get on with something else
until the sync completes. Yes, one could claim that technically the time
between *issuing* the CMD_SYNC and its completion is then lower, but
only because that extra time has now been wasted in a polling loop
waiting for CMDQ space instead - it's a meaningless distinction overall.
> IMHO, queue size is orthogonal to what this patch is trying to
> address, which is a specific locking problem and improve efficiency.
> e.g. eliminated cmpxchg
> - do {
> - val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
> - } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val);
> + atomic_cond_read_relaxed(&cmdq->lock, VAL > 0);
>
> Even on BM with restricted queue size, this patch reduces latency of
> concurrent madvise(MADV_DONTNEED) from multiple CPUs (I tested 32 CPUs,
> cutting 50% latency unmap 1GB buffer in 2MB chucks per CPU).
My point is that a 50% improvement on nonsense is likely still nonsense.
With only 256 entries, every single one of those 2MB unmaps needs to
fill the entire CMDQ more than twice over. 32 CPUs all jostling to issue
about 34x as many commands as the queue can hold *each* is a ridiculous
level of contention. If a hypervisor is advertising an SMMU_IDR1.CMDQS
value that is absurdly tiny for the size of the VM then that represents
such an obvious bottleneck that it's hardly mainline Linux's
responsibility to maintain code to help "debug" it. As for "BM with
restricted queue size", like I said, just don't do that.
What is the difference on an un-hacked bare-metal system with a
normally-sized queue? Is it even measurable? That's what's actually
interesting. Furthermore, what exactly does that measurement even mean?
If we're still issuing the same number of commands I struggle to believe
we could lose 50% of the *overall* time just bouncing a cacheline
between shared and exclusive state - is this actually just the *maximum*
per-CPU latency going down, at the cost of the minimum latency
correspondingly increasing just as much (if not comparatively more) due
to better fairness? And if so, how important is that really? I can
imagine there are equally cases where other callers might prefer a lower
minimum/mean latency at the price of some longer outliers.
Note I'm not saying I'm necessarily against making these changes, just
that I'm against making them without a believable justification that it
is actually beneficial to mainline users.
Thanks,
Robin.
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH 0/2] SMMU v3 CMDQ fix and improvement
2025-10-21 11:45 ` Robin Murphy
@ 2025-10-21 20:37 ` Jacob Pan
0 siblings, 0 replies; 20+ messages in thread
From: Jacob Pan @ 2025-10-21 20:37 UTC (permalink / raw)
To: Robin Murphy
Cc: Mostafa Saleh, linux-kernel, iommu@lists.linux.dev, Will Deacon,
Nicolin Chen, Zhang Yu, Jean Philippe-Brucker, Alexander Grest,
Jason Gunthorpe
On Tue, 21 Oct 2025 12:45:48 +0100
Robin Murphy <robin.murphy@arm.com> wrote:
> On 2025-10-20 7:57 pm, Jacob Pan wrote:
> > On Mon, 20 Oct 2025 09:02:40 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> >> On Fri, Oct 17, 2025 at 09:50:31AM -0700, Jacob Pan wrote:
> >>> On Fri, 17 Oct 2025 10:51:45 -0300
> >>> Jason Gunthorpe <jgg@nvidia.com> wrote:
> >>>
> >>>> On Fri, Oct 17, 2025 at 10:57:52AM +0000, Mostafa Saleh wrote:
> >>>>> On Wed, Sep 24, 2025 at 10:54:36AM -0700, Jacob Pan wrote:
> >>>>>> Hi Will et al,
> >>>>>>
> >>>>>> These two patches are derived from testing SMMU driver with
> >>>>>> smaller CMDQ sizes where we see soft lockups.
> >>>>>>
> >>>>>> This happens on HyperV emulated SMMU v3 as well as baremetal
> >>>>>> ARM servers with artificially reduced queue size and
> >>>>>> microbenchmark to stress test concurrency.
> >>>>>
> >>>>> Is it possible to share what are the artificial sizes and does
> >>>>> the HW/emulation support range invalidation (IRD3.RIL)?
> >>>>>
> >>>>> I'd expect it would be really hard to overwhelm the command
> >>>>> queue, unless the HW doesn't support range invalidation and/or
> >>>>> the queue entries are close to the number of CPUs.
> >>>>
> >>>> At least on Jacob's system there is no RIL and there are 72/144
> >>>> CPU cores potentially banging on this.
> >>>>
> >>>> I think it is combination of lots of required invalidation
> >>>> commands, low queue depth and slow retirement of commands that
> >>>> make it easier to create a queue full condition.
> >>>>
> >>>> Without RIL one SVA invalidation may take out the entire small
> >>>> queue, for example.
> >>> Right, no range invalidation and queue depth is 256 in this case.
> >>>
> >>
> >> I think Robin is asking you to justify why the queue depth is 256
> >> when ARM is recommending much larger depths specifically to fix
> >> issues like this?
> > The smaller queue depth is chosen for CMD_SYNC latency reasons. But
> > I don't know the implementation details of HyperV and host SMMU
> > driver.
>
> TBH that sounds highly dubious. The only way I could imagine CMDQ
> size bearing any relation to CMD_SYNC at all is if a hypervisor is
> emulating a stage 1 vCMDQ in such a naive and lazy manner that a)
> performance is already off the table, and b) it has a good chance of
> being broken anyway.
>
> For the hardware to actually process, say, 1023 invalidations
> followed by a sync takes as long as it takes, based on how busy the
> SMMU is. The only difference in issuing that sequence of commands on
> a 256-entry queue vs. a 1024-entry queue is that in the latter case,
> software does not have to sit waiting for the first 768 to actually
> be consumed before it can finish the submission and potentially get
> on with something else until the sync completes. Yes, one could claim
> that technically the time between *issuing* the CMD_SYNC and its
> completion is then lower, but only because that extra time has now
> been wasted in a polling loop waiting for CMDQ space instead - it's a
> meaningless distinction overall.
I agree that a smaller emulated queue size does not change the
time took the physical IOMMU to do the invalidation. I am not defending
the current emulation which I think can be improved overtime
transparent to the guest.
> > IMHO, queue size is orthogonal to what this patch is trying to
> > address, which is a specific locking problem and improve efficiency.
> > e.g. eliminated cmpxchg
> > - do {
> > - val = atomic_cond_read_relaxed(&cmdq->lock, VAL >=
> > 0);
> > - } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1)
> > != val);
> > + atomic_cond_read_relaxed(&cmdq->lock, VAL > 0);
> >
> > Even on BM with restricted queue size, this patch reduces latency of
> > concurrent madvise(MADV_DONTNEED) from multiple CPUs (I tested 32
> > CPUs, cutting 50% latency unmap 1GB buffer in 2MB chucks per CPU).
> My point is that a 50% improvement on nonsense is likely still
> nonsense. With only 256 entries, every single one of those 2MB unmaps
> needs to fill the entire CMDQ more than twice over. 32 CPUs all
> jostling to issue about 34x as many commands as the queue can hold
> *each* is a ridiculous level of contention. If a hypervisor is
> advertising an SMMU_IDR1.CMDQS value that is absurdly tiny for the
> size of the VM then that represents such an obvious bottleneck that
> it's hardly mainline Linux's responsibility to maintain code to help
> "debug" it. As for "BM with restricted queue size", like I said, just
> don't do that.
I don't think we are asking the mainline Linux to debug our emulation
problem, quite the contrary, this setup helped to exposed mainline
Linux's bug (Patch 1/2 clearly shows that the queue space contention
has not been adequately tested)
This madvise test is intended to show:
a) the locking change is functionally sound, no harm to existing
mainline users
b) create extreme contentions that expose the problems with the current
code
c) performance differences
> What is the difference on an un-hacked bare-metal system with a
> normally-sized queue? Is it even measurable?
Not measurable on my BM system with a large cmdq size. The condition
!queue_has_space() is rarely met, so the exclusive lock is almost never
acquired.
> That's what's actually
> interesting. Furthermore, what exactly does that measurement even
> mean?
I agree this is not a measurement of real workload performance, but the
test shows:
a) no more occasional soft lockup as shared lock is no longer starved
b) shared lock can be taken quickly as we get rid of the unnecessary
cmpxchg.
> If we're still issuing the same number of commands I struggle
> to believe we could lose 50% of the *overall* time just bouncing a
> cacheline between shared and exclusive state - is this actually just
> the *maximum* per-CPU latency going down, at the cost of the minimum
> latency correspondingly increasing just as much (if not comparatively
> more) due to better fairness? And if so, how important is that
> really? I can imagine there are equally cases where other callers
> might prefer a lower minimum/mean latency at the price of some longer
> outliers.
The importance is that this change avoided a soft lockup where
exclusive lock is taken all the time. It is not about bouncing a
cacheline between shared and exclusive state. I have tried to flush the
lock cacheline after shared lock winning the cmpxchg but it didn't help
avoiding the lockup.
> Note I'm not saying I'm necessarily against making these changes,
> just that I'm against making them without a believable justification
> that it is actually beneficial to mainline users.
The benefit to mainline users ( I assume you meant SMMUs with large cmdq
size), at the minium, is that shared lock can be taken quicker when it
is released from the exclusive state.
> > - do {
> > - val = atomic_cond_read_relaxed(&cmdq->lock, VAL >=
> > 0);
> > - } while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1)
> > != val);
> > + atomic_cond_read_relaxed(&cmdq->lock, VAL > 0);
Here is the assembly code diff :
BEFORE:
27bc: 14000008 b 27dc
<arm_smmu_cmdq_issue_cmdlist+0x364> val =
atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0); 27c0: 93407c00
sxtw x0, w0 __CMPWAIT_CASE(w, , 32);
27c4: d50320bf sevl
27c8: d503205f wfe
27cc: 885f7c22 ldxr w2, [x1]
27d0: 4a000042 eor w2, w2, w0
27d4: 35000042 cbnz w2, 27dc
<arm_smmu_cmdq_issue_cmdlist+0x364> 27d8: d503205f wfe
27dc: b940ce60 ldr w0, [x19, #204]
27e0: 37ffff00 tbnz w0, #31, 27c0
<arm_smmu_cmdq_issue_cmdlist+0x348> } while
(atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val); 27e4:
11000403 add w3, w0, #0x1 27e8: 14000004 b
27f8 <arm_smmu_cmdq_issue_cmdlist+0x380> __CMPXCHG_CASE(w, , ,
32, ) 27ec: 2a0003e2 mov w2, w0
27f0: 88a27c23 cas w2, w3, [x1]
27f4: 14000008 b 2814
<arm_smmu_cmdq_issue_cmdlist+0x39c> __CMPXCHG_CASE( , 32)
27f8: 93407c04 sxtw x4, w0
__CMPXCHG_CASE(w, , , 32, , , , , K)
27fc: f9800031 prfm pstl1strm, [x1]
2800: 885f7c22 ldxr w2, [x1]
2804: 4a040045 eor w5, w2, w4
2808: 35000065 cbnz w5, 2814
<arm_smmu_cmdq_issue_cmdlist+0x39c> 280c: 88057c23 stxr
w5, w3, [x1] 2810: 35ffff85 cbnz w5, 2800
<arm_smmu_cmdq_issue_cmdlist+0x388> 2814: 6b02001f cmp
w0, w2 2818: 54fffe21 b.ne 27dc
<arm_smmu_cmdq_issue_cmdlist+0x364> // b.any
AFTER:
atomic_cond_read_relaxed(&cmdq->lock, VAL > 0);
27bc: b940ce60 ldr w0, [x19, #204]
27c0: 7100001f cmp w0, #0x0
27c4: 5400006d b.le 27d0
<arm_smmu_cmdq_issue_cmdlist+0x358>
> Thanks,
> Robin.
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2025-10-21 20:37 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-24 17:54 [PATCH 0/2] SMMU v3 CMDQ fix and improvement Jacob Pan
2025-09-24 17:54 ` [PATCH 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning Jacob Pan
2025-10-07 0:44 ` Nicolin Chen
2025-10-07 16:12 ` Jacob Pan
2025-10-07 16:32 ` Nicolin Chen
2025-09-24 17:54 ` [PATCH 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency Jacob Pan
2025-10-07 1:08 ` Nicolin Chen
2025-10-07 18:16 ` Jacob Pan
2025-10-17 11:04 ` Mostafa Saleh
2025-10-19 5:32 ` Jacob Pan
2025-10-06 15:14 ` [PATCH 0/2] SMMU v3 CMDQ fix and improvement Jacob Pan
2025-10-16 15:31 ` Jacob Pan
2025-10-17 10:57 ` Mostafa Saleh
2025-10-17 13:51 ` Jason Gunthorpe
2025-10-17 14:44 ` Robin Murphy
2025-10-17 16:50 ` Jacob Pan
2025-10-20 12:02 ` Jason Gunthorpe
2025-10-20 18:57 ` Jacob Pan
2025-10-21 11:45 ` Robin Murphy
2025-10-21 20:37 ` Jacob Pan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).