[PATCH v2 0/2] SMMU v3 CMDQ fix and improvement

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] SMMU v3 CMDQ fix and improvement
@ 2025-10-20 22:43 Jacob Pan
  2025-10-20 22:43 ` [PATCH v2 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning Jacob Pan
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Jacob Pan @ 2025-10-20 22:43 UTC (permalink / raw)
  To: linux-kernel, iommu@lists.linux.dev, Will Deacon, Joerg Roedel,
	Mostafa Saleh, Jason Gunthorpe, Robin Murphy, Nicolin Chen
  Cc: Jacob Pan, Zhang Yu, Jean Philippe-Brucker, Alexander Grest

Hi Will et al,

These two patches are derived from testing SMMU driver with smaller CMDQ
sizes where we see soft lockups.

This happens on HyperV emulated SMMU v3 as well as baremetal ARM servers
with artificially reduced queue size and microbenchmark to stress test
concurrency.

Thanks,

Jacob 

Alexander Grest (1):
  iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency

Jacob Pan (1):
  iommu/arm-smmu-v3: Fix CMDQ timeout warning

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 97 ++++++++++-----------
 1 file changed, 47 insertions(+), 50 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning
  2025-10-20 22:43 [PATCH v2 0/2] SMMU v3 CMDQ fix and improvement Jacob Pan
@ 2025-10-20 22:43 ` Jacob Pan
  2025-10-30 22:41   ` Nicolin Chen
  2025-10-20 22:43 ` [PATCH v2 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency Jacob Pan
  2025-10-30 15:42 ` [PATCH v2 0/2] SMMU v3 CMDQ fix and improvement Jacob Pan
  2 siblings, 1 reply; 12+ messages in thread
From: Jacob Pan @ 2025-10-20 22:43 UTC (permalink / raw)
  To: linux-kernel, iommu@lists.linux.dev, Will Deacon, Joerg Roedel,
	Mostafa Saleh, Jason Gunthorpe, Robin Murphy, Nicolin Chen
  Cc: Jacob Pan, Zhang Yu, Jean Philippe-Brucker, Alexander Grest

While polling for n spaces in the cmdq, the current code instead checks
if the queue is full. If the queue is almost full but not enough space
(<n), then the CMDQ timeout warning is never triggered even if the
polling has exceeded timeout limit.

This patch polls for the availability of exact space instead of full and
emit timeout warning accordingly.

Fixes: 587e6c10a7ce ("iommu/arm-smmu-v3: Reduce contention during command-queue insertion")
Co-developed-by: Yu Zhang <zhangyu1@linux.microsoft.com>
Signed-off-by: Yu Zhang <zhangyu1@linux.microsoft.com>
Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com>
---
v2: - Reduced debug print info (Nicolin)
    - Use a separate irq flags for exclusive lock
    - Handle queue_poll err code other than ETIMEOUT
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 66 ++++++++-------------
 1 file changed, 26 insertions(+), 40 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index bf67d9abc901..6959d99c74a3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -117,12 +117,6 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
 	return space >= n;
 }
 
-static bool queue_full(struct arm_smmu_ll_queue *q)
-{
-	return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
-	       Q_WRP(q, q->prod) != Q_WRP(q, q->cons);
-}
-
 static bool queue_empty(struct arm_smmu_ll_queue *q)
 {
 	return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
@@ -612,38 +606,6 @@ static void arm_smmu_cmdq_poll_valid_map(struct arm_smmu_cmdq *cmdq,
 	__arm_smmu_cmdq_poll_set_valid_map(cmdq, sprod, eprod, false);
 }
 
-/* Wait for the command queue to become non-full */
-static int arm_smmu_cmdq_poll_until_not_full(struct arm_smmu_device *smmu,
-					     struct arm_smmu_cmdq *cmdq,
-					     struct arm_smmu_ll_queue *llq)
-{
-	unsigned long flags;
-	struct arm_smmu_queue_poll qp;
-	int ret = 0;
-
-	/*
-	 * Try to update our copy of cons by grabbing exclusive cmdq access. If
-	 * that fails, spin until somebody else updates it for us.
-	 */
-	if (arm_smmu_cmdq_exclusive_trylock_irqsave(cmdq, flags)) {
-		WRITE_ONCE(cmdq->q.llq.cons, readl_relaxed(cmdq->q.cons_reg));
-		arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags);
-		llq->val = READ_ONCE(cmdq->q.llq.val);
-		return 0;
-	}
-
-	queue_poll_init(smmu, &qp);
-	do {
-		llq->val = READ_ONCE(cmdq->q.llq.val);
-		if (!queue_full(llq))
-			break;
-
-		ret = queue_poll(&qp);
-	} while (!ret);
-
-	return ret;
-}
-
 /*
  * Wait until the SMMU signals a CMD_SYNC completion MSI.
  * Must be called with the cmdq lock held in some capacity.
@@ -775,6 +737,7 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 	unsigned long flags;
 	bool owner;
 	struct arm_smmu_ll_queue llq, head;
+	struct arm_smmu_queue_poll qp;
 	int ret = 0;
 
 	llq.max_n_shift = cmdq->q.llq.max_n_shift;
@@ -785,10 +748,33 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 	do {
 		u64 old;
 
+		queue_poll_init(smmu, &qp);
 		while (!queue_has_space(&llq, n + sync)) {
+			unsigned long iflags;
+
 			local_irq_restore(flags);
-			if (arm_smmu_cmdq_poll_until_not_full(smmu, cmdq, &llq))
-				dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
+			/*
+			 * Try to update our copy of cons by grabbing exclusive cmdq access. If
+			 * that fails, spin until somebody else updates it for us.
+			 */
+			if (arm_smmu_cmdq_exclusive_trylock_irqsave(cmdq, iflags)) {
+				WRITE_ONCE(cmdq->q.llq.cons, readl_relaxed(cmdq->q.cons_reg));
+				arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, iflags);
+				llq.val = READ_ONCE(cmdq->q.llq.val);
+				local_irq_save(flags);
+				continue;
+			}
+
+			ret = queue_poll(&qp);
+			if (ret == -ETIMEDOUT) {
+				dev_err_ratelimited(smmu->dev, "CPU %d CMDQ Timeout, Cons: %08x, Prod: 0x%08x Lock 0x%x\n",
+						    smp_processor_id(), llq.cons, llq.prod, atomic_read(&cmdq->lock));
+				queue_poll_init(smmu, &qp);
+			} else if (ret) {
+				dev_err_ratelimited(smmu->dev, "CPU %d CMDQ Poll error %d\n",
+						    smp_processor_id(), ret);
+			}
+			llq.val = READ_ONCE(cmdq->q.llq.val);
 			local_irq_save(flags);
 		}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency
  2025-10-20 22:43 [PATCH v2 0/2] SMMU v3 CMDQ fix and improvement Jacob Pan
  2025-10-20 22:43 ` [PATCH v2 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning Jacob Pan
@ 2025-10-20 22:43 ` Jacob Pan
  2025-10-31  2:00   ` Nicolin Chen
  2025-10-30 15:42 ` [PATCH v2 0/2] SMMU v3 CMDQ fix and improvement Jacob Pan
  2 siblings, 1 reply; 12+ messages in thread
From: Jacob Pan @ 2025-10-20 22:43 UTC (permalink / raw)
  To: linux-kernel, iommu@lists.linux.dev, Will Deacon, Joerg Roedel,
	Mostafa Saleh, Jason Gunthorpe, Robin Murphy, Nicolin Chen
  Cc: Jacob Pan, Zhang Yu, Jean Philippe-Brucker, Alexander Grest

From: Alexander Grest <Alexander.Grest@microsoft.com>

The SMMU CMDQ lock is highly contentious when there are multiple CPUs
issuing commands on an architecture with small queue sizes e.g 256
entries.

The lock has the following states:
 - 0:		Unlocked
 - >0:		Shared lock held with count
 - INT_MIN+N:	Exclusive lock held, where N is the # of shared waiters
 - INT_MIN:	Exclusive lock held, no shared waiters

When multiple CPUs are polling for space in the queue, they attempt to
grab the exclusive lock to update the cons pointer from the hardware. If
they fail to get the lock, they will spin until either the cons pointer
is updated by another CPU.

The current code allows the possibility of shared lock starvation
if there is a constant stream of CPUs trying to grab the exclusive lock.
This leads to severe latency issues and soft lockups.

To mitigate this, we release the exclusive lock by only clearing the sign
bit while retaining the shared lock waiter count as a way to avoid
starving the shared lock waiters.

Also deleted cmpxchg loop while trying to acquire the shared lock as it
is not needed. The waiters can see the positive lock count and proceed
immediately after the exclusive lock is released.

Exclusive lock is not starved in that submitters will try exclusive lock
first when new spaces become available.

In a staged test where 32 CPUs issue SVA invalidations simultaneously on
a system with a 256 entry queue, the madvise (MADV_DONTNEED) latency
dropped by 50% with this patch and without soft lockups.

Reviewed-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Alexander Grest <Alexander.Grest@microsoft.com>
Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com>
---
v2:
	- Changed shared lock acquire condition from VAL>=0 to VAL>0
	  (Mostafa)
	- Added more comments to explain shared lock change (Nicolin)
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 31 ++++++++++++++-------
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 6959d99c74a3..9e632bb022fe 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -460,20 +460,26 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
  */
 static void arm_smmu_cmdq_shared_lock(struct arm_smmu_cmdq *cmdq)
 {
-	int val;
-
 	/*
-	 * We can try to avoid the cmpxchg() loop by simply incrementing the
-	 * lock counter. When held in exclusive state, the lock counter is set
-	 * to INT_MIN so these increments won't hurt as the value will remain
-	 * negative.
+	 * When held in exclusive state, the lock counter is set to INT_MIN
+	 * so these increments won't hurt as the value will remain negative.
+	 * The increment will also signal the exclusive locker that there are
+	 * shared waiters.
 	 */
 	if (atomic_fetch_inc_relaxed(&cmdq->lock) >= 0)
 		return;
 
-	do {
-		val = atomic_cond_read_relaxed(&cmdq->lock, VAL >= 0);
-	} while (atomic_cmpxchg_relaxed(&cmdq->lock, val, val + 1) != val);
+	/*
+	 * Someone else is holding the lock in exclusive state, so wait
+	 * for them to finish. Since we already incremented the lock counter,
+	 * no exclusive lock can be acquired until we finish. We don't need
+	 * the return value since we only care that the exclusive lock is
+	 * released (i.e. the lock counter is non-negative).
+	 * Once the exclusive locker releases the lock, the sign bit will
+	 * be cleared and our increment will make the lock counter positive,
+	 * allowing us to proceed.
+	 */
+	atomic_cond_read_relaxed(&cmdq->lock, VAL > 0);
 }
 
 static void arm_smmu_cmdq_shared_unlock(struct arm_smmu_cmdq *cmdq)
@@ -500,9 +506,14 @@ static bool arm_smmu_cmdq_shared_tryunlock(struct arm_smmu_cmdq *cmdq)
 	__ret;								\
 })
 
+/*
+ * Only clear the sign bit when releasing the exclusive lock this will
+ * allow any shared_lock() waiters to proceed without the possibility
+ * of entering the exclusive lock in a tight loop.
+ */
 #define arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags)		\
 ({									\
-	atomic_set_release(&cmdq->lock, 0);				\
+	atomic_fetch_and_release(~INT_MIN, &cmdq->lock);				\
 	local_irq_restore(flags);					\
 })
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/2] SMMU v3 CMDQ fix and improvement
  2025-10-20 22:43 [PATCH v2 0/2] SMMU v3 CMDQ fix and improvement Jacob Pan
  2025-10-20 22:43 ` [PATCH v2 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning Jacob Pan
  2025-10-20 22:43 ` [PATCH v2 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency Jacob Pan
@ 2025-10-30 15:42 ` Jacob Pan
  2 siblings, 0 replies; 12+ messages in thread
From: Jacob Pan @ 2025-10-30 15:42 UTC (permalink / raw)
  To: linux-kernel, iommu@lists.linux.dev, Will Deacon, Joerg Roedel,
	Mostafa Saleh, Jason Gunthorpe, Robin Murphy, Nicolin Chen
  Cc: Zhang Yu, Jean Philippe-Brucker, Alexander Grest

Hi Will,

Just a gentle reminder, appreciated.

@Nicolin, were you able to evaluate these?

By the way, I incorporated your review comments in v2, except for
extracting the function arm_smmu_cmdq_poll_until_enough_space(),
primarily due to concerns with the IRQ flags.

Thanks,

Jacob

 On Mon, 20 Oct 2025 15:43:51 -0700 Jacob
Pan <jacob.pan@linux.microsoft.com> wrote:

> Hi Will et al,
> 
> These two patches are derived from testing SMMU driver with smaller
> CMDQ sizes where we see soft lockups.
> 
> This happens on HyperV emulated SMMU v3 as well as baremetal ARM
> servers with artificially reduced queue size and microbenchmark to
> stress test concurrency.
> 
> Thanks,
> 
> Jacob 
> 
> Alexander Grest (1):
>   iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency
> 
> Jacob Pan (1):
>   iommu/arm-smmu-v3: Fix CMDQ timeout warning
> 
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 97
> ++++++++++----------- 1 file changed, 47 insertions(+), 50
> deletions(-)
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning
  2025-10-20 22:43 ` [PATCH v2 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning Jacob Pan
@ 2025-10-30 22:41   ` Nicolin Chen
  2025-11-03 23:16     ` Jacob Pan
  0 siblings, 1 reply; 12+ messages in thread
From: Nicolin Chen @ 2025-10-30 22:41 UTC (permalink / raw)
  To: Jacob Pan
  Cc: linux-kernel, iommu@lists.linux.dev, Will Deacon, Joerg Roedel,
	Mostafa Saleh, Jason Gunthorpe, Robin Murphy, Zhang Yu,
	Jean Philippe-Brucker, Alexander Grest

On Mon, Oct 20, 2025 at 03:43:52PM -0700, Jacob Pan wrote:
> @@ -785,10 +748,33 @@ static int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
>  	do {
>  		u64 old;
>  
> +		queue_poll_init(smmu, &qp);
>  		while (!queue_has_space(&llq, n + sync)) {
> +			unsigned long iflags;
> +
>  			local_irq_restore(flags);
> -			if (arm_smmu_cmdq_poll_until_not_full(smmu, cmdq, &llq))
> -				dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
> +			/*
> +			 * Try to update our copy of cons by grabbing exclusive cmdq access. If
> +			 * that fails, spin until somebody else updates it for us.
> +			 */
> +			if (arm_smmu_cmdq_exclusive_trylock_irqsave(cmdq, iflags)) {
> +				WRITE_ONCE(cmdq->q.llq.cons, readl_relaxed(cmdq->q.cons_reg));
> +				arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, iflags);
> +				llq.val = READ_ONCE(cmdq->q.llq.val);
> +				local_irq_save(flags);

I don't quite get the reason why it moves queue_poll_init() and
add local_irq_save(). It's quite different than what the driver
has, so it's nicer to explain in the commit message at least.

I still feel that we could just replace the _until_not_full()
with a _until_has_space()?

Nicolin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency
  2025-10-20 22:43 ` [PATCH v2 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency Jacob Pan
@ 2025-10-31  2:00   ` Nicolin Chen
  2025-11-04  1:08     ` Jacob Pan
  0 siblings, 1 reply; 12+ messages in thread
From: Nicolin Chen @ 2025-10-31  2:00 UTC (permalink / raw)
  To: Jacob Pan
  Cc: linux-kernel, iommu@lists.linux.dev, Will Deacon, Joerg Roedel,
	Mostafa Saleh, Jason Gunthorpe, Robin Murphy, Zhang Yu,
	Jean Philippe-Brucker, Alexander Grest

On Mon, Oct 20, 2025 at 03:43:53PM -0700, Jacob Pan wrote:
> From: Alexander Grest <Alexander.Grest@microsoft.com>
> 
> The SMMU CMDQ lock is highly contentious when there are multiple CPUs
> issuing commands on an architecture with small queue sizes e.g 256
> entries.

As Robin pointed out that 256 entry itself is not quite normal,
the justification here might still not be very convincing..

I'd suggest to avoid saying "an architecture with a small queue
sizes, but to focus on the issue itself -- potential starvation.
"256-entry" can be used a testing setup to reproduce the issue.

> The lock has the following states:
>  - 0:		Unlocked
>  - >0:		Shared lock held with count
>  - INT_MIN+N:	Exclusive lock held, where N is the # of shared waiters
>  - INT_MIN:	Exclusive lock held, no shared waiters
> 
> When multiple CPUs are polling for space in the queue, they attempt to
> grab the exclusive lock to update the cons pointer from the hardware. If
> they fail to get the lock, they will spin until either the cons pointer
> is updated by another CPU.
> 
> The current code allows the possibility of shared lock starvation
> if there is a constant stream of CPUs trying to grab the exclusive lock.
> This leads to severe latency issues and soft lockups.

It'd be nicer to have a graph to show how the starvation might
happen due to a race:

CPU0 (exclusive)  | CPU1 (shared)     | CPU2 (exclusive)    | `cmdq->lock`
--------------------------------------------------------------------------
trylock() //takes |                   |                     | 0
                  | shared_lock()     |                     | INT_MIN
                  | fetch_inc()       |                     | INT_MIN
                  | no return         |                     | INT_MIN + 1
                  | spins // VAL >= 0 |                     | INT_MIN + 1
unlock()          | spins...          |                     | INT_MIN + 1
set_release(0)    | spins...          |                     | 0  <-- BUG?
(done)            | (sees 0)          | trylock() // takes  | 0
                  | *exits loop*      | cmpxchg(0, INT_MIN) | 0
                  |                   | *cuts in*           | INT_MIN
                  | cmpxchg(0, 1)     |                     | INT_MIN
                  | fails // != 0     |                     | INT_MIN
                  | spins // VAL >= 0 |                     | INT_MIN
                  | *starved*         |                     | INT_MIN

And point it out that it should have reserved the "+1" from CPU1
instead of nuking the entire cmdq->lock to 0.

> In a staged test where 32 CPUs issue SVA invalidations simultaneously on
> a system with a 256 entry queue, the madvise (MADV_DONTNEED) latency
> dropped by 50% with this patch and without soft lockups.

This might not be very useful per Robin's remarks. I'd drop it.

> Reviewed-by: Mostafa Saleh <smostafa@google.com>
> Signed-off-by: Alexander Grest <Alexander.Grest@microsoft.com>
> Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com>

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>

> @@ -500,9 +506,14 @@ static bool arm_smmu_cmdq_shared_tryunlock(struct arm_smmu_cmdq *cmdq)
>  	__ret;								\
>  })
>  
> +/*
> + * Only clear the sign bit when releasing the exclusive lock this will
> + * allow any shared_lock() waiters to proceed without the possibility
> + * of entering the exclusive lock in a tight loop.
> + */
>  #define arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags)		\
>  ({									\
> -	atomic_set_release(&cmdq->lock, 0);				\
> +	atomic_fetch_and_release(~INT_MIN, &cmdq->lock);				\

Align the tailing spacing with other lines please.

Nicolin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning
  2025-10-30 22:41   ` Nicolin Chen
@ 2025-11-03 23:16     ` Jacob Pan
  2025-11-04  1:23       ` Nicolin Chen
  0 siblings, 1 reply; 12+ messages in thread
From: Jacob Pan @ 2025-11-03 23:16 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: linux-kernel, iommu@lists.linux.dev, Will Deacon, Joerg Roedel,
	Mostafa Saleh, Jason Gunthorpe, Robin Murphy, Zhang Yu,
	Jean Philippe-Brucker, Alexander Grest

Hi Nicolin,

On Thu, 30 Oct 2025 15:41:57 -0700
Nicolin Chen <nicolinc@nvidia.com> wrote:

> On Mon, Oct 20, 2025 at 03:43:52PM -0700, Jacob Pan wrote:
> > @@ -785,10 +748,33 @@ static int arm_smmu_cmdq_issue_cmdlist(struct
> > arm_smmu_device *smmu, do {
> >  		u64 old;
> >  
> > +		queue_poll_init(smmu, &qp);
> >  		while (!queue_has_space(&llq, n + sync)) {
> > +			unsigned long iflags;
> > +
> >  			local_irq_restore(flags);
> > -			if
> > (arm_smmu_cmdq_poll_until_not_full(smmu, cmdq, &llq))
> > -				dev_err_ratelimited(smmu->dev,
> > "CMDQ timeout\n");
> > +			/*
> > +			 * Try to update our copy of cons by
> > grabbing exclusive cmdq access. If
> > +			 * that fails, spin until somebody else
> > updates it for us.
> > +			 */
> > +			if
> > (arm_smmu_cmdq_exclusive_trylock_irqsave(cmdq, iflags)) {
> > +				WRITE_ONCE(cmdq->q.llq.cons,
> > readl_relaxed(cmdq->q.cons_reg));
> > +
> > arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, iflags);
> > +				llq.val =
> > READ_ONCE(cmdq->q.llq.val);
> > +				local_irq_save(flags);  
> 
> I don't quite get the reason why it moves queue_poll_init() and
> add local_irq_save(). It's quite different than what the driver
> has, so it's nicer to explain in the commit message at least.

Let me add the following to the commit message.

The original code has three nested while loops,
do {
	while (!queue_has_space(&llq, n + sync)) {
		// inside arm_smmu_cmdq_poll_until_not_full
		queue_poll_init(smmu, &qp);
		do {
			if(!queue_full(llq))
				break;
			ret = queue_poll(&qp);

		}while (!ret);
	}
	check exit condition

} while (1);

Now, with this patch we reduced to two nested while loops and
calling queue_has_space() only without checking queue_full.

do {
	queue_poll_init(smmu, &qp);
	while (!queue_has_space(&llq, n + sync)) {
		ret = queue_poll(&qp);
		if (ret == -ETIMEDOUT) {
			dev_err();
			queue_poll_init(smmu, &qp);
		} 
	}
	check exit condition
} while (1);

An additional queue_poll_init is added outside inner while loop to arm
the timer. We can merge the two queue_poll_init with a local bool
variable to track whether init is needed, but IMHO it is not any better.

Adding local_irq_save() just to make sure it pairs up with
local_irq_restore(), no functional changes.

> I still feel that we could just replace the _until_not_full()
> with a _until_has_space()?
Since the current code uses three nested while loops, replacing the
inner _until_not_full() function means means retaining all three nested
while loops and calling queue_has_space in two places - once in the
middle while loop then again in this _until_has_space() function.

I tried to extract the inner loop into a function but it requires
passing in irqflags to restore. Not pretty.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency
  2025-10-31  2:00   ` Nicolin Chen
@ 2025-11-04  1:08     ` Jacob Pan
  0 siblings, 0 replies; 12+ messages in thread
From: Jacob Pan @ 2025-11-04  1:08 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: linux-kernel, iommu@lists.linux.dev, Will Deacon, Joerg Roedel,
	Mostafa Saleh, Jason Gunthorpe, Robin Murphy, Zhang Yu,
	Jean Philippe-Brucker, Alexander Grest

Hi Nicolin,

On Thu, 30 Oct 2025 19:00:02 -0700
Nicolin Chen <nicolinc@nvidia.com> wrote:

> On Mon, Oct 20, 2025 at 03:43:53PM -0700, Jacob Pan wrote:
> > From: Alexander Grest <Alexander.Grest@microsoft.com>
> > 
> > The SMMU CMDQ lock is highly contentious when there are multiple
> > CPUs issuing commands on an architecture with small queue sizes e.g
> > 256 entries.  
> 
> As Robin pointed out that 256 entry itself is not quite normal,
> the justification here might still not be very convincing..
> 
> I'd suggest to avoid saying "an architecture with a small queue
> sizes, but to focus on the issue itself -- potential starvation.
> "256-entry" can be used a testing setup to reproduce the issue.
> 
> > The lock has the following states:
> >  - 0:		Unlocked  
> >  - >0:		Shared lock held with count  
> >  - INT_MIN+N:	Exclusive lock held, where N is the # of
> > shared waiters
> >  - INT_MIN:	Exclusive lock held, no shared waiters
> > 
> > When multiple CPUs are polling for space in the queue, they attempt
> > to grab the exclusive lock to update the cons pointer from the
> > hardware. If they fail to get the lock, they will spin until either
> > the cons pointer is updated by another CPU.
> > 
> > The current code allows the possibility of shared lock starvation
> > if there is a constant stream of CPUs trying to grab the exclusive
> > lock. This leads to severe latency issues and soft lockups.  
> 
> It'd be nicer to have a graph to show how the starvation might
> happen due to a race:
> 
> CPU0 (exclusive)  | CPU1 (shared)     | CPU2 (exclusive)    |
> `cmdq->lock`
> --------------------------------------------------------------------------
> trylock() //takes |                   |                     | 0 |
> shared_lock()     |                     | INT_MIN | fetch_inc()
> |                     | INT_MIN | no return         |
>     | INT_MIN + 1 | spins // VAL >= 0 |                     | INT_MIN
> + 1 unlock()          | spins...          |                     |
> INT_MIN + 1 set_release(0)    | spins...          |
>   | 0  <-- BUG? 
Not sure we can call it a bug but it definitely opens the door for
starving shared lock.

>(done)            | (sees 0)          | trylock() //
> takes  | 0 | *exits loop*      | cmpxchg(0, INT_MIN) | 0
>                   |                   | *cuts in*           | INT_MIN
>                   | cmpxchg(0, 1)     |                     | INT_MIN
>                   | fails // != 0     |                     | INT_MIN
>                   | spins // VAL >= 0 |                     | INT_MIN
>                   | *starved*         |                     | INT_MIN
>
Thanks for the graph, will incorporate. The starved shared lock also
prevents advancing cmdq which perpetuate the situation of
!queue_has_space(&llq, n + sync)
 
> And point it out that it should have reserved the "+1" from CPU1
> instead of nuking the entire cmdq->lock to 0.
> 
Will do. reserved the "+1" is useful to prevent back to back exclusive
lock acquisition. Nuking to 0 wasted such info.

> > In a staged test where 32 CPUs issue SVA invalidations
> > simultaneously on a system with a 256 entry queue, the madvise
> > (MADV_DONTNEED) latency dropped by 50% with this patch and without
> > soft lockups.  
> 
> This might not be very useful per Robin's remarks. I'd drop it.
> 
Will do.

> > Reviewed-by: Mostafa Saleh <smostafa@google.com>
> > Signed-off-by: Alexander Grest <Alexander.Grest@microsoft.com>
> > Signed-off-by: Jacob Pan <jacob.pan@linux.microsoft.com>  
> 
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> 
> > @@ -500,9 +506,14 @@ static bool
> > arm_smmu_cmdq_shared_tryunlock(struct arm_smmu_cmdq *cmdq)
> > __ret;
> > 	\ }) 
> > +/*
> > + * Only clear the sign bit when releasing the exclusive lock this
> > will
> > + * allow any shared_lock() waiters to proceed without the
> > possibility
> > + * of entering the exclusive lock in a tight loop.
> > + */
> >  #define arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq,
> > flags)		\ ({
> > 				\
> > -	atomic_set_release(&cmdq->lock, 0);
> > 	\
> > +	atomic_fetch_and_release(~INT_MIN, &cmdq->lock);
> > 			\  
> 
> Align the tailing spacing with other lines please.
> 
> Nicolin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning
  2025-11-03 23:16     ` Jacob Pan
@ 2025-11-04  1:23       ` Nicolin Chen
  2025-11-04 18:25         ` Jacob Pan
  0 siblings, 1 reply; 12+ messages in thread
From: Nicolin Chen @ 2025-11-04  1:23 UTC (permalink / raw)
  To: Jacob Pan
  Cc: linux-kernel, iommu@lists.linux.dev, Will Deacon, Joerg Roedel,
	Mostafa Saleh, Jason Gunthorpe, Robin Murphy, Zhang Yu,
	Jean Philippe-Brucker, Alexander Grest

On Mon, Nov 03, 2025 at 03:16:31PM -0800, Jacob Pan wrote:
> On Thu, 30 Oct 2025 15:41:57 -0700 Nicolin Chen <nicolinc@nvidia.com> wrote:
> > On Mon, Oct 20, 2025 at 03:43:52PM -0700, Jacob Pan wrote:
> > I still feel that we could just replace the _until_not_full()
> > with a _until_has_space()?

> Since the current code uses three nested while loops, replacing the
> inner _until_not_full() function means means retaining all three nested
> while loops and calling queue_has_space in two places - once in the
> middle while loop then again in this _until_has_space() function.
> 
> I tried to extract the inner loop into a function but it requires
> passing in irqflags to restore. Not pretty.

I think we could do:

-----------------------------------------------------------------
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 2a8b46b948f05..1211e087dedca 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -138,12 +138,6 @@ static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
 	return space >= n;
 }
 
-static bool queue_full(struct arm_smmu_ll_queue *q)
-{
-	return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
-	       Q_WRP(q, q->prod) != Q_WRP(q, q->cons);
-}
-
 static bool queue_empty(struct arm_smmu_ll_queue *q)
 {
 	return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
@@ -633,14 +627,13 @@ static void arm_smmu_cmdq_poll_valid_map(struct arm_smmu_cmdq *cmdq,
 	__arm_smmu_cmdq_poll_set_valid_map(cmdq, sprod, eprod, false);
 }
 
-/* Wait for the command queue to become non-full */
-static int arm_smmu_cmdq_poll_until_not_full(struct arm_smmu_device *smmu,
-					     struct arm_smmu_cmdq *cmdq,
-					     struct arm_smmu_ll_queue *llq)
+/* Poll command queue PROD and CONS, using a continued timer */
+static inline void arm_smmu_cmdq_poll(struct arm_smmu_device *smmu,
+				      struct arm_smmu_cmdq *cmdq,
+				      struct arm_smmu_ll_queue *llq,
+				      struct arm_smmu_queue_poll *qp)
 {
 	unsigned long flags;
-	struct arm_smmu_queue_poll qp;
-	int ret = 0;
 
 	/*
 	 * Try to update our copy of cons by grabbing exclusive cmdq access. If
@@ -650,19 +643,18 @@ static int arm_smmu_cmdq_poll_until_not_full(struct arm_smmu_device *smmu,
 		WRITE_ONCE(cmdq->q.llq.cons, readl_relaxed(cmdq->q.cons_reg));
 		arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags);
 		llq->val = READ_ONCE(cmdq->q.llq.val);
-		return 0;
+		return;
 	}
 
-	queue_poll_init(smmu, &qp);
-	do {
-		llq->val = READ_ONCE(cmdq->q.llq.val);
-		if (!queue_full(llq))
-			break;
-
-		ret = queue_poll(&qp);
-	} while (!ret);
-
-	return ret;
+	/* queue_poll() returns 0 or -ETIMEDOUT only */
+	if (queue_poll(qp)) {
+		dev_err_ratelimited(smmu->dev,
+				    "CMDQ timeout at prod 0x%08x cons 0x%08x\n",
+				    llq->prod, llq->cons);
+		/* Restart the timer */
+		queue_poll_init(smmu, qp);
+	}
+	llq->val = READ_ONCE(cmdq->q.llq.val);
 }
 
 /*
@@ -804,12 +796,13 @@ int arm_smmu_cmdq_issue_cmdlist(struct arm_smmu_device *smmu,
 	local_irq_save(flags);
 	llq.val = READ_ONCE(cmdq->q.llq.val);
 	do {
+		struct arm_smmu_queue_poll qp;
 		u64 old;
 
+		queue_poll_init(smmu, &qp);
 		while (!queue_has_space(&llq, n + sync)) {
 			local_irq_restore(flags);
-			if (arm_smmu_cmdq_poll_until_not_full(smmu, cmdq, &llq))
-				dev_err_ratelimited(smmu->dev, "CMDQ timeout\n");
+			arm_smmu_cmdq_poll(smmu, cmdq, &llq, &qp);
 			local_irq_save(flags);
 		}
 
-----------------------------------------------------------------

And the commit message should point out:

The existing arm_smmu_cmdq_poll_until_not_full() doesn't fit efficiently
nor ideally to the only caller arm_smmu_cmdq_issue_cmdlist():
 - It uses a new timer at every single call, which fails to limit to the
   preset ARM_SMMU_POLL_TIMEOUT_US per issue.
 - It has a redundant internal queue_full(), which doesn't detect whether
   there is a enough space for number of n commands.

So, rework it to be an inline helper to work with the queue_has_space().

Nicolin

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning
  2025-11-04  1:23       ` Nicolin Chen
@ 2025-11-04 18:25         ` Jacob Pan
  2025-11-04 18:48           ` Nicolin Chen
  0 siblings, 1 reply; 12+ messages in thread
From: Jacob Pan @ 2025-11-04 18:25 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: linux-kernel, iommu@lists.linux.dev, Will Deacon, Joerg Roedel,
	Mostafa Saleh, Jason Gunthorpe, Robin Murphy, Zhang Yu,
	Jean Philippe-Brucker, Alexander Grest

Hi Nicolin,

On Mon, 3 Nov 2025 17:23:26 -0800
Nicolin Chen <nicolinc@nvidia.com> wrote:

> On Mon, Nov 03, 2025 at 03:16:31PM -0800, Jacob Pan wrote:
> > On Thu, 30 Oct 2025 15:41:57 -0700 Nicolin Chen
> > <nicolinc@nvidia.com> wrote:  
> > > On Mon, Oct 20, 2025 at 03:43:52PM -0700, Jacob Pan wrote:
> > > I still feel that we could just replace the _until_not_full()
> > > with a _until_has_space()?  
> 
> > Since the current code uses three nested while loops, replacing the
> > inner _until_not_full() function means means retaining all three
> > nested while loops and calling queue_has_space in two places - once
> > in the middle while loop then again in this _until_has_space()
> > function.
> > 
> > I tried to extract the inner loop into a function but it requires
> > passing in irqflags to restore. Not pretty.  
> 
> I think we could do:
> 
> -----------------------------------------------------------------
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index
> 2a8b46b948f05..1211e087dedca 100644 ---
> a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -138,12 +138,6 @@
> static bool queue_has_space(struct arm_smmu_ll_queue *q, u32 n)
> return space >= n; }
>  
> -static bool queue_full(struct arm_smmu_ll_queue *q)
> -{
> -	return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
> -	       Q_WRP(q, q->prod) != Q_WRP(q, q->cons);
> -}
> -
>  static bool queue_empty(struct arm_smmu_ll_queue *q)
>  {
>  	return Q_IDX(q, q->prod) == Q_IDX(q, q->cons) &&
> @@ -633,14 +627,13 @@ static void arm_smmu_cmdq_poll_valid_map(struct
> arm_smmu_cmdq *cmdq, __arm_smmu_cmdq_poll_set_valid_map(cmdq, sprod,
> eprod, false); }
>  
> -/* Wait for the command queue to become non-full */
> -static int arm_smmu_cmdq_poll_until_not_full(struct arm_smmu_device
> *smmu,
> -					     struct arm_smmu_cmdq
> *cmdq,
> -					     struct
> arm_smmu_ll_queue *llq) +/* Poll command queue PROD and CONS, using a
> continued timer */ +static inline void arm_smmu_cmdq_poll(struct
> arm_smmu_device *smmu,
> +				      struct arm_smmu_cmdq *cmdq,
> +				      struct arm_smmu_ll_queue *llq,
> +				      struct arm_smmu_queue_poll *qp)
>  {
>  	unsigned long flags;
> -	struct arm_smmu_queue_poll qp;
> -	int ret = 0;
>  
>  	/*
>  	 * Try to update our copy of cons by grabbing exclusive cmdq
> access. If @@ -650,19 +643,18 @@ static int
> arm_smmu_cmdq_poll_until_not_full(struct arm_smmu_device *smmu,
> WRITE_ONCE(cmdq->q.llq.cons, readl_relaxed(cmdq->q.cons_reg));
> arm_smmu_cmdq_exclusive_unlock_irqrestore(cmdq, flags); llq->val =
> READ_ONCE(cmdq->q.llq.val);
> -		return 0;
> +		return;
>  	}
>  
> -	queue_poll_init(smmu, &qp);
> -	do {
> -		llq->val = READ_ONCE(cmdq->q.llq.val);
> -		if (!queue_full(llq))
> -			break;
> -
> -		ret = queue_poll(&qp);
> -	} while (!ret);
> -
> -	return ret;
> +	/* queue_poll() returns 0 or -ETIMEDOUT only */
> +	if (queue_poll(qp)) {
I would still prefer more a defensive approach to prevent future change
of queue_poll returning other error being treated as ETIMEOUT.

> +		dev_err_ratelimited(smmu->dev,
> +				    "CMDQ timeout at prod 0x%08x
> cons 0x%08x\n",
> +				    llq->prod, llq->cons);
> +		/* Restart the timer */
> +		queue_poll_init(smmu, qp);
> +	}
> +	llq->val = READ_ONCE(cmdq->q.llq.val);
>  }
>  
>  /*
> @@ -804,12 +796,13 @@ int arm_smmu_cmdq_issue_cmdlist(struct
> arm_smmu_device *smmu, local_irq_save(flags);
>  	llq.val = READ_ONCE(cmdq->q.llq.val);
>  	do {
> +		struct arm_smmu_queue_poll qp;
>  		u64 old;
>  
> +		queue_poll_init(smmu, &qp);
>  		while (!queue_has_space(&llq, n + sync)) {
>  			local_irq_restore(flags);
> -			if (arm_smmu_cmdq_poll_until_not_full(smmu,
> cmdq, &llq))
> -				dev_err_ratelimited(smmu->dev, "CMDQ
> timeout\n");
> +			arm_smmu_cmdq_poll(smmu, cmdq, &llq, &qp);
>  			local_irq_save(flags);
>  		}
>  
yeah, that should work. it is more readable than open coding.

> -----------------------------------------------------------------
> 
> And the commit message should point out:
> 
> The existing arm_smmu_cmdq_poll_until_not_full() doesn't fit
> efficiently nor ideally to the only caller
> arm_smmu_cmdq_issue_cmdlist():
>  - It uses a new timer at every single call, which fails to limit to
> the preset ARM_SMMU_POLL_TIMEOUT_US per issue.
Not following what you mean.
The original code below does honor the timeout of
ARM_SMMU_POLL_TIMEOUT_US

-       queue_poll_init(smmu, &qp);
-       do {
-               llq->val = READ_ONCE(cmdq->q.llq.val);
-               if (!queue_full(llq))
-                       break;
-
-               ret = queue_poll(&qp);
-       } while (!ret);

>  - It has a redundant internal queue_full(), which doesn't detect
> whether there is a enough space for number of n commands.
will incorporate, though the same points already mentioned in the
current commit message.
 
> So, rework it to be an inline helper to work with the
> queue_has_space().
> 

> Nicolin


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning
  2025-11-04 18:25         ` Jacob Pan
@ 2025-11-04 18:48           ` Nicolin Chen
  2025-11-04 19:37             ` Jacob Pan
  0 siblings, 1 reply; 12+ messages in thread
From: Nicolin Chen @ 2025-11-04 18:48 UTC (permalink / raw)
  To: Jacob Pan
  Cc: linux-kernel, iommu@lists.linux.dev, Will Deacon, Joerg Roedel,
	Mostafa Saleh, Jason Gunthorpe, Robin Murphy, Zhang Yu,
	Jean Philippe-Brucker, Alexander Grest

On Tue, Nov 04, 2025 at 10:25:39AM -0800, Jacob Pan wrote:
> > -----------------------------------------------------------------
> > 
> > And the commit message should point out:
> > 
> > The existing arm_smmu_cmdq_poll_until_not_full() doesn't fit
> > efficiently nor ideally to the only caller
> > arm_smmu_cmdq_issue_cmdlist():
> >  - It uses a new timer at every single call, which fails to limit to
> > the preset ARM_SMMU_POLL_TIMEOUT_US per issue.

> Not following what you mean.
> The original code below does honor the timeout of
> ARM_SMMU_POLL_TIMEOUT_US

It sets the timeout per arm_smmu_cmdq_poll_until_not_full(), not
the entire wait-for-space routine. And that's why you moved the
queue_poll_init() to the caller, right?

Nicolin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning
  2025-11-04 18:48           ` Nicolin Chen
@ 2025-11-04 19:37             ` Jacob Pan
  0 siblings, 0 replies; 12+ messages in thread
From: Jacob Pan @ 2025-11-04 19:37 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: linux-kernel, iommu@lists.linux.dev, Will Deacon, Joerg Roedel,
	Mostafa Saleh, Jason Gunthorpe, Robin Murphy, Zhang Yu,
	Jean Philippe-Brucker, Alexander Grest

Hi Nicolin,

On Tue, 4 Nov 2025 10:48:31 -0800
Nicolin Chen <nicolinc@nvidia.com> wrote:

> On Tue, Nov 04, 2025 at 10:25:39AM -0800, Jacob Pan wrote:
> > > -----------------------------------------------------------------
> > > 
> > > And the commit message should point out:
> > > 
> > > The existing arm_smmu_cmdq_poll_until_not_full() doesn't fit
> > > efficiently nor ideally to the only caller
> > > arm_smmu_cmdq_issue_cmdlist():
> > >  - It uses a new timer at every single call, which fails to limit
> > > to the preset ARM_SMMU_POLL_TIMEOUT_US per issue.  
> 
> > Not following what you mean.
> > The original code below does honor the timeout of
> > ARM_SMMU_POLL_TIMEOUT_US  
> 
> It sets the timeout per arm_smmu_cmdq_poll_until_not_full(), not
> the entire wait-for-space routine. And that's why you moved the
> queue_poll_init() to the caller, right?
> 
Got you! will do.

Thanks,

Jacob


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-11-04 19:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-20 22:43 [PATCH v2 0/2] SMMU v3 CMDQ fix and improvement Jacob Pan
2025-10-20 22:43 ` [PATCH v2 1/2] iommu/arm-smmu-v3: Fix CMDQ timeout warning Jacob Pan
2025-10-30 22:41   ` Nicolin Chen
2025-11-03 23:16     ` Jacob Pan
2025-11-04  1:23       ` Nicolin Chen
2025-11-04 18:25         ` Jacob Pan
2025-11-04 18:48           ` Nicolin Chen
2025-11-04 19:37             ` Jacob Pan
2025-10-20 22:43 ` [PATCH v2 2/2] iommu/arm-smmu-v3: Improve CMDQ lock fairness and efficiency Jacob Pan
2025-10-31  2:00   ` Nicolin Chen
2025-11-04  1:08     ` Jacob Pan
2025-10-30 15:42 ` [PATCH v2 0/2] SMMU v3 CMDQ fix and improvement Jacob Pan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox