[PATCH 0/3] iommu/arm-smmu-v3: Reduce latency in __arm_smmu_tlb_inv

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] iommu/arm-smmu-v3: Reduce latency in __arm_smmu_tlb_inv_range()
@ 2023-08-22  8:45 Nicolin Chen
  2023-08-22  8:45 ` [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg Nicolin Chen
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Nicolin Chen @ 2023-08-22  8:45 UTC (permalink / raw)
  To: will, robin.murphy, jgg
  Cc: joro, jean-philippe, apopple, linux-kernel, linux-arm-kernel,
	iommu

This series is reformatted from the tlb_invalidate_threshold patch:
https://lore.kernel.org/linux-iommu/ZN5oojF6vKOKB%2FeI@Asurada-Nvidia/
The discussion in the thread above moved towards the direction of adding
an arbitrary max_tlbi_ops (similar to MAX_TLBI_OPS in tlbflush.h file).

The changes here aim to reduce the latency in __arm_smmu_tlb_inv_range()
due to the large number of TLBI commands on an SMMU without the support
of range TLBI commands. The solution is to add a threshold at the number
of TLBI commands that one __arm_smmu_tlb_inv_range() callback can issue,
and to replace those TLBI commands with one single per-asid or per-vmid
TLBI command.

Nicolin Chen (3):
  iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  iommu/arm-smmu-v3: Add an arm_smmu_tlb_inv_domain helper
  iommu/arm-smmu-v3: Add a max_tlbi_ops for __arm_smmu_tlb_inv_range()

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 34 +++++++++++++++------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
 drivers/iommu/io-pgtable-arm.c              |  3 ++
 include/linux/io-pgtable.h                  |  2 ++
 4 files changed, 30 insertions(+), 10 deletions(-)

-- 
2.41.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2023-08-22  8:45 [PATCH 0/3] iommu/arm-smmu-v3: Reduce latency in __arm_smmu_tlb_inv_range() Nicolin Chen
@ 2023-08-22  8:45 ` Nicolin Chen
  2023-08-22  9:19   ` Robin Murphy
  2023-08-22  8:45 ` [PATCH 2/3] iommu/arm-smmu-v3: Add an arm_smmu_tlb_inv_domain helper Nicolin Chen
  2023-08-22  8:45 ` [PATCH 3/3] iommu/arm-smmu-v3: Add a max_tlbi_ops for __arm_smmu_tlb_inv_range() Nicolin Chen
  2 siblings, 1 reply; 34+ messages in thread
From: Nicolin Chen @ 2023-08-22  8:45 UTC (permalink / raw)
  To: will, robin.murphy, jgg
  Cc: joro, jean-philippe, apopple, linux-kernel, linux-arm-kernel,
	iommu

Add a new nents_per_pgtable in struct io_pgtable_cfg to store the number
of entries per IO pagetable, so it can be passed back to an IOMMU driver.
It will be used by one of the following changes to set the maximum number
of TLBI operations in the arm-smmu-v3 driver.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/io-pgtable-arm.c | 3 +++
 include/linux/io-pgtable.h     | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 72dcdd468cf3..7583d9ecca2b 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -891,6 +891,7 @@ arm_64_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg, void *cookie)
 
 	/* TTBR */
 	cfg->arm_lpae_s1_cfg.ttbr = virt_to_phys(data->pgd);
+	cfg->nents_per_pgtable = 1 << data->bits_per_level;
 	return &data->iop;
 
 out_free_data:
@@ -993,6 +994,7 @@ arm_64_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg, void *cookie)
 
 	/* VTTBR */
 	cfg->arm_lpae_s2_cfg.vttbr = virt_to_phys(data->pgd);
+	cfg->nents_per_pgtable = 1 << data->bits_per_level;
 	return &data->iop;
 
 out_free_data:
@@ -1071,6 +1073,7 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
 					  ARM_MALI_LPAE_TTBR_ADRMODE_TABLE;
 	if (cfg->coherent_walk)
 		cfg->arm_mali_lpae_cfg.transtab |= ARM_MALI_LPAE_TTBR_SHARE_OUTER;
+	cfg->nents_per_pgtable = 1 << data->bits_per_level;
 
 	return &data->iop;
 
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index 1b7a44b35616..4b55a327abc1 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -55,6 +55,7 @@ struct iommu_flush_ops {
  *                 tables.
  * @ias:           Input address (iova) size, in bits.
  * @oas:           Output address (paddr) size, in bits.
+ * @nents_per_pgtable: Number of entries per page table.
  * @coherent_walk  A flag to indicate whether or not page table walks made
  *                 by the IOMMU are coherent with the CPU caches.
  * @tlb:           TLB management callbacks for this set of tables.
@@ -96,6 +97,7 @@ struct io_pgtable_cfg {
 	unsigned long			pgsize_bitmap;
 	unsigned int			ias;
 	unsigned int			oas;
+	unsigned int			nents_per_pgtable;
 	bool				coherent_walk;
 	const struct iommu_flush_ops	*tlb;
 	struct device			*iommu_dev;
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/3] iommu/arm-smmu-v3: Add an arm_smmu_tlb_inv_domain helper
  2023-08-22  8:45 [PATCH 0/3] iommu/arm-smmu-v3: Reduce latency in __arm_smmu_tlb_inv_range() Nicolin Chen
  2023-08-22  8:45 ` [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg Nicolin Chen
@ 2023-08-22  8:45 ` Nicolin Chen
  2023-08-22  9:40   ` Robin Murphy
  2023-08-22  8:45 ` [PATCH 3/3] iommu/arm-smmu-v3: Add a max_tlbi_ops for __arm_smmu_tlb_inv_range() Nicolin Chen
  2 siblings, 1 reply; 34+ messages in thread
From: Nicolin Chen @ 2023-08-22  8:45 UTC (permalink / raw)
  To: will, robin.murphy, jgg
  Cc: joro, jean-philippe, apopple, linux-kernel, linux-arm-kernel,
	iommu

Move the part of per-asid or per-vmid invalidation command issuing into a
new helper function, which will be used in the following change.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 +++++++++++++--------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 9b0dc3505601..d6c647e1eb01 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1854,12 +1854,24 @@ int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, int ssid,
 	return arm_smmu_cmdq_batch_submit(smmu_domain->smmu, &cmds);
 }
 
+static void arm_smmu_tlb_inv_domain(struct arm_smmu_domain *smmu_domain)
+{
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	struct arm_smmu_cmdq_ent cmd;
+
+	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
+		arm_smmu_tlb_inv_asid(smmu, smmu_domain->s1_cfg.cd.asid);
+	} else {
+		cmd.opcode	= CMDQ_OP_TLBI_S12_VMALL;
+		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
+		arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
+	}
+}
+
 /* IO_PGTABLE API */
 static void arm_smmu_tlb_inv_context(void *cookie)
 {
 	struct arm_smmu_domain *smmu_domain = cookie;
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
-	struct arm_smmu_cmdq_ent cmd;
 
 	/*
 	 * NOTE: when io-pgtable is in non-strict mode, we may get here with
@@ -1868,13 +1880,7 @@ static void arm_smmu_tlb_inv_context(void *cookie)
 	 * insertion to guarantee those are observed before the TLBI. Do be
 	 * careful, 007.
 	 */
-	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
-		arm_smmu_tlb_inv_asid(smmu, smmu_domain->s1_cfg.cd.asid);
-	} else {
-		cmd.opcode	= CMDQ_OP_TLBI_S12_VMALL;
-		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
-		arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
-	}
+	arm_smmu_tlb_inv_domain(smmu_domain);
 	arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
 }
 
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/3] iommu/arm-smmu-v3: Add a max_tlbi_ops for __arm_smmu_tlb_inv_range()
  2023-08-22  8:45 [PATCH 0/3] iommu/arm-smmu-v3: Reduce latency in __arm_smmu_tlb_inv_range() Nicolin Chen
  2023-08-22  8:45 ` [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg Nicolin Chen
  2023-08-22  8:45 ` [PATCH 2/3] iommu/arm-smmu-v3: Add an arm_smmu_tlb_inv_domain helper Nicolin Chen
@ 2023-08-22  8:45 ` Nicolin Chen
  2023-08-22  9:30   ` Robin Murphy
  2 siblings, 1 reply; 34+ messages in thread
From: Nicolin Chen @ 2023-08-22  8:45 UTC (permalink / raw)
  To: will, robin.murphy, jgg
  Cc: joro, jean-philippe, apopple, linux-kernel, linux-arm-kernel,
	iommu

When receiving an __arm_smmu_tlb_inv_range() call with a large size, there
could be a long latency at this function call: one part is coming from a
large software overhead in the routine of building commands, and the other
part is coming from CMDQ hardware consuming the large number of commands.
This latency could be significantly large on an SMMU that does not support
range invalidation commands, i.e. no ARM_SMMU_FEAT_RANGE_INV.

One way to optimize this is to replace a large number of VA invalidation
commands with one single per-asid or per-vmid invalidation command, if an
invalidation size reaches a preset threshold using the number of entries
per io-pgtable, similar to the MAX_TLBI_OPS in arm64's tlbflush.h.

Add a max_tlbi_ops in arm_smmu_domain, and convert a large number of per-
granule TLBI commands to one single per-asid or per-vmid TLBI command, on
SMMUs without ARM_SMMU_FEAT_RANGE_INV.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 10 +++++++++-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index d6c647e1eb01..3f0db30932bd 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1897,7 +1897,14 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
 	if (!size)
 		return;
 
-	if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
+	if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) {
+		/*
+		 * When the size reaches a threshold, replace per-granule TLBI
+		 * commands with one single per-asid or per-vmid TLBI command.
+		 */
+		if (size >= granule * smmu_domain->max_tlbi_ops)
+			return arm_smmu_tlb_inv_domain(smmu_domain);
+	} else {
 		/* Get the leaf page size */
 		tg = __ffs(smmu_domain->domain.pgsize_bitmap);
 
@@ -2258,6 +2265,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
 	}
 
 	smmu_domain->pgtbl_ops = pgtbl_ops;
+	smmu_domain->max_tlbi_ops = pgtbl_cfg.nents_per_pgtable;
 	return 0;
 }
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index dcab85698a4e..f68c95a2e24f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -721,6 +721,7 @@ struct arm_smmu_domain {
 	struct io_pgtable_ops		*pgtbl_ops;
 	bool				stall_enabled;
 	atomic_t			nr_ats_masters;
+	unsigned long			max_tlbi_ops;
 
 	enum arm_smmu_domain_stage	stage;
 	union {
-- 
2.41.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2023-08-22  8:45 ` [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg Nicolin Chen
@ 2023-08-22  9:19   ` Robin Murphy
  2023-08-22 16:42     ` Nicolin Chen
  0 siblings, 1 reply; 34+ messages in thread
From: Robin Murphy @ 2023-08-22  9:19 UTC (permalink / raw)
  To: Nicolin Chen, will, jgg
  Cc: joro, jean-philippe, apopple, linux-kernel, linux-arm-kernel,
	iommu

On 2023-08-22 09:45, Nicolin Chen wrote:
> Add a new nents_per_pgtable in struct io_pgtable_cfg to store the number
> of entries per IO pagetable, so it can be passed back to an IOMMU driver.
> It will be used by one of the following changes to set the maximum number
> of TLBI operations in the arm-smmu-v3 driver.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   drivers/iommu/io-pgtable-arm.c | 3 +++
>   include/linux/io-pgtable.h     | 2 ++
>   2 files changed, 5 insertions(+)
> 
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index 72dcdd468cf3..7583d9ecca2b 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -891,6 +891,7 @@ arm_64_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg, void *cookie)
>   
>   	/* TTBR */
>   	cfg->arm_lpae_s1_cfg.ttbr = virt_to_phys(data->pgd);
> +	cfg->nents_per_pgtable = 1 << data->bits_per_level;
>   	return &data->iop;
>   
>   out_free_data:
> @@ -993,6 +994,7 @@ arm_64_lpae_alloc_pgtable_s2(struct io_pgtable_cfg *cfg, void *cookie)
>   
>   	/* VTTBR */
>   	cfg->arm_lpae_s2_cfg.vttbr = virt_to_phys(data->pgd);
> +	cfg->nents_per_pgtable = 1 << data->bits_per_level;
>   	return &data->iop;
>   
>   out_free_data:
> @@ -1071,6 +1073,7 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
>   					  ARM_MALI_LPAE_TTBR_ADRMODE_TABLE;
>   	if (cfg->coherent_walk)
>   		cfg->arm_mali_lpae_cfg.transtab |= ARM_MALI_LPAE_TTBR_SHARE_OUTER;
> +	cfg->nents_per_pgtable = 1 << data->bits_per_level;

The result of this highly complex and expensive calculation is clearly 
redundant with the existing bits_per_level field, so why do we need to 
waste space storing when the driver could simply use bits_per_level?

Thanks,
Robin.

>   	return &data->iop;
>   
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
> index 1b7a44b35616..4b55a327abc1 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -55,6 +55,7 @@ struct iommu_flush_ops {
>    *                 tables.
>    * @ias:           Input address (iova) size, in bits.
>    * @oas:           Output address (paddr) size, in bits.
> + * @nents_per_pgtable: Number of entries per page table.
>    * @coherent_walk  A flag to indicate whether or not page table walks made
>    *                 by the IOMMU are coherent with the CPU caches.
>    * @tlb:           TLB management callbacks for this set of tables.
> @@ -96,6 +97,7 @@ struct io_pgtable_cfg {
>   	unsigned long			pgsize_bitmap;
>   	unsigned int			ias;
>   	unsigned int			oas;
> +	unsigned int			nents_per_pgtable;
>   	bool				coherent_walk;
>   	const struct iommu_flush_ops	*tlb;
>   	struct device			*iommu_dev;

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] iommu/arm-smmu-v3: Add a max_tlbi_ops for __arm_smmu_tlb_inv_range()
  2023-08-22  8:45 ` [PATCH 3/3] iommu/arm-smmu-v3: Add a max_tlbi_ops for __arm_smmu_tlb_inv_range() Nicolin Chen
@ 2023-08-22  9:30   ` Robin Murphy
  2023-08-22 16:32     ` Nicolin Chen
  0 siblings, 1 reply; 34+ messages in thread
From: Robin Murphy @ 2023-08-22  9:30 UTC (permalink / raw)
  To: Nicolin Chen, will, jgg
  Cc: joro, jean-philippe, apopple, linux-kernel, linux-arm-kernel,
	iommu

On 2023-08-22 09:45, Nicolin Chen wrote:
> When receiving an __arm_smmu_tlb_inv_range() call with a large size, there
> could be a long latency at this function call: one part is coming from a
> large software overhead in the routine of building commands, and the other
> part is coming from CMDQ hardware consuming the large number of commands.
> This latency could be significantly large on an SMMU that does not support
> range invalidation commands, i.e. no ARM_SMMU_FEAT_RANGE_INV.
> 
> One way to optimize this is to replace a large number of VA invalidation
> commands with one single per-asid or per-vmid invalidation command, if an
> invalidation size reaches a preset threshold using the number of entries
> per io-pgtable, similar to the MAX_TLBI_OPS in arm64's tlbflush.h.
> 
> Add a max_tlbi_ops in arm_smmu_domain, and convert a large number of per-
> granule TLBI commands to one single per-asid or per-vmid TLBI command, on
> SMMUs without ARM_SMMU_FEAT_RANGE_INV.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 10 +++++++++-
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
>   2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index d6c647e1eb01..3f0db30932bd 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1897,7 +1897,14 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
>   	if (!size)
>   		return;
>   
> -	if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> +	if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) {
> +		/*
> +		 * When the size reaches a threshold, replace per-granule TLBI
> +		 * commands with one single per-asid or per-vmid TLBI command.
> +		 */
> +		if (size >= granule * smmu_domain->max_tlbi_ops)
> +			return arm_smmu_tlb_inv_domain(smmu_domain);

This looks like it's at the wrong level - we should have figured this 
out before we got as far as low-level command-building. I'd have thought 
it would be a case of short-circuiting directly from 
arm_smmu_tlb_inv_range_domain() to arm_smmu_tlb_inv_context().

> +	} else {
>   		/* Get the leaf page size */
>   		tg = __ffs(smmu_domain->domain.pgsize_bitmap);
>   
> @@ -2258,6 +2265,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
>   	}
>   
>   	smmu_domain->pgtbl_ops = pgtbl_ops;
> +	smmu_domain->max_tlbi_ops = pgtbl_cfg.nents_per_pgtable;

And now we're carrying *three* copies of the same information around 
everywhere? Honestly, just pull cfg->bits_per_level out of the 
io_pgtable_ops at the point where you need it, like the pagetable code 
itself manages to do perfectly happily. Wrap it in an io-pgtable helper 
if you think that's cleaner.

Thanks,
Robin.

>   	return 0;
>   }
>   
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index dcab85698a4e..f68c95a2e24f 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -721,6 +721,7 @@ struct arm_smmu_domain {
>   	struct io_pgtable_ops		*pgtbl_ops;
>   	bool				stall_enabled;
>   	atomic_t			nr_ats_masters;
> +	unsigned long			max_tlbi_ops;
>   
>   	enum arm_smmu_domain_stage	stage;
>   	union {

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] iommu/arm-smmu-v3: Add an arm_smmu_tlb_inv_domain helper
  2023-08-22  8:45 ` [PATCH 2/3] iommu/arm-smmu-v3: Add an arm_smmu_tlb_inv_domain helper Nicolin Chen
@ 2023-08-22  9:40   ` Robin Murphy
  2023-08-22 17:03     ` Nicolin Chen
  0 siblings, 1 reply; 34+ messages in thread
From: Robin Murphy @ 2023-08-22  9:40 UTC (permalink / raw)
  To: Nicolin Chen, will, jgg
  Cc: joro, jean-philippe, apopple, linux-kernel, linux-arm-kernel,
	iommu

On 2023-08-22 09:45, Nicolin Chen wrote:
> Move the part of per-asid or per-vmid invalidation command issuing into a
> new helper function, which will be used in the following change.

Why? This achieves nothing except make the code harder to follow and 
disconnect the rather important comment even further from the code it is 
significant to. It's not like we need a specific prototype to take a 
function pointer from, it's just another internal call - see 
arm_smmu_flush_iotlb_all() for instance. We know the cookie is an 
arm_smmu_domain pointer because we put it there, and converting it back 
from a void pointer is exactly the same *at* the function call boundary 
as immediately afterwards.

Thanks,
Robin.

> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 +++++++++++++--------
>   1 file changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 9b0dc3505601..d6c647e1eb01 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1854,12 +1854,24 @@ int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, int ssid,
>   	return arm_smmu_cmdq_batch_submit(smmu_domain->smmu, &cmds);
>   }
>   
> +static void arm_smmu_tlb_inv_domain(struct arm_smmu_domain *smmu_domain)
> +{
> +	struct arm_smmu_device *smmu = smmu_domain->smmu;
> +	struct arm_smmu_cmdq_ent cmd;
> +
> +	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
> +		arm_smmu_tlb_inv_asid(smmu, smmu_domain->s1_cfg.cd.asid);
> +	} else {
> +		cmd.opcode	= CMDQ_OP_TLBI_S12_VMALL;
> +		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
> +		arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
> +	}
> +}
> +
>   /* IO_PGTABLE API */
>   static void arm_smmu_tlb_inv_context(void *cookie)
>   {
>   	struct arm_smmu_domain *smmu_domain = cookie;
> -	struct arm_smmu_device *smmu = smmu_domain->smmu;
> -	struct arm_smmu_cmdq_ent cmd;
>   
>   	/*
>   	 * NOTE: when io-pgtable is in non-strict mode, we may get here with
> @@ -1868,13 +1880,7 @@ static void arm_smmu_tlb_inv_context(void *cookie)
>   	 * insertion to guarantee those are observed before the TLBI. Do be
>   	 * careful, 007.
>   	 */
> -	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
> -		arm_smmu_tlb_inv_asid(smmu, smmu_domain->s1_cfg.cd.asid);
> -	} else {
> -		cmd.opcode	= CMDQ_OP_TLBI_S12_VMALL;
> -		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
> -		arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
> -	}
> +	arm_smmu_tlb_inv_domain(smmu_domain);
>   	arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
>   }
>   

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] iommu/arm-smmu-v3: Add a max_tlbi_ops for __arm_smmu_tlb_inv_range()
  2023-08-22  9:30   ` Robin Murphy
@ 2023-08-22 16:32     ` Nicolin Chen
  2023-08-22 23:04       ` Nicolin Chen
  0 siblings, 1 reply; 34+ messages in thread
From: Nicolin Chen @ 2023-08-22 16:32 UTC (permalink / raw)
  To: Robin Murphy
  Cc: will, jgg, joro, jean-philippe, apopple, linux-kernel,
	linux-arm-kernel, iommu

On Tue, Aug 22, 2023 at 10:30:35AM +0100, Robin Murphy wrote:

> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index d6c647e1eb01..3f0db30932bd 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -1897,7 +1897,14 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
> >       if (!size)
> >               return;
> > 
> > -     if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> > +     if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) {
> > +             /*
> > +              * When the size reaches a threshold, replace per-granule TLBI
> > +              * commands with one single per-asid or per-vmid TLBI command.
> > +              */
> > +             if (size >= granule * smmu_domain->max_tlbi_ops)
> > +                     return arm_smmu_tlb_inv_domain(smmu_domain);
> 
> This looks like it's at the wrong level - we should have figured this
> out before we got as far as low-level command-building. I'd have thought
> it would be a case of short-circuiting directly from
> arm_smmu_tlb_inv_range_domain() to arm_smmu_tlb_inv_context().

OK, I could do that. We would have copies of this same routine
though. Also, the shortcut applies to !ARM_SMMU_FEAT_RANGE_INV
cases only, so this function feels convenient to me.

> > +     } else {
> >               /* Get the leaf page size */
> >               tg = __ffs(smmu_domain->domain.pgsize_bitmap);
> > 
> > @@ -2258,6 +2265,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
> >       }
> > 
> >       smmu_domain->pgtbl_ops = pgtbl_ops;
> > +     smmu_domain->max_tlbi_ops = pgtbl_cfg.nents_per_pgtable;
> 
> And now we're carrying *three* copies of the same information around
> everywhere? Honestly, just pull cfg->bits_per_level out of the
> io_pgtable_ops at the point where you need it, like the pagetable code
> itself manages to do perfectly happily. Wrap it in an io-pgtable helper
> if you think that's cleaner.

OK. I overlooked io_pgtable_ops_to_pgtable. Will do that.

Thanks
Nic

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2023-08-22  9:19   ` Robin Murphy
@ 2023-08-22 16:42     ` Nicolin Chen
  2023-08-29 15:37       ` Robin Murphy
  0 siblings, 1 reply; 34+ messages in thread
From: Nicolin Chen @ 2023-08-22 16:42 UTC (permalink / raw)
  To: Robin Murphy
  Cc: will, jgg, joro, jean-philippe, apopple, linux-kernel,
	linux-arm-kernel, iommu

On Tue, Aug 22, 2023 at 10:19:21AM +0100, Robin Murphy wrote:

> >   out_free_data:
> > @@ -1071,6 +1073,7 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
> >                                         ARM_MALI_LPAE_TTBR_ADRMODE_TABLE;
> >       if (cfg->coherent_walk)
> >               cfg->arm_mali_lpae_cfg.transtab |= ARM_MALI_LPAE_TTBR_SHARE_OUTER;
> > +     cfg->nents_per_pgtable = 1 << data->bits_per_level;
> 
> The result of this highly complex and expensive calculation is clearly
> redundant with the existing bits_per_level field, so why do we need to
> waste space storing when the driver could simply use bits_per_level?

bits_per_level is in the private struct arm_lpae_io_pgtable, while
drivers can only access struct io_pgtable_cfg. Are you suggesting
to move bits_per_level out of the private struct arm_lpae_io_pgtable
to the public struct io_pgtable_cfg?

Or am I missing another bits_per_level?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] iommu/arm-smmu-v3: Add an arm_smmu_tlb_inv_domain helper
  2023-08-22  9:40   ` Robin Murphy
@ 2023-08-22 17:03     ` Nicolin Chen
  2023-08-29 21:54       ` Robin Murphy
  0 siblings, 1 reply; 34+ messages in thread
From: Nicolin Chen @ 2023-08-22 17:03 UTC (permalink / raw)
  To: Robin Murphy
  Cc: will, jgg, joro, jean-philippe, apopple, linux-kernel,
	linux-arm-kernel, iommu

On Tue, Aug 22, 2023 at 10:40:18AM +0100, Robin Murphy wrote:
 
> On 2023-08-22 09:45, Nicolin Chen wrote:
> > Move the part of per-asid or per-vmid invalidation command issuing into a
> > new helper function, which will be used in the following change.
> 
> Why? This achieves nothing except make the code harder to follow and
> disconnect the rather important comment even further from the code it is

We need the same if-else routine to issue a per-asid or per-vmid
TLBI command. If making a copy of this same routine feels better
to you, yea, I can change that.

> significant to. It's not like we need a specific prototype to take a
> function pointer from, it's just another internal call - see
> arm_smmu_flush_iotlb_all() for instance. We know the cookie is an
> arm_smmu_domain pointer because we put it there, and converting it back
> from a void pointer is exactly the same *at* the function call boundary
> as immediately afterwards.

Hmm, I am not quite following this. What do you suggest here?

Thanks
Nic

> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > ---
> >   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 +++++++++++++--------
> >   1 file changed, 15 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 9b0dc3505601..d6c647e1eb01 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -1854,12 +1854,24 @@ int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, int ssid,
> >       return arm_smmu_cmdq_batch_submit(smmu_domain->smmu, &cmds);
> >   }
> > 
> > +static void arm_smmu_tlb_inv_domain(struct arm_smmu_domain *smmu_domain)
> > +{
> > +     struct arm_smmu_device *smmu = smmu_domain->smmu;
> > +     struct arm_smmu_cmdq_ent cmd;
> > +
> > +     if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
> > +             arm_smmu_tlb_inv_asid(smmu, smmu_domain->s1_cfg.cd.asid);
> > +     } else {
> > +             cmd.opcode      = CMDQ_OP_TLBI_S12_VMALL;
> > +             cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
> > +             arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
> > +     }
> > +}
> > +
> >   /* IO_PGTABLE API */
> >   static void arm_smmu_tlb_inv_context(void *cookie)
> >   {
> >       struct arm_smmu_domain *smmu_domain = cookie;
> > -     struct arm_smmu_device *smmu = smmu_domain->smmu;
> > -     struct arm_smmu_cmdq_ent cmd;
> > 
> >       /*
> >        * NOTE: when io-pgtable is in non-strict mode, we may get here with
> > @@ -1868,13 +1880,7 @@ static void arm_smmu_tlb_inv_context(void *cookie)
> >        * insertion to guarantee those are observed before the TLBI. Do be
> >        * careful, 007.
> >        */
> > -     if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
> > -             arm_smmu_tlb_inv_asid(smmu, smmu_domain->s1_cfg.cd.asid);
> > -     } else {
> > -             cmd.opcode      = CMDQ_OP_TLBI_S12_VMALL;
> > -             cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
> > -             arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
> > -     }
> > +     arm_smmu_tlb_inv_domain(smmu_domain);
> >       arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
> >   }
> > 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] iommu/arm-smmu-v3: Add a max_tlbi_ops for __arm_smmu_tlb_inv_range()
  2023-08-22 16:32     ` Nicolin Chen
@ 2023-08-22 23:04       ` Nicolin Chen
  2023-08-29 22:40         ` Robin Murphy
  0 siblings, 1 reply; 34+ messages in thread
From: Nicolin Chen @ 2023-08-22 23:04 UTC (permalink / raw)
  To: Robin Murphy
  Cc: will, jgg, joro, jean-philippe, apopple, linux-kernel,
	linux-arm-kernel, iommu

Hi Robin,

On Tue, Aug 22, 2023 at 09:32:26AM -0700, Nicolin Chen wrote:
> On Tue, Aug 22, 2023 at 10:30:35AM +0100, Robin Murphy wrote:
> 
> > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > index d6c647e1eb01..3f0db30932bd 100644
> > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > @@ -1897,7 +1897,14 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
> > >       if (!size)
> > >               return;
> > > 
> > > -     if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> > > +     if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) {
> > > +             /*
> > > +              * When the size reaches a threshold, replace per-granule TLBI
> > > +              * commands with one single per-asid or per-vmid TLBI command.
> > > +              */
> > > +             if (size >= granule * smmu_domain->max_tlbi_ops)
> > > +                     return arm_smmu_tlb_inv_domain(smmu_domain);
> > 
> > This looks like it's at the wrong level - we should have figured this
> > out before we got as far as low-level command-building. I'd have thought
> > it would be a case of short-circuiting directly from
> > arm_smmu_tlb_inv_range_domain() to arm_smmu_tlb_inv_context().
> 
> OK, I could do that. We would have copies of this same routine
> though. Also, the shortcut applies to !ARM_SMMU_FEAT_RANGE_INV
> cases only, so this function feels convenient to me.

I was trying to say that we would need the same piece in both
arm_smmu_tlb_inv_range_domain() and arm_smmu_tlb_inv_range_asid(),
though the latter one only needs to call arm_smmu_tlb_inv_asid().

Also, arm_smmu_tlb_inv_context() does a full range ATC invalidation
instead of a given range like what arm_smmu_tlb_inv_range_domain()
currently does. So, it might be a bit overkill.

Combining all your comments, we'd have something like this:

-------------------------------------------------------------------
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7614739ea2c1..2967a6634c7c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1937,12 +1937,22 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
 					  size_t granule, bool leaf,
 					  struct arm_smmu_domain *smmu_domain)
 {
+	struct io_pgtable_cfg *cfg =
+		&io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops)->cfg;
 	struct arm_smmu_cmdq_ent cmd = {
 		.tlbi = {
 			.leaf	= leaf,
 		},
 	};
 
+	/*
+	 * If the given size is too large that would end up with too many TLBI
+	 * commands in CMDQ, short circuit directly to a full invalidation
+	 */
+	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV) &&
+	    size >= granule * (1UL << cfg->bits_per_level))
+		return arm_smmu_tlb_inv_context(smmu_domain);
+
 	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
 		cmd.opcode	= smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ?
 				  CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA;
@@ -1964,6 +1974,8 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
 				 size_t granule, bool leaf,
 				 struct arm_smmu_domain *smmu_domain)
 {
+	struct io_pgtable_cfg *cfg =
+		&io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops)->cfg;
 	struct arm_smmu_cmdq_ent cmd = {
 		.opcode	= smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ?
 			  CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA,
@@ -1973,6 +1985,14 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
 		},
 	};
 
+	/*
+	 * If the given size is too large that would end up with too many TLBI
+	 * commands in CMDQ, short circuit directly to a full invalidation
+	 */
+	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV) &&
+	    size >= granule * (1UL << cfg->bits_per_level))
+		return arm_smmu_tlb_inv_asid(smmu_domain->smmu, asid);
+
 	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
 }
 
-------------------------------------------------------------------

You're sure that you prefer this, right?

Thanks
Nicolin

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2023-08-22 16:42     ` Nicolin Chen
@ 2023-08-29 15:37       ` Robin Murphy
  2023-08-29 20:15         ` Nicolin Chen
  0 siblings, 1 reply; 34+ messages in thread
From: Robin Murphy @ 2023-08-29 15:37 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, jgg, joro, jean-philippe, apopple, linux-kernel,
	linux-arm-kernel, iommu

On 2023-08-22 17:42, Nicolin Chen wrote:
> On Tue, Aug 22, 2023 at 10:19:21AM +0100, Robin Murphy wrote:
> 
>>>    out_free_data:
>>> @@ -1071,6 +1073,7 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
>>>                                          ARM_MALI_LPAE_TTBR_ADRMODE_TABLE;
>>>        if (cfg->coherent_walk)
>>>                cfg->arm_mali_lpae_cfg.transtab |= ARM_MALI_LPAE_TTBR_SHARE_OUTER;
>>> +     cfg->nents_per_pgtable = 1 << data->bits_per_level;
>>
>> The result of this highly complex and expensive calculation is clearly
>> redundant with the existing bits_per_level field, so why do we need to
>> waste space storing when the driver could simply use bits_per_level?
> 
> bits_per_level is in the private struct arm_lpae_io_pgtable, while
> drivers can only access struct io_pgtable_cfg. Are you suggesting
> to move bits_per_level out of the private struct arm_lpae_io_pgtable
> to the public struct io_pgtable_cfg?
> 
> Or am I missing another bits_per_level?

Bleh, apologies, I always confuse myself trying to remember the fiddly 
design of io-pgtable data. However, I think this then ends up proving 
the opposite point - the number of pages per table only happens to be a 
fixed constant for certain formats like LPAE, but does not necessarily 
generalise. For instance for a single v7s config it would be 1024 or 256 
or 16 depending on what has actually been unmapped.

The mechanism as proposed implicitly assumes LPAE format, so I still 
think we're better off making that assumption explicit. And at that 
point arm-smmu-v3 can then freely admit it already knows the number is 
simply 1/8th of the domain page size.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2023-08-29 15:37       ` Robin Murphy
@ 2023-08-29 20:15         ` Nicolin Chen
  2023-08-29 21:25           ` Robin Murphy
  0 siblings, 1 reply; 34+ messages in thread
From: Nicolin Chen @ 2023-08-29 20:15 UTC (permalink / raw)
  To: Robin Murphy
  Cc: will, jgg, joro, jean-philippe, apopple, linux-kernel,
	linux-arm-kernel, iommu

On Tue, Aug 29, 2023 at 04:37:00PM +0100, Robin Murphy wrote:

> On 2023-08-22 17:42, Nicolin Chen wrote:
> > On Tue, Aug 22, 2023 at 10:19:21AM +0100, Robin Murphy wrote:
> > 
> > > >    out_free_data:
> > > > @@ -1071,6 +1073,7 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
> > > >                                          ARM_MALI_LPAE_TTBR_ADRMODE_TABLE;
> > > >        if (cfg->coherent_walk)
> > > >                cfg->arm_mali_lpae_cfg.transtab |= ARM_MALI_LPAE_TTBR_SHARE_OUTER;
> > > > +     cfg->nents_per_pgtable = 1 << data->bits_per_level;
> > > 
> > > The result of this highly complex and expensive calculation is clearly
> > > redundant with the existing bits_per_level field, so why do we need to
> > > waste space storing when the driver could simply use bits_per_level?
> > 
> > bits_per_level is in the private struct arm_lpae_io_pgtable, while
> > drivers can only access struct io_pgtable_cfg. Are you suggesting
> > to move bits_per_level out of the private struct arm_lpae_io_pgtable
> > to the public struct io_pgtable_cfg?
> > 
> > Or am I missing another bits_per_level?
> 
> Bleh, apologies, I always confuse myself trying to remember the fiddly
> design of io-pgtable data. However, I think this then ends up proving
> the opposite point - the number of pages per table only happens to be a
> fixed constant for certain formats like LPAE, but does not necessarily
> generalise. For instance for a single v7s config it would be 1024 or 256
> or 16 depending on what has actually been unmapped.
> 
> The mechanism as proposed implicitly assumes LPAE format, so I still
> think we're better off making that assumption explicit. And at that
> point arm-smmu-v3 can then freely admit it already knows the number is
> simply 1/8th of the domain page size.

Hmm, I am not getting that "1/8th" part, would you mind elaborating?

Also, what we need is actually an arbitrary number for max_tlbi_ops.
And I think it could be irrelevant to the page size, i.e. either a
4K pgsize or a 64K pgsize could use the same max_tlbi_ops number,
because what eventually impacts the latency is the number of loops
of building/issuing commands.

So, combining your narrative above that nents_per_pgtable isn't so
general as we have in the tlbflush for MMU, perhaps we could just
decouple max_tlbi_ops from the pgtable and pgsize, instead define
something like this in the SMMUv3 driver:
	/*
	 * A request for a large number of TLBI commands could result in a big
	 * overhead and latency on SMMUs without ARM_SMMU_FEAT_RANGE_INV. Set
	 * a threshold to the number, so the driver would switch to one single
	 * full-range command.
	 */
	#define MAX_TLBI_OPS 8192

Any thought?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2023-08-29 20:15         ` Nicolin Chen
@ 2023-08-29 21:25           ` Robin Murphy
  2023-08-29 22:15             ` Nicolin Chen
  2024-01-20 19:59             ` Nicolin Chen
  0 siblings, 2 replies; 34+ messages in thread
From: Robin Murphy @ 2023-08-29 21:25 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, jgg, joro, jean-philippe, apopple, linux-kernel,
	linux-arm-kernel, iommu

On 2023-08-29 21:15, Nicolin Chen wrote:
> On Tue, Aug 29, 2023 at 04:37:00PM +0100, Robin Murphy wrote:
> 
>> On 2023-08-22 17:42, Nicolin Chen wrote:
>>> On Tue, Aug 22, 2023 at 10:19:21AM +0100, Robin Murphy wrote:
>>>
>>>>>     out_free_data:
>>>>> @@ -1071,6 +1073,7 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
>>>>>                                           ARM_MALI_LPAE_TTBR_ADRMODE_TABLE;
>>>>>         if (cfg->coherent_walk)
>>>>>                 cfg->arm_mali_lpae_cfg.transtab |= ARM_MALI_LPAE_TTBR_SHARE_OUTER;
>>>>> +     cfg->nents_per_pgtable = 1 << data->bits_per_level;
>>>>
>>>> The result of this highly complex and expensive calculation is clearly
>>>> redundant with the existing bits_per_level field, so why do we need to
>>>> waste space storing when the driver could simply use bits_per_level?
>>>
>>> bits_per_level is in the private struct arm_lpae_io_pgtable, while
>>> drivers can only access struct io_pgtable_cfg. Are you suggesting
>>> to move bits_per_level out of the private struct arm_lpae_io_pgtable
>>> to the public struct io_pgtable_cfg?
>>>
>>> Or am I missing another bits_per_level?
>>
>> Bleh, apologies, I always confuse myself trying to remember the fiddly
>> design of io-pgtable data. However, I think this then ends up proving
>> the opposite point - the number of pages per table only happens to be a
>> fixed constant for certain formats like LPAE, but does not necessarily
>> generalise. For instance for a single v7s config it would be 1024 or 256
>> or 16 depending on what has actually been unmapped.
>>
>> The mechanism as proposed implicitly assumes LPAE format, so I still
>> think we're better off making that assumption explicit. And at that
>> point arm-smmu-v3 can then freely admit it already knows the number is
>> simply 1/8th of the domain page size.
> 
> Hmm, I am not getting that "1/8th" part, would you mind elaborating?

If we know the format is LPAE, then we already know that nearly all 
pagetable levels are one full page, and the PTEs are 64 bits long. No 
magic data conduit required.

> Also, what we need is actually an arbitrary number for max_tlbi_ops.
> And I think it could be irrelevant to the page size, i.e. either a
> 4K pgsize or a 64K pgsize could use the same max_tlbi_ops number,
> because what eventually impacts the latency is the number of loops
> of building/issuing commands.

Although there is perhaps a counter-argument for selective invalidation, 
that if you're using 64K pages to improve TLB hit rates vs. 4K, then you 
have more to lose from overinvalidation, so maybe a single threshold 
isn't so appropriate for both.

Yes, ultimately it all comes down to picking an arbitrary number, but 
the challenge is that we want to pick a *good* one, and ideally have 
some reasoning behind it. As Will suggested, copying what the mm layer 
does gives us an easy line of reasoning, even if it's just in the form 
of passing the buck. And that's actually quite attractive, since 
otherwise we'd then have to get into the question of what really is the 
latency of building and issuing commands, since that clearly depends on 
how fast the CPU is, and how fast the SMMU is, and how busy the SMMU is, 
and how large the command queue is, and how many other CPUs are also 
contending for the command queue... and very quickly it becomes hard to 
believe that any simple constant can be good for all possible systems.

> So, combining your narrative above that nents_per_pgtable isn't so
> general as we have in the tlbflush for MMU,

FWIW I meant it doesn't generalise well enough to be a common io-pgtable 
interface; I have no issue with it forming the basis of an 
SMMUv3-specific heuristic when it *is* a relevant concept to all the 
pagetable formats SMMUv3 can possibly support.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] iommu/arm-smmu-v3: Add an arm_smmu_tlb_inv_domain helper
  2023-08-22 17:03     ` Nicolin Chen
@ 2023-08-29 21:54       ` Robin Murphy
  2023-08-29 23:03         ` Nicolin Chen
  0 siblings, 1 reply; 34+ messages in thread
From: Robin Murphy @ 2023-08-29 21:54 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, jgg, joro, jean-philippe, apopple, linux-kernel,
	linux-arm-kernel, iommu

On 2023-08-22 18:03, Nicolin Chen wrote:
> On Tue, Aug 22, 2023 at 10:40:18AM +0100, Robin Murphy wrote:
>   
>> On 2023-08-22 09:45, Nicolin Chen wrote:
>>> Move the part of per-asid or per-vmid invalidation command issuing into a
>>> new helper function, which will be used in the following change.
>>
>> Why? This achieves nothing except make the code harder to follow and
>> disconnect the rather important comment even further from the code it is
> 
> We need the same if-else routine to issue a per-asid or per-vmid
> TLBI command. If making a copy of this same routine feels better
> to you, yea, I can change that.
> 
>> significant to. It's not like we need a specific prototype to take a
>> function pointer from, it's just another internal call - see
>> arm_smmu_flush_iotlb_all() for instance. We know the cookie is an
>> arm_smmu_domain pointer because we put it there, and converting it back
>> from a void pointer is exactly the same *at* the function call boundary
>> as immediately afterwards.
> 
> Hmm, I am not quite following this. What do you suggest here?

Oh, this is becoming quite the lesson in not reviewing patches in a hurry :(

Apparently I managed to misread the diff and the horribly subtle 
difference between "arm_smmu_tlb_inv_domain" and 
"arm_smmu_atc_inv_domain", and think that arm_smmu_tlb_inv_context() was 
already just dealing with the TLBI command and you were moving the 
entire body into the new helper. Sorry about that.

Still, the part about the comment remains true, and I think it goes to 
show what a thoroughly horrible naming scheme it is to have "tlb_inv" 
denote a function responsible for TLBI commands and "atc_inv" denote a 
function responsible for ATC commands and "tlb_inv" denote a function 
responsible for both TLBI and ATC commands...

Thanks,
Robin.

> 
> Thanks
> Nic
> 
>>> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
>>> ---
>>>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 +++++++++++++--------
>>>    1 file changed, 15 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> index 9b0dc3505601..d6c647e1eb01 100644
>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> @@ -1854,12 +1854,24 @@ int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain, int ssid,
>>>        return arm_smmu_cmdq_batch_submit(smmu_domain->smmu, &cmds);
>>>    }
>>>
>>> +static void arm_smmu_tlb_inv_domain(struct arm_smmu_domain *smmu_domain)
>>> +{
>>> +     struct arm_smmu_device *smmu = smmu_domain->smmu;
>>> +     struct arm_smmu_cmdq_ent cmd;
>>> +
>>> +     if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
>>> +             arm_smmu_tlb_inv_asid(smmu, smmu_domain->s1_cfg.cd.asid);
>>> +     } else {
>>> +             cmd.opcode      = CMDQ_OP_TLBI_S12_VMALL;
>>> +             cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
>>> +             arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>>> +     }
>>> +}
>>> +
>>>    /* IO_PGTABLE API */
>>>    static void arm_smmu_tlb_inv_context(void *cookie)
>>>    {
>>>        struct arm_smmu_domain *smmu_domain = cookie;
>>> -     struct arm_smmu_device *smmu = smmu_domain->smmu;
>>> -     struct arm_smmu_cmdq_ent cmd;
>>>
>>>        /*
>>>         * NOTE: when io-pgtable is in non-strict mode, we may get here with
>>> @@ -1868,13 +1880,7 @@ static void arm_smmu_tlb_inv_context(void *cookie)
>>>         * insertion to guarantee those are observed before the TLBI. Do be
>>>         * careful, 007.
>>>         */
>>> -     if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
>>> -             arm_smmu_tlb_inv_asid(smmu, smmu_domain->s1_cfg.cd.asid);
>>> -     } else {
>>> -             cmd.opcode      = CMDQ_OP_TLBI_S12_VMALL;
>>> -             cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
>>> -             arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>>> -     }
>>> +     arm_smmu_tlb_inv_domain(smmu_domain);
>>>        arm_smmu_atc_inv_domain(smmu_domain, 0, 0, 0);
>>>    }
>>>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2023-08-29 21:25           ` Robin Murphy
@ 2023-08-29 22:15             ` Nicolin Chen
  2023-08-30 21:49               ` Will Deacon
  2024-01-20 19:59             ` Nicolin Chen
  1 sibling, 1 reply; 34+ messages in thread
From: Nicolin Chen @ 2023-08-29 22:15 UTC (permalink / raw)
  To: will, Robin Murphy
  Cc: jgg, joro, jean-philippe, apopple, linux-kernel, linux-arm-kernel,
	iommu

On Tue, Aug 29, 2023 at 10:25:10PM +0100, Robin Murphy wrote:

> > > Bleh, apologies, I always confuse myself trying to remember the fiddly
> > > design of io-pgtable data. However, I think this then ends up proving
> > > the opposite point - the number of pages per table only happens to be a
> > > fixed constant for certain formats like LPAE, but does not necessarily
> > > generalise. For instance for a single v7s config it would be 1024 or 256
> > > or 16 depending on what has actually been unmapped.
> > > 
> > > The mechanism as proposed implicitly assumes LPAE format, so I still
> > > think we're better off making that assumption explicit. And at that
> > > point arm-smmu-v3 can then freely admit it already knows the number is
> > > simply 1/8th of the domain page size.
> > 
> > Hmm, I am not getting that "1/8th" part, would you mind elaborating?
> 
> If we know the format is LPAE, then we already know that nearly all
> pagetable levels are one full page, and the PTEs are 64 bits long. No
> magic data conduit required.

Oh, I see!

> > Also, what we need is actually an arbitrary number for max_tlbi_ops.
> > And I think it could be irrelevant to the page size, i.e. either a
> > 4K pgsize or a 64K pgsize could use the same max_tlbi_ops number,
> > because what eventually impacts the latency is the number of loops
> > of building/issuing commands.
> 
> Although there is perhaps a counter-argument for selective invalidation,
> that if you're using 64K pages to improve TLB hit rates vs. 4K, then you
> have more to lose from overinvalidation, so maybe a single threshold
> isn't so appropriate for both.
> 
> Yes, ultimately it all comes down to picking an arbitrary number, but
> the challenge is that we want to pick a *good* one, and ideally have
> some reasoning behind it. As Will suggested, copying what the mm layer
> does gives us an easy line of reasoning, even if it's just in the form
> of passing the buck. And that's actually quite attractive, since
> otherwise we'd then have to get into the question of what really is the
> latency of building and issuing commands, since that clearly depends on
> how fast the CPU is, and how fast the SMMU is, and how busy the SMMU is,
> and how large the command queue is, and how many other CPUs are also
> contending for the command queue... and very quickly it becomes hard to
> believe that any simple constant can be good for all possible systems.

Yea, I had trouble with deciding the number at the first place, so
the previous solution ended up with an SYSFS node. I do agree that
copying from the mm layer solution gives a strong justification of
picking a arbitrary number. My concern here is about whether it'll
be overly too often or not at triggering a full-as invalidation.

Meanwhile, by re-looking at Will's commit log:
    arm64: tlbi: Set MAX_TLBI_OPS to PTRS_PER_PTE

    In order to reduce the possibility of soft lock-ups, we bound the
    maximum number of TLBI operations performed by a single call to
    flush_tlb_range() to an arbitrary constant of 1024.

    Whilst this does the job of avoiding lock-ups, we can actually be a bit
    smarter by defining this as PTRS_PER_PTE. Due to the structure of our
    page tables, using PTRS_PER_PTE means that an outer loop calling
    flush_tlb_range() for entire table entries will end up performing just a
    single TLBI operation for each entry. As an example, mremap()ing a 1GB
    range mapped using 4k pages now requires only 512 TLBI operations when
    moving the page tables as opposed to 262144 operations (512*512) when
    using the current threshold of 1024.

I found that I am actually not quite getting the calculation at the
end for the comparison between 512 and 262144.

For a 4K pgsize setup, MAX_TLBI_OPS is set to 512, calculated from
4096 / 8. Then, any VA range >= 2MB will trigger a flush_tlb_all().
By setting the threshold to 1024, the 2MB size bumps up to 4MB, i.e.
the condition becomes range >= 4MB.

So, it seems to me that requesting a 1GB invalidation will trigger
a flush_tlb_all() in either case of having a 2MB or a 4MB threshold?

I can get that the 262144 is the number of pages in a 1GB size, so
the number of per-page invalidations will be 262144 operations if
there was no threshold to replace with a full-as invalidation. Yet,
that wasn't the case since we had a 4MB threshold with an arbitrary
1024 for MAX_TLBI_OPS?

> > So, combining your narrative above that nents_per_pgtable isn't so
> > general as we have in the tlbflush for MMU,
> 
> FWIW I meant it doesn't generalise well enough to be a common io-pgtable
> interface; I have no issue with it forming the basis of an
> SMMUv3-specific heuristic when it *is* a relevant concept to all the
> pagetable formats SMMUv3 can possibly support.

OK.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] iommu/arm-smmu-v3: Add a max_tlbi_ops for __arm_smmu_tlb_inv_range()
  2023-08-22 23:04       ` Nicolin Chen
@ 2023-08-29 22:40         ` Robin Murphy
  2023-08-29 23:14           ` Nicolin Chen
  0 siblings, 1 reply; 34+ messages in thread
From: Robin Murphy @ 2023-08-29 22:40 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, jgg, joro, jean-philippe, apopple, linux-kernel,
	linux-arm-kernel, iommu

On 2023-08-23 00:04, Nicolin Chen wrote:
> Hi Robin,
> 
> On Tue, Aug 22, 2023 at 09:32:26AM -0700, Nicolin Chen wrote:
>> On Tue, Aug 22, 2023 at 10:30:35AM +0100, Robin Murphy wrote:
>>
>>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> index d6c647e1eb01..3f0db30932bd 100644
>>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>>> @@ -1897,7 +1897,14 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
>>>>        if (!size)
>>>>                return;
>>>>
>>>> -     if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
>>>> +     if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) {
>>>> +             /*
>>>> +              * When the size reaches a threshold, replace per-granule TLBI
>>>> +              * commands with one single per-asid or per-vmid TLBI command.
>>>> +              */
>>>> +             if (size >= granule * smmu_domain->max_tlbi_ops)
>>>> +                     return arm_smmu_tlb_inv_domain(smmu_domain);
>>>
>>> This looks like it's at the wrong level - we should have figured this
>>> out before we got as far as low-level command-building. I'd have thought
>>> it would be a case of short-circuiting directly from
>>> arm_smmu_tlb_inv_range_domain() to arm_smmu_tlb_inv_context().
>>
>> OK, I could do that. We would have copies of this same routine
>> though. Also, the shortcut applies to !ARM_SMMU_FEAT_RANGE_INV
>> cases only, so this function feels convenient to me.
> 
> I was trying to say that we would need the same piece in both
> arm_smmu_tlb_inv_range_domain() and arm_smmu_tlb_inv_range_asid(),
> though the latter one only needs to call arm_smmu_tlb_inv_asid().

Its not like arm_smmu_tlb_inv_range_asid() doesn't already massively 
overlap with arm_smmu_tlb_inv_range_domain() anyway, so a little further 
duplication hardly seems like it would hurt. Checking 
ARM_SMMU_FEAT_RANGE_INV should be cheap (otherwise we'd really want to 
split __arm_smmu_tlb_inv_range() into separate RIL vs. non-RIL versions 
to avoid having it in the loop), and it makes the intent clear. What I 
just really don't like is a flow where we construct a specific command, 
then call the low-level function to issue it, only that function then 
actually jumps back out into another high-level function which 
constructs a *different* command. This code is already a maze of twisty 
little passages...

> Also, arm_smmu_tlb_inv_context() does a full range ATC invalidation
> instead of a given range like what arm_smmu_tlb_inv_range_domain()
> currently does. So, it might be a bit overkill.
> 
> Combining all your comments, we'd have something like this:

TBH I'd be inclined to refactor a bit harder, maybe break out some 
VMID-based helpers for orthogonality, and aim for a flow like:

	if (over threshold)
		tlb_inv_domain()
	else if (stage 1)
		tlb_inv_range_asid()
	else
		tlb_inv_range_vmid()
	atc_inv_range()

or possibly if you prefer:

	if (stage 1) {
		if (over threshold)
			tlb_inv_asid()
		else
			tlb_inv_range_asid()
	} else {
		if (over threshold)
			tlb_inv_vmid()
		else
			tlb_inv_range_vmid()
	}
	atc_inv_range()

where the latter maybe trades more verbosity for less duplication 
overall - I'd probably have to try both to see which looks nicer in the 
end. And obviously if there's any chance of inventing a clear and 
consistent naming scheme in the process, that would be lovely.

Thanks,
Robin.

> -------------------------------------------------------------------
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 7614739ea2c1..2967a6634c7c 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1937,12 +1937,22 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
>   					  size_t granule, bool leaf,
>   					  struct arm_smmu_domain *smmu_domain)
>   {
> +	struct io_pgtable_cfg *cfg =
> +		&io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops)->cfg;
>   	struct arm_smmu_cmdq_ent cmd = {
>   		.tlbi = {
>   			.leaf	= leaf,
>   		},
>   	};
>   
> +	/*
> +	 * If the given size is too large that would end up with too many TLBI
> +	 * commands in CMDQ, short circuit directly to a full invalidation
> +	 */
> +	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV) &&
> +	    size >= granule * (1UL << cfg->bits_per_level))
> +		return arm_smmu_tlb_inv_context(smmu_domain);
> +
>   	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
>   		cmd.opcode	= smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ?
>   				  CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA;
> @@ -1964,6 +1974,8 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
>   				 size_t granule, bool leaf,
>   				 struct arm_smmu_domain *smmu_domain)
>   {
> +	struct io_pgtable_cfg *cfg =
> +		&io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops)->cfg;
>   	struct arm_smmu_cmdq_ent cmd = {
>   		.opcode	= smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ?
>   			  CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA,
> @@ -1973,6 +1985,14 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
>   		},
>   	};
>   
> +	/*
> +	 * If the given size is too large that would end up with too many TLBI
> +	 * commands in CMDQ, short circuit directly to a full invalidation
> +	 */
> +	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV) &&
> +	    size >= granule * (1UL << cfg->bits_per_level))
> +		return arm_smmu_tlb_inv_asid(smmu_domain->smmu, asid);
> +
>   	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
>   }
>   
> -------------------------------------------------------------------
> 
> You're sure that you prefer this, right?
> 
> Thanks
> Nicolin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/3] iommu/arm-smmu-v3: Add an arm_smmu_tlb_inv_domain helper
  2023-08-29 21:54       ` Robin Murphy
@ 2023-08-29 23:03         ` Nicolin Chen
  0 siblings, 0 replies; 34+ messages in thread
From: Nicolin Chen @ 2023-08-29 23:03 UTC (permalink / raw)
  To: Robin Murphy
  Cc: will, jgg, joro, jean-philippe, apopple, linux-kernel,
	linux-arm-kernel, iommu

On Tue, Aug 29, 2023 at 10:54:00PM +0100, Robin Murphy wrote:
> On 2023-08-22 18:03, Nicolin Chen wrote:
> > On Tue, Aug 22, 2023 at 10:40:18AM +0100, Robin Murphy wrote:
> > 
> > > On 2023-08-22 09:45, Nicolin Chen wrote:
> > > > Move the part of per-asid or per-vmid invalidation command issuing into a
> > > > new helper function, which will be used in the following change.
> > > 
> > > Why? This achieves nothing except make the code harder to follow and
> > > disconnect the rather important comment even further from the code it is
> > 
> > We need the same if-else routine to issue a per-asid or per-vmid
> > TLBI command. If making a copy of this same routine feels better
> > to you, yea, I can change that.
> > 
> > > significant to. It's not like we need a specific prototype to take a
> > > function pointer from, it's just another internal call - see
> > > arm_smmu_flush_iotlb_all() for instance. We know the cookie is an
> > > arm_smmu_domain pointer because we put it there, and converting it back
> > > from a void pointer is exactly the same *at* the function call boundary
> > > as immediately afterwards.
> > 
> > Hmm, I am not quite following this. What do you suggest here?
> 
> Oh, this is becoming quite the lesson in not reviewing patches in a hurry :(
> 
> Apparently I managed to misread the diff and the horribly subtle
> difference between "arm_smmu_tlb_inv_domain" and
> "arm_smmu_atc_inv_domain", and think that arm_smmu_tlb_inv_context() was
> already just dealing with the TLBI command and you were moving the
> entire body into the new helper. Sorry about that.
> 
> Still, the part about the comment remains true, and I think it goes to
> show what a thoroughly horrible naming scheme it is to have "tlb_inv"
> denote a function responsible for TLBI commands and "atc_inv" denote a
> function responsible for ATC commands and "tlb_inv" denote a function
> responsible for both TLBI and ATC commands...

Well, "atc_inv" is quite clear I think. But the"tlb_inv" might
not be, as you pointed out.

So, we have:
 tlb_inv_range_asid: tlbi only (NH_VA/EL2_VA) // used by SVA too
 tlb_inv_range_domain:
 	if (S1)
		tlb_inv_range_asid();	// NH_VA/EL2_VA
	else
		tlbi only (S2_IPA);
	atc();
 tlb_inv_asid: tlbi (NH_ASID)	// only used by tlb_inv_context()
 tlb_inv_context:
 	if (S1)
		tlb_inv_asid();	// NH_ASID
	else
		tlbi only (S2_VMALL);
	atc();

Then, what this patch wants another non-atc:
 tlb_inv_asid: tlbi (NH_ASID)	// only used by tlb_inv_domain()
 tlb_inv_domain: 		// new
 	if (S1)
		tlb_inv_asid();	// NH_ASID
	else
		tlbi only (S2_VMALL);
 tlb_inv_context:
 	tlb_inv_domain();
	atc();

The problem of this is that it conflicts with the naming used in
other tlb_inv_range_domain() that does an atc().

Perhaps, we could rename to the following patterns?
 tlb_inv_range_asid:	// used by SVA too
 tlb_inv_range_domain:
 	if (S1)
		return tlb_inv_range_asid();
	else
		tlbi only (S2_IPA)
 tlb_inv_range_domain_with_atc:
 	tlb_inv_range_domain();
	atc();

 # remove tlb_inv_asid() since it doesn't help much
 tlb_inv_domain:
 	if (S1)
		tlbi only (NH_ASID)
	else
		tlbi only (S2_VMALL)
 tlb_inv_domain_with_atc:
 	tlb_inv_domain();
	atc();

 tlb_inv_context:
 	struct arm_smmu_domain *smmu_domain =
		(struct arm_smmu_domain *cookie);
	tlb_inv_domain_with_atc(smmu_domain);

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/3] iommu/arm-smmu-v3: Add a max_tlbi_ops for __arm_smmu_tlb_inv_range()
  2023-08-29 22:40         ` Robin Murphy
@ 2023-08-29 23:14           ` Nicolin Chen
  0 siblings, 0 replies; 34+ messages in thread
From: Nicolin Chen @ 2023-08-29 23:14 UTC (permalink / raw)
  To: Robin Murphy
  Cc: will, jgg, joro, jean-philippe, apopple, linux-kernel,
	linux-arm-kernel, iommu

On Tue, Aug 29, 2023 at 11:40:29PM +0100, Robin Murphy wrote:

> > > > > -     if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> > > > > +     if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) {
> > > > > +             /*
> > > > > +              * When the size reaches a threshold, replace per-granule TLBI
> > > > > +              * commands with one single per-asid or per-vmid TLBI command.
> > > > > +              */
> > > > > +             if (size >= granule * smmu_domain->max_tlbi_ops)
> > > > > +                     return arm_smmu_tlb_inv_domain(smmu_domain);
> > > > 
> > > > This looks like it's at the wrong level - we should have figured this
> > > > out before we got as far as low-level command-building. I'd have thought
> > > > it would be a case of short-circuiting directly from
> > > > arm_smmu_tlb_inv_range_domain() to arm_smmu_tlb_inv_context().
> > > 
> > > OK, I could do that. We would have copies of this same routine
> > > though. Also, the shortcut applies to !ARM_SMMU_FEAT_RANGE_INV
> > > cases only, so this function feels convenient to me.
> > 
> > I was trying to say that we would need the same piece in both
> > arm_smmu_tlb_inv_range_domain() and arm_smmu_tlb_inv_range_asid(),
> > though the latter one only needs to call arm_smmu_tlb_inv_asid().
> 
> Its not like arm_smmu_tlb_inv_range_asid() doesn't already massively
> overlap with arm_smmu_tlb_inv_range_domain() anyway, so a little further
> duplication hardly seems like it would hurt. Checking
> ARM_SMMU_FEAT_RANGE_INV should be cheap (otherwise we'd really want to
> split __arm_smmu_tlb_inv_range() into separate RIL vs. non-RIL versions
> to avoid having it in the loop), and it makes the intent clear. What I
> just really don't like is a flow where we construct a specific command,
> then call the low-level function to issue it, only that function then
> actually jumps back out into another high-level function which
> constructs a *different* command. This code is already a maze of twisty
> little passages...

OK. That sounds convincing to me. We can do at the higher level.

> > Also, arm_smmu_tlb_inv_context() does a full range ATC invalidation
> > instead of a given range like what arm_smmu_tlb_inv_range_domain()
> > currently does. So, it might be a bit overkill.
> > 
> > Combining all your comments, we'd have something like this:
> 
> TBH I'd be inclined to refactor a bit harder, maybe break out some
> VMID-based helpers for orthogonality, and aim for a flow like:
> 
>        if (over threshold)
>                tlb_inv_domain()
>        else if (stage 1)
>                tlb_inv_range_asid()
>        else
>                tlb_inv_range_vmid()
>        atc_inv_range()
> 
> or possibly if you prefer:
> 
>        if (stage 1) {
>                if (over threshold)
>                        tlb_inv_asid()
>                else
>                        tlb_inv_range_asid()
>        } else {
>                if (over threshold)
>                        tlb_inv_vmid()
>                else
>                        tlb_inv_range_vmid()
>        }
>        atc_inv_range()
> 
> where the latter maybe trades more verbosity for less duplication
> overall - I'd probably have to try both to see which looks nicer in the
> end. And obviously if there's any chance of inventing a clear and
> consistent naming scheme in the process, that would be lovely.

Oh, I just replied you in another email, asking for a refactor,
though that didn't include the over-threshold part yet.

Anyway, I think I got your point now and will bear in mind that
we want something clean overall with less duplication among the
"inv" functions.

Let me try some rewriting. And we can see how it looks after.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2023-08-29 22:15             ` Nicolin Chen
@ 2023-08-30 21:49               ` Will Deacon
  2023-08-31 17:39                 ` Nicolin Chen
  0 siblings, 1 reply; 34+ messages in thread
From: Will Deacon @ 2023-08-30 21:49 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Robin Murphy, jgg, joro, jean-philippe, apopple, linux-kernel,
	linux-arm-kernel, iommu

On Tue, Aug 29, 2023 at 03:15:52PM -0700, Nicolin Chen wrote:
> Meanwhile, by re-looking at Will's commit log:
>     arm64: tlbi: Set MAX_TLBI_OPS to PTRS_PER_PTE
> 
>     In order to reduce the possibility of soft lock-ups, we bound the
>     maximum number of TLBI operations performed by a single call to
>     flush_tlb_range() to an arbitrary constant of 1024.
> 
>     Whilst this does the job of avoiding lock-ups, we can actually be a bit
>     smarter by defining this as PTRS_PER_PTE. Due to the structure of our
>     page tables, using PTRS_PER_PTE means that an outer loop calling
>     flush_tlb_range() for entire table entries will end up performing just a
>     single TLBI operation for each entry. As an example, mremap()ing a 1GB
>     range mapped using 4k pages now requires only 512 TLBI operations when
>     moving the page tables as opposed to 262144 operations (512*512) when
>     using the current threshold of 1024.
> 
> I found that I am actually not quite getting the calculation at the
> end for the comparison between 512 and 262144.
> 
> For a 4K pgsize setup, MAX_TLBI_OPS is set to 512, calculated from
> 4096 / 8. Then, any VA range >= 2MB will trigger a flush_tlb_all().
> By setting the threshold to 1024, the 2MB size bumps up to 4MB, i.e.
> the condition becomes range >= 4MB.
> 
> So, it seems to me that requesting a 1GB invalidation will trigger
> a flush_tlb_all() in either case of having a 2MB or a 4MB threshold?
> 
> I can get that the 262144 is the number of pages in a 1GB size, so
> the number of per-page invalidations will be 262144 operations if
> there was no threshold to replace with a full-as invalidation. Yet,
> that wasn't the case since we had a 4MB threshold with an arbitrary
> 1024 for MAX_TLBI_OPS?

I think this is because you can't always batch up the entire range as
you'd like due to things like locking concerns. For example,
move_page_tables() can end up invalidating 2MiB at a time, which is
too low to trigger the old threshold and so you end up doing ever single
pte individually.

Will

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2023-08-30 21:49               ` Will Deacon
@ 2023-08-31 17:39                 ` Nicolin Chen
  2023-09-01  0:08                   ` Nicolin Chen
  0 siblings, 1 reply; 34+ messages in thread
From: Nicolin Chen @ 2023-08-31 17:39 UTC (permalink / raw)
  To: Will Deacon
  Cc: Robin Murphy, jgg, joro, jean-philippe, apopple, linux-kernel,
	linux-arm-kernel, iommu

On Wed, Aug 30, 2023 at 10:49:59PM +0100, Will Deacon wrote:
 
> On Tue, Aug 29, 2023 at 03:15:52PM -0700, Nicolin Chen wrote:
> > Meanwhile, by re-looking at Will's commit log:
> >     arm64: tlbi: Set MAX_TLBI_OPS to PTRS_PER_PTE
> >
> >     In order to reduce the possibility of soft lock-ups, we bound the
> >     maximum number of TLBI operations performed by a single call to
> >     flush_tlb_range() to an arbitrary constant of 1024.
> >
> >     Whilst this does the job of avoiding lock-ups, we can actually be a bit
> >     smarter by defining this as PTRS_PER_PTE. Due to the structure of our
> >     page tables, using PTRS_PER_PTE means that an outer loop calling
> >     flush_tlb_range() for entire table entries will end up performing just a
> >     single TLBI operation for each entry. As an example, mremap()ing a 1GB
> >     range mapped using 4k pages now requires only 512 TLBI operations when
> >     moving the page tables as opposed to 262144 operations (512*512) when
> >     using the current threshold of 1024.
> >
> > I found that I am actually not quite getting the calculation at the
> > end for the comparison between 512 and 262144.
> >
> > For a 4K pgsize setup, MAX_TLBI_OPS is set to 512, calculated from
> > 4096 / 8. Then, any VA range >= 2MB will trigger a flush_tlb_all().
> > By setting the threshold to 1024, the 2MB size bumps up to 4MB, i.e.
> > the condition becomes range >= 4MB.
> >
> > So, it seems to me that requesting a 1GB invalidation will trigger
> > a flush_tlb_all() in either case of having a 2MB or a 4MB threshold?
> >
> > I can get that the 262144 is the number of pages in a 1GB size, so
> > the number of per-page invalidations will be 262144 operations if
> > there was no threshold to replace with a full-as invalidation. Yet,
> > that wasn't the case since we had a 4MB threshold with an arbitrary
> > 1024 for MAX_TLBI_OPS?
> 
> I think this is because you can't always batch up the entire range as
> you'd like due to things like locking concerns. For example,
> move_page_tables() can end up invalidating 2MiB at a time, which is
> too low to trigger the old threshold and so you end up doing ever single
> pte individually.

Thanks for elaborating! So, I see now that it was about the worst
case when 1GB breaks down to 512 pieces of 2MB range invalidation
ops, i.e. 512 flush_tlb_all ops v.s. 262144 flush_tlb_range ops.

Though I have not dig enough, I assume that this worst case could
happen to SVA too, since the IOTLB invalidation is from MMU code.
But the same worst case might not happen to non-SVA pathway, i.e.
TLBI ops for IO Page Table doesn't really benefit from this?

With that being questioned, I got Robin's remarks that it wouldn't
be easy to decide the arbitrary number, so we could just take the
worst case from SVA pathway as the common threshold.

Then, SVA uses the CPU page table, so perhaps we should highlight
that SMMU sets the threshold directly reusing the MAX_TLBI_OPS of
CPU page table rather than calculating from IO page table, though
both of them are in the same format?

Thank you
Nicolin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2023-08-31 17:39                 ` Nicolin Chen
@ 2023-09-01  0:08                   ` Nicolin Chen
  2023-09-01 18:02                     ` Robin Murphy
  0 siblings, 1 reply; 34+ messages in thread
From: Nicolin Chen @ 2023-09-01  0:08 UTC (permalink / raw)
  To: Robin Murphy, Will Deacon
  Cc: jgg, joro, jean-philippe, apopple, linux-kernel, linux-arm-kernel,
	iommu

Hi Will/Robin,

On Thu, Aug 31, 2023 at 10:39:15AM -0700, Nicolin Chen wrote:
 
> Though I have not dig enough, I assume that this worst case could
> happen to SVA too, since the IOTLB invalidation is from MMU code.
> But the same worst case might not happen to non-SVA pathway, i.e.
> TLBI ops for IO Page Table doesn't really benefit from this?
> 
> With that being questioned, I got Robin's remarks that it wouldn't
> be easy to decide the arbitrary number, so we could just take the
> worst case from SVA pathway as the common threshold.
> 
> Then, SVA uses the CPU page table, so perhaps we should highlight
> that SMMU sets the threshold directly reusing the MAX_TLBI_OPS of
> CPU page table rather than calculating from IO page table, though
> both of them are in the same format?

Our test team encountered a soft lockup in this path today:
--------------------------------------------------------------------
watchdog: BUG: soft lockup - CPU#244 stuck for 26s!
pstate: 83400009 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
pc : arm_smmu_cmdq_issue_cmdlist+0x178/0xa50
lr : arm_smmu_cmdq_issue_cmdlist+0x150/0xa50
sp : ffff8000d83ef290
x29: ffff8000d83ef290 x28: 000000003b9aca00 x27: 0000000000000000
x26: ffff8000d83ef3c0 x25: da86c0812194a0e8 x24: 0000000000000000
x23: 0000000000000040 x22: ffff8000d83ef340 x21: ffff0000c63980c0
x20: 0000000000000001 x19: ffff0000c6398080 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: ffff3000b4a3bbb0
x14: ffff3000b4a30888 x13: ffff3000b4a3cf60 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000 x9 : ffffc08120e4d6bc
x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000048cfa
x5 : 0000000000000000 x4 : 0000000000000001 x3 : 000000000000000a
x2 : 0000000080000000 x1 : 0000000000000000 x0 : 0000000000000001
Call trace:
 arm_smmu_cmdq_issue_cmdlist+0x178/0xa50
 __arm_smmu_tlb_inv_range+0x118/0x254
 arm_smmu_tlb_inv_range_asid+0x6c/0x130
 arm_smmu_mm_invalidate_range+0xa0/0xa4
 __mmu_notifier_invalidate_range_end+0x88/0x120
 unmap_vmas+0x194/0x1e0
 unmap_region+0xb4/0x144
 do_mas_align_munmap+0x290/0x490
 do_mas_munmap+0xbc/0x124
 __vm_munmap+0xa8/0x19c
 __arm64_sys_munmap+0x28/0x50
 invoke_syscall+0x78/0x11c
 el0_svc_common.constprop.0+0x58/0x1c0
 do_el0_svc+0x34/0x60
 el0_svc+0x2c/0xd4
 el0t_64_sync_handler+0x114/0x140
 el0t_64_sync+0x1a4/0x1a8
--------------------------------------------------------------------


I think it is the same problem that we fixed in tlbflush.h using
MAX_TLBI_OPS. So, I plan to send a cleaner bug fix (cc stable):
--------------------------------------------------------------------
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index a5a63b1c947e..e3ea7d2a6308 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -186,6 +186,9 @@ static void arm_smmu_free_shared_cd(struct arm_smmu_ctx_desc *cd)
        }
 }

+/* Copid from arch/arm64/include/asm/tlbflush.h to avoid similar soft lockups */
+#define MAX_TLBI_OPS   (1 << (PAGE_SHIFT - 3))
+
 static void arm_smmu_mm_invalidate_range(struct mmu_notifier *mn,
                                         struct mm_struct *mm,
                                         unsigned long start, unsigned long end)
@@ -201,9 +204,14 @@ static void arm_smmu_mm_invalidate_range(struct mmu_notifier *mn,
         */
        size = end - start;

-       if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_BTM))
-               arm_smmu_tlb_inv_range_asid(start, size, smmu_mn->cd->asid,
-                                           PAGE_SIZE, false, smmu_domain);
+       if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_BTM)) {
+               if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV) &&
+                   size >= MAX_TLBI_OPS * PAGE_SIZE)
+                       arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_mn->cd->asid);
+               else
+                       arm_smmu_tlb_inv_range_asid(start, size, smmu_mn->cd->asid,
+                                                   PAGE_SIZE, false, smmu_domain);
+       }
        arm_smmu_atc_inv_domain(smmu_domain, mm->pasid, start, size);
 }
--------------------------------------------------------------------

What do you think about it?

Thanks
Nic

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2023-09-01  0:08                   ` Nicolin Chen
@ 2023-09-01 18:02                     ` Robin Murphy
  2023-09-01 18:23                       ` Nicolin Chen
  0 siblings, 1 reply; 34+ messages in thread
From: Robin Murphy @ 2023-09-01 18:02 UTC (permalink / raw)
  To: Nicolin Chen, Will Deacon
  Cc: jgg, joro, jean-philippe, apopple, linux-kernel, linux-arm-kernel,
	iommu

On 2023-09-01 01:08, Nicolin Chen wrote:
> Hi Will/Robin,
> 
> On Thu, Aug 31, 2023 at 10:39:15AM -0700, Nicolin Chen wrote:
>   
>> Though I have not dig enough, I assume that this worst case could
>> happen to SVA too, since the IOTLB invalidation is from MMU code.
>> But the same worst case might not happen to non-SVA pathway, i.e.
>> TLBI ops for IO Page Table doesn't really benefit from this?
>>
>> With that being questioned, I got Robin's remarks that it wouldn't
>> be easy to decide the arbitrary number, so we could just take the
>> worst case from SVA pathway as the common threshold.
>>
>> Then, SVA uses the CPU page table, so perhaps we should highlight
>> that SMMU sets the threshold directly reusing the MAX_TLBI_OPS of
>> CPU page table rather than calculating from IO page table, though
>> both of them are in the same format?
> 
> Our test team encountered a soft lockup in this path today:
> --------------------------------------------------------------------
> watchdog: BUG: soft lockup - CPU#244 stuck for 26s!

That's a lot of TLBIs!

> pstate: 83400009 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
> pc : arm_smmu_cmdq_issue_cmdlist+0x178/0xa50
> lr : arm_smmu_cmdq_issue_cmdlist+0x150/0xa50
> sp : ffff8000d83ef290
> x29: ffff8000d83ef290 x28: 000000003b9aca00 x27: 0000000000000000
> x26: ffff8000d83ef3c0 x25: da86c0812194a0e8 x24: 0000000000000000
> x23: 0000000000000040 x22: ffff8000d83ef340 x21: ffff0000c63980c0
> x20: 0000000000000001 x19: ffff0000c6398080 x18: 0000000000000000
> x17: 0000000000000000 x16: 0000000000000000 x15: ffff3000b4a3bbb0
> x14: ffff3000b4a30888 x13: ffff3000b4a3cf60 x12: 0000000000000000
> x11: 0000000000000000 x10: 0000000000000000 x9 : ffffc08120e4d6bc
> x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000048cfa
> x5 : 0000000000000000 x4 : 0000000000000001 x3 : 000000000000000a
> x2 : 0000000080000000 x1 : 0000000000000000 x0 : 0000000000000001
> Call trace:
>   arm_smmu_cmdq_issue_cmdlist+0x178/0xa50
>   __arm_smmu_tlb_inv_range+0x118/0x254
>   arm_smmu_tlb_inv_range_asid+0x6c/0x130
>   arm_smmu_mm_invalidate_range+0xa0/0xa4
>   __mmu_notifier_invalidate_range_end+0x88/0x120
>   unmap_vmas+0x194/0x1e0
>   unmap_region+0xb4/0x144
>   do_mas_align_munmap+0x290/0x490
>   do_mas_munmap+0xbc/0x124
>   __vm_munmap+0xa8/0x19c
>   __arm64_sys_munmap+0x28/0x50
>   invoke_syscall+0x78/0x11c
>   el0_svc_common.constprop.0+0x58/0x1c0
>   do_el0_svc+0x34/0x60
>   el0_svc+0x2c/0xd4
>   el0t_64_sync_handler+0x114/0x140
>   el0t_64_sync+0x1a4/0x1a8
> --------------------------------------------------------------------
> 
> 
> I think it is the same problem that we fixed in tlbflush.h using
> MAX_TLBI_OPS. So, I plan to send a cleaner bug fix (cc stable):
> --------------------------------------------------------------------
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
> index a5a63b1c947e..e3ea7d2a6308 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
> @@ -186,6 +186,9 @@ static void arm_smmu_free_shared_cd(struct arm_smmu_ctx_desc *cd)
>          }
>   }
> 
> +/* Copid from arch/arm64/include/asm/tlbflush.h to avoid similar soft lockups */
> +#define MAX_TLBI_OPS   (1 << (PAGE_SHIFT - 3))
> +
>   static void arm_smmu_mm_invalidate_range(struct mmu_notifier *mn,
>                                           struct mm_struct *mm,
>                                           unsigned long start, unsigned long end)
> @@ -201,9 +204,14 @@ static void arm_smmu_mm_invalidate_range(struct mmu_notifier *mn,
>           */
>          size = end - start;
> 
> -       if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_BTM))
> -               arm_smmu_tlb_inv_range_asid(start, size, smmu_mn->cd->asid,
> -                                           PAGE_SIZE, false, smmu_domain);
> +       if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_BTM)) {
> +               if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV) &&
> +                   size >= MAX_TLBI_OPS * PAGE_SIZE)
> +                       arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_mn->cd->asid);
> +               else
> +                       arm_smmu_tlb_inv_range_asid(start, size, smmu_mn->cd->asid,
> +                                                   PAGE_SIZE, false, smmu_domain);
> +       }
>          arm_smmu_atc_inv_domain(smmu_domain, mm->pasid, start, size);
>   }
> --------------------------------------------------------------------
> 
> What do you think about it?

Looks reasonable to me - I think it's the right shape to foreshadow the 
bigger refactoring we discussed, and I can't object to using 
PAGE_{SIZE,SHIFT} for the calculation when it's specifically in the 
context of SVA.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2023-09-01 18:02                     ` Robin Murphy
@ 2023-09-01 18:23                       ` Nicolin Chen
  0 siblings, 0 replies; 34+ messages in thread
From: Nicolin Chen @ 2023-09-01 18:23 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Will Deacon, jgg, joro, jean-philippe, apopple, linux-kernel,
	linux-arm-kernel, iommu

On Fri, Sep 01, 2023 at 07:02:19PM +0100, Robin Murphy wrote:

> > Our test team encountered a soft lockup in this path today:
> > --------------------------------------------------------------------
> > watchdog: BUG: soft lockup - CPU#244 stuck for 26s!
> 
> That's a lot of TLBIs!

Well, imagining the use cases of NVIDIA Grace, I'd expect it soon
to be no longer a surprise that we see more stressful cases :-/

> > I think it is the same problem that we fixed in tlbflush.h using
> > MAX_TLBI_OPS. So, I plan to send a cleaner bug fix (cc stable)

> > What do you think about it?
> 
> Looks reasonable to me - I think it's the right shape to foreshadow the
> bigger refactoring we discussed, 

Cool! Thanks for the quick ack. And I could move the MAX_TLBI_OPS
to the SMMU header so a later change in smmu.c can just use it.

> and I can't object to using
> PAGE_{SIZE,SHIFT} for the calculation when it's specifically in the
> context of SVA.

Yea, the current SVA code passes in PAGE_SIZE as the TLBI granule.

Thanks
Nic

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2023-08-29 21:25           ` Robin Murphy
  2023-08-29 22:15             ` Nicolin Chen
@ 2024-01-20 19:59             ` Nicolin Chen
  2024-01-22 13:01               ` Jason Gunthorpe
  1 sibling, 1 reply; 34+ messages in thread
From: Nicolin Chen @ 2024-01-20 19:59 UTC (permalink / raw)
  To: will@kernel.org, Robin Murphy
  Cc: Jason Gunthorpe, joro@8bytes.org, jean-philippe@linaro.org,
	Alistair Popple, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev

Hi Robin/Will,

On Tue, Aug 29, 2023 at 02:25:10PM -0700, Robin Murphy wrote:
> > Also, what we need is actually an arbitrary number for max_tlbi_ops.
> > And I think it could be irrelevant to the page size, i.e. either a
> > 4K pgsize or a 64K pgsize could use the same max_tlbi_ops number,
> > because what eventually impacts the latency is the number of loops
> > of building/issuing commands.
> 
> Although there is perhaps a counter-argument for selective invalidation,
> that if you're using 64K pages to improve TLB hit rates vs. 4K, then you
> have more to lose from overinvalidation, so maybe a single threshold
> isn't so appropriate for both.
> 
> Yes, ultimately it all comes down to picking an arbitrary number, but
> the challenge is that we want to pick a *good* one, and ideally have
> some reasoning behind it. As Will suggested, copying what the mm layer
> does gives us an easy line of reasoning, even if it's just in the form
> of passing the buck. And that's actually quite attractive, since
> otherwise we'd then have to get into the question of what really is the
> latency of building and issuing commands, since that clearly depends on
> how fast the CPU is, and how fast the SMMU is, and how busy the SMMU is,
> and how large the command queue is, and how many other CPUs are also
> contending for the command queue... and very quickly it becomes hard to
> believe that any simple constant can be good for all possible systems.

So, here we have another request to optimize this number further,
though the merged arbitrary number copied from MMU side could fix
the soft lockup. The iommu_unmap delay with a 500MB buffer is not
quite satisfying on our testing chip, since the threshold now for
max_tlbi_ops is at 512MB for 64K pgsize (8192 * 64KB).

As Robin remarked, this could be really a case-by-case situation.
So, I wonder if we can rethink of adding a configurable threshold
that has a default value at its current setup matching MMU side.

If this is acceptable, what can be the preferable way of having a
configuration: a per-SMMU or a per-IOMMU-group sysfs node? I am
open for any other option too.

Also, this would be added to the arm_smmu_inv_range_too_big() in
Jason's patch here:
https://lore.kernel.org/linux-iommu/20240115153152.GA50608@ziepe.ca/

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2024-01-20 19:59             ` Nicolin Chen
@ 2024-01-22 13:01               ` Jason Gunthorpe
  2024-01-22 17:24                 ` Nicolin Chen
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2024-01-22 13:01 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will@kernel.org, Robin Murphy, joro@8bytes.org,
	jean-philippe@linaro.org, Alistair Popple,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev

On Sat, Jan 20, 2024 at 11:59:45AM -0800, Nicolin Chen wrote:
> Hi Robin/Will,
> 
> On Tue, Aug 29, 2023 at 02:25:10PM -0700, Robin Murphy wrote:
> > > Also, what we need is actually an arbitrary number for max_tlbi_ops.
> > > And I think it could be irrelevant to the page size, i.e. either a
> > > 4K pgsize or a 64K pgsize could use the same max_tlbi_ops number,
> > > because what eventually impacts the latency is the number of loops
> > > of building/issuing commands.
> > 
> > Although there is perhaps a counter-argument for selective invalidation,
> > that if you're using 64K pages to improve TLB hit rates vs. 4K, then you
> > have more to lose from overinvalidation, so maybe a single threshold
> > isn't so appropriate for both.
> > 
> > Yes, ultimately it all comes down to picking an arbitrary number, but
> > the challenge is that we want to pick a *good* one, and ideally have
> > some reasoning behind it. As Will suggested, copying what the mm layer
> > does gives us an easy line of reasoning, even if it's just in the form
> > of passing the buck. And that's actually quite attractive, since
> > otherwise we'd then have to get into the question of what really is the
> > latency of building and issuing commands, since that clearly depends on
> > how fast the CPU is, and how fast the SMMU is, and how busy the SMMU is,
> > and how large the command queue is, and how many other CPUs are also
> > contending for the command queue... and very quickly it becomes hard to
> > believe that any simple constant can be good for all possible systems.
> 
> So, here we have another request to optimize this number further,
> though the merged arbitrary number copied from MMU side could fix
> the soft lockup. The iommu_unmap delay with a 500MB buffer is not
> quite satisfying on our testing chip, since the threshold now for
> max_tlbi_ops is at 512MB for 64K pgsize (8192 * 64KB).
> 
> As Robin remarked, this could be really a case-by-case situation.
> So, I wonder if we can rethink of adding a configurable threshold
> that has a default value at its current setup matching MMU side.
> 
> If this is acceptable, what can be the preferable way of having a
> configuration: a per-SMMU or a per-IOMMU-group sysfs node? I am
> open for any other option too.

Maybe it should be more dynamic and you get xx ms to push
invalidations otherwise it gives up and does invalidate all?

The busier the system the broader the invalidation?

Or do we need to measure at boot time invalidation performance and set
a threshold that way?

Also, it seems to me that SVA use cases and, say, DMA API cases are
somewhat different where we may be willing to wait longer for DMA API.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2024-01-22 13:01               ` Jason Gunthorpe
@ 2024-01-22 17:24                 ` Nicolin Chen
  2024-01-22 17:57                   ` Jason Gunthorpe
       [not found]                   ` <098d64da-ecf5-4a23-bff9-a04840726ef0@huawei.com>
  0 siblings, 2 replies; 34+ messages in thread
From: Nicolin Chen @ 2024-01-22 17:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: will@kernel.org, Robin Murphy, joro@8bytes.org,
	jean-philippe@linaro.org, Alistair Popple,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev

Hi Jason,

Thank you for the ideas!

On Mon, Jan 22, 2024 at 09:01:52AM -0400, Jason Gunthorpe wrote:
> On Sat, Jan 20, 2024 at 11:59:45AM -0800, Nicolin Chen wrote:
> > Hi Robin/Will,
> > 
> > On Tue, Aug 29, 2023 at 02:25:10PM -0700, Robin Murphy wrote:
> > > > Also, what we need is actually an arbitrary number for max_tlbi_ops.
> > > > And I think it could be irrelevant to the page size, i.e. either a
> > > > 4K pgsize or a 64K pgsize could use the same max_tlbi_ops number,
> > > > because what eventually impacts the latency is the number of loops
> > > > of building/issuing commands.
> > > 
> > > Although there is perhaps a counter-argument for selective invalidation,
> > > that if you're using 64K pages to improve TLB hit rates vs. 4K, then you
> > > have more to lose from overinvalidation, so maybe a single threshold
> > > isn't so appropriate for both.
> > > 
> > > Yes, ultimately it all comes down to picking an arbitrary number, but
> > > the challenge is that we want to pick a *good* one, and ideally have
> > > some reasoning behind it. As Will suggested, copying what the mm layer
> > > does gives us an easy line of reasoning, even if it's just in the form
> > > of passing the buck. And that's actually quite attractive, since
> > > otherwise we'd then have to get into the question of what really is the
> > > latency of building and issuing commands, since that clearly depends on
> > > how fast the CPU is, and how fast the SMMU is, and how busy the SMMU is,
> > > and how large the command queue is, and how many other CPUs are also
> > > contending for the command queue... and very quickly it becomes hard to
> > > believe that any simple constant can be good for all possible systems.
> > 
> > So, here we have another request to optimize this number further,
> > though the merged arbitrary number copied from MMU side could fix
> > the soft lockup. The iommu_unmap delay with a 500MB buffer is not
> > quite satisfying on our testing chip, since the threshold now for
> > max_tlbi_ops is at 512MB for 64K pgsize (8192 * 64KB).
> > 
> > As Robin remarked, this could be really a case-by-case situation.
> > So, I wonder if we can rethink of adding a configurable threshold
> > that has a default value at its current setup matching MMU side.
> > 
> > If this is acceptable, what can be the preferable way of having a
> > configuration: a per-SMMU or a per-IOMMU-group sysfs node? I am
> > open for any other option too.
> 
> Maybe it should be more dynamic and you get xx ms to push
> invalidations otherwise it gives up and does invalidate all?
> 
> The busier the system the broader the invalidation?

Yea, I think this could be good one.

> Or do we need to measure at boot time invalidation performance and set
> a threshold that way?

I see. We can run an invalidation at default max_tlbi_ops to
get its delay in msec or usec, and then set as the threshold
"xx ms" in the idea one.

> Also, it seems to me that SVA use cases and, say, DMA API cases are
> somewhat different where we may be willing to wait longer for DMA API.

Hmm, the lockup that my patch fixed was for an SVA case that
doesn't seem to involve DMA API:
https://lore.kernel.org/linux-iommu/20230901203904.4073-1-nicolinc@nvidia.com/

And the other lockup fix for a non-SVA case from Zhang doesn't
seem to involve DMA API either:
https://lore.kernel.org/linux-iommu/e74ea905-d107-4202-97ca-c2c509e7aa1e@huawei.com/

Maybe we can treat DMA API a bit different. But I am not sure
about the justification of leaving it to wait longer. Mind
elaborating?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2024-01-22 17:24                 ` Nicolin Chen
@ 2024-01-22 17:57                   ` Jason Gunthorpe
  2024-01-24  0:11                     ` Nicolin Chen
       [not found]                   ` <098d64da-ecf5-4a23-bff9-a04840726ef0@huawei.com>
  1 sibling, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2024-01-22 17:57 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will@kernel.org, Robin Murphy, joro@8bytes.org,
	jean-philippe@linaro.org, Alistair Popple,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev

On Mon, Jan 22, 2024 at 09:24:08AM -0800, Nicolin Chen wrote:
> > Or do we need to measure at boot time invalidation performance and set
> > a threshold that way?
> 
> I see. We can run an invalidation at default max_tlbi_ops to
> get its delay in msec or usec, and then set as the threshold
> "xx ms" in the idea one.
> 
> > Also, it seems to me that SVA use cases and, say, DMA API cases are
> > somewhat different where we may be willing to wait longer for DMA API.
> 
> Hmm, the lockup that my patch fixed was for an SVA case that
> doesn't seem to involve DMA API:
> https://lore.kernel.org/linux-iommu/20230901203904.4073-1-nicolinc@nvidia.com/
> 
> And the other lockup fix for a non-SVA case from Zhang doesn't
> seem to involve DMA API either:
> https://lore.kernel.org/linux-iommu/e74ea905-d107-4202-97ca-c2c509e7aa1e@huawei.com/
> 
> Maybe we can treat DMA API a bit different. But I am not sure
> about the justification of leaving it to wait longer. Mind
> elaborating?

Well, there are two issues.. The first is the soft lockup, that should
just be reliably prevented. The timer, for instance, is a reasonable
stab at making that universally safe.

Then there is the issue of just raw invalidation performance, where
SVA particularly is linked to the mm and the longer invalidation takes
the slower the apps will be. We don't have any idea where future DMA
might hit the cache, so it is hard to know if all invalidation is not
the right thing..

DMA api is often lazy and the active DMA is a bit more predictable, so
perhaps there is more merit in spending more time to narrow the
invalidation.

The other case was vfio unmap for VM tear down, which ideally would
use whole ASID invalidation.

If your issue is softlockup, not performance, then that should be
prevented strongly. Broadly speaking if SVA is pushing too high an
invalidation workload then we need to agressively trim it, and do so
dynamically. Certainly we should not have a tunable that has to be set
right to avoid soft lockup.

A tunable to improve performance, perhaps, but not to achieve basic
correctness.

Maybe it is really just a simple thing - compute how many invalidation
commands are needed, if they don't all fit in the current queue space,
then do an invalidate all instead?

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2024-01-22 17:57                   ` Jason Gunthorpe
@ 2024-01-24  0:11                     ` Nicolin Chen
  2024-01-25 13:55                       ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: Nicolin Chen @ 2024-01-24  0:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: will@kernel.org, Robin Murphy, joro@8bytes.org,
	jean-philippe@linaro.org, Alistair Popple,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev

On Mon, Jan 22, 2024 at 01:57:00PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 22, 2024 at 09:24:08AM -0800, Nicolin Chen wrote:
> > > Or do we need to measure at boot time invalidation performance and set
> > > a threshold that way?
> > 
> > I see. We can run an invalidation at default max_tlbi_ops to
> > get its delay in msec or usec, and then set as the threshold
> > "xx ms" in the idea one.
> > 
> > > Also, it seems to me that SVA use cases and, say, DMA API cases are
> > > somewhat different where we may be willing to wait longer for DMA API.
> > 
> > Hmm, the lockup that my patch fixed was for an SVA case that
> > doesn't seem to involve DMA API:
> > https://lore.kernel.org/linux-iommu/20230901203904.4073-1-nicolinc@nvidia.com/
> > 
> > And the other lockup fix for a non-SVA case from Zhang doesn't
> > seem to involve DMA API either:
> > https://lore.kernel.org/linux-iommu/e74ea905-d107-4202-97ca-c2c509e7aa1e@huawei.com/
> > 
> > Maybe we can treat DMA API a bit different. But I am not sure
> > about the justification of leaving it to wait longer. Mind
> > elaborating?
> 
> Well, there are two issues.. The first is the soft lockup, that should
> just be reliably prevented. The timer, for instance, is a reasonable
> stab at making that universally safe.
> 
> Then there is the issue of just raw invalidation performance, where
> SVA particularly is linked to the mm and the longer invalidation takes
> the slower the apps will be. We don't have any idea where future DMA
> might hit the cache, so it is hard to know if all invalidation is not
> the right thing..
> 
> DMA api is often lazy and the active DMA is a bit more predictable, so
> perhaps there is more merit in spending more time to narrow the
> invalidation.
> 
> The other case was vfio unmap for VM tear down, which ideally would
> use whole ASID invalidation.

I see! Then we need a flag to pass in __iommu_dma_unmap or so.
If a caller is in dma-iommu.c, do a longer per-page invalidation.

> If your issue is softlockup, not performance, then that should be

We have both issues.

> prevented strongly. Broadly speaking if SVA is pushing too high an
> invalidation workload then we need to agressively trim it, and do so
> dynamically. Certainly we should not have a tunable that has to be set
> right to avoid soft lockup.
> 
> A tunable to improve performance, perhaps, but not to achieve basic
> correctness.

So, should we make an optional tunable only for those who care
about performance? Though I think having a tunable would just
fix both issues.

> Maybe it is really just a simple thing - compute how many invalidation
> commands are needed, if they don't all fit in the current queue space,
> then do an invalidate all instead?

The queue could actually have a large space. But one large-size
invalidation would be divided into batches that have to execute
back-to-back. And the batch size is 64 commands in 64-bit case,
which might be too small as a cap.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
       [not found]                   ` <098d64da-ecf5-4a23-bff9-a04840726ef0@huawei.com>
@ 2024-01-25  5:09                     ` Nicolin Chen
  0 siblings, 0 replies; 34+ messages in thread
From: Nicolin Chen @ 2024-01-25  5:09 UTC (permalink / raw)
  To: zhangzekun (A)
  Cc: Jason Gunthorpe, will@kernel.org, Robin Murphy, joro@8bytes.org,
	jean-philippe@linaro.org, Alistair Popple,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev

On Wed, Jan 24, 2024 at 02:20:14PM +0800, zhangzekun (A) wrote:
> Also, it seems to me that SVA use cases and, say, DMA API cases are
> somewhat different where we may be willing to wait longer for DMA API.
 
> Hmm, the lockup that my patch fixed was for an SVA case that
> doesn't seem to involve DMA API:
> https://lore.kernel.org/linux-iommu/20230901203904.4073-1-nicolinc@nvidia.com/
> 
> And the other lockup fix for a non-SVA case from Zhang doesn't
> seem to involve DMA API either:
> https://lore.kernel.org/linux-iommu/e74ea905-d107-4202-97ca-c2c509e7aa1e@huawei.com/
> 
> 
> Hi, Nicolin
> 
> These patches do involve DMA APIs, because it modifies arm_smmu_tlb_inv_walk() which will be called when unmapping dma address space.
> 
> iommu_dma_unmap_page
>    __iommu_dma_unmap
>        __iommu_unmap
>            arm_smmu_unmap_pages
>                arm_lpae_unmap_pages
>                    io_pgtable_tlb_flush_walk
>                        arm_smmu_tlb_inv_walk
> 

Oh, thanks for clarifying. I was just looking at your trace log
there, but overlooked other places.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2024-01-24  0:11                     ` Nicolin Chen
@ 2024-01-25 13:55                       ` Jason Gunthorpe
  2024-01-25 17:23                         ` Nicolin Chen
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 13:55 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will@kernel.org, Robin Murphy, joro@8bytes.org,
	jean-philippe@linaro.org, Alistair Popple,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev

On Tue, Jan 23, 2024 at 04:11:09PM -0800, Nicolin Chen wrote:
> > prevented strongly. Broadly speaking if SVA is pushing too high an
> > invalidation workload then we need to agressively trim it, and do so
> > dynamically. Certainly we should not have a tunable that has to be set
> > right to avoid soft lockup.
> > 
> > A tunable to improve performance, perhaps, but not to achieve basic
> > correctness.
> 
> So, should we make an optional tunable only for those who care
> about performance? Though I think having a tunable would just
> fix both issues.

When the soft lockup issue is solved you can consider if a tunable is
still interesting..
 
> > Maybe it is really just a simple thing - compute how many invalidation
> > commands are needed, if they don't all fit in the current queue space,
> > then do an invalidate all instead?
> 
> The queue could actually have a large space. But one large-size
> invalidation would be divided into batches that have to execute
> back-to-back. And the batch size is 64 commands in 64-bit case,
> which might be too small as a cap.

Yes, some notable code reorganizing would be needed to implement
something like this

Broadly I'd sketch sort of:

 - Figure out how fast the HW can execute a lot of commands
 - The above should drive some XX maximum number of commands, maybe we
   need to measure at boot, IDK
 - Strongly time bound SVA invalidation:
   * No more than XX commands, if more needed then push invalidate
     all
   * All commands must fit in the available queue space, if more
     needed then push invalidate all
 - The total queue depth must not be larger than YY based on the
   retire rate so that even a full queue will complete invalidation
   below the target time.

A tunable indicating what the SVA time bound target should be might be
appropriate..

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2024-01-25 13:55                       ` Jason Gunthorpe
@ 2024-01-25 17:23                         ` Nicolin Chen
  2024-01-25 17:47                           ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: Nicolin Chen @ 2024-01-25 17:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: will@kernel.org, Robin Murphy, joro@8bytes.org,
	jean-philippe@linaro.org, Alistair Popple,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev

On Thu, Jan 25, 2024 at 09:55:37AM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 23, 2024 at 04:11:09PM -0800, Nicolin Chen wrote:
> > > prevented strongly. Broadly speaking if SVA is pushing too high an
> > > invalidation workload then we need to agressively trim it, and do so
> > > dynamically. Certainly we should not have a tunable that has to be set
> > > right to avoid soft lockup.
> > > 
> > > A tunable to improve performance, perhaps, but not to achieve basic
> > > correctness.
> > 
> > So, should we make an optional tunable only for those who care
> > about performance? Though I think having a tunable would just
> > fix both issues.
> 
> When the soft lockup issue is solved you can consider if a tunable is
> still interesting..

Yea, it would be on top of the soft lockup fix. I assume we are
still going with your change: arm_smmu_inv_range_too_big, though
I wonder if we should apply before your rework series to make it
a bug fix..

> > > Maybe it is really just a simple thing - compute how many invalidation
> > > commands are needed, if they don't all fit in the current queue space,
> > > then do an invalidate all instead?
> > 
> > The queue could actually have a large space. But one large-size
> > invalidation would be divided into batches that have to execute
> > back-to-back. And the batch size is 64 commands in 64-bit case,
> > which might be too small as a cap.
> 
> Yes, some notable code reorganizing would be needed to implement
> something like this
> 
> Broadly I'd sketch sort of:
> 
>  - Figure out how fast the HW can execute a lot of commands
>  - The above should drive some XX maximum number of commands, maybe we
>    need to measure at boot, IDK
>  - Strongly time bound SVA invalidation:
>    * No more than XX commands, if more needed then push invalidate
>      all
>    * All commands must fit in the available queue space, if more
>      needed then push invalidate all
>  - The total queue depth must not be larger than YY based on the
>    retire rate so that even a full queue will complete invalidation
>    below the target time.
> 
> A tunable indicating what the SVA time bound target should be might be
> appropriate..

Thanks for listing it out. I will draft something with that, and
should we just confine it to SVA or non DMA callers in general?

Nicolin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2024-01-25 17:23                         ` Nicolin Chen
@ 2024-01-25 17:47                           ` Jason Gunthorpe
  2024-01-25 19:55                             ` Nicolin Chen
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 17:47 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will@kernel.org, Robin Murphy, joro@8bytes.org,
	jean-philippe@linaro.org, Alistair Popple,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev

On Thu, Jan 25, 2024 at 09:23:00AM -0800, Nicolin Chen wrote:
> > When the soft lockup issue is solved you can consider if a tunable is
> > still interesting..
> 
> Yea, it would be on top of the soft lockup fix. I assume we are
> still going with your change: arm_smmu_inv_range_too_big, though
> I wonder if we should apply before your rework series to make it
> a bug fix..

It depends what change you settle on..

> > > > Maybe it is really just a simple thing - compute how many invalidation
> > > > commands are needed, if they don't all fit in the current queue space,
> > > > then do an invalidate all instead?
> > > 
> > > The queue could actually have a large space. But one large-size
> > > invalidation would be divided into batches that have to execute
> > > back-to-back. And the batch size is 64 commands in 64-bit case,
> > > which might be too small as a cap.
> > 
> > Yes, some notable code reorganizing would be needed to implement
> > something like this
> > 
> > Broadly I'd sketch sort of:
> > 
> >  - Figure out how fast the HW can execute a lot of commands
> >  - The above should drive some XX maximum number of commands, maybe we
> >    need to measure at boot, IDK
> >  - Strongly time bound SVA invalidation:
> >    * No more than XX commands, if more needed then push invalidate
> >      all
> >    * All commands must fit in the available queue space, if more
> >      needed then push invalidate all
> >  - The total queue depth must not be larger than YY based on the
> >    retire rate so that even a full queue will complete invalidation
> >    below the target time.
> > 
> > A tunable indicating what the SVA time bound target should be might be
> > appropriate..
> 
> Thanks for listing it out. I will draft something with that, and
> should we just confine it to SVA or non DMA callers in general?

Also, how much of this SVA issue is multithreaded? Will multiple
command queues improve anything?

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg
  2024-01-25 17:47                           ` Jason Gunthorpe
@ 2024-01-25 19:55                             ` Nicolin Chen
  0 siblings, 0 replies; 34+ messages in thread
From: Nicolin Chen @ 2024-01-25 19:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: will@kernel.org, Robin Murphy, joro@8bytes.org,
	jean-philippe@linaro.org, Alistair Popple,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev

On Thu, Jan 25, 2024 at 01:47:28PM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 25, 2024 at 09:23:00AM -0800, Nicolin Chen wrote:
> > > When the soft lockup issue is solved you can consider if a tunable is
> > > still interesting..
> > 
> > Yea, it would be on top of the soft lockup fix. I assume we are
> > still going with your change: arm_smmu_inv_range_too_big, though
> > I wonder if we should apply before your rework series to make it
> > a bug fix..
> 
> It depends what change you settle on..

I mean your arm_smmu_inv_range_too_big patch. Should it be a bug
fix CCing the stable tree? My previous SVA fix was, by the way.

> > > > > Maybe it is really just a simple thing - compute how many invalidation
> > > > > commands are needed, if they don't all fit in the current queue space,
> > > > > then do an invalidate all instead?
> > > > 
> > > > The queue could actually have a large space. But one large-size
> > > > invalidation would be divided into batches that have to execute
> > > > back-to-back. And the batch size is 64 commands in 64-bit case,
> > > > which might be too small as a cap.
> > > 
> > > Yes, some notable code reorganizing would be needed to implement
> > > something like this
> > > 
> > > Broadly I'd sketch sort of:
> > > 
> > >  - Figure out how fast the HW can execute a lot of commands
> > >  - The above should drive some XX maximum number of commands, maybe we
> > >    need to measure at boot, IDK
> > >  - Strongly time bound SVA invalidation:
> > >    * No more than XX commands, if more needed then push invalidate
> > >      all
> > >    * All commands must fit in the available queue space, if more
> > >      needed then push invalidate all
> > >  - The total queue depth must not be larger than YY based on the
> > >    retire rate so that even a full queue will complete invalidation
> > >    below the target time.
> > > 
> > > A tunable indicating what the SVA time bound target should be might be
> > > appropriate..
> > 
> > Thanks for listing it out. I will draft something with that, and
> > should we just confine it to SVA or non DMA callers in general?
> 
> Also, how much of this SVA issue is multithreaded? Will multiple
> command queues improve anything?

The bottleneck from measurement is mostly at SMMU consuming the
commands with a single CMDQ HW, so multithreading unlikely helps.
And VCMDQ only provides a multi-queue interface/wrapper for VM
isolations.

Thanks
Nic

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2024-01-25 19:56 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-22  8:45 [PATCH 0/3] iommu/arm-smmu-v3: Reduce latency in __arm_smmu_tlb_inv_range() Nicolin Chen
2023-08-22  8:45 ` [PATCH 1/3] iommu/io-pgtable-arm: Add nents_per_pgtable in struct io_pgtable_cfg Nicolin Chen
2023-08-22  9:19   ` Robin Murphy
2023-08-22 16:42     ` Nicolin Chen
2023-08-29 15:37       ` Robin Murphy
2023-08-29 20:15         ` Nicolin Chen
2023-08-29 21:25           ` Robin Murphy
2023-08-29 22:15             ` Nicolin Chen
2023-08-30 21:49               ` Will Deacon
2023-08-31 17:39                 ` Nicolin Chen
2023-09-01  0:08                   ` Nicolin Chen
2023-09-01 18:02                     ` Robin Murphy
2023-09-01 18:23                       ` Nicolin Chen
2024-01-20 19:59             ` Nicolin Chen
2024-01-22 13:01               ` Jason Gunthorpe
2024-01-22 17:24                 ` Nicolin Chen
2024-01-22 17:57                   ` Jason Gunthorpe
2024-01-24  0:11                     ` Nicolin Chen
2024-01-25 13:55                       ` Jason Gunthorpe
2024-01-25 17:23                         ` Nicolin Chen
2024-01-25 17:47                           ` Jason Gunthorpe
2024-01-25 19:55                             ` Nicolin Chen
     [not found]                   ` <098d64da-ecf5-4a23-bff9-a04840726ef0@huawei.com>
2024-01-25  5:09                     ` Nicolin Chen
2023-08-22  8:45 ` [PATCH 2/3] iommu/arm-smmu-v3: Add an arm_smmu_tlb_inv_domain helper Nicolin Chen
2023-08-22  9:40   ` Robin Murphy
2023-08-22 17:03     ` Nicolin Chen
2023-08-29 21:54       ` Robin Murphy
2023-08-29 23:03         ` Nicolin Chen
2023-08-22  8:45 ` [PATCH 3/3] iommu/arm-smmu-v3: Add a max_tlbi_ops for __arm_smmu_tlb_inv_range() Nicolin Chen
2023-08-22  9:30   ` Robin Murphy
2023-08-22 16:32     ` Nicolin Chen
2023-08-22 23:04       ` Nicolin Chen
2023-08-29 22:40         ` Robin Murphy
2023-08-29 23:14           ` Nicolin Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).