public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] iommu/io-pgtable-arm: Implement .iotlb_sync_map callback
@ 2025-09-27 22:39 Daniel Mentz
  2025-09-27 22:39 ` [PATCH 2/2] drivers/arm-smmu-v3: " Daniel Mentz
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Daniel Mentz @ 2025-09-27 22:39 UTC (permalink / raw)
  To: iommu, linux-kernel
  Cc: Will Deacon, Mostafa Saleh, Pranjal Shrivastava, Liviu Dudau,
	Jason Gunthorpe, Rob Clark, Daniel Mentz

On systems with a non-coherent SMMU, the CPU must perform cache
maintenance operations (CMOs) after updating page table entries (PTEs).
This ensures that the SMMU reads the latest version of the descriptors
and not stale data from memory.

This requirement can lead to significant performance overhead,
especially when mapping long scatter-gather lists where each sg entry
maps into an iommu_map() call that only covers 4KB of address space.

In such scenarios, each small mapping operation modifies a single 8-byte
PTE but triggers a cache clean for the entire cache line (e.g., 64
bytes). This results in the same cache line being cleaned repeatedly,
once for each PTE it contains.

A more efficient implementation performs the cache clean operation only
after updating all descriptors that are co-located in the same cache
line. This patch introduces a mechanism to defer and batch the cache
maintenance:

A new boolean flag, defer_sync_pte, is added to struct io_pgtable_cfg.
When this flag is set, the arm-lpae backend will skip the cache sync
operation for leaf entries within its .map_pages implementation.

A new callback, .iotlb_sync_map, is added to struct io_pgtable_ops.
After performing a series of mapping operations, the caller is
responsible for invoking this callback for the entire IOVA range. This
function then walks the page tables for the specified range and performs
the necessary cache clean operations for all the modified PTEs.

This allows for a single, efficient cache maintenance operation to cover
multiple PTE updates, significantly reducing overhead for workloads that
perform many small, contiguous mappings.

Signed-off-by: Daniel Mentz <danielmentz@google.com>
---
 drivers/iommu/io-pgtable-arm.c | 66 +++++++++++++++++++++++++++++++++-
 include/linux/io-pgtable.h     |  7 ++++
 2 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 7e8e2216c294..a970eefb07fb 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -353,7 +353,7 @@ static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
 	for (i = 0; i < num_entries; i++)
 		ptep[i] = pte | paddr_to_iopte(paddr + i * sz, data);
 
-	if (!cfg->coherent_walk)
+	if (!cfg->coherent_walk && !cfg->defer_sync_pte)
 		__arm_lpae_sync_pte(ptep, num_entries, cfg);
 }
 
@@ -582,6 +582,69 @@ static int arm_lpae_map_pages(struct io_pgtable_ops *ops, unsigned long iova,
 	return ret;
 }
 
+static int __arm_lpae_iotlb_sync_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
+			      size_t size, int lvl, arm_lpae_iopte *ptep)
+{
+	struct io_pgtable *iop = &data->iop;
+	size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
+	int ret = 0, num_entries, max_entries;
+	unsigned long iova_offset, sync_idx_start, sync_idx_end;
+	int i, shift, synced_entries = 0;
+
+	shift = (ARM_LPAE_LVL_SHIFT(lvl - 1, data) + ARM_LPAE_PGD_IDX(lvl - 1, data));
+	iova_offset = iova & ((1ULL << shift) - 1);
+	sync_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
+	sync_idx_end = (iova_offset + size + block_size - ARM_LPAE_GRANULE(data)) >>
+		ARM_LPAE_LVL_SHIFT(lvl, data);
+	max_entries = arm_lpae_max_entries(sync_idx_start, data);
+	num_entries = min_t(unsigned long, sync_idx_end - sync_idx_start, max_entries);
+	ptep += sync_idx_start;
+
+	if (lvl < (ARM_LPAE_MAX_LEVELS - 1)) {
+		for (i = 0; i < num_entries; i++) {
+			arm_lpae_iopte pte = READ_ONCE(ptep[i]);
+			unsigned long synced;
+
+			WARN_ON(!pte);
+
+			if (iopte_type(pte) == ARM_LPAE_PTE_TYPE_TABLE) {
+				int n = i - synced_entries;
+
+				if (n) {
+					__arm_lpae_sync_pte(&ptep[synced_entries], n, &iop->cfg);
+					synced_entries += n;
+				}
+				ret = __arm_lpae_iotlb_sync_map(data, iova, size, lvl + 1,
+								iopte_deref(pte, data));
+				synced_entries++;
+			}
+			synced = block_size - (iova & (block_size - 1));
+			size -= synced;
+			iova += synced;
+		}
+	}
+
+	if (synced_entries != num_entries)
+		__arm_lpae_sync_pte(&ptep[synced_entries], num_entries - synced_entries, &iop->cfg);
+
+	return ret;
+}
+
+static int arm_lpae_iotlb_sync_map(struct io_pgtable_ops *ops, unsigned long iova,
+			    size_t size)
+{
+	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
+	struct io_pgtable_cfg *cfg = &data->iop.cfg;
+	arm_lpae_iopte *ptep = data->pgd;
+	int lvl = data->start_level;
+	long iaext = (s64)iova >> cfg->ias;
+
+	WARN_ON(!size);
+	WARN_ON(iaext);
+
+	return __arm_lpae_iotlb_sync_map(data, iova, size, lvl, ptep);
+}
+
 static void __arm_lpae_free_pgtable(struct arm_lpae_io_pgtable *data, int lvl,
 				    arm_lpae_iopte *ptep)
 {
@@ -949,6 +1012,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
 	data->iop.ops = (struct io_pgtable_ops) {
 		.map_pages	= arm_lpae_map_pages,
 		.unmap_pages	= arm_lpae_unmap_pages,
+		.iotlb_sync_map	= cfg->coherent_walk ? NULL : arm_lpae_iotlb_sync_map,
 		.iova_to_phys	= arm_lpae_iova_to_phys,
 		.read_and_clear_dirty = arm_lpae_read_and_clear_dirty,
 		.pgtable_walk	= arm_lpae_pgtable_walk,
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index 138fbd89b1e6..c4927dbc0f61 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -57,6 +57,9 @@ struct iommu_flush_ops {
  * @oas:           Output address (paddr) size, in bits.
  * @coherent_walk  A flag to indicate whether or not page table walks made
  *                 by the IOMMU are coherent with the CPU caches.
+ * @defer_sync_pte A flag to indicate whether pte sync operations for leaf
+ *                 entries shall be skipped during calls to .map_pages. A
+ *                 subsequent call to .iotlb_sync_map is required.
  * @tlb:           TLB management callbacks for this set of tables.
  * @iommu_dev:     The device representing the DMA configuration for the
  *                 page table walker.
@@ -110,6 +113,7 @@ struct io_pgtable_cfg {
 	unsigned int			ias;
 	unsigned int			oas;
 	bool				coherent_walk;
+	bool				defer_sync_pte;
 	const struct iommu_flush_ops	*tlb;
 	struct device			*iommu_dev;
 
@@ -204,6 +208,7 @@ struct arm_lpae_io_pgtable_walk_data {
  * @unmap_pages:  Unmap a range of virtually contiguous pages of the same size.
  * @iova_to_phys: Translate iova to physical address.
  * @pgtable_walk: (optional) Perform a page table walk for a given iova.
+ * @iotlb_sync_map: Sync ptes (Required for non-coherent page table walks)
  *
  * These functions map directly onto the iommu_ops member functions with
  * the same names.
@@ -222,6 +227,8 @@ struct io_pgtable_ops {
 				    unsigned long iova, size_t size,
 				    unsigned long flags,
 				    struct iommu_dirty_bitmap *dirty);
+	int (*iotlb_sync_map)(struct io_pgtable_ops *ops, unsigned long iova,
+			      size_t size);
 };
 
 /**
-- 
2.51.0.570.gb178f27e6d-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/2] drivers/arm-smmu-v3: Implement .iotlb_sync_map callback
  2025-09-27 22:39 [PATCH 1/2] iommu/io-pgtable-arm: Implement .iotlb_sync_map callback Daniel Mentz
@ 2025-09-27 22:39 ` Daniel Mentz
  2025-09-29 11:58   ` Jason Gunthorpe
  2025-09-29 12:25   ` Mostafa Saleh
  2025-09-29 12:21 ` [PATCH 1/2] iommu/io-pgtable-arm: " Mostafa Saleh
  2025-09-29 19:57 ` Pranjal Shrivastava
  2 siblings, 2 replies; 17+ messages in thread
From: Daniel Mentz @ 2025-09-27 22:39 UTC (permalink / raw)
  To: iommu, linux-kernel
  Cc: Will Deacon, Mostafa Saleh, Pranjal Shrivastava, Liviu Dudau,
	Jason Gunthorpe, Rob Clark, Daniel Mentz

Implement .iotlb_sync_map callback based on the new callback in the
io-pgtable code.

Signed-off-by: Daniel Mentz <danielmentz@google.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 2a8b46b948f0..0ffcc6c8e4bf 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2541,6 +2541,7 @@ static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain,
 	pgtbl_cfg = (struct io_pgtable_cfg) {
 		.pgsize_bitmap	= smmu->pgsize_bitmap,
 		.coherent_walk	= smmu->features & ARM_SMMU_FEAT_COHERENCY,
+		.defer_sync_pte	= true,
 		.tlb		= &arm_smmu_flush_ops,
 		.iommu_dev	= smmu->dev,
 	};
@@ -3366,6 +3367,18 @@ static int arm_smmu_map_pages(struct iommu_domain *domain, unsigned long iova,
 	return ops->map_pages(ops, iova, paddr, pgsize, pgcount, prot, gfp, mapped);
 }
 
+static int arm_smmu_iotlb_sync_map(struct iommu_domain *domain,
+				    unsigned long iova, size_t size)
+{
+	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+
+	if (!ops || !ops->iotlb_sync_map)
+		return 0;
+
+	ops->iotlb_sync_map(ops, iova, size);
+	return 0;
+}
+
 static size_t arm_smmu_unmap_pages(struct iommu_domain *domain, unsigned long iova,
 				   size_t pgsize, size_t pgcount,
 				   struct iommu_iotlb_gather *gather)
@@ -3700,6 +3713,7 @@ static const struct iommu_ops arm_smmu_ops = {
 		.map_pages		= arm_smmu_map_pages,
 		.unmap_pages		= arm_smmu_unmap_pages,
 		.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
+		.iotlb_sync_map		= arm_smmu_iotlb_sync_map,
 		.iotlb_sync		= arm_smmu_iotlb_sync,
 		.iova_to_phys		= arm_smmu_iova_to_phys,
 		.free			= arm_smmu_domain_free_paging,
-- 
2.51.0.570.gb178f27e6d-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] drivers/arm-smmu-v3: Implement .iotlb_sync_map callback
  2025-09-27 22:39 ` [PATCH 2/2] drivers/arm-smmu-v3: " Daniel Mentz
@ 2025-09-29 11:58   ` Jason Gunthorpe
  2025-09-29 12:24     ` Mostafa Saleh
  2025-09-29 12:25   ` Mostafa Saleh
  1 sibling, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2025-09-29 11:58 UTC (permalink / raw)
  To: Daniel Mentz
  Cc: iommu, linux-kernel, Will Deacon, Mostafa Saleh,
	Pranjal Shrivastava, Liviu Dudau, Rob Clark

On Sat, Sep 27, 2025 at 10:39:53PM +0000, Daniel Mentz wrote:
> @@ -3700,6 +3713,7 @@ static const struct iommu_ops arm_smmu_ops = {
>  		.map_pages		= arm_smmu_map_pages,
>  		.unmap_pages		= arm_smmu_unmap_pages,
>  		.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
> +		.iotlb_sync_map		= arm_smmu_iotlb_sync_map,

Shouldn't this avoid defining the op on coherent systems?

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] iommu/io-pgtable-arm: Implement .iotlb_sync_map callback
  2025-09-27 22:39 [PATCH 1/2] iommu/io-pgtable-arm: Implement .iotlb_sync_map callback Daniel Mentz
  2025-09-27 22:39 ` [PATCH 2/2] drivers/arm-smmu-v3: " Daniel Mentz
@ 2025-09-29 12:21 ` Mostafa Saleh
  2025-09-29 21:00   ` Daniel Mentz
  2025-09-29 19:57 ` Pranjal Shrivastava
  2 siblings, 1 reply; 17+ messages in thread
From: Mostafa Saleh @ 2025-09-29 12:21 UTC (permalink / raw)
  To: Daniel Mentz
  Cc: iommu, linux-kernel, Will Deacon, Pranjal Shrivastava,
	Liviu Dudau, Jason Gunthorpe, Rob Clark

On Sat, Sep 27, 2025 at 10:39:52PM +0000, Daniel Mentz wrote:
> On systems with a non-coherent SMMU, the CPU must perform cache
> maintenance operations (CMOs) after updating page table entries (PTEs).
> This ensures that the SMMU reads the latest version of the descriptors
> and not stale data from memory.
> 
> This requirement can lead to significant performance overhead,
> especially when mapping long scatter-gather lists where each sg entry
> maps into an iommu_map() call that only covers 4KB of address space.
> 
> In such scenarios, each small mapping operation modifies a single 8-byte
> PTE but triggers a cache clean for the entire cache line (e.g., 64
> bytes). This results in the same cache line being cleaned repeatedly,
> once for each PTE it contains.
> 
> A more efficient implementation performs the cache clean operation only
> after updating all descriptors that are co-located in the same cache
> line. This patch introduces a mechanism to defer and batch the cache
> maintenance:
> 
> A new boolean flag, defer_sync_pte, is added to struct io_pgtable_cfg.
> When this flag is set, the arm-lpae backend will skip the cache sync
> operation for leaf entries within its .map_pages implementation.
> 
> A new callback, .iotlb_sync_map, is added to struct io_pgtable_ops.
> After performing a series of mapping operations, the caller is
> responsible for invoking this callback for the entire IOVA range. This
> function then walks the page tables for the specified range and performs
> the necessary cache clean operations for all the modified PTEs.
> 
> This allows for a single, efficient cache maintenance operation to cover
> multiple PTE updates, significantly reducing overhead for workloads that
> perform many small, contiguous mappings.
> 
> Signed-off-by: Daniel Mentz <danielmentz@google.com>
> ---
>  drivers/iommu/io-pgtable-arm.c | 66 +++++++++++++++++++++++++++++++++-
>  include/linux/io-pgtable.h     |  7 ++++
>  2 files changed, 72 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index 7e8e2216c294..a970eefb07fb 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -353,7 +353,7 @@ static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
>  	for (i = 0; i < num_entries; i++)
>  		ptep[i] = pte | paddr_to_iopte(paddr + i * sz, data);
>  
> -	if (!cfg->coherent_walk)
> +	if (!cfg->coherent_walk && !cfg->defer_sync_pte)
>  		__arm_lpae_sync_pte(ptep, num_entries, cfg);
>  }
>  
> @@ -582,6 +582,69 @@ static int arm_lpae_map_pages(struct io_pgtable_ops *ops, unsigned long iova,
>  	return ret;
>  }
>  
> +static int __arm_lpae_iotlb_sync_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
> +			      size_t size, int lvl, arm_lpae_iopte *ptep)
> +{
> +	struct io_pgtable *iop = &data->iop;
> +	size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> +	int ret = 0, num_entries, max_entries;
> +	unsigned long iova_offset, sync_idx_start, sync_idx_end;
> +	int i, shift, synced_entries = 0;
> +
> +	shift = (ARM_LPAE_LVL_SHIFT(lvl - 1, data) + ARM_LPAE_PGD_IDX(lvl - 1, data));
> +	iova_offset = iova & ((1ULL << shift) - 1);
> +	sync_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
> +	sync_idx_end = (iova_offset + size + block_size - ARM_LPAE_GRANULE(data)) >>
> +		ARM_LPAE_LVL_SHIFT(lvl, data);
> +	max_entries = arm_lpae_max_entries(sync_idx_start, data);
> +	num_entries = min_t(unsigned long, sync_idx_end - sync_idx_start, max_entries);
> +	ptep += sync_idx_start;
> +
> +	if (lvl < (ARM_LPAE_MAX_LEVELS - 1)) {
> +		for (i = 0; i < num_entries; i++) {
> +			arm_lpae_iopte pte = READ_ONCE(ptep[i]);
> +			unsigned long synced;
> +
> +			WARN_ON(!pte);
> +
> +			if (iopte_type(pte) == ARM_LPAE_PTE_TYPE_TABLE) {
> +				int n = i - synced_entries;
> +
> +				if (n) {
> +					__arm_lpae_sync_pte(&ptep[synced_entries], n, &iop->cfg);
> +					synced_entries += n;
> +				}
> +				ret = __arm_lpae_iotlb_sync_map(data, iova, size, lvl + 1,
> +								iopte_deref(pte, data));
> +				synced_entries++;
> +			}
> +			synced = block_size - (iova & (block_size - 1));
> +			size -= synced;
> +			iova += synced;
> +		}
> +	}
> +
> +	if (synced_entries != num_entries)
> +		__arm_lpae_sync_pte(&ptep[synced_entries], num_entries - synced_entries, &iop->cfg);
> +
> +	return ret;
> +}

Can't we rely on the exisiting generic table walker "__arm_lpae_iopte_walk",
instead writing a new one, that is already used for iova_to_phys and dirty bit.

Thanks,
Mostafa

> +
> +static int arm_lpae_iotlb_sync_map(struct io_pgtable_ops *ops, unsigned long iova,
> +			    size_t size)
> +{
> +	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> +	struct io_pgtable_cfg *cfg = &data->iop.cfg;
> +	arm_lpae_iopte *ptep = data->pgd;
> +	int lvl = data->start_level;
> +	long iaext = (s64)iova >> cfg->ias;
> +
> +	WARN_ON(!size);
> +	WARN_ON(iaext);
> +
> +	return __arm_lpae_iotlb_sync_map(data, iova, size, lvl, ptep);
> +}
> +
>  static void __arm_lpae_free_pgtable(struct arm_lpae_io_pgtable *data, int lvl,
>  				    arm_lpae_iopte *ptep)
>  {
> @@ -949,6 +1012,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
>  	data->iop.ops = (struct io_pgtable_ops) {
>  		.map_pages	= arm_lpae_map_pages,
>  		.unmap_pages	= arm_lpae_unmap_pages,
> +		.iotlb_sync_map	= cfg->coherent_walk ? NULL : arm_lpae_iotlb_sync_map,
>  		.iova_to_phys	= arm_lpae_iova_to_phys,
>  		.read_and_clear_dirty = arm_lpae_read_and_clear_dirty,
>  		.pgtable_walk	= arm_lpae_pgtable_walk,
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
> index 138fbd89b1e6..c4927dbc0f61 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -57,6 +57,9 @@ struct iommu_flush_ops {
>   * @oas:           Output address (paddr) size, in bits.
>   * @coherent_walk  A flag to indicate whether or not page table walks made
>   *                 by the IOMMU are coherent with the CPU caches.
> + * @defer_sync_pte A flag to indicate whether pte sync operations for leaf
> + *                 entries shall be skipped during calls to .map_pages. A
> + *                 subsequent call to .iotlb_sync_map is required.
>   * @tlb:           TLB management callbacks for this set of tables.
>   * @iommu_dev:     The device representing the DMA configuration for the
>   *                 page table walker.
> @@ -110,6 +113,7 @@ struct io_pgtable_cfg {
>  	unsigned int			ias;
>  	unsigned int			oas;
>  	bool				coherent_walk;
> +	bool				defer_sync_pte;
>  	const struct iommu_flush_ops	*tlb;
>  	struct device			*iommu_dev;
>  
> @@ -204,6 +208,7 @@ struct arm_lpae_io_pgtable_walk_data {
>   * @unmap_pages:  Unmap a range of virtually contiguous pages of the same size.
>   * @iova_to_phys: Translate iova to physical address.
>   * @pgtable_walk: (optional) Perform a page table walk for a given iova.
> + * @iotlb_sync_map: Sync ptes (Required for non-coherent page table walks)
>   *
>   * These functions map directly onto the iommu_ops member functions with
>   * the same names.
> @@ -222,6 +227,8 @@ struct io_pgtable_ops {
>  				    unsigned long iova, size_t size,
>  				    unsigned long flags,
>  				    struct iommu_dirty_bitmap *dirty);
> +	int (*iotlb_sync_map)(struct io_pgtable_ops *ops, unsigned long iova,
> +			      size_t size);
>  };
>  
>  /**
> -- 
> 2.51.0.570.gb178f27e6d-goog
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] drivers/arm-smmu-v3: Implement .iotlb_sync_map callback
  2025-09-29 11:58   ` Jason Gunthorpe
@ 2025-09-29 12:24     ` Mostafa Saleh
  2025-09-29 12:47       ` Jason Gunthorpe
  0 siblings, 1 reply; 17+ messages in thread
From: Mostafa Saleh @ 2025-09-29 12:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Daniel Mentz, iommu, linux-kernel, Will Deacon,
	Pranjal Shrivastava, Liviu Dudau, Rob Clark

On Mon, Sep 29, 2025 at 08:58:03AM -0300, Jason Gunthorpe wrote:
> On Sat, Sep 27, 2025 at 10:39:53PM +0000, Daniel Mentz wrote:
> > @@ -3700,6 +3713,7 @@ static const struct iommu_ops arm_smmu_ops = {
> >  		.map_pages		= arm_smmu_map_pages,
> >  		.unmap_pages		= arm_smmu_unmap_pages,
> >  		.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
> > +		.iotlb_sync_map		= arm_smmu_iotlb_sync_map,
> 
> Shouldn't this avoid defining the op on coherent systems?

Does that mean we need to have 2 iommu_ops, one for
coherent/non-coherent SMMUs, as both can be mixed in the same system.

The current implementation would skip the page table call on coherent
SMMUs, As in the first patch it only sets the pointer for them:
 .iotlb_sync_map = cfg->coherent_walk ? NULL : arm_lpae_iotlb_sync_map.

Thanks,
Mostafa

> 
> Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] drivers/arm-smmu-v3: Implement .iotlb_sync_map callback
  2025-09-27 22:39 ` [PATCH 2/2] drivers/arm-smmu-v3: " Daniel Mentz
  2025-09-29 11:58   ` Jason Gunthorpe
@ 2025-09-29 12:25   ` Mostafa Saleh
  1 sibling, 0 replies; 17+ messages in thread
From: Mostafa Saleh @ 2025-09-29 12:25 UTC (permalink / raw)
  To: Daniel Mentz
  Cc: iommu, linux-kernel, Will Deacon, Pranjal Shrivastava,
	Liviu Dudau, Jason Gunthorpe, Rob Clark

On Sat, Sep 27, 2025 at 10:39:53PM +0000, Daniel Mentz wrote:
> Implement .iotlb_sync_map callback based on the new callback in the
> io-pgtable code.
> 
> Signed-off-by: Daniel Mentz <danielmentz@google.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 2a8b46b948f0..0ffcc6c8e4bf 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2541,6 +2541,7 @@ static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain,
>  	pgtbl_cfg = (struct io_pgtable_cfg) {
>  		.pgsize_bitmap	= smmu->pgsize_bitmap,
>  		.coherent_walk	= smmu->features & ARM_SMMU_FEAT_COHERENCY,
> +		.defer_sync_pte	= true,
>  		.tlb		= &arm_smmu_flush_ops,
>  		.iommu_dev	= smmu->dev,
>  	};
> @@ -3366,6 +3367,18 @@ static int arm_smmu_map_pages(struct iommu_domain *domain, unsigned long iova,
>  	return ops->map_pages(ops, iova, paddr, pgsize, pgcount, prot, gfp, mapped);
>  }
>  
> +static int arm_smmu_iotlb_sync_map(struct iommu_domain *domain,
> +				    unsigned long iova, size_t size)
> +{
> +	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
> +
> +	if (!ops || !ops->iotlb_sync_map)
> +		return 0;
> +
> +	ops->iotlb_sync_map(ops, iova, size);
> +	return 0;

Shouldn't that be
	return ops->iotlb_sync_map(ops, iova, size);

As now we ignore the return from the ops and mask it to 0.

Thanks,
Mostafa

> +}
> +
>  static size_t arm_smmu_unmap_pages(struct iommu_domain *domain, unsigned long iova,
>  				   size_t pgsize, size_t pgcount,
>  				   struct iommu_iotlb_gather *gather)
> @@ -3700,6 +3713,7 @@ static const struct iommu_ops arm_smmu_ops = {
>  		.map_pages		= arm_smmu_map_pages,
>  		.unmap_pages		= arm_smmu_unmap_pages,
>  		.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
> +		.iotlb_sync_map		= arm_smmu_iotlb_sync_map,
>  		.iotlb_sync		= arm_smmu_iotlb_sync,
>  		.iova_to_phys		= arm_smmu_iova_to_phys,
>  		.free			= arm_smmu_domain_free_paging,
> -- 
> 2.51.0.570.gb178f27e6d-goog
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] drivers/arm-smmu-v3: Implement .iotlb_sync_map callback
  2025-09-29 12:24     ` Mostafa Saleh
@ 2025-09-29 12:47       ` Jason Gunthorpe
  2025-09-30  0:23         ` Pranjal Shrivastava
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Gunthorpe @ 2025-09-29 12:47 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Daniel Mentz, iommu, linux-kernel, Will Deacon,
	Pranjal Shrivastava, Liviu Dudau, Rob Clark

On Mon, Sep 29, 2025 at 12:24:28PM +0000, Mostafa Saleh wrote:
> On Mon, Sep 29, 2025 at 08:58:03AM -0300, Jason Gunthorpe wrote:
> > On Sat, Sep 27, 2025 at 10:39:53PM +0000, Daniel Mentz wrote:
> > > @@ -3700,6 +3713,7 @@ static const struct iommu_ops arm_smmu_ops = {
> > >  		.map_pages		= arm_smmu_map_pages,
> > >  		.unmap_pages		= arm_smmu_unmap_pages,
> > >  		.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
> > > +		.iotlb_sync_map		= arm_smmu_iotlb_sync_map,
> > 
> > Shouldn't this avoid defining the op on coherent systems?
> 
> Does that mean we need to have 2 iommu_ops, one for
> coherent/non-coherent SMMUs, as both can be mixed in the same system.

Yes, I think you'd have to do it with two ops..

It just seems wrong to penalize the normal fast case for these
systems.

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] iommu/io-pgtable-arm: Implement .iotlb_sync_map callback
  2025-09-27 22:39 [PATCH 1/2] iommu/io-pgtable-arm: Implement .iotlb_sync_map callback Daniel Mentz
  2025-09-27 22:39 ` [PATCH 2/2] drivers/arm-smmu-v3: " Daniel Mentz
  2025-09-29 12:21 ` [PATCH 1/2] iommu/io-pgtable-arm: " Mostafa Saleh
@ 2025-09-29 19:57 ` Pranjal Shrivastava
  2025-09-29 20:42   ` Daniel Mentz
  2 siblings, 1 reply; 17+ messages in thread
From: Pranjal Shrivastava @ 2025-09-29 19:57 UTC (permalink / raw)
  To: Daniel Mentz
  Cc: iommu, linux-kernel, Will Deacon, Mostafa Saleh, Liviu Dudau,
	Jason Gunthorpe, Rob Clark

On Sat, Sep 27, 2025 at 10:39:52PM +0000, Daniel Mentz wrote:
> On systems with a non-coherent SMMU, the CPU must perform cache
> maintenance operations (CMOs) after updating page table entries (PTEs).
> This ensures that the SMMU reads the latest version of the descriptors
> and not stale data from memory.
> 
> This requirement can lead to significant performance overhead,
> especially when mapping long scatter-gather lists where each sg entry
> maps into an iommu_map() call that only covers 4KB of address space.
> 
> In such scenarios, each small mapping operation modifies a single 8-byte
> PTE but triggers a cache clean for the entire cache line (e.g., 64
> bytes). This results in the same cache line being cleaned repeatedly,
> once for each PTE it contains.
> 
> A more efficient implementation performs the cache clean operation only
> after updating all descriptors that are co-located in the same cache
> line. This patch introduces a mechanism to defer and batch the cache
> maintenance:
> 
> A new boolean flag, defer_sync_pte, is added to struct io_pgtable_cfg.
> When this flag is set, the arm-lpae backend will skip the cache sync
> operation for leaf entries within its .map_pages implementation.
> 
> A new callback, .iotlb_sync_map, is added to struct io_pgtable_ops.
> After performing a series of mapping operations, the caller is
> responsible for invoking this callback for the entire IOVA range. This
> function then walks the page tables for the specified range and performs
> the necessary cache clean operations for all the modified PTEs.
> 
> This allows for a single, efficient cache maintenance operation to cover
> multiple PTE updates, significantly reducing overhead for workloads that
> perform many small, contiguous mappings.
> 
> Signed-off-by: Daniel Mentz <danielmentz@google.com>
> ---
>  drivers/iommu/io-pgtable-arm.c | 66 +++++++++++++++++++++++++++++++++-
>  include/linux/io-pgtable.h     |  7 ++++
>  2 files changed, 72 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index 7e8e2216c294..a970eefb07fb 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -353,7 +353,7 @@ static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
>  	for (i = 0; i < num_entries; i++)
>  		ptep[i] = pte | paddr_to_iopte(paddr + i * sz, data);
>  
> -	if (!cfg->coherent_walk)
> +	if (!cfg->coherent_walk && !cfg->defer_sync_pte)
>  		__arm_lpae_sync_pte(ptep, num_entries, cfg);
>  }
>  
> @@ -582,6 +582,69 @@ static int arm_lpae_map_pages(struct io_pgtable_ops *ops, unsigned long iova,
>  	return ret;
>  }
>  
> +static int __arm_lpae_iotlb_sync_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
> +			      size_t size, int lvl, arm_lpae_iopte *ptep)
> +{
> +	struct io_pgtable *iop = &data->iop;
> +	size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> +	int ret = 0, num_entries, max_entries;
> +	unsigned long iova_offset, sync_idx_start, sync_idx_end;
> +	int i, shift, synced_entries = 0;
> +
> +	shift = (ARM_LPAE_LVL_SHIFT(lvl - 1, data) + ARM_LPAE_PGD_IDX(lvl - 1, data));
> +	iova_offset = iova & ((1ULL << shift) - 1);
> +	sync_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
> +	sync_idx_end = (iova_offset + size + block_size - ARM_LPAE_GRANULE(data)) >>
> +		ARM_LPAE_LVL_SHIFT(lvl, data);
> +	max_entries = arm_lpae_max_entries(sync_idx_start, data);
> +	num_entries = min_t(unsigned long, sync_idx_end - sync_idx_start, max_entries);
> +	ptep += sync_idx_start;
> +
> +	if (lvl < (ARM_LPAE_MAX_LEVELS - 1)) {
> +		for (i = 0; i < num_entries; i++) {
> +			arm_lpae_iopte pte = READ_ONCE(ptep[i]);
> +			unsigned long synced;
> +
> +			WARN_ON(!pte);
> +
> +			if (iopte_type(pte) == ARM_LPAE_PTE_TYPE_TABLE) {
> +				int n = i - synced_entries;
> +
> +				if (n) {
> +					__arm_lpae_sync_pte(&ptep[synced_entries], n, &iop->cfg);
> +					synced_entries += n;
> +				}
> +				ret = __arm_lpae_iotlb_sync_map(data, iova, size, lvl + 1,
> +								iopte_deref(pte, data));

I think we must check the returned value here and break the loop on
error. Otherwise, we might burry a failure by continuing the loop.
We should add something like:

if (ret)
	break;

> +				synced_entries++;
> +			}
> +			synced = block_size - (iova & (block_size - 1));
> +			size -= synced;
> +			iova += synced;
> +		}
> +	}
> +
> +	if (synced_entries != num_entries)
> +		__arm_lpae_sync_pte(&ptep[synced_entries], num_entries - synced_entries, &iop->cfg);
> +
> +	return ret;
> +}
> +
> +static int arm_lpae_iotlb_sync_map(struct io_pgtable_ops *ops, unsigned long iova,
> +			    size_t size)
> +{
> +	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> +	struct io_pgtable_cfg *cfg = &data->iop.cfg;
> +	arm_lpae_iopte *ptep = data->pgd;
> +	int lvl = data->start_level;
> +	long iaext = (s64)iova >> cfg->ias;
> +
> +	WARN_ON(!size);
> +	WARN_ON(iaext);
> +
> +	return __arm_lpae_iotlb_sync_map(data, iova, size, lvl, ptep);
> +}
> +
>  static void __arm_lpae_free_pgtable(struct arm_lpae_io_pgtable *data, int lvl,
>  				    arm_lpae_iopte *ptep)
>  {
> @@ -949,6 +1012,7 @@ arm_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg)
>  	data->iop.ops = (struct io_pgtable_ops) {
>  		.map_pages	= arm_lpae_map_pages,
>  		.unmap_pages	= arm_lpae_unmap_pages,
> +		.iotlb_sync_map	= cfg->coherent_walk ? NULL : arm_lpae_iotlb_sync_map,
>  		.iova_to_phys	= arm_lpae_iova_to_phys,
>  		.read_and_clear_dirty = arm_lpae_read_and_clear_dirty,
>  		.pgtable_walk	= arm_lpae_pgtable_walk,
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
> index 138fbd89b1e6..c4927dbc0f61 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -57,6 +57,9 @@ struct iommu_flush_ops {
>   * @oas:           Output address (paddr) size, in bits.
>   * @coherent_walk  A flag to indicate whether or not page table walks made
>   *                 by the IOMMU are coherent with the CPU caches.
> + * @defer_sync_pte A flag to indicate whether pte sync operations for leaf
> + *                 entries shall be skipped during calls to .map_pages. A
> + *                 subsequent call to .iotlb_sync_map is required.
>   * @tlb:           TLB management callbacks for this set of tables.
>   * @iommu_dev:     The device representing the DMA configuration for the
>   *                 page table walker.
> @@ -110,6 +113,7 @@ struct io_pgtable_cfg {
>  	unsigned int			ias;
>  	unsigned int			oas;
>  	bool				coherent_walk;
> +	bool				defer_sync_pte;
>  	const struct iommu_flush_ops	*tlb;
>  	struct device			*iommu_dev;
>  
> @@ -204,6 +208,7 @@ struct arm_lpae_io_pgtable_walk_data {
>   * @unmap_pages:  Unmap a range of virtually contiguous pages of the same size.
>   * @iova_to_phys: Translate iova to physical address.
>   * @pgtable_walk: (optional) Perform a page table walk for a given iova.
> + * @iotlb_sync_map: Sync ptes (Required for non-coherent page table walks)
>   *
>   * These functions map directly onto the iommu_ops member functions with
>   * the same names.
> @@ -222,6 +227,8 @@ struct io_pgtable_ops {
>  				    unsigned long iova, size_t size,
>  				    unsigned long flags,
>  				    struct iommu_dirty_bitmap *dirty);
> +	int (*iotlb_sync_map)(struct io_pgtable_ops *ops, unsigned long iova,
> +			      size_t size);
>  };
>  
>  /**
> -- 
> 2.51.0.570.gb178f27e6d-goog
> 

Thanks,
Praan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] iommu/io-pgtable-arm: Implement .iotlb_sync_map callback
  2025-09-29 19:57 ` Pranjal Shrivastava
@ 2025-09-29 20:42   ` Daniel Mentz
  2025-09-29 22:24     ` Pranjal Shrivastava
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Mentz @ 2025-09-29 20:42 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: iommu, linux-kernel, Will Deacon, Mostafa Saleh, Liviu Dudau,
	Jason Gunthorpe, Rob Clark

On Mon, Sep 29, 2025 at 12:57 PM Pranjal Shrivastava <praan@google.com> wrote:
>
> On Sat, Sep 27, 2025 at 10:39:52PM +0000, Daniel Mentz wrote:
> > +static int __arm_lpae_iotlb_sync_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
> > +                           size_t size, int lvl, arm_lpae_iopte *ptep)
> > +{
> > +     struct io_pgtable *iop = &data->iop;
> > +     size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> > +     int ret = 0, num_entries, max_entries;
> > +     unsigned long iova_offset, sync_idx_start, sync_idx_end;
> > +     int i, shift, synced_entries = 0;
> > +
> > +     shift = (ARM_LPAE_LVL_SHIFT(lvl - 1, data) + ARM_LPAE_PGD_IDX(lvl - 1, data));
> > +     iova_offset = iova & ((1ULL << shift) - 1);
> > +     sync_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
> > +     sync_idx_end = (iova_offset + size + block_size - ARM_LPAE_GRANULE(data)) >>
> > +             ARM_LPAE_LVL_SHIFT(lvl, data);
> > +     max_entries = arm_lpae_max_entries(sync_idx_start, data);
> > +     num_entries = min_t(unsigned long, sync_idx_end - sync_idx_start, max_entries);
> > +     ptep += sync_idx_start;
> > +
> > +     if (lvl < (ARM_LPAE_MAX_LEVELS - 1)) {
> > +             for (i = 0; i < num_entries; i++) {
> > +                     arm_lpae_iopte pte = READ_ONCE(ptep[i]);
> > +                     unsigned long synced;
> > +
> > +                     WARN_ON(!pte);
> > +
> > +                     if (iopte_type(pte) == ARM_LPAE_PTE_TYPE_TABLE) {
> > +                             int n = i - synced_entries;
> > +
> > +                             if (n) {
> > +                                     __arm_lpae_sync_pte(&ptep[synced_entries], n, &iop->cfg);
> > +                                     synced_entries += n;
> > +                             }
> > +                             ret = __arm_lpae_iotlb_sync_map(data, iova, size, lvl + 1,
> > +                                                             iopte_deref(pte, data));
>
> I think we must check the returned value here and break the loop on
> error. Otherwise, we might burry a failure by continuing the loop.
> We should add something like:
>
> if (ret)
>         break;
>

Now, I'm realising that __arm_lpae_iotlb_sync_map always returns 0, in
which case I could change the return type to void. Would that work for
you?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] iommu/io-pgtable-arm: Implement .iotlb_sync_map callback
  2025-09-29 12:21 ` [PATCH 1/2] iommu/io-pgtable-arm: " Mostafa Saleh
@ 2025-09-29 21:00   ` Daniel Mentz
  2025-09-30  9:10     ` Mostafa Saleh
  0 siblings, 1 reply; 17+ messages in thread
From: Daniel Mentz @ 2025-09-29 21:00 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, linux-kernel, Will Deacon, Pranjal Shrivastava,
	Liviu Dudau, Jason Gunthorpe, Rob Clark

On Mon, Sep 29, 2025 at 5:21 AM Mostafa Saleh <smostafa@google.com> wrote:
>
> On Sat, Sep 27, 2025 at 10:39:52PM +0000, Daniel Mentz wrote:
> > On systems with a non-coherent SMMU, the CPU must perform cache
> > maintenance operations (CMOs) after updating page table entries (PTEs).
> > This ensures that the SMMU reads the latest version of the descriptors
> > and not stale data from memory.
> >
> > This requirement can lead to significant performance overhead,
> > especially when mapping long scatter-gather lists where each sg entry
> > maps into an iommu_map() call that only covers 4KB of address space.
> >
> > In such scenarios, each small mapping operation modifies a single 8-byte
> > PTE but triggers a cache clean for the entire cache line (e.g., 64
> > bytes). This results in the same cache line being cleaned repeatedly,
> > once for each PTE it contains.
> >
> > A more efficient implementation performs the cache clean operation only
> > after updating all descriptors that are co-located in the same cache
> > line. This patch introduces a mechanism to defer and batch the cache
> > maintenance:
> >
> > A new boolean flag, defer_sync_pte, is added to struct io_pgtable_cfg.
> > When this flag is set, the arm-lpae backend will skip the cache sync
> > operation for leaf entries within its .map_pages implementation.
> >
> > A new callback, .iotlb_sync_map, is added to struct io_pgtable_ops.
> > After performing a series of mapping operations, the caller is
> > responsible for invoking this callback for the entire IOVA range. This
> > function then walks the page tables for the specified range and performs
> > the necessary cache clean operations for all the modified PTEs.
> >
> > This allows for a single, efficient cache maintenance operation to cover
> > multiple PTE updates, significantly reducing overhead for workloads that
> > perform many small, contiguous mappings.
> >
> > Signed-off-by: Daniel Mentz <danielmentz@google.com>
> > ---
> >  drivers/iommu/io-pgtable-arm.c | 66 +++++++++++++++++++++++++++++++++-
> >  include/linux/io-pgtable.h     |  7 ++++
> >  2 files changed, 72 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> > index 7e8e2216c294..a970eefb07fb 100644
> > --- a/drivers/iommu/io-pgtable-arm.c
> > +++ b/drivers/iommu/io-pgtable-arm.c
> > @@ -353,7 +353,7 @@ static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> >       for (i = 0; i < num_entries; i++)
> >               ptep[i] = pte | paddr_to_iopte(paddr + i * sz, data);
> >
> > -     if (!cfg->coherent_walk)
> > +     if (!cfg->coherent_walk && !cfg->defer_sync_pte)
> >               __arm_lpae_sync_pte(ptep, num_entries, cfg);
> >  }
> >
> > @@ -582,6 +582,69 @@ static int arm_lpae_map_pages(struct io_pgtable_ops *ops, unsigned long iova,
> >       return ret;
> >  }
> >
> > +static int __arm_lpae_iotlb_sync_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
> > +                           size_t size, int lvl, arm_lpae_iopte *ptep)
> > +{
> > +     struct io_pgtable *iop = &data->iop;
> > +     size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> > +     int ret = 0, num_entries, max_entries;
> > +     unsigned long iova_offset, sync_idx_start, sync_idx_end;
> > +     int i, shift, synced_entries = 0;
> > +
> > +     shift = (ARM_LPAE_LVL_SHIFT(lvl - 1, data) + ARM_LPAE_PGD_IDX(lvl - 1, data));
> > +     iova_offset = iova & ((1ULL << shift) - 1);
> > +     sync_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
> > +     sync_idx_end = (iova_offset + size + block_size - ARM_LPAE_GRANULE(data)) >>
> > +             ARM_LPAE_LVL_SHIFT(lvl, data);
> > +     max_entries = arm_lpae_max_entries(sync_idx_start, data);
> > +     num_entries = min_t(unsigned long, sync_idx_end - sync_idx_start, max_entries);
> > +     ptep += sync_idx_start;
> > +
> > +     if (lvl < (ARM_LPAE_MAX_LEVELS - 1)) {
> > +             for (i = 0; i < num_entries; i++) {
> > +                     arm_lpae_iopte pte = READ_ONCE(ptep[i]);
> > +                     unsigned long synced;
> > +
> > +                     WARN_ON(!pte);
> > +
> > +                     if (iopte_type(pte) == ARM_LPAE_PTE_TYPE_TABLE) {
> > +                             int n = i - synced_entries;
> > +
> > +                             if (n) {
> > +                                     __arm_lpae_sync_pte(&ptep[synced_entries], n, &iop->cfg);
> > +                                     synced_entries += n;
> > +                             }
> > +                             ret = __arm_lpae_iotlb_sync_map(data, iova, size, lvl + 1,
> > +                                                             iopte_deref(pte, data));
> > +                             synced_entries++;
> > +                     }
> > +                     synced = block_size - (iova & (block_size - 1));
> > +                     size -= synced;
> > +                     iova += synced;
> > +             }
> > +     }
> > +
> > +     if (synced_entries != num_entries)
> > +             __arm_lpae_sync_pte(&ptep[synced_entries], num_entries - synced_entries, &iop->cfg);
> > +
> > +     return ret;
> > +}
>
> Can't we rely on the exisiting generic table walker "__arm_lpae_iopte_walk",
> instead writing a new one, that is already used for iova_to_phys and dirty bit.

The performance gains of .iotlb_sync_map are achieved by performing
CMOs on a range of descriptors as opposed to individually on each
descriptor in isolation. The function __arm_lpae_iopte_walk is
inherently incompatible with this, because it calls the .visit
callback once for each descriptor it finds in the specified range. I
guess I could work around this limitation by saving some state in
io_pgtable_walk_data and developing a .visit function that tries to
coalesce individual descriptors into contiguous ranges and delays CMOs
until it finds a break in continuity. I'm afraid, though, that that
might hurt performance significantly.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] iommu/io-pgtable-arm: Implement .iotlb_sync_map callback
  2025-09-29 20:42   ` Daniel Mentz
@ 2025-09-29 22:24     ` Pranjal Shrivastava
  0 siblings, 0 replies; 17+ messages in thread
From: Pranjal Shrivastava @ 2025-09-29 22:24 UTC (permalink / raw)
  To: Daniel Mentz
  Cc: iommu, linux-kernel, Will Deacon, Mostafa Saleh, Liviu Dudau,
	Jason Gunthorpe, Rob Clark

On Mon, Sep 29, 2025 at 01:42:29PM -0700, Daniel Mentz wrote:
> On Mon, Sep 29, 2025 at 12:57 PM Pranjal Shrivastava <praan@google.com> wrote:
> >
> > On Sat, Sep 27, 2025 at 10:39:52PM +0000, Daniel Mentz wrote:
> > > +static int __arm_lpae_iotlb_sync_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
> > > +                           size_t size, int lvl, arm_lpae_iopte *ptep)
> > > +{
> > > +     struct io_pgtable *iop = &data->iop;
> > > +     size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> > > +     int ret = 0, num_entries, max_entries;
> > > +     unsigned long iova_offset, sync_idx_start, sync_idx_end;
> > > +     int i, shift, synced_entries = 0;
> > > +
> > > +     shift = (ARM_LPAE_LVL_SHIFT(lvl - 1, data) + ARM_LPAE_PGD_IDX(lvl - 1, data));
> > > +     iova_offset = iova & ((1ULL << shift) - 1);
> > > +     sync_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
> > > +     sync_idx_end = (iova_offset + size + block_size - ARM_LPAE_GRANULE(data)) >>
> > > +             ARM_LPAE_LVL_SHIFT(lvl, data);
> > > +     max_entries = arm_lpae_max_entries(sync_idx_start, data);
> > > +     num_entries = min_t(unsigned long, sync_idx_end - sync_idx_start, max_entries);
> > > +     ptep += sync_idx_start;
> > > +
> > > +     if (lvl < (ARM_LPAE_MAX_LEVELS - 1)) {
> > > +             for (i = 0; i < num_entries; i++) {
> > > +                     arm_lpae_iopte pte = READ_ONCE(ptep[i]);
> > > +                     unsigned long synced;
> > > +
> > > +                     WARN_ON(!pte);
> > > +
> > > +                     if (iopte_type(pte) == ARM_LPAE_PTE_TYPE_TABLE) {
> > > +                             int n = i - synced_entries;
> > > +
> > > +                             if (n) {
> > > +                                     __arm_lpae_sync_pte(&ptep[synced_entries], n, &iop->cfg);
> > > +                                     synced_entries += n;
> > > +                             }
> > > +                             ret = __arm_lpae_iotlb_sync_map(data, iova, size, lvl + 1,
> > > +                                                             iopte_deref(pte, data));
> >
> > I think we must check the returned value here and break the loop on
> > error. Otherwise, we might burry a failure by continuing the loop.
> > We should add something like:
> >
> > if (ret)
> >         break;
> >
> 
> Now, I'm realising that __arm_lpae_iotlb_sync_map always returns 0, in
> which case I could change the return type to void. Would that work for
> you?

Ah, yes, I just realized that too. I guess we can simply change the
return type to void.

Thanks,
Praan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] drivers/arm-smmu-v3: Implement .iotlb_sync_map callback
  2025-09-29 12:47       ` Jason Gunthorpe
@ 2025-09-30  0:23         ` Pranjal Shrivastava
  2025-09-30  9:27           ` Mostafa Saleh
  0 siblings, 1 reply; 17+ messages in thread
From: Pranjal Shrivastava @ 2025-09-30  0:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mostafa Saleh, Daniel Mentz, iommu, linux-kernel, Will Deacon,
	Liviu Dudau, Rob Clark

On Mon, Sep 29, 2025 at 09:47:19AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 29, 2025 at 12:24:28PM +0000, Mostafa Saleh wrote:
> > On Mon, Sep 29, 2025 at 08:58:03AM -0300, Jason Gunthorpe wrote:
> > > On Sat, Sep 27, 2025 at 10:39:53PM +0000, Daniel Mentz wrote:
> > > > @@ -3700,6 +3713,7 @@ static const struct iommu_ops arm_smmu_ops = {
> > > >  		.map_pages		= arm_smmu_map_pages,
> > > >  		.unmap_pages		= arm_smmu_unmap_pages,
> > > >  		.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
> > > > +		.iotlb_sync_map		= arm_smmu_iotlb_sync_map,
> > > 
> > > Shouldn't this avoid defining the op on coherent systems?
> > 
> > Does that mean we need to have 2 iommu_ops, one for
> > coherent/non-coherent SMMUs, as both can be mixed in the same system.
> 
> Yes, I think you'd have to do it with two ops..
> 
> It just seems wrong to penalize the normal fast case for these
> systems.
> 

I see we plan to set defer_sync_pte = true always. What if we invoke the
ops->iotlb_sync_map() only for incoherent IOMMUs? Maybe something like:

static int arm_smmu_iotlb_sync_map(struct iommu_domain *domain,
				    unsigned long iova, size_t size)
{
	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
	struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
	bool is_coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY;


	if (!ops || !ops->iotlb_sync_map || is_coherent)
		return 0;

	ops->iotlb_sync_map(ops, iova, size);
	return 0;
}

If needed we can push the coherency check to the io-pgtable op
iotlb_sync_map() as well. Just an idea..

> Jason

Thanks,
Praan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] iommu/io-pgtable-arm: Implement .iotlb_sync_map callback
  2025-09-29 21:00   ` Daniel Mentz
@ 2025-09-30  9:10     ` Mostafa Saleh
  2025-11-04 13:28       ` Will Deacon
  0 siblings, 1 reply; 17+ messages in thread
From: Mostafa Saleh @ 2025-09-30  9:10 UTC (permalink / raw)
  To: Daniel Mentz
  Cc: iommu, linux-kernel, Will Deacon, Pranjal Shrivastava,
	Liviu Dudau, Jason Gunthorpe, Rob Clark

On Mon, Sep 29, 2025 at 02:00:09PM -0700, Daniel Mentz wrote:
> On Mon, Sep 29, 2025 at 5:21 AM Mostafa Saleh <smostafa@google.com> wrote:
> >
> > On Sat, Sep 27, 2025 at 10:39:52PM +0000, Daniel Mentz wrote:
> > > On systems with a non-coherent SMMU, the CPU must perform cache
> > > maintenance operations (CMOs) after updating page table entries (PTEs).
> > > This ensures that the SMMU reads the latest version of the descriptors
> > > and not stale data from memory.
> > >
> > > This requirement can lead to significant performance overhead,
> > > especially when mapping long scatter-gather lists where each sg entry
> > > maps into an iommu_map() call that only covers 4KB of address space.
> > >
> > > In such scenarios, each small mapping operation modifies a single 8-byte
> > > PTE but triggers a cache clean for the entire cache line (e.g., 64
> > > bytes). This results in the same cache line being cleaned repeatedly,
> > > once for each PTE it contains.
> > >
> > > A more efficient implementation performs the cache clean operation only
> > > after updating all descriptors that are co-located in the same cache
> > > line. This patch introduces a mechanism to defer and batch the cache
> > > maintenance:
> > >
> > > A new boolean flag, defer_sync_pte, is added to struct io_pgtable_cfg.
> > > When this flag is set, the arm-lpae backend will skip the cache sync
> > > operation for leaf entries within its .map_pages implementation.
> > >
> > > A new callback, .iotlb_sync_map, is added to struct io_pgtable_ops.
> > > After performing a series of mapping operations, the caller is
> > > responsible for invoking this callback for the entire IOVA range. This
> > > function then walks the page tables for the specified range and performs
> > > the necessary cache clean operations for all the modified PTEs.
> > >
> > > This allows for a single, efficient cache maintenance operation to cover
> > > multiple PTE updates, significantly reducing overhead for workloads that
> > > perform many small, contiguous mappings.
> > >
> > > Signed-off-by: Daniel Mentz <danielmentz@google.com>
> > > ---
> > >  drivers/iommu/io-pgtable-arm.c | 66 +++++++++++++++++++++++++++++++++-
> > >  include/linux/io-pgtable.h     |  7 ++++
> > >  2 files changed, 72 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> > > index 7e8e2216c294..a970eefb07fb 100644
> > > --- a/drivers/iommu/io-pgtable-arm.c
> > > +++ b/drivers/iommu/io-pgtable-arm.c
> > > @@ -353,7 +353,7 @@ static void __arm_lpae_init_pte(struct arm_lpae_io_pgtable *data,
> > >       for (i = 0; i < num_entries; i++)
> > >               ptep[i] = pte | paddr_to_iopte(paddr + i * sz, data);
> > >
> > > -     if (!cfg->coherent_walk)
> > > +     if (!cfg->coherent_walk && !cfg->defer_sync_pte)
> > >               __arm_lpae_sync_pte(ptep, num_entries, cfg);
> > >  }
> > >
> > > @@ -582,6 +582,69 @@ static int arm_lpae_map_pages(struct io_pgtable_ops *ops, unsigned long iova,
> > >       return ret;
> > >  }
> > >
> > > +static int __arm_lpae_iotlb_sync_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
> > > +                           size_t size, int lvl, arm_lpae_iopte *ptep)
> > > +{
> > > +     struct io_pgtable *iop = &data->iop;
> > > +     size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> > > +     int ret = 0, num_entries, max_entries;
> > > +     unsigned long iova_offset, sync_idx_start, sync_idx_end;
> > > +     int i, shift, synced_entries = 0;
> > > +
> > > +     shift = (ARM_LPAE_LVL_SHIFT(lvl - 1, data) + ARM_LPAE_PGD_IDX(lvl - 1, data));
> > > +     iova_offset = iova & ((1ULL << shift) - 1);
> > > +     sync_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
> > > +     sync_idx_end = (iova_offset + size + block_size - ARM_LPAE_GRANULE(data)) >>
> > > +             ARM_LPAE_LVL_SHIFT(lvl, data);
> > > +     max_entries = arm_lpae_max_entries(sync_idx_start, data);
> > > +     num_entries = min_t(unsigned long, sync_idx_end - sync_idx_start, max_entries);
> > > +     ptep += sync_idx_start;
> > > +
> > > +     if (lvl < (ARM_LPAE_MAX_LEVELS - 1)) {
> > > +             for (i = 0; i < num_entries; i++) {
> > > +                     arm_lpae_iopte pte = READ_ONCE(ptep[i]);
> > > +                     unsigned long synced;
> > > +
> > > +                     WARN_ON(!pte);
> > > +
> > > +                     if (iopte_type(pte) == ARM_LPAE_PTE_TYPE_TABLE) {
> > > +                             int n = i - synced_entries;
> > > +
> > > +                             if (n) {
> > > +                                     __arm_lpae_sync_pte(&ptep[synced_entries], n, &iop->cfg);
> > > +                                     synced_entries += n;
> > > +                             }
> > > +                             ret = __arm_lpae_iotlb_sync_map(data, iova, size, lvl + 1,
> > > +                                                             iopte_deref(pte, data));
> > > +                             synced_entries++;
> > > +                     }
> > > +                     synced = block_size - (iova & (block_size - 1));
> > > +                     size -= synced;
> > > +                     iova += synced;
> > > +             }
> > > +     }
> > > +
> > > +     if (synced_entries != num_entries)
> > > +             __arm_lpae_sync_pte(&ptep[synced_entries], num_entries - synced_entries, &iop->cfg);
> > > +
> > > +     return ret;
> > > +}
> >
> > Can't we rely on the exisiting generic table walker "__arm_lpae_iopte_walk",
> > instead writing a new one, that is already used for iova_to_phys and dirty bit.
> 
> The performance gains of .iotlb_sync_map are achieved by performing
> CMOs on a range of descriptors as opposed to individually on each
> descriptor in isolation. The function __arm_lpae_iopte_walk is
> inherently incompatible with this, because it calls the .visit
> callback once for each descriptor it finds in the specified range. I
> guess I could work around this limitation by saving some state in
> io_pgtable_walk_data and developing a .visit function that tries to
> coalesce individual descriptors into contiguous ranges and delays CMOs
> until it finds a break in continuity. I'm afraid, though, that that
> might hurt performance significantly.

Exactly, I think that would be the way, I don’t have a strong opinion
though, but I’d avoid open coding a new walker unless it’s necessary.
Also, the current walker won’t do ranges, it needs some more changes,
I did that as part of (half of the patch doesn’t apply for this case):
https://lore.kernel.org/all/20241212180423.1578358-38-smostafa@google.com/

Thanks,
Mostafa


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] drivers/arm-smmu-v3: Implement .iotlb_sync_map callback
  2025-09-30  0:23         ` Pranjal Shrivastava
@ 2025-09-30  9:27           ` Mostafa Saleh
  2025-09-30 14:56             ` Pranjal Shrivastava
  0 siblings, 1 reply; 17+ messages in thread
From: Mostafa Saleh @ 2025-09-30  9:27 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Jason Gunthorpe, Daniel Mentz, iommu, linux-kernel, Will Deacon,
	Liviu Dudau, Rob Clark

On Tue, Sep 30, 2025 at 12:23:50AM +0000, Pranjal Shrivastava wrote:
> On Mon, Sep 29, 2025 at 09:47:19AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 29, 2025 at 12:24:28PM +0000, Mostafa Saleh wrote:
> > > On Mon, Sep 29, 2025 at 08:58:03AM -0300, Jason Gunthorpe wrote:
> > > > On Sat, Sep 27, 2025 at 10:39:53PM +0000, Daniel Mentz wrote:
> > > > > @@ -3700,6 +3713,7 @@ static const struct iommu_ops arm_smmu_ops = {
> > > > >  		.map_pages		= arm_smmu_map_pages,
> > > > >  		.unmap_pages		= arm_smmu_unmap_pages,
> > > > >  		.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
> > > > > +		.iotlb_sync_map		= arm_smmu_iotlb_sync_map,
> > > > 
> > > > Shouldn't this avoid defining the op on coherent systems?
> > > 
> > > Does that mean we need to have 2 iommu_ops, one for
> > > coherent/non-coherent SMMUs, as both can be mixed in the same system.
> > 
> > Yes, I think you'd have to do it with two ops..
> > 
> > It just seems wrong to penalize the normal fast case for these
> > systems.
> > 
> 
> I see we plan to set defer_sync_pte = true always. What if we invoke the
> ops->iotlb_sync_map() only for incoherent IOMMUs? Maybe something like:
> 
> static int arm_smmu_iotlb_sync_map(struct iommu_domain *domain,
> 				    unsigned long iova, size_t size)
> {
> 	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
> 	struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
> 	bool is_coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY;
> 
> 
> 	if (!ops || !ops->iotlb_sync_map || is_coherent)
> 		return 0;
> 
> 	ops->iotlb_sync_map(ops, iova, size);
> 	return 0;
> }
> 
> If needed we can push the coherency check to the io-pgtable op
> iotlb_sync_map() as well. Just an idea..
> 

iotlb_sync_map is already NULL for coherent SMMUs, I beleive
Jason's point is about that the iommu_ops.default_domain_ops
will have the extra pointer which will be called by the core code
anyway, which immediatly returns; wasting some cylces.
To avoid this we can have 2 sets of the default_domain_ops for
coherent and non-coherent devices, to be chosen at domain alloc time.

Though, It would be intersting to measure how much overhead does the
current approach have in practice, maybe through dma_map_benchmark?

Thanks,
Mostafa

> > Jason
> 
> Thanks,
> Praan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/2] drivers/arm-smmu-v3: Implement .iotlb_sync_map callback
  2025-09-30  9:27           ` Mostafa Saleh
@ 2025-09-30 14:56             ` Pranjal Shrivastava
  0 siblings, 0 replies; 17+ messages in thread
From: Pranjal Shrivastava @ 2025-09-30 14:56 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Jason Gunthorpe, Daniel Mentz, iommu, linux-kernel, Will Deacon,
	Liviu Dudau, Rob Clark

On Tue, Sep 30, 2025 at 09:27:21AM +0000, Mostafa Saleh wrote:
> On Tue, Sep 30, 2025 at 12:23:50AM +0000, Pranjal Shrivastava wrote:
> > On Mon, Sep 29, 2025 at 09:47:19AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Sep 29, 2025 at 12:24:28PM +0000, Mostafa Saleh wrote:
> > > > On Mon, Sep 29, 2025 at 08:58:03AM -0300, Jason Gunthorpe wrote:
> > > > > On Sat, Sep 27, 2025 at 10:39:53PM +0000, Daniel Mentz wrote:
> > > > > > @@ -3700,6 +3713,7 @@ static const struct iommu_ops arm_smmu_ops = {
> > > > > >  		.map_pages		= arm_smmu_map_pages,
> > > > > >  		.unmap_pages		= arm_smmu_unmap_pages,
> > > > > >  		.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
> > > > > > +		.iotlb_sync_map		= arm_smmu_iotlb_sync_map,
> > > > > 
> > > > > Shouldn't this avoid defining the op on coherent systems?
> > > > 
> > > > Does that mean we need to have 2 iommu_ops, one for
> > > > coherent/non-coherent SMMUs, as both can be mixed in the same system.
> > > 
> > > Yes, I think you'd have to do it with two ops..
> > > 
> > > It just seems wrong to penalize the normal fast case for these
> > > systems.
> > > 
> > 
> > I see we plan to set defer_sync_pte = true always. What if we invoke the
> > ops->iotlb_sync_map() only for incoherent IOMMUs? Maybe something like:
> > 
> > static int arm_smmu_iotlb_sync_map(struct iommu_domain *domain,
> > 				    unsigned long iova, size_t size)
> > {
> > 	struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
> > 	struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
> > 	bool is_coherent = smmu->features & ARM_SMMU_FEAT_COHERENCY;
> > 
> > 
> > 	if (!ops || !ops->iotlb_sync_map || is_coherent)
> > 		return 0;
> > 
> > 	ops->iotlb_sync_map(ops, iova, size);
> > 	return 0;
> > }
> > 
> > If needed we can push the coherency check to the io-pgtable op
> > iotlb_sync_map() as well. Just an idea..
> > 
> 
> iotlb_sync_map is already NULL for coherent SMMUs, I beleive
> Jason's point is about that the iommu_ops.default_domain_ops
> will have the extra pointer which will be called by the core code
> anyway, which immediatly returns; wasting some cylces.

Ohh okay, I see.

> To avoid this we can have 2 sets of the default_domain_ops for
> coherent and non-coherent devices, to be chosen at domain alloc time.
> 

I guess it'd be better to have non-coherent def domain ops then.

> Though, It would be intersting to measure how much overhead does the
> current approach have in practice, maybe through dma_map_benchmark?
>

Yes, dma_map_benchmark can be used but its results won't reflect the
impact on scatter-gather workloads since the benchmark doesn't cover
dma_map_sg IIRC. I believe even a small per-call regression may get
amplified at scale though.

Thanks,
Praan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] iommu/io-pgtable-arm: Implement .iotlb_sync_map callback
  2025-09-30  9:10     ` Mostafa Saleh
@ 2025-11-04 13:28       ` Will Deacon
  2025-11-04 13:37         ` Jason Gunthorpe
  0 siblings, 1 reply; 17+ messages in thread
From: Will Deacon @ 2025-11-04 13:28 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Daniel Mentz, iommu, linux-kernel, Pranjal Shrivastava,
	Liviu Dudau, Jason Gunthorpe, Rob Clark

On Tue, Sep 30, 2025 at 09:10:44AM +0000, Mostafa Saleh wrote:
> On Mon, Sep 29, 2025 at 02:00:09PM -0700, Daniel Mentz wrote:
> > On Mon, Sep 29, 2025 at 5:21 AM Mostafa Saleh <smostafa@google.com> wrote:
> > > On Sat, Sep 27, 2025 at 10:39:52PM +0000, Daniel Mentz wrote:
> > > > @@ -582,6 +582,69 @@ static int arm_lpae_map_pages(struct io_pgtable_ops *ops, unsigned long iova,
> > > >       return ret;
> > > >  }
> > > >
> > > > +static int __arm_lpae_iotlb_sync_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
> > > > +                           size_t size, int lvl, arm_lpae_iopte *ptep)
> > > > +{
> > > > +     struct io_pgtable *iop = &data->iop;
> > > > +     size_t block_size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> > > > +     int ret = 0, num_entries, max_entries;
> > > > +     unsigned long iova_offset, sync_idx_start, sync_idx_end;
> > > > +     int i, shift, synced_entries = 0;
> > > > +
> > > > +     shift = (ARM_LPAE_LVL_SHIFT(lvl - 1, data) + ARM_LPAE_PGD_IDX(lvl - 1, data));
> > > > +     iova_offset = iova & ((1ULL << shift) - 1);
> > > > +     sync_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
> > > > +     sync_idx_end = (iova_offset + size + block_size - ARM_LPAE_GRANULE(data)) >>
> > > > +             ARM_LPAE_LVL_SHIFT(lvl, data);
> > > > +     max_entries = arm_lpae_max_entries(sync_idx_start, data);
> > > > +     num_entries = min_t(unsigned long, sync_idx_end - sync_idx_start, max_entries);
> > > > +     ptep += sync_idx_start;
> > > > +
> > > > +     if (lvl < (ARM_LPAE_MAX_LEVELS - 1)) {
> > > > +             for (i = 0; i < num_entries; i++) {
> > > > +                     arm_lpae_iopte pte = READ_ONCE(ptep[i]);
> > > > +                     unsigned long synced;
> > > > +
> > > > +                     WARN_ON(!pte);
> > > > +
> > > > +                     if (iopte_type(pte) == ARM_LPAE_PTE_TYPE_TABLE) {
> > > > +                             int n = i - synced_entries;
> > > > +
> > > > +                             if (n) {
> > > > +                                     __arm_lpae_sync_pte(&ptep[synced_entries], n, &iop->cfg);
> > > > +                                     synced_entries += n;
> > > > +                             }
> > > > +                             ret = __arm_lpae_iotlb_sync_map(data, iova, size, lvl + 1,
> > > > +                                                             iopte_deref(pte, data));
> > > > +                             synced_entries++;
> > > > +                     }
> > > > +                     synced = block_size - (iova & (block_size - 1));
> > > > +                     size -= synced;
> > > > +                     iova += synced;
> > > > +             }
> > > > +     }
> > > > +
> > > > +     if (synced_entries != num_entries)
> > > > +             __arm_lpae_sync_pte(&ptep[synced_entries], num_entries - synced_entries, &iop->cfg);
> > > > +
> > > > +     return ret;
> > > > +}
> > >
> > > Can't we rely on the exisiting generic table walker "__arm_lpae_iopte_walk",
> > > instead writing a new one, that is already used for iova_to_phys and dirty bit.
> > 
> > The performance gains of .iotlb_sync_map are achieved by performing
> > CMOs on a range of descriptors as opposed to individually on each
> > descriptor in isolation. The function __arm_lpae_iopte_walk is
> > inherently incompatible with this, because it calls the .visit
> > callback once for each descriptor it finds in the specified range. I
> > guess I could work around this limitation by saving some state in
> > io_pgtable_walk_data and developing a .visit function that tries to
> > coalesce individual descriptors into contiguous ranges and delays CMOs
> > until it finds a break in continuity. I'm afraid, though, that that
> > might hurt performance significantly.
> 
> Exactly, I think that would be the way, I don’t have a strong opinion
> though, but I’d avoid open coding a new walker unless it’s necessary.
> Also, the current walker won’t do ranges, it needs some more changes,
> I did that as part of (half of the patch doesn’t apply for this case):
> https://lore.kernel.org/all/20241212180423.1578358-38-smostafa@google.com/

I'm inclined to agree that it would be better to avoid open-coding a
new walker here and if we're able to reuse/extend the generic walker
then that would be cleaner.

If that's not workable (due to Daniel's performance worries), another
option is to bring back the ->map_sg() hook (removed by d88e61faad52
("iommu: Remove the ->map_sg indirection")) and implement an optimised
version of that, preferably sharing as much code as possible with the
existing map path.

Will

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/2] iommu/io-pgtable-arm: Implement .iotlb_sync_map callback
  2025-11-04 13:28       ` Will Deacon
@ 2025-11-04 13:37         ` Jason Gunthorpe
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Gunthorpe @ 2025-11-04 13:37 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mostafa Saleh, Daniel Mentz, iommu, linux-kernel,
	Pranjal Shrivastava, Liviu Dudau, Rob Clark

On Tue, Nov 04, 2025 at 01:28:35PM +0000, Will Deacon wrote:
> If that's not workable (due to Daniel's performance worries), another
> option is to bring back the ->map_sg() hook (removed by d88e61faad52
> ("iommu: Remove the ->map_sg indirection")) and implement an optimised
> version of that, preferably sharing as much code as possible with the
> existing map path.

I've been broadly thinking of doing this. The benchmarks I've run
suggest there are decent gains in avoiding the re-walks, so we can
speed up the map_sg operation by working directly on the scatterlist
without rewalking every next sgent. That cleanly solves the CMO issue
as well.

I think the page table rework will land this cycle, that would be the
right time to attempt it as we should not be trying to make the
duplicated page table code any more complicated.

Jason

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-11-04 13:37 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-27 22:39 [PATCH 1/2] iommu/io-pgtable-arm: Implement .iotlb_sync_map callback Daniel Mentz
2025-09-27 22:39 ` [PATCH 2/2] drivers/arm-smmu-v3: " Daniel Mentz
2025-09-29 11:58   ` Jason Gunthorpe
2025-09-29 12:24     ` Mostafa Saleh
2025-09-29 12:47       ` Jason Gunthorpe
2025-09-30  0:23         ` Pranjal Shrivastava
2025-09-30  9:27           ` Mostafa Saleh
2025-09-30 14:56             ` Pranjal Shrivastava
2025-09-29 12:25   ` Mostafa Saleh
2025-09-29 12:21 ` [PATCH 1/2] iommu/io-pgtable-arm: " Mostafa Saleh
2025-09-29 21:00   ` Daniel Mentz
2025-09-30  9:10     ` Mostafa Saleh
2025-11-04 13:28       ` Will Deacon
2025-11-04 13:37         ` Jason Gunthorpe
2025-09-29 19:57 ` Pranjal Shrivastava
2025-09-29 20:42   ` Daniel Mentz
2025-09-29 22:24     ` Pranjal Shrivastava

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox