[PATCH v9 0/7] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v9 0/7] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array
@ 2025-12-19 20:11 Nicolin Chen
  2025-12-19 20:11 ` [PATCH v9 1/7] iommu/arm-smmu-v3: Explicitly set smmu_domain->stage for SVA Nicolin Chen
                   ` (7 more replies)
  0 siblings, 8 replies; 45+ messages in thread
From: Nicolin Chen @ 2025-12-19 20:11 UTC (permalink / raw)
  To: will
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

This is a work based on Jason's design and algorithm. This implementation
follows his initial draft and revising as well.

The new arm_smmu_invs array is an RCU-protected array, mutated when device
attach to the domain, iterated when an invalidation is required for IOPTE
changes in this domain. This keeps the current invalidation efficiency of
a smb_mb() followed by a conditional rwlock replacing the atomic/spinlock
combination.

A new data structure is defined for the array and its entry, representing
invalidation operations, such as S1_ASID, S2_VMID, and ATS. The algorithm
adds and deletes array entries efficiently and also keeps the array sorted
so as to group similar invalidations into batches.

During an invalidation, a new invalidation function iterates domain->invs,
and converts each entry to the corresponding invalidation command(s). This
new function is fully compatible with all the existing use cases, allowing
a simple rework/replacement.

Some races to keep in mind:

1) A domain can be shared across SMMU instances. When an SMMU instance is
   removed, the updated invs array has to be sync-ed via synchronize_rcu()
   to prevent an concurrent invalidation routine that is accessing the old
   array from issuing commands to the removed SMMU instance.

2) When there are concurrent IOPTE changes (followed by invalidations) and
   a domain attachment, the new attachment must not become out of sync at
   the HW level, meaning that an STE store and invalidation array load must
   be sequenced by the CPU's memory model.

3) When an ATS-enabled device attaches to a blocking domain, the core code
   requires a hard fence to ensure all ATS invalidations to the device are
   completed. Relying on RCU alone requires calling synchronize_rcu() that
   can be too slow. Instead, when ATS is in use, hold a conditional rwlock
   till all concurrent invalidations are finished.

Related future work and dependent projects:

 * NVIDIA is building systems with > 10 SMMU instances where > 8 are being
   used concurrently in a single VM. So having 8 copies of an identical S2
   page table is not efficient. Instead, all vSMMU instances should check
   compatibility on a shared S2 iopt, to eliminate 7 copies.

   Previous attempt based on the list/spinlock design:
     iommu/arm-smmu-v3: Allocate vmid per vsmmu instead of s2_parent
     https://lore.kernel.org/all/cover.1744692494.git.nicolinc@nvidia.com/
   now can adopt this invs array, avoiding adding complex lists/locks.

 * The guest support for BTM requires temporarily invalidating two ASIDs
   for a single instance. When it renumbers ASIDs this can now be done via
   the invs array.

 * SVA with multiple devices being used by a single process (NVIDIA today
   has 4-8) sequentially iterates the invalidations through all instances.
   This ignores the HW concurrency available in each instance. It would be
   nice to not spin on each sync but go forward and issue batches to other
   instances also. Reducing to a single SVA domain shared across instances
   is required to look at this.

This is on Github:
https://github.com/nicolinc/iommufd/commits/arm_smmu_invs-v9

Changelog
v9:
 * Set cur->ssid correctly in arm_smmu_master_build_inv()
 * Add comments for INV_TYPE_S1_ASID in arm_smmu_master_build_inv()
v8:
 https://lore.kernel.org/all/cover.1766013662.git.nicolinc@nvidia.com/
 * Update comments for smp_mb() in arm_smmu_install_new/old_domain_invs()
v7:
 https://lore.kernel.org/all/cover.1765850465.git.nicolinc@nvidia.com/
 * Rebase on v6.19-rc1
 * Fix has_ats in arm_smmu_invs_merge()
 * Fix a re-attach case with a different ats_enabled
 * Constify the inv parameter in arm_smmu_inv_is_ats()
 * Update kunit test data to be closer to real-world use cases
 * Clean up SVA domain's invs in arm_smmu_blocking_set_dev_pasid()
v6:
 https://lore.kernel.org/all/cover.1764119291.git.nicolinc@nvidia.com/
 * Fix typo in kdocs
 * Fix arm_smmu_domain_free() in SVA
 * Add arm_smmu_master_build_inv() helper
 * Add max_invs and __counted_by(max_invs)
 * Return NULL if arm_smmu_invs_alloc() fails
 * Sort master->streams instead of fwspec->ids
 * s/flush_fn/free_fn (prepare for future patches)
 * s/arm_smmu_invs_flush_iotlb_tags/arm_smmu_inv_flush_iotlb_tag
 * Define arm_smmu_invs macros and helpers to reduce duplications
 * Save num_trashes in "struct arm_smmu_invs" (to avoid a re-scan)
 * Move smp_mb() in install() to PATCH-6 that adds the pairing barrier
   and description
v5:
 https://lore.kernel.org/all/cover.1762588839.git.nicolinc@nvidia.com/
 * Add Reviewed/Acked-by from Jason and Balbir
 * Add two inline comments
 * Move cmp after the trash entry validation
 * Batch commands in arm_smmu_inv_size_too_big case
 * Replace kfree_rcu() with kfree() in arm_smmu_domain_free()
v4:
 https://lore.kernel.org/all/cover.1761590851.git.nicolinc@nvidia.com/
 * Fix build errors with CONFIG_KUNIT=n
 * Fix uninitialized cmp in arm_smmu_invs_unref()
 * Add missing "__rcu" in struct arm_smmu_inv_state
 * Add missing rcu_derference_protected() in arm_smmu_domain_free()
 * Bisect two paths for the conditional lock in arm_smmu_domain_inv_range()
   to fix a sparse waring
v3:
 https://lore.kernel.org/all/cover.1760555863.git.nicolinc@nvidia.com/
 * Add Reviewed/Acked-by from Jason and Balbir
 * Rebase on v6.18-rc1
 * Drop arm_smmu_invs_dbg()
 * Improve kdocs and inline/commit comments
 * Rename arm_smmu_invs_cmp to arm_smmu_inv_cmp
 * Rename arm_smmu_invs_merge_cmp to arm_smmu_invs_cmp
 * Call arm_smmu_invs_flush_iotlb_tags() from arm_smmu_invs_unref()
 * Unconditionally trim the invs->num_invs inside arm_smmu_invs_unref(),
   and simplify arm_smmu_install_old_domain_invs().
v2:
 https://lore.kernel.org/all/cover.1757373449.git.nicolinc@nvidia.com/
 * Rebase on v6.17-rc5
 * Improve kdocs and inline comments
 * Add arm_smmu_invs_dbg() for tracing
 * Use users refcount to replace todel flag
 * Initialize num_invs in arm_smmu_invs_alloc()
 * Add a struct arm_smmu_inv_state to group invs pointers
 * Add in struct arm_smmu_invs two flags (has_ats and old)
 * Rename master->invs to master->build_invs, and sort the array
 * Rework arm_smmu_domain_inv_range() and arm_smmu_invs_end_batch()
 * Copy entries by struct arm_smmu_inv in arm_smmu_master_build_invs()
 * Add arm_smmu_invs_flush_iotlb_tags() for IOTLB flush by last device
 * Rework three invs mutation helpers, and prioritize use the in-place
   mutation for detach
 * Take writer's lock unconditionally but keep it short, and only take
   reader's lock conditionally on a has_ats flag
v1:
 https://lore.kernel.org/all/cover.1755131672.git.nicolinc@nvidia.com/

Thanks
Nicolin

Jason Gunthorpe (1):
  iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array

Nicolin Chen (6):
  iommu/arm-smmu-v3: Explicitly set smmu_domain->stage for SVA
  iommu/arm-smmu-v3: Add an inline arm_smmu_domain_free()
  iommu/arm-smmu-v3: Pre-allocate a per-master invalidation array
  iommu/arm-smmu-v3: Populate smmu_domain->invs when attaching masters
  iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  iommu/arm-smmu-v3: Perform per-domain invalidations using
    arm_smmu_invs

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   | 143 ++-
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  35 +-
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c  |  92 ++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 925 +++++++++++++++---
 4 files changed, 1015 insertions(+), 180 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v9 1/7] iommu/arm-smmu-v3: Explicitly set smmu_domain->stage for SVA
  2025-12-19 20:11 [PATCH v9 0/7] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
@ 2025-12-19 20:11 ` Nicolin Chen
  2026-01-23  9:49   ` Pranjal Shrivastava
  2025-12-19 20:11 ` [PATCH v9 2/7] iommu/arm-smmu-v3: Add an inline arm_smmu_domain_free() Nicolin Chen
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Nicolin Chen @ 2025-12-19 20:11 UTC (permalink / raw)
  To: will
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

Both the ARM_SMMU_DOMAIN_S1 case and the SVA case use ASID, requiring ASID
based invalidation commands to flush the TLB.

Define an ARM_SMMU_DOMAIN_SVA to make the SVA case clear to share the same
path with the ARM_SMMU_DOMAIN_S1 case, which will be a part of the routine
to build a new per-domain invalidation array.

There is no function change.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Balbir Singh <balbirs@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h     | 1 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 1 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c     | 3 +++
 3 files changed, 5 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index ae23aacc3840..5c0b38595d20 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -858,6 +858,7 @@ struct arm_smmu_master {
 enum arm_smmu_domain_stage {
 	ARM_SMMU_DOMAIN_S1 = 0,
 	ARM_SMMU_DOMAIN_S2,
+	ARM_SMMU_DOMAIN_SVA,
 };
 
 struct arm_smmu_domain {
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index 59a480974d80..6097f1f540d8 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -346,6 +346,7 @@ struct iommu_domain *arm_smmu_sva_domain_alloc(struct device *dev,
 	 * ARM_SMMU_FEAT_RANGE_INV is present
 	 */
 	smmu_domain->domain.pgsize_bitmap = PAGE_SIZE;
+	smmu_domain->stage = ARM_SMMU_DOMAIN_SVA;
 	smmu_domain->smmu = smmu;
 
 	ret = xa_alloc(&arm_smmu_asid_xa, &asid, smmu_domain,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index d16d35c78c06..9d9f24a08062 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3065,6 +3065,9 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev,
 		arm_smmu_install_ste_for_dev(master, &target);
 		arm_smmu_clear_cd(master, IOMMU_NO_PASID);
 		break;
+	default:
+		WARN_ON(true);
+		break;
 	}
 
 	arm_smmu_attach_commit(&state);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 1/7] iommu/arm-smmu-v3: Explicitly set smmu_domain->stage for SVA
  2025-12-19 20:11 ` [PATCH v9 1/7] iommu/arm-smmu-v3: Explicitly set smmu_domain->stage for SVA Nicolin Chen
@ 2026-01-23  9:49   ` Pranjal Shrivastava
  0 siblings, 0 replies; 45+ messages in thread
From: Pranjal Shrivastava @ 2026-01-23  9:49 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, jean-philippe, robin.murphy, joro, jgg, balbirs,
	miko.lenczewski, peterz, kevin.tian, linux-arm-kernel, iommu,
	linux-kernel

On Fri, Dec 19, 2025 at 12:11:23PM -0800, Nicolin Chen wrote:
> Both the ARM_SMMU_DOMAIN_S1 case and the SVA case use ASID, requiring ASID
> based invalidation commands to flush the TLB.
> 
> Define an ARM_SMMU_DOMAIN_SVA to make the SVA case clear to share the same
> path with the ARM_SMMU_DOMAIN_S1 case, which will be a part of the routine
> to build a new per-domain invalidation array.
> 
> There is no function change.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Acked-by: Balbir Singh <balbirs@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h     | 1 +
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 1 +
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c     | 3 +++
>  3 files changed, 5 insertions(+)
> 

Reviewed-by: Pranjal Shrivastava <praan@google.com>

Thanks,
Praan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v9 2/7] iommu/arm-smmu-v3: Add an inline arm_smmu_domain_free()
  2025-12-19 20:11 [PATCH v9 0/7] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
  2025-12-19 20:11 ` [PATCH v9 1/7] iommu/arm-smmu-v3: Explicitly set smmu_domain->stage for SVA Nicolin Chen
@ 2025-12-19 20:11 ` Nicolin Chen
  2026-01-23  9:50   ` Pranjal Shrivastava
  2025-12-19 20:11 ` [PATCH v9 3/7] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array Nicolin Chen
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Nicolin Chen @ 2025-12-19 20:11 UTC (permalink / raw)
  To: will
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

There will be a bit more things to free than smmu_domain itself. So keep a
simple inline function in the header to share aross files.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Balbir Singh <balbirs@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h     | 5 +++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 5 +++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c     | 4 ++--
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 5c0b38595d20..96a23ca633cb 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -954,6 +954,11 @@ extern struct mutex arm_smmu_asid_lock;
 
 struct arm_smmu_domain *arm_smmu_domain_alloc(void);
 
+static inline void arm_smmu_domain_free(struct arm_smmu_domain *smmu_domain)
+{
+	kfree(smmu_domain);
+}
+
 void arm_smmu_clear_cd(struct arm_smmu_master *master, ioasid_t ssid);
 struct arm_smmu_cd *arm_smmu_get_cd_ptr(struct arm_smmu_master *master,
 					u32 ssid);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index 6097f1f540d8..440ad8cc07de 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -197,7 +197,8 @@ static void arm_smmu_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 
 static void arm_smmu_mmu_notifier_free(struct mmu_notifier *mn)
 {
-	kfree(container_of(mn, struct arm_smmu_domain, mmu_notifier));
+	arm_smmu_domain_free(
+		container_of(mn, struct arm_smmu_domain, mmu_notifier));
 }
 
 static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = {
@@ -365,6 +366,6 @@ struct iommu_domain *arm_smmu_sva_domain_alloc(struct device *dev,
 err_asid:
 	xa_erase(&arm_smmu_asid_xa, smmu_domain->cd.asid);
 err_free:
-	kfree(smmu_domain);
+	arm_smmu_domain_free(smmu_domain);
 	return ERR_PTR(ret);
 }
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 9d9f24a08062..e9759e8af0c0 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2492,7 +2492,7 @@ static void arm_smmu_domain_free_paging(struct iommu_domain *domain)
 			ida_free(&smmu->vmid_map, cfg->vmid);
 	}
 
-	kfree(smmu_domain);
+	arm_smmu_domain_free(smmu_domain);
 }
 
 static int arm_smmu_domain_finalise_s1(struct arm_smmu_device *smmu,
@@ -3359,7 +3359,7 @@ arm_smmu_domain_alloc_paging_flags(struct device *dev, u32 flags,
 	return &smmu_domain->domain;
 
 err_free:
-	kfree(smmu_domain);
+	arm_smmu_domain_free(smmu_domain);
 	return ERR_PTR(ret);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 2/7] iommu/arm-smmu-v3: Add an inline arm_smmu_domain_free()
  2025-12-19 20:11 ` [PATCH v9 2/7] iommu/arm-smmu-v3: Add an inline arm_smmu_domain_free() Nicolin Chen
@ 2026-01-23  9:50   ` Pranjal Shrivastava
  0 siblings, 0 replies; 45+ messages in thread
From: Pranjal Shrivastava @ 2026-01-23  9:50 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, jean-philippe, robin.murphy, joro, jgg, balbirs,
	miko.lenczewski, peterz, kevin.tian, linux-arm-kernel, iommu,
	linux-kernel

On Fri, Dec 19, 2025 at 12:11:24PM -0800, Nicolin Chen wrote:
> There will be a bit more things to free than smmu_domain itself. So keep a
> simple inline function in the header to share aross files.
> 
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Acked-by: Balbir Singh <balbirs@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h     | 5 +++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 5 +++--
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c     | 4 ++--
>  3 files changed, 10 insertions(+), 4 deletions(-)
> 

Reviewed-by: Pranjal Shrivastava <praan@google.com>

Thanks
Praan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v9 3/7] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array
  2025-12-19 20:11 [PATCH v9 0/7] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
  2025-12-19 20:11 ` [PATCH v9 1/7] iommu/arm-smmu-v3: Explicitly set smmu_domain->stage for SVA Nicolin Chen
  2025-12-19 20:11 ` [PATCH v9 2/7] iommu/arm-smmu-v3: Add an inline arm_smmu_domain_free() Nicolin Chen
@ 2025-12-19 20:11 ` Nicolin Chen
  2026-01-23  9:53   ` Pranjal Shrivastava
  2026-01-23 17:03   ` Will Deacon
  2025-12-19 20:11 ` [PATCH v9 4/7] iommu/arm-smmu-v3: Pre-allocate a per-master invalidation array Nicolin Chen
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 45+ messages in thread
From: Nicolin Chen @ 2025-12-19 20:11 UTC (permalink / raw)
  To: will
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

From: Jason Gunthorpe <jgg@nvidia.com>

Create a new data structure to hold an array of invalidations that need to
be performed for the domain based on what masters are attached, to replace
the single smmu pointer and linked list of masters in the current design.

Each array entry holds one of the invalidation actions - S1_ASID, S2_VMID,
ATS or their variant with information to feed invalidation commands to HW.
It is structured so that multiple SMMUs can participate in the same array,
removing one key limitation of the current system.

To maximize performance, a sorted array is used as the data structure. It
allows grouping SYNCs together to parallelize invalidations. For instance,
it will group all the ATS entries after the ASID/VMID entry, so they will
all be pushed to the PCI devices in parallel with one SYNC.

To minimize the locking cost on the invalidation fast path (reader of the
invalidation array), the array is managed with RCU.

Provide a set of APIs to add/delete entries to/from an array, which cover
cannot-fail attach cases, e.g. attaching to arm_smmu_blocked_domain. Also
add kunit coverage for those APIs.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  98 +++++++
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c  |  92 +++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 258 ++++++++++++++++++
 3 files changed, 448 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 96a23ca633cb..a9441dc9070e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -649,6 +649,93 @@ struct arm_smmu_cmdq_batch {
 	int				num;
 };
 
+/*
+ * The order here also determines the sequence in which commands are sent to the
+ * command queue. E.g. TLBI must be done before ATC_INV.
+ */
+enum arm_smmu_inv_type {
+	INV_TYPE_S1_ASID,
+	INV_TYPE_S2_VMID,
+	INV_TYPE_S2_VMID_S1_CLEAR,
+	INV_TYPE_ATS,
+	INV_TYPE_ATS_FULL,
+};
+
+struct arm_smmu_inv {
+	struct arm_smmu_device *smmu;
+	u8 type;
+	u8 size_opcode;
+	u8 nsize_opcode;
+	u32 id; /* ASID or VMID or SID */
+	union {
+		size_t pgsize; /* ARM_SMMU_FEAT_RANGE_INV */
+		u32 ssid; /* INV_TYPE_ATS */
+	};
+
+	refcount_t users; /* users=0 to mark as a trash to be purged */
+};
+
+static inline bool arm_smmu_inv_is_ats(const struct arm_smmu_inv *inv)
+{
+	return inv->type == INV_TYPE_ATS || inv->type == INV_TYPE_ATS_FULL;
+}
+
+/**
+ * struct arm_smmu_invs - Per-domain invalidation array
+ * @max_invs: maximum capacity of the flexible array
+ * @num_invs: number of invalidations in the flexible array. May be smaller than
+ *            @max_invs after a tailing trash entry is excluded, but must not be
+ *            greater than @max_invs
+ * @num_trashes: number of trash entries in the array for arm_smmu_invs_purge().
+ *               Must not be greater than @num_invs
+ * @rwlock: optional rwlock to fench ATS operations
+ * @has_ats: flag if the array contains an INV_TYPE_ATS or INV_TYPE_ATS_FULL
+ * @rcu: rcu head for kfree_rcu()
+ * @inv: flexible invalidation array
+ *
+ * The arm_smmu_invs is an RCU data structure. During a ->attach_dev callback,
+ * arm_smmu_invs_merge(), arm_smmu_invs_unref() and arm_smmu_invs_purge() will
+ * be used to allocate a new copy of an old array for addition and deletion in
+ * the old domain's and new domain's invs arrays.
+ *
+ * The arm_smmu_invs_unref() mutates a given array, by internally reducing the
+ * users counts of some given entries. This exists to support a no-fail routine
+ * like attaching to an IOMMU_DOMAIN_BLOCKED. And it could pair with a followup
+ * arm_smmu_invs_purge() call to generate a new clean array.
+ *
+ * Concurrent invalidation thread will push every invalidation described in the
+ * array into the command queue for each invalidation event. It is designed like
+ * this to optimize the invalidation fast path by avoiding locks.
+ *
+ * A domain can be shared across SMMU instances. When an instance gets removed,
+ * it would delete all the entries that belong to that SMMU instance. Then, a
+ * synchronize_rcu() would have to be called to sync the array, to prevent any
+ * concurrent invalidation thread accessing the old array from issuing commands
+ * to the command queue of a removed SMMU instance.
+ */
+struct arm_smmu_invs {
+	size_t max_invs;
+	size_t num_invs;
+	size_t num_trashes;
+	rwlock_t rwlock;
+	bool has_ats;
+	struct rcu_head rcu;
+	struct arm_smmu_inv inv[] __counted_by(max_invs);
+};
+
+static inline struct arm_smmu_invs *arm_smmu_invs_alloc(size_t num_invs)
+{
+	struct arm_smmu_invs *new_invs;
+
+	new_invs = kzalloc(struct_size(new_invs, inv, num_invs), GFP_KERNEL);
+	if (!new_invs)
+		return NULL;
+	new_invs->max_invs = num_invs;
+	new_invs->num_invs = num_invs;
+	rwlock_init(&new_invs->rwlock);
+	return new_invs;
+}
+
 struct arm_smmu_evtq {
 	struct arm_smmu_queue		q;
 	struct iopf_queue		*iopf;
@@ -875,6 +962,8 @@ struct arm_smmu_domain {
 
 	struct iommu_domain		domain;
 
+	struct arm_smmu_invs __rcu	*invs;
+
 	/* List of struct arm_smmu_master_domain */
 	struct list_head		devices;
 	spinlock_t			devices_lock;
@@ -923,6 +1012,13 @@ void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
 void arm_smmu_make_sva_cd(struct arm_smmu_cd *target,
 			  struct arm_smmu_master *master, struct mm_struct *mm,
 			  u16 asid);
+
+struct arm_smmu_invs *arm_smmu_invs_merge(struct arm_smmu_invs *invs,
+					  struct arm_smmu_invs *to_merge);
+void arm_smmu_invs_unref(struct arm_smmu_invs *invs,
+			 struct arm_smmu_invs *to_unref,
+			 void (*free_fn)(struct arm_smmu_inv *inv));
+struct arm_smmu_invs *arm_smmu_invs_purge(struct arm_smmu_invs *invs);
 #endif
 
 struct arm_smmu_master_domain {
@@ -956,6 +1052,8 @@ struct arm_smmu_domain *arm_smmu_domain_alloc(void);
 
 static inline void arm_smmu_domain_free(struct arm_smmu_domain *smmu_domain)
 {
+	/* No concurrency with invalidation is possible at this point */
+	kfree(rcu_dereference_protected(smmu_domain->invs, true));
 	kfree(smmu_domain);
 }
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
index d2671bfd3798..238bfd328b5b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
@@ -567,6 +567,97 @@ static void arm_smmu_v3_write_cd_test_sva_release(struct kunit *test)
 						      NUM_EXPECTED_SYNCS(2));
 }
 
+static void arm_smmu_v3_invs_test_verify(struct kunit *test,
+					 struct arm_smmu_invs *invs, int num,
+					 const int *ids, const int *users)
+{
+	KUNIT_EXPECT_EQ(test, invs->num_invs, num);
+	while (num--) {
+		KUNIT_EXPECT_EQ(test, invs->inv[num].id, ids[num]);
+		KUNIT_EXPECT_EQ(test, refcount_read(&invs->inv[num].users),
+				users[num]);
+	}
+}
+
+static struct arm_smmu_invs invs1 = {
+	.num_invs = 3,
+	.inv = { { .type = INV_TYPE_S2_VMID, .id = 1, },
+		 { .type = INV_TYPE_S2_VMID_S1_CLEAR, .id = 1, },
+		 { .type = INV_TYPE_ATS, .id = 3, }, },
+};
+
+static struct arm_smmu_invs invs2 = {
+	.num_invs = 3,
+	.inv = { { .type = INV_TYPE_S2_VMID, .id = 1, }, /* duplicated */
+		 { .type = INV_TYPE_ATS, .id = 4, },
+		 { .type = INV_TYPE_ATS, .id = 5, }, },
+};
+
+static struct arm_smmu_invs invs3 = {
+	.num_invs = 3,
+	.inv = { { .type = INV_TYPE_S2_VMID, .id = 1, }, /* duplicated */
+		 { .type = INV_TYPE_ATS, .id = 5, }, /* recover a trash */
+		 { .type = INV_TYPE_ATS, .id = 6, }, },
+};
+
+static void arm_smmu_v3_invs_test(struct kunit *test)
+{
+	const int results1[2][3] = { { 1, 1, 3, }, { 1, 1, 1, }, };
+	const int results2[2][5] = { { 1, 1, 3, 4, 5, }, { 2, 1, 1, 1, 1, }, };
+	const int results3[2][3] = { { 1, 1, 3, }, { 1, 1, 1, }, };
+	const int results4[2][5] = { { 1, 1, 3, 5, 6, }, { 2, 1, 1, 1, 1, }, };
+	const int results5[2][5] = { { 1, 1, 3, 5, 6, }, { 1, 0, 0, 1, 1, }, };
+	const int results6[2][3] = { { 1, 5, 6, }, { 1, 1, 1, }, };
+	struct arm_smmu_invs *test_a, *test_b;
+
+	/* New array */
+	test_a = arm_smmu_invs_alloc(0);
+	KUNIT_EXPECT_EQ(test, test_a->num_invs, 0);
+
+	/* Test1: merge invs1 (new array) */
+	test_b = arm_smmu_invs_merge(test_a, &invs1);
+	kfree(test_a);
+	arm_smmu_v3_invs_test_verify(test, test_b, ARRAY_SIZE(results1[0]),
+				     results1[0], results1[1]);
+
+	/* Test2: merge invs2 (new array) */
+	test_a = arm_smmu_invs_merge(test_b, &invs2);
+	kfree(test_b);
+	arm_smmu_v3_invs_test_verify(test, test_a, ARRAY_SIZE(results2[0]),
+				     results2[0], results2[1]);
+
+	/* Test3: unref invs2 (same array) */
+	arm_smmu_invs_unref(test_a, &invs2, NULL);
+	arm_smmu_v3_invs_test_verify(test, test_a, ARRAY_SIZE(results3[0]),
+				     results3[0], results3[1]);
+	KUNIT_EXPECT_EQ(test, test_a->num_trashes, 0);
+
+	/* Test4: merge invs3 (new array) */
+	test_b = arm_smmu_invs_merge(test_a, &invs3);
+	kfree(test_a);
+	arm_smmu_v3_invs_test_verify(test, test_b, ARRAY_SIZE(results4[0]),
+				     results4[0], results4[1]);
+
+	/* Test5: unref invs1 (same array) */
+	arm_smmu_invs_unref(test_b, &invs1, NULL);
+	arm_smmu_v3_invs_test_verify(test, test_b, ARRAY_SIZE(results5[0]),
+				     results5[0], results5[1]);
+	KUNIT_EXPECT_EQ(test, test_b->num_trashes, 2);
+
+	/* Test6: purge test_b (new array) */
+	test_a = arm_smmu_invs_purge(test_b);
+	kfree(test_b);
+	arm_smmu_v3_invs_test_verify(test, test_a, ARRAY_SIZE(results6[0]),
+				     results6[0], results6[1]);
+
+	/* Test7: unref invs3 (same array) */
+	arm_smmu_invs_unref(test_a, &invs3, NULL);
+	KUNIT_EXPECT_EQ(test, test_a->num_invs, 0);
+	KUNIT_EXPECT_EQ(test, test_a->num_trashes, 0);
+
+	kfree(test_a);
+}
+
 static struct kunit_case arm_smmu_v3_test_cases[] = {
 	KUNIT_CASE(arm_smmu_v3_write_ste_test_bypass_to_abort),
 	KUNIT_CASE(arm_smmu_v3_write_ste_test_abort_to_bypass),
@@ -590,6 +681,7 @@ static struct kunit_case arm_smmu_v3_test_cases[] = {
 	KUNIT_CASE(arm_smmu_v3_write_ste_test_s2_to_s1_stall),
 	KUNIT_CASE(arm_smmu_v3_write_cd_test_sva_clear),
 	KUNIT_CASE(arm_smmu_v3_write_cd_test_sva_release),
+	KUNIT_CASE(arm_smmu_v3_invs_test),
 	{},
 };
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index e9759e8af0c0..39e89176fa87 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -26,6 +26,7 @@
 #include <linux/pci.h>
 #include <linux/pci-ats.h>
 #include <linux/platform_device.h>
+#include <linux/sort.h>
 #include <linux/string_choices.h>
 #include <kunit/visibility.h>
 #include <uapi/linux/iommufd.h>
@@ -1015,6 +1016,255 @@ static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused
 	 */
 }
 
+/* Invalidation array manipulation functions */
+static inline struct arm_smmu_inv *
+arm_smmu_invs_iter_next(struct arm_smmu_invs *invs, size_t next, size_t *idx)
+{
+	while (true) {
+		if (next >= invs->num_invs) {
+			*idx = next;
+			return NULL;
+		}
+		if (!refcount_read(&invs->inv[next].users)) {
+			next++;
+			continue;
+		}
+		*idx = next;
+		return &invs->inv[next];
+	}
+}
+
+/**
+ * arm_smmu_invs_for_each_entry - Iterate over all non-trash entries in invs
+ * @invs: the base invalidation array
+ * @idx: a stack variable of 'size_t', to store the array index
+ * @cur: a stack variable of 'struct arm_smmu_inv *'
+ */
+#define arm_smmu_invs_for_each_entry(invs, idx, cur)                           \
+	for (cur = arm_smmu_invs_iter_next(invs, 0, &(idx)); cur;              \
+	     cur = arm_smmu_invs_iter_next(invs, idx + 1, &(idx)))
+
+static int arm_smmu_inv_cmp(const struct arm_smmu_inv *inv_l,
+			    const struct arm_smmu_inv *inv_r)
+{
+	if (inv_l->smmu != inv_r->smmu)
+		return cmp_int((uintptr_t)inv_l->smmu, (uintptr_t)inv_r->smmu);
+	if (inv_l->type != inv_r->type)
+		return cmp_int(inv_l->type, inv_r->type);
+	return cmp_int(inv_l->id, inv_r->id);
+}
+
+static inline int arm_smmu_invs_iter_next_cmp(struct arm_smmu_invs *invs_l,
+					      size_t next_l, size_t *idx_l,
+					      struct arm_smmu_invs *invs_r,
+					      size_t next_r, size_t *idx_r)
+{
+	struct arm_smmu_inv *cur_l =
+		arm_smmu_invs_iter_next(invs_l, next_l, idx_l);
+
+	/*
+	 * We have to update the idx_r manually, because the invs_r cannot call
+	 * arm_smmu_invs_iter_next() as the invs_r never sets any users counter.
+	 */
+	*idx_r = next_r;
+
+	/*
+	 * Compare of two sorted arrays items. If one side is past the end of
+	 * the array, return the other side to let it run out the iteration.
+	 *
+	 * If the left entry is empty, return 1 to pick the right entry.
+	 * If the right entry is empty, return -1 to pick the left entry.
+	 */
+	if (!cur_l)
+		return 1;
+	if (next_r >= invs_r->num_invs)
+		return -1;
+	return arm_smmu_inv_cmp(cur_l, &invs_r->inv[next_r]);
+}
+
+/**
+ * arm_smmu_invs_for_each_entry - Iterate over two sorted arrays computing for
+ *                                arm_smmu_invs_merge() or arm_smmu_invs_unref()
+ * @invs_l: the base invalidation array
+ * @idx_l: a stack variable of 'size_t', to store the base array index
+ * @invs_r: the build_invs array as to_merge or to_unref
+ * @idx_r: a stack variable of 'size_t', to store the build_invs index
+ * @cmp: a stack variable of 'int', to store return value (-1, 0, or 1)
+ */
+#define arm_smmu_invs_for_each_cmp(invs_l, idx_l, invs_r, idx_r, cmp)          \
+	for (idx_l = idx_r = 0,                                                \
+	     cmp = arm_smmu_invs_iter_next_cmp(invs_l, 0, &(idx_l),            \
+					       invs_r, 0, &(idx_r));           \
+	     idx_l < invs_l->num_invs || idx_r < invs_r->num_invs;             \
+	     cmp = arm_smmu_invs_iter_next_cmp(                                \
+		     invs_l, idx_l + (cmp <= 0 ? 1 : 0), &(idx_l),             \
+		     invs_r, idx_r + (cmp >= 0 ? 1 : 0), &(idx_r)))
+
+/**
+ * arm_smmu_invs_merge() - Merge @to_merge into @invs and generate a new array
+ * @invs: the base invalidation array
+ * @to_merge: an array of invlidations to merge
+ *
+ * Return: a newly allocated array on success, or ERR_PTR
+ *
+ * This function must be locked and serialized with arm_smmu_invs_unref() and
+ * arm_smmu_invs_purge(), but do not lockdep on any lock for KUNIT test.
+ *
+ * Both @invs and @to_merge must be sorted, to ensure the returned array will be
+ * sorted as well.
+ *
+ * Caller is resposible for freeing the @invs and the returned new one.
+ *
+ * Entries marked as trash will be purged in the returned array.
+ */
+VISIBLE_IF_KUNIT
+struct arm_smmu_invs *arm_smmu_invs_merge(struct arm_smmu_invs *invs,
+					  struct arm_smmu_invs *to_merge)
+{
+	struct arm_smmu_invs *new_invs;
+	struct arm_smmu_inv *new;
+	size_t num_invs = 0;
+	size_t i, j;
+	int cmp;
+
+	arm_smmu_invs_for_each_cmp(invs, i, to_merge, j, cmp)
+		num_invs++;
+
+	new_invs = arm_smmu_invs_alloc(num_invs);
+	if (!new_invs)
+		return ERR_PTR(-ENOMEM);
+
+	new = new_invs->inv;
+	arm_smmu_invs_for_each_cmp(invs, i, to_merge, j, cmp) {
+		if (cmp < 0) {
+			*new = invs->inv[i];
+		} else if (cmp == 0) {
+			*new = invs->inv[i];
+			refcount_inc(&new->users);
+		} else {
+			*new = to_merge->inv[j];
+			refcount_set(&new->users, 1);
+		}
+
+		/*
+		 * Check that the new array is sorted. This also validates that
+		 * to_merge is sorted.
+		 */
+		if (new != new_invs->inv)
+			WARN_ON_ONCE(arm_smmu_inv_cmp(new - 1, new) == 1);
+		if (arm_smmu_inv_is_ats(new))
+			new_invs->has_ats = true;
+		new++;
+	}
+
+	WARN_ON(new != new_invs->inv + new_invs->num_invs);
+
+	return new_invs;
+}
+EXPORT_SYMBOL_IF_KUNIT(arm_smmu_invs_merge);
+
+/**
+ * arm_smmu_invs_unref() - Find in @invs for all entries in @to_unref, decrease
+ *                         the user counts without deletions
+ * @invs: the base invalidation array
+ * @to_unref: an array of invlidations to decrease their user counts
+ * @free_fn: A callback function to invoke, when an entry's user count reduces
+ *           to 0
+ *
+ * Return: the number of trash entries in the array, for arm_smmu_invs_purge()
+ *
+ * This function will not fail. Any entry with users=0 will be marked as trash.
+ * All tailing trash entries in the array will be dropped. And the size of the
+ * array will be trimmed properly. All trash entries in-between will remain in
+ * the @invs until being completely deleted by the next arm_smmu_invs_merge()
+ * or an arm_smmu_invs_purge() function call.
+ *
+ * This function must be locked and serialized with arm_smmu_invs_merge() and
+ * arm_smmu_invs_purge(), but do not lockdep on any mutex for KUNIT test.
+ *
+ * Note that the final @invs->num_invs might not reflect the actual number of
+ * invalidations due to trash entries. Any reader should take the read lock to
+ * iterate each entry and check its users counter till the last entry.
+ */
+VISIBLE_IF_KUNIT
+void arm_smmu_invs_unref(struct arm_smmu_invs *invs,
+			 struct arm_smmu_invs *to_unref,
+			 void (*free_fn)(struct arm_smmu_inv *inv))
+{
+	unsigned long flags;
+	size_t num_invs = 0;
+	size_t i, j;
+	int cmp;
+
+	arm_smmu_invs_for_each_cmp(invs, i, to_unref, j, cmp) {
+		if (cmp < 0) {
+			/* not found in to_unref, leave alone */
+			num_invs = i + 1;
+		} else if (cmp == 0) {
+			/* same item */
+			if (!refcount_dec_and_test(&invs->inv[i].users)) {
+				num_invs = i + 1;
+				continue;
+			}
+
+			/* KUNIT test doesn't pass in a free_fn */
+			if (free_fn)
+				free_fn(&invs->inv[i]);
+			invs->num_trashes++;
+		} else {
+			/* item in to_unref is not in invs or already a trash */
+			WARN_ON(true);
+		}
+	}
+
+	/* Exclude any tailing trash */
+	invs->num_trashes -= invs->num_invs - num_invs;
+
+	/* The lock is required to fence concurrent ATS operations. */
+	write_lock_irqsave(&invs->rwlock, flags);
+	WRITE_ONCE(invs->num_invs, num_invs); /* Remove tailing trash entries */
+	write_unlock_irqrestore(&invs->rwlock, flags);
+}
+EXPORT_SYMBOL_IF_KUNIT(arm_smmu_invs_unref);
+
+/**
+ * arm_smmu_invs_purge() - Purge all the trash entries in the @invs
+ * @invs: the base invalidation array
+ *
+ * Return: a newly allocated array on success removing all the trash entries, or
+ *         NULL if there is no trash entry in the array or there is a bug
+ *
+ * This function must be locked and serialized with arm_smmu_invs_merge() and
+ * arm_smmu_invs_unref(), but do not lockdep on any lock for KUNIT test.
+ *
+ * Caller is resposible for freeing the @invs and the returned new one.
+ */
+VISIBLE_IF_KUNIT
+struct arm_smmu_invs *arm_smmu_invs_purge(struct arm_smmu_invs *invs)
+{
+	struct arm_smmu_invs *new_invs;
+	struct arm_smmu_inv *inv;
+	size_t i, num_invs = 0;
+
+	if (WARN_ON(invs->num_invs < invs->num_trashes))
+		return NULL;
+	if (!invs->num_invs || !invs->num_trashes)
+		return NULL;
+
+	new_invs = arm_smmu_invs_alloc(invs->num_invs - invs->num_trashes);
+	if (!new_invs)
+		return ERR_PTR(-ENOMEM);
+
+	arm_smmu_invs_for_each_entry(invs, i, inv) {
+		new_invs->inv[num_invs] = *inv;
+		num_invs++;
+	}
+
+	WARN_ON(num_invs != new_invs->num_invs);
+	return new_invs;
+}
+EXPORT_SYMBOL_IF_KUNIT(arm_smmu_invs_purge);
+
 /* Context descriptor manipulation functions */
 void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
 {
@@ -2462,13 +2712,21 @@ static bool arm_smmu_enforce_cache_coherency(struct iommu_domain *domain)
 struct arm_smmu_domain *arm_smmu_domain_alloc(void)
 {
 	struct arm_smmu_domain *smmu_domain;
+	struct arm_smmu_invs *new_invs;
 
 	smmu_domain = kzalloc(sizeof(*smmu_domain), GFP_KERNEL);
 	if (!smmu_domain)
 		return ERR_PTR(-ENOMEM);
 
+	new_invs = arm_smmu_invs_alloc(0);
+	if (!new_invs) {
+		kfree(smmu_domain);
+		return ERR_PTR(-ENOMEM);
+	}
+
 	INIT_LIST_HEAD(&smmu_domain->devices);
 	spin_lock_init(&smmu_domain->devices_lock);
+	rcu_assign_pointer(smmu_domain->invs, new_invs);
 
 	return smmu_domain;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3/7] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array
  2025-12-19 20:11 ` [PATCH v9 3/7] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array Nicolin Chen
@ 2026-01-23  9:53   ` Pranjal Shrivastava
  2026-01-23 17:03   ` Will Deacon
  1 sibling, 0 replies; 45+ messages in thread
From: Pranjal Shrivastava @ 2026-01-23  9:53 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, jean-philippe, robin.murphy, joro, jgg, balbirs,
	miko.lenczewski, peterz, kevin.tian, linux-arm-kernel, iommu,
	linux-kernel

On Fri, Dec 19, 2025 at 12:11:25PM -0800, Nicolin Chen wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> Create a new data structure to hold an array of invalidations that need to
> be performed for the domain based on what masters are attached, to replace
> the single smmu pointer and linked list of masters in the current design.
> 
> Each array entry holds one of the invalidation actions - S1_ASID, S2_VMID,
> ATS or their variant with information to feed invalidation commands to HW.
> It is structured so that multiple SMMUs can participate in the same array,
> removing one key limitation of the current system.
> 
> To maximize performance, a sorted array is used as the data structure. It
> allows grouping SYNCs together to parallelize invalidations. For instance,
> it will group all the ATS entries after the ASID/VMID entry, so they will
> all be pushed to the PCI devices in parallel with one SYNC.
> 
> To minimize the locking cost on the invalidation fast path (reader of the
> invalidation array), the array is managed with RCU.
> 
> Provide a set of APIs to add/delete entries to/from an array, which cover
> cannot-fail attach cases, e.g. attaching to arm_smmu_blocked_domain. Also
> add kunit coverage for those APIs.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  98 +++++++
>  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c  |  92 +++++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 258 ++++++++++++++++++
>  3 files changed, 448 insertions(+)
> 

Reviewed-by: Pranjal Shrivastava <praan@google.com>

Thanks,
Praan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3/7] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array
  2025-12-19 20:11 ` [PATCH v9 3/7] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array Nicolin Chen
  2026-01-23  9:53   ` Pranjal Shrivastava
@ 2026-01-23 17:03   ` Will Deacon
  2026-01-23 17:35     ` Nicolin Chen
  1 sibling, 1 reply; 45+ messages in thread
From: Will Deacon @ 2026-01-23 17:03 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

On Fri, Dec 19, 2025 at 12:11:25PM -0800, Nicolin Chen wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> Create a new data structure to hold an array of invalidations that need to
> be performed for the domain based on what masters are attached, to replace
> the single smmu pointer and linked list of masters in the current design.
> 
> Each array entry holds one of the invalidation actions - S1_ASID, S2_VMID,
> ATS or their variant with information to feed invalidation commands to HW.
> It is structured so that multiple SMMUs can participate in the same array,
> removing one key limitation of the current system.
> 
> To maximize performance, a sorted array is used as the data structure. It
> allows grouping SYNCs together to parallelize invalidations. For instance,
> it will group all the ATS entries after the ASID/VMID entry, so they will
> all be pushed to the PCI devices in parallel with one SYNC.
> 
> To minimize the locking cost on the invalidation fast path (reader of the
> invalidation array), the array is managed with RCU.
> 
> Provide a set of APIs to add/delete entries to/from an array, which cover
> cannot-fail attach cases, e.g. attaching to arm_smmu_blocked_domain. Also
> add kunit coverage for those APIs.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  98 +++++++
>  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c  |  92 +++++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 258 ++++++++++++++++++
>  3 files changed, 448 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 96a23ca633cb..a9441dc9070e 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -649,6 +649,93 @@ struct arm_smmu_cmdq_batch {
>  	int				num;
>  };
>  
> +/*
> + * The order here also determines the sequence in which commands are sent to the
> + * command queue. E.g. TLBI must be done before ATC_INV.
> + */
> +enum arm_smmu_inv_type {
> +	INV_TYPE_S1_ASID,
> +	INV_TYPE_S2_VMID,
> +	INV_TYPE_S2_VMID_S1_CLEAR,
> +	INV_TYPE_ATS,
> +	INV_TYPE_ATS_FULL,
> +};
> +
> +struct arm_smmu_inv {
> +	struct arm_smmu_device *smmu;
> +	u8 type;
> +	u8 size_opcode;
> +	u8 nsize_opcode;
> +	u32 id; /* ASID or VMID or SID */
> +	union {
> +		size_t pgsize; /* ARM_SMMU_FEAT_RANGE_INV */
> +		u32 ssid; /* INV_TYPE_ATS */
> +	};
> +
> +	refcount_t users; /* users=0 to mark as a trash to be purged */

The refcount_t API uses atomics with barrier semantics. Do we actually
need those properties when updating the refcounts here? The ASID lock
gives us pretty strong serialisation even after this patch series and
we rely heavily on that.

> +static void arm_smmu_v3_invs_test(struct kunit *test)
> +{
> +	const int results1[2][3] = { { 1, 1, 3, }, { 1, 1, 1, }, };
> +	const int results2[2][5] = { { 1, 1, 3, 4, 5, }, { 2, 1, 1, 1, 1, }, };
> +	const int results3[2][3] = { { 1, 1, 3, }, { 1, 1, 1, }, };
> +	const int results4[2][5] = { { 1, 1, 3, 5, 6, }, { 2, 1, 1, 1, 1, }, };
> +	const int results5[2][5] = { { 1, 1, 3, 5, 6, }, { 1, 0, 0, 1, 1, }, };
> +	const int results6[2][3] = { { 1, 5, 6, }, { 1, 1, 1, }, };
> +	struct arm_smmu_invs *test_a, *test_b;
> +
> +	/* New array */
> +	test_a = arm_smmu_invs_alloc(0);
> +	KUNIT_EXPECT_EQ(test, test_a->num_invs, 0);
> +
> +	/* Test1: merge invs1 (new array) */
> +	test_b = arm_smmu_invs_merge(test_a, &invs1);
> +	kfree(test_a);
> +	arm_smmu_v3_invs_test_verify(test, test_b, ARRAY_SIZE(results1[0]),
> +				     results1[0], results1[1]);
> +
> +	/* Test2: merge invs2 (new array) */
> +	test_a = arm_smmu_invs_merge(test_b, &invs2);
> +	kfree(test_b);
> +	arm_smmu_v3_invs_test_verify(test, test_a, ARRAY_SIZE(results2[0]),
> +				     results2[0], results2[1]);
> +
> +	/* Test3: unref invs2 (same array) */
> +	arm_smmu_invs_unref(test_a, &invs2, NULL);
> +	arm_smmu_v3_invs_test_verify(test, test_a, ARRAY_SIZE(results3[0]),
> +				     results3[0], results3[1]);
> +	KUNIT_EXPECT_EQ(test, test_a->num_trashes, 0);
> +
> +	/* Test4: merge invs3 (new array) */
> +	test_b = arm_smmu_invs_merge(test_a, &invs3);
> +	kfree(test_a);
> +	arm_smmu_v3_invs_test_verify(test, test_b, ARRAY_SIZE(results4[0]),
> +				     results4[0], results4[1]);
> +
> +	/* Test5: unref invs1 (same array) */
> +	arm_smmu_invs_unref(test_b, &invs1, NULL);
> +	arm_smmu_v3_invs_test_verify(test, test_b, ARRAY_SIZE(results5[0]),
> +				     results5[0], results5[1]);
> +	KUNIT_EXPECT_EQ(test, test_b->num_trashes, 2);
> +
> +	/* Test6: purge test_b (new array) */
> +	test_a = arm_smmu_invs_purge(test_b);
> +	kfree(test_b);
> +	arm_smmu_v3_invs_test_verify(test, test_a, ARRAY_SIZE(results6[0]),
> +				     results6[0], results6[1]);
> +
> +	/* Test7: unref invs3 (same array) */
> +	arm_smmu_invs_unref(test_a, &invs3, NULL);
> +	KUNIT_EXPECT_EQ(test, test_a->num_invs, 0);
> +	KUNIT_EXPECT_EQ(test, test_a->num_trashes, 0);

Wouldn't we be better off testing num_trashes == 0 after test 6 has
completed?

> +/**
> + * arm_smmu_invs_for_each_entry - Iterate over two sorted arrays computing for
> + *                                arm_smmu_invs_merge() or arm_smmu_invs_unref()
> + * @invs_l: the base invalidation array
> + * @idx_l: a stack variable of 'size_t', to store the base array index
> + * @invs_r: the build_invs array as to_merge or to_unref
> + * @idx_r: a stack variable of 'size_t', to store the build_invs index
> + * @cmp: a stack variable of 'int', to store return value (-1, 0, or 1)
> + */
> +#define arm_smmu_invs_for_each_cmp(invs_l, idx_l, invs_r, idx_r, cmp)          \

nit: the kerneldoc comment doesn't match the name of this function.

Will

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3/7] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array
  2026-01-23 17:03   ` Will Deacon
@ 2026-01-23 17:35     ` Nicolin Chen
  2026-01-23 17:51       ` Will Deacon
  0 siblings, 1 reply; 45+ messages in thread
From: Nicolin Chen @ 2026-01-23 17:35 UTC (permalink / raw)
  To: Will Deacon
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

On Fri, Jan 23, 2026 at 05:03:10PM +0000, Will Deacon wrote:
> On Fri, Dec 19, 2025 at 12:11:25PM -0800, Nicolin Chen wrote:
> > +struct arm_smmu_inv {
> > +	struct arm_smmu_device *smmu;
> > +	u8 type;
> > +	u8 size_opcode;
> > +	u8 nsize_opcode;
> > +	u32 id; /* ASID or VMID or SID */
> > +	union {
> > +		size_t pgsize; /* ARM_SMMU_FEAT_RANGE_INV */
> > +		u32 ssid; /* INV_TYPE_ATS */
> > +	};
> > +
> > +	refcount_t users; /* users=0 to mark as a trash to be purged */
> 
> The refcount_t API uses atomics with barrier semantics. Do we actually
> need those properties when updating the refcounts here? The ASID lock
> gives us pretty strong serialisation even after this patch series and
> we rely heavily on that.

But we can't use that mutex in the invalidation function that
might be an IRQ context?

> > +	/* Test6: purge test_b (new array) */
> > +	test_a = arm_smmu_invs_purge(test_b);
> > +	kfree(test_b);
> > +	arm_smmu_v3_invs_test_verify(test, test_a, ARRAY_SIZE(results6[0]),
> > +				     results6[0], results6[1]);
> > +
> > +	/* Test7: unref invs3 (same array) */
> > +	arm_smmu_invs_unref(test_a, &invs3, NULL);
> > +	KUNIT_EXPECT_EQ(test, test_a->num_invs, 0);
> > +	KUNIT_EXPECT_EQ(test, test_a->num_trashes, 0);
> 
> Wouldn't we be better off testing num_trashes == 0 after test 6 has
> completed?

OK.

> > +/**
> > + * arm_smmu_invs_for_each_entry - Iterate over two sorted arrays computing for
> > + *                                arm_smmu_invs_merge() or arm_smmu_invs_unref()
> > + * @invs_l: the base invalidation array
> > + * @idx_l: a stack variable of 'size_t', to store the base array index
> > + * @invs_r: the build_invs array as to_merge or to_unref
> > + * @idx_r: a stack variable of 'size_t', to store the build_invs index
> > + * @cmp: a stack variable of 'int', to store return value (-1, 0, or 1)
> > + */
> > +#define arm_smmu_invs_for_each_cmp(invs_l, idx_l, invs_r, idx_r, cmp)          \
> 
> nit: the kerneldoc comment doesn't match the name of this function.

I will fix this.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3/7] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array
  2026-01-23 17:35     ` Nicolin Chen
@ 2026-01-23 17:51       ` Will Deacon
  2026-01-23 17:56         ` Nicolin Chen
  0 siblings, 1 reply; 45+ messages in thread
From: Will Deacon @ 2026-01-23 17:51 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

On Fri, Jan 23, 2026 at 09:35:58AM -0800, Nicolin Chen wrote:
> On Fri, Jan 23, 2026 at 05:03:10PM +0000, Will Deacon wrote:
> > On Fri, Dec 19, 2025 at 12:11:25PM -0800, Nicolin Chen wrote:
> > > +struct arm_smmu_inv {
> > > +	struct arm_smmu_device *smmu;
> > > +	u8 type;
> > > +	u8 size_opcode;
> > > +	u8 nsize_opcode;
> > > +	u32 id; /* ASID or VMID or SID */
> > > +	union {
> > > +		size_t pgsize; /* ARM_SMMU_FEAT_RANGE_INV */
> > > +		u32 ssid; /* INV_TYPE_ATS */
> > > +	};
> > > +
> > > +	refcount_t users; /* users=0 to mark as a trash to be purged */
> > 
> > The refcount_t API uses atomics with barrier semantics. Do we actually
> > need those properties when updating the refcounts here? The ASID lock
> > gives us pretty strong serialisation even after this patch series and
> > we rely heavily on that.
> 
> But we can't use that mutex in the invalidation function that
> might be an IRQ context?

My question, really, is why do you need the atomic properties in this patch
series? It just looks like overhead at the moment.

Will

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3/7] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array
  2026-01-23 17:51       ` Will Deacon
@ 2026-01-23 17:56         ` Nicolin Chen
  2026-01-23 19:16           ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Nicolin Chen @ 2026-01-23 17:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

On Fri, Jan 23, 2026 at 05:51:52PM +0000, Will Deacon wrote:
> On Fri, Jan 23, 2026 at 09:35:58AM -0800, Nicolin Chen wrote:
> > On Fri, Jan 23, 2026 at 05:03:10PM +0000, Will Deacon wrote:
> > > On Fri, Dec 19, 2025 at 12:11:25PM -0800, Nicolin Chen wrote:
> > > > +struct arm_smmu_inv {
> > > > +	struct arm_smmu_device *smmu;
> > > > +	u8 type;
> > > > +	u8 size_opcode;
> > > > +	u8 nsize_opcode;
> > > > +	u32 id; /* ASID or VMID or SID */
> > > > +	union {
> > > > +		size_t pgsize; /* ARM_SMMU_FEAT_RANGE_INV */
> > > > +		u32 ssid; /* INV_TYPE_ATS */
> > > > +	};
> > > > +
> > > > +	refcount_t users; /* users=0 to mark as a trash to be purged */
> > > 
> > > The refcount_t API uses atomics with barrier semantics. Do we actually
> > > need those properties when updating the refcounts here? The ASID lock
> > > gives us pretty strong serialisation even after this patch series and
> > > we rely heavily on that.
> > 
> > But we can't use that mutex in the invalidation function that
> > might be an IRQ context?
> 
> My question, really, is why do you need the atomic properties in this patch
> series? It just looks like overhead at the moment.

Hmm, shouldn't it be atomic, since..

(might be IRQ, no mutex) __arm_smmu_domain_inv_range() reads it.
(mutex protected) arm_smmu_attach_dev() writes it.

..?

Nicolin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3/7] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array
  2026-01-23 17:56         ` Nicolin Chen
@ 2026-01-23 19:16           ` Jason Gunthorpe
  2026-01-23 19:18             ` Nicolin Chen
  2026-01-26 14:54             ` Will Deacon
  0 siblings, 2 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2026-01-23 19:16 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Will Deacon, jean-philippe, robin.murphy, joro, balbirs,
	miko.lenczewski, peterz, kevin.tian, praan, linux-arm-kernel,
	iommu, linux-kernel

On Fri, Jan 23, 2026 at 09:56:01AM -0800, Nicolin Chen wrote:
> On Fri, Jan 23, 2026 at 05:51:52PM +0000, Will Deacon wrote:
> > On Fri, Jan 23, 2026 at 09:35:58AM -0800, Nicolin Chen wrote:
> > > On Fri, Jan 23, 2026 at 05:03:10PM +0000, Will Deacon wrote:
> > > > On Fri, Dec 19, 2025 at 12:11:25PM -0800, Nicolin Chen wrote:
> > > > > +struct arm_smmu_inv {
> > > > > +	struct arm_smmu_device *smmu;
> > > > > +	u8 type;
> > > > > +	u8 size_opcode;
> > > > > +	u8 nsize_opcode;
> > > > > +	u32 id; /* ASID or VMID or SID */
> > > > > +	union {
> > > > > +		size_t pgsize; /* ARM_SMMU_FEAT_RANGE_INV */
> > > > > +		u32 ssid; /* INV_TYPE_ATS */
> > > > > +	};
> > > > > +
> > > > > +	refcount_t users; /* users=0 to mark as a trash to be purged */
> > > > 
> > > > The refcount_t API uses atomics with barrier semantics. Do we actually
> > > > need those properties when updating the refcounts here? The ASID lock
> > > > gives us pretty strong serialisation even after this patch series and
> > > > we rely heavily on that.
> > > 
> > > But we can't use that mutex in the invalidation function that
> > > might be an IRQ context?
> > 
> > My question, really, is why do you need the atomic properties in this patch
> > series? It just looks like overhead at the moment.
> 
> Hmm, shouldn't it be atomic, since..
> 
> (might be IRQ, no mutex) __arm_smmu_domain_inv_range() reads it.
> (mutex protected) arm_smmu_attach_dev() writes it.
> 
> ..?

It doesn't need to be a full atomic, you can use WRITE_ONCE/READ_ONCE
instead for this case.

The general argument of this scheme is it doesn't matter if the
invalidation side is doing extra invalidation, that is supposed to
always be safe except for ATS. For ATS we hold locks and then it
doesn't to be atomic because of the lock.

So I think Will has it right, just use READ_ONCE/WRITE_ONCE on a naked
unsigned int.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3/7] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array
  2026-01-23 19:16           ` Jason Gunthorpe
@ 2026-01-23 19:18             ` Nicolin Chen
  2026-01-26 14:54             ` Will Deacon
  1 sibling, 0 replies; 45+ messages in thread
From: Nicolin Chen @ 2026-01-23 19:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Will Deacon, jean-philippe, robin.murphy, joro, balbirs,
	miko.lenczewski, peterz, kevin.tian, praan, linux-arm-kernel,
	iommu, linux-kernel

On Fri, Jan 23, 2026 at 03:16:54PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 23, 2026 at 09:56:01AM -0800, Nicolin Chen wrote:
> > On Fri, Jan 23, 2026 at 05:51:52PM +0000, Will Deacon wrote:
> > > On Fri, Jan 23, 2026 at 09:35:58AM -0800, Nicolin Chen wrote:
> > > > On Fri, Jan 23, 2026 at 05:03:10PM +0000, Will Deacon wrote:
> > > > > On Fri, Dec 19, 2025 at 12:11:25PM -0800, Nicolin Chen wrote:
> > > > > > +struct arm_smmu_inv {
> > > > > > +	struct arm_smmu_device *smmu;
> > > > > > +	u8 type;
> > > > > > +	u8 size_opcode;
> > > > > > +	u8 nsize_opcode;
> > > > > > +	u32 id; /* ASID or VMID or SID */
> > > > > > +	union {
> > > > > > +		size_t pgsize; /* ARM_SMMU_FEAT_RANGE_INV */
> > > > > > +		u32 ssid; /* INV_TYPE_ATS */
> > > > > > +	};
> > > > > > +
> > > > > > +	refcount_t users; /* users=0 to mark as a trash to be purged */
> > > > > 
> > > > > The refcount_t API uses atomics with barrier semantics. Do we actually
> > > > > need those properties when updating the refcounts here? The ASID lock
> > > > > gives us pretty strong serialisation even after this patch series and
> > > > > we rely heavily on that.
> > > > 
> > > > But we can't use that mutex in the invalidation function that
> > > > might be an IRQ context?
> > > 
> > > My question, really, is why do you need the atomic properties in this patch
> > > series? It just looks like overhead at the moment.
> > 
> > Hmm, shouldn't it be atomic, since..
> > 
> > (might be IRQ, no mutex) __arm_smmu_domain_inv_range() reads it.
> > (mutex protected) arm_smmu_attach_dev() writes it.
> > 
> > ..?
> 
> It doesn't need to be a full atomic, you can use WRITE_ONCE/READ_ONCE
> instead for this case.
> 
> The general argument of this scheme is it doesn't matter if the
> invalidation side is doing extra invalidation, that is supposed to
> always be safe except for ATS. For ATS we hold locks and then it
> doesn't to be atomic because of the lock.
> 
> So I think Will has it right, just use READ_ONCE/WRITE_ONCE on a naked
> unsigned int.

OK. I'll change that.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3/7] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array
  2026-01-23 19:16           ` Jason Gunthorpe
  2026-01-23 19:18             ` Nicolin Chen
@ 2026-01-26 14:54             ` Will Deacon
  2026-01-26 15:21               ` Jason Gunthorpe
  1 sibling, 1 reply; 45+ messages in thread
From: Will Deacon @ 2026-01-26 14:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, jean-philippe, robin.murphy, joro, balbirs,
	miko.lenczewski, peterz, kevin.tian, praan, linux-arm-kernel,
	iommu, linux-kernel

On Fri, Jan 23, 2026 at 03:16:54PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 23, 2026 at 09:56:01AM -0800, Nicolin Chen wrote:
> > On Fri, Jan 23, 2026 at 05:51:52PM +0000, Will Deacon wrote:
> > > On Fri, Jan 23, 2026 at 09:35:58AM -0800, Nicolin Chen wrote:
> > > > On Fri, Jan 23, 2026 at 05:03:10PM +0000, Will Deacon wrote:
> > > > > On Fri, Dec 19, 2025 at 12:11:25PM -0800, Nicolin Chen wrote:
> > > > > > +struct arm_smmu_inv {
> > > > > > +	struct arm_smmu_device *smmu;
> > > > > > +	u8 type;
> > > > > > +	u8 size_opcode;
> > > > > > +	u8 nsize_opcode;
> > > > > > +	u32 id; /* ASID or VMID or SID */
> > > > > > +	union {
> > > > > > +		size_t pgsize; /* ARM_SMMU_FEAT_RANGE_INV */
> > > > > > +		u32 ssid; /* INV_TYPE_ATS */
> > > > > > +	};
> > > > > > +
> > > > > > +	refcount_t users; /* users=0 to mark as a trash to be purged */
> > > > > 
> > > > > The refcount_t API uses atomics with barrier semantics. Do we actually
> > > > > need those properties when updating the refcounts here? The ASID lock
> > > > > gives us pretty strong serialisation even after this patch series and
> > > > > we rely heavily on that.
> > > > 
> > > > But we can't use that mutex in the invalidation function that
> > > > might be an IRQ context?
> > > 
> > > My question, really, is why do you need the atomic properties in this patch
> > > series? It just looks like overhead at the moment.
> > 
> > Hmm, shouldn't it be atomic, since..
> > 
> > (might be IRQ, no mutex) __arm_smmu_domain_inv_range() reads it.
> > (mutex protected) arm_smmu_attach_dev() writes it.
> > 
> > ..?
> 
> It doesn't need to be a full atomic, you can use WRITE_ONCE/READ_ONCE
> instead for this case.
> 
> The general argument of this scheme is it doesn't matter if the
> invalidation side is doing extra invalidation, that is supposed to
> always be safe except for ATS. For ATS we hold locks and then it
> doesn't to be atomic because of the lock.
> 
> So I think Will has it right, just use READ_ONCE/WRITE_ONCE on a naked
> unsigned int.

There's the usual scary ordering worries with this stuff, but they exist
with refcount_read() too. We're ok because cur->users is address dependent
on the read of the invalidation array pointer, which is RCU protected.

(I think!)

Will

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3/7] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array
  2026-01-26 14:54             ` Will Deacon
@ 2026-01-26 15:21               ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2026-01-26 15:21 UTC (permalink / raw)
  To: Will Deacon
  Cc: Nicolin Chen, jean-philippe, robin.murphy, joro, balbirs,
	miko.lenczewski, peterz, kevin.tian, praan, linux-arm-kernel,
	iommu, linux-kernel

On Mon, Jan 26, 2026 at 02:54:38PM +0000, Will Deacon wrote:
> There's the usual scary ordering worries with this stuff, but they exist
> with refcount_read() too. We're ok because cur->users is address dependent
> on the read of the invalidation array pointer, which is RCU protected.
> 
> (I think!)

This has been my understanding at least :\

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v9 4/7] iommu/arm-smmu-v3: Pre-allocate a per-master invalidation array
  2025-12-19 20:11 [PATCH v9 0/7] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
                   ` (2 preceding siblings ...)
  2025-12-19 20:11 ` [PATCH v9 3/7] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array Nicolin Chen
@ 2025-12-19 20:11 ` Nicolin Chen
  2026-01-23  9:54   ` Pranjal Shrivastava
  2025-12-19 20:11 ` [PATCH v9 5/7] iommu/arm-smmu-v3: Populate smmu_domain->invs when attaching masters Nicolin Chen
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 45+ messages in thread
From: Nicolin Chen @ 2025-12-19 20:11 UTC (permalink / raw)
  To: will
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

When a master is attached from an old domain to a new domain, it needs to
build an invalidation array to delete and add the array entries from/onto
the invalidation arrays of those two domains, passed via the to_merge and
to_unref arguments into arm_smmu_invs_merge/unref() respectively.

Since the master->num_streams might differ across masters, a memory would
have to be allocated when building an to_merge/to_unref array which might
fail with -ENOMEM.

On the other hand, an attachment to arm_smmu_blocked_domain must not fail
so it's the best to avoid any memory allocation in that path.

Pre-allocate a fixed size invalidation array for every master. This array
will be used as a scratch to fill dynamically when building a to_merge or
to_unref invs array. Sort fwspec->ids in an ascending order to fit to the
arm_smmu_invs_merge() function.

Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 ++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 41 +++++++++++++++++++--
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index a9441dc9070e..f98774962012 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -930,6 +930,14 @@ struct arm_smmu_master {
 	struct arm_smmu_device		*smmu;
 	struct device			*dev;
 	struct arm_smmu_stream		*streams;
+	/*
+	 * Scratch memory for a to_merge or to_unref array to build a per-domain
+	 * invalidation array. It'll be pre-allocated with enough enries for all
+	 * possible build scenarios. It can be used by only one caller at a time
+	 * until the arm_smmu_invs_merge/unref() finishes. Must be locked by the
+	 * iommu_group mutex.
+	 */
+	struct arm_smmu_invs		*build_invs;
 	struct arm_smmu_vmaster		*vmaster; /* use smmu->streams_mutex */
 	/* Locked by the iommu core using the group mutex */
 	struct arm_smmu_ctx_desc_cfg	cd_table;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 39e89176fa87..cc85e7a10ea8 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3709,12 +3709,22 @@ static int arm_smmu_init_sid_strtab(struct arm_smmu_device *smmu, u32 sid)
 	return 0;
 }
 
+static int arm_smmu_stream_id_cmp(const void *_l, const void *_r)
+{
+	const typeof_member(struct arm_smmu_stream, id) *l = _l;
+	const typeof_member(struct arm_smmu_stream, id) *r = _r;
+
+	return cmp_int(*l, *r);
+}
+
 static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
 				  struct arm_smmu_master *master)
 {
 	int i;
 	int ret = 0;
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev);
+	bool ats_supported = dev_is_pci(master->dev) &&
+			     pci_ats_supported(to_pci_dev(master->dev));
 
 	master->streams = kcalloc(fwspec->num_ids, sizeof(*master->streams),
 				  GFP_KERNEL);
@@ -3722,14 +3732,35 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
 		return -ENOMEM;
 	master->num_streams = fwspec->num_ids;
 
-	mutex_lock(&smmu->streams_mutex);
+	if (!ats_supported) {
+		/* Base case has 1 ASID entry or maximum 2 VMID entries */
+		master->build_invs = arm_smmu_invs_alloc(2);
+	} else {
+		/* ATS case adds num_ids of entries, on top of the base case */
+		master->build_invs = arm_smmu_invs_alloc(2 + fwspec->num_ids);
+	}
+	if (!master->build_invs) {
+		kfree(master->streams);
+		return -ENOMEM;
+	}
+
 	for (i = 0; i < fwspec->num_ids; i++) {
 		struct arm_smmu_stream *new_stream = &master->streams[i];
-		struct rb_node *existing;
-		u32 sid = fwspec->ids[i];
 
-		new_stream->id = sid;
+		new_stream->id = fwspec->ids[i];
 		new_stream->master = master;
+	}
+
+	/* Put the ids into order for sorted to_merge/to_unref arrays */
+	sort_nonatomic(master->streams, master->num_streams,
+		       sizeof(master->streams[0]), arm_smmu_stream_id_cmp,
+		       NULL);
+
+	mutex_lock(&smmu->streams_mutex);
+	for (i = 0; i < fwspec->num_ids; i++) {
+		struct arm_smmu_stream *new_stream = &master->streams[i];
+		struct rb_node *existing;
+		u32 sid = new_stream->id;
 
 		ret = arm_smmu_init_sid_strtab(smmu, sid);
 		if (ret)
@@ -3759,6 +3790,7 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
 		for (i--; i >= 0; i--)
 			rb_erase(&master->streams[i].node, &smmu->streams);
 		kfree(master->streams);
+		kfree(master->build_invs);
 	}
 	mutex_unlock(&smmu->streams_mutex);
 
@@ -3780,6 +3812,7 @@ static void arm_smmu_remove_master(struct arm_smmu_master *master)
 	mutex_unlock(&smmu->streams_mutex);
 
 	kfree(master->streams);
+	kfree(master->build_invs);
 }
 
 static struct iommu_device *arm_smmu_probe_device(struct device *dev)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 4/7] iommu/arm-smmu-v3: Pre-allocate a per-master invalidation array
  2025-12-19 20:11 ` [PATCH v9 4/7] iommu/arm-smmu-v3: Pre-allocate a per-master invalidation array Nicolin Chen
@ 2026-01-23  9:54   ` Pranjal Shrivastava
  0 siblings, 0 replies; 45+ messages in thread
From: Pranjal Shrivastava @ 2026-01-23  9:54 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, jean-philippe, robin.murphy, joro, jgg, balbirs,
	miko.lenczewski, peterz, kevin.tian, linux-arm-kernel, iommu,
	linux-kernel

On Fri, Dec 19, 2025 at 12:11:26PM -0800, Nicolin Chen wrote:
> When a master is attached from an old domain to a new domain, it needs to
> build an invalidation array to delete and add the array entries from/onto
> the invalidation arrays of those two domains, passed via the to_merge and
> to_unref arguments into arm_smmu_invs_merge/unref() respectively.
> 
> Since the master->num_streams might differ across masters, a memory would
> have to be allocated when building an to_merge/to_unref array which might
> fail with -ENOMEM.
> 
> On the other hand, an attachment to arm_smmu_blocked_domain must not fail
> so it's the best to avoid any memory allocation in that path.
> 
> Pre-allocate a fixed size invalidation array for every master. This array
> will be used as a scratch to fill dynamically when building a to_merge or
> to_unref invs array. Sort fwspec->ids in an ascending order to fit to the
> arm_smmu_invs_merge() function.
> 
> Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 ++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 41 +++++++++++++++++++--
>  2 files changed, 45 insertions(+), 4 deletions(-)
> 

Reviewed-by: Pranjal Shrivastava <praan@google.com>

Thanks
Praan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v9 5/7] iommu/arm-smmu-v3: Populate smmu_domain->invs when attaching masters
  2025-12-19 20:11 [PATCH v9 0/7] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
                   ` (3 preceding siblings ...)
  2025-12-19 20:11 ` [PATCH v9 4/7] iommu/arm-smmu-v3: Pre-allocate a per-master invalidation array Nicolin Chen
@ 2025-12-19 20:11 ` Nicolin Chen
  2025-12-19 20:11 ` [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range() Nicolin Chen
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 45+ messages in thread
From: Nicolin Chen @ 2025-12-19 20:11 UTC (permalink / raw)
  To: will
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

Update the invs array with the invalidations required by each domain type
during attachment operations.

Only an SVA domain or a paging domain will have an invs array:
 a. SVA domain will add an INV_TYPE_S1_ASID per SMMU and an INV_TYPE_ATS
    per SID

 b. Non-nesting-parent paging domain with no ATS-enabled master will add
    a single INV_TYPE_S1_ASID or INV_TYPE_S2_VMID per SMMU

 c. Non-nesting-parent paging domain with ATS-enabled master(s) will do
    (b) and add an INV_TYPE_ATS per SID

 d. Nesting-parent paging domain will add an INV_TYPE_S2_VMID followed by
    an INV_TYPE_S2_VMID_S1_CLEAR per vSMMU. For an ATS-enabled master, it
    will add an INV_TYPE_ATS_FULL per SID

 Note that case #d prepares for a future implementation of VMID allocation
 which requires a followup series for S2 domain sharing. So when a nesting
 parent domain is attached through a vSMMU instance using a nested domain.
 VMID will be allocated per vSMMU instance v.s. currectly per S2 domain.

The per-domain invalidation is not needed until the domain is attached to
a master (when it starts to possibly use TLB). This will make it possible
to attach the domain to multiple SMMUs and avoid unnecessary invalidation
overhead during teardown if no STEs/CDs refer to the domain. It also means
that when the last device is detached, the old domain must flush its ASID
or VMID, since any new iommu_unmap() call would not trigger invalidations
given an empty domain->invs array.

Introduce some arm_smmu_invs helper functions for building scratch arrays,
preparing and installing old/new domain's invalidation arrays.

Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  17 ++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 258 +++++++++++++++++++-
 2 files changed, 274 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index f98774962012..f8dc96476c43 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -1101,6 +1101,21 @@ static inline bool arm_smmu_master_canwbs(struct arm_smmu_master *master)
 	       IOMMU_FWSPEC_PCI_RC_CANWBS;
 }
 
+/**
+ * struct arm_smmu_inv_state - Per-domain invalidation array state
+ * @invs_ptr: points to the domain->invs (unwinding nesting/etc.) or is NULL if
+ *            no change should be made
+ * @old_invs: the original invs array
+ * @new_invs: for new domain, this is the new invs array to update domain->invs;
+ *            for old domain, this is the master->build_invs to pass in as the
+ *            to_unref argument to an arm_smmu_invs_unref() call
+ */
+struct arm_smmu_inv_state {
+	struct arm_smmu_invs __rcu **invs_ptr;
+	struct arm_smmu_invs *old_invs;
+	struct arm_smmu_invs *new_invs;
+};
+
 struct arm_smmu_attach_state {
 	/* Inputs */
 	struct iommu_domain *old_domain;
@@ -1110,6 +1125,8 @@ struct arm_smmu_attach_state {
 	ioasid_t ssid;
 	/* Resulting state */
 	struct arm_smmu_vmaster *vmaster;
+	struct arm_smmu_inv_state old_domain_invst;
+	struct arm_smmu_inv_state new_domain_invst;
 	bool ats_enabled;
 };
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index cc85e7a10ea8..fb45359680d2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3068,6 +3068,121 @@ static void arm_smmu_disable_iopf(struct arm_smmu_master *master,
 		iopf_queue_remove_device(master->smmu->evtq.iopf, master->dev);
 }
 
+static struct arm_smmu_inv *
+arm_smmu_master_build_inv(struct arm_smmu_master *master,
+			  enum arm_smmu_inv_type type, u32 id, ioasid_t ssid,
+			  size_t pgsize)
+{
+	struct arm_smmu_invs *build_invs = master->build_invs;
+	struct arm_smmu_inv *cur, inv = {
+		.smmu = master->smmu,
+		.type = type,
+		.id = id,
+		.pgsize = pgsize,
+	};
+
+	if (WARN_ON(build_invs->num_invs >= build_invs->max_invs))
+		return NULL;
+	cur = &build_invs->inv[build_invs->num_invs];
+	build_invs->num_invs++;
+
+	*cur = inv;
+	switch (type) {
+	case INV_TYPE_S1_ASID:
+		/*
+		 * For S1 page tables the driver always uses VMID=0, and the
+		 * invalidation logic for this type will set it as well.
+		 */
+		if (master->smmu->features & ARM_SMMU_FEAT_E2H) {
+			cur->size_opcode = CMDQ_OP_TLBI_EL2_VA;
+			cur->nsize_opcode = CMDQ_OP_TLBI_EL2_ASID;
+		} else {
+			cur->size_opcode = CMDQ_OP_TLBI_NH_VA;
+			cur->nsize_opcode = CMDQ_OP_TLBI_NH_ASID;
+		}
+		break;
+	case INV_TYPE_S2_VMID:
+		cur->size_opcode = CMDQ_OP_TLBI_S2_IPA;
+		cur->nsize_opcode = CMDQ_OP_TLBI_S12_VMALL;
+		break;
+	case INV_TYPE_S2_VMID_S1_CLEAR:
+		cur->size_opcode = cur->nsize_opcode = CMDQ_OP_TLBI_NH_ALL;
+		break;
+	case INV_TYPE_ATS:
+	case INV_TYPE_ATS_FULL:
+		cur->size_opcode = cur->nsize_opcode = CMDQ_OP_ATC_INV;
+		cur->ssid = ssid;
+		break;
+	}
+
+	return cur;
+}
+
+/*
+ * Use the preallocated scratch array at master->build_invs, to build a to_merge
+ * or to_unref array, to pass into a following arm_smmu_invs_merge/unref() call.
+ *
+ * Do not free the returned invs array. It is reused, and will be overwritten by
+ * the next arm_smmu_master_build_invs() call.
+ */
+static struct arm_smmu_invs *
+arm_smmu_master_build_invs(struct arm_smmu_master *master, bool ats_enabled,
+			   ioasid_t ssid, struct arm_smmu_domain *smmu_domain)
+{
+	const bool nesting = smmu_domain->nest_parent;
+	size_t pgsize = 0, i;
+
+	iommu_group_mutex_assert(master->dev);
+
+	master->build_invs->num_invs = 0;
+
+	/* Range-based invalidation requires the leaf pgsize for calculation */
+	if (master->smmu->features & ARM_SMMU_FEAT_RANGE_INV)
+		pgsize = __ffs(smmu_domain->domain.pgsize_bitmap);
+
+	switch (smmu_domain->stage) {
+	case ARM_SMMU_DOMAIN_SVA:
+	case ARM_SMMU_DOMAIN_S1:
+		if (!arm_smmu_master_build_inv(master, INV_TYPE_S1_ASID,
+					       smmu_domain->cd.asid,
+					       IOMMU_NO_PASID, pgsize))
+			return NULL;
+		break;
+	case ARM_SMMU_DOMAIN_S2:
+		if (!arm_smmu_master_build_inv(master, INV_TYPE_S2_VMID,
+					       smmu_domain->s2_cfg.vmid,
+					       IOMMU_NO_PASID, pgsize))
+			return NULL;
+		break;
+	default:
+		WARN_ON(true);
+		return NULL;
+	}
+
+	/* All the nested S1 ASIDs have to be flushed when S2 parent changes */
+	if (nesting) {
+		if (!arm_smmu_master_build_inv(
+			    master, INV_TYPE_S2_VMID_S1_CLEAR,
+			    smmu_domain->s2_cfg.vmid, IOMMU_NO_PASID, 0))
+			return NULL;
+	}
+
+	for (i = 0; ats_enabled && i < master->num_streams; i++) {
+		/*
+		 * If an S2 used as a nesting parent is changed we have no
+		 * option but to completely flush the ATC.
+		 */
+		if (!arm_smmu_master_build_inv(
+			    master, nesting ? INV_TYPE_ATS_FULL : INV_TYPE_ATS,
+			    master->streams[i].id, ssid, 0))
+			return NULL;
+	}
+
+	/* Note this build_invs must have been sorted */
+
+	return master->build_invs;
+}
+
 static void arm_smmu_remove_master_domain(struct arm_smmu_master *master,
 					  struct iommu_domain *domain,
 					  ioasid_t ssid)
@@ -3097,6 +3212,131 @@ static void arm_smmu_remove_master_domain(struct arm_smmu_master *master,
 	kfree(master_domain);
 }
 
+/*
+ * During attachment, the updates of the two domain->invs arrays are sequenced:
+ *  1. new domain updates its invs array, merging master->build_invs
+ *  2. new domain starts to include the master during its invalidation
+ *  3. master updates its STE switching from the old domain to the new domain
+ *  4. old domain still includes the master during its invalidation
+ *  5. old domain updates its invs array, unreferencing master->build_invs
+ *
+ * For 1 and 5, prepare the two updated arrays in advance, handling any changes
+ * that can possibly failure. So the actual update of either 1 or 5 won't fail.
+ * arm_smmu_asid_lock ensures that the old invs in the domains are intact while
+ * we are sequencing to update them.
+ */
+static int arm_smmu_attach_prepare_invs(struct arm_smmu_attach_state *state,
+					struct arm_smmu_domain *new_smmu_domain)
+{
+	struct arm_smmu_domain *old_smmu_domain =
+		to_smmu_domain_devices(state->old_domain);
+	struct arm_smmu_master *master = state->master;
+	ioasid_t ssid = state->ssid;
+
+	/*
+	 * At this point a NULL domain indicates the domain doesn't use the
+	 * IOTLB, see to_smmu_domain_devices().
+	 */
+	if (new_smmu_domain) {
+		struct arm_smmu_inv_state *invst = &state->new_domain_invst;
+		struct arm_smmu_invs *build_invs;
+
+		invst->invs_ptr = &new_smmu_domain->invs;
+		invst->old_invs = rcu_dereference_protected(
+			new_smmu_domain->invs,
+			lockdep_is_held(&arm_smmu_asid_lock));
+		build_invs = arm_smmu_master_build_invs(
+			master, state->ats_enabled, ssid, new_smmu_domain);
+		if (!build_invs)
+			return -EINVAL;
+
+		invst->new_invs =
+			arm_smmu_invs_merge(invst->old_invs, build_invs);
+		if (IS_ERR(invst->new_invs))
+			return PTR_ERR(invst->new_invs);
+	}
+
+	if (old_smmu_domain) {
+		struct arm_smmu_inv_state *invst = &state->old_domain_invst;
+
+		invst->invs_ptr = &old_smmu_domain->invs;
+		/* A re-attach case might have a different ats_enabled state */
+		if (new_smmu_domain == old_smmu_domain)
+			invst->old_invs = state->new_domain_invst.new_invs;
+		else
+			invst->old_invs = rcu_dereference_protected(
+				old_smmu_domain->invs,
+				lockdep_is_held(&arm_smmu_asid_lock));
+		/* For old_smmu_domain, new_invs points to master->build_invs */
+		invst->new_invs = arm_smmu_master_build_invs(
+			master, master->ats_enabled, ssid, old_smmu_domain);
+	}
+
+	return 0;
+}
+
+/* Must be installed before arm_smmu_install_ste_for_dev() */
+static void
+arm_smmu_install_new_domain_invs(struct arm_smmu_attach_state *state)
+{
+	struct arm_smmu_inv_state *invst = &state->new_domain_invst;
+
+	if (!invst->invs_ptr)
+		return;
+
+	rcu_assign_pointer(*invst->invs_ptr, invst->new_invs);
+	kfree_rcu(invst->old_invs, rcu);
+}
+
+/*
+ * When an array entry's users count reaches zero, it means the ASID/VMID is no
+ * longer being invalidated by map/unmap and must be cleaned. The rule is that
+ * all ASIDs/VMIDs not in an invalidation array are left cleared in the IOTLB.
+ */
+static void arm_smmu_inv_flush_iotlb_tag(struct arm_smmu_inv *inv)
+{
+	struct arm_smmu_cmdq_ent cmd = {};
+
+	switch (inv->type) {
+	case INV_TYPE_S1_ASID:
+		cmd.tlbi.asid = inv->id;
+		break;
+	case INV_TYPE_S2_VMID:
+		/* S2_VMID using nsize_opcode covers S2_VMID_S1_CLEAR */
+		cmd.tlbi.vmid = inv->id;
+		break;
+	default:
+		return;
+	}
+
+	cmd.opcode = inv->nsize_opcode;
+	arm_smmu_cmdq_issue_cmd_with_sync(inv->smmu, &cmd);
+}
+
+/* Should be installed after arm_smmu_install_ste_for_dev() */
+static void
+arm_smmu_install_old_domain_invs(struct arm_smmu_attach_state *state)
+{
+	struct arm_smmu_inv_state *invst = &state->old_domain_invst;
+	struct arm_smmu_invs *old_invs = invst->old_invs;
+	struct arm_smmu_invs *new_invs;
+
+	lockdep_assert_held(&arm_smmu_asid_lock);
+
+	if (!invst->invs_ptr)
+		return;
+
+	arm_smmu_invs_unref(old_invs, invst->new_invs,
+			    arm_smmu_inv_flush_iotlb_tag);
+
+	new_invs = arm_smmu_invs_purge(old_invs);
+	if (!new_invs)
+		return;
+
+	rcu_assign_pointer(*invst->invs_ptr, new_invs);
+	kfree_rcu(old_invs, rcu);
+}
+
 /*
  * Start the sequence to attach a domain to a master. The sequence contains three
  * steps:
@@ -3154,12 +3394,16 @@ int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
 				     arm_smmu_ats_supported(master);
 	}
 
+	ret = arm_smmu_attach_prepare_invs(state, smmu_domain);
+	if (ret)
+		return ret;
+
 	if (smmu_domain) {
 		if (new_domain->type == IOMMU_DOMAIN_NESTED) {
 			ret = arm_smmu_attach_prepare_vmaster(
 				state, to_smmu_nested_domain(new_domain));
 			if (ret)
-				return ret;
+				goto err_unprepare_invs;
 		}
 
 		master_domain = kzalloc(sizeof(*master_domain), GFP_KERNEL);
@@ -3207,6 +3451,8 @@ int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
 			atomic_inc(&smmu_domain->nr_ats_masters);
 		list_add(&master_domain->devices_elm, &smmu_domain->devices);
 		spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+
+		arm_smmu_install_new_domain_invs(state);
 	}
 
 	if (!state->ats_enabled && master->ats_enabled) {
@@ -3226,6 +3472,8 @@ int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
 	kfree(master_domain);
 err_free_vmaster:
 	kfree(state->vmaster);
+err_unprepare_invs:
+	kfree(state->new_domain_invst.new_invs);
 	return ret;
 }
 
@@ -3257,6 +3505,7 @@ void arm_smmu_attach_commit(struct arm_smmu_attach_state *state)
 	}
 
 	arm_smmu_remove_master_domain(master, state->old_domain, state->ssid);
+	arm_smmu_install_old_domain_invs(state);
 	master->ats_enabled = state->ats_enabled;
 }
 
@@ -3438,12 +3687,19 @@ static int arm_smmu_blocking_set_dev_pasid(struct iommu_domain *new_domain,
 {
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(old_domain);
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct arm_smmu_attach_state state = {
+		.master = master,
+		.old_domain = old_domain,
+		.ssid = pasid,
+	};
 
 	mutex_lock(&arm_smmu_asid_lock);
+	arm_smmu_attach_prepare_invs(&state, NULL);
 	arm_smmu_clear_cd(master, pasid);
 	if (master->ats_enabled)
 		arm_smmu_atc_inv_master(master, pasid);
 	arm_smmu_remove_master_domain(master, &smmu_domain->domain, pasid);
+	arm_smmu_install_old_domain_invs(&state);
 	mutex_unlock(&arm_smmu_asid_lock);
 
 	/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2025-12-19 20:11 [PATCH v9 0/7] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
                   ` (4 preceding siblings ...)
  2025-12-19 20:11 ` [PATCH v9 5/7] iommu/arm-smmu-v3: Populate smmu_domain->invs when attaching masters Nicolin Chen
@ 2025-12-19 20:11 ` Nicolin Chen
  2026-01-23  9:48   ` Pranjal Shrivastava
  2026-01-23 17:05   ` Will Deacon
  2025-12-19 20:11 ` [PATCH v9 7/7] iommu/arm-smmu-v3: Perform per-domain invalidations using arm_smmu_invs Nicolin Chen
  2026-01-19 17:10 ` [PATCH v9 0/7] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
  7 siblings, 2 replies; 45+ messages in thread
From: Nicolin Chen @ 2025-12-19 20:11 UTC (permalink / raw)
  To: will
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

Each smmu_domain now has an arm_smmu_invs that specifies the invalidation
steps to perform after any change the IOPTEs. This includes supports for
basic ASID/VMID, the special case for nesting, and ATC invalidations.

Introduce a new arm_smmu_domain_inv helper iterating smmu_domain->invs to
convert the invalidation array to commands. Any invalidation request with
no size specified means an entire flush over a range based one.

Take advantage of the sorted array to compatible batch operations together
to the same SMMU. For instance, ATC invaliations for multiple SIDs can be
pushed as a batch.

ATC invalidations must be completed before the driver disables ATS. Or the
device is permitted to ignore any racing invalidation that would cause an
SMMU timeout. The sequencing is done with a rwlock where holding the write
side of the rwlock means that there are no outstanding ATC invalidations.
If ATS is not used the rwlock is ignored, similar to the existing code.

Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |   9 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 258 +++++++++++++++++++-
 2 files changed, 254 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index f8dc96476c43..c3fee7f14480 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -1086,6 +1086,15 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
 int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
 			    unsigned long iova, size_t size);
 
+void arm_smmu_domain_inv_range(struct arm_smmu_domain *smmu_domain,
+			       unsigned long iova, size_t size,
+			       unsigned int granule, bool leaf);
+
+static inline void arm_smmu_domain_inv(struct arm_smmu_domain *smmu_domain)
+{
+	arm_smmu_domain_inv_range(smmu_domain, 0, 0, 0, false);
+}
+
 void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
 			      struct arm_smmu_cmdq *cmdq);
 int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index fb45359680d2..6e1082e6d164 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2516,23 +2516,19 @@ static void arm_smmu_tlb_inv_context(void *cookie)
 	arm_smmu_atc_inv_domain(smmu_domain, 0, 0);
 }
 
-static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
-				     unsigned long iova, size_t size,
-				     size_t granule,
-				     struct arm_smmu_domain *smmu_domain)
+static void arm_smmu_cmdq_batch_add_range(struct arm_smmu_device *smmu,
+					  struct arm_smmu_cmdq_batch *cmds,
+					  struct arm_smmu_cmdq_ent *cmd,
+					  unsigned long iova, size_t size,
+					  size_t granule, size_t pgsize)
 {
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
-	unsigned long end = iova + size, num_pages = 0, tg = 0;
+	unsigned long end = iova + size, num_pages = 0, tg = pgsize;
 	size_t inv_range = granule;
-	struct arm_smmu_cmdq_batch cmds;
 
 	if (!size)
 		return;
 
 	if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
-		/* Get the leaf page size */
-		tg = __ffs(smmu_domain->domain.pgsize_bitmap);
-
 		num_pages = size >> tg;
 
 		/* Convert page size of 12,14,16 (log2) to 1,2,3 */
@@ -2552,8 +2548,6 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
 			num_pages++;
 	}
 
-	arm_smmu_cmdq_batch_init(smmu, &cmds, cmd);
-
 	while (iova < end) {
 		if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
 			/*
@@ -2581,9 +2575,26 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
 		}
 
 		cmd->tlbi.addr = iova;
-		arm_smmu_cmdq_batch_add(smmu, &cmds, cmd);
+		arm_smmu_cmdq_batch_add(smmu, cmds, cmd);
 		iova += inv_range;
 	}
+}
+
+static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
+				     unsigned long iova, size_t size,
+				     size_t granule,
+				     struct arm_smmu_domain *smmu_domain)
+{
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	struct arm_smmu_cmdq_batch cmds;
+	size_t pgsize;
+
+	/* Get the leaf page size */
+	pgsize = __ffs(smmu_domain->domain.pgsize_bitmap);
+
+	arm_smmu_cmdq_batch_init(smmu, &cmds, cmd);
+	arm_smmu_cmdq_batch_add_range(smmu, &cmds, cmd, iova, size, granule,
+				      pgsize);
 	arm_smmu_cmdq_batch_submit(smmu, &cmds);
 }
 
@@ -2639,6 +2650,193 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
 	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
 }
 
+static bool arm_smmu_inv_size_too_big(struct arm_smmu_device *smmu, size_t size,
+				      size_t granule)
+{
+	size_t max_tlbi_ops;
+
+	/* 0 size means invalidate all */
+	if (!size || size == SIZE_MAX)
+		return true;
+
+	if (smmu->features & ARM_SMMU_FEAT_RANGE_INV)
+		return false;
+
+	/*
+	 * Borrowed from the MAX_TLBI_OPS in arch/arm64/include/asm/tlbflush.h,
+	 * this is used as a threshold to replace "size_opcode" commands with a
+	 * single "nsize_opcode" command, when SMMU doesn't implement the range
+	 * invalidation feature, where there can be too many per-granule TLBIs,
+	 * resulting in a soft lockup.
+	 */
+	max_tlbi_ops = 1 << (ilog2(granule) - 3);
+	return size >= max_tlbi_ops * granule;
+}
+
+/* Used by non INV_TYPE_ATS* invalidations */
+static void arm_smmu_inv_to_cmdq_batch(struct arm_smmu_inv *inv,
+				       struct arm_smmu_cmdq_batch *cmds,
+				       struct arm_smmu_cmdq_ent *cmd,
+				       unsigned long iova, size_t size,
+				       unsigned int granule)
+{
+	if (arm_smmu_inv_size_too_big(inv->smmu, size, granule)) {
+		cmd->opcode = inv->nsize_opcode;
+		arm_smmu_cmdq_batch_add(inv->smmu, cmds, cmd);
+		return;
+	}
+
+	cmd->opcode = inv->size_opcode;
+	arm_smmu_cmdq_batch_add_range(inv->smmu, cmds, cmd, iova, size, granule,
+				      inv->pgsize);
+}
+
+static inline bool arm_smmu_invs_end_batch(struct arm_smmu_inv *cur,
+					   struct arm_smmu_inv *next)
+{
+	/* Changing smmu means changing command queue */
+	if (cur->smmu != next->smmu)
+		return true;
+	/* The batch for S2 TLBI must be done before nested S1 ASIDs */
+	if (cur->type != INV_TYPE_S2_VMID_S1_CLEAR &&
+	    next->type == INV_TYPE_S2_VMID_S1_CLEAR)
+		return true;
+	/* ATS must be after a sync of the S1/S2 invalidations */
+	if (!arm_smmu_inv_is_ats(cur) && arm_smmu_inv_is_ats(next))
+		return true;
+	return false;
+}
+
+static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
+					unsigned long iova, size_t size,
+					unsigned int granule, bool leaf)
+{
+	struct arm_smmu_cmdq_batch cmds = {};
+	struct arm_smmu_inv *cur;
+	struct arm_smmu_inv *end;
+
+	cur = invs->inv;
+	end = cur + READ_ONCE(invs->num_invs);
+	/* Skip any leading entry marked as a trash */
+	for (; cur != end; cur++)
+		if (refcount_read(&cur->users))
+			break;
+	while (cur != end) {
+		struct arm_smmu_device *smmu = cur->smmu;
+		struct arm_smmu_cmdq_ent cmd = {
+			/*
+			 * Pick size_opcode to run arm_smmu_get_cmdq(). This can
+			 * be changed to nsize_opcode, which would result in the
+			 * same CMDQ pointer.
+			 */
+			.opcode = cur->size_opcode,
+		};
+		struct arm_smmu_inv *next;
+
+		if (!cmds.num)
+			arm_smmu_cmdq_batch_init(smmu, &cmds, &cmd);
+
+		switch (cur->type) {
+		case INV_TYPE_S1_ASID:
+			cmd.tlbi.asid = cur->id;
+			cmd.tlbi.leaf = leaf;
+			arm_smmu_inv_to_cmdq_batch(cur, &cmds, &cmd, iova, size,
+						   granule);
+			break;
+		case INV_TYPE_S2_VMID:
+			cmd.tlbi.vmid = cur->id;
+			cmd.tlbi.leaf = leaf;
+			arm_smmu_inv_to_cmdq_batch(cur, &cmds, &cmd, iova, size,
+						   granule);
+			break;
+		case INV_TYPE_S2_VMID_S1_CLEAR:
+			/* CMDQ_OP_TLBI_S12_VMALL already flushed S1 entries */
+			if (arm_smmu_inv_size_too_big(cur->smmu, size, granule))
+				continue;
+			cmd.tlbi.vmid = cur->id;
+			arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
+			break;
+		case INV_TYPE_ATS:
+			arm_smmu_atc_inv_to_cmd(cur->ssid, iova, size, &cmd);
+			cmd.atc.sid = cur->id;
+			arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
+			break;
+		case INV_TYPE_ATS_FULL:
+			arm_smmu_atc_inv_to_cmd(IOMMU_NO_PASID, 0, 0, &cmd);
+			cmd.atc.sid = cur->id;
+			arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
+			break;
+		default:
+			WARN_ON_ONCE(1);
+			continue;
+		}
+
+		/* Skip any trash entry in-between */
+		for (next = cur + 1; next != end; next++)
+			if (refcount_read(&next->users))
+				break;
+
+		if (cmds.num &&
+		    (next == end || arm_smmu_invs_end_batch(cur, next))) {
+			arm_smmu_cmdq_batch_submit(smmu, &cmds);
+			cmds.num = 0;
+		}
+		cur = next;
+	}
+}
+
+void arm_smmu_domain_inv_range(struct arm_smmu_domain *smmu_domain,
+			       unsigned long iova, size_t size,
+			       unsigned int granule, bool leaf)
+{
+	struct arm_smmu_invs *invs;
+
+	/*
+	 * An invalidation request must follow some IOPTE change and then load
+	 * an invalidation array. In the meantime, a domain attachment mutates
+	 * the array and then stores an STE/CD asking SMMU HW to acquire those
+	 * changed IOPTEs. In other word, these two are interdependent and can
+	 * race.
+	 *
+	 * In a race, the RCU design (with its underlying memory barriers) can
+	 * ensure the invalidation array to always get updated before loaded.
+	 *
+	 * smp_mb() is used here, paired with the smp_mb() following the array
+	 * update in a concurrent attach, to ensure:
+	 *  - HW sees the new IOPTEs if it walks after STE installation
+	 *  - Invalidation thread sees the updated array with the new ASID.
+	 *
+	 *  [CPU0]                        | [CPU1]
+	 *                                |
+	 *  change IOPTEs and TLB flush:  |
+	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_new_domain_invs {
+	 *    ...                         |   rcu_assign_pointer(new_invs);
+	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
+	 *    ...                         |   kfree_rcu(old_invs, rcu);
+	 *    // load invalidation array  | }
+	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
+	 *                                |   STE = TTB0 // read new IOPTEs
+	 */
+	smp_mb();
+
+	rcu_read_lock();
+	invs = rcu_dereference(smmu_domain->invs);
+
+	/*
+	 * Avoid locking unless ATS is being used. No ATC invalidation can be
+	 * going on after a domain is detached.
+	 */
+	if (invs->has_ats) {
+		read_lock(&invs->rwlock);
+		__arm_smmu_domain_inv_range(invs, iova, size, granule, leaf);
+		read_unlock(&invs->rwlock);
+	} else {
+		__arm_smmu_domain_inv_range(invs, iova, size, granule, leaf);
+	}
+
+	rcu_read_unlock();
+}
+
 static void arm_smmu_tlb_inv_page_nosync(struct iommu_iotlb_gather *gather,
 					 unsigned long iova, size_t granule,
 					 void *cookie)
@@ -3285,6 +3483,23 @@ arm_smmu_install_new_domain_invs(struct arm_smmu_attach_state *state)
 		return;
 
 	rcu_assign_pointer(*invst->invs_ptr, invst->new_invs);
+	/*
+	 * We are committed to updating the STE. Ensure the invalidation array
+	 * is visible to concurrent map/unmap threads, and acquire any racing
+	 * IOPTE updates.
+	 *
+	 *  [CPU0]                        | [CPU1]
+	 *                                |
+	 *  change IOPTEs and TLB flush:  |
+	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_new_domain_invs {
+	 *    ...                         |   rcu_assign_pointer(new_invs);
+	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
+	 *    ...                         |   kfree_rcu(old_invs, rcu);
+	 *    // load invalidation array  | }
+	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
+	 *                                |   STE = TTB0 // read new IOPTEs
+	 */
+	smp_mb();
 	kfree_rcu(invst->old_invs, rcu);
 }
 
@@ -3334,6 +3549,23 @@ arm_smmu_install_old_domain_invs(struct arm_smmu_attach_state *state)
 		return;
 
 	rcu_assign_pointer(*invst->invs_ptr, new_invs);
+	/*
+	 * We are committed to updating the STE. Ensure the invalidation array
+	 * is visible to concurrent map/unmap threads, and acquire any racing
+	 * IOPTE updates.
+	 *
+	 *  [CPU0]                        | [CPU1]
+	 *                                |
+	 *  change IOPTEs and TLB flush:  |
+	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_old_domain_invs {
+	 *    ...                         |   rcu_assign_pointer(new_invs);
+	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
+	 *    ...                         |   kfree_rcu(old_invs, rcu);
+	 *    // load invalidation array  | }
+	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
+	 *                                |   STE = TTB0 // read new IOPTEs
+	 */
+	smp_mb();
 	kfree_rcu(old_invs, rcu);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2025-12-19 20:11 ` [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range() Nicolin Chen
@ 2026-01-23  9:48   ` Pranjal Shrivastava
  2026-01-23 13:56     ` Jason Gunthorpe
  2026-01-27 16:38     ` Nicolin Chen
  2026-01-23 17:05   ` Will Deacon
  1 sibling, 2 replies; 45+ messages in thread
From: Pranjal Shrivastava @ 2026-01-23  9:48 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, jean-philippe, robin.murphy, joro, jgg, balbirs,
	miko.lenczewski, peterz, kevin.tian, linux-arm-kernel, iommu,
	linux-kernel

On Fri, Dec 19, 2025 at 12:11:28PM -0800, Nicolin Chen wrote:
> Each smmu_domain now has an arm_smmu_invs that specifies the invalidation
> steps to perform after any change the IOPTEs. This includes supports for
> basic ASID/VMID, the special case for nesting, and ATC invalidations.
> 
> Introduce a new arm_smmu_domain_inv helper iterating smmu_domain->invs to
> convert the invalidation array to commands. Any invalidation request with
> no size specified means an entire flush over a range based one.
> 
> Take advantage of the sorted array to compatible batch operations together
> to the same SMMU. For instance, ATC invaliations for multiple SIDs can be
> pushed as a batch.
> 
> ATC invalidations must be completed before the driver disables ATS. Or the
> device is permitted to ignore any racing invalidation that would cause an
> SMMU timeout. The sequencing is done with a rwlock where holding the write
> side of the rwlock means that there are no outstanding ATC invalidations.
> If ATS is not used the rwlock is ignored, similar to the existing code.
> 
> Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |   9 +
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 258 +++++++++++++++++++-
>  2 files changed, 254 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index f8dc96476c43..c3fee7f14480 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -1086,6 +1086,15 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
>  int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
>  			    unsigned long iova, size_t size);
>  
> +void arm_smmu_domain_inv_range(struct arm_smmu_domain *smmu_domain,
> +			       unsigned long iova, size_t size,
> +			       unsigned int granule, bool leaf);
> +
> +static inline void arm_smmu_domain_inv(struct arm_smmu_domain *smmu_domain)
> +{
> +	arm_smmu_domain_inv_range(smmu_domain, 0, 0, 0, false);
> +}
> +
>  void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
>  			      struct arm_smmu_cmdq *cmdq);
>  int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index fb45359680d2..6e1082e6d164 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2516,23 +2516,19 @@ static void arm_smmu_tlb_inv_context(void *cookie)
>  	arm_smmu_atc_inv_domain(smmu_domain, 0, 0);
>  }
>  
> -static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
> -				     unsigned long iova, size_t size,
> -				     size_t granule,
> -				     struct arm_smmu_domain *smmu_domain)
> +static void arm_smmu_cmdq_batch_add_range(struct arm_smmu_device *smmu,
> +					  struct arm_smmu_cmdq_batch *cmds,
> +					  struct arm_smmu_cmdq_ent *cmd,
> +					  unsigned long iova, size_t size,
> +					  size_t granule, size_t pgsize)
>  {
> -	struct arm_smmu_device *smmu = smmu_domain->smmu;
> -	unsigned long end = iova + size, num_pages = 0, tg = 0;
> +	unsigned long end = iova + size, num_pages = 0, tg = pgsize;
>  	size_t inv_range = granule;
> -	struct arm_smmu_cmdq_batch cmds;
>  
>  	if (!size)
>  		return;
>  
>  	if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> -		/* Get the leaf page size */
> -		tg = __ffs(smmu_domain->domain.pgsize_bitmap);
> -
>  		num_pages = size >> tg;
>  
>  		/* Convert page size of 12,14,16 (log2) to 1,2,3 */
> @@ -2552,8 +2548,6 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
>  			num_pages++;
>  	}
>  
> -	arm_smmu_cmdq_batch_init(smmu, &cmds, cmd);
> -
>  	while (iova < end) {
>  		if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
>  			/*
> @@ -2581,9 +2575,26 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
>  		}
>  
>  		cmd->tlbi.addr = iova;
> -		arm_smmu_cmdq_batch_add(smmu, &cmds, cmd);
> +		arm_smmu_cmdq_batch_add(smmu, cmds, cmd);
>  		iova += inv_range;
>  	}
> +}
> +
> +static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
> +				     unsigned long iova, size_t size,
> +				     size_t granule,
> +				     struct arm_smmu_domain *smmu_domain)
> +{
> +	struct arm_smmu_device *smmu = smmu_domain->smmu;
> +	struct arm_smmu_cmdq_batch cmds;
> +	size_t pgsize;
> +
> +	/* Get the leaf page size */
> +	pgsize = __ffs(smmu_domain->domain.pgsize_bitmap);
> +
> +	arm_smmu_cmdq_batch_init(smmu, &cmds, cmd);
> +	arm_smmu_cmdq_batch_add_range(smmu, &cmds, cmd, iova, size, granule,
> +				      pgsize);
>  	arm_smmu_cmdq_batch_submit(smmu, &cmds);
>  }
>  
> @@ -2639,6 +2650,193 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
>  	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
>  }
>  
> +static bool arm_smmu_inv_size_too_big(struct arm_smmu_device *smmu, size_t size,
> +				      size_t granule)
> +{
> +	size_t max_tlbi_ops;
> +
> +	/* 0 size means invalidate all */
> +	if (!size || size == SIZE_MAX)
> +		return true;
> +
> +	if (smmu->features & ARM_SMMU_FEAT_RANGE_INV)
> +		return false;
> +
> +	/*
> +	 * Borrowed from the MAX_TLBI_OPS in arch/arm64/include/asm/tlbflush.h,
> +	 * this is used as a threshold to replace "size_opcode" commands with a
> +	 * single "nsize_opcode" command, when SMMU doesn't implement the range
> +	 * invalidation feature, where there can be too many per-granule TLBIs,
> +	 * resulting in a soft lockup.
> +	 */
> +	max_tlbi_ops = 1 << (ilog2(granule) - 3);
> +	return size >= max_tlbi_ops * granule;
> +}
> +
> +/* Used by non INV_TYPE_ATS* invalidations */
> +static void arm_smmu_inv_to_cmdq_batch(struct arm_smmu_inv *inv,
> +				       struct arm_smmu_cmdq_batch *cmds,
> +				       struct arm_smmu_cmdq_ent *cmd,
> +				       unsigned long iova, size_t size,
> +				       unsigned int granule)
> +{
> +	if (arm_smmu_inv_size_too_big(inv->smmu, size, granule)) {
> +		cmd->opcode = inv->nsize_opcode;
> +		arm_smmu_cmdq_batch_add(inv->smmu, cmds, cmd);
> +		return;
> +	}
> +
> +	cmd->opcode = inv->size_opcode;
> +	arm_smmu_cmdq_batch_add_range(inv->smmu, cmds, cmd, iova, size, granule,
> +				      inv->pgsize);
> +}
> +
> +static inline bool arm_smmu_invs_end_batch(struct arm_smmu_inv *cur,
> +					   struct arm_smmu_inv *next)
> +{
> +	/* Changing smmu means changing command queue */
> +	if (cur->smmu != next->smmu)
> +		return true;
> +	/* The batch for S2 TLBI must be done before nested S1 ASIDs */
> +	if (cur->type != INV_TYPE_S2_VMID_S1_CLEAR &&
> +	    next->type == INV_TYPE_S2_VMID_S1_CLEAR)
> +		return true;
> +	/* ATS must be after a sync of the S1/S2 invalidations */
> +	if (!arm_smmu_inv_is_ats(cur) && arm_smmu_inv_is_ats(next))
> +		return true;
> +	return false;
> +}
> +
> +static void __arm_smmu_domain_inv_range(struct arm_smmu_invs *invs,
> +					unsigned long iova, size_t size,
> +					unsigned int granule, bool leaf)
> +{
> +	struct arm_smmu_cmdq_batch cmds = {};
> +	struct arm_smmu_inv *cur;
> +	struct arm_smmu_inv *end;
> +
> +	cur = invs->inv;
> +	end = cur + READ_ONCE(invs->num_invs);
> +	/* Skip any leading entry marked as a trash */
> +	for (; cur != end; cur++)
> +		if (refcount_read(&cur->users))
> +			break;
> +	while (cur != end) {
> +		struct arm_smmu_device *smmu = cur->smmu;
> +		struct arm_smmu_cmdq_ent cmd = {
> +			/*
> +			 * Pick size_opcode to run arm_smmu_get_cmdq(). This can
> +			 * be changed to nsize_opcode, which would result in the
> +			 * same CMDQ pointer.
> +			 */
> +			.opcode = cur->size_opcode,
> +		};
> +		struct arm_smmu_inv *next;
> +
> +		if (!cmds.num)
> +			arm_smmu_cmdq_batch_init(smmu, &cmds, &cmd);
> +
> +		switch (cur->type) {
> +		case INV_TYPE_S1_ASID:
> +			cmd.tlbi.asid = cur->id;
> +			cmd.tlbi.leaf = leaf;
> +			arm_smmu_inv_to_cmdq_batch(cur, &cmds, &cmd, iova, size,
> +						   granule);
> +			break;
> +		case INV_TYPE_S2_VMID:
> +			cmd.tlbi.vmid = cur->id;
> +			cmd.tlbi.leaf = leaf;
> +			arm_smmu_inv_to_cmdq_batch(cur, &cmds, &cmd, iova, size,
> +						   granule);
> +			break;
> +		case INV_TYPE_S2_VMID_S1_CLEAR:
> +			/* CMDQ_OP_TLBI_S12_VMALL already flushed S1 entries */
> +			if (arm_smmu_inv_size_too_big(cur->smmu, size, granule))
> +				continue;
> +			cmd.tlbi.vmid = cur->id;
> +			arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
> +			break;
> +		case INV_TYPE_ATS:
> +			arm_smmu_atc_inv_to_cmd(cur->ssid, iova, size, &cmd);
> +			cmd.atc.sid = cur->id;
> +			arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
> +			break;
> +		case INV_TYPE_ATS_FULL:
> +			arm_smmu_atc_inv_to_cmd(IOMMU_NO_PASID, 0, 0, &cmd);
> +			cmd.atc.sid = cur->id;
> +			arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
> +			break;
> +		default:
> +			WARN_ON_ONCE(1);
> +			continue;
> +		}
> +
> +		/* Skip any trash entry in-between */
> +		for (next = cur + 1; next != end; next++)
> +			if (refcount_read(&next->users))
> +				break;
> +
> +		if (cmds.num &&
> +		    (next == end || arm_smmu_invs_end_batch(cur, next))) {
> +			arm_smmu_cmdq_batch_submit(smmu, &cmds);
> +			cmds.num = 0;
> +		}
> +		cur = next;
> +	}
> +}
> +
> +void arm_smmu_domain_inv_range(struct arm_smmu_domain *smmu_domain,
> +			       unsigned long iova, size_t size,
> +			       unsigned int granule, bool leaf)
> +{
> +	struct arm_smmu_invs *invs;
> +
> +	/*
> +	 * An invalidation request must follow some IOPTE change and then load
> +	 * an invalidation array. In the meantime, a domain attachment mutates
> +	 * the array and then stores an STE/CD asking SMMU HW to acquire those
> +	 * changed IOPTEs. In other word, these two are interdependent and can
> +	 * race.
> +	 *
> +	 * In a race, the RCU design (with its underlying memory barriers) can
> +	 * ensure the invalidation array to always get updated before loaded.
> +	 *
> +	 * smp_mb() is used here, paired with the smp_mb() following the array
> +	 * update in a concurrent attach, to ensure:
> +	 *  - HW sees the new IOPTEs if it walks after STE installation
> +	 *  - Invalidation thread sees the updated array with the new ASID.
> +	 *
> +	 *  [CPU0]                        | [CPU1]
> +	 *                                |
> +	 *  change IOPTEs and TLB flush:  |
> +	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_new_domain_invs {
> +	 *    ...                         |   rcu_assign_pointer(new_invs);
> +	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
> +	 *    ...                         |   kfree_rcu(old_invs, rcu);
> +	 *    // load invalidation array  | }
> +	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
> +	 *                                |   STE = TTB0 // read new IOPTEs
> +	 */
> +	smp_mb();
> +
> +	rcu_read_lock();
> +	invs = rcu_dereference(smmu_domain->invs);
> +
> +	/*
> +	 * Avoid locking unless ATS is being used. No ATC invalidation can be
> +	 * going on after a domain is detached.
> +	 */
> +	if (invs->has_ats) {
> +		read_lock(&invs->rwlock);

Shouldn't these be read_lock_irqsave for all rwlock variants here? 
Invalidations might happen in IRQ context as well..

> +		__arm_smmu_domain_inv_range(invs, iova, size, granule, leaf);
> +		read_unlock(&invs->rwlock);
> +	} else {
> +		__arm_smmu_domain_inv_range(invs, iova, size, granule, leaf);
> +	}
> +
> +	rcu_read_unlock();
> +}
> +
>  static void arm_smmu_tlb_inv_page_nosync(struct iommu_iotlb_gather *gather,
>  					 unsigned long iova, size_t granule,
>  					 void *cookie)
> @@ -3285,6 +3483,23 @@ arm_smmu_install_new_domain_invs(struct arm_smmu_attach_state *state)
>  		return;
>  
>  	rcu_assign_pointer(*invst->invs_ptr, invst->new_invs);
> +	/*
> +	 * We are committed to updating the STE. Ensure the invalidation array
> +	 * is visible to concurrent map/unmap threads, and acquire any racing
> +	 * IOPTE updates.
> +	 *
> +	 *  [CPU0]                        | [CPU1]
> +	 *                                |
> +	 *  change IOPTEs and TLB flush:  |
> +	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_new_domain_invs {
> +	 *    ...                         |   rcu_assign_pointer(new_invs);
> +	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
> +	 *    ...                         |   kfree_rcu(old_invs, rcu);
> +	 *    // load invalidation array  | }
> +	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
> +	 *                                |   STE = TTB0 // read new IOPTEs
> +	 */
> +	smp_mb();
>  	kfree_rcu(invst->old_invs, rcu);
>  }
>  
> @@ -3334,6 +3549,23 @@ arm_smmu_install_old_domain_invs(struct arm_smmu_attach_state *state)
>  		return;
>  
>  	rcu_assign_pointer(*invst->invs_ptr, new_invs);
> +	/*
> +	 * We are committed to updating the STE. Ensure the invalidation array
> +	 * is visible to concurrent map/unmap threads, and acquire any racing
> +	 * IOPTE updates.
> +	 *
> +	 *  [CPU0]                        | [CPU1]
> +	 *                                |
> +	 *  change IOPTEs and TLB flush:  |
> +	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_old_domain_invs {
> +	 *    ...                         |   rcu_assign_pointer(new_invs);
> +	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
> +	 *    ...                         |   kfree_rcu(old_invs, rcu);
> +	 *    // load invalidation array  | }
> +	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
> +	 *                                |   STE = TTB0 // read new IOPTEs
> +	 */
> +	smp_mb();
>  	kfree_rcu(old_invs, rcu);
>  }
>

For INV_TYPE_S1_ASID, the new code loops and checks size_too_big 
via arm_smmu_inv_to_cmdq_batch. However, for INV_TYPE_ATS, it issues
a single command for the entire range. While this matches the current
driver, are we confident arm_smmu_atc_inv_to_cmd handles all massive
sizes correctly without needing a similar loop or "too big" fallback?

Thanks,
Praan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-23  9:48   ` Pranjal Shrivastava
@ 2026-01-23 13:56     ` Jason Gunthorpe
  2026-01-27 16:38     ` Nicolin Chen
  1 sibling, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2026-01-23 13:56 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Nicolin Chen, will, jean-philippe, robin.murphy, joro, balbirs,
	miko.lenczewski, peterz, kevin.tian, linux-arm-kernel, iommu,
	linux-kernel

On Fri, Jan 23, 2026 at 09:48:37AM +0000, Pranjal Shrivastava wrote:

> For INV_TYPE_S1_ASID, the new code loops and checks size_too_big 
> via arm_smmu_inv_to_cmdq_batch. However, for INV_TYPE_ATS, it issues
> a single command for the entire range. While this matches the current
> driver, are we confident arm_smmu_atc_inv_to_cmd handles all massive
> sizes correctly without needing a similar loop or "too big" fallback?

At the PCI level ATS invalidations accept an arbitary byte range, so
all ranges should be converted into a single CMDQ command and pushed
as a single TLP.

The issue the "too big" stuff is solving is for the IOTLB itself which
can require an invalidation rage to be exploded into many smaller
commands.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-23  9:48   ` Pranjal Shrivastava
  2026-01-23 13:56     ` Jason Gunthorpe
@ 2026-01-27 16:38     ` Nicolin Chen
  2026-01-27 17:08       ` Jason Gunthorpe
  1 sibling, 1 reply; 45+ messages in thread
From: Nicolin Chen @ 2026-01-27 16:38 UTC (permalink / raw)
  To: jgg, Pranjal Shrivastava
  Cc: will, jean-philippe, robin.murphy, joro, balbirs, miko.lenczewski,
	peterz, kevin.tian, linux-arm-kernel, iommu, linux-kernel

Hi Pranjal,

Sorry, I missed this!

On Fri, Jan 23, 2026 at 09:48:37AM +0000, Pranjal Shrivastava wrote:
> On Fri, Dec 19, 2025 at 12:11:28PM -0800, Nicolin Chen wrote:
> > +	/*
> > +	 * Avoid locking unless ATS is being used. No ATC invalidation can be
> > +	 * going on after a domain is detached.
> > +	 */
> > +	if (invs->has_ats) {
> > +		read_lock(&invs->rwlock);
> 
> Shouldn't these be read_lock_irqsave for all rwlock variants here? 
> Invalidations might happen in IRQ context as well..
> 
> > +		__arm_smmu_domain_inv_range(invs, iova, size, granule, leaf);
> > +		read_unlock(&invs->rwlock);

It was kept from the older versions where we had a trylock. Jason
had an insight about this, mainly for less latency on invalidation
threads.

Yet, now we have a plain locking. TBH, I can't find a good reason
justifying this. And it does look a bit unsafe to me. So, I think
I will just change to the _irqsave version. (Jason?)

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-27 16:38     ` Nicolin Chen
@ 2026-01-27 17:08       ` Jason Gunthorpe
  2026-01-27 18:07         ` Nicolin Chen
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2026-01-27 17:08 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Pranjal Shrivastava, will, jean-philippe, robin.murphy, joro,
	balbirs, miko.lenczewski, peterz, kevin.tian, linux-arm-kernel,
	iommu, linux-kernel

On Tue, Jan 27, 2026 at 08:38:31AM -0800, Nicolin Chen wrote:
> Hi Pranjal,
> 
> Sorry, I missed this!
> 
> On Fri, Jan 23, 2026 at 09:48:37AM +0000, Pranjal Shrivastava wrote:
> > On Fri, Dec 19, 2025 at 12:11:28PM -0800, Nicolin Chen wrote:
> > > +	/*
> > > +	 * Avoid locking unless ATS is being used. No ATC invalidation can be
> > > +	 * going on after a domain is detached.
> > > +	 */
> > > +	if (invs->has_ats) {
> > > +		read_lock(&invs->rwlock);
> > 
> > Shouldn't these be read_lock_irqsave for all rwlock variants here? 
> > Invalidations might happen in IRQ context as well..
> > 
> > > +		__arm_smmu_domain_inv_range(invs, iova, size, granule, leaf);
> > > +		read_unlock(&invs->rwlock);
> 
> It was kept from the older versions where we had a trylock. Jason
> had an insight about this, mainly for less latency on invalidation
> threads.
> 
> Yet, now we have a plain locking. TBH, I can't find a good reason
> justifying this. And it does look a bit unsafe to me. So, I think
> I will just change to the _irqsave version. (Jason?)

My understanding has been that this invalidation can run from an IRQ
context - we permit the use of the DMA API from an interrupt handler?

I though that for rwsem the read side does not require the _irqsave,
even if it is in an irq context, unless the write side runs from an
IRQ. 

Here the write side always runs from a process context.

So the write side will block the IRQ which ensures we don't spin
during read in an IRQ.

IOW I think this is OK? Pranjal do you know otherwise?

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-27 17:08       ` Jason Gunthorpe
@ 2026-01-27 18:07         ` Nicolin Chen
  2026-01-27 18:23           ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Nicolin Chen @ 2026-01-27 18:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pranjal Shrivastava, will, jean-philippe, robin.murphy, joro,
	balbirs, miko.lenczewski, peterz, kevin.tian, linux-arm-kernel,
	iommu, linux-kernel

On Tue, Jan 27, 2026 at 01:08:37PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 27, 2026 at 08:38:31AM -0800, Nicolin Chen wrote:
> > Hi Pranjal,
> > 
> > Sorry, I missed this!
> > 
> > On Fri, Jan 23, 2026 at 09:48:37AM +0000, Pranjal Shrivastava wrote:
> > > On Fri, Dec 19, 2025 at 12:11:28PM -0800, Nicolin Chen wrote:
> > > > +	/*
> > > > +	 * Avoid locking unless ATS is being used. No ATC invalidation can be
> > > > +	 * going on after a domain is detached.
> > > > +	 */
> > > > +	if (invs->has_ats) {
> > > > +		read_lock(&invs->rwlock);
> > > 
> > > Shouldn't these be read_lock_irqsave for all rwlock variants here? 
> > > Invalidations might happen in IRQ context as well..
> > > 
> > > > +		__arm_smmu_domain_inv_range(invs, iova, size, granule, leaf);
> > > > +		read_unlock(&invs->rwlock);
> > 
> > It was kept from the older versions where we had a trylock. Jason
> > had an insight about this, mainly for less latency on invalidation
> > threads.
> > 
> > Yet, now we have a plain locking. TBH, I can't find a good reason
> > justifying this. And it does look a bit unsafe to me. So, I think
> > I will just change to the _irqsave version. (Jason?)
> 
> My understanding has been that this invalidation can run from an IRQ
> context - we permit the use of the DMA API from an interrupt handler?
> 
> I though that for rwsem the read side does not require the _irqsave,
> even if it is in an irq context, unless the write side runs from an
> IRQ. 

Hmm, is "rwsem" a typo? Because it's rwlock_t, which is spinlock :-/

> Here the write side always runs from a process context.
> 
> So the write side will block the IRQ which ensures we don't spin
> during read in an IRQ.

And, does write_lock_irqsave() disable global IRQ or local IRQ only?

Documentation/locking/locktypes.rst mentions "local_irq_disable()"..

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-27 18:07         ` Nicolin Chen
@ 2026-01-27 18:23           ` Jason Gunthorpe
  2026-01-27 18:37             ` Nicolin Chen
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2026-01-27 18:23 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Pranjal Shrivastava, will, jean-philippe, robin.murphy, joro,
	balbirs, miko.lenczewski, peterz, kevin.tian, linux-arm-kernel,
	iommu, linux-kernel

On Tue, Jan 27, 2026 at 10:07:09AM -0800, Nicolin Chen wrote:
> > My understanding has been that this invalidation can run from an IRQ
> > context - we permit the use of the DMA API from an interrupt handler?
> > 
> > I though that for rwsem the read side does not require the _irqsave,
> > even if it is in an irq context, unless the write side runs from an
> > IRQ. 
> 
> Hmm, is "rwsem" a typo? Because it's rwlock_t, which is spinlock :-/

Yeah, sorry

> > Here the write side always runs from a process context.
> > 
> > So the write side will block the IRQ which ensures we don't spin
> > during read in an IRQ.
> 
> And, does write_lock_irqsave() disable global IRQ or local IRQ only?
> 
> Documentation/locking/locktypes.rst mentions "local_irq_disable()"..

It will only disable the local IRQ, since it is a spin type lock an IRQ on
another CPU can spin until it is unlocked.

The main issue is if this CPU takes an IRQ while the write side is
locked and spins, then it will never unlock.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-27 18:23           ` Jason Gunthorpe
@ 2026-01-27 18:37             ` Nicolin Chen
  2026-01-27 19:19               ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Nicolin Chen @ 2026-01-27 18:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pranjal Shrivastava, will, jean-philippe, robin.murphy, joro,
	balbirs, miko.lenczewski, peterz, kevin.tian, linux-arm-kernel,
	iommu, linux-kernel

On Tue, Jan 27, 2026 at 02:23:48PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 27, 2026 at 10:07:09AM -0800, Nicolin Chen wrote:
> > > My understanding has been that this invalidation can run from an IRQ
> > > context - we permit the use of the DMA API from an interrupt handler?
> > > 
> > > I though that for rwsem the read side does not require the _irqsave,
> > > even if it is in an irq context, unless the write side runs from an
> > > IRQ. 
> > 
> > Hmm, is "rwsem" a typo? Because it's rwlock_t, which is spinlock :-/
> 
> Yeah, sorry
> 
> > > Here the write side always runs from a process context.
> > > 
> > > So the write side will block the IRQ which ensures we don't spin
> > > during read in an IRQ.
> > 
> > And, does write_lock_irqsave() disable global IRQ or local IRQ only?
> > 
> > Documentation/locking/locktypes.rst mentions "local_irq_disable()"..
> 
> It will only disable the local IRQ, since it is a spin type lock an IRQ on
> another CPU can spin until it is unlocked.
> 
> The main issue is if this CPU takes an IRQ while the write side is
> locked and spins, then it will never unlock.

Yea, that sounds unsafe. I'll send a v11 with read_lock_irqsave().

We can also pass in iommu_domain to arm_smmu_attach_prepare_invs()
that you pointed out in the followup series.

Thanks!
Nicolin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-27 18:37             ` Nicolin Chen
@ 2026-01-27 19:19               ` Jason Gunthorpe
  2026-01-27 20:14                 ` Nicolin Chen
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2026-01-27 19:19 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Pranjal Shrivastava, will, jean-philippe, robin.murphy, joro,
	balbirs, miko.lenczewski, peterz, kevin.tian, linux-arm-kernel,
	iommu, linux-kernel

On Tue, Jan 27, 2026 at 10:37:44AM -0800, Nicolin Chen wrote:
> On Tue, Jan 27, 2026 at 02:23:48PM -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 27, 2026 at 10:07:09AM -0800, Nicolin Chen wrote:
> > > > My understanding has been that this invalidation can run from an IRQ
> > > > context - we permit the use of the DMA API from an interrupt handler?
> > > > 
> > > > I though that for rwsem the read side does not require the _irqsave,
> > > > even if it is in an irq context, unless the write side runs from an
> > > > IRQ. 
> > > 
> > > Hmm, is "rwsem" a typo? Because it's rwlock_t, which is spinlock :-/
> > 
> > Yeah, sorry
> > 
> > > > Here the write side always runs from a process context.
> > > > 
> > > > So the write side will block the IRQ which ensures we don't spin
> > > > during read in an IRQ.
> > > 
> > > And, does write_lock_irqsave() disable global IRQ or local IRQ only?
> > > 
> > > Documentation/locking/locktypes.rst mentions "local_irq_disable()"..
> > 
> > It will only disable the local IRQ, since it is a spin type lock an IRQ on
> > another CPU can spin until it is unlocked.
> > 
> > The main issue is if this CPU takes an IRQ while the write side is
> > locked and spins, then it will never unlock.
> 
> Yea, that sounds unsafe. I'll send a v11 with read_lock_irqsave().

I'm explaining why it is safe now, the write side takes the irqsave so
the above can't happen.

There is no case where the read side needs to block IRQ because if the
read side succeeds, an IRQ happens and tries to take another read
side, it will succeed not spin.

But it is a very very small optimization so let's just go ahead with
the thing that is obviously safe.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-27 19:19               ` Jason Gunthorpe
@ 2026-01-27 20:14                 ` Nicolin Chen
  2026-01-28  0:05                   ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Nicolin Chen @ 2026-01-27 20:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pranjal Shrivastava, will, jean-philippe, robin.murphy, joro,
	balbirs, miko.lenczewski, peterz, kevin.tian, linux-arm-kernel,
	iommu, linux-kernel

On Tue, Jan 27, 2026 at 03:19:38PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 27, 2026 at 10:37:44AM -0800, Nicolin Chen wrote:
> > On Tue, Jan 27, 2026 at 02:23:48PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Jan 27, 2026 at 10:07:09AM -0800, Nicolin Chen wrote:
> > > > > My understanding has been that this invalidation can run from an IRQ
> > > > > context - we permit the use of the DMA API from an interrupt handler?
> > > > > 
> > > > > I though that for rwsem the read side does not require the _irqsave,
> > > > > even if it is in an irq context, unless the write side runs from an
> > > > > IRQ. 
> > > > 
> > > > Hmm, is "rwsem" a typo? Because it's rwlock_t, which is spinlock :-/
> > > 
> > > Yeah, sorry
> > > 
> > > > > Here the write side always runs from a process context.
> > > > > 
> > > > > So the write side will block the IRQ which ensures we don't spin
> > > > > during read in an IRQ.
> > > > 
> > > > And, does write_lock_irqsave() disable global IRQ or local IRQ only?
> > > > 
> > > > Documentation/locking/locktypes.rst mentions "local_irq_disable()"..
> > > 
> > > It will only disable the local IRQ, since it is a spin type lock an IRQ on
> > > another CPU can spin until it is unlocked.
> > > 
> > > The main issue is if this CPU takes an IRQ while the write side is
> > > locked and spins, then it will never unlock.
> > 
> > Yea, that sounds unsafe. I'll send a v11 with read_lock_irqsave().
> 
> I'm explaining why it is safe now, the write side takes the irqsave so
> the above can't happen.

Sorry, I misunderstood..

> There is no case where the read side needs to block IRQ because if the
> read side succeeds, an IRQ happens and tries to take another read
> side, it will succeed not spin.

Yea, I also went a bit deeper.

It seems to depend on the CONFIG_QUEUED_RWLOCKS (ARM sets =y)

252 config QUEUED_RWLOCKS
253         def_bool y if ARCH_USE_QUEUED_RWLOCKS
254         depends on SMP && !PREEMPT_RT

where a reader will not get blocked in our particular use case:

21 void __lockfunc queued_read_lock_slowpath(struct qrwlock *lock)
22 {
23         /*
24          * Readers come here when they cannot get the lock without waiting
25          */
26         if (unlikely(in_interrupt())) {
27                 /*
28                  * Readers in interrupt context will get the lock immediately
29                  * if the writer is just waiting (not holding the lock yet),
30                  * so spin with ACQUIRE semantics until the lock is available
31                  * without waiting in the queue.
32                  */
33                 atomic_cond_read_acquire(&lock->cnts, !(VAL & _QW_LOCKED));
34                 return;

And I don't see any non-hackable way for CONFIG_QUEUED_RWLOCKS=n
unless CONFIG_PREEMPT_RT=y, which would be a different ball game
that I assume SMMUv3 might not be completely compatible with.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-27 20:14                 ` Nicolin Chen
@ 2026-01-28  0:05                   ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2026-01-28  0:05 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Pranjal Shrivastava, will, jean-philippe, robin.murphy, joro,
	balbirs, miko.lenczewski, peterz, kevin.tian, linux-arm-kernel,
	iommu, linux-kernel

On Tue, Jan 27, 2026 at 12:14:37PM -0800, Nicolin Chen wrote:
> And I don't see any non-hackable way for CONFIG_QUEUED_RWLOCKS=n
> unless CONFIG_PREEMPT_RT=y, which would be a different ball game
> that I assume SMMUv3 might not be completely compatible with.

SMMUv3 should work with PREEMPT_RT=y

Preempt is a good point, I don't know what the rules are for preempt
at all. So it is OK to be conservative here, I doubt we could measure
the difference anyhow

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2025-12-19 20:11 ` [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range() Nicolin Chen
  2026-01-23  9:48   ` Pranjal Shrivastava
@ 2026-01-23 17:05   ` Will Deacon
  2026-01-23 17:10     ` Will Deacon
  1 sibling, 1 reply; 45+ messages in thread
From: Will Deacon @ 2026-01-23 17:05 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

On Fri, Dec 19, 2025 at 12:11:28PM -0800, Nicolin Chen wrote:
> +void arm_smmu_domain_inv_range(struct arm_smmu_domain *smmu_domain,
> +			       unsigned long iova, size_t size,
> +			       unsigned int granule, bool leaf)
> +{
> +	struct arm_smmu_invs *invs;
> +
> +	/*
> +	 * An invalidation request must follow some IOPTE change and then load
> +	 * an invalidation array. In the meantime, a domain attachment mutates
> +	 * the array and then stores an STE/CD asking SMMU HW to acquire those
> +	 * changed IOPTEs. In other word, these two are interdependent and can
> +	 * race.
> +	 *
> +	 * In a race, the RCU design (with its underlying memory barriers) can
> +	 * ensure the invalidation array to always get updated before loaded.
> +	 *
> +	 * smp_mb() is used here, paired with the smp_mb() following the array
> +	 * update in a concurrent attach, to ensure:
> +	 *  - HW sees the new IOPTEs if it walks after STE installation
> +	 *  - Invalidation thread sees the updated array with the new ASID.
> +	 *
> +	 *  [CPU0]                        | [CPU1]
> +	 *                                |
> +	 *  change IOPTEs and TLB flush:  |
> +	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_new_domain_invs {
> +	 *    ...                         |   rcu_assign_pointer(new_invs);
> +	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
> +	 *    ...                         |   kfree_rcu(old_invs, rcu);
> +	 *    // load invalidation array  | }
> +	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
> +	 *                                |   STE = TTB0 // read new IOPTEs
> +	 */
> +	smp_mb();
> +
> +	rcu_read_lock();
> +	invs = rcu_dereference(smmu_domain->invs);
> +
> +	/*
> +	 * Avoid locking unless ATS is being used. No ATC invalidation can be
> +	 * going on after a domain is detached.
> +	 */
> +	if (invs->has_ats) {
> +		read_lock(&invs->rwlock);
> +		__arm_smmu_domain_inv_range(invs, iova, size, granule, leaf);
> +		read_unlock(&invs->rwlock);
> +	} else {
> +		__arm_smmu_domain_inv_range(invs, iova, size, granule, leaf);
> +	}
> +
> +	rcu_read_unlock();
> +}
> +
>  static void arm_smmu_tlb_inv_page_nosync(struct iommu_iotlb_gather *gather,
>  					 unsigned long iova, size_t granule,
>  					 void *cookie)
> @@ -3285,6 +3483,23 @@ arm_smmu_install_new_domain_invs(struct arm_smmu_attach_state *state)
>  		return;
>  
>  	rcu_assign_pointer(*invst->invs_ptr, invst->new_invs);
> +	/*
> +	 * We are committed to updating the STE. Ensure the invalidation array
> +	 * is visible to concurrent map/unmap threads, and acquire any racing
> +	 * IOPTE updates.
> +	 *
> +	 *  [CPU0]                        | [CPU1]
> +	 *                                |
> +	 *  change IOPTEs and TLB flush:  |
> +	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_new_domain_invs {
> +	 *    ...                         |   rcu_assign_pointer(new_invs);
> +	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
> +	 *    ...                         |   kfree_rcu(old_invs, rcu);
> +	 *    // load invalidation array  | }
> +	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
> +	 *                                |   STE = TTB0 // read new IOPTEs
> +	 */
> +	smp_mb();
>  	kfree_rcu(invst->old_invs, rcu);
>  }
>  
> @@ -3334,6 +3549,23 @@ arm_smmu_install_old_domain_invs(struct arm_smmu_attach_state *state)
>  		return;
>  
>  	rcu_assign_pointer(*invst->invs_ptr, new_invs);
> +	/*
> +	 * We are committed to updating the STE. Ensure the invalidation array
> +	 * is visible to concurrent map/unmap threads, and acquire any racing
> +	 * IOPTE updates.
> +	 *
> +	 *  [CPU0]                        | [CPU1]
> +	 *                                |
> +	 *  change IOPTEs and TLB flush:  |
> +	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_old_domain_invs {
> +	 *    ...                         |   rcu_assign_pointer(new_invs);
> +	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
> +	 *    ...                         |   kfree_rcu(old_invs, rcu);
> +	 *    // load invalidation array  | }
> +	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
> +	 *                                |   STE = TTB0 // read new IOPTEs
> +	 */
> +	smp_mb();

I don't think we need to duplicate this comment three times, you can just
refer to the first function (e.g. "See ordering comment in
arm_smmu_domain_inv_range()").

However, isn't the comment above misleading for this case?
arm_smmu_install_old_domain_invs() has the sequencing the other way
around on CPU 1: we should update the STE first.

Will

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-23 17:05   ` Will Deacon
@ 2026-01-23 17:10     ` Will Deacon
  2026-01-23 17:43       ` Nicolin Chen
  2026-01-23 20:03       ` Jason Gunthorpe
  0 siblings, 2 replies; 45+ messages in thread
From: Will Deacon @ 2026-01-23 17:10 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

On Fri, Jan 23, 2026 at 05:05:31PM +0000, Will Deacon wrote:
> On Fri, Dec 19, 2025 at 12:11:28PM -0800, Nicolin Chen wrote:
> > +	/*
> > +	 * We are committed to updating the STE. Ensure the invalidation array
> > +	 * is visible to concurrent map/unmap threads, and acquire any racing
> > +	 * IOPTE updates.
> > +	 *
> > +	 *  [CPU0]                        | [CPU1]
> > +	 *                                |
> > +	 *  change IOPTEs and TLB flush:  |
> > +	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_old_domain_invs {
> > +	 *    ...                         |   rcu_assign_pointer(new_invs);
> > +	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
> > +	 *    ...                         |   kfree_rcu(old_invs, rcu);
> > +	 *    // load invalidation array  | }
> > +	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
> > +	 *                                |   STE = TTB0 // read new IOPTEs
> > +	 */
> > +	smp_mb();
> 
> I don't think we need to duplicate this comment three times, you can just
> refer to the first function (e.g. "See ordering comment in
> arm_smmu_domain_inv_range()").
> 
> However, isn't the comment above misleading for this case?
> arm_smmu_install_old_domain_invs() has the sequencing the other way
> around on CPU 1: we should update the STE first.

I also think we probably want a dma_mb() instead of an smp_mb() for all
of these examples? It won't make any practical difference but I think it
helps readability given that one of the readers is the PTW.

Will

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-23 17:10     ` Will Deacon
@ 2026-01-23 17:43       ` Nicolin Chen
  2026-01-23 20:03       ` Jason Gunthorpe
  1 sibling, 0 replies; 45+ messages in thread
From: Nicolin Chen @ 2026-01-23 17:43 UTC (permalink / raw)
  To: Will Deacon
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

On Fri, Jan 23, 2026 at 05:10:52PM +0000, Will Deacon wrote:
> On Fri, Jan 23, 2026 at 05:05:31PM +0000, Will Deacon wrote:
> > On Fri, Dec 19, 2025 at 12:11:28PM -0800, Nicolin Chen wrote:
> > > +	/*
> > > +	 * We are committed to updating the STE. Ensure the invalidation array
> > > +	 * is visible to concurrent map/unmap threads, and acquire any racing
> > > +	 * IOPTE updates.
> > > +	 *
> > > +	 *  [CPU0]                        | [CPU1]
> > > +	 *                                |
> > > +	 *  change IOPTEs and TLB flush:  |
> > > +	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_old_domain_invs {
> > > +	 *    ...                         |   rcu_assign_pointer(new_invs);
> > > +	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
> > > +	 *    ...                         |   kfree_rcu(old_invs, rcu);
> > > +	 *    // load invalidation array  | }
> > > +	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
> > > +	 *                                |   STE = TTB0 // read new IOPTEs
> > > +	 */
> > > +	smp_mb();
> > 
> > I don't think we need to duplicate this comment three times, you can just
> > refer to the first function (e.g. "See ordering comment in
> > arm_smmu_domain_inv_range()").

I'll drop the duplicates.

> > However, isn't the comment above misleading for this case?
> > arm_smmu_install_old_domain_invs() has the sequencing the other way
> > around on CPU 1: we should update the STE first.

Ah, that's true. It installs new domain invs first then ste, and
lastly updates the old domain invs.

> I also think we probably want a dma_mb() instead of an smp_mb() for all
> of these examples? It won't make any practical difference but I think it
> helps readability given that one of the readers is the PTW.

OK. I'll change to dma_mb().

Nicolin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-23 17:10     ` Will Deacon
  2026-01-23 17:43       ` Nicolin Chen
@ 2026-01-23 20:03       ` Jason Gunthorpe
  2026-01-26 13:01         ` Will Deacon
  1 sibling, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2026-01-23 20:03 UTC (permalink / raw)
  To: Will Deacon
  Cc: Nicolin Chen, jean-philippe, robin.murphy, joro, balbirs,
	miko.lenczewski, peterz, kevin.tian, praan, linux-arm-kernel,
	iommu, linux-kernel

On Fri, Jan 23, 2026 at 05:10:52PM +0000, Will Deacon wrote:
> On Fri, Jan 23, 2026 at 05:05:31PM +0000, Will Deacon wrote:
> > On Fri, Dec 19, 2025 at 12:11:28PM -0800, Nicolin Chen wrote:
> > > +	/*
> > > +	 * We are committed to updating the STE. Ensure the invalidation array
> > > +	 * is visible to concurrent map/unmap threads, and acquire any racing
> > > +	 * IOPTE updates.
> > > +	 *
> > > +	 *  [CPU0]                        | [CPU1]
> > > +	 *                                |
> > > +	 *  change IOPTEs and TLB flush:  |
> > > +	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_old_domain_invs {
> > > +	 *    ...                         |   rcu_assign_pointer(new_invs);
> > > +	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
> > > +	 *    ...                         |   kfree_rcu(old_invs, rcu);
> > > +	 *    // load invalidation array  | }
> > > +	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
> > > +	 *                                |   STE = TTB0 // read new IOPTEs
> > > +	 */
> > > +	smp_mb();
> > 
> > I don't think we need to duplicate this comment three times, you can just
> > refer to the first function (e.g. "See ordering comment in
> > arm_smmu_domain_inv_range()").
> > 
> > However, isn't the comment above misleading for this case?
> > arm_smmu_install_old_domain_invs() has the sequencing the other way
> > around on CPU 1: we should update the STE first.
> 
> I also think we probably want a dma_mb() instead of an smp_mb() for all
> of these examples? It won't make any practical difference but I think it
> helps readability given that one of the readers is the PTW.

The only actual dma_wmb() is inside arm_smmu_install_ste_for_dev()
after updating the STE. Adding that line explicitly would help as that
is the only point where we must have the writes actually visible to
the DMA HW.

The ones written here as smp_mb() are not required to be DMA ones and
could all be NOP's on UP..

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-23 20:03       ` Jason Gunthorpe
@ 2026-01-26 13:01         ` Will Deacon
  2026-01-26 15:20           ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Will Deacon @ 2026-01-26 13:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, jean-philippe, robin.murphy, joro, balbirs,
	miko.lenczewski, peterz, kevin.tian, praan, linux-arm-kernel,
	iommu, linux-kernel

On Fri, Jan 23, 2026 at 04:03:27PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 23, 2026 at 05:10:52PM +0000, Will Deacon wrote:
> > On Fri, Jan 23, 2026 at 05:05:31PM +0000, Will Deacon wrote:
> > > On Fri, Dec 19, 2025 at 12:11:28PM -0800, Nicolin Chen wrote:
> > > > +	/*
> > > > +	 * We are committed to updating the STE. Ensure the invalidation array
> > > > +	 * is visible to concurrent map/unmap threads, and acquire any racing
> > > > +	 * IOPTE updates.
> > > > +	 *
> > > > +	 *  [CPU0]                        | [CPU1]
> > > > +	 *                                |
> > > > +	 *  change IOPTEs and TLB flush:  |
> > > > +	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_old_domain_invs {
> > > > +	 *    ...                         |   rcu_assign_pointer(new_invs);
> > > > +	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
> > > > +	 *    ...                         |   kfree_rcu(old_invs, rcu);
> > > > +	 *    // load invalidation array  | }
> > > > +	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
> > > > +	 *                                |   STE = TTB0 // read new IOPTEs
> > > > +	 */
> > > > +	smp_mb();
> > > 
> > > I don't think we need to duplicate this comment three times, you can just
> > > refer to the first function (e.g. "See ordering comment in
> > > arm_smmu_domain_inv_range()").
> > > 
> > > However, isn't the comment above misleading for this case?
> > > arm_smmu_install_old_domain_invs() has the sequencing the other way
> > > around on CPU 1: we should update the STE first.
> > 
> > I also think we probably want a dma_mb() instead of an smp_mb() for all
> > of these examples? It won't make any practical difference but I think it
> > helps readability given that one of the readers is the PTW.
> 
> The only actual dma_wmb() is inside arm_smmu_install_ste_for_dev()
> after updating the STE. Adding that line explicitly would help as that
> is the only point where we must have the writes actually visible to
> the DMA HW.
> 
> The ones written here as smp_mb() are not required to be DMA ones and
> could all be NOP's on UP..

Hmm, I'm not sure about that.

If we've written a new (i.e. previously invalid) valid PTE to a
page-table and then we install that page-table into an STE hitlessly
(let's say we write the S2TTB field) then isn't there a window before we
do the STE invalidation where the page-table might be accessible to the
SMMU but the new PTE is still sitting in the CPU?

i.e. we can't rely on the command insertion barrier for that.

Will

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-26 13:01         ` Will Deacon
@ 2026-01-26 15:20           ` Jason Gunthorpe
  2026-01-26 16:02             ` Will Deacon
  2026-01-26 17:50             ` Nicolin Chen
  0 siblings, 2 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2026-01-26 15:20 UTC (permalink / raw)
  To: Will Deacon
  Cc: Nicolin Chen, jean-philippe, robin.murphy, joro, balbirs,
	miko.lenczewski, peterz, kevin.tian, praan, linux-arm-kernel,
	iommu, linux-kernel

On Mon, Jan 26, 2026 at 01:01:16PM +0000, Will Deacon wrote:
> On Fri, Jan 23, 2026 at 04:03:27PM -0400, Jason Gunthorpe wrote:
> > On Fri, Jan 23, 2026 at 05:10:52PM +0000, Will Deacon wrote:
> > > On Fri, Jan 23, 2026 at 05:05:31PM +0000, Will Deacon wrote:
> > > > On Fri, Dec 19, 2025 at 12:11:28PM -0800, Nicolin Chen wrote:
> > > > > +	/*
> > > > > +	 * We are committed to updating the STE. Ensure the invalidation array
> > > > > +	 * is visible to concurrent map/unmap threads, and acquire any racing
> > > > > +	 * IOPTE updates.
> > > > > +	 *
> > > > > +	 *  [CPU0]                        | [CPU1]
> > > > > +	 *                                |
> > > > > +	 *  change IOPTEs and TLB flush:  |
> > > > > +	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_old_domain_invs {
> > > > > +	 *    ...                         |   rcu_assign_pointer(new_invs);
> > > > > +	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
> > > > > +	 *    ...                         |   kfree_rcu(old_invs, rcu);
> > > > > +	 *    // load invalidation array  | }
> > > > > +	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
> > > > > +	 *                                |   STE = TTB0 // read new IOPTEs
> > > > > +	 */
> > > > > +	smp_mb();
> > > > 
> > > > I don't think we need to duplicate this comment three times, you can just
> > > > refer to the first function (e.g. "See ordering comment in
> > > > arm_smmu_domain_inv_range()").
> > > > 
> > > > However, isn't the comment above misleading for this case?
> > > > arm_smmu_install_old_domain_invs() has the sequencing the other way
> > > > around on CPU 1: we should update the STE first.
> > > 
> > > I also think we probably want a dma_mb() instead of an smp_mb() for all
> > > of these examples? It won't make any practical difference but I think it
> > > helps readability given that one of the readers is the PTW.
> > 
> > The only actual dma_wmb() is inside arm_smmu_install_ste_for_dev()
> > after updating the STE. Adding that line explicitly would help as that
> > is the only point where we must have the writes actually visible to
> > the DMA HW.
> > 
> > The ones written here as smp_mb() are not required to be DMA ones and
> > could all be NOP's on UP..
> 
> Hmm, I'm not sure about that.
> 
> If we've written a new (i.e. previously invalid) valid PTE to a
> page-table and then we install that page-table into an STE hitlessly
> (let's say we write the S2TTB field) then isn't there a window before we
> do the STE invalidation where the page-table might be accessible to the
> SMMU but the new PTE is still sitting in the CPU?

Hmm! Yes seems like it.

However, that's seems like a general bug, if we allocate an
iommu_domain and immediately hitlessly install it, then there would be
no dma_wmb() for the page table memory prior to the earliest point the
HW is able to read the STE.

What I wrote is is how things are intended to work, so lets fix it
with this?

@@ -1173,6 +1173,13 @@ void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer, __le64 *entry,
        __le64 unused_update[NUM_ENTRY_QWORDS];
        u8 used_qword_diff;
 
+       /*
+        * Many of the entry structures have pointers to other structures that
+        * need to have their updates be visible before any writes of the entry
+        * happen.
+        */
+       dma_wmb();
+
        used_qword_diff =
                arm_smmu_entry_qword_diff(writer, entry, target, unused_update);
        if (hweight8(used_qword_diff) == 1) {

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-26 15:20           ` Jason Gunthorpe
@ 2026-01-26 16:02             ` Will Deacon
  2026-01-26 16:09               ` Jason Gunthorpe
  2026-01-26 17:50             ` Nicolin Chen
  1 sibling, 1 reply; 45+ messages in thread
From: Will Deacon @ 2026-01-26 16:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, jean-philippe, robin.murphy, joro, balbirs,
	miko.lenczewski, peterz, kevin.tian, praan, linux-arm-kernel,
	iommu, linux-kernel

On Mon, Jan 26, 2026 at 11:20:19AM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 26, 2026 at 01:01:16PM +0000, Will Deacon wrote:
> > On Fri, Jan 23, 2026 at 04:03:27PM -0400, Jason Gunthorpe wrote:
> > > On Fri, Jan 23, 2026 at 05:10:52PM +0000, Will Deacon wrote:
> > > > On Fri, Jan 23, 2026 at 05:05:31PM +0000, Will Deacon wrote:
> > > > > On Fri, Dec 19, 2025 at 12:11:28PM -0800, Nicolin Chen wrote:
> > > > > > +	/*
> > > > > > +	 * We are committed to updating the STE. Ensure the invalidation array
> > > > > > +	 * is visible to concurrent map/unmap threads, and acquire any racing
> > > > > > +	 * IOPTE updates.
> > > > > > +	 *
> > > > > > +	 *  [CPU0]                        | [CPU1]
> > > > > > +	 *                                |
> > > > > > +	 *  change IOPTEs and TLB flush:  |
> > > > > > +	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_old_domain_invs {
> > > > > > +	 *    ...                         |   rcu_assign_pointer(new_invs);
> > > > > > +	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
> > > > > > +	 *    ...                         |   kfree_rcu(old_invs, rcu);
> > > > > > +	 *    // load invalidation array  | }
> > > > > > +	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
> > > > > > +	 *                                |   STE = TTB0 // read new IOPTEs
> > > > > > +	 */
> > > > > > +	smp_mb();
> > > > > 
> > > > > I don't think we need to duplicate this comment three times, you can just
> > > > > refer to the first function (e.g. "See ordering comment in
> > > > > arm_smmu_domain_inv_range()").
> > > > > 
> > > > > However, isn't the comment above misleading for this case?
> > > > > arm_smmu_install_old_domain_invs() has the sequencing the other way
> > > > > around on CPU 1: we should update the STE first.
> > > > 
> > > > I also think we probably want a dma_mb() instead of an smp_mb() for all
> > > > of these examples? It won't make any practical difference but I think it
> > > > helps readability given that one of the readers is the PTW.
> > > 
> > > The only actual dma_wmb() is inside arm_smmu_install_ste_for_dev()
> > > after updating the STE. Adding that line explicitly would help as that
> > > is the only point where we must have the writes actually visible to
> > > the DMA HW.
> > > 
> > > The ones written here as smp_mb() are not required to be DMA ones and
> > > could all be NOP's on UP..
> > 
> > Hmm, I'm not sure about that.
> > 
> > If we've written a new (i.e. previously invalid) valid PTE to a
> > page-table and then we install that page-table into an STE hitlessly
> > (let's say we write the S2TTB field) then isn't there a window before we
> > do the STE invalidation where the page-table might be accessible to the
> > SMMU but the new PTE is still sitting in the CPU?
> 
> Hmm! Yes seems like it.
> 
> However, that's seems like a general bug, if we allocate an
> iommu_domain and immediately hitlessly install it, then there would be
> no dma_wmb() for the page table memory prior to the earliest point the
> HW is able to read the STE.
> 
> What I wrote is is how things are intended to work, so lets fix it
> with this?
> 
> @@ -1173,6 +1173,13 @@ void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer, __le64 *entry,
>         __le64 unused_update[NUM_ENTRY_QWORDS];
>         u8 used_qword_diff;
>  
> +       /*
> +        * Many of the entry structures have pointers to other structures that
> +        * need to have their updates be visible before any writes of the entry
> +        * happen.
> +        */
> +       dma_wmb();

If we do that, can we drop the smp_mb()s from
arm_smmu_install_{old,new}_domain_invs()?

Then the barrier in arm_smmu_domain_inv_range() is kind of irritating
but it demonstrates the need for additional ordering (beyond what we
get from the pgtable code) to ensure that a prior update to the
page-table is ordered before the RCU-protected read of the invalidation
array pointer. So that becomes the special case, with everybody else
relying on a dma_wmb() from either the cmdq insertion or the CD/STE
update.

Will

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-26 16:02             ` Will Deacon
@ 2026-01-26 16:09               ` Jason Gunthorpe
  2026-01-26 18:56                 ` Will Deacon
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2026-01-26 16:09 UTC (permalink / raw)
  To: Will Deacon
  Cc: Nicolin Chen, jean-philippe, robin.murphy, joro, balbirs,
	miko.lenczewski, peterz, kevin.tian, praan, linux-arm-kernel,
	iommu, linux-kernel

On Mon, Jan 26, 2026 at 04:02:19PM +0000, Will Deacon wrote:
> On Mon, Jan 26, 2026 at 11:20:19AM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 26, 2026 at 01:01:16PM +0000, Will Deacon wrote:
> > > On Fri, Jan 23, 2026 at 04:03:27PM -0400, Jason Gunthorpe wrote:
> > > > On Fri, Jan 23, 2026 at 05:10:52PM +0000, Will Deacon wrote:
> > > > > On Fri, Jan 23, 2026 at 05:05:31PM +0000, Will Deacon wrote:
> > > > > > On Fri, Dec 19, 2025 at 12:11:28PM -0800, Nicolin Chen wrote:
> > > > > > > +	/*
> > > > > > > +	 * We are committed to updating the STE. Ensure the invalidation array
> > > > > > > +	 * is visible to concurrent map/unmap threads, and acquire any racing
> > > > > > > +	 * IOPTE updates.
> > > > > > > +	 *
> > > > > > > +	 *  [CPU0]                        | [CPU1]
> > > > > > > +	 *                                |
> > > > > > > +	 *  change IOPTEs and TLB flush:  |
> > > > > > > +	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_old_domain_invs {
> > > > > > > +	 *    ...                         |   rcu_assign_pointer(new_invs);
> > > > > > > +	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
> > > > > > > +	 *    ...                         |   kfree_rcu(old_invs, rcu);
> > > > > > > +	 *    // load invalidation array  | }
> > > > > > > +	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
> > > > > > > +	 *                                |   STE = TTB0 // read new IOPTEs
> > > > > > > +	 */
> > > > > > > +	smp_mb();

> If we do that, can we drop the smp_mb()s from
> arm_smmu_install_{old,new}_domain_invs()?

I suppose so, but domain attach isn't a performance path so it depends
on your preference for strict pairing of barriers. Currently the two
smp_mbs() are paired. Can we reliably pair smp_mb() with dma_wmb()?
Are you happy with that clarity?

My view is attach isn't a performance path, so having extra barriers
is fine if it helps understandability.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-26 16:09               ` Jason Gunthorpe
@ 2026-01-26 18:56                 ` Will Deacon
  2026-01-27  3:14                   ` Nicolin Chen
  0 siblings, 1 reply; 45+ messages in thread
From: Will Deacon @ 2026-01-26 18:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, jean-philippe, robin.murphy, joro, balbirs,
	miko.lenczewski, peterz, kevin.tian, praan, linux-arm-kernel,
	iommu, linux-kernel

On Mon, Jan 26, 2026 at 12:09:40PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 26, 2026 at 04:02:19PM +0000, Will Deacon wrote:
> > On Mon, Jan 26, 2026 at 11:20:19AM -0400, Jason Gunthorpe wrote:
> > > On Mon, Jan 26, 2026 at 01:01:16PM +0000, Will Deacon wrote:
> > > > On Fri, Jan 23, 2026 at 04:03:27PM -0400, Jason Gunthorpe wrote:
> > > > > On Fri, Jan 23, 2026 at 05:10:52PM +0000, Will Deacon wrote:
> > > > > > On Fri, Jan 23, 2026 at 05:05:31PM +0000, Will Deacon wrote:
> > > > > > > On Fri, Dec 19, 2025 at 12:11:28PM -0800, Nicolin Chen wrote:
> > > > > > > > +	/*
> > > > > > > > +	 * We are committed to updating the STE. Ensure the invalidation array
> > > > > > > > +	 * is visible to concurrent map/unmap threads, and acquire any racing
> > > > > > > > +	 * IOPTE updates.
> > > > > > > > +	 *
> > > > > > > > +	 *  [CPU0]                        | [CPU1]
> > > > > > > > +	 *                                |
> > > > > > > > +	 *  change IOPTEs and TLB flush:  |
> > > > > > > > +	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_old_domain_invs {
> > > > > > > > +	 *    ...                         |   rcu_assign_pointer(new_invs);
> > > > > > > > +	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
> > > > > > > > +	 *    ...                         |   kfree_rcu(old_invs, rcu);
> > > > > > > > +	 *    // load invalidation array  | }
> > > > > > > > +	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
> > > > > > > > +	 *                                |   STE = TTB0 // read new IOPTEs
> > > > > > > > +	 */
> > > > > > > > +	smp_mb();
> 
> > If we do that, can we drop the smp_mb()s from
> > arm_smmu_install_{old,new}_domain_invs()?
> 
> I suppose so, but domain attach isn't a performance path so it depends
> on your preference for strict pairing of barriers. Currently the two
> smp_mbs() are paired. Can we reliably pair smp_mb() with dma_wmb()?
> Are you happy with that clarity?

Yeah, I think that's ok.

> My view is attach isn't a performance path, so having extra barriers
> is fine if it helps understandability.

I think that the more barriers we have, the harder the code is to
understand so I would prefer just to have the smp_mb() for the
write->read case in arm_smmu_domain_inv_range() along with a comment
explaining why it's needed.

I think that also means that my concern about the old comments on the
other patch largely disappears.

Will

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-26 18:56                 ` Will Deacon
@ 2026-01-27  3:14                   ` Nicolin Chen
  0 siblings, 0 replies; 45+ messages in thread
From: Nicolin Chen @ 2026-01-27  3:14 UTC (permalink / raw)
  To: Will Deacon
  Cc: Jason Gunthorpe, jean-philippe, robin.murphy, joro, balbirs,
	miko.lenczewski, peterz, kevin.tian, praan, linux-arm-kernel,
	iommu, linux-kernel

On Mon, Jan 26, 2026 at 06:56:34PM +0000, Will Deacon wrote:
> On Mon, Jan 26, 2026 at 12:09:40PM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 26, 2026 at 04:02:19PM +0000, Will Deacon wrote:
> > > If we do that, can we drop the smp_mb()s from
> > > arm_smmu_install_{old,new}_domain_invs()?
> > 
> > I suppose so, but domain attach isn't a performance path so it depends
> > on your preference for strict pairing of barriers. Currently the two
> > smp_mbs() are paired. Can we reliably pair smp_mb() with dma_wmb()?
> > Are you happy with that clarity?
> 
> Yeah, I think that's ok.
> 
> > My view is attach isn't a performance path, so having extra barriers
> > is fine if it helps understandability.
> 
> I think that the more barriers we have, the harder the code is to
> understand so I would prefer just to have the smp_mb() for the
> write->read case in arm_smmu_domain_inv_range() along with a comment
> explaining why it's needed.
> 
> I think that also means that my concern about the old comments on the
> other patch largely disappears.

I sent v10. Thanks for the insightful reviews!

Nicolin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2026-01-26 15:20           ` Jason Gunthorpe
  2026-01-26 16:02             ` Will Deacon
@ 2026-01-26 17:50             ` Nicolin Chen
  1 sibling, 0 replies; 45+ messages in thread
From: Nicolin Chen @ 2026-01-26 17:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Will Deacon, jean-philippe, robin.murphy, joro, balbirs,
	miko.lenczewski, peterz, kevin.tian, praan, linux-arm-kernel,
	iommu, linux-kernel

On Mon, Jan 26, 2026 at 11:20:19AM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 26, 2026 at 01:01:16PM +0000, Will Deacon wrote:
> > If we've written a new (i.e. previously invalid) valid PTE to a
> > page-table and then we install that page-table into an STE hitlessly
> > (let's say we write the S2TTB field) then isn't there a window before we
> > do the STE invalidation where the page-table might be accessible to the
> > SMMU but the new PTE is still sitting in the CPU?
> 
> Hmm! Yes seems like it.
> 
> However, that's seems like a general bug, if we allocate an
> iommu_domain and immediately hitlessly install it, then there would be
> no dma_wmb() for the page table memory prior to the earliest point the
> HW is able to read the STE.
> 
> What I wrote is is how things are intended to work, so lets fix it
> with this?
> 
> @@ -1173,6 +1173,13 @@ void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer, __le64 *entry,
>         __le64 unused_update[NUM_ENTRY_QWORDS];
>         u8 used_qword_diff;
>  
> +       /*
> +        * Many of the entry structures have pointers to other structures that
> +        * need to have their updates be visible before any writes of the entry
> +        * happen.
> +        */
> +       dma_wmb();
> +
>         used_qword_diff =
>                 arm_smmu_entry_qword_diff(writer, entry, target, unused_update);
>         if (hweight8(used_qword_diff) == 1) {

I will attach this patch as PATCH-1 in v10.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v9 7/7] iommu/arm-smmu-v3: Perform per-domain invalidations using arm_smmu_invs
  2025-12-19 20:11 [PATCH v9 0/7] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
                   ` (5 preceding siblings ...)
  2025-12-19 20:11 ` [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range() Nicolin Chen
@ 2025-12-19 20:11 ` Nicolin Chen
  2026-01-23 17:07   ` Will Deacon
  2026-01-19 17:10 ` [PATCH v9 0/7] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
  7 siblings, 1 reply; 45+ messages in thread
From: Nicolin Chen @ 2025-12-19 20:11 UTC (permalink / raw)
  To: will
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

Replace the old invalidation functions with arm_smmu_domain_inv_range() in
all the existing invalidation routines. And deprecate the old functions.

The new arm_smmu_domain_inv_range() handles the CMDQ_MAX_TLBI_OPS as well,
so drop it in the SVA function.

Since arm_smmu_cmdq_batch_add_range() has only one caller now, and it must
be given a valid size, add a WARN_ON_ONCE to catch any missed case.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   7 -
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  29 +--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 165 +-----------------
 3 files changed, 11 insertions(+), 190 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index c3fee7f14480..4f104c1baa67 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -1079,13 +1079,6 @@ int arm_smmu_set_pasid(struct arm_smmu_master *master,
 		       struct arm_smmu_domain *smmu_domain, ioasid_t pasid,
 		       struct arm_smmu_cd *cd, struct iommu_domain *old);
 
-void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid);
-void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
-				 size_t granule, bool leaf,
-				 struct arm_smmu_domain *smmu_domain);
-int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
-			    unsigned long iova, size_t size);
-
 void arm_smmu_domain_inv_range(struct arm_smmu_domain *smmu_domain,
 			       unsigned long iova, size_t size,
 			       unsigned int granule, bool leaf);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index 440ad8cc07de..f1f8e01a7e91 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -122,15 +122,6 @@ void arm_smmu_make_sva_cd(struct arm_smmu_cd *target,
 }
 EXPORT_SYMBOL_IF_KUNIT(arm_smmu_make_sva_cd);
 
-/*
- * Cloned from the MAX_TLBI_OPS in arch/arm64/include/asm/tlbflush.h, this
- * is used as a threshold to replace per-page TLBI commands to issue in the
- * command queue with an address-space TLBI command, when SMMU w/o a range
- * invalidation feature handles too many per-page TLBI commands, which will
- * otherwise result in a soft lockup.
- */
-#define CMDQ_MAX_TLBI_OPS		(1 << (PAGE_SHIFT - 3))
-
 static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
 						struct mm_struct *mm,
 						unsigned long start,
@@ -146,21 +137,8 @@ static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
 	 * range. So do a simple translation here by calculating size correctly.
 	 */
 	size = end - start;
-	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV)) {
-		if (size >= CMDQ_MAX_TLBI_OPS * PAGE_SIZE)
-			size = 0;
-	} else {
-		if (size == ULONG_MAX)
-			size = 0;
-	}
-
-	if (!size)
-		arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_domain->cd.asid);
-	else
-		arm_smmu_tlb_inv_range_asid(start, size, smmu_domain->cd.asid,
-					    PAGE_SIZE, false, smmu_domain);
 
-	arm_smmu_atc_inv_domain(smmu_domain, start, size);
+	arm_smmu_domain_inv_range(smmu_domain, start, size, PAGE_SIZE, false);
 }
 
 static void arm_smmu_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
@@ -191,8 +169,7 @@ static void arm_smmu_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	}
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
-	arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_domain->cd.asid);
-	arm_smmu_atc_inv_domain(smmu_domain, 0, 0);
+	arm_smmu_domain_inv(smmu_domain);
 }
 
 static void arm_smmu_mmu_notifier_free(struct mmu_notifier *mn)
@@ -302,7 +279,7 @@ static void arm_smmu_sva_domain_free(struct iommu_domain *domain)
 	/*
 	 * Ensure the ASID is empty in the iommu cache before allowing reuse.
 	 */
-	arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_domain->cd.asid);
+	arm_smmu_domain_inv(smmu_domain);
 
 	/*
 	 * Notice that the arm_smmu_mm_arch_invalidate_secondary_tlbs op can
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 6e1082e6d164..9912262a0e3b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1266,16 +1266,6 @@ struct arm_smmu_invs *arm_smmu_invs_purge(struct arm_smmu_invs *invs)
 EXPORT_SYMBOL_IF_KUNIT(arm_smmu_invs_purge);
 
 /* Context descriptor manipulation functions */
-void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
-{
-	struct arm_smmu_cmdq_ent cmd = {
-		.opcode	= smmu->features & ARM_SMMU_FEAT_E2H ?
-			CMDQ_OP_TLBI_EL2_ASID : CMDQ_OP_TLBI_NH_ASID,
-		.tlbi.asid = asid,
-	};
-
-	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
-}
 
 /*
  * Based on the value of ent report which bits of the STE the HW will access. It
@@ -2430,74 +2420,10 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
 	return arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
 }
 
-int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
-			    unsigned long iova, size_t size)
-{
-	struct arm_smmu_master_domain *master_domain;
-	int i;
-	unsigned long flags;
-	struct arm_smmu_cmdq_ent cmd = {
-		.opcode = CMDQ_OP_ATC_INV,
-	};
-	struct arm_smmu_cmdq_batch cmds;
-
-	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS))
-		return 0;
-
-	/*
-	 * Ensure that we've completed prior invalidation of the main TLBs
-	 * before we read 'nr_ats_masters' in case of a concurrent call to
-	 * arm_smmu_enable_ats():
-	 *
-	 *	// unmap()			// arm_smmu_enable_ats()
-	 *	TLBI+SYNC			atomic_inc(&nr_ats_masters);
-	 *	smp_mb();			[...]
-	 *	atomic_read(&nr_ats_masters);	pci_enable_ats() // writel()
-	 *
-	 * Ensures that we always see the incremented 'nr_ats_masters' count if
-	 * ATS was enabled at the PCI device before completion of the TLBI.
-	 */
-	smp_mb();
-	if (!atomic_read(&smmu_domain->nr_ats_masters))
-		return 0;
-
-	arm_smmu_cmdq_batch_init(smmu_domain->smmu, &cmds, &cmd);
-
-	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
-	list_for_each_entry(master_domain, &smmu_domain->devices,
-			    devices_elm) {
-		struct arm_smmu_master *master = master_domain->master;
-
-		if (!master->ats_enabled)
-			continue;
-
-		if (master_domain->nested_ats_flush) {
-			/*
-			 * If a S2 used as a nesting parent is changed we have
-			 * no option but to completely flush the ATC.
-			 */
-			arm_smmu_atc_inv_to_cmd(IOMMU_NO_PASID, 0, 0, &cmd);
-		} else {
-			arm_smmu_atc_inv_to_cmd(master_domain->ssid, iova, size,
-						&cmd);
-		}
-
-		for (i = 0; i < master->num_streams; i++) {
-			cmd.atc.sid = master->streams[i].id;
-			arm_smmu_cmdq_batch_add(smmu_domain->smmu, &cmds, &cmd);
-		}
-	}
-	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
-
-	return arm_smmu_cmdq_batch_submit(smmu_domain->smmu, &cmds);
-}
-
 /* IO_PGTABLE API */
 static void arm_smmu_tlb_inv_context(void *cookie)
 {
 	struct arm_smmu_domain *smmu_domain = cookie;
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
-	struct arm_smmu_cmdq_ent cmd;
 
 	/*
 	 * NOTE: when io-pgtable is in non-strict mode, we may get here with
@@ -2506,14 +2432,7 @@ static void arm_smmu_tlb_inv_context(void *cookie)
 	 * insertion to guarantee those are observed before the TLBI. Do be
 	 * careful, 007.
 	 */
-	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
-		arm_smmu_tlb_inv_asid(smmu, smmu_domain->cd.asid);
-	} else {
-		cmd.opcode	= CMDQ_OP_TLBI_S12_VMALL;
-		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
-		arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
-	}
-	arm_smmu_atc_inv_domain(smmu_domain, 0, 0);
+	arm_smmu_domain_inv(smmu_domain);
 }
 
 static void arm_smmu_cmdq_batch_add_range(struct arm_smmu_device *smmu,
@@ -2525,7 +2444,7 @@ static void arm_smmu_cmdq_batch_add_range(struct arm_smmu_device *smmu,
 	unsigned long end = iova + size, num_pages = 0, tg = pgsize;
 	size_t inv_range = granule;
 
-	if (!size)
+	if (WARN_ON_ONCE(!size))
 		return;
 
 	if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
@@ -2580,76 +2499,6 @@ static void arm_smmu_cmdq_batch_add_range(struct arm_smmu_device *smmu,
 	}
 }
 
-static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
-				     unsigned long iova, size_t size,
-				     size_t granule,
-				     struct arm_smmu_domain *smmu_domain)
-{
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
-	struct arm_smmu_cmdq_batch cmds;
-	size_t pgsize;
-
-	/* Get the leaf page size */
-	pgsize = __ffs(smmu_domain->domain.pgsize_bitmap);
-
-	arm_smmu_cmdq_batch_init(smmu, &cmds, cmd);
-	arm_smmu_cmdq_batch_add_range(smmu, &cmds, cmd, iova, size, granule,
-				      pgsize);
-	arm_smmu_cmdq_batch_submit(smmu, &cmds);
-}
-
-static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
-					  size_t granule, bool leaf,
-					  struct arm_smmu_domain *smmu_domain)
-{
-	struct arm_smmu_cmdq_ent cmd = {
-		.tlbi = {
-			.leaf	= leaf,
-		},
-	};
-
-	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
-		cmd.opcode	= smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ?
-				  CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA;
-		cmd.tlbi.asid	= smmu_domain->cd.asid;
-	} else {
-		cmd.opcode	= CMDQ_OP_TLBI_S2_IPA;
-		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
-	}
-	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
-
-	if (smmu_domain->nest_parent) {
-		/*
-		 * When the S2 domain changes all the nested S1 ASIDs have to be
-		 * flushed too.
-		 */
-		cmd.opcode = CMDQ_OP_TLBI_NH_ALL;
-		arm_smmu_cmdq_issue_cmd_with_sync(smmu_domain->smmu, &cmd);
-	}
-
-	/*
-	 * Unfortunately, this can't be leaf-only since we may have
-	 * zapped an entire table.
-	 */
-	arm_smmu_atc_inv_domain(smmu_domain, iova, size);
-}
-
-void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
-				 size_t granule, bool leaf,
-				 struct arm_smmu_domain *smmu_domain)
-{
-	struct arm_smmu_cmdq_ent cmd = {
-		.opcode	= smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ?
-			  CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA,
-		.tlbi = {
-			.asid	= asid,
-			.leaf	= leaf,
-		},
-	};
-
-	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
-}
-
 static bool arm_smmu_inv_size_too_big(struct arm_smmu_device *smmu, size_t size,
 				      size_t granule)
 {
@@ -2850,7 +2699,9 @@ static void arm_smmu_tlb_inv_page_nosync(struct iommu_iotlb_gather *gather,
 static void arm_smmu_tlb_inv_walk(unsigned long iova, size_t size,
 				  size_t granule, void *cookie)
 {
-	arm_smmu_tlb_inv_range_domain(iova, size, granule, false, cookie);
+	struct arm_smmu_domain *smmu_domain = cookie;
+
+	arm_smmu_domain_inv_range(smmu_domain, iova, size, granule, false);
 }
 
 static const struct iommu_flush_ops arm_smmu_flush_ops = {
@@ -4150,9 +4001,9 @@ static void arm_smmu_iotlb_sync(struct iommu_domain *domain,
 	if (!gather->pgsize)
 		return;
 
-	arm_smmu_tlb_inv_range_domain(gather->start,
-				      gather->end - gather->start + 1,
-				      gather->pgsize, true, smmu_domain);
+	arm_smmu_domain_inv_range(smmu_domain, gather->start,
+				  gather->end - gather->start + 1,
+				  gather->pgsize, true);
 }
 
 static phys_addr_t
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 7/7] iommu/arm-smmu-v3: Perform per-domain invalidations using arm_smmu_invs
  2025-12-19 20:11 ` [PATCH v9 7/7] iommu/arm-smmu-v3: Perform per-domain invalidations using arm_smmu_invs Nicolin Chen
@ 2026-01-23 17:07   ` Will Deacon
  2026-01-23 17:47     ` Nicolin Chen
  2026-01-23 19:59     ` Jason Gunthorpe
  0 siblings, 2 replies; 45+ messages in thread
From: Will Deacon @ 2026-01-23 17:07 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

On Fri, Dec 19, 2025 at 12:11:29PM -0800, Nicolin Chen wrote:
> Replace the old invalidation functions with arm_smmu_domain_inv_range() in
> all the existing invalidation routines. And deprecate the old functions.
> 
> The new arm_smmu_domain_inv_range() handles the CMDQ_MAX_TLBI_OPS as well,
> so drop it in the SVA function.
> 
> Since arm_smmu_cmdq_batch_add_range() has only one caller now, and it must
> be given a valid size, add a WARN_ON_ONCE to catch any missed case.
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   7 -
>  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  29 +--
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 165 +-----------------
>  3 files changed, 11 insertions(+), 190 deletions(-)

It's one thing replacing the invalidation implementation but I think you
need to update some of the old ordering comments, too. In particular,
the old code relies on the dma_wmb() during cmdq insertion to order
updates to in-memory structures, which includes the pgtable in non-strict
mode.

I don't think any of that is true now?

Will

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 7/7] iommu/arm-smmu-v3: Perform per-domain invalidations using arm_smmu_invs
  2026-01-23 17:07   ` Will Deacon
@ 2026-01-23 17:47     ` Nicolin Chen
  2026-01-23 19:59     ` Jason Gunthorpe
  1 sibling, 0 replies; 45+ messages in thread
From: Nicolin Chen @ 2026-01-23 17:47 UTC (permalink / raw)
  To: Will Deacon
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

On Fri, Jan 23, 2026 at 05:07:15PM +0000, Will Deacon wrote:
> On Fri, Dec 19, 2025 at 12:11:29PM -0800, Nicolin Chen wrote:
> > Replace the old invalidation functions with arm_smmu_domain_inv_range() in
> > all the existing invalidation routines. And deprecate the old functions.
> > 
> > The new arm_smmu_domain_inv_range() handles the CMDQ_MAX_TLBI_OPS as well,
> > so drop it in the SVA function.
> > 
> > Since arm_smmu_cmdq_batch_add_range() has only one caller now, and it must
> > be given a valid size, add a WARN_ON_ONCE to catch any missed case.
> > 
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > ---
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   7 -
> >  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  29 +--
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 165 +-----------------
> >  3 files changed, 11 insertions(+), 190 deletions(-)
> 
> It's one thing replacing the invalidation implementation but I think you
> need to update some of the old ordering comments, too. In particular,
> the old code relies on the dma_wmb() during cmdq insertion to order
> updates to in-memory structures, which includes the pgtable in non-strict
> mode.
> 
> I don't think any of that is true now?

OK. I'll update those ordering comments. Combining your latest
suggestion of using dma_mb() v.s. smp_mb(). I assume this will
be just doing s/dma_wmb/dma_mb in those comments.

Nicolin

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 7/7] iommu/arm-smmu-v3: Perform per-domain invalidations using arm_smmu_invs
  2026-01-23 17:07   ` Will Deacon
  2026-01-23 17:47     ` Nicolin Chen
@ 2026-01-23 19:59     ` Jason Gunthorpe
  1 sibling, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2026-01-23 19:59 UTC (permalink / raw)
  To: Will Deacon
  Cc: Nicolin Chen, jean-philippe, robin.murphy, joro, balbirs,
	miko.lenczewski, peterz, kevin.tian, praan, linux-arm-kernel,
	iommu, linux-kernel

On Fri, Jan 23, 2026 at 05:07:15PM +0000, Will Deacon wrote:
> On Fri, Dec 19, 2025 at 12:11:29PM -0800, Nicolin Chen wrote:
> > Replace the old invalidation functions with arm_smmu_domain_inv_range() in
> > all the existing invalidation routines. And deprecate the old functions.
> > 
> > The new arm_smmu_domain_inv_range() handles the CMDQ_MAX_TLBI_OPS as well,
> > so drop it in the SVA function.
> > 
> > Since arm_smmu_cmdq_batch_add_range() has only one caller now, and it must
> > be given a valid size, add a WARN_ON_ONCE to catch any missed case.
> > 
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > ---
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   7 -
> >  .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  29 +--
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 165 +-----------------
> >  3 files changed, 11 insertions(+), 190 deletions(-)
> 
> It's one thing replacing the invalidation implementation but I think you
> need to update some of the old ordering comments, too. In particular,
> the old code relies on the dma_wmb() during cmdq insertion to order
> updates to in-memory structures, which includes the pgtable in non-strict
> mode.
> 
> I don't think any of that is true now?

You are talking about this comment?

	/*
	 * NOTE: when io-pgtable is in non-strict mode, we may get here with
	 * PTEs previously cleared by unmaps on the current CPU not yet visible
	 * to the SMMU. We are relying on the dma_wmb() implicit during cmd
	 * insertion to guarantee those are observed before the TLBI. Do be
	 * careful, 007.
	 */

Maybe we can restate that a little bit:

	/*
	 * If the DMA API is running in non-strict mode then another CPU could
	 * have changed the page table and not invoked any flush op. Instead the
	 * other CPU will do an atomic_read() and this CPU will have done an
	 * atomic_write(). That handshake is enough to acquire the page table
	 * writes from the other CPU.
	 *
	 * All command execution has a dma_wmb() to release all the in-memory
	 * structures written by this CPU, that barrier must also release the
	 * writes acquired from all the other CPUs too.
	 *
	 * There are other barriers and atomics on this path, but the above is
	 * the essential mechanism for ensuring that HW sees the page table
	 * writes from another CPU before it executes the IOTLB invalidation.
	 */

I'm sure this series adds more barriers that might move things earlier
in the sequence, but that isn't why those barries exist.

I think the original documentation was on to something important so
I'd like to keep the information, though perhaps the comment belongs
in dma-iommu.c as it really applies to all iommu drivers on all
architectures.

It is also why things were done as a smp_XX not a dma_XX. The
intention was not to release things all the way to DMA, but just to
release things enough that the thread writing the STE can acquire
them and it will ensure they are released to DMA.

For instance on UP it is fine to have no barriers at all in the
invalidation code, the dma_wmb() in the STE store command is
sufficient.

Jason

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 0/7] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array
  2025-12-19 20:11 [PATCH v9 0/7] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
                   ` (6 preceding siblings ...)
  2025-12-19 20:11 ` [PATCH v9 7/7] iommu/arm-smmu-v3: Perform per-domain invalidations using arm_smmu_invs Nicolin Chen
@ 2026-01-19 17:10 ` Nicolin Chen
  7 siblings, 0 replies; 45+ messages in thread
From: Nicolin Chen @ 2026-01-19 17:10 UTC (permalink / raw)
  To: will
  Cc: jean-philippe, robin.murphy, joro, jgg, balbirs, miko.lenczewski,
	peterz, kevin.tian, praan, linux-arm-kernel, iommu, linux-kernel

Hi Will,

On Fri, Dec 19, 2025 at 12:11:22PM -0800, Nicolin Chen wrote:
>  * NVIDIA is building systems with > 10 SMMU instances where > 8 are being
>    used concurrently in a single VM. So having 8 copies of an identical S2
>    page table is not efficient. Instead, all vSMMU instances should check
>    compatibility on a shared S2 iopt, to eliminate 7 copies.
> 
>    Previous attempt based on the list/spinlock design:
>      iommu/arm-smmu-v3: Allocate vmid per vsmmu instead of s2_parent
>      https://lore.kernel.org/all/cover.1744692494.git.nicolinc@nvidia.com/
>    now can adopt this invs array, avoiding adding complex lists/locks.
> 
>  * The guest support for BTM requires temporarily invalidating two ASIDs
>    for a single instance. When it renumbers ASIDs this can now be done via
>    the invs array.
> 
>  * SVA with multiple devices being used by a single process (NVIDIA today
>    has 4-8) sequentially iterates the invalidations through all instances.
>    This ignores the HW concurrency available in each instance. It would be
>    nice to not spin on each sync but go forward and issue batches to other
>    instances also. Reducing to a single SVA domain shared across instances
>    is required to look at this.
> 
> This is on Github:
> https://github.com/nicolinc/iommufd/commits/arm_smmu_invs-v9
> 
> Changelog
> v9:
>  * Set cur->ssid correctly in arm_smmu_master_build_inv()
>  * Add comments for INV_TYPE_S1_ASID in arm_smmu_master_build_inv()

Sorry for a gentle push. This has been sitting here for a while.
And we are at rc6. Any chance that it can make it to this cycle?

We have a dependent series that I am about to send v2:
https://lore.kernel.org/linux-iommu/cover.1766088962.git.nicolinc@nvidia.com/

Anything that you think I should improve, please let me know.

Thanks!
Nicolin

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2026-01-28  0:05 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-19 20:11 [PATCH v9 0/7] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
2025-12-19 20:11 ` [PATCH v9 1/7] iommu/arm-smmu-v3: Explicitly set smmu_domain->stage for SVA Nicolin Chen
2026-01-23  9:49   ` Pranjal Shrivastava
2025-12-19 20:11 ` [PATCH v9 2/7] iommu/arm-smmu-v3: Add an inline arm_smmu_domain_free() Nicolin Chen
2026-01-23  9:50   ` Pranjal Shrivastava
2025-12-19 20:11 ` [PATCH v9 3/7] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array Nicolin Chen
2026-01-23  9:53   ` Pranjal Shrivastava
2026-01-23 17:03   ` Will Deacon
2026-01-23 17:35     ` Nicolin Chen
2026-01-23 17:51       ` Will Deacon
2026-01-23 17:56         ` Nicolin Chen
2026-01-23 19:16           ` Jason Gunthorpe
2026-01-23 19:18             ` Nicolin Chen
2026-01-26 14:54             ` Will Deacon
2026-01-26 15:21               ` Jason Gunthorpe
2025-12-19 20:11 ` [PATCH v9 4/7] iommu/arm-smmu-v3: Pre-allocate a per-master invalidation array Nicolin Chen
2026-01-23  9:54   ` Pranjal Shrivastava
2025-12-19 20:11 ` [PATCH v9 5/7] iommu/arm-smmu-v3: Populate smmu_domain->invs when attaching masters Nicolin Chen
2025-12-19 20:11 ` [PATCH v9 6/7] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range() Nicolin Chen
2026-01-23  9:48   ` Pranjal Shrivastava
2026-01-23 13:56     ` Jason Gunthorpe
2026-01-27 16:38     ` Nicolin Chen
2026-01-27 17:08       ` Jason Gunthorpe
2026-01-27 18:07         ` Nicolin Chen
2026-01-27 18:23           ` Jason Gunthorpe
2026-01-27 18:37             ` Nicolin Chen
2026-01-27 19:19               ` Jason Gunthorpe
2026-01-27 20:14                 ` Nicolin Chen
2026-01-28  0:05                   ` Jason Gunthorpe
2026-01-23 17:05   ` Will Deacon
2026-01-23 17:10     ` Will Deacon
2026-01-23 17:43       ` Nicolin Chen
2026-01-23 20:03       ` Jason Gunthorpe
2026-01-26 13:01         ` Will Deacon
2026-01-26 15:20           ` Jason Gunthorpe
2026-01-26 16:02             ` Will Deacon
2026-01-26 16:09               ` Jason Gunthorpe
2026-01-26 18:56                 ` Will Deacon
2026-01-27  3:14                   ` Nicolin Chen
2026-01-26 17:50             ` Nicolin Chen
2025-12-19 20:11 ` [PATCH v9 7/7] iommu/arm-smmu-v3: Perform per-domain invalidations using arm_smmu_invs Nicolin Chen
2026-01-23 17:07   ` Will Deacon
2026-01-23 17:47     ` Nicolin Chen
2026-01-23 19:59     ` Jason Gunthorpe
2026-01-19 17:10 ` [PATCH v9 0/7] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox