[PATCH rfcv1 0/8] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH rfcv1 0/8] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array
@ 2025-08-14  1:25 Nicolin Chen
  2025-08-14  1:25 ` [PATCH rfcv1 1/8] iommu/arm-smmu-v3: Clear cmds->num after arm_smmu_cmdq_batch_submit Nicolin Chen
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Nicolin Chen @ 2025-08-14  1:25 UTC (permalink / raw)
  To: jgg, will, robin.murphy
  Cc: joro, jean-philippe, miko.lenczewski, balbirs, peterz, smostafa,
	kevin.tian, praan, zhangzekun11, linux-arm-kernel, iommu,
	linux-kernel, patches

This is a work based on Jason's design and algorithm. This implementation
follows his initial draft as well.

The new arm_smmu_invs array is an RCU-protected array, mutated when device
attach to the domain, iterated when an invalidation is required for IOPTE
changes in this domain. This keeps the current invalidation efficiency of
a smb_mb() followed by a conditional rwlock replacing the atomic/spinlock
combination.

A new data structure is defined for the array and its entry, representing
invalidation operations, such as S1_ASID, S2_VMID, and ATS. The algorithm
adds and deletes array entries efficiently and also keeps the array sorted
so as to group similar invalidations into batches.

During an invalidation, a new invalidation function iterates domain->invs,
and converts each entry to the corresponding invalidation command(s). This
new function is fully compatible with all the existing use cases, allowing
a simple rework/replacement.

Some races to keep in mind:

1) A domain can be shared across SMMU instances. When an SMMU instance is
   removed, the updated invs array has to be sync-ed via synchronize_rcu()
   to prevent an concurrent invalidation routine that is accessing the old
   array from issuing commands to the removed SMMU instance.

2) When there are concurrent IOPTE changes (followed by invalidations) and
   a domain attachment, the new attachment must not become out of sync at
   the HW level, meaning that an STE store and invalidation array load must
   be sequenced by the CPU's memory model.

3) When an ATS-enabled device attaches to a blocking domain, the core code
   requires a hard fence to ensure all ATS invalidations to the device are
   completed. Relying on RCU alone requires calling synchronize_rcu() that
   can be too slow. Instead, when ATS is in use, hold a conditional rwlock
   till all concurrent invalidations are finished.

Related future work and dependent projects:

 * NVIDIA is building systems with > 10 SMMU instances where > 8 are being
   used concurrently in a single VM. So having 8 copies of an identical S2
   page table is not efficient. Instead, all vSMMU instances should check
   compatibility on a shared S2 iopt, to eliminate 7 copies.

   Previous attempt based on the list/spinlock design:
     iommu/arm-smmu-v3: Allocate vmid per vsmmu instead of s2_parent
     https://lore.kernel.org/all/cover.1744692494.git.nicolinc@nvidia.com/
   now can adopt this invs array, avoiding adding complex lists/locks.

 * The guest support for BTM requires temporarily invalidating two ASIDs
   for a single instance. When it renumbers ASIDs this can now be done via
   the invs array.

 * SVA with multiple devices being used by a single process (NVIDIA today
   has 4-8) sequentially iterates the invalidations through all instances.
   This ignores the HW concurrency available in each instance. It would be
   nice to not spin on each sync but go forward and issue batches to other
   instances also. Reducing to a single SVA domain shared across instances
   is required to look at this.

This is on Github:
https://github.com/nicolinc/iommufd/commits/arm_smmu_invs-rfcv1

Thanks
Nicolin

Jason Gunthorpe (1):
  iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array

Nicolin Chen (7):
  iommu/arm-smmu-v3: Clear cmds->num after arm_smmu_cmdq_batch_submit
  iommu/arm-smmu-v3: Explicitly set smmu_domain->stage for SVA
  iommu/arm-smmu-v3: Add an inline arm_smmu_domain_free()
  iommu/arm-smmu-v3: Pre-allocate a per-master invalidation array
  iommu/arm-smmu-v3: Populate smmu_domain->invs when attaching masters
  iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  iommu/arm-smmu-v3: Perform per-domain invalidations using
    arm_smmu_invs

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   | 106 ++-
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  32 +-
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c  |  85 ++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 829 +++++++++++++++---
 4 files changed, 876 insertions(+), 176 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH rfcv1 1/8] iommu/arm-smmu-v3: Clear cmds->num after arm_smmu_cmdq_batch_submit
  2025-08-14  1:25 [PATCH rfcv1 0/8] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
@ 2025-08-14  1:25 ` Nicolin Chen
  2025-08-14  1:25 ` [PATCH rfcv1 2/8] iommu/arm-smmu-v3: Explicitly set smmu_domain->stage for SVA Nicolin Chen
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Nicolin Chen @ 2025-08-14  1:25 UTC (permalink / raw)
  To: jgg, will, robin.murphy
  Cc: joro, jean-philippe, miko.lenczewski, balbirs, peterz, smostafa,
	kevin.tian, praan, zhangzekun11, linux-arm-kernel, iommu,
	linux-kernel, patches

None of the callers of arm_smmu_cmdq_batch_submit() cares about the batch
after a submission. So, it'll be certainly safe to nuke the cmds->num, at
least upon a successful one. This will ease a bit a new wrapper function.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 2a8b46b948f05..cccf8f52ee0d5 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -974,11 +974,17 @@ static void arm_smmu_cmdq_batch_add(struct arm_smmu_device *smmu,
 	cmds->num++;
 }
 
+/* Clears cmds->num after a successful submission */
 static int arm_smmu_cmdq_batch_submit(struct arm_smmu_device *smmu,
 				      struct arm_smmu_cmdq_batch *cmds)
 {
-	return arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmdq, cmds->cmds,
-					   cmds->num, true);
+	int ret;
+
+	ret = arm_smmu_cmdq_issue_cmdlist(smmu, cmds->cmdq, cmds->cmds,
+					  cmds->num, true);
+	if (!ret)
+		cmds->num = 0;
+	return ret;
 }
 
 static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused,
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH rfcv1 2/8] iommu/arm-smmu-v3: Explicitly set smmu_domain->stage for SVA
  2025-08-14  1:25 [PATCH rfcv1 0/8] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
  2025-08-14  1:25 ` [PATCH rfcv1 1/8] iommu/arm-smmu-v3: Clear cmds->num after arm_smmu_cmdq_batch_submit Nicolin Chen
@ 2025-08-14  1:25 ` Nicolin Chen
  2025-08-14  1:25 ` [PATCH rfcv1 3/8] iommu/arm-smmu-v3: Add an inline arm_smmu_domain_free() Nicolin Chen
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Nicolin Chen @ 2025-08-14  1:25 UTC (permalink / raw)
  To: jgg, will, robin.murphy
  Cc: joro, jean-philippe, miko.lenczewski, balbirs, peterz, smostafa,
	kevin.tian, praan, zhangzekun11, linux-arm-kernel, iommu,
	linux-kernel, patches

Both the ARM_SMMU_DOMAIN_S1 case and the SVA case use ASID, requiring ASID
based invalidation commands to flush the TLB.

Define an ARM_SMMU_DOMAIN_SVA to make the SVA case clear to share the same
path with the ARM_SMMU_DOMAIN_S1 case, which will be a part of the routine
to build a new per-domain invalidation array.

There is no function change.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h     | 1 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 1 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c     | 3 +++
 3 files changed, 5 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index ae23aacc38402..5c0b38595d209 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -858,6 +858,7 @@ struct arm_smmu_master {
 enum arm_smmu_domain_stage {
 	ARM_SMMU_DOMAIN_S1 = 0,
 	ARM_SMMU_DOMAIN_S2,
+	ARM_SMMU_DOMAIN_SVA,
 };
 
 struct arm_smmu_domain {
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index 59a480974d80f..6097f1f540d87 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -346,6 +346,7 @@ struct iommu_domain *arm_smmu_sva_domain_alloc(struct device *dev,
 	 * ARM_SMMU_FEAT_RANGE_INV is present
 	 */
 	smmu_domain->domain.pgsize_bitmap = PAGE_SIZE;
+	smmu_domain->stage = ARM_SMMU_DOMAIN_SVA;
 	smmu_domain->smmu = smmu;
 
 	ret = xa_alloc(&arm_smmu_asid_xa, &asid, smmu_domain,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index cccf8f52ee0d5..0016ec699acfe 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3070,6 +3070,9 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 		arm_smmu_install_ste_for_dev(master, &target);
 		arm_smmu_clear_cd(master, IOMMU_NO_PASID);
 		break;
+	default:
+		WARN_ON(true);
+		break;
 	}
 
 	arm_smmu_attach_commit(&state);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH rfcv1 3/8] iommu/arm-smmu-v3: Add an inline arm_smmu_domain_free()
  2025-08-14  1:25 [PATCH rfcv1 0/8] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
  2025-08-14  1:25 ` [PATCH rfcv1 1/8] iommu/arm-smmu-v3: Clear cmds->num after arm_smmu_cmdq_batch_submit Nicolin Chen
  2025-08-14  1:25 ` [PATCH rfcv1 2/8] iommu/arm-smmu-v3: Explicitly set smmu_domain->stage for SVA Nicolin Chen
@ 2025-08-14  1:25 ` Nicolin Chen
  2025-08-14  1:25 ` [PATCH rfcv1 4/8] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array Nicolin Chen
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Nicolin Chen @ 2025-08-14  1:25 UTC (permalink / raw)
  To: jgg, will, robin.murphy
  Cc: joro, jean-philippe, miko.lenczewski, balbirs, peterz, smostafa,
	kevin.tian, praan, zhangzekun11, linux-arm-kernel, iommu,
	linux-kernel, patches

There will be a bit more things to free than smmu_domain itself. So keep a
simple inline function in the header to share aross files.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h     | 5 +++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 2 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c     | 4 ++--
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 5c0b38595d209..96a23ca633cb6 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -954,6 +954,11 @@ extern struct mutex arm_smmu_asid_lock;
 
 struct arm_smmu_domain *arm_smmu_domain_alloc(void);
 
+static inline void arm_smmu_domain_free(struct arm_smmu_domain *smmu_domain)
+{
+	kfree(smmu_domain);
+}
+
 void arm_smmu_clear_cd(struct arm_smmu_master *master, ioasid_t ssid);
 struct arm_smmu_cd *arm_smmu_get_cd_ptr(struct arm_smmu_master *master,
 					u32 ssid);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index 6097f1f540d87..fc601b494e0af 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -365,6 +365,6 @@ struct iommu_domain *arm_smmu_sva_domain_alloc(struct device *dev,
 err_asid:
 	xa_erase(&arm_smmu_asid_xa, smmu_domain->cd.asid);
 err_free:
-	kfree(smmu_domain);
+	arm_smmu_domain_free(smmu_domain);
 	return ERR_PTR(ret);
 }
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 0016ec699acfe..08af5f2d1235a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2498,7 +2498,7 @@ static void arm_smmu_domain_free_paging(struct iommu_domain *domain)
 			ida_free(&smmu->vmid_map, cfg->vmid);
 	}
 
-	kfree(smmu_domain);
+	arm_smmu_domain_free(smmu_domain);
 }
 
 static int arm_smmu_domain_finalise_s1(struct arm_smmu_device *smmu,
@@ -3359,7 +3359,7 @@ arm_smmu_domain_alloc_paging_flags(struct device *dev, u32 flags,
 	return &smmu_domain->domain;
 
 err_free:
-	kfree(smmu_domain);
+	arm_smmu_domain_free(smmu_domain);
 	return ERR_PTR(ret);
 }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH rfcv1 4/8] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array
  2025-08-14  1:25 [PATCH rfcv1 0/8] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
                   ` (2 preceding siblings ...)
  2025-08-14  1:25 ` [PATCH rfcv1 3/8] iommu/arm-smmu-v3: Add an inline arm_smmu_domain_free() Nicolin Chen
@ 2025-08-14  1:25 ` Nicolin Chen
  2025-08-26 19:50   ` Jason Gunthorpe
                     ` (2 more replies)
  2025-08-14  1:25 ` [PATCH rfcv1 5/8] iommu/arm-smmu-v3: Pre-allocate a per-master invalidation array Nicolin Chen
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 18+ messages in thread
From: Nicolin Chen @ 2025-08-14  1:25 UTC (permalink / raw)
  To: jgg, will, robin.murphy
  Cc: joro, jean-philippe, miko.lenczewski, balbirs, peterz, smostafa,
	kevin.tian, praan, zhangzekun11, linux-arm-kernel, iommu,
	linux-kernel, patches

From: Jason Gunthorpe <jgg@nvidia.com>

Create a new data structure to hold an array of invalidations that need to
be performed for the domain based on what masters are attached, to replace
the single smmu pointer and linked list of masters in the current design.

Each array entry holds one of the invalidation actions - S1_ASID, S2_VMID,
ATS or their variant with information to feed invalidation commands to HW.
It is structured so that multiple SMMUs can participate in the same array,
removing one key limitation of the current system.

To maximize performance, a sorted array is used as the data structure. It
allows grouping SYNCs together to parallelize invalidations. For instance,
it will group all the ATS entries after the ASID/VMID entry, so they will
all be pushed to the PCI devices in parallel with one SYNC.

To minimize the locking cost on the invalidation fast path (reader of the
invalidation array), the array is managed with RCU.

Provide a set of APIs to add/delete entries to/from an array, including a
special no-fail helper function for a cannot-fail case, e.g. attaching to
arm_smmu_blocked_domain. Also, add kunit coverage for those APIs.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Co-developed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  79 ++++++
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c  |  85 ++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 266 ++++++++++++++++++
 3 files changed, 430 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 96a23ca633cb6..d7421b56e3598 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -649,6 +649,82 @@ struct arm_smmu_cmdq_batch {
 	int				num;
 };
 
+enum arm_smmu_inv_type {
+	INV_TYPE_S1_ASID,
+	INV_TYPE_S2_VMID,
+	INV_TYPE_S2_VMID_S1_CLEAR,
+	INV_TYPE_ATS,
+	INV_TYPE_ATS_FULL,
+};
+
+struct arm_smmu_inv {
+	/* invalidation items */
+	struct arm_smmu_device *smmu;
+	u8 type;
+	u8 size_opcode;
+	u8 nsize_opcode;
+	u32 id; /* ASID or VMID or SID */
+	union {
+		size_t pgsize; /* ARM_SMMU_FEAT_RANGE_INV */
+		u32 ssid; /* INV_TYPE_ATS */
+	};
+
+	/* infrastructure items */
+	refcount_t users; /* users=0 to mark as a trash */
+	bool todel : 1; /* set for a pending deletion */
+};
+
+/**
+ * struct arm_smmu_invs - Per-domain invalidation array
+ * @num_invs: number of invalidations in @invs
+ * @rwlock: Optional rwlock to fench arm_smmu_invs_dec()
+ * @rcu: rcu head for kfree_rcu()
+ * @inv: flexible invalidation array
+ *
+ * The arm_smmu_invs is an RCU data structure. During a ->attach_dev callback,
+ * arm_smmu_invs_add() and arm_smmu_invs_del() will be used to allocate a new
+ * copy of an old array for addition and deletion.
+ *
+ * The arm_smmu_invs_dec() is a special function to mutate a given array, by
+ * internally reducing the users counts of some given entries. This exists to
+ * support a no-fail routine like attaching to an IOMMU_DOMAIN_BLOCKED. Use it
+ * carefully as it can impact performance from the extra rwlock.
+ *
+ * Concurrent invalidation thread will push all the invalidations described on
+ * the array into the command queue for each invalidation event. It is designed
+ * like this to optimize the invalidation fast path by avoiding locks.
+ *
+ * A domain can be shared across SMMU instances. When an instance gets removed
+ * it would delete all the entries that belong to that SMMU instance. Then, a
+ * synchronize_rcu() would have to be called to sync the array, to prevent any
+ * concurrent invalidation thread accessing the old array from issuing commands
+ * to the command queue of a removed SMMU instance.
+ */
+struct arm_smmu_invs {
+	size_t num_invs;
+	rwlock_t rwlock;
+	struct rcu_head rcu;
+	struct arm_smmu_inv inv[];
+};
+
+static inline struct arm_smmu_invs *arm_smmu_invs_alloc(size_t num_invs)
+{
+	struct arm_smmu_invs *new_invs;
+
+	new_invs = kzalloc(struct_size(new_invs, inv, num_invs), GFP_KERNEL);
+	if (!new_invs)
+		return ERR_PTR(-ENOMEM);
+	rwlock_init(&new_invs->rwlock);
+	return new_invs;
+}
+
+struct arm_smmu_invs *arm_smmu_invs_add(struct arm_smmu_invs *old_invs,
+					struct arm_smmu_invs *add_invs);
+struct arm_smmu_invs *arm_smmu_invs_del(struct arm_smmu_invs *old_invs,
+					struct arm_smmu_invs *del_invs);
+size_t arm_smmu_invs_dec(struct arm_smmu_invs *invs,
+			 struct arm_smmu_invs *dec_invs);
+
 struct arm_smmu_evtq {
 	struct arm_smmu_queue		q;
 	struct iopf_queue		*iopf;
@@ -875,6 +951,8 @@ struct arm_smmu_domain {
 
 	struct iommu_domain		domain;
 
+	struct arm_smmu_invs		*invs;
+
 	/* List of struct arm_smmu_master_domain */
 	struct list_head		devices;
 	spinlock_t			devices_lock;
@@ -956,6 +1034,7 @@ struct arm_smmu_domain *arm_smmu_domain_alloc(void);
 
 static inline void arm_smmu_domain_free(struct arm_smmu_domain *smmu_domain)
 {
+	kfree_rcu(smmu_domain->invs, rcu);
 	kfree(smmu_domain);
 }
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
index d2671bfd37981..2008a4b55ef70 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
@@ -567,6 +567,90 @@ static void arm_smmu_v3_write_cd_test_sva_release(struct kunit *test)
 						      NUM_EXPECTED_SYNCS(2));
 }
 
+static void arm_smmu_v3_invs_test_verify(struct kunit *test,
+					 struct arm_smmu_invs *invs, int num,
+					 const int *ids, const int *users)
+{
+	KUNIT_EXPECT_EQ(test, invs->num_invs, num);
+	while (num--) {
+		KUNIT_EXPECT_EQ(test, invs->inv[num].id, ids[num]);
+		KUNIT_EXPECT_EQ(test, refcount_read(&invs->inv[num].users),
+				users[num]);
+	}
+}
+
+static struct arm_smmu_invs invs1 = {
+	.num_invs = 3,
+	.inv = { { .type = INV_TYPE_S2_VMID, .id = 1, },
+		 { .type = INV_TYPE_S2_VMID, .id = 2, },
+		 { .type = INV_TYPE_S2_VMID, .id = 3, }, },
+};
+
+static struct arm_smmu_invs invs2 = {
+	.num_invs = 3,
+	.inv = { { .type = INV_TYPE_S2_VMID, .id = 1, }, /* duplicated */
+		 { .type = INV_TYPE_ATS, .id = 5, },
+		 { .type = INV_TYPE_ATS, .id = 4, }, },
+};
+
+static struct arm_smmu_invs invs3 = {
+	.num_invs = 3,
+	.inv = { { .type = INV_TYPE_S2_VMID, .id = 1, }, /* duplicated */
+		 { .type = INV_TYPE_ATS, .id = 7, },
+		 { .type = INV_TYPE_ATS, .id = 6, }, },
+};
+
+static void arm_smmu_v3_invs_test(struct kunit *test)
+{
+	const int results1[2][3] = { { 1, 2, 3, }, { 1, 1, 1, }, };
+	const int results2[2][5] = { { 1, 2, 3, 4, 5, }, { 2, 1, 1, 1, 1, }, };
+	const int results3[2][5] = { { 1, 2, 3, 4, 5, }, { 1, 1, 1, 0, 0, }, };
+	const int results4[2][5] = { { 1, 2, 3, 6, 7, }, { 2, 1, 1, 1, 1, }, };
+	const int results5[2][5] = { { 1, 2, 3, 6, 7, }, { 1, 0, 0, 1, 1, }, };
+	struct arm_smmu_invs *test_a, *test_b;
+	size_t num_invs;
+
+	/* New array */
+	test_a = arm_smmu_invs_alloc(0);
+	KUNIT_EXPECT_EQ(test, test_a->num_invs, 0);
+
+	/* Test1: add invs1 (new array) */
+	test_b = arm_smmu_invs_add(test_a, &invs1);
+	arm_smmu_v3_invs_test_verify(test, test_b, ARRAY_SIZE(results1[0]),
+				     results1[0], results1[1]);
+	kfree(test_a);
+
+	/* Test2: add invs2 (new array) */
+	test_a = arm_smmu_invs_add(test_b, &invs2);
+	arm_smmu_v3_invs_test_verify(test, test_a, ARRAY_SIZE(results2[0]),
+				     results2[0], results2[1]);
+	kfree(test_b);
+
+	/* Test3: decrease invs2 (same array) */
+	num_invs = arm_smmu_invs_dec(test_a, &invs2);
+	arm_smmu_v3_invs_test_verify(test, test_a, ARRAY_SIZE(results3[0]),
+				     results3[0], results3[1]);
+	KUNIT_EXPECT_EQ(test, num_invs, 3);
+
+	/* Test4: add invs3 (new array) */
+	test_b = arm_smmu_invs_add(test_a, &invs3);
+	arm_smmu_v3_invs_test_verify(test, test_b, ARRAY_SIZE(results4[0]),
+				     results4[0], results4[1]);
+	kfree(test_a);
+
+	/* Test5: decrease invs1 (same array) */
+	num_invs = arm_smmu_invs_dec(test_b, &invs1);
+	arm_smmu_v3_invs_test_verify(test, test_b, ARRAY_SIZE(results5[0]),
+				     results5[0], results5[1]);
+	KUNIT_EXPECT_EQ(test, num_invs, 3);
+
+	/* Test6: delete invs3 (new array) */
+	test_a = arm_smmu_invs_del(test_b, &invs3);
+	KUNIT_EXPECT_EQ(test, test_a->num_invs, 0);
+	kfree(test_b);
+	kfree(test_a);
+}
+
 static struct kunit_case arm_smmu_v3_test_cases[] = {
 	KUNIT_CASE(arm_smmu_v3_write_ste_test_bypass_to_abort),
 	KUNIT_CASE(arm_smmu_v3_write_ste_test_abort_to_bypass),
@@ -590,6 +674,7 @@ static struct kunit_case arm_smmu_v3_test_cases[] = {
 	KUNIT_CASE(arm_smmu_v3_write_ste_test_s2_to_s1_stall),
 	KUNIT_CASE(arm_smmu_v3_write_cd_test_sva_clear),
 	KUNIT_CASE(arm_smmu_v3_write_cd_test_sva_release),
+	KUNIT_CASE(arm_smmu_v3_invs_test),
 	{},
 };
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 08af5f2d1235a..73f3b411ff7ef 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -26,6 +26,7 @@
 #include <linux/pci.h>
 #include <linux/pci-ats.h>
 #include <linux/platform_device.h>
+#include <linux/sort.h>
 #include <linux/string_choices.h>
 #include <kunit/visibility.h>
 #include <uapi/linux/iommufd.h>
@@ -1033,6 +1034,263 @@ void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
 	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }
 
+static int arm_smmu_invs_cmp(const void *_l, const void *_r)
+{
+	const struct arm_smmu_inv *l = _l;
+	const struct arm_smmu_inv *r = _r;
+
+	if (l->smmu != r->smmu)
+		return cmp_int((uintptr_t)l->smmu, (uintptr_t)r->smmu);
+	if (l->type != r->type)
+		return cmp_int(l->type, r->type);
+	return cmp_int(l->id, r->id);
+}
+
+static inline bool same_op(const struct arm_smmu_inv *a,
+			   const struct arm_smmu_inv *b)
+{
+	return a->smmu == b->smmu && a->type == b->type && a->id == b->id;
+}
+
+/**
+ * arm_smmu_invs_add() - Combine @old_invs with @add_invs to a new array
+ * @old_invs: the old invalidation array
+ * @add_invs: an array of invlidations to add
+ *
+ * Return: a newly allocated and sorted invalidation array on success, or an
+ * ERR_PTR.
+ *
+ * This function must be locked and serialized with arm_smmu_invs_del/dec(),
+ * but do not lockdep on any lock for KUNIT test.
+ *
+ * Caller is resposible for freeing the @old_invs and the returned one.
+ *
+ * Entries marked as trash can be resued if @add_invs wants to add them back.
+ * Otherwise, they will be completely removed in the returned array.
+ */
+VISIBLE_IF_KUNIT
+struct arm_smmu_invs *arm_smmu_invs_add(struct arm_smmu_invs *old_invs,
+					struct arm_smmu_invs *add_invs)
+{
+	size_t need = old_invs->num_invs + add_invs->num_invs;
+	struct arm_smmu_invs *new_invs;
+	size_t deletes = 0, i, j;
+	u64 existed = 0;
+
+	/* Max of add_invs->num_invs is 64 */
+	if (WARN_ON(add_invs->num_invs > sizeof(existed) * 8))
+		return ERR_PTR(-EINVAL);
+
+	for (i = 0; i != old_invs->num_invs; i++) {
+		struct arm_smmu_inv *cur = &old_invs->inv[i];
+		/* Count the trash entries to deletes */
+		if (cur->todel) {
+			WARN_ON_ONCE(refcount_read(&cur->users));
+			deletes++;
+		}
+		for (j = 0; j != add_invs->num_invs; j++) {
+			if (!same_op(cur, &add_invs->inv[j]))
+				continue;
+			/* Found duplicated entries in add_invs */
+			if (WARN_ON_ONCE(existed & BIT_ULL(j)))
+				continue;
+			/* Revert the todel marker for reuse */
+			if (cur->todel) {
+				cur->todel = false;
+				deletes--;
+			}
+			/* Store the new location of this existing op in id */
+			add_invs->inv[j].id = i - deletes;
+			existed |= BIT_ULL(j);
+			need--;
+			break;
+		}
+	}
+
+	need -= deletes;
+
+	new_invs = arm_smmu_invs_alloc(need);
+	if (IS_ERR(new_invs)) {
+		/* Don't forget to revert all the todel markers */
+		for (i = 0; i != old_invs->num_invs; i++) {
+			if (refcount_read(&old_invs->inv[i].users) == 0)
+				old_invs->inv[i].todel = true;
+		}
+		return new_invs;
+	}
+
+	/* Copy the entire array less all the todel entries */
+	for (i = 0; i != old_invs->num_invs; i++) {
+		if (old_invs->inv[i].todel)
+			continue;
+		new_invs->inv[new_invs->num_invs++] = old_invs->inv[i];
+	}
+
+	for (j = 0; j != add_invs->num_invs; j++) {
+		if (existed & BIT_ULL(j)) {
+			unsigned int idx = add_invs->inv[j].id;
+
+			refcount_inc(&new_invs->inv[idx].users);
+
+			/* Restore the id of the passed in add_invs->inv[j] */
+			add_invs->inv[j].id = new_invs->inv[idx].id;
+		} else {
+			unsigned int idx = new_invs->num_invs;
+
+			new_invs->inv[idx] = add_invs->inv[j];
+			refcount_set(&new_invs->inv[idx].users, 1);
+			new_invs->num_invs++;
+		}
+	}
+
+	WARN_ON(new_invs->num_invs != need);
+
+	/*
+	 * A sorted array allows batching invalidations together for fewer SYNCs.
+	 * Also, ATS must follow the ASID/VMID invalidation SYNC.
+	 */
+	sort_nonatomic(new_invs->inv, new_invs->num_invs,
+		       sizeof(add_invs->inv[0]), arm_smmu_invs_cmp, NULL);
+	return new_invs;
+}
+EXPORT_SYMBOL_IF_KUNIT(arm_smmu_invs_add);
+
+/**
+ * arm_smmu_invs_del() - Remove @del_invs from @old_invs
+ * @old_invs: the old invalidation array
+ * @del_invs: an array of invlidations to delete
+ *
+ * Return: a newly allocated and sorted invalidation array on success, or an
+ * ERR_PTR.
+ *
+ * This function must be locked and serialized with arm_smmu_invs_add/dec(),
+ * but do not lockdep on any lock for KUNIT test.
+ *
+ * Caller is resposible for freeing the @old_invs and the returned one.
+ *
+ * Entries marked as trash will be completely removed in the returned array.
+ */
+VISIBLE_IF_KUNIT
+struct arm_smmu_invs *arm_smmu_invs_del(struct arm_smmu_invs *old_invs,
+					struct arm_smmu_invs *del_invs)
+{
+	size_t need = old_invs->num_invs;
+	struct arm_smmu_invs *new_invs;
+	size_t i, j;
+
+	if (WARN_ON(old_invs->num_invs < del_invs->num_invs))
+		return ERR_PTR(-EINVAL);
+
+	for (i = 0; i != old_invs->num_invs; i++) {
+		struct arm_smmu_inv *cur = &old_invs->inv[i];
+		/* Skip any trash entry */
+		if (cur->todel) {
+			WARN_ON_ONCE(refcount_read(&cur->users));
+			need--;
+			continue;
+		}
+		for (j = 0; j != del_invs->num_invs; j++) {
+			if (!same_op(cur, &del_invs->inv[j]))
+				continue;
+			/* Found duplicated entries in del_invs */
+			if (WARN_ON_ONCE(cur->todel))
+				continue;
+			/* Mark todel. The deletion part will take care of it */
+			cur->todel = true;
+			if (refcount_read(&cur->users) == 1)
+				need--;
+		}
+	}
+
+	new_invs = arm_smmu_invs_alloc(need);
+	if (IS_ERR(new_invs)) {
+		/* Don't forget to revert all the todel markers */
+		for (i = 0; i != old_invs->num_invs; i++) {
+			if (refcount_read(&old_invs->inv[i].users) != 0)
+				old_invs->inv[i].todel = false;
+		}
+		return new_invs;
+	}
+
+	for (i = 0; i != old_invs->num_invs; i++) {
+		struct arm_smmu_inv *cur = &old_invs->inv[i];
+		unsigned int idx = new_invs->num_invs;
+
+		/* Either a trash entry or a matched entry for a dec-and-test */
+		if (cur->todel) {
+			/* Can't do refcount_dec_and_test() on a trash entry */
+			if (refcount_read(&cur->users) <= 1)
+				continue;
+			refcount_dec(&cur->users);
+			cur->todel = false;
+		}
+		new_invs->inv[idx] = *cur;
+		new_invs->num_invs++;
+	}
+
+	WARN_ON(new_invs->num_invs != need);
+
+	/* Still sorted */
+	return new_invs;
+}
+EXPORT_SYMBOL_IF_KUNIT(arm_smmu_invs_del);
+
+/**
+ * arm_smmu_invs_dec() - Find in @invs for all entries in @del_invs, decrease
+ *                       the user counts without deletions
+ * @invs: a given invalidation array
+ * @dec_invs: an array of invlidations to decrease their user counts
+ *
+ * Return: the actual number of invs in the array, excluding all trash entries
+ *
+ * This function will not fail. Any entry with users=0 will be marked as trash.
+ * All trash entries will remain in the @invs until being completely deleted by
+ * the next arm_smmu_invs_add() or arm_smmu_invs_del() function call.
+ *
+ * This function must be locked and serialized with arm_smmu_invs_add/del(), but
+ * do not lockdep on any lock for KUNIT test.
+ *
+ * Note that the @invs->num_invs will not be updated, even if the actual number
+ * of invalidations are decreased. Readers should take the read lock to iterate
+ * each entry and check its users counter until @inv->num_invs.
+ */
+VISIBLE_IF_KUNIT
+size_t arm_smmu_invs_dec(struct arm_smmu_invs *invs,
+			 struct arm_smmu_invs *dec_invs)
+{
+	size_t num_invs = 0, i, j;
+	unsigned long flags;
+
+	/* Driver bug. Must fix rather, but do not fail here */
+	if (WARN_ON(invs->num_invs < dec_invs->num_invs)) {
+		for (i = 0; i != invs->num_invs; i++) {
+			if (!invs->inv[i].todel)
+				num_invs++;
+		}
+		return num_invs;
+	}
+
+	/* We have no choice but to lock the array while editing it in place */
+	write_lock_irqsave(&invs->rwlock, flags);
+
+	for (i = 0; i != invs->num_invs; i++) {
+		for (j = 0; j != dec_invs->num_invs; j++) {
+			if (same_op(&invs->inv[i], &dec_invs->inv[j]) &&
+			    refcount_dec_and_test(&invs->inv[i].users)) {
+				/* Set the todel marker for deletion */
+				invs->inv[i].todel = true;
+				break;
+			}
+		}
+		if (!invs->inv[i].todel)
+			num_invs++;
+	}
+
+	write_unlock_irqrestore(&invs->rwlock, flags);
+	return num_invs;
+}
+EXPORT_SYMBOL_IF_KUNIT(arm_smmu_invs_dec);
+
 /*
  * Based on the value of ent report which bits of the STE the HW will access. It
  * would be nice if this was complete according to the spec, but minimally it
@@ -2468,13 +2726,21 @@ static bool arm_smmu_enforce_cache_coherency(struct iommu_domain *domain)
 struct arm_smmu_domain *arm_smmu_domain_alloc(void)
 {
 	struct arm_smmu_domain *smmu_domain;
+	struct arm_smmu_invs *new_invs;
 
 	smmu_domain = kzalloc(sizeof(*smmu_domain), GFP_KERNEL);
 	if (!smmu_domain)
 		return ERR_PTR(-ENOMEM);
 
+	new_invs = arm_smmu_invs_alloc(0);
+	if (IS_ERR(new_invs)) {
+		kfree(smmu_domain);
+		return ERR_CAST(new_invs);
+	}
+
 	INIT_LIST_HEAD(&smmu_domain->devices);
 	spin_lock_init(&smmu_domain->devices_lock);
+	rcu_assign_pointer(smmu_domain->invs, new_invs);
 
 	return smmu_domain;
 }
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH rfcv1 5/8] iommu/arm-smmu-v3: Pre-allocate a per-master invalidation array
  2025-08-14  1:25 [PATCH rfcv1 0/8] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
                   ` (3 preceding siblings ...)
  2025-08-14  1:25 ` [PATCH rfcv1 4/8] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array Nicolin Chen
@ 2025-08-14  1:25 ` Nicolin Chen
  2025-08-26 19:56   ` Jason Gunthorpe
  2025-08-14  1:25 ` [PATCH rfcv1 6/8] iommu/arm-smmu-v3: Populate smmu_domain->invs when attaching masters Nicolin Chen
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Nicolin Chen @ 2025-08-14  1:25 UTC (permalink / raw)
  To: jgg, will, robin.murphy
  Cc: joro, jean-philippe, miko.lenczewski, balbirs, peterz, smostafa,
	kevin.tian, praan, zhangzekun11, linux-arm-kernel, iommu,
	linux-kernel, patches

When a master is attached from an old domain to a new domain, it needs to
build an invalidation array to delete and add the array entries from/onto
the invalidation arrays of those two domains, passed via the del_invs and
add_invs arguments in to arm_smmu_invs_del/add() respectively.

Since the master->num_streams might differ across masters, a memory would
have to be allocated when building an add_invs/del_invs array which might
fail with -ENOMEM.

On the other hand, an attachment to arm_smmu_blocked_domain must not fail
so it's the best to avoid any memory allocation in that path.

Pre-allocate a fixed size invalidation array for every master. This array
will be filled dynamically when building an add_invs or del_invs array to
attach or detach an smmu_domain.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 10 ++++++++++
 2 files changed, 11 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index d7421b56e3598..0330444bef45f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -919,6 +919,7 @@ struct arm_smmu_master {
 	struct arm_smmu_device		*smmu;
 	struct device			*dev;
 	struct arm_smmu_stream		*streams;
+	struct arm_smmu_invs		*invs;
 	struct arm_smmu_vmaster		*vmaster; /* use smmu->streams_mutex */
 	/* Locked by the iommu core using the group mutex */
 	struct arm_smmu_ctx_desc_cfg	cd_table;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 73f3b411ff7ef..fb5429d8ebb29 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3723,6 +3723,7 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
 	int i;
 	int ret = 0;
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(master->dev);
+	size_t num_ats = dev_is_pci(master->dev) ? master->num_streams : 0;
 
 	master->streams = kcalloc(fwspec->num_ids, sizeof(*master->streams),
 				  GFP_KERNEL);
@@ -3730,6 +3731,13 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
 		return -ENOMEM;
 	master->num_streams = fwspec->num_ids;
 
+	/* Max possible num_invs: two for ASID/VMIDs and num_ats for ATC_INVs */
+	master->invs = arm_smmu_invs_alloc(2 + num_ats);
+	if (IS_ERR(master->invs)) {
+		kfree(master->streams);
+		return PTR_ERR(master->invs);
+	}
+
 	mutex_lock(&smmu->streams_mutex);
 	for (i = 0; i < fwspec->num_ids; i++) {
 		struct arm_smmu_stream *new_stream = &master->streams[i];
@@ -3767,6 +3775,7 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
 		for (i--; i >= 0; i--)
 			rb_erase(&master->streams[i].node, &smmu->streams);
 		kfree(master->streams);
+		kfree(master->invs);
 	}
 	mutex_unlock(&smmu->streams_mutex);
 
@@ -3788,6 +3797,7 @@ static void arm_smmu_remove_master(struct arm_smmu_master *master)
 	mutex_unlock(&smmu->streams_mutex);
 
 	kfree(master->streams);
+	kfree(master->invs);
 }
 
 static struct iommu_device *arm_smmu_probe_device(struct device *dev)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH rfcv1 6/8] iommu/arm-smmu-v3: Populate smmu_domain->invs when attaching masters
  2025-08-14  1:25 [PATCH rfcv1 0/8] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
                   ` (4 preceding siblings ...)
  2025-08-14  1:25 ` [PATCH rfcv1 5/8] iommu/arm-smmu-v3: Pre-allocate a per-master invalidation array Nicolin Chen
@ 2025-08-14  1:25 ` Nicolin Chen
  2025-08-27 18:21   ` Jason Gunthorpe
  2025-08-14  1:25 ` [PATCH rfcv1 7/8] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range() Nicolin Chen
  2025-08-14  1:25 ` [PATCH rfcv1 8/8] iommu/arm-smmu-v3: Perform per-domain invalidations using arm_smmu_invs Nicolin Chen
  7 siblings, 1 reply; 18+ messages in thread
From: Nicolin Chen @ 2025-08-14  1:25 UTC (permalink / raw)
  To: jgg, will, robin.murphy
  Cc: joro, jean-philippe, miko.lenczewski, balbirs, peterz, smostafa,
	kevin.tian, praan, zhangzekun11, linux-arm-kernel, iommu,
	linux-kernel, patches

Update the invs array with the invalidations required by each domain type
during attachment operations.

Only an SVA domain or a paging domain will have an invs array:
 a. SVA domain will add an INV_TYPE_S1_ASID per SMMU and an INV_TYPE_ATS
    per SID

 b. Non-nesting-parent paging domain with no ATS-enabled master will add
    a single INV_TYPE_S1_ASID or INV_TYPE_S2_VMID per SMMU

 c. Non-nesting-parent paging domain with ATS-enabled master(s) will do
    (b) and add an INV_TYPE_ATS per SID

 d. Nesting-parent paging domain will add an INV_TYPE_S2_VMID followed by
    an INV_TYPE_S2_VMID_S1_CLEAR per vSMMU. For an ATS-enabled master, it
    will add an INV_TYPE_ATS_FULL per SID

The per-domain invalidation is not needed, until the domain is attached to
a master, i.e. a possible translation request. Giving this clears a way to
allowing the domain to be attached to many SMMUs, and avoids any pointless
invalidation overheads during a teardown if there are no STE/CDs referring
to the domain. This also means, when the last device is detached, the old
domain must flush its ASID or VMID because any iommu_unmap() call after it
wouldn't initiate any invalidation given an empty domain invs array.

Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |   6 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 220 +++++++++++++++++++-
 2 files changed, 225 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 0330444bef45f..715179249eced 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -1084,6 +1084,12 @@ struct arm_smmu_attach_state {
 	ioasid_t ssid;
 	/* Resulting state */
 	struct arm_smmu_vmaster *vmaster;
+	struct arm_smmu_invs **old_domain_invs;
+	struct arm_smmu_invs *old_domain_oinvs;
+	struct arm_smmu_invs *old_domain_ninvs;
+	struct arm_smmu_invs **new_domain_invs;
+	struct arm_smmu_invs *new_domain_oinvs;
+	struct arm_smmu_invs *new_domain_ninvs;
 	bool ats_enabled;
 };
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index fb5429d8ebb29..95615525b0ab8 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3082,6 +3082,76 @@ static void arm_smmu_disable_iopf(struct arm_smmu_master *master,
 		iopf_queue_remove_device(master->smmu->evtq.iopf, master->dev);
 }
 
+typedef struct arm_smmu_invs *(*invs_fn)(struct arm_smmu_invs *old_invs,
+					 struct arm_smmu_invs *invs);
+
+static struct arm_smmu_invs *arm_smmu_build_invs(
+	struct arm_smmu_invs *old_invs, struct arm_smmu_domain *smmu_domain,
+	struct arm_smmu_master *master, bool ats, ioasid_t ssid, invs_fn fn)
+{
+	const bool e2h = master->smmu->features & ARM_SMMU_FEAT_E2H;
+	const bool nesting = smmu_domain->nest_parent;
+	struct arm_smmu_inv *cur = master->invs->inv;
+	size_t num_invs = 1;
+	size_t i;
+
+	switch (smmu_domain->stage) {
+	case ARM_SMMU_DOMAIN_SVA:
+	case ARM_SMMU_DOMAIN_S1:
+		cur->smmu = master->smmu;
+		cur->type = INV_TYPE_S1_ASID;
+		cur->id = smmu_domain->cd.asid;
+		cur->size_opcode = e2h ? CMDQ_OP_TLBI_EL2_VA :
+					 CMDQ_OP_TLBI_NH_VA;
+		cur->nsize_opcode = e2h ? CMDQ_OP_TLBI_EL2_ASID :
+					  CMDQ_OP_TLBI_NH_ASID;
+		break;
+	case ARM_SMMU_DOMAIN_S2:
+		cur->smmu = master->smmu;
+		cur->type = INV_TYPE_S2_VMID;
+		cur->id = smmu_domain->s2_cfg.vmid;
+		cur->size_opcode = CMDQ_OP_TLBI_S2_IPA;
+		cur->nsize_opcode = CMDQ_OP_TLBI_S12_VMALL;
+		break;
+	default:
+		WARN_ON(true);
+		return old_invs;
+	}
+
+	/* Range-based invalidation requires the leaf pgsize for calculation */
+	if (master->smmu->features & ARM_SMMU_FEAT_RANGE_INV)
+		cur->pgsize = __ffs(smmu_domain->domain.pgsize_bitmap);
+
+	/* All the nested S1 ASIDs have to be flushed when S2 parent changes */
+	if (nesting) {
+		cur = &master->invs->inv[num_invs++];
+		cur->smmu = master->smmu;
+		cur->type = INV_TYPE_S2_VMID_S1_CLEAR;
+		cur->id = smmu_domain->s2_cfg.vmid;
+		cur->size_opcode = CMDQ_OP_TLBI_NH_ALL;
+		cur->nsize_opcode = CMDQ_OP_TLBI_NH_ALL;
+	}
+
+	if (ats) {
+		for (i = 0, cur++; i < master->num_streams; i++) {
+			cur->smmu = master->smmu;
+			/*
+			 * If an S2 used as a nesting parent is changed we have
+			 * no option but to completely flush the ATC.
+			 */
+			cur->type = nesting ? INV_TYPE_ATS_FULL : INV_TYPE_ATS;
+			cur->id = master->streams[i].id;
+			cur->ssid = ssid;
+			cur->size_opcode = CMDQ_OP_ATC_INV;
+			cur->nsize_opcode = CMDQ_OP_ATC_INV;
+		}
+		num_invs += master->num_streams;
+	}
+
+	master->invs->num_invs = num_invs;
+	return fn(old_invs, master->invs);
+}
+
 static void arm_smmu_remove_master_domain(struct arm_smmu_master *master,
 					  struct iommu_domain *domain,
 					  ioasid_t ssid)
@@ -3111,6 +3181,144 @@ static void arm_smmu_remove_master_domain(struct arm_smmu_master *master,
 	kfree(master_domain);
 }
 
+static int arm_smmu_attach_prepare_invs(struct arm_smmu_attach_state *state,
+					struct arm_smmu_domain *new_smmu_domain)
+{
+	struct arm_smmu_domain *old_smmu_domain =
+		to_smmu_domain_devices(state->old_domain);
+	struct arm_smmu_master *master = state->master;
+	bool blocking = false;
+
+	/* A re-attach case doesn't need to update invs array */
+	if (new_smmu_domain == old_smmu_domain)
+		return 0;
+
+	if (new_smmu_domain) {
+		state->new_domain_oinvs = rcu_dereference_protected(
+			new_smmu_domain->invs,
+			lockdep_is_held(&arm_smmu_asid_lock));
+		state->new_domain_ninvs = arm_smmu_build_invs(
+			state->new_domain_oinvs, new_smmu_domain, master,
+			state->ats_enabled, state->ssid, arm_smmu_invs_add);
+		if (IS_ERR(state->new_domain_ninvs))
+			return PTR_ERR(state->new_domain_ninvs);
+		state->new_domain_invs = &new_smmu_domain->invs;
+		blocking = new_smmu_domain->domain.type == IOMMU_DOMAIN_BLOCKED;
+	}
+
+	if (old_smmu_domain) {
+		state->old_domain_oinvs = rcu_dereference_protected(
+			old_smmu_domain->invs,
+			lockdep_is_held(&arm_smmu_asid_lock));
+		state->old_domain_ninvs = arm_smmu_build_invs(
+			state->old_domain_oinvs, old_smmu_domain, master,
+			master->ats_enabled, state->ssid, arm_smmu_invs_del);
+		if (IS_ERR(state->old_domain_ninvs)) {
+			/* An attachment to the blocked_domain must not fail */
+			if (blocking) {
+				state->old_domain_ninvs = NULL;
+			} else {
+				kfree(state->new_domain_ninvs);
+				return PTR_ERR(state->old_domain_ninvs);
+			}
+		}
+		state->old_domain_invs = &old_smmu_domain->invs;
+		/* master->invs is retaining the del_invs for the old domain */
+	}
+
+	return 0;
+}
+
+/* Must be installed before arm_smmu_install_ste_for_dev() */
+static void
+arm_smmu_install_new_domain_invs(struct arm_smmu_attach_state *state)
+{
+	if (!state->new_domain_invs)
+		return;
+
+	rcu_assign_pointer(*state->new_domain_invs, state->new_domain_ninvs);
+	/*
+	 * Committed to updating the STE, using the new invalidation array, and
+	 * acquiring any racing IOPTE updates.
+	 */
+	smp_mb();
+	kfree_rcu(state->new_domain_oinvs, rcu);
+}
+
+/* Should be installed after arm_smmu_install_ste_for_dev() */
+static void
+arm_smmu_install_old_domain_invs(struct arm_smmu_attach_state *state)
+{
+	struct arm_smmu_invs *old_domain_oinvs = state->old_domain_oinvs;
+	struct arm_smmu_invs *old_domain_ninvs = state->old_domain_ninvs;
+	struct arm_smmu_master *master = state->master;
+	unsigned long flags;
+	size_t num_invs;
+
+	if (!state->old_domain_invs)
+		return;
+
+	/* Activate the no-fail protocol upon an allocation failure */
+	if (!old_domain_ninvs) {
+		/*
+		 * Notes:
+		 *  - The array will be edited in place while holding its rwlock
+		 *    which has a tradeoff that any concurrent invalidation will
+		 *    fail at read_trylock() until arm_smmu_invs_dec() returns.
+		 *  - arm_smmu_invs_dec() doesn't update the array's num_invs as
+		 *    if only decrease users counters. So, get num_invs from the
+		 *    returned value.
+		 *  - The master->invs retains the del_invs for the old domain.
+		 */
+		num_invs = arm_smmu_invs_dec(old_domain_oinvs, master->invs);
+	} else {
+		rcu_assign_pointer(*state->old_domain_invs, old_domain_ninvs);
+		/*
+		 * Fake an empty old array that a concurrent invalidation thread
+		 * races at. It either lets the reader quickly respin for a new
+		 * array with fewer num_invs (avoiding deleted invalidations) or
+		 * blocks the writer till the reader flushes the array (avoiding
+		 * ATC invalidation timeouts for ATS invalidations being sent to
+		 * a resetting PCI device).
+		 */
+		write_lock_irqsave(&old_domain_oinvs->rwlock, flags);
+		old_domain_oinvs->num_invs = 0;
+		write_unlock_irqrestore(&old_domain_oinvs->rwlock, flags);
+
+		kfree_rcu(old_domain_oinvs, rcu);
+		num_invs = state->old_domain_ninvs->num_invs;
+	}
+
+	/*
+	 * The domain invs array was filled when the first device attaches to it
+	 * and emptied when the last device detaches. So, the invs array doesn't
+	 * syncrhonize with iommu_unmap() calls, which might come after the last
+	 * detach and end up with a NOP. This would result in missing a critical
+	 * TLB maintanance. Thus, when the last device is detached (indicated by
+	 * an empty invs array), flush all TLBs using the removed ASID or VMID.
+	 */
+	if (!num_invs) {
+		struct arm_smmu_inv *inv = &master->invs->inv[0];
+		struct arm_smmu_cmdq_ent cmd = {
+			.opcode = inv->nsize_opcode,
+		};
+
+		switch (inv->type) {
+		case INV_TYPE_S1_ASID:
+			cmd.tlbi.asid = inv->id;
+			arm_smmu_cmdq_issue_cmd_with_sync(inv->smmu, &cmd);
+			break;
+		case INV_TYPE_S2_VMID:
+			cmd.tlbi.vmid = inv->id;
+			arm_smmu_cmdq_issue_cmd_with_sync(inv->smmu, &cmd);
+			break;
+		default:
+			WARN_ON(true);
+			break;
+		}
+	}
+}
+
 /*
  * Start the sequence to attach a domain to a master. The sequence contains three
  * steps:
@@ -3168,12 +3376,16 @@ int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
 				     arm_smmu_ats_supported(master);
 	}
 
+	ret = arm_smmu_attach_prepare_invs(state, smmu_domain);
+	if (ret)
+		return ret;
+
 	if (smmu_domain) {
 		if (new_domain->type == IOMMU_DOMAIN_NESTED) {
 			ret = arm_smmu_attach_prepare_vmaster(
 				state, to_smmu_nested_domain(new_domain));
 			if (ret)
-				return ret;
+				goto err_unprepare_invs;
 		}
 
 		master_domain = kzalloc(sizeof(*master_domain), GFP_KERNEL);
@@ -3221,6 +3433,8 @@ int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
 			atomic_inc(&smmu_domain->nr_ats_masters);
 		list_add(&master_domain->devices_elm, &smmu_domain->devices);
 		spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+
+		arm_smmu_install_new_domain_invs(state);
 	}
 
 	if (!state->ats_enabled && master->ats_enabled) {
@@ -3240,6 +3454,9 @@ int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
 	kfree(master_domain);
 err_free_vmaster:
 	kfree(state->vmaster);
+err_unprepare_invs:
+	kfree(state->old_domain_ninvs);
+	kfree(state->new_domain_ninvs);
 	return ret;
 }
 
@@ -3271,6 +3488,7 @@ void arm_smmu_attach_commit(struct arm_smmu_attach_state *state)
 	}
 
 	arm_smmu_remove_master_domain(master, state->old_domain, state->ssid);
+	arm_smmu_install_old_domain_invs(state);
 	master->ats_enabled = state->ats_enabled;
 }
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH rfcv1 7/8] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2025-08-14  1:25 [PATCH rfcv1 0/8] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
                   ` (5 preceding siblings ...)
  2025-08-14  1:25 ` [PATCH rfcv1 6/8] iommu/arm-smmu-v3: Populate smmu_domain->invs when attaching masters Nicolin Chen
@ 2025-08-14  1:25 ` Nicolin Chen
  2025-08-27 18:49   ` Jason Gunthorpe
  2025-08-14  1:25 ` [PATCH rfcv1 8/8] iommu/arm-smmu-v3: Perform per-domain invalidations using arm_smmu_invs Nicolin Chen
  7 siblings, 1 reply; 18+ messages in thread
From: Nicolin Chen @ 2025-08-14  1:25 UTC (permalink / raw)
  To: jgg, will, robin.murphy
  Cc: joro, jean-philippe, miko.lenczewski, balbirs, peterz, smostafa,
	kevin.tian, praan, zhangzekun11, linux-arm-kernel, iommu,
	linux-kernel, patches

Now, each smmu_domain is built with an invs array that keeps all the IDs
(asid/vmid) and its attached device SIDs, following the exact pattern of
all the existing invalidation functions.

Introduce a new arm_smmu_domain_inv helper iterating smmu_domain->invs,
to convert the invalidation array to commands. Any invalidation request
with no size specified means an entire flush over a range based one.

Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |   9 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 215 ++++++++++++++++++--
 2 files changed, 211 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 715179249eced..69271beb54527 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -1060,6 +1060,15 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
 int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
 			    unsigned long iova, size_t size);
 
+void arm_smmu_domain_inv_range(struct arm_smmu_domain *smmu_domain,
+			       unsigned long iova, size_t size,
+			       unsigned int granule, bool leaf);
+
+static inline void arm_smmu_domain_inv(struct arm_smmu_domain *smmu_domain)
+{
+	arm_smmu_domain_inv_range(smmu_domain, 0, 0, 0, false);
+}
+
 void __arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu,
 			      struct arm_smmu_cmdq *cmdq);
 int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 95615525b0ab8..aa770275029e2 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2530,23 +2530,19 @@ static void arm_smmu_tlb_inv_context(void *cookie)
 	arm_smmu_atc_inv_domain(smmu_domain, 0, 0);
 }
 
-static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
-				     unsigned long iova, size_t size,
-				     size_t granule,
-				     struct arm_smmu_domain *smmu_domain)
+static void arm_smmu_cmdq_batch_add_range(struct arm_smmu_device *smmu,
+					  struct arm_smmu_cmdq_batch *cmds,
+					  struct arm_smmu_cmdq_ent *cmd,
+					  unsigned long iova, size_t size,
+					  size_t granule, size_t pgsize)
 {
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
-	unsigned long end = iova + size, num_pages = 0, tg = 0;
+	unsigned long end = iova + size, num_pages = 0, tg = pgsize;
 	size_t inv_range = granule;
-	struct arm_smmu_cmdq_batch cmds;
 
 	if (!size)
 		return;
 
 	if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
-		/* Get the leaf page size */
-		tg = __ffs(smmu_domain->domain.pgsize_bitmap);
-
 		num_pages = size >> tg;
 
 		/* Convert page size of 12,14,16 (log2) to 1,2,3 */
@@ -2566,8 +2562,6 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
 			num_pages++;
 	}
 
-	arm_smmu_cmdq_batch_init(smmu, &cmds, cmd);
-
 	while (iova < end) {
 		if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
 			/*
@@ -2595,9 +2589,26 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
 		}
 
 		cmd->tlbi.addr = iova;
-		arm_smmu_cmdq_batch_add(smmu, &cmds, cmd);
+		arm_smmu_cmdq_batch_add(smmu, cmds, cmd);
 		iova += inv_range;
 	}
+}
+
+static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
+				     unsigned long iova, size_t size,
+				     size_t granule,
+				     struct arm_smmu_domain *smmu_domain)
+{
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	struct arm_smmu_cmdq_batch cmds;
+	size_t pgsize;
+
+	/* Get the leaf page size */
+	pgsize = __ffs(smmu_domain->domain.pgsize_bitmap);
+
+	arm_smmu_cmdq_batch_init(smmu, &cmds, cmd);
+	arm_smmu_cmdq_batch_add_range(smmu, &cmds, cmd, iova, size, granule,
+				      pgsize);
 	arm_smmu_cmdq_batch_submit(smmu, &cmds);
 }
 
@@ -2653,6 +2664,184 @@ void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
 	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
 }
 
+static bool arm_smmu_inv_size_too_big(struct arm_smmu_device *smmu, size_t size,
+				      size_t granule)
+{
+	size_t max_tlbi_ops;
+
+	/* 0 size means invalidate all */
+	if (!size || size == SIZE_MAX)
+		return true;
+
+	if (smmu->features & ARM_SMMU_FEAT_RANGE_INV)
+		return false;
+
+	/*
+	 * Borrowed from the MAX_TLBI_OPS in arch/arm64/include/asm/tlbflush.h,
+	 * this is used as a threshold to replace "size_opcode" commands with a
+	 * single "nsize_opcode" command, when SMMU doesn't implement the range
+	 * invalidation feature, where there can be too many per-granule TLBIs,
+	 * resulting in a soft lockup.
+	 */
+	max_tlbi_ops = 1 << (ilog2(granule) - 3);
+	return size >= max_tlbi_ops * granule;
+}
+
+/* Used by non INV_TYPE_ATS* invalidations */
+static void arm_smmu_inv_to_cmdq_batch(struct arm_smmu_inv *inv,
+				       struct arm_smmu_cmdq_batch *cmds,
+				       struct arm_smmu_cmdq_ent *cmd,
+				       unsigned long iova, size_t size,
+				       unsigned int granule)
+{
+	if (arm_smmu_inv_size_too_big(inv->smmu, size, granule)) {
+		cmd->opcode = inv->nsize_opcode;
+		/* nsize_opcode always needs a sync, no batching */
+		arm_smmu_cmdq_issue_cmd_with_sync(inv->smmu, cmd);
+		return;
+	}
+
+	cmd->opcode = inv->size_opcode;
+	arm_smmu_cmdq_batch_add_range(inv->smmu, cmds, cmd, iova, size, granule,
+				      inv->pgsize);
+}
+
+static bool arm_smmu_invs_end_batch(struct arm_smmu_invs *invs, size_t idx)
+{
+	struct arm_smmu_inv *cur = &invs->inv[idx];
+	struct arm_smmu_inv *next;
+
+	/* Last array entry ends */
+	if (idx + 1 == invs->num_invs)
+		return true;
+
+	next = &cur[1];
+	/* Changing smmu means changing command queue */
+	if (cur->smmu != next->smmu)
+		return true;
+	/* The batch for S2 TLBI must be done before nested S1 ASIDs */
+	if (next->type == INV_TYPE_S2_VMID_S1_CLEAR)
+		return true;
+	/* ATS must be after a sync of the S1/S2 invalidations */
+	if (cur->type != INV_TYPE_ATS && cur->type != INV_TYPE_ATS_FULL &&
+	    (next->type == INV_TYPE_ATS || next->type == INV_TYPE_ATS_FULL))
+		return true;
+	return false;
+}
+
+void arm_smmu_domain_inv_range(struct arm_smmu_domain *smmu_domain,
+			       unsigned long iova, size_t size,
+			       unsigned int granule, bool leaf)
+{
+	struct arm_smmu_cmdq_batch cmds = {};
+	struct arm_smmu_invs *invs;
+	bool retried = false;
+	size_t i;
+
+	/*
+	 * An invalidation request must follow some IOPTE change and then load
+	 * the invalidation array In the meantime, a domain attachment mutates
+	 * the array and then stores an STE/CD asking SMMU HW to acquire those
+	 * changed IOPTEs. In other word, these two are interdependent and can
+	 * race.
+	 *
+	 * In a race, the RCU design (with its underlying memory barriers) can
+	 * ensure the invalidations array to always get updated before loaded.
+	 *
+	 * smp_mb() is used here, paired with the smp_mb() following the array
+	 * update in a concurrent attach, to ensure:
+	 *  - HW sees the new IOPTEs if it walks after STE installation
+	 *  - Invalidation thread sees the updated array with the new ASID.
+	 *
+	 *  [CPU0]                        | [CPU1]
+	 *                                |
+	 *  change IOPTEs and TLB flush:  |
+	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_new_domain_invs {
+	 *    ...                         |   rcu_assign_pointer(new_invs);
+	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
+	 *    ...                         |   kfree_rcu(old_invs, rcu);
+	 *    // load invalidation array  | }
+	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
+	 *                                |   STE = TTB0 // read new IOPTEs
+	 */
+	smp_mb();
+
+	rcu_read_lock();
+again:
+	invs = rcu_dereference(smmu_domain->invs);
+
+	/* A concurrent attachment might have changed the array. Do a respin */
+	if (unlikely(!read_trylock(&invs->rwlock)))
+		goto again;
+	/* Only one retry. Otherwise, it would soft lockup on an empty array */
+	if (!retried && unlikely(!invs->num_invs)) {
+		read_unlock(&invs->rwlock);
+		retried = true;
+		goto again;
+	}
+
+	for (i = 0; i < invs->num_invs; i++) {
+		struct arm_smmu_inv *cur = &invs->inv[i];
+		struct arm_smmu_device *smmu = cur->smmu;
+		struct arm_smmu_cmdq_ent cmd = {
+			/*
+			 * Pick size_opcode to run arm_smmu_get_cmdq(). This can
+			 * be changed to nsize_opcode, which would result in the
+			 * same CMDQ pointer.
+			 */
+			.opcode = cur->size_opcode,
+		};
+
+		/* Do not throw any trash to the command queue */
+		if (refcount_read(&cur->users) == 0)
+			continue;
+
+		if (!cmds.num)
+			arm_smmu_cmdq_batch_init(smmu, &cmds, &cmd);
+
+		switch (cur->type) {
+		case INV_TYPE_S1_ASID:
+			cmd.tlbi.asid = cur->id;
+			cmd.tlbi.leaf = leaf;
+			arm_smmu_inv_to_cmdq_batch(cur, &cmds, &cmd, iova, size,
+						   granule);
+			break;
+		case INV_TYPE_S2_VMID:
+			cmd.tlbi.vmid = cur->id;
+			cmd.tlbi.leaf = leaf;
+			arm_smmu_inv_to_cmdq_batch(cur, &cmds, &cmd, iova, size,
+						   granule);
+			break;
+		case INV_TYPE_S2_VMID_S1_CLEAR:
+			/* CMDQ_OP_TLBI_S12_VMALL already flushed S1 entries */
+			if (arm_smmu_inv_size_too_big(cur->smmu, size, granule))
+				continue;
+			/* Just a single CMDQ_OP_TLBI_NH_ALL, no batching */
+			cmd.tlbi.vmid = cur->id;
+			arm_smmu_cmdq_issue_cmd_with_sync(cur->smmu, &cmd);
+			continue;
+		case INV_TYPE_ATS:
+			arm_smmu_atc_inv_to_cmd(cur->ssid, iova, size, &cmd);
+			cmd.atc.sid = cur->id;
+			arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
+			break;
+		case INV_TYPE_ATS_FULL:
+			arm_smmu_atc_inv_to_cmd(IOMMU_NO_PASID, 0, 0, &cmd);
+			cmd.atc.sid = cur->id;
+			arm_smmu_cmdq_batch_add(smmu, &cmds, &cmd);
+			break;
+		default:
+			WARN_ON_ONCE(1);
+			continue;
+		}
+
+		if (cmds.num && arm_smmu_invs_end_batch(invs, i))
+			arm_smmu_cmdq_batch_submit(smmu, &cmds);
+	}
+	read_unlock(&invs->rwlock);
+	rcu_read_unlock();
+}
+
 static void arm_smmu_tlb_inv_page_nosync(struct iommu_iotlb_gather *gather,
 					 unsigned long iova, size_t granule,
 					 void *cookie)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH rfcv1 8/8] iommu/arm-smmu-v3: Perform per-domain invalidations using arm_smmu_invs
  2025-08-14  1:25 [PATCH rfcv1 0/8] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
                   ` (6 preceding siblings ...)
  2025-08-14  1:25 ` [PATCH rfcv1 7/8] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range() Nicolin Chen
@ 2025-08-14  1:25 ` Nicolin Chen
  7 siblings, 0 replies; 18+ messages in thread
From: Nicolin Chen @ 2025-08-14  1:25 UTC (permalink / raw)
  To: jgg, will, robin.murphy
  Cc: joro, jean-philippe, miko.lenczewski, balbirs, peterz, smostafa,
	kevin.tian, praan, zhangzekun11, linux-arm-kernel, iommu,
	linux-kernel, patches

Replace the old invalidation functions with arm_smmu_domain_inv_range() in
all the existing invalidation routines. And deprecate the old functions.

The new arm_smmu_domain_inv_range() handles the CMDQ_MAX_TLBI_OPS as well,
so drop it in the SVA function.

Since arm_smmu_cmdq_batch_add_range() has only one caller now, and it must
be given a valid size, add a WARN_ON_ONCE to catch any missed case.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   7 -
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c   |  29 +--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 165 +-----------------
 3 files changed, 11 insertions(+), 190 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 69271beb54527..4ccb03c4a69d0 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -1053,13 +1053,6 @@ int arm_smmu_set_pasid(struct arm_smmu_master *master,
 		       struct arm_smmu_domain *smmu_domain, ioasid_t pasid,
 		       struct arm_smmu_cd *cd, struct iommu_domain *old);
 
-void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid);
-void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
-				 size_t granule, bool leaf,
-				 struct arm_smmu_domain *smmu_domain);
-int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
-			    unsigned long iova, size_t size);
-
 void arm_smmu_domain_inv_range(struct arm_smmu_domain *smmu_domain,
 			       unsigned long iova, size_t size,
 			       unsigned int granule, bool leaf);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index fc601b494e0af..048b53f79b144 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -122,15 +122,6 @@ void arm_smmu_make_sva_cd(struct arm_smmu_cd *target,
 }
 EXPORT_SYMBOL_IF_KUNIT(arm_smmu_make_sva_cd);
 
-/*
- * Cloned from the MAX_TLBI_OPS in arch/arm64/include/asm/tlbflush.h, this
- * is used as a threshold to replace per-page TLBI commands to issue in the
- * command queue with an address-space TLBI command, when SMMU w/o a range
- * invalidation feature handles too many per-page TLBI commands, which will
- * otherwise result in a soft lockup.
- */
-#define CMDQ_MAX_TLBI_OPS		(1 << (PAGE_SHIFT - 3))
-
 static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
 						struct mm_struct *mm,
 						unsigned long start,
@@ -146,21 +137,8 @@ static void arm_smmu_mm_arch_invalidate_secondary_tlbs(struct mmu_notifier *mn,
 	 * range. So do a simple translation here by calculating size correctly.
 	 */
 	size = end - start;
-	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_RANGE_INV)) {
-		if (size >= CMDQ_MAX_TLBI_OPS * PAGE_SIZE)
-			size = 0;
-	} else {
-		if (size == ULONG_MAX)
-			size = 0;
-	}
-
-	if (!size)
-		arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_domain->cd.asid);
-	else
-		arm_smmu_tlb_inv_range_asid(start, size, smmu_domain->cd.asid,
-					    PAGE_SIZE, false, smmu_domain);
 
-	arm_smmu_atc_inv_domain(smmu_domain, start, size);
+	arm_smmu_domain_inv_range(smmu_domain, start, size, PAGE_SIZE, false);
 }
 
 static void arm_smmu_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
@@ -191,8 +169,7 @@ static void arm_smmu_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	}
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
-	arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_domain->cd.asid);
-	arm_smmu_atc_inv_domain(smmu_domain, 0, 0);
+	arm_smmu_domain_inv(smmu_domain);
 }
 
 static void arm_smmu_mmu_notifier_free(struct mmu_notifier *mn)
@@ -301,7 +278,7 @@ static void arm_smmu_sva_domain_free(struct iommu_domain *domain)
 	/*
 	 * Ensure the ASID is empty in the iommu cache before allowing reuse.
 	 */
-	arm_smmu_tlb_inv_asid(smmu_domain->smmu, smmu_domain->cd.asid);
+	arm_smmu_domain_inv(smmu_domain);
 
 	/*
 	 * Notice that the arm_smmu_mm_arch_invalidate_secondary_tlbs op can
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index aa770275029e2..8aec471af4316 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1023,16 +1023,6 @@ static void arm_smmu_page_response(struct device *dev, struct iopf_fault *unused
 }
 
 /* Context descriptor manipulation functions */
-void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
-{
-	struct arm_smmu_cmdq_ent cmd = {
-		.opcode	= smmu->features & ARM_SMMU_FEAT_E2H ?
-			CMDQ_OP_TLBI_EL2_ASID : CMDQ_OP_TLBI_NH_ASID,
-		.tlbi.asid = asid,
-	};
-
-	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
-}
 
 static int arm_smmu_invs_cmp(const void *_l, const void *_r)
 {
@@ -2444,74 +2434,10 @@ static int arm_smmu_atc_inv_master(struct arm_smmu_master *master,
 	return arm_smmu_cmdq_batch_submit(master->smmu, &cmds);
 }
 
-int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
-			    unsigned long iova, size_t size)
-{
-	struct arm_smmu_master_domain *master_domain;
-	int i;
-	unsigned long flags;
-	struct arm_smmu_cmdq_ent cmd = {
-		.opcode = CMDQ_OP_ATC_INV,
-	};
-	struct arm_smmu_cmdq_batch cmds;
-
-	if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_ATS))
-		return 0;
-
-	/*
-	 * Ensure that we've completed prior invalidation of the main TLBs
-	 * before we read 'nr_ats_masters' in case of a concurrent call to
-	 * arm_smmu_enable_ats():
-	 *
-	 *	// unmap()			// arm_smmu_enable_ats()
-	 *	TLBI+SYNC			atomic_inc(&nr_ats_masters);
-	 *	smp_mb();			[...]
-	 *	atomic_read(&nr_ats_masters);	pci_enable_ats() // writel()
-	 *
-	 * Ensures that we always see the incremented 'nr_ats_masters' count if
-	 * ATS was enabled at the PCI device before completion of the TLBI.
-	 */
-	smp_mb();
-	if (!atomic_read(&smmu_domain->nr_ats_masters))
-		return 0;
-
-	arm_smmu_cmdq_batch_init(smmu_domain->smmu, &cmds, &cmd);
-
-	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
-	list_for_each_entry(master_domain, &smmu_domain->devices,
-			    devices_elm) {
-		struct arm_smmu_master *master = master_domain->master;
-
-		if (!master->ats_enabled)
-			continue;
-
-		if (master_domain->nested_ats_flush) {
-			/*
-			 * If a S2 used as a nesting parent is changed we have
-			 * no option but to completely flush the ATC.
-			 */
-			arm_smmu_atc_inv_to_cmd(IOMMU_NO_PASID, 0, 0, &cmd);
-		} else {
-			arm_smmu_atc_inv_to_cmd(master_domain->ssid, iova, size,
-						&cmd);
-		}
-
-		for (i = 0; i < master->num_streams; i++) {
-			cmd.atc.sid = master->streams[i].id;
-			arm_smmu_cmdq_batch_add(smmu_domain->smmu, &cmds, &cmd);
-		}
-	}
-	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
-
-	return arm_smmu_cmdq_batch_submit(smmu_domain->smmu, &cmds);
-}
-
 /* IO_PGTABLE API */
 static void arm_smmu_tlb_inv_context(void *cookie)
 {
 	struct arm_smmu_domain *smmu_domain = cookie;
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
-	struct arm_smmu_cmdq_ent cmd;
 
 	/*
 	 * NOTE: when io-pgtable is in non-strict mode, we may get here with
@@ -2520,14 +2446,7 @@ static void arm_smmu_tlb_inv_context(void *cookie)
 	 * insertion to guarantee those are observed before the TLBI. Do be
 	 * careful, 007.
 	 */
-	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
-		arm_smmu_tlb_inv_asid(smmu, smmu_domain->cd.asid);
-	} else {
-		cmd.opcode	= CMDQ_OP_TLBI_S12_VMALL;
-		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
-		arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
-	}
-	arm_smmu_atc_inv_domain(smmu_domain, 0, 0);
+	arm_smmu_domain_inv(smmu_domain);
 }
 
 static void arm_smmu_cmdq_batch_add_range(struct arm_smmu_device *smmu,
@@ -2539,7 +2458,7 @@ static void arm_smmu_cmdq_batch_add_range(struct arm_smmu_device *smmu,
 	unsigned long end = iova + size, num_pages = 0, tg = pgsize;
 	size_t inv_range = granule;
 
-	if (!size)
+	if (WARN_ON_ONCE(!size))
 		return;
 
 	if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
@@ -2594,76 +2513,6 @@ static void arm_smmu_cmdq_batch_add_range(struct arm_smmu_device *smmu,
 	}
 }
 
-static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
-				     unsigned long iova, size_t size,
-				     size_t granule,
-				     struct arm_smmu_domain *smmu_domain)
-{
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
-	struct arm_smmu_cmdq_batch cmds;
-	size_t pgsize;
-
-	/* Get the leaf page size */
-	pgsize = __ffs(smmu_domain->domain.pgsize_bitmap);
-
-	arm_smmu_cmdq_batch_init(smmu, &cmds, cmd);
-	arm_smmu_cmdq_batch_add_range(smmu, &cmds, cmd, iova, size, granule,
-				      pgsize);
-	arm_smmu_cmdq_batch_submit(smmu, &cmds);
-}
-
-static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
-					  size_t granule, bool leaf,
-					  struct arm_smmu_domain *smmu_domain)
-{
-	struct arm_smmu_cmdq_ent cmd = {
-		.tlbi = {
-			.leaf	= leaf,
-		},
-	};
-
-	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
-		cmd.opcode	= smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ?
-				  CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA;
-		cmd.tlbi.asid	= smmu_domain->cd.asid;
-	} else {
-		cmd.opcode	= CMDQ_OP_TLBI_S2_IPA;
-		cmd.tlbi.vmid	= smmu_domain->s2_cfg.vmid;
-	}
-	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
-
-	if (smmu_domain->nest_parent) {
-		/*
-		 * When the S2 domain changes all the nested S1 ASIDs have to be
-		 * flushed too.
-		 */
-		cmd.opcode = CMDQ_OP_TLBI_NH_ALL;
-		arm_smmu_cmdq_issue_cmd_with_sync(smmu_domain->smmu, &cmd);
-	}
-
-	/*
-	 * Unfortunately, this can't be leaf-only since we may have
-	 * zapped an entire table.
-	 */
-	arm_smmu_atc_inv_domain(smmu_domain, iova, size);
-}
-
-void arm_smmu_tlb_inv_range_asid(unsigned long iova, size_t size, int asid,
-				 size_t granule, bool leaf,
-				 struct arm_smmu_domain *smmu_domain)
-{
-	struct arm_smmu_cmdq_ent cmd = {
-		.opcode	= smmu_domain->smmu->features & ARM_SMMU_FEAT_E2H ?
-			  CMDQ_OP_TLBI_EL2_VA : CMDQ_OP_TLBI_NH_VA,
-		.tlbi = {
-			.asid	= asid,
-			.leaf	= leaf,
-		},
-	};
-
-	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
-}
-
 static bool arm_smmu_inv_size_too_big(struct arm_smmu_device *smmu, size_t size,
 				      size_t granule)
 {
@@ -2855,7 +2704,9 @@ static void arm_smmu_tlb_inv_page_nosync(struct iommu_iotlb_gather *gather,
 static void arm_smmu_tlb_inv_walk(unsigned long iova, size_t size,
 				  size_t granule, void *cookie)
 {
-	arm_smmu_tlb_inv_range_domain(iova, size, granule, false, cookie);
+	struct arm_smmu_domain *smmu_domain = cookie;
+
+	arm_smmu_domain_inv_range(smmu_domain, iova, size, granule, false);
 }
 
 static const struct iommu_flush_ops arm_smmu_flush_ops = {
@@ -4077,9 +3928,9 @@ static void arm_smmu_iotlb_sync(struct iommu_domain *domain,
 	if (!gather->pgsize)
 		return;
 
-	arm_smmu_tlb_inv_range_domain(gather->start,
-				      gather->end - gather->start + 1,
-				      gather->pgsize, true, smmu_domain);
+	arm_smmu_domain_inv_range(smmu_domain, gather->start,
+				  gather->end - gather->start + 1,
+				  gather->pgsize, true);
 }
 
 static phys_addr_t
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH rfcv1 4/8] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array
  2025-08-14  1:25 ` [PATCH rfcv1 4/8] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array Nicolin Chen
@ 2025-08-26 19:50   ` Jason Gunthorpe
  2025-08-27  0:49     ` Nicolin Chen
  2025-08-27 16:48   ` Jason Gunthorpe
  2025-08-27 20:00   ` Jason Gunthorpe
  2 siblings, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2025-08-26 19:50 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, jean-philippe, miko.lenczewski, balbirs,
	peterz, smostafa, kevin.tian, praan, zhangzekun11,
	linux-arm-kernel, iommu, linux-kernel, patches

On Wed, Aug 13, 2025 at 06:25:35PM -0700, Nicolin Chen wrote:
> +/**
> + * arm_smmu_invs_add() - Combine @old_invs with @add_invs to a new array
> + * @old_invs: the old invalidation array
> + * @add_invs: an array of invlidations to add
> + *
> + * Return: a newly allocated and sorted invalidation array on success, or an
> + * ERR_PTR.
> + *
> + * This function must be locked and serialized with arm_smmu_invs_del/dec(),
> + * but do not lockdep on any lock for KUNIT test.
> + *
> + * Caller is resposible for freeing the @old_invs and the returned one.
> + *
> + * Entries marked as trash can be resued if @add_invs wants to add them back.
> + * Otherwise, they will be completely removed in the returned array.
> + */
> +VISIBLE_IF_KUNIT
> +struct arm_smmu_invs *arm_smmu_invs_add(struct arm_smmu_invs *old_invs,
> +					struct arm_smmu_invs *add_invs)
> +{
> +	size_t need = old_invs->num_invs + add_invs->num_invs;
> +	struct arm_smmu_invs *new_invs;
> +	size_t deletes = 0, i, j;
> +	u64 existed = 0;
> +
> +	/* Max of add_invs->num_invs is 64 */
> +	if (WARN_ON(add_invs->num_invs > sizeof(existed) * 8))
> +		return ERR_PTR(-EINVAL);

Since this is driven off of num_streams using a fixed bitmap doesn't
seem great since I suppose the dt isn't limited to 64.

Given how this is working now I think you can just add a new member to
the struct:

struct arm_smmu_inv {
	/* invalidation items */
	struct arm_smmu_device *smmu;
	u8 type;
	u8 size_opcode;
	u8 nsize_opcode;
	/* Temporary bits for add/del functions */
	u8 reuse:1;
	u8 todel:1;

And use reuse as the temporary instead of the bitmap.

> +	for (i = 0; i != old_invs->num_invs; i++) {
> +		struct arm_smmu_inv *cur = &old_invs->inv[i];

missing newline

> +		/* Count the trash entries to deletes */
> +		if (cur->todel) {
> +			WARN_ON_ONCE(refcount_read(&cur->users));
> +			deletes++;
> +		}

Just do continue here.

todel should only be used as a temporary. Use refcount_read() ==
0. Then you don't need a WARN either.

> +		for (j = 0; j != add_invs->num_invs; j++) {
> +			if (!same_op(cur, &add_invs->inv[j]))
> +				continue;
> +			/* Found duplicated entries in add_invs */
> +			if (WARN_ON_ONCE(existed & BIT_ULL(j)))
> +				continue;

inv[j].reuse

> +			/* Revert the todel marker for reuse */
> +			if (cur->todel) {
> +				cur->todel = false;
> +				deletes--;

This wil blow up the refcount_inc() below because users is 0..
There is no point in trying to optimize like this just discard the
old entry and add a new one.

> +	new_invs = arm_smmu_invs_alloc(need);
> +	if (IS_ERR(new_invs)) {
> +		/* Don't forget to revert all the todel markers */
> +		for (i = 0; i != old_invs->num_invs; i++) {
> +			if (refcount_read(&old_invs->inv[i].users) == 0)
> +				old_invs->inv[i].todel = true;
> +		}

Drop as well

> +		return new_invs;
> +	}
> +
> +	/* Copy the entire array less all the todel entries */
> +	for (i = 0; i != old_invs->num_invs; i++) {
> +		if (old_invs->inv[i].todel)
> +			continue;

refcount_read

> +		new_invs->inv[new_invs->num_invs++] = old_invs->inv[i];
> +	}
> +
> +	for (j = 0; j != add_invs->num_invs; j++) {
> +		if (existed & BIT_ULL(j)) {

inv[j]->reuse

> +			unsigned int idx = add_invs->inv[j].id;


Similar remarks for del, use users to set todel, don't expect it to be
valid coming into the function.

Jason


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfcv1 5/8] iommu/arm-smmu-v3: Pre-allocate a per-master invalidation array
  2025-08-14  1:25 ` [PATCH rfcv1 5/8] iommu/arm-smmu-v3: Pre-allocate a per-master invalidation array Nicolin Chen
@ 2025-08-26 19:56   ` Jason Gunthorpe
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Gunthorpe @ 2025-08-26 19:56 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, jean-philippe, miko.lenczewski, balbirs,
	peterz, smostafa, kevin.tian, praan, zhangzekun11,
	linux-arm-kernel, iommu, linux-kernel, patches

On Wed, Aug 13, 2025 at 06:25:36PM -0700, Nicolin Chen wrote:
> @@ -3730,6 +3731,13 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
>  		return -ENOMEM;
>  	master->num_streams = fwspec->num_ids;
>  
> +	/* Max possible num_invs: two for ASID/VMIDs and num_ats for ATC_INVs */
> +	master->invs = arm_smmu_invs_alloc(2 + num_ats);
> +	if (IS_ERR(master->invs)) {
> +		kfree(master->streams);
> +		return PTR_ERR(master->invs);
> +	}

This seems like a nice solution, but I would add a comment here that
it is locked by the group mutex, and check if ATS is supported:

	/*
	 * Scratch memory to build the per-domain invalidation list. locked by
	 * the group_mutex. Max possible num_invs: two for ASID/VMIDs and
	 * num_streams for ATC_INVs
	 */
	if (dev_is_pci(master->dev) &&
	    pci_ats_supported(to_pci_dev(master->dev)))
		master->invs = arm_smmu_invs_alloc(2 + master->num_streams);
	else
		master->invs = arm_smmu_invs_alloc(2);

And probably rename it scratch_invs or something to indicate it is
temporary memory.

I'm not sure there is any case where fwspec->num_ids >1 &&
ats_supported, or at least is should be really rare.

Jason


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfcv1 4/8] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array
  2025-08-26 19:50   ` Jason Gunthorpe
@ 2025-08-27  0:49     ` Nicolin Chen
  0 siblings, 0 replies; 18+ messages in thread
From: Nicolin Chen @ 2025-08-27  0:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: will, robin.murphy, joro, jean-philippe, miko.lenczewski, balbirs,
	peterz, smostafa, kevin.tian, praan, zhangzekun11,
	linux-arm-kernel, iommu, linux-kernel, patches

On Tue, Aug 26, 2025 at 04:50:03PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 13, 2025 at 06:25:35PM -0700, Nicolin Chen wrote:
> > +struct arm_smmu_invs *arm_smmu_invs_add(struct arm_smmu_invs *old_invs,
> > +					struct arm_smmu_invs *add_invs)
> > +{
> > +	size_t need = old_invs->num_invs + add_invs->num_invs;
> > +	struct arm_smmu_invs *new_invs;
> > +	size_t deletes = 0, i, j;
> > +	u64 existed = 0;
> > +
> > +	/* Max of add_invs->num_invs is 64 */
> > +	if (WARN_ON(add_invs->num_invs > sizeof(existed) * 8))
> > +		return ERR_PTR(-EINVAL);
> 
> Since this is driven off of num_streams using a fixed bitmap doesn't
> seem great since I suppose the dt isn't limited to 64.

In the other patch, you noted that it's likely very rare to have
an ATS-supported device with multiple SIDs. Also given that this
function is called per device. So, 64 should be enough?

With that being said...

> Given how this is working now I think you can just add a new member to
> the struct:
> 
> struct arm_smmu_inv {
> 	/* invalidation items */
> 	struct arm_smmu_device *smmu;
> 	u8 type;
> 	u8 size_opcode;
> 	u8 nsize_opcode;
> 	/* Temporary bits for add/del functions */
> 	u8 reuse:1;
> 	u8 todel:1;
> 
> And use reuse as the temporary instead of the bitmap.
 
... I do like this reuse flag. I will give it a try.

> > +		/* Count the trash entries to deletes */
> > +		if (cur->todel) {
> > +			WARN_ON_ONCE(refcount_read(&cur->users));
> > +			deletes++;
> > +		}
> 
> Just do continue here.
> 
> todel should only be used as a temporary. Use refcount_read() ==
> 0. Then you don't need a WARN either.

I did so until my last local pre-v1 version as I found it seems
cleaner to mark it using the todel. I'll try again and see how
it goes.
 
> > +			/* Revert the todel marker for reuse */
> > +			if (cur->todel) {
> > +				cur->todel = false;
> > +				deletes--;
> 
> This wil blow up the refcount_inc() below because users is 0..
> There is no point in trying to optimize like this just discard the
> old entry and add a new one.

Oh right. refcount == 0 can't increase...

> > +			unsigned int idx = add_invs->inv[j].id;
> 
> 
> Similar remarks for del, use users to set todel, don't expect it to be
> valid coming into the function.

OK.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfcv1 4/8] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array
  2025-08-14  1:25 ` [PATCH rfcv1 4/8] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array Nicolin Chen
  2025-08-26 19:50   ` Jason Gunthorpe
@ 2025-08-27 16:48   ` Jason Gunthorpe
  2025-08-27 17:19     ` Nicolin Chen
  2025-08-27 20:00   ` Jason Gunthorpe
  2 siblings, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2025-08-27 16:48 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, jean-philippe, miko.lenczewski, balbirs,
	peterz, smostafa, kevin.tian, praan, zhangzekun11,
	linux-arm-kernel, iommu, linux-kernel, patches

On Wed, Aug 13, 2025 at 06:25:35PM -0700, Nicolin Chen wrote:
> +/**
> + * arm_smmu_invs_del() - Remove @del_invs from @old_invs
> + * @old_invs: the old invalidation array
> + * @del_invs: an array of invlidations to delete
> + *
> + * Return: a newly allocated and sorted invalidation array on success, or an
> + * ERR_PTR.
> + *
> + * This function must be locked and serialized with arm_smmu_invs_add/dec(),
> + * but do not lockdep on any lock for KUNIT test.
> + *
> + * Caller is resposible for freeing the @old_invs and the returned one.
> + *
> + * Entries marked as trash will be completely removed in the returned array.
> + */
> +VISIBLE_IF_KUNIT
> +struct arm_smmu_invs *arm_smmu_invs_del(struct arm_smmu_invs *old_invs,
> +					struct arm_smmu_invs *del_invs)
> +{

Having looked at this more completely, I think we should drop this
function.

Just always do decr, then have a simple function to compact the list
after the decr:

struct arm_smmu_invs *arm_smmu_invs_cleanup(struct arm_smmu_invs *invs,
					    size_t to_del)
{
	struct arm_smmu_invs *new_invs;
	size_t i, j;

	if (WARN_ON(invs->num_invs < to_del))
		return NULL;

	new_invs = arm_smmu_invs_alloc(invs->num_invs - to_del);
	if (IS_ERR(new_invs))
		return NULL;

	for (i = 0, j = 0; i != invs->num_invs; i++) {
		if (!refcount_read(&invs->inv[i].users))
			continue;
		new_invs->inv[j] = invs->inv[i];
		j++;
	}
	return new_invs;
}

If this returns NULL then just leave the list alone, it is OK to sit
there with the 0 users left behind.

No need for the complex _del function and the _decr function..

This also means the memory doesn't need to be preallocated and it
significantly simplifies alot of the logic.

Jason


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfcv1 4/8] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array
  2025-08-27 16:48   ` Jason Gunthorpe
@ 2025-08-27 17:19     ` Nicolin Chen
  2025-08-28 12:37       ` Jason Gunthorpe
  0 siblings, 1 reply; 18+ messages in thread
From: Nicolin Chen @ 2025-08-27 17:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: will, robin.murphy, joro, jean-philippe, miko.lenczewski, balbirs,
	peterz, smostafa, kevin.tian, praan, zhangzekun11,
	linux-arm-kernel, iommu, linux-kernel, patches

On Wed, Aug 27, 2025 at 01:48:04PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 13, 2025 at 06:25:35PM -0700, Nicolin Chen wrote:
> > +/**
> > + * arm_smmu_invs_del() - Remove @del_invs from @old_invs
> > + * @old_invs: the old invalidation array
> > + * @del_invs: an array of invlidations to delete
> > + *
> > + * Return: a newly allocated and sorted invalidation array on success, or an
> > + * ERR_PTR.
> > + *
> > + * This function must be locked and serialized with arm_smmu_invs_add/dec(),
> > + * but do not lockdep on any lock for KUNIT test.
> > + *
> > + * Caller is resposible for freeing the @old_invs and the returned one.
> > + *
> > + * Entries marked as trash will be completely removed in the returned array.
> > + */
> > +VISIBLE_IF_KUNIT
> > +struct arm_smmu_invs *arm_smmu_invs_del(struct arm_smmu_invs *old_invs,
> > +					struct arm_smmu_invs *del_invs)
> > +{
> 
> Having looked at this more completely, I think we should drop this
> function.
> 
> Just always do decr, then have a simple function to compact the list
> after the decr:

But the _dec function will always take the write lock, which seems
to lose the benefit of using an RCU array?

The complexity of the _dec function wouldn't release the lock very
quickly. The current version only invokes it upon kmalloc failure,
which can be seem as a very rare case having a very minimal impact.

> struct arm_smmu_invs *arm_smmu_invs_cleanup(struct arm_smmu_invs *invs,
> 					    size_t to_del)
> {
> 	struct arm_smmu_invs *new_invs;
> 	size_t i, j;
> 
> 	if (WARN_ON(invs->num_invs < to_del))
> 		return NULL;
> 
> 	new_invs = arm_smmu_invs_alloc(invs->num_invs - to_del);
> 	if (IS_ERR(new_invs))
> 		return NULL;
> 
> 	for (i = 0, j = 0; i != invs->num_invs; i++) {
> 		if (!refcount_read(&invs->inv[i].users))
> 			continue;
> 		new_invs->inv[j] = invs->inv[i];
> 		j++;
> 	}
> 	return new_invs;
> }
> 
> If this returns NULL then just leave the list alone, it is OK to sit
> there with the 0 users left behind.

Yea, it's better than waiting for the next _add function to compact
the list.

> No need for the complex _del function and the _decr function..
> 
> This also means the memory doesn't need to be preallocated and it
> significantly simplifies alot of the logic.

By "preallocated" you mean "master->scratch_invs"? I think it will
be still needed for the "dec_invs" argument in:
  size_t arm_smmu_invs_dec(struct arm_smmu_invs *invs,
			   struct arm_smmu_invs *dec_invs);
?

Or you mean the allocation in arm_smmu_invs_del()? Yes, this drops
the kmalloc in the detach path.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfcv1 6/8] iommu/arm-smmu-v3: Populate smmu_domain->invs when attaching masters
  2025-08-14  1:25 ` [PATCH rfcv1 6/8] iommu/arm-smmu-v3: Populate smmu_domain->invs when attaching masters Nicolin Chen
@ 2025-08-27 18:21   ` Jason Gunthorpe
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Gunthorpe @ 2025-08-27 18:21 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, jean-philippe, miko.lenczewski, balbirs,
	peterz, smostafa, kevin.tian, praan, zhangzekun11,
	linux-arm-kernel, iommu, linux-kernel, patches

On Wed, Aug 13, 2025 at 06:25:37PM -0700, Nicolin Chen wrote:
> +typedef struct arm_smmu_invs *(*invs_fn)(struct arm_smmu_invs *old_invs,
> +					 struct arm_smmu_invs *invs);

no reason to pass in fn, this always just calls it as the last thing
so the caller can do it..

> +static struct arm_smmu_invs *arm_smmu_build_invs(
> +	struct arm_smmu_invs *old_invs, struct arm_smmu_domain *smmu_domain,
> +	struct arm_smmu_master *master, bool ats, ioasid_t ssid, invs_fn fn)
> +{
> +	const bool e2h = master->smmu->features & ARM_SMMU_FEAT_E2H;
> +	const bool nesting = smmu_domain->nest_parent;
> +	struct arm_smmu_inv *cur = master->invs->inv;
> +	size_t num_invs = 1;
> +	size_t i;
> +
> +	switch (smmu_domain->stage) {
> +	case ARM_SMMU_DOMAIN_SVA:
> +	case ARM_SMMU_DOMAIN_S1:
> +		cur->smmu = master->smmu;
> +		cur->type = INV_TYPE_S1_ASID;
> +		cur->id = smmu_domain->cd.asid;
> +		cur->size_opcode = e2h ? CMDQ_OP_TLBI_EL2_VA :
> +					 CMDQ_OP_TLBI_NH_VA;
> +		cur->nsize_opcode = e2h ? CMDQ_OP_TLBI_EL2_ASID :
> +					  CMDQ_OP_TLBI_NH_ASID;
> +		break;
> +	case ARM_SMMU_DOMAIN_S2:
> +		cur->smmu = master->smmu;
> +		cur->type = INV_TYPE_S2_VMID;
> +		cur->id = smmu_domain->s2_cfg.vmid;
> +		cur->size_opcode = CMDQ_OP_TLBI_S2_IPA;
> +		cur->nsize_opcode = CMDQ_OP_TLBI_S12_VMALL;
> +		break;
> +	default:
> +		WARN_ON(true);
> +		return old_invs;

Return ERR_PTR, it makes the error flows possibly wrong or at least over
complex to return something that shouldn't be freed.

> +	}
> +
> +	/* Range-based invalidation requires the leaf pgsize for calculation */
> +	if (master->smmu->features & ARM_SMMU_FEAT_RANGE_INV)
> +		cur->pgsize = __ffs(smmu_domain->domain.pgsize_bitmap);
> +
> +	/* All the nested S1 ASIDs have to be flushed when S2 parent changes */
> +	if (nesting) {
> +		cur = &master->invs->inv[num_invs++];

Don't do both 'cur as an iterator' and 'num_invs as the
location'. Delete num_invs entirely and just use cur.

> +		cur->smmu = master->smmu;
> +		cur->type = INV_TYPE_S2_VMID_S1_CLEAR;
> +		cur->id = smmu_domain->s2_cfg.vmid;
> +		cur->size_opcode = CMDQ_OP_TLBI_NH_ALL;
> +		cur->nsize_opcode = CMDQ_OP_TLBI_NH_ALL;
> +	}
> +
> +	if (ats) {
> +		for (i = 0, cur++; i < master->num_streams; i++) {
> +			cur->smmu = master->smmu;
> +			/*
> +			 * If an S2 used as a nesting parent is changed we have
> +			 * no option but to completely flush the ATC.
> +			 */
> +			cur->type = nesting ? INV_TYPE_ATS_FULL : INV_TYPE_ATS;
> +			cur->id = master->streams[i].id;
> +			cur->ssid = ssid;
> +			cur->size_opcode = CMDQ_OP_ATC_INV;
> +			cur->nsize_opcode = CMDQ_OP_ATC_INV;
> +		}
> +		num_invs += master->num_streams;
> +	}
> +
> +	master->invs->num_invs = num_invs;

Like this:

	master->invs->num_invs = cur - master->invs->inv;

> +static int arm_smmu_attach_prepare_invs(struct arm_smmu_attach_state *state,
> +					struct arm_smmu_domain *new_smmu_domain)
> +{

How about a comment:

/*
 * During attachment the invalidation lists on the two domains are sequenced:
 *  1. old domain is invalidating master
 *  2. new and old domain are invalidating master
 *  3. new domain is invalidating master
 *
 * This uses two updated invalidation lists, one with master added to new domain
 * and one with master removed from old domain. Prepare these lists in advance
 * of changing anything. arm_smmu_asid_lock ensures that the invalidation list
 * in the domains doesn't change while we are sequencing to update it.
 */

> +	struct arm_smmu_domain *old_smmu_domain =
> +		to_smmu_domain_devices(state->old_domain);
> +	struct arm_smmu_master *master = state->master;
> +	bool blocking = false;
> +
> +	/* A re-attach case doesn't need to update invs array */
> +	if (new_smmu_domain == old_smmu_domain)
> +		return 0;
> +
> +	if (new_smmu_domain) {

This if wants a comment, it is tricky:

	/*
	 * At this point a NULL domain indicates the domain doesn't use the
	 * IOTLB, see to_smmu_domain_devices().
	 */

> +		state->new_domain_oinvs = rcu_dereference_protected(
> +			new_smmu_domain->invs,
> +			lockdep_is_held(&arm_smmu_asid_lock));
> +		state->new_domain_ninvs = arm_smmu_build_invs(
> +			state->new_domain_oinvs, new_smmu_domain, master,
> +			state->ats_enabled, state->ssid, arm_smmu_invs_add);
> +		if (IS_ERR(state->new_domain_ninvs))
> +			return PTR_ERR(state->new_domain_ninvs);
> +		state->new_domain_invs = &new_smmu_domain->invs;
> +		blocking = new_smmu_domain->domain.type == IOMMU_DOMAIN_BLOCKED;
> +	}
> +
> +	if (old_smmu_domain) {
> +		state->old_domain_oinvs = rcu_dereference_protected(
> +			old_smmu_domain->invs,
> +			lockdep_is_held(&arm_smmu_asid_lock));
> +		state->old_domain_ninvs = arm_smmu_build_invs(
> +			state->old_domain_oinvs, old_smmu_domain, master,
> +			master->ats_enabled, state->ssid, arm_smmu_invs_del);
> +		if (IS_ERR(state->old_domain_ninvs)) {

Then here, as per the last email, just get rid of invs_del and use
the scratch list master->invs for the next step. So all this goes away:

Jason


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfcv1 7/8] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range()
  2025-08-14  1:25 ` [PATCH rfcv1 7/8] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range() Nicolin Chen
@ 2025-08-27 18:49   ` Jason Gunthorpe
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Gunthorpe @ 2025-08-27 18:49 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, jean-philippe, miko.lenczewski, balbirs,
	peterz, smostafa, kevin.tian, praan, zhangzekun11,
	linux-arm-kernel, iommu, linux-kernel, patches

On Wed, Aug 13, 2025 at 06:25:38PM -0700, Nicolin Chen wrote:
> +void arm_smmu_domain_inv_range(struct arm_smmu_domain *smmu_domain,
> +			       unsigned long iova, size_t size,
> +			       unsigned int granule, bool leaf)
> +{
> +	struct arm_smmu_cmdq_batch cmds = {};
> +	struct arm_smmu_invs *invs;
> +	bool retried = false;
> +	size_t i;
> +
> +	/*
> +	 * An invalidation request must follow some IOPTE change and then load
> +	 * the invalidation array In the meantime, a domain attachment mutates
> +	 * the array and then stores an STE/CD asking SMMU HW to acquire those
> +	 * changed IOPTEs. In other word, these two are interdependent and can
> +	 * race.
> +	 *
> +	 * In a race, the RCU design (with its underlying memory barriers) can
> +	 * ensure the invalidations array to always get updated before loaded.
> +	 *
> +	 * smp_mb() is used here, paired with the smp_mb() following the array
> +	 * update in a concurrent attach, to ensure:
> +	 *  - HW sees the new IOPTEs if it walks after STE installation
> +	 *  - Invalidation thread sees the updated array with the new ASID.
> +	 *
> +	 *  [CPU0]                        | [CPU1]
> +	 *                                |
> +	 *  change IOPTEs and TLB flush:  |
> +	 *  arm_smmu_domain_inv_range() { | arm_smmu_install_new_domain_invs {
> +	 *    ...                         |   rcu_assign_pointer(new_invs);
> +	 *    smp_mb(); // ensure IOPTEs  |   smp_mb(); // ensure new_invs
> +	 *    ...                         |   kfree_rcu(old_invs, rcu);
> +	 *    // load invalidation array  | }
> +	 *    invs = rcu_dereference();   | arm_smmu_install_ste_for_dev {
> +	 *                                |   STE = TTB0 // read new IOPTEs
> +	 */
> +	smp_mb();
> +
> +	rcu_read_lock();
> +again:
> +	invs = rcu_dereference(smmu_domain->invs);
> +
> +	/* A concurrent attachment might have changed the array. Do a respin */
> +	if (unlikely(!read_trylock(&invs->rwlock)))
> +		goto again;
> +	/* Only one retry. Otherwise, it would soft lockup on an empty array */
> +	if (!retried && unlikely(!invs->num_invs)) {
> +		read_unlock(&invs->rwlock);
> +		retried = true;
> +		goto again;
> +	}

This has missed the point, it was to not get the unless we have
ATS. Something like this:


	rcu_read_lock();

	while (true) {
		invs = rcu_dereference(smmu_domain->invs);

		/*
		 * Avoid locking unless ATS is being used. No ATS invalidate can
		 * be going on after a domain is detached.
		 */
		locked = false;
		if (invs->has_ats || READ_ONCE(invs->old)) {
			read_lock(&invs->rwlock);
			if (invs->old) {
				read_unlock(&invs->rwlock);
				continue;
			}
			locked = true;
		}
		break;
	}

	num_invs = READ_ONCE(num_invs);
	for (i = 0; i != num_invs; i++) {

> +		case INV_TYPE_S2_VMID_S1_CLEAR:
> +			/* CMDQ_OP_TLBI_S12_VMALL already flushed S1 entries */
> +			if (arm_smmu_inv_size_too_big(cur->smmu, size, granule))
> +				continue;
> +			/* Just a single CMDQ_OP_TLBI_NH_ALL, no batching */
> +			cmd.tlbi.vmid = cur->id;
> +			arm_smmu_cmdq_issue_cmd_with_sync(cur->smmu, &cmd);

This should just stick it in the batch, the batch was already flushed:

	/* The batch for S2 TLBI must be done before nested S1 ASIDs */
	if (next->type == INV_TYPE_S2_VMID_S1_CLEAR)
		return true;

That needs a fixup too:

	/* The batch for S2 TLBI must be done before nested S1 ASIDs */
	if (cur->type != INV_TYPE_S2_VMID_S1_CLEAR &&
	    next->type == INV_TYPE_S2_VMID_S1_CLEAR)
		return true;

It makes sense to bundle all the NH_ALL into one batch if there is more
than one.

But this arm_smmu_invs_end_batch() no longer works right since the

		if (refcount_read(&cur->users) == 0)
			continue;

Was added..

Jason


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfcv1 4/8] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array
  2025-08-14  1:25 ` [PATCH rfcv1 4/8] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array Nicolin Chen
  2025-08-26 19:50   ` Jason Gunthorpe
  2025-08-27 16:48   ` Jason Gunthorpe
@ 2025-08-27 20:00   ` Jason Gunthorpe
  2 siblings, 0 replies; 18+ messages in thread
From: Jason Gunthorpe @ 2025-08-27 20:00 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, jean-philippe, miko.lenczewski, balbirs,
	peterz, smostafa, kevin.tian, praan, zhangzekun11,
	linux-arm-kernel, iommu, linux-kernel, patches

On Wed, Aug 13, 2025 at 06:25:35PM -0700, Nicolin Chen wrote:
> +struct arm_smmu_invs *arm_smmu_invs_add(struct arm_smmu_invs *old_invs,
> +					struct arm_smmu_invs *add_invs)
> +{

It turns out it is fairly easy and cheap to sort add_invs by sorting
the ids during probe:

@@ -3983,6 +3989,14 @@ static int arm_smmu_init_sid_strtab(struct arm_smmu_device *smmu, u32 sid)
        return 0;
 }
 
+static int arm_smmu_ids_cmp(const void *_l, const void *_r)
+{
+       const typeof_member(struct iommu_fwspec, ids[0]) *l = _l;
+       const typeof_member(struct iommu_fwspec, ids[0]) *r = _r;
+
+       return cmp_int(*l, *r);
+}
+
 static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
                                  struct arm_smmu_master *master)
 {
@@ -4011,6 +4025,13 @@ static int arm_smmu_insert_master(struct arm_smmu_device *smmu,
                return PTR_ERR(master->invs);
        }
 
+       /*
+        * Put the ids into order so that arm_smmu_build_invs() can trivially
+        * generate sorted lists.
+        */
+       sort_nonatomic(fwspec->ids, fwspec->num_ids, sizeof(fwspec->ids[0]),
+                      arm_smmu_ids_cmp, NULL);
+
        mutex_lock(&smmu->streams_mutex);
        for (i = 0; i < fwspec->num_ids; i++) {
                struct arm_smmu_stream *new_stream = &master->streams[i];

Then arm_smmu_build_invs() trivially makes sorted lists.

So if old_invs and add_invs are both sorted list we can use variations
on a merge algorithm for sorted lists which is both simpler to
understand and runs faster:

/*
 * Compare used for merging two sorted lists. Merge compare of two sorted list
 * items. If one side is past the end of the list then return the other side to
 * let it run out the iteration.
 */
static inline int arm_smmu_invs_merge_cmp(const struct arm_smmu_invs *lhs,
					  size_t lhs_idx,
					  const struct arm_smmu_invs *rhs,
					  size_t rhs_idx)
{
	if (lhs_idx != lhs->num_invs && rhs_idx != rhs->num_invs)
		return arm_smmu_invs_cmp(&lhs->inv[lhs_idx],
					 &rhs->inv[rhs_idx]);
	if (lhs_idx != lhs->num_invs)
		return -1;
	return 1;
}

struct arm_smmu_invs *arm_smmu_invs_add(struct arm_smmu_invs *invs,
					struct arm_smmu_invs *add_invs)
{
	struct arm_smmu_invs *new_invs;
	struct arm_smmu_inv *new;
	size_t to_add = 0;
	size_t to_del = 0;
	size_t i, j;

	for (i = 0, j = 0; i != invs->num_invs || j != add_invs->num_invs;) {
		int cmp = arm_smmu_invs_merge_cmp(invs, i, add_invs, j);

		if (cmp < 0) {
			/* not found in add_invs, leave alone */
			if (refcount_read(&invs->inv[i].users))
				i++;
			else
				to_del++;
		} else if (cmp == 0) {
			/* same item */
			i++;
			j++;
		} else {
			/* unique to add_invs */
			to_add++;
			j++;
		}
	}

	new_invs = arm_smmu_invs_alloc(invs->num_invs + to_add - to_del);
	if (IS_ERR(new_invs))
		return new_invs;

	new = new_invs->inv;
	for (i = 0, j = 0; i != invs->num_invs || j != add_invs->num_invs;) {
		int cmp = arm_smmu_invs_merge_cmp(invs, i, add_invs, j);

		if (cmp <= 0 && !refcount_read(&invs->inv[i].users)) {
			i++;
			continue;
		}

		if (cmp < 0) {
			*new = invs->inv[i];
			i++;
		} else if (cmp == 0) {
			*new = invs->inv[i];
			refcount_inc(&new->users);
			i++;
			j++;
		} else {
			*new = add_invs->inv[j];
			refcount_set(&new->users, 1);
			j++;
		}
		if (arm_smmu_inv_is_ats(new))
			new_invs->has_ats = true;
		new++;
	}

	WARN_ON(new != new_invs->inv + new_invs->num_invs);

	/*
	 * A sorted array allows batching invalidations together for fewer SYNCs.
	 * Also, ATS must follow the ASID/VMID invalidation SYNC.
	 */
	sort_nonatomic(new_invs->inv, new_invs->num_invs,
		       sizeof(add_invs->inv[0]), arm_smmu_invs_cmp, NULL);
	return new_invs;
}

size_t arm_smmu_invs_dec(struct arm_smmu_invs *invs,
			 struct arm_smmu_invs *dec_invs)
{
	size_t to_del = 0;
	size_t i, j;

	for (i = 0, j = 0; i != invs->num_invs || j != dec_invs->num_invs;) {
		int cmp = arm_smmu_invs_merge_cmp(invs, i, dec_invs, j);

		if (cmp < 0) {
			/* not found in dec_invs, leave alone */
			i++;
		} else if (cmp == 0) {
			/* same item */
			if (refcount_dec_and_test(&invs->inv[i].users)) {
				dec_invs->inv[j].todel = true;
				to_del++;
			}
			i++;
			j++;
		} else {
			/* item in dec_invs is not in invs? */
			WARN_ON(true);
			j++;
		}
	}
	return to_del;
}


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH rfcv1 4/8] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array
  2025-08-27 17:19     ` Nicolin Chen
@ 2025-08-28 12:37       ` Jason Gunthorpe
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Gunthorpe @ 2025-08-28 12:37 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, joro, jean-philippe, miko.lenczewski, balbirs,
	peterz, smostafa, kevin.tian, praan, zhangzekun11,
	linux-arm-kernel, iommu, linux-kernel, patches

On Wed, Aug 27, 2025 at 10:19:18AM -0700, Nicolin Chen wrote:
> > Just always do decr, then have a simple function to compact the list
> > after the decr:
> 
> But the _dec function will always take the write lock, which seems
> to lose the benefit of using an RCU array?

The lock isn't needed, all it does is refcount dec which is atomic.

And all the lists are locked by the asid lock on the write side
anyhow.

> > If this returns NULL then just leave the list alone, it is OK to sit
> > there with the 0 users left behind.
> 
> Yea, it's better than waiting for the next _add function to compact
> the list.

I was thinking you could call the add function with an empty list
instead of adding another function..

> > No need for the complex _del function and the _decr function..
> > 
> > This also means the memory doesn't need to be preallocated and it
> > significantly simplifies alot of the logic.
>
> By "preallocated" you mean "master->scratch_invs"? 

No, I mean the second list of invalidations the old domain new list.

Jason


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-08-28 17:25 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-14  1:25 [PATCH rfcv1 0/8] iommu/arm-smmu-v3: Introduce an RCU-protected invalidation array Nicolin Chen
2025-08-14  1:25 ` [PATCH rfcv1 1/8] iommu/arm-smmu-v3: Clear cmds->num after arm_smmu_cmdq_batch_submit Nicolin Chen
2025-08-14  1:25 ` [PATCH rfcv1 2/8] iommu/arm-smmu-v3: Explicitly set smmu_domain->stage for SVA Nicolin Chen
2025-08-14  1:25 ` [PATCH rfcv1 3/8] iommu/arm-smmu-v3: Add an inline arm_smmu_domain_free() Nicolin Chen
2025-08-14  1:25 ` [PATCH rfcv1 4/8] iommu/arm-smmu-v3: Introduce a per-domain arm_smmu_invs array Nicolin Chen
2025-08-26 19:50   ` Jason Gunthorpe
2025-08-27  0:49     ` Nicolin Chen
2025-08-27 16:48   ` Jason Gunthorpe
2025-08-27 17:19     ` Nicolin Chen
2025-08-28 12:37       ` Jason Gunthorpe
2025-08-27 20:00   ` Jason Gunthorpe
2025-08-14  1:25 ` [PATCH rfcv1 5/8] iommu/arm-smmu-v3: Pre-allocate a per-master invalidation array Nicolin Chen
2025-08-26 19:56   ` Jason Gunthorpe
2025-08-14  1:25 ` [PATCH rfcv1 6/8] iommu/arm-smmu-v3: Populate smmu_domain->invs when attaching masters Nicolin Chen
2025-08-27 18:21   ` Jason Gunthorpe
2025-08-14  1:25 ` [PATCH rfcv1 7/8] iommu/arm-smmu-v3: Add arm_smmu_invs based arm_smmu_domain_inv_range() Nicolin Chen
2025-08-27 18:49   ` Jason Gunthorpe
2025-08-14  1:25 ` [PATCH rfcv1 8/8] iommu/arm-smmu-v3: Perform per-domain invalidations using arm_smmu_invs Nicolin Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).