linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17
@ 2025-07-14  4:50 Lu Baolu
  2025-07-14  4:50 ` [PATCH 01/11] iommu/vt-d: Remove the CONFIG_X86 wrapping from iommu init hook Lu Baolu
                   ` (11 more replies)
  0 siblings, 12 replies; 19+ messages in thread
From: Lu Baolu @ 2025-07-14  4:50 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: iommu, linux-kernel

Hi Joerg,

The following changes have been queued for v6.17-rc1. They are about new
features and code refactoring, including:

 - Reorganize Intel VT-d to be ready for iommupt
 - Optimize iotlb_sync_map for non-caching/non-RWBF modes
 - Fix missed PASID in dev TLB invalidation in cache_tag_flush_all()
 - Miscellaneous cleanups

These patches are based on v6.16-rc6. Please consider them for the
iommu/vt-d branch.

Best regards,
baolu

Ethan Milon (2):
  iommu/vt-d: Fix missing PASID in dev TLB flush with
    cache_tag_flush_all
  iommu/vt-d: Deduplicate cache_tag_flush_all by reusing flush_range

Jason Gunthorpe (7):
  iommu/vt-d: Lift the __pa to
    domain_setup_first_level/intel_svm_set_dev_pasid()
  iommu/vt-d: Fold domain_exit() into intel_iommu_domain_free()
  iommu/vt-d: Do not wipe out the page table NID when devices detach
  iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags()
  iommu/vt-d: Create unique domain ops for each stage
  iommu/vt-d: Split intel_iommu_enforce_cache_coherency()
  iommu/vt-d: Split paging_domain_compatible()

Lu Baolu (1):
  iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes

Vineeth Pillai (Google) (1):
  iommu/vt-d: Remove the CONFIG_X86 wrapping from iommu init hook

 drivers/iommu/intel/cache.c  |  55 ++----
 drivers/iommu/intel/dmar.c   |   3 -
 drivers/iommu/intel/iommu.c  | 336 ++++++++++++++++++++++-------------
 drivers/iommu/intel/iommu.h  |  22 ++-
 drivers/iommu/intel/nested.c |   4 +-
 drivers/iommu/intel/pasid.c  |  17 +-
 drivers/iommu/intel/pasid.h  |  11 +-
 drivers/iommu/intel/svm.c    |   3 +-
 drivers/iommu/intel/trace.h  |   5 -
 9 files changed, 259 insertions(+), 197 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 01/11] iommu/vt-d: Remove the CONFIG_X86 wrapping from iommu init hook
  2025-07-14  4:50 [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17 Lu Baolu
@ 2025-07-14  4:50 ` Lu Baolu
  2025-07-14  4:50 ` [PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes Lu Baolu
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Lu Baolu @ 2025-07-14  4:50 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: iommu, linux-kernel

From: "Vineeth Pillai (Google)" <vineeth@bitbyteword.org>

iommu init hook is wrapped in CONFI_X86 and is a remnant of dmar.c when
it was a common code in "drivers/pci/dmar.c". This was added in commit
(9d5ce73a64be2 x86: intel-iommu: Convert detect_intel_iommu to use
iommu_init hook)

Now this is built only for x86. This config wrap could be removed.

Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/20250616131740.3499289-1-vineeth@bitbyteword.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/dmar.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c
index b61d9ea27aa9..ec975c73cfe6 100644
--- a/drivers/iommu/intel/dmar.c
+++ b/drivers/iommu/intel/dmar.c
@@ -935,14 +935,11 @@ void __init detect_intel_iommu(void)
 		pci_request_acs();
 	}
 
-#ifdef CONFIG_X86
 	if (!ret) {
 		x86_init.iommu.iommu_init = intel_iommu_init;
 		x86_platform.iommu_shutdown = intel_iommu_shutdown;
 	}
 
-#endif
-
 	if (dmar_tbl) {
 		acpi_put_table(dmar_tbl);
 		dmar_tbl = NULL;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
  2025-07-14  4:50 [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17 Lu Baolu
  2025-07-14  4:50 ` [PATCH 01/11] iommu/vt-d: Remove the CONFIG_X86 wrapping from iommu init hook Lu Baolu
@ 2025-07-14  4:50 ` Lu Baolu
  2025-07-16 14:12   ` Jason Gunthorpe
  2025-07-14  4:50 ` [PATCH 03/11] iommu/vt-d: Lift the __pa to domain_setup_first_level/intel_svm_set_dev_pasid() Lu Baolu
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 19+ messages in thread
From: Lu Baolu @ 2025-07-14  4:50 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: iommu, linux-kernel

The iotlb_sync_map iommu ops allows drivers to perform necessary cache
flushes when new mappings are established. For the Intel iommu driver,
this callback specifically serves two purposes:

- To flush caches when a second-stage page table is attached to a device
  whose iommu is operating in caching mode (CAP_REG.CM==1).
- To explicitly flush internal write buffers to ensure updates to memory-
  resident remapping structures are visible to hardware (CAP_REG.RWBF==1).

However, in scenarios where neither caching mode nor the RWBF flag is
active, the cache_tag_flush_range_np() helper, which is called in the
iotlb_sync_map path, effectively becomes a no-op.

Despite being a no-op, cache_tag_flush_range_np() involves iterating
through all cache tags of the iommu's attached to the domain, protected
by a spinlock. This unnecessary execution path introduces overhead,
leading to a measurable I/O performance regression. On systems with NVMes
under the same bridge, performance was observed to drop from approximately
~6150 MiB/s down to ~4985 MiB/s.

Introduce a flag in the dmar_domain structure. This flag will only be set
when iotlb_sync_map is required (i.e., when CM or RWBF is set). The
cache_tag_flush_range_np() is called only for domains where this flag is
set. This flag, once set, is immutable, given that there won't be mixed
configurations in real-world scenarios where some IOMMUs in a system
operate in caching mode while others do not. Theoretically, the
immutability of this flag does not impact functionality.

Reported-by: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115738
Link: https://lore.kernel.org/r/20250701171154.52435-1-ioanna-maria.alifieraki@canonical.com
Fixes: 129dab6e1286 ("iommu/vt-d: Use cache_tag_flush_range_np() in iotlb_sync_map")
Cc: stable@vger.kernel.org
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Link: https://lore.kernel.org/r/20250703031545.3378602-1-baolu.lu@linux.intel.com
---
 drivers/iommu/intel/iommu.c | 19 ++++++++++++++++++-
 drivers/iommu/intel/iommu.h |  3 +++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 148b944143b8..b23efb70b52c 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1796,6 +1796,18 @@ static int domain_setup_first_level(struct intel_iommu *iommu,
 					  (pgd_t *)pgd, flags, old);
 }
 
+static bool domain_need_iotlb_sync_map(struct dmar_domain *domain,
+				       struct intel_iommu *iommu)
+{
+	if (cap_caching_mode(iommu->cap) && !domain->use_first_level)
+		return true;
+
+	if (rwbf_quirk || cap_rwbf(iommu->cap))
+		return true;
+
+	return false;
+}
+
 static int dmar_domain_attach_device(struct dmar_domain *domain,
 				     struct device *dev)
 {
@@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
 	if (ret)
 		goto out_block_translation;
 
+	domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu);
+
 	return 0;
 
 out_block_translation:
@@ -3954,7 +3968,10 @@ static bool risky_device(struct pci_dev *pdev)
 static int intel_iommu_iotlb_sync_map(struct iommu_domain *domain,
 				      unsigned long iova, size_t size)
 {
-	cache_tag_flush_range_np(to_dmar_domain(domain), iova, iova + size - 1);
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+
+	if (dmar_domain->iotlb_sync_map)
+		cache_tag_flush_range_np(dmar_domain, iova, iova + size - 1);
 
 	return 0;
 }
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 2d1afab5eedc..61f42802fe9e 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -614,6 +614,9 @@ struct dmar_domain {
 	u8 has_mappings:1;		/* Has mappings configured through
 					 * iommu_map() interface.
 					 */
+	u8 iotlb_sync_map:1;		/* Need to flush IOTLB cache or write
+					 * buffer when creating mappings.
+					 */
 
 	spinlock_t lock;		/* Protect device tracking lists */
 	struct list_head devices;	/* all devices' list */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 03/11] iommu/vt-d: Lift the __pa to domain_setup_first_level/intel_svm_set_dev_pasid()
  2025-07-14  4:50 [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17 Lu Baolu
  2025-07-14  4:50 ` [PATCH 01/11] iommu/vt-d: Remove the CONFIG_X86 wrapping from iommu init hook Lu Baolu
  2025-07-14  4:50 ` [PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes Lu Baolu
@ 2025-07-14  4:50 ` Lu Baolu
  2025-07-14  4:50 ` [PATCH 04/11] iommu/vt-d: Fold domain_exit() into intel_iommu_domain_free() Lu Baolu
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Lu Baolu @ 2025-07-14  4:50 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: iommu, linux-kernel

From: Jason Gunthorpe <jgg@nvidia.com>

Pass the phys_addr_t down through the call chain from the top instead of
passing a pgd_t * KVA. This moves the __pa() into
domain_setup_first_level() which is the first function to obtain the pgd
from the IOMMU page table in this call chain.

The SVA flow is also adjusted to get the pa of the mm->pgd.

iommput will move the __pa() into iommupt code, it never shares the KVA of
the page table with the driver.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/1-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 15 +++++++--------
 drivers/iommu/intel/iommu.h |  7 +++----
 drivers/iommu/intel/pasid.c | 17 +++++++++--------
 drivers/iommu/intel/pasid.h | 11 +++++------
 drivers/iommu/intel/svm.c   |  2 +-
 5 files changed, 25 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index b23efb70b52c..7c2e5e682a41 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1736,15 +1736,14 @@ static void domain_context_clear_one(struct device_domain_info *info, u8 bus, u8
 	intel_context_flush_no_pasid(info, context, did);
 }
 
-int __domain_setup_first_level(struct intel_iommu *iommu,
-			       struct device *dev, ioasid_t pasid,
-			       u16 did, pgd_t *pgd, int flags,
-			       struct iommu_domain *old)
+int __domain_setup_first_level(struct intel_iommu *iommu, struct device *dev,
+			       ioasid_t pasid, u16 did, phys_addr_t fsptptr,
+			       int flags, struct iommu_domain *old)
 {
 	if (!old)
-		return intel_pasid_setup_first_level(iommu, dev, pgd,
-						     pasid, did, flags);
-	return intel_pasid_replace_first_level(iommu, dev, pgd, pasid, did,
+		return intel_pasid_setup_first_level(iommu, dev, fsptptr, pasid,
+						     did, flags);
+	return intel_pasid_replace_first_level(iommu, dev, fsptptr, pasid, did,
 					       iommu_domain_did(old, iommu),
 					       flags);
 }
@@ -1793,7 +1792,7 @@ static int domain_setup_first_level(struct intel_iommu *iommu,
 
 	return __domain_setup_first_level(iommu, dev, pasid,
 					  domain_id_iommu(domain, iommu),
-					  (pgd_t *)pgd, flags, old);
+					  __pa(pgd), flags, old);
 }
 
 static bool domain_need_iotlb_sync_map(struct dmar_domain *domain,
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 61f42802fe9e..50d69cc88a1f 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -1255,10 +1255,9 @@ domain_add_dev_pasid(struct iommu_domain *domain,
 void domain_remove_dev_pasid(struct iommu_domain *domain,
 			     struct device *dev, ioasid_t pasid);
 
-int __domain_setup_first_level(struct intel_iommu *iommu,
-			       struct device *dev, ioasid_t pasid,
-			       u16 did, pgd_t *pgd, int flags,
-			       struct iommu_domain *old);
+int __domain_setup_first_level(struct intel_iommu *iommu, struct device *dev,
+			       ioasid_t pasid, u16 did, phys_addr_t fsptptr,
+			       int flags, struct iommu_domain *old);
 
 int dmar_ir_support(void);
 
diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index ac67a056b6c8..52f678975da7 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -348,14 +348,15 @@ static void intel_pasid_flush_present(struct intel_iommu *iommu,
  */
 static void pasid_pte_config_first_level(struct intel_iommu *iommu,
 					 struct pasid_entry *pte,
-					 pgd_t *pgd, u16 did, int flags)
+					 phys_addr_t fsptptr, u16 did,
+					 int flags)
 {
 	lockdep_assert_held(&iommu->lock);
 
 	pasid_clear_entry(pte);
 
 	/* Setup the first level page table pointer: */
-	pasid_set_flptr(pte, (u64)__pa(pgd));
+	pasid_set_flptr(pte, fsptptr);
 
 	if (flags & PASID_FLAG_FL5LP)
 		pasid_set_flpm(pte, 1);
@@ -372,9 +373,9 @@ static void pasid_pte_config_first_level(struct intel_iommu *iommu,
 	pasid_set_present(pte);
 }
 
-int intel_pasid_setup_first_level(struct intel_iommu *iommu,
-				  struct device *dev, pgd_t *pgd,
-				  u32 pasid, u16 did, int flags)
+int intel_pasid_setup_first_level(struct intel_iommu *iommu, struct device *dev,
+				  phys_addr_t fsptptr, u32 pasid, u16 did,
+				  int flags)
 {
 	struct pasid_entry *pte;
 
@@ -402,7 +403,7 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
 		return -EBUSY;
 	}
 
-	pasid_pte_config_first_level(iommu, pte, pgd, did, flags);
+	pasid_pte_config_first_level(iommu, pte, fsptptr, did, flags);
 
 	spin_unlock(&iommu->lock);
 
@@ -412,7 +413,7 @@ int intel_pasid_setup_first_level(struct intel_iommu *iommu,
 }
 
 int intel_pasid_replace_first_level(struct intel_iommu *iommu,
-				    struct device *dev, pgd_t *pgd,
+				    struct device *dev, phys_addr_t fsptptr,
 				    u32 pasid, u16 did, u16 old_did,
 				    int flags)
 {
@@ -430,7 +431,7 @@ int intel_pasid_replace_first_level(struct intel_iommu *iommu,
 		return -EINVAL;
 	}
 
-	pasid_pte_config_first_level(iommu, &new_pte, pgd, did, flags);
+	pasid_pte_config_first_level(iommu, &new_pte, fsptptr, did, flags);
 
 	spin_lock(&iommu->lock);
 	pte = intel_pasid_get_entry(dev, pasid);
diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index fd0fd1a0df84..a771a77d4239 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -288,9 +288,9 @@ extern unsigned int intel_pasid_max_id;
 int intel_pasid_alloc_table(struct device *dev);
 void intel_pasid_free_table(struct device *dev);
 struct pasid_table *intel_pasid_get_table(struct device *dev);
-int intel_pasid_setup_first_level(struct intel_iommu *iommu,
-				  struct device *dev, pgd_t *pgd,
-				  u32 pasid, u16 did, int flags);
+int intel_pasid_setup_first_level(struct intel_iommu *iommu, struct device *dev,
+				  phys_addr_t fsptptr, u32 pasid, u16 did,
+				  int flags);
 int intel_pasid_setup_second_level(struct intel_iommu *iommu,
 				   struct dmar_domain *domain,
 				   struct device *dev, u32 pasid);
@@ -302,9 +302,8 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu,
 int intel_pasid_setup_nested(struct intel_iommu *iommu, struct device *dev,
 			     u32 pasid, struct dmar_domain *domain);
 int intel_pasid_replace_first_level(struct intel_iommu *iommu,
-				    struct device *dev, pgd_t *pgd,
-				    u32 pasid, u16 did, u16 old_did,
-				    int flags);
+				    struct device *dev, phys_addr_t fsptptr,
+				    u32 pasid, u16 did, u16 old_did, int flags);
 int intel_pasid_replace_second_level(struct intel_iommu *iommu,
 				     struct dmar_domain *domain,
 				     struct device *dev, u16 old_did,
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index f3da596410b5..8c0bed36c587 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -171,7 +171,7 @@ static int intel_svm_set_dev_pasid(struct iommu_domain *domain,
 	/* Setup the pasid table: */
 	sflags = cpu_feature_enabled(X86_FEATURE_LA57) ? PASID_FLAG_FL5LP : 0;
 	ret = __domain_setup_first_level(iommu, dev, pasid,
-					 FLPT_DEFAULT_DID, mm->pgd,
+					 FLPT_DEFAULT_DID, __pa(mm->pgd),
 					 sflags, old);
 	if (ret)
 		goto out_unwind_iopf;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 04/11] iommu/vt-d: Fold domain_exit() into intel_iommu_domain_free()
  2025-07-14  4:50 [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17 Lu Baolu
                   ` (2 preceding siblings ...)
  2025-07-14  4:50 ` [PATCH 03/11] iommu/vt-d: Lift the __pa to domain_setup_first_level/intel_svm_set_dev_pasid() Lu Baolu
@ 2025-07-14  4:50 ` Lu Baolu
  2025-07-14  4:50 ` [PATCH 05/11] iommu/vt-d: Do not wipe out the page table NID when devices detach Lu Baolu
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Lu Baolu @ 2025-07-14  4:50 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: iommu, linux-kernel

From: Jason Gunthorpe <jgg@nvidia.com>

It has only one caller, no need for two functions.

Correct the WARN_ON() error handling to leak the entire page table if the
HW is still referencing it so we don't UAF during WARN_ON recovery.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/2-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 38 ++++++++++++++++++-------------------
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 7c2e5e682a41..5c798f30dbc4 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1396,23 +1396,6 @@ void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
 	}
 }
 
-static void domain_exit(struct dmar_domain *domain)
-{
-	if (domain->pgd) {
-		struct iommu_pages_list freelist =
-			IOMMU_PAGES_LIST_INIT(freelist);
-
-		domain_unmap(domain, 0, DOMAIN_MAX_PFN(domain->gaw), &freelist);
-		iommu_put_pages_list(&freelist);
-	}
-
-	if (WARN_ON(!list_empty(&domain->devices)))
-		return;
-
-	kfree(domain->qi_batch);
-	kfree(domain);
-}
-
 /*
  * For kdump cases, old valid entries may be cached due to the
  * in-flight DMA and copied pgtable, but there is no unmapping
@@ -3406,9 +3389,24 @@ static void intel_iommu_domain_free(struct iommu_domain *domain)
 {
 	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
 
-	WARN_ON(dmar_domain->nested_parent &&
-		!list_empty(&dmar_domain->s1_domains));
-	domain_exit(dmar_domain);
+	if (WARN_ON(dmar_domain->nested_parent &&
+		    !list_empty(&dmar_domain->s1_domains)))
+		return;
+
+	if (WARN_ON(!list_empty(&dmar_domain->devices)))
+		return;
+
+	if (dmar_domain->pgd) {
+		struct iommu_pages_list freelist =
+			IOMMU_PAGES_LIST_INIT(freelist);
+
+		domain_unmap(dmar_domain, 0, DOMAIN_MAX_PFN(dmar_domain->gaw),
+			     &freelist);
+		iommu_put_pages_list(&freelist);
+	}
+
+	kfree(dmar_domain->qi_batch);
+	kfree(dmar_domain);
 }
 
 int paging_domain_compatible(struct iommu_domain *domain, struct device *dev)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 05/11] iommu/vt-d: Do not wipe out the page table NID when devices detach
  2025-07-14  4:50 [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17 Lu Baolu
                   ` (3 preceding siblings ...)
  2025-07-14  4:50 ` [PATCH 04/11] iommu/vt-d: Fold domain_exit() into intel_iommu_domain_free() Lu Baolu
@ 2025-07-14  4:50 ` Lu Baolu
  2025-07-14  4:50 ` [PATCH 06/11] iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags() Lu Baolu
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Lu Baolu @ 2025-07-14  4:50 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: iommu, linux-kernel

From: Jason Gunthorpe <jgg@nvidia.com>

The NID is used to control which NUMA node memory for the page table is
allocated it from. It should be a permanent property of the page table
when it was allocated and not change during attach/detach of devices.

Reviewed-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 5c798f30dbc4..55e0ba4d20ae 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1391,7 +1391,6 @@ void domain_detach_iommu(struct dmar_domain *domain, struct intel_iommu *iommu)
 	if (--info->refcnt == 0) {
 		ida_free(&iommu->domain_ida, info->did);
 		xa_erase(&domain->iommu_array, iommu->seq_id);
-		domain->nid = NUMA_NO_NODE;
 		kfree(info);
 	}
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 06/11] iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags()
  2025-07-14  4:50 [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17 Lu Baolu
                   ` (4 preceding siblings ...)
  2025-07-14  4:50 ` [PATCH 05/11] iommu/vt-d: Do not wipe out the page table NID when devices detach Lu Baolu
@ 2025-07-14  4:50 ` Lu Baolu
  2025-07-14  4:50 ` [PATCH 07/11] iommu/vt-d: Create unique domain ops for each stage Lu Baolu
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Lu Baolu @ 2025-07-14  4:50 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: iommu, linux-kernel

From: Jason Gunthorpe <jgg@nvidia.com>

Create stage specific functions that check the stage specific conditions
if each stage can be supported.

Have intel_iommu_domain_alloc_paging_flags() call both stages in sequence
until one does not return EOPNOTSUPP and prefer to use the first stage if
available and suitable for the requested flags.

Move second stage only operations like nested_parent and dirty_tracking
into the second stage function for clarity.

Move initialization of the iommu_domain members into paging_domain_alloc().

Drop initialization of domain->owner as the callers all do it.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/4-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 98 +++++++++++++++++++++----------------
 1 file changed, 57 insertions(+), 41 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 55e0ba4d20ae..0ac3c3a6d9e7 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3281,10 +3281,15 @@ static struct dmar_domain *paging_domain_alloc(struct device *dev, bool first_st
 	spin_lock_init(&domain->lock);
 	spin_lock_init(&domain->cache_lock);
 	xa_init(&domain->iommu_array);
+	INIT_LIST_HEAD(&domain->s1_domains);
+	spin_lock_init(&domain->s1_lock);
 
 	domain->nid = dev_to_node(dev);
 	domain->use_first_level = first_stage;
 
+	domain->domain.type = IOMMU_DOMAIN_UNMANAGED;
+	domain->domain.ops = intel_iommu_ops.default_domain_ops;
+
 	/* calculate the address width */
 	addr_width = agaw_to_width(iommu->agaw);
 	if (addr_width > cap_mgaw(iommu->cap))
@@ -3326,62 +3331,73 @@ static struct dmar_domain *paging_domain_alloc(struct device *dev, bool first_st
 }
 
 static struct iommu_domain *
-intel_iommu_domain_alloc_paging_flags(struct device *dev, u32 flags,
-				      const struct iommu_user_data *user_data)
+intel_iommu_domain_alloc_first_stage(struct device *dev,
+				     struct intel_iommu *iommu, u32 flags)
+{
+	struct dmar_domain *dmar_domain;
+
+	if (flags & ~IOMMU_HWPT_ALLOC_PASID)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	/* Only SL is available in legacy mode */
+	if (!sm_supported(iommu) || !ecap_flts(iommu->ecap))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	dmar_domain = paging_domain_alloc(dev, true);
+	if (IS_ERR(dmar_domain))
+		return ERR_CAST(dmar_domain);
+	return &dmar_domain->domain;
+}
+
+static struct iommu_domain *
+intel_iommu_domain_alloc_second_stage(struct device *dev,
+				      struct intel_iommu *iommu, u32 flags)
 {
-	struct device_domain_info *info = dev_iommu_priv_get(dev);
-	bool dirty_tracking = flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
-	bool nested_parent = flags & IOMMU_HWPT_ALLOC_NEST_PARENT;
-	struct intel_iommu *iommu = info->iommu;
 	struct dmar_domain *dmar_domain;
-	struct iommu_domain *domain;
-	bool first_stage;
 
 	if (flags &
 	    (~(IOMMU_HWPT_ALLOC_NEST_PARENT | IOMMU_HWPT_ALLOC_DIRTY_TRACKING |
 	       IOMMU_HWPT_ALLOC_PASID)))
 		return ERR_PTR(-EOPNOTSUPP);
-	if (nested_parent && !nested_supported(iommu))
-		return ERR_PTR(-EOPNOTSUPP);
-	if (user_data || (dirty_tracking && !ssads_supported(iommu)))
+
+	if (((flags & IOMMU_HWPT_ALLOC_NEST_PARENT) &&
+	     !nested_supported(iommu)) ||
+	    ((flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING) &&
+	     !ssads_supported(iommu)))
 		return ERR_PTR(-EOPNOTSUPP);
 
-	/*
-	 * Always allocate the guest compatible page table unless
-	 * IOMMU_HWPT_ALLOC_NEST_PARENT or IOMMU_HWPT_ALLOC_DIRTY_TRACKING
-	 * is specified.
-	 */
-	if (nested_parent || dirty_tracking) {
-		if (!sm_supported(iommu) || !ecap_slts(iommu->ecap))
-			return ERR_PTR(-EOPNOTSUPP);
-		first_stage = false;
-	} else {
-		first_stage = first_level_by_default(iommu);
-	}
+	/* Legacy mode always supports second stage */
+	if (sm_supported(iommu) && !ecap_slts(iommu->ecap))
+		return ERR_PTR(-EOPNOTSUPP);
 
-	dmar_domain = paging_domain_alloc(dev, first_stage);
+	dmar_domain = paging_domain_alloc(dev, false);
 	if (IS_ERR(dmar_domain))
 		return ERR_CAST(dmar_domain);
-	domain = &dmar_domain->domain;
-	domain->type = IOMMU_DOMAIN_UNMANAGED;
-	domain->owner = &intel_iommu_ops;
-	domain->ops = intel_iommu_ops.default_domain_ops;
 
-	if (nested_parent) {
-		dmar_domain->nested_parent = true;
-		INIT_LIST_HEAD(&dmar_domain->s1_domains);
-		spin_lock_init(&dmar_domain->s1_lock);
-	}
+	dmar_domain->nested_parent = flags & IOMMU_HWPT_ALLOC_NEST_PARENT;
 
-	if (dirty_tracking) {
-		if (dmar_domain->use_first_level) {
-			iommu_domain_free(domain);
-			return ERR_PTR(-EOPNOTSUPP);
-		}
-		domain->dirty_ops = &intel_dirty_ops;
-	}
+	if (flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING)
+		dmar_domain->domain.dirty_ops = &intel_dirty_ops;
 
-	return domain;
+	return &dmar_domain->domain;
+}
+
+static struct iommu_domain *
+intel_iommu_domain_alloc_paging_flags(struct device *dev, u32 flags,
+				      const struct iommu_user_data *user_data)
+{
+	struct device_domain_info *info = dev_iommu_priv_get(dev);
+	struct intel_iommu *iommu = info->iommu;
+	struct iommu_domain *domain;
+
+	if (user_data)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	/* Prefer first stage if possible by default. */
+	domain = intel_iommu_domain_alloc_first_stage(dev, iommu, flags);
+	if (domain != ERR_PTR(-EOPNOTSUPP))
+		return domain;
+	return intel_iommu_domain_alloc_second_stage(dev, iommu, flags);
 }
 
 static void intel_iommu_domain_free(struct iommu_domain *domain)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 07/11] iommu/vt-d: Create unique domain ops for each stage
  2025-07-14  4:50 [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17 Lu Baolu
                   ` (5 preceding siblings ...)
  2025-07-14  4:50 ` [PATCH 06/11] iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags() Lu Baolu
@ 2025-07-14  4:50 ` Lu Baolu
  2025-07-14  4:50 ` [PATCH 08/11] iommu/vt-d: Split intel_iommu_enforce_cache_coherency() Lu Baolu
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Lu Baolu @ 2025-07-14  4:50 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: iommu, linux-kernel

From: Jason Gunthorpe <jgg@nvidia.com>

Use the domain ops pointer to tell what kind of domain it is instead of
the internal use_first_level indication. This also protects against
wrongly using a SVA/nested/IDENTITY/BLOCKED domain type in places they
should not be.

The only remaining uses of use_first_level outside the paging domain are in
paging_domain_compatible() and intel_iommu_enforce_cache_coherency().

Thus, remove the useless sets of use_first_level in
intel_svm_domain_alloc() and intel_iommu_domain_alloc_nested(). None of
the unique ops for these domain types ever reference it on their call
chains.

Add a WARN_ON() check in domain_context_mapping_one() as it only works
with second stage.

This is preparation for iommupt which will have different ops for each of
the stages.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/5-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/cache.c  |  5 +--
 drivers/iommu/intel/iommu.c  | 60 +++++++++++++++++++++++++-----------
 drivers/iommu/intel/iommu.h  | 12 ++++++++
 drivers/iommu/intel/nested.c |  4 +--
 drivers/iommu/intel/svm.c    |  1 -
 5 files changed, 58 insertions(+), 24 deletions(-)

diff --git a/drivers/iommu/intel/cache.c b/drivers/iommu/intel/cache.c
index 47692cbfaabd..876630e10849 100644
--- a/drivers/iommu/intel/cache.c
+++ b/drivers/iommu/intel/cache.c
@@ -370,7 +370,7 @@ static void cache_tag_flush_iotlb(struct dmar_domain *domain, struct cache_tag *
 	struct intel_iommu *iommu = tag->iommu;
 	u64 type = DMA_TLB_PSI_FLUSH;
 
-	if (domain->use_first_level) {
+	if (intel_domain_is_fs_paging(domain)) {
 		qi_batch_add_piotlb(iommu, tag->domain_id, tag->pasid, addr,
 				    pages, ih, domain->qi_batch);
 		return;
@@ -545,7 +545,8 @@ void cache_tag_flush_range_np(struct dmar_domain *domain, unsigned long start,
 			qi_batch_flush_descs(iommu, domain->qi_batch);
 		iommu = tag->iommu;
 
-		if (!cap_caching_mode(iommu->cap) || domain->use_first_level) {
+		if (!cap_caching_mode(iommu->cap) ||
+		    intel_domain_is_fs_paging(domain)) {
 			iommu_flush_write_buffer(iommu);
 			continue;
 		}
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 0ac3c3a6d9e7..b7b1a3d2cbfc 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1462,6 +1462,9 @@ static int domain_context_mapping_one(struct dmar_domain *domain,
 	struct context_entry *context;
 	int ret;
 
+	if (WARN_ON(!intel_domain_is_ss_paging(domain)))
+		return -EINVAL;
+
 	pr_debug("Set context mapping for %02x:%02x.%d\n",
 		bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
 
@@ -1780,7 +1783,7 @@ static int domain_setup_first_level(struct intel_iommu *iommu,
 static bool domain_need_iotlb_sync_map(struct dmar_domain *domain,
 				       struct intel_iommu *iommu)
 {
-	if (cap_caching_mode(iommu->cap) && !domain->use_first_level)
+	if (cap_caching_mode(iommu->cap) && intel_domain_is_ss_paging(domain))
 		return true;
 
 	if (rwbf_quirk || cap_rwbf(iommu->cap))
@@ -1812,12 +1815,14 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
 
 	if (!sm_supported(iommu))
 		ret = domain_context_mapping(domain, dev);
-	else if (domain->use_first_level)
+	else if (intel_domain_is_fs_paging(domain))
 		ret = domain_setup_first_level(iommu, domain, dev,
 					       IOMMU_NO_PASID, NULL);
-	else
+	else if (intel_domain_is_ss_paging(domain))
 		ret = domain_setup_second_level(iommu, domain, dev,
 						IOMMU_NO_PASID, NULL);
+	else if (WARN_ON(true))
+		ret = -EINVAL;
 
 	if (ret)
 		goto out_block_translation;
@@ -3288,7 +3293,6 @@ static struct dmar_domain *paging_domain_alloc(struct device *dev, bool first_st
 	domain->use_first_level = first_stage;
 
 	domain->domain.type = IOMMU_DOMAIN_UNMANAGED;
-	domain->domain.ops = intel_iommu_ops.default_domain_ops;
 
 	/* calculate the address width */
 	addr_width = agaw_to_width(iommu->agaw);
@@ -3346,6 +3350,8 @@ intel_iommu_domain_alloc_first_stage(struct device *dev,
 	dmar_domain = paging_domain_alloc(dev, true);
 	if (IS_ERR(dmar_domain))
 		return ERR_CAST(dmar_domain);
+
+	dmar_domain->domain.ops = &intel_fs_paging_domain_ops;
 	return &dmar_domain->domain;
 }
 
@@ -3374,6 +3380,7 @@ intel_iommu_domain_alloc_second_stage(struct device *dev,
 	if (IS_ERR(dmar_domain))
 		return ERR_CAST(dmar_domain);
 
+	dmar_domain->domain.ops = &intel_ss_paging_domain_ops;
 	dmar_domain->nested_parent = flags & IOMMU_HWPT_ALLOC_NEST_PARENT;
 
 	if (flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING)
@@ -4107,12 +4114,15 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain *domain,
 	if (ret)
 		goto out_remove_dev_pasid;
 
-	if (dmar_domain->use_first_level)
+	if (intel_domain_is_fs_paging(dmar_domain))
 		ret = domain_setup_first_level(iommu, dmar_domain,
 					       dev, pasid, old);
-	else
+	else if (intel_domain_is_ss_paging(dmar_domain))
 		ret = domain_setup_second_level(iommu, dmar_domain,
 						dev, pasid, old);
+	else if (WARN_ON(true))
+		ret = -EINVAL;
+
 	if (ret)
 		goto out_unwind_iopf;
 
@@ -4387,6 +4397,32 @@ static struct iommu_domain identity_domain = {
 	},
 };
 
+const struct iommu_domain_ops intel_fs_paging_domain_ops = {
+	.attach_dev = intel_iommu_attach_device,
+	.set_dev_pasid = intel_iommu_set_dev_pasid,
+	.map_pages = intel_iommu_map_pages,
+	.unmap_pages = intel_iommu_unmap_pages,
+	.iotlb_sync_map = intel_iommu_iotlb_sync_map,
+	.flush_iotlb_all = intel_flush_iotlb_all,
+	.iotlb_sync = intel_iommu_tlb_sync,
+	.iova_to_phys = intel_iommu_iova_to_phys,
+	.free = intel_iommu_domain_free,
+	.enforce_cache_coherency = intel_iommu_enforce_cache_coherency,
+};
+
+const struct iommu_domain_ops intel_ss_paging_domain_ops = {
+	.attach_dev = intel_iommu_attach_device,
+	.set_dev_pasid = intel_iommu_set_dev_pasid,
+	.map_pages = intel_iommu_map_pages,
+	.unmap_pages = intel_iommu_unmap_pages,
+	.iotlb_sync_map = intel_iommu_iotlb_sync_map,
+	.flush_iotlb_all = intel_flush_iotlb_all,
+	.iotlb_sync = intel_iommu_tlb_sync,
+	.iova_to_phys = intel_iommu_iova_to_phys,
+	.free = intel_iommu_domain_free,
+	.enforce_cache_coherency = intel_iommu_enforce_cache_coherency,
+};
+
 const struct iommu_ops intel_iommu_ops = {
 	.blocked_domain		= &blocking_domain,
 	.release_domain		= &blocking_domain,
@@ -4405,18 +4441,6 @@ const struct iommu_ops intel_iommu_ops = {
 	.def_domain_type	= device_def_domain_type,
 	.pgsize_bitmap		= SZ_4K,
 	.page_response		= intel_iommu_page_response,
-	.default_domain_ops = &(const struct iommu_domain_ops) {
-		.attach_dev		= intel_iommu_attach_device,
-		.set_dev_pasid		= intel_iommu_set_dev_pasid,
-		.map_pages		= intel_iommu_map_pages,
-		.unmap_pages		= intel_iommu_unmap_pages,
-		.iotlb_sync_map		= intel_iommu_iotlb_sync_map,
-		.flush_iotlb_all        = intel_flush_iotlb_all,
-		.iotlb_sync		= intel_iommu_tlb_sync,
-		.iova_to_phys		= intel_iommu_iova_to_phys,
-		.free			= intel_iommu_domain_free,
-		.enforce_cache_coherency = intel_iommu_enforce_cache_coherency,
-	}
 };
 
 static void quirk_iommu_igfx(struct pci_dev *dev)
diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h
index 50d69cc88a1f..d09b92871659 100644
--- a/drivers/iommu/intel/iommu.h
+++ b/drivers/iommu/intel/iommu.h
@@ -1380,6 +1380,18 @@ struct context_entry *iommu_context_addr(struct intel_iommu *iommu, u8 bus,
 					 u8 devfn, int alloc);
 
 extern const struct iommu_ops intel_iommu_ops;
+extern const struct iommu_domain_ops intel_fs_paging_domain_ops;
+extern const struct iommu_domain_ops intel_ss_paging_domain_ops;
+
+static inline bool intel_domain_is_fs_paging(struct dmar_domain *domain)
+{
+	return domain->domain.ops == &intel_fs_paging_domain_ops;
+}
+
+static inline bool intel_domain_is_ss_paging(struct dmar_domain *domain)
+{
+	return domain->domain.ops == &intel_ss_paging_domain_ops;
+}
 
 #ifdef CONFIG_INTEL_IOMMU
 extern int intel_iommu_sm;
diff --git a/drivers/iommu/intel/nested.c b/drivers/iommu/intel/nested.c
index fc312f649f9e..1b6ad9c900a5 100644
--- a/drivers/iommu/intel/nested.c
+++ b/drivers/iommu/intel/nested.c
@@ -216,8 +216,7 @@ intel_iommu_domain_alloc_nested(struct device *dev, struct iommu_domain *parent,
 	/* Must be nested domain */
 	if (user_data->type != IOMMU_HWPT_DATA_VTD_S1)
 		return ERR_PTR(-EOPNOTSUPP);
-	if (parent->ops != intel_iommu_ops.default_domain_ops ||
-	    !s2_domain->nested_parent)
+	if (!intel_domain_is_ss_paging(s2_domain) || !s2_domain->nested_parent)
 		return ERR_PTR(-EINVAL);
 
 	ret = iommu_copy_struct_from_user(&vtd, user_data,
@@ -229,7 +228,6 @@ intel_iommu_domain_alloc_nested(struct device *dev, struct iommu_domain *parent,
 	if (!domain)
 		return ERR_PTR(-ENOMEM);
 
-	domain->use_first_level = true;
 	domain->s2_domain = s2_domain;
 	domain->s1_cfg = vtd;
 	domain->domain.ops = &intel_nested_domain_ops;
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 8c0bed36c587..e147f71f91b7 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -214,7 +214,6 @@ struct iommu_domain *intel_svm_domain_alloc(struct device *dev,
 		return ERR_PTR(-ENOMEM);
 
 	domain->domain.ops = &intel_svm_domain_ops;
-	domain->use_first_level = true;
 	INIT_LIST_HEAD(&domain->dev_pasids);
 	INIT_LIST_HEAD(&domain->cache_tags);
 	spin_lock_init(&domain->cache_lock);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 08/11] iommu/vt-d: Split intel_iommu_enforce_cache_coherency()
  2025-07-14  4:50 [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17 Lu Baolu
                   ` (6 preceding siblings ...)
  2025-07-14  4:50 ` [PATCH 07/11] iommu/vt-d: Create unique domain ops for each stage Lu Baolu
@ 2025-07-14  4:50 ` Lu Baolu
  2025-07-14  4:50 ` [PATCH 09/11] iommu/vt-d: Split paging_domain_compatible() Lu Baolu
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Lu Baolu @ 2025-07-14  4:50 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: iommu, linux-kernel

From: Jason Gunthorpe <jgg@nvidia.com>

First Stage and Second Stage have very different ways to deny
no-snoop. The first stage uses the PGSNP bit which is global per-PASID so
enabling requires loading new PASID entries for all the attached devices.

Second stage uses a bit per PTE, so enabling just requires telling future
maps to set the bit.

Since we now have two domain ops we can have two functions that can
directly code their required actions instead of a bunch of logic dancing
around use_first_level.

Combine domain_set_force_snooping() into the new functions since they are
the only caller.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/6-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 47 +++++++++++++++++--------------------
 1 file changed, 22 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index b7b1a3d2cbfc..95619640b027 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3643,44 +3643,41 @@ static bool domain_support_force_snooping(struct dmar_domain *domain)
 	return support;
 }
 
-static void domain_set_force_snooping(struct dmar_domain *domain)
+static bool intel_iommu_enforce_cache_coherency_fs(struct iommu_domain *domain)
 {
+	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
 	struct device_domain_info *info;
 
-	assert_spin_locked(&domain->lock);
-	/*
-	 * Second level page table supports per-PTE snoop control. The
-	 * iommu_map() interface will handle this by setting SNP bit.
-	 */
-	if (!domain->use_first_level) {
-		domain->set_pte_snp = true;
-		return;
-	}
+	guard(spinlock_irqsave)(&dmar_domain->lock);
 
-	list_for_each_entry(info, &domain->devices, link)
+	if (dmar_domain->force_snooping)
+		return true;
+
+	if (!domain_support_force_snooping(dmar_domain))
+		return false;
+
+	dmar_domain->force_snooping = true;
+	list_for_each_entry(info, &dmar_domain->devices, link)
 		intel_pasid_setup_page_snoop_control(info->iommu, info->dev,
 						     IOMMU_NO_PASID);
+	return true;
 }
 
-static bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain)
+static bool intel_iommu_enforce_cache_coherency_ss(struct iommu_domain *domain)
 {
 	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
-	unsigned long flags;
 
-	if (dmar_domain->force_snooping)
-		return true;
-
-	spin_lock_irqsave(&dmar_domain->lock, flags);
+	guard(spinlock_irqsave)(&dmar_domain->lock);
 	if (!domain_support_force_snooping(dmar_domain) ||
-	    (!dmar_domain->use_first_level && dmar_domain->has_mappings)) {
-		spin_unlock_irqrestore(&dmar_domain->lock, flags);
+	    dmar_domain->has_mappings)
 		return false;
-	}
 
-	domain_set_force_snooping(dmar_domain);
+	/*
+	 * Second level page table supports per-PTE snoop control. The
+	 * iommu_map() interface will handle this by setting SNP bit.
+	 */
+	dmar_domain->set_pte_snp = true;
 	dmar_domain->force_snooping = true;
-	spin_unlock_irqrestore(&dmar_domain->lock, flags);
-
 	return true;
 }
 
@@ -4407,7 +4404,7 @@ const struct iommu_domain_ops intel_fs_paging_domain_ops = {
 	.iotlb_sync = intel_iommu_tlb_sync,
 	.iova_to_phys = intel_iommu_iova_to_phys,
 	.free = intel_iommu_domain_free,
-	.enforce_cache_coherency = intel_iommu_enforce_cache_coherency,
+	.enforce_cache_coherency = intel_iommu_enforce_cache_coherency_fs,
 };
 
 const struct iommu_domain_ops intel_ss_paging_domain_ops = {
@@ -4420,7 +4417,7 @@ const struct iommu_domain_ops intel_ss_paging_domain_ops = {
 	.iotlb_sync = intel_iommu_tlb_sync,
 	.iova_to_phys = intel_iommu_iova_to_phys,
 	.free = intel_iommu_domain_free,
-	.enforce_cache_coherency = intel_iommu_enforce_cache_coherency,
+	.enforce_cache_coherency = intel_iommu_enforce_cache_coherency_ss,
 };
 
 const struct iommu_ops intel_iommu_ops = {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 09/11] iommu/vt-d: Split paging_domain_compatible()
  2025-07-14  4:50 [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17 Lu Baolu
                   ` (7 preceding siblings ...)
  2025-07-14  4:50 ` [PATCH 08/11] iommu/vt-d: Split intel_iommu_enforce_cache_coherency() Lu Baolu
@ 2025-07-14  4:50 ` Lu Baolu
  2025-07-14  4:50 ` [PATCH 10/11] iommu/vt-d: Fix missing PASID in dev TLB flush with cache_tag_flush_all Lu Baolu
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 19+ messages in thread
From: Lu Baolu @ 2025-07-14  4:50 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: iommu, linux-kernel

From: Jason Gunthorpe <jgg@nvidia.com>

Make First/Second stage specific functions that follow the same pattern in
intel_iommu_domain_alloc_first/second_stage() for computing
EOPNOTSUPP. This makes the code easier to understand as if we couldn't
create a domain with the parameters for this IOMMU instance then we
certainly are not compatible with it.

Check superpage support directly against the per-stage cap bits and the
pgsize_bitmap.

Add a note that the force_snooping is read without locking. The locking
needs to cover the compatible check and the add of the device to the list.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/7-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/iommu.c | 66 ++++++++++++++++++++++++++++++-------
 1 file changed, 54 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 95619640b027..ddb981ad0e19 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3431,33 +3431,75 @@ static void intel_iommu_domain_free(struct iommu_domain *domain)
 	kfree(dmar_domain);
 }
 
+static int paging_domain_compatible_first_stage(struct dmar_domain *dmar_domain,
+						struct intel_iommu *iommu)
+{
+	if (WARN_ON(dmar_domain->domain.dirty_ops ||
+		    dmar_domain->nested_parent))
+		return -EINVAL;
+
+	/* Only SL is available in legacy mode */
+	if (!sm_supported(iommu) || !ecap_flts(iommu->ecap))
+		return -EINVAL;
+
+	/* Same page size support */
+	if (!cap_fl1gp_support(iommu->cap) &&
+	    (dmar_domain->domain.pgsize_bitmap & SZ_1G))
+		return -EINVAL;
+	return 0;
+}
+
+static int
+paging_domain_compatible_second_stage(struct dmar_domain *dmar_domain,
+				      struct intel_iommu *iommu)
+{
+	unsigned int sslps = cap_super_page_val(iommu->cap);
+
+	if (dmar_domain->domain.dirty_ops && !ssads_supported(iommu))
+		return -EINVAL;
+	if (dmar_domain->nested_parent && !nested_supported(iommu))
+		return -EINVAL;
+
+	/* Legacy mode always supports second stage */
+	if (sm_supported(iommu) && !ecap_slts(iommu->ecap))
+		return -EINVAL;
+
+	/* Same page size support */
+	if (!(sslps & BIT(0)) && (dmar_domain->domain.pgsize_bitmap & SZ_2M))
+		return -EINVAL;
+	if (!(sslps & BIT(1)) && (dmar_domain->domain.pgsize_bitmap & SZ_1G))
+		return -EINVAL;
+	return 0;
+}
+
 int paging_domain_compatible(struct iommu_domain *domain, struct device *dev)
 {
 	struct device_domain_info *info = dev_iommu_priv_get(dev);
 	struct dmar_domain *dmar_domain = to_dmar_domain(domain);
 	struct intel_iommu *iommu = info->iommu;
+	int ret = -EINVAL;
 	int addr_width;
 
-	if (WARN_ON_ONCE(!(domain->type & __IOMMU_DOMAIN_PAGING)))
-		return -EPERM;
+	if (intel_domain_is_fs_paging(dmar_domain))
+		ret = paging_domain_compatible_first_stage(dmar_domain, iommu);
+	else if (intel_domain_is_ss_paging(dmar_domain))
+		ret = paging_domain_compatible_second_stage(dmar_domain, iommu);
+	else if (WARN_ON(true))
+		ret = -EINVAL;
+	if (ret)
+		return ret;
 
+	/*
+	 * FIXME this is locked wrong, it needs to be under the
+	 * dmar_domain->lock
+	 */
 	if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap))
 		return -EINVAL;
 
-	if (domain->dirty_ops && !ssads_supported(iommu))
-		return -EINVAL;
-
 	if (dmar_domain->iommu_coherency !=
 			iommu_paging_structure_coherency(iommu))
 		return -EINVAL;
 
-	if (dmar_domain->iommu_superpage !=
-			iommu_superpage_capability(iommu, dmar_domain->use_first_level))
-		return -EINVAL;
-
-	if (dmar_domain->use_first_level &&
-	    (!sm_supported(iommu) || !ecap_flts(iommu->ecap)))
-		return -EINVAL;
 
 	/* check if this iommu agaw is sufficient for max mapped address */
 	addr_width = agaw_to_width(iommu->agaw);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 10/11] iommu/vt-d: Fix missing PASID in dev TLB flush with cache_tag_flush_all
  2025-07-14  4:50 [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17 Lu Baolu
                   ` (8 preceding siblings ...)
  2025-07-14  4:50 ` [PATCH 09/11] iommu/vt-d: Split paging_domain_compatible() Lu Baolu
@ 2025-07-14  4:50 ` Lu Baolu
  2025-07-14  4:50 ` [PATCH 11/11] iommu/vt-d: Deduplicate cache_tag_flush_all by reusing flush_range Lu Baolu
  2025-07-14 11:00 ` [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17 Will Deacon
  11 siblings, 0 replies; 19+ messages in thread
From: Lu Baolu @ 2025-07-14  4:50 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: iommu, linux-kernel

From: Ethan Milon <ethan.milon@eviden.com>

The function cache_tag_flush_all() was originally implemented with
incorrect device TLB invalidation logic that does not handle PASID, in
commit c4d27ffaa8eb ("iommu/vt-d: Add cache tag invalidation helpers")

This causes regressions where full address space TLB invalidations occur
with a PASID attached, such as during transparent hugepage unmapping in
SVA configurations or when calling iommu_flush_iotlb_all(). In these
cases, the device receives a TLB invalidation that lacks PASID.

This incorrect logic was later extracted into
cache_tag_flush_devtlb_all(), in commit 3297d047cd7f ("iommu/vt-d:
Refactor IOTLB and Dev-IOTLB flush for batching")

The fix replaces the call to cache_tag_flush_devtlb_all() with
cache_tag_flush_devtlb_psi(), which properly handles PASID.

Fixes: 4f609dbff51b ("iommu/vt-d: Use cache helpers in arch_invalidate_secondary_tlbs")
Fixes: 4e589a53685c ("iommu/vt-d: Use cache_tag_flush_all() in flush_iotlb_all")
Signed-off-by: Ethan Milon <ethan.milon@eviden.com>
Link: https://lore.kernel.org/r/20250708214821.30967-1-ethan.milon@eviden.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/cache.c | 18 +-----------------
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/drivers/iommu/intel/cache.c b/drivers/iommu/intel/cache.c
index 876630e10849..071f78e67fcb 100644
--- a/drivers/iommu/intel/cache.c
+++ b/drivers/iommu/intel/cache.c
@@ -422,22 +422,6 @@ static void cache_tag_flush_devtlb_psi(struct dmar_domain *domain, struct cache_
 					     domain->qi_batch);
 }
 
-static void cache_tag_flush_devtlb_all(struct dmar_domain *domain, struct cache_tag *tag)
-{
-	struct intel_iommu *iommu = tag->iommu;
-	struct device_domain_info *info;
-	u16 sid;
-
-	info = dev_iommu_priv_get(tag->dev);
-	sid = PCI_DEVID(info->bus, info->devfn);
-
-	qi_batch_add_dev_iotlb(iommu, sid, info->pfsid, info->ats_qdep, 0,
-			       MAX_AGAW_PFN_WIDTH, domain->qi_batch);
-	if (info->dtlb_extra_inval)
-		qi_batch_add_dev_iotlb(iommu, sid, info->pfsid, info->ats_qdep, 0,
-				       MAX_AGAW_PFN_WIDTH, domain->qi_batch);
-}
-
 /*
  * Invalidates a range of IOVA from @start (inclusive) to @end (inclusive)
  * when the memory mappings in the target domain have been modified.
@@ -508,7 +492,7 @@ void cache_tag_flush_all(struct dmar_domain *domain)
 			break;
 		case CACHE_TAG_DEVTLB:
 		case CACHE_TAG_NESTING_DEVTLB:
-			cache_tag_flush_devtlb_all(domain, tag);
+			cache_tag_flush_devtlb_psi(domain, tag, 0, MAX_AGAW_PFN_WIDTH);
 			break;
 		}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 11/11] iommu/vt-d: Deduplicate cache_tag_flush_all by reusing flush_range
  2025-07-14  4:50 [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17 Lu Baolu
                   ` (9 preceding siblings ...)
  2025-07-14  4:50 ` [PATCH 10/11] iommu/vt-d: Fix missing PASID in dev TLB flush with cache_tag_flush_all Lu Baolu
@ 2025-07-14  4:50 ` Lu Baolu
  2025-07-14 11:00 ` [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17 Will Deacon
  11 siblings, 0 replies; 19+ messages in thread
From: Lu Baolu @ 2025-07-14  4:50 UTC (permalink / raw)
  To: Joerg Roedel; +Cc: iommu, linux-kernel

From: Ethan Milon <ethan.milon@eviden.com>

The logic in cache_tag_flush_all() to iterate over cache tags and issue
TLB invalidations is largely duplicated in cache_tag_flush_range(), with
the only difference being the range parameters.

Extend cache_tag_flush_range() to handle a full address space flush when
called with start = 0 and end = ULONG_MAX. This allows
cache_tag_flush_all() to simply delegate to cache_tag_flush_range()

Signed-off-by: Ethan Milon <ethan.milon@eviden.com>
Link: https://lore.kernel.org/r/20250708214821.30967-2-ethan.milon@eviden.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
---
 drivers/iommu/intel/cache.c | 34 ++++++++--------------------------
 drivers/iommu/intel/trace.h |  5 -----
 2 files changed, 8 insertions(+), 31 deletions(-)

diff --git a/drivers/iommu/intel/cache.c b/drivers/iommu/intel/cache.c
index 071f78e67fcb..265e7290256b 100644
--- a/drivers/iommu/intel/cache.c
+++ b/drivers/iommu/intel/cache.c
@@ -434,7 +434,13 @@ void cache_tag_flush_range(struct dmar_domain *domain, unsigned long start,
 	struct cache_tag *tag;
 	unsigned long flags;
 
-	addr = calculate_psi_aligned_address(start, end, &pages, &mask);
+	if (start == 0 && end == ULONG_MAX) {
+		addr = 0;
+		pages = -1;
+		mask = MAX_AGAW_PFN_WIDTH;
+	} else {
+		addr = calculate_psi_aligned_address(start, end, &pages, &mask);
+	}
 
 	spin_lock_irqsave(&domain->cache_lock, flags);
 	list_for_each_entry(tag, &domain->cache_tags, node) {
@@ -475,31 +481,7 @@ void cache_tag_flush_range(struct dmar_domain *domain, unsigned long start,
  */
 void cache_tag_flush_all(struct dmar_domain *domain)
 {
-	struct intel_iommu *iommu = NULL;
-	struct cache_tag *tag;
-	unsigned long flags;
-
-	spin_lock_irqsave(&domain->cache_lock, flags);
-	list_for_each_entry(tag, &domain->cache_tags, node) {
-		if (iommu && iommu != tag->iommu)
-			qi_batch_flush_descs(iommu, domain->qi_batch);
-		iommu = tag->iommu;
-
-		switch (tag->type) {
-		case CACHE_TAG_IOTLB:
-		case CACHE_TAG_NESTING_IOTLB:
-			cache_tag_flush_iotlb(domain, tag, 0, -1, 0, 0);
-			break;
-		case CACHE_TAG_DEVTLB:
-		case CACHE_TAG_NESTING_DEVTLB:
-			cache_tag_flush_devtlb_psi(domain, tag, 0, MAX_AGAW_PFN_WIDTH);
-			break;
-		}
-
-		trace_cache_tag_flush_all(tag);
-	}
-	qi_batch_flush_descs(iommu, domain->qi_batch);
-	spin_unlock_irqrestore(&domain->cache_lock, flags);
+	cache_tag_flush_range(domain, 0, ULONG_MAX, 0);
 }
 
 /*
diff --git a/drivers/iommu/intel/trace.h b/drivers/iommu/intel/trace.h
index 9defdae6ebae..6311ba3f1691 100644
--- a/drivers/iommu/intel/trace.h
+++ b/drivers/iommu/intel/trace.h
@@ -130,11 +130,6 @@ DEFINE_EVENT(cache_tag_log, cache_tag_unassign,
 	TP_ARGS(tag)
 );
 
-DEFINE_EVENT(cache_tag_log, cache_tag_flush_all,
-	TP_PROTO(struct cache_tag *tag),
-	TP_ARGS(tag)
-);
-
 DECLARE_EVENT_CLASS(cache_tag_flush,
 	TP_PROTO(struct cache_tag *tag, unsigned long start, unsigned long end,
 		 unsigned long addr, unsigned long pages, unsigned long mask),
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17
  2025-07-14  4:50 [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17 Lu Baolu
                   ` (10 preceding siblings ...)
  2025-07-14  4:50 ` [PATCH 11/11] iommu/vt-d: Deduplicate cache_tag_flush_all by reusing flush_range Lu Baolu
@ 2025-07-14 11:00 ` Will Deacon
  11 siblings, 0 replies; 19+ messages in thread
From: Will Deacon @ 2025-07-14 11:00 UTC (permalink / raw)
  To: Joerg Roedel, Lu Baolu
  Cc: catalin.marinas, kernel-team, Will Deacon, iommu, linux-kernel

On Mon, 14 Jul 2025 12:50:17 +0800, Lu Baolu wrote:
> The following changes have been queued for v6.17-rc1. They are about new
> features and code refactoring, including:
> 
>  - Reorganize Intel VT-d to be ready for iommupt
>  - Optimize iotlb_sync_map for non-caching/non-RWBF modes
>  - Fix missed PASID in dev TLB invalidation in cache_tag_flush_all()
>  - Miscellaneous cleanups
> 
> [...]

Applied to iommu (intel/vt-d), thanks!

[01/11] iommu/vt-d: Remove the CONFIG_X86 wrapping from iommu init hook
        https://git.kernel.org/iommu/c/bd26cd9d815a
[02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
        https://git.kernel.org/iommu/c/12724ce3fe1a
[03/11] iommu/vt-d: Lift the __pa to domain_setup_first_level/intel_svm_set_dev_pasid()
        https://git.kernel.org/iommu/c/cd0d0e4e48d8
[04/11] iommu/vt-d: Fold domain_exit() into intel_iommu_domain_free()
        https://git.kernel.org/iommu/c/00939bebe51c
[05/11] iommu/vt-d: Do not wipe out the page table NID when devices detach
        https://git.kernel.org/iommu/c/5c3687d5789c
[06/11] iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags()
        https://git.kernel.org/iommu/c/b9434ba97c44
[07/11] iommu/vt-d: Create unique domain ops for each stage
        https://git.kernel.org/iommu/c/b33125296b50
[08/11] iommu/vt-d: Split intel_iommu_enforce_cache_coherency()
        https://git.kernel.org/iommu/c/0fa6f0893466
[09/11] iommu/vt-d: Split paging_domain_compatible()
        https://git.kernel.org/iommu/c/85cfaacc9937
[10/11] iommu/vt-d: Fix missing PASID in dev TLB flush with cache_tag_flush_all
        https://git.kernel.org/iommu/c/3141153816bf
[11/11] iommu/vt-d: Deduplicate cache_tag_flush_all by reusing flush_range
        https://git.kernel.org/iommu/c/e934464e098e

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
  2025-07-14  4:50 ` [PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes Lu Baolu
@ 2025-07-16 14:12   ` Jason Gunthorpe
  2025-07-17  2:40     ` Baolu Lu
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2025-07-16 14:12 UTC (permalink / raw)
  To: Lu Baolu; +Cc: Joerg Roedel, iommu, linux-kernel

On Mon, Jul 14, 2025 at 12:50:19PM +0800, Lu Baolu wrote:
> @@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
>  	if (ret)
>  		goto out_block_translation;
>  
> +	domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu);

This has no locking and is in the wrong order anyhow :(

Any change to how invalidation works has to be done before attaching
the HW so that the required invalidations are already happening before
the HW can walk the page table.

And you need to serialize somehow with concurrent map/unmap as iommufd
doesn't prevent userspace from racing attach with map/unmap.

The cache_tag_assign_domain() looks similarly wrong too, it needs to
start invalidating the cache tag of the new domain, then change the
context then stop invalidating the cache tag of the old
domain. Otherwise there are invalidation races.

Finally, if the HW needs RWBF then this also needs to do the buffer
flush in this thread before installing the context to prevent a race.

Overall this dynamic behavior may just be a bad idea, and perhaps you
can live with domains having the domain->iotlb_sync_map as a static
property set once during paging domain allocation.

If the iommu requires iotlb_sync_map but the domain does not have it
then the attach is rejected. This reduces domain sharing
possibilities, but maybe that is just fine??

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
  2025-07-16 14:12   ` Jason Gunthorpe
@ 2025-07-17  2:40     ` Baolu Lu
  2025-07-17 11:55       ` Jason Gunthorpe
  0 siblings, 1 reply; 19+ messages in thread
From: Baolu Lu @ 2025-07-17  2:40 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Joerg Roedel, iommu, linux-kernel

On 7/16/25 22:12, Jason Gunthorpe wrote:
> On Mon, Jul 14, 2025 at 12:50:19PM +0800, Lu Baolu wrote:
>> @@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
>>   	if (ret)
>>   		goto out_block_translation;
>>   
>> +	domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu);
> 
> This has no locking and is in the wrong order anyhow :(
> 
> Any change to how invalidation works has to be done before attaching
> the HW so that the required invalidations are already happening before
> the HW can walk the page table.
> 
> And you need to serialize somehow with concurrent map/unmap as iommufd
> doesn't prevent userspace from racing attach with map/unmap.

domain->iotlb_sync_map does not change the driver's behavior. It simply
indicates that there's no need to waste time calling
cache_tag_flush_range_np(), as it's just a no-op.

> 
> The cache_tag_assign_domain() looks similarly wrong too, it needs to
> start invalidating the cache tag of the new domain, then change the
> context then stop invalidating the cache tag of the old
> domain. Otherwise there are invalidation races.
> 
> Finally, if the HW needs RWBF then this also needs to do the buffer
> flush in this thread before installing the context to prevent a race.
> 
> Overall this dynamic behavior may just be a bad idea, and perhaps you
> can live with domains having the domain->iotlb_sync_map as a static
> property set once during paging domain allocation.
> 
> If the iommu requires iotlb_sync_map but the domain does not have it
> then the attach is rejected. This reduces domain sharing
> possibilities, but maybe that is just fine??

I previously discussed this with Kevin, and we agreed on a phase-by-
phase approach. As I mentioned, domain->iotlb_sync_map is merely a hint
for the driver, preventing it from looping through all cache tags to
determine if any cache invalidation work needs to be performed. We
already know it's predetermined that no work needs to be done.

RWBF is only required on some early implementations where memory
coherence was not yet implemented by the VT-d engine. It should be
difficult to find such systems in modern environments. Thus,
iotlb_sync_map is primarily relevant for nested translation that
utilizes S2 shadowing page tables. This, too, is a legacy feature, as
Intel has supported hardware-assisted nested translation for years.

Making iotlb_sync_map static is a feature, not an optimization. We are
still evaluating the value of this, as it's only truly helpful if there
are real use cases.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
  2025-07-17  2:40     ` Baolu Lu
@ 2025-07-17 11:55       ` Jason Gunthorpe
  2025-07-18  2:56         ` Baolu Lu
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2025-07-17 11:55 UTC (permalink / raw)
  To: Baolu Lu; +Cc: Joerg Roedel, iommu, linux-kernel

On Thu, Jul 17, 2025 at 10:40:01AM +0800, Baolu Lu wrote:
> On 7/16/25 22:12, Jason Gunthorpe wrote:
> > On Mon, Jul 14, 2025 at 12:50:19PM +0800, Lu Baolu wrote:
> > > @@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
> > >   	if (ret)
> > >   		goto out_block_translation;
> > > +	domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu);
> > 
> > This has no locking and is in the wrong order anyhow :(
> > 
> > Any change to how invalidation works has to be done before attaching
> > the HW so that the required invalidations are already happening before
> > the HW can walk the page table.
> > 
> > And you need to serialize somehow with concurrent map/unmap as iommufd
> > doesn't prevent userspace from racing attach with map/unmap.
> 
> domain->iotlb_sync_map does not change the driver's behavior. It simply
> indicates that there's no need to waste time calling
> cache_tag_flush_range_np(), as it's just a no-op.

Of course it changes the behavior, it changes what the invalidation
callback does.

Without locking you have a race situation where a PGD is visible to HW
that requires extra flushing and the SW is not doing the extra
flushing.

Before any PGD is made visible to the HW the software must ensure all
the required invalidations are happening.

> I previously discussed this with Kevin, and we agreed on a phase-by-
> phase approach. As I mentioned, domain->iotlb_sync_map is merely a hint
> for the driver, preventing it from looping through all cache tags to
> determine if any cache invalidation work needs to be performed. We
> already know it's predetermined that no work needs to be done.

The iteration though the cache tags is done inside a lock so it
doesn't have this race (it has the issue I mentioned setting up the
cache tage list though).

> RWBF is only required on some early implementations where memory
> coherence was not yet implemented by the VT-d engine. It should be
> difficult to find such systems in modern environments.

Then I would set it at domain creation time, check it during attach,
and remove this race.

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
  2025-07-17 11:55       ` Jason Gunthorpe
@ 2025-07-18  2:56         ` Baolu Lu
  2025-07-18 13:29           ` Jason Gunthorpe
  0 siblings, 1 reply; 19+ messages in thread
From: Baolu Lu @ 2025-07-18  2:56 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Joerg Roedel, iommu, linux-kernel

On 7/17/25 19:55, Jason Gunthorpe wrote:
> On Thu, Jul 17, 2025 at 10:40:01AM +0800, Baolu Lu wrote:
>> On 7/16/25 22:12, Jason Gunthorpe wrote:
>>> On Mon, Jul 14, 2025 at 12:50:19PM +0800, Lu Baolu wrote:
>>>> @@ -1833,6 +1845,8 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
>>>>    	if (ret)
>>>>    		goto out_block_translation;
>>>> +	domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu);
>>>
>>> This has no locking and is in the wrong order anyhow :(
>>>
>>> Any change to how invalidation works has to be done before attaching
>>> the HW so that the required invalidations are already happening before
>>> the HW can walk the page table.
>>>
>>> And you need to serialize somehow with concurrent map/unmap as iommufd
>>> doesn't prevent userspace from racing attach with map/unmap.
>>
>> domain->iotlb_sync_map does not change the driver's behavior. It simply
>> indicates that there's no need to waste time calling
>> cache_tag_flush_range_np(), as it's just a no-op.
> 
> Of course it changes the behavior, it changes what the invalidation
> callback does.
> 
> Without locking you have a race situation where a PGD is visible to HW
> that requires extra flushing and the SW is not doing the extra
> flushing.
> 
> Before any PGD is made visible to the HW the software must ensure all
> the required invalidations are happening.

Oh, I understand now. If there is no synchronization between attach/
detach and map/unmap operations, the cache invalidation behavior must be
determined when a domain is allocated.

> 
>> I previously discussed this with Kevin, and we agreed on a phase-by-
>> phase approach. As I mentioned, domain->iotlb_sync_map is merely a hint
>> for the driver, preventing it from looping through all cache tags to
>> determine if any cache invalidation work needs to be performed. We
>> already know it's predetermined that no work needs to be done.
> 
> The iteration though the cache tags is done inside a lock so it
> doesn't have this race (it has the issue I mentioned setting up the
> cache tage list though).
> 
>> RWBF is only required on some early implementations where memory
>> coherence was not yet implemented by the VT-d engine. It should be
>> difficult to find such systems in modern environments.
> 
> Then I would set it at domain creation time, check it during attach,
> and remove this race.

How about the following changes (compiled but not tested)?

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 8db8be9b7e7d..bb00dc14275d 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1780,18 +1780,6 @@ static int domain_setup_first_level(struct 
intel_iommu *iommu,
  					  __pa(pgd), flags, old);
  }

-static bool domain_need_iotlb_sync_map(struct dmar_domain *domain,
-				       struct intel_iommu *iommu)
-{
-	if (cap_caching_mode(iommu->cap) && intel_domain_is_ss_paging(domain))
-		return true;
-
-	if (rwbf_quirk || cap_rwbf(iommu->cap))
-		return true;
-
-	return false;
-}
-
  static int dmar_domain_attach_device(struct dmar_domain *domain,
  				     struct device *dev)
  {
@@ -1831,8 +1819,6 @@ static int dmar_domain_attach_device(struct 
dmar_domain *domain,
  	if (ret)
  		goto out_block_translation;

-	domain->iotlb_sync_map |= domain_need_iotlb_sync_map(domain, iommu);
-
  	return 0;

  out_block_translation:
@@ -3352,6 +3338,14 @@ intel_iommu_domain_alloc_first_stage(struct 
device *dev,
  		return ERR_CAST(dmar_domain);

  	dmar_domain->domain.ops = &intel_fs_paging_domain_ops;
+	/*
+	 * iotlb sync for map is only needed for legacy implementations that
+	 * explicitly require flushing internal write buffers to ensure memory
+	 * coherence.
+	 */
+	if (rwbf_quirk || cap_rwbf(iommu->cap))
+		dmar_domain->iotlb_sync_map = true;
+
  	return &dmar_domain->domain;
  }

@@ -3386,6 +3380,14 @@ intel_iommu_domain_alloc_second_stage(struct 
device *dev,
  	if (flags & IOMMU_HWPT_ALLOC_DIRTY_TRACKING)
  		dmar_domain->domain.dirty_ops = &intel_dirty_ops;

+	/*
+	 * Besides the internal write buffer flush, the caching mode used for
+	 * legacy nested translation (which utilizes shadowing page tables)
+	 * also requires iotlb sync on map.
+	 */
+	if (rwbf_quirk || cap_rwbf(iommu->cap) || cap_caching_mode(iommu->cap))
+		dmar_domain->iotlb_sync_map = true;
+
  	return &dmar_domain->domain;
  }

@@ -3446,6 +3448,11 @@ static int 
paging_domain_compatible_first_stage(struct dmar_domain *dmar_domain,
  	if (!cap_fl1gp_support(iommu->cap) &&
  	    (dmar_domain->domain.pgsize_bitmap & SZ_1G))
  		return -EINVAL;
+
+	/* iotlb sync on map requirement */
+	if ((rwbf_quirk || cap_rwbf(iommu->cap)) && !dmar_domain->iotlb_sync_map)
+		return -EINVAL;
+
  	return 0;
  }

@@ -3469,6 +3476,12 @@ paging_domain_compatible_second_stage(struct 
dmar_domain *dmar_domain,
  		return -EINVAL;
  	if (!(sslps & BIT(1)) && (dmar_domain->domain.pgsize_bitmap & SZ_1G))
  		return -EINVAL;
+
+	/* iotlb sync on map requirement */
+	if ((rwbf_quirk || cap_rwbf(iommu->cap) || 
cap_caching_mode(iommu->cap)) &&
+	    !dmar_domain->iotlb_sync_map)
+		return -EINVAL;
+
  	return 0;
  }

-- 
2.43.0

Thanks,
baolu


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
  2025-07-18  2:56         ` Baolu Lu
@ 2025-07-18 13:29           ` Jason Gunthorpe
  2025-07-21  1:57             ` Baolu Lu
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2025-07-18 13:29 UTC (permalink / raw)
  To: Baolu Lu; +Cc: Joerg Roedel, iommu, linux-kernel

On Fri, Jul 18, 2025 at 10:56:24AM +0800, Baolu Lu wrote:

> How about the following changes (compiled but not tested)?

Yes, exactly what I was thinking!

That leaves the cache tag linked list race..

Nicolin is working on a neat solution for SMMU I wonder if we can
reuse it here too..

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
  2025-07-18 13:29           ` Jason Gunthorpe
@ 2025-07-21  1:57             ` Baolu Lu
  0 siblings, 0 replies; 19+ messages in thread
From: Baolu Lu @ 2025-07-21  1:57 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Joerg Roedel, iommu, linux-kernel

On 7/18/25 21:29, Jason Gunthorpe wrote:
> On Fri, Jul 18, 2025 at 10:56:24AM +0800, Baolu Lu wrote:
> 
>> How about the following changes (compiled but not tested)?
> Yes, exactly what I was thinking!

I will submit a formal patch.

> 
> That leaves the cache tag linked list race..
> 
> Nicolin is working on a neat solution for SMMU I wonder if we can
> reuse it here too..

Okay, let's wait and see.

Thanks,
baolu

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-07-21  1:59 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-14  4:50 [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17 Lu Baolu
2025-07-14  4:50 ` [PATCH 01/11] iommu/vt-d: Remove the CONFIG_X86 wrapping from iommu init hook Lu Baolu
2025-07-14  4:50 ` [PATCH 02/11] iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes Lu Baolu
2025-07-16 14:12   ` Jason Gunthorpe
2025-07-17  2:40     ` Baolu Lu
2025-07-17 11:55       ` Jason Gunthorpe
2025-07-18  2:56         ` Baolu Lu
2025-07-18 13:29           ` Jason Gunthorpe
2025-07-21  1:57             ` Baolu Lu
2025-07-14  4:50 ` [PATCH 03/11] iommu/vt-d: Lift the __pa to domain_setup_first_level/intel_svm_set_dev_pasid() Lu Baolu
2025-07-14  4:50 ` [PATCH 04/11] iommu/vt-d: Fold domain_exit() into intel_iommu_domain_free() Lu Baolu
2025-07-14  4:50 ` [PATCH 05/11] iommu/vt-d: Do not wipe out the page table NID when devices detach Lu Baolu
2025-07-14  4:50 ` [PATCH 06/11] iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags() Lu Baolu
2025-07-14  4:50 ` [PATCH 07/11] iommu/vt-d: Create unique domain ops for each stage Lu Baolu
2025-07-14  4:50 ` [PATCH 08/11] iommu/vt-d: Split intel_iommu_enforce_cache_coherency() Lu Baolu
2025-07-14  4:50 ` [PATCH 09/11] iommu/vt-d: Split paging_domain_compatible() Lu Baolu
2025-07-14  4:50 ` [PATCH 10/11] iommu/vt-d: Fix missing PASID in dev TLB flush with cache_tag_flush_all Lu Baolu
2025-07-14  4:50 ` [PATCH 11/11] iommu/vt-d: Deduplicate cache_tag_flush_all by reusing flush_range Lu Baolu
2025-07-14 11:00 ` [PATCH 00/11] [PULL REQUEST] Intel IOMMU updates for v6.17 Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).