public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFCv1 0/3] Allow ATS to be always on for certain ATS-capable devices
@ 2026-01-17  4:56 Nicolin Chen
  2026-01-17  4:56 ` [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices Nicolin Chen
                   ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Nicolin Chen @ 2026-01-17  4:56 UTC (permalink / raw)
  To: jgg, will, robin.murphy, bhelgaas
  Cc: joro, praan, baolu.lu, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-pci

PCI ATS function is controlled by IOMMU driver calling pci_enable_ats() and
pci_disable_ats() helpers. In general, IOMMU driver only enables ATS, when
a translation channel is enabled on a PASID, typically for an SVA use case.
When a device's RID is IOMMU bypassed and there is no active PASID running
SVA use case, ATS is always disabled.

However, certain pcie devices support non-PASID ATS on its RID, even if the
RID is IOMMU bypassed. E.g. CXL.cache capability requires ATS to access the
physical memory; some NVIDIA GPUs in non-CXL configuration also support ATS
on a bypassed RID.

Provide a helper function to detect CXL.cache capability and scan through a
device ID list.

As the initial use case, call the helper in ARM SMMUv3 driver and adapt the
driver accordingly with a per-device ats_always_on flag.

This is on Github:
https://github.com/nicolinc/iommufd/commits/pci_ats_always_on-rfcv1/

Nicolin Chen (3):
  PCI: Allow ATS to be always on for CXL.cache capable devices
  PCI: Allow ATS to be always on for non-CXL NVIDIA GPUs
  iommu/arm-smmu-v3: Allow ATS to be always on

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
 drivers/pci/pci.h                           |  9 +++
 include/linux/pci-ats.h                     |  3 +
 include/uapi/linux/pci_regs.h               |  5 ++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 74 ++++++++++++++++++---
 drivers/pci/ats.c                           | 45 +++++++++++++
 drivers/pci/quirks.c                        | 23 +++++++
 7 files changed, 149 insertions(+), 11 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-17  4:56 [PATCH RFCv1 0/3] Allow ATS to be always on for certain ATS-capable devices Nicolin Chen
@ 2026-01-17  4:56 ` Nicolin Chen
  2026-01-19 17:58   ` Jason Gunthorpe
  2026-01-21  8:01   ` Tian, Kevin
  2026-01-17  4:56 ` [PATCH RFCv1 2/3] PCI: Allow ATS to be always on for non-CXL NVIDIA GPUs Nicolin Chen
  2026-01-17  4:56 ` [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on Nicolin Chen
  2 siblings, 2 replies; 60+ messages in thread
From: Nicolin Chen @ 2026-01-17  4:56 UTC (permalink / raw)
  To: jgg, will, robin.murphy, bhelgaas
  Cc: joro, praan, baolu.lu, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-pci

Controlled by the IOMMU driver, ATS is usually enabled "on demand", when a
device requests a translation service from its associated IOMMU HW running
on the channel of a given PASID. This is working even when a device has no
translation on its RID, i.e. RID is IOMMU bypassed.

On the other hand, certain PCIe device requires non-PASID ATS, when its RID
stream is IOMMU bypassed. Call this "always on".

For instance, the CXL spec notes in "3.2.5.13 Memory Type on CXL.cache":
"To source requests on CXL.cache, devices need to get the Host Physical
 Address (HPA) from the Host by means of an ATS request on CXL.io."
In other word, the CXL.cache capability relies on ATS. Otherwise, it won't
have access to the host physical memory.

Introduce a new pci_ats_always_on() for IOMMU driver to scan a PCI device,
to shift ATS policies between "on demand" and "always on".

Add the support for CXL.cache devices first. Non-CXL devices will be added
in quirks.c file.

Suggested-by: Vikram Sethi <vsethi@nvidia.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/pci-ats.h       |  3 +++
 include/uapi/linux/pci_regs.h |  5 ++++
 drivers/pci/ats.c             | 44 +++++++++++++++++++++++++++++++++++
 3 files changed, 52 insertions(+)

diff --git a/include/linux/pci-ats.h b/include/linux/pci-ats.h
index 75c6c86cf09d..d14ba727d38b 100644
--- a/include/linux/pci-ats.h
+++ b/include/linux/pci-ats.h
@@ -12,6 +12,7 @@ int pci_prepare_ats(struct pci_dev *dev, int ps);
 void pci_disable_ats(struct pci_dev *dev);
 int pci_ats_queue_depth(struct pci_dev *dev);
 int pci_ats_page_aligned(struct pci_dev *dev);
+bool pci_ats_always_on(struct pci_dev *dev);
 #else /* CONFIG_PCI_ATS */
 static inline bool pci_ats_supported(struct pci_dev *d)
 { return false; }
@@ -24,6 +25,8 @@ static inline int pci_ats_queue_depth(struct pci_dev *d)
 { return -ENODEV; }
 static inline int pci_ats_page_aligned(struct pci_dev *dev)
 { return 0; }
+static inline bool pci_ats_always_on(struct pci_dev *dev)
+{ return false; }
 #endif /* CONFIG_PCI_ATS */
 
 #ifdef CONFIG_PCI_PRI
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 3add74ae2594..84da6d7645a3 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1258,6 +1258,11 @@
 #define PCI_DVSEC_CXL_PORT_CTL				0x0c
 #define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
 
+/* CXL 2.0 8.1.3: PCIe DVSEC for CXL Device */
+#define CXL_DVSEC_PCIE_DEVICE				0
+#define   CXL_DVSEC_CAP_OFFSET				0xA
+#define     CXL_DVSEC_CACHE_CAPABLE			BIT(0)
+
 /* Integrity and Data Encryption Extended Capability */
 #define PCI_IDE_CAP			0x04
 #define  PCI_IDE_CAP_LINK		0x1  /* Link IDE Stream Supported */
diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
index ec6c8dbdc5e9..1795131f0697 100644
--- a/drivers/pci/ats.c
+++ b/drivers/pci/ats.c
@@ -205,6 +205,50 @@ int pci_ats_page_aligned(struct pci_dev *pdev)
 	return 0;
 }
 
+/*
+ * CXL r4.0, sec 3.2.5.13 Memory Type on CXL.cache notes: to source requests on
+ * CXL.cache, devices need to get the Host Physical Address (HPA) from the Host
+ * by means of an ATS request on CXL.io.
+ *
+ * In other world, CXL.cache devices cannot access physical memory without ATS.
+ */
+static bool pci_cxl_ats_always_on(struct pci_dev *pdev)
+{
+	int offset;
+	u16 cap;
+
+	offset = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
+					   CXL_DVSEC_PCIE_DEVICE);
+	if (!offset)
+		return false;
+
+	pci_read_config_word(pdev, offset + CXL_DVSEC_CAP_OFFSET, &cap);
+	if (cap & CXL_DVSEC_CACHE_CAPABLE)
+		return true;
+
+	return false;
+}
+
+/**
+ * pci_ats_always_on - Whether the PCI device requires ATS to be always enabled
+ * @pdev: the PCI device
+ *
+ * Returns true, if the PCI device requires non-PASID ATS function on an IOMMU
+ * bypassed configuration.
+ */
+bool pci_ats_always_on(struct pci_dev *pdev)
+{
+	if (pci_ats_disabled() || !pci_ats_supported(pdev))
+		return false;
+
+	/* A VF inherits its PF's requirement for ATS function */
+	if (pdev->is_virtfn)
+		pdev = pci_physfn(pdev);
+
+	return pci_cxl_ats_always_on(pdev);
+}
+EXPORT_SYMBOL_GPL(pci_ats_always_on);
+
 #ifdef CONFIG_PCI_PRI
 void pci_pri_init(struct pci_dev *pdev)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH RFCv1 2/3] PCI: Allow ATS to be always on for non-CXL NVIDIA GPUs
  2026-01-17  4:56 [PATCH RFCv1 0/3] Allow ATS to be always on for certain ATS-capable devices Nicolin Chen
  2026-01-17  4:56 ` [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices Nicolin Chen
@ 2026-01-17  4:56 ` Nicolin Chen
  2026-01-19 18:00   ` Jason Gunthorpe
  2026-01-17  4:56 ` [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on Nicolin Chen
  2 siblings, 1 reply; 60+ messages in thread
From: Nicolin Chen @ 2026-01-17  4:56 UTC (permalink / raw)
  To: jgg, will, robin.murphy, bhelgaas
  Cc: joro, praan, baolu.lu, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-pci

Some non-CXL NVIDIA GPU devices support non-PASID ATS function when their
RIDs are IOMMU bypassed. This is slightly different than the default ATS
policy which would only enable ATS on demand: when a non-zero PASID line
is enabled in SVA use cases.

Introduce a pci_dev_specific_ats_always_on() quirk function to support a
list of IDs for these device. Then, include it pci_ats_always_on().

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/pci/pci.h    |  9 +++++++++
 drivers/pci/ats.c    |  3 ++-
 drivers/pci/quirks.c | 23 +++++++++++++++++++++++
 3 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 0e67014aa001..1391df064983 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1032,6 +1032,15 @@ static inline int pci_dev_specific_reset(struct pci_dev *dev, bool probe)
 }
 #endif
 
+#if defined(CONFIG_PCI_QUIRKS) && defined(CONFIG_PCI_ATS)
+bool pci_dev_specific_ats_always_on(struct pci_dev *dev);
+#else
+static inline bool pci_dev_specific_ats_always_on(struct pci_dev *dev)
+{
+	return false;
+}
+#endif
+
 #if defined(CONFIG_PCI_QUIRKS) && defined(CONFIG_ARM64)
 int acpi_get_rc_resources(struct device *dev, const char *hid, u16 segment,
 			  struct resource *res);
diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
index 1795131f0697..6db45ae2cc8e 100644
--- a/drivers/pci/ats.c
+++ b/drivers/pci/ats.c
@@ -245,7 +245,8 @@ bool pci_ats_always_on(struct pci_dev *pdev)
 	if (pdev->is_virtfn)
 		pdev = pci_physfn(pdev);
 
-	return pci_cxl_ats_always_on(pdev);
+	return pci_cxl_ats_always_on(pdev) ||
+	       pci_dev_specific_ats_always_on(pdev);
 }
 EXPORT_SYMBOL_GPL(pci_ats_always_on);
 
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index b9c252aa6fe0..afc1d2adb13a 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5654,6 +5654,29 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x1457, quirk_intel_e2000_no_ats);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x1459, quirk_intel_e2000_no_ats);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x145a, quirk_intel_e2000_no_ats);
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, 0x145c, quirk_intel_e2000_no_ats);
+
+static const struct pci_dev_ats_always_on {
+	u16 vendor;
+	u16 device;
+} pci_dev_ats_always_on[] = {
+	{ PCI_VENDOR_ID_NVIDIA, 0x2e12, },
+	{ PCI_VENDOR_ID_NVIDIA, 0x2e2a, },
+	{ PCI_VENDOR_ID_NVIDIA, 0x2e2b, },
+	{ 0 }
+};
+
+/* Some non-CXL devices support ATS on RID when it is IOMMU-bypassed */
+bool pci_dev_specific_ats_always_on(struct pci_dev *pdev)
+{
+	const struct pci_dev_ats_always_on *i;
+
+	for (i = pci_dev_ats_always_on; i->vendor; i++) {
+		if (i->vendor == pdev->vendor && i->device == pdev->device)
+			return true;
+	}
+
+	return false;
+}
 #endif /* CONFIG_PCI_ATS */
 
 /* Freescale PCIe doesn't support MSI in RC mode */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on
  2026-01-17  4:56 [PATCH RFCv1 0/3] Allow ATS to be always on for certain ATS-capable devices Nicolin Chen
  2026-01-17  4:56 ` [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices Nicolin Chen
  2026-01-17  4:56 ` [PATCH RFCv1 2/3] PCI: Allow ATS to be always on for non-CXL NVIDIA GPUs Nicolin Chen
@ 2026-01-17  4:56 ` Nicolin Chen
  2026-01-19 20:06   ` Jason Gunthorpe
  2026-01-26 12:39   ` Will Deacon
  2 siblings, 2 replies; 60+ messages in thread
From: Nicolin Chen @ 2026-01-17  4:56 UTC (permalink / raw)
  To: jgg, will, robin.murphy, bhelgaas
  Cc: joro, praan, baolu.lu, kevin.tian, miko.lenczewski,
	linux-arm-kernel, iommu, linux-kernel, linux-pci

When a device's default substream attaches to an identity domain, the SMMU
driver currently sets the device's STE between two modes:

  Mode 1: Cfg=Translate, S1DSS=Bypass, EATS=1
  Mode 2: Cfg=bypass (EATS is ignored by HW)

When there is an active PASID (non-default substream), mode 1 is used. And
when there is no PASID support or no active PASID, mode 2 is used.

The driver will also downgrade an STE from mode 1 to mode 2, when the last
active substream becomes inactive.

However, there are PCIe devices that demand ATS to be always on. For these
devices, their STEs have to use the mode 1 as HW ignores EATS with mode 2.

Change the driver accordingly:
  - always use the mode 1
  - never downgrade to mode 2
  - allocate and retain a CD table (see note below)

Note that these devices might not support PASID, i.e. doing non-PASID ATS.
In such a case, the ssid_bits is set to 0. However, s1cdmax must be set to
a !0 value in order to keep the S1DSS field effective. Thus, when a master
requires ats_always_on, set its s1cdmax to minimal 1, meaning the CD table
will have a dummy entry (SSID=1) that will be never used.

Now, for these device, arm_smmu_cdtab_allocated() will always return true,
v.s. false prior to this change. When its default substream is attached to
an IDENTITY domain, its first CD is NULL in the table, which is a totally
valid case. Thus, drop the WARN_ON().

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 74 ++++++++++++++++++---
 2 files changed, 64 insertions(+), 11 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index ae23aacc3840..2ed68f43347e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -850,6 +850,7 @@ struct arm_smmu_master {
 	bool				ats_enabled : 1;
 	bool				ste_ats_enabled : 1;
 	bool				stall_enabled;
+	bool				ats_always_on;
 	unsigned int			ssid_bits;
 	unsigned int			iopf_refcount;
 };
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index d16d35c78c06..5b7deb708636 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1422,7 +1422,7 @@ void arm_smmu_clear_cd(struct arm_smmu_master *master, ioasid_t ssid)
 	if (!arm_smmu_cdtab_allocated(&master->cd_table))
 		return;
 	cdptr = arm_smmu_get_cd_ptr(master, ssid);
-	if (WARN_ON(!cdptr))
+	if (!cdptr)
 		return;
 	arm_smmu_write_cd_entry(master, ssid, cdptr, &target);
 }
@@ -1436,6 +1436,22 @@ static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master)
 	struct arm_smmu_ctx_desc_cfg *cd_table = &master->cd_table;
 
 	cd_table->s1cdmax = master->ssid_bits;
+
+	/*
+	 * When a device doesn't support PASID (non default SSID), ssid_bits is
+	 * set to 0. This also sets S1CDMAX to 0, which disables the substreams
+	 * and ignores the S1DSS field.
+	 *
+	 * On the other hand, if a device demands ATS to be always on even when
+	 * its default substream is IOMMU bypassed, it has to use EATS that is
+	 * only effective with an STE (CFG=S1translate, S1DSS=Bypass). For such
+	 * use cases, S1CDMAX has to be !0, in order to make use of S1DSS/EATS.
+	 *
+	 * Set S1CDMAX no lower than 1. This would add a dummy substream in the
+	 * CD table but it should never be used by an actual CD.
+	 */
+	if (master->ats_always_on)
+		cd_table->s1cdmax = max_t(u8, cd_table->s1cdmax, 1);
 	max_contexts = 1 << cd_table->s1cdmax;
 
 	if (!(smmu->features & ARM_SMMU_FEAT_2_LVL_CDTAB) ||
@@ -3189,7 +3205,8 @@ static int arm_smmu_blocking_set_dev_pasid(struct iommu_domain *new_domain,
 	 * When the last user of the CD table goes away downgrade the STE back
 	 * to a non-cd_table one, by re-attaching its sid_domain.
 	 */
-	if (!arm_smmu_ssids_in_use(&master->cd_table)) {
+	if (!master->ats_always_on &&
+	    !arm_smmu_ssids_in_use(&master->cd_table)) {
 		struct iommu_domain *sid_domain =
 			iommu_get_domain_for_dev(master->dev);
 
@@ -3205,7 +3222,7 @@ static void arm_smmu_attach_dev_ste(struct iommu_domain *domain,
 				    struct iommu_domain *old_domain,
 				    struct device *dev,
 				    struct arm_smmu_ste *ste,
-				    unsigned int s1dss)
+				    unsigned int s1dss, bool ats_always_on)
 {
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
 	struct arm_smmu_attach_state state = {
@@ -3224,7 +3241,7 @@ static void arm_smmu_attach_dev_ste(struct iommu_domain *domain,
 	 * If the CD table is not in use we can use the provided STE, otherwise
 	 * we use a cdtable STE with the provided S1DSS.
 	 */
-	if (arm_smmu_ssids_in_use(&master->cd_table)) {
+	if (ats_always_on || arm_smmu_ssids_in_use(&master->cd_table)) {
 		/*
 		 * If a CD table has to be present then we need to run with ATS
 		 * on because we have to assume a PASID is using ATS. For
@@ -3260,7 +3277,8 @@ static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
 	arm_smmu_master_clear_vmaster(master);
 	arm_smmu_make_bypass_ste(master->smmu, &ste);
 	arm_smmu_attach_dev_ste(domain, old_domain, dev, &ste,
-				STRTAB_STE_1_S1DSS_BYPASS);
+				STRTAB_STE_1_S1DSS_BYPASS,
+				master->ats_always_on);
 	return 0;
 }
 
@@ -3283,7 +3301,7 @@ static int arm_smmu_attach_dev_blocked(struct iommu_domain *domain,
 	arm_smmu_master_clear_vmaster(master);
 	arm_smmu_make_abort_ste(&ste);
 	arm_smmu_attach_dev_ste(domain, old_domain, dev, &ste,
-				STRTAB_STE_1_S1DSS_TERMINATE);
+				STRTAB_STE_1_S1DSS_TERMINATE, false);
 	return 0;
 }
 
@@ -3521,6 +3539,40 @@ static void arm_smmu_remove_master(struct arm_smmu_master *master)
 	kfree(master->streams);
 }
 
+static int arm_smmu_master_prepare_ats(struct arm_smmu_master *master)
+{
+	bool s1p = master->smmu->features & ARM_SMMU_FEAT_TRANS_S1;
+	unsigned int stu = __ffs(master->smmu->pgsize_bitmap);
+	struct pci_dev *pdev = to_pci_dev(master->dev);
+	int ret;
+
+	if (!arm_smmu_ats_supported(master))
+		return 0;
+
+	if (!pci_ats_always_on(pdev))
+		goto out_prepare;
+
+	/*
+	 * S1DSS is required for ATS to be always on for identity domain cases.
+	 * However, the S1DSS field is ignored if !IDR0_S1P or !IDR1_SSIDSIZE.
+	 */
+	if (!s1p || !master->smmu->ssid_bits) {
+		dev_info_once(master->dev,
+			      "SMMU doesn't support ATS to be always on\n");
+		goto out_prepare;
+	}
+
+	master->ats_always_on = true;
+
+	ret = arm_smmu_alloc_cd_tables(master);
+	if (ret)
+		return ret;
+
+out_prepare:
+	pci_prepare_ats(pdev, stu);
+	return 0;
+}
+
 static struct iommu_device *arm_smmu_probe_device(struct device *dev)
 {
 	int ret;
@@ -3569,14 +3621,14 @@ static struct iommu_device *arm_smmu_probe_device(struct device *dev)
 	    smmu->features & ARM_SMMU_FEAT_STALL_FORCE)
 		master->stall_enabled = true;
 
-	if (dev_is_pci(dev)) {
-		unsigned int stu = __ffs(smmu->pgsize_bitmap);
-
-		pci_prepare_ats(to_pci_dev(dev), stu);
-	}
+	ret = arm_smmu_master_prepare_ats(master);
+	if (ret)
+		goto err_disable_pasid;
 
 	return &smmu->iommu;
 
+err_disable_pasid:
+	arm_smmu_disable_pasid(master);
 err_free_master:
 	kfree(master);
 	return ERR_PTR(ret);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-17  4:56 ` [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices Nicolin Chen
@ 2026-01-19 17:58   ` Jason Gunthorpe
  2026-01-21  8:01   ` Tian, Kevin
  1 sibling, 0 replies; 60+ messages in thread
From: Jason Gunthorpe @ 2026-01-19 17:58 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, bhelgaas, joro, praan, baolu.lu, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel, linux-pci

On Fri, Jan 16, 2026 at 08:56:40PM -0800, Nicolin Chen wrote:
> Controlled by the IOMMU driver, ATS is usually enabled "on demand", when a
> device requests a translation service from its associated IOMMU HW running
> on the channel of a given PASID. This is working even when a device has no
> translation on its RID, i.e. RID is IOMMU bypassed.

I would add here that this is done to allow optimizing devices running
in IDENTITY translation as there is no point to using ATS to return
the same value as it already has.

> For instance, the CXL spec notes in "3.2.5.13 Memory Type on CXL.cache":
> "To source requests on CXL.cache, devices need to get the Host Physical
>  Address (HPA) from the Host by means of an ATS request on CXL.io."
> In other word, the CXL.cache capability relies on ATS. Otherwise, it won't
> have access to the host physical memory.
> 
> Introduce a new pci_ats_always_on() for IOMMU driver to scan a PCI device,
> to shift ATS policies between "on demand" and "always on".
> 
> Add the support for CXL.cache devices first. Non-CXL devices will be added
> in quirks.c file.
> 
> Suggested-by: Vikram Sethi <vsethi@nvidia.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  include/linux/pci-ats.h       |  3 +++
>  include/uapi/linux/pci_regs.h |  5 ++++
>  drivers/pci/ats.c             | 44 +++++++++++++++++++++++++++++++++++
>  3 files changed, 52 insertions(+)

This implementation looks OK to me

Thanks,
Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 2/3] PCI: Allow ATS to be always on for non-CXL NVIDIA GPUs
  2026-01-17  4:56 ` [PATCH RFCv1 2/3] PCI: Allow ATS to be always on for non-CXL NVIDIA GPUs Nicolin Chen
@ 2026-01-19 18:00   ` Jason Gunthorpe
  2026-01-19 18:09     ` Nicolin Chen
  0 siblings, 1 reply; 60+ messages in thread
From: Jason Gunthorpe @ 2026-01-19 18:00 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, bhelgaas, joro, praan, baolu.lu, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel, linux-pci

On Fri, Jan 16, 2026 at 08:56:41PM -0800, Nicolin Chen wrote:
> Some non-CXL NVIDIA GPU devices support non-PASID ATS function when their
> RIDs are IOMMU bypassed. This is slightly different than the default ATS
> policy which would only enable ATS on demand: when a non-zero PASID line
> is enabled in SVA use cases.

Not support, require non-PASID ATS.

I've been describing these devices as pre-CXL, in that they have many
CXL like properties, including what motivated the prior patch, but do
not implement the CXL config space.

> +/* Some non-CXL devices support ATS on RID when it is IOMMU-bypassed */

Require not support

This also looks OK to me

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 2/3] PCI: Allow ATS to be always on for non-CXL NVIDIA GPUs
  2026-01-19 18:00   ` Jason Gunthorpe
@ 2026-01-19 18:09     ` Nicolin Chen
  0 siblings, 0 replies; 60+ messages in thread
From: Nicolin Chen @ 2026-01-19 18:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: will, robin.murphy, bhelgaas, joro, praan, baolu.lu, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel, linux-pci

On Mon, Jan 19, 2026 at 02:00:26PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 16, 2026 at 08:56:41PM -0800, Nicolin Chen wrote:
> > Some non-CXL NVIDIA GPU devices support non-PASID ATS function when their
> > RIDs are IOMMU bypassed. This is slightly different than the default ATS
> > policy which would only enable ATS on demand: when a non-zero PASID line
> > is enabled in SVA use cases.
> 
> Not support, require non-PASID ATS.
> 
> I've been describing these devices as pre-CXL, in that they have many
> CXL like properties, including what motivated the prior patch, but do
> not implement the CXL config space.
> 
> > +/* Some non-CXL devices support ATS on RID when it is IOMMU-bypassed */
> 
> Require not support

I will fix these.

Thanks!
Nicolin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on
  2026-01-17  4:56 ` [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on Nicolin Chen
@ 2026-01-19 20:06   ` Jason Gunthorpe
  2026-01-26 12:39   ` Will Deacon
  1 sibling, 0 replies; 60+ messages in thread
From: Jason Gunthorpe @ 2026-01-19 20:06 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, robin.murphy, bhelgaas, joro, praan, baolu.lu, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel, linux-pci

On Fri, Jan 16, 2026 at 08:56:42PM -0800, Nicolin Chen wrote:
> +static int arm_smmu_master_prepare_ats(struct arm_smmu_master *master)
> +{
> +	bool s1p = master->smmu->features & ARM_SMMU_FEAT_TRANS_S1;
> +	unsigned int stu = __ffs(master->smmu->pgsize_bitmap);
> +	struct pci_dev *pdev = to_pci_dev(master->dev);
> +	int ret;
> +
> +	if (!arm_smmu_ats_supported(master))
> +		return 0;
> +
> +	if (!pci_ats_always_on(pdev))
> +		goto out_prepare;
> +
> +	/*
> +	 * S1DSS is required for ATS to be always on for identity domain cases.
> +	 * However, the S1DSS field is ignored if !IDR0_S1P or !IDR1_SSIDSIZE.
> +	 */
> +	if (!s1p || !master->smmu->ssid_bits) {
> +		dev_info_once(master->dev,
> +			      "SMMU doesn't support ATS to be always on\n");
> +		goto out_prepare;
> +	}

It looks right, IDK if Will would prefer a formal ARM_SMMU_FEAT_S1DSS
though.

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-17  4:56 ` [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices Nicolin Chen
  2026-01-19 17:58   ` Jason Gunthorpe
@ 2026-01-21  8:01   ` Tian, Kevin
  2026-01-21 10:03     ` Jonathan Cameron
  1 sibling, 1 reply; 60+ messages in thread
From: Tian, Kevin @ 2026-01-21  8:01 UTC (permalink / raw)
  To: Nicolin Chen, jgg@nvidia.com, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, Williams, Dan J
  Cc: joro@8bytes.org, praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org

+Dan. I recalled an offline discussion in which he raised concern on
having the kernel blindly enable ATS for cxl.cache device instead of
creating a knob for admin to configure from userspace (in case
security is viewed more important than functionality, upon allowing
DMA to read data out of CPU caches)...

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Saturday, January 17, 2026 12:57 PM
> 
> Controlled by the IOMMU driver, ATS is usually enabled "on demand", when
> a
> device requests a translation service from its associated IOMMU HW running
> on the channel of a given PASID. This is working even when a device has no
> translation on its RID, i.e. RID is IOMMU bypassed.
> 
> On the other hand, certain PCIe device requires non-PASID ATS, when its RID
> stream is IOMMU bypassed. Call this "always on".
> 
> For instance, the CXL spec notes in "3.2.5.13 Memory Type on CXL.cache":
> "To source requests on CXL.cache, devices need to get the Host Physical
>  Address (HPA) from the Host by means of an ATS request on CXL.io."
> In other word, the CXL.cache capability relies on ATS. Otherwise, it won't
> have access to the host physical memory.
> 
> Introduce a new pci_ats_always_on() for IOMMU driver to scan a PCI device,
> to shift ATS policies between "on demand" and "always on".
> 
> Add the support for CXL.cache devices first. Non-CXL devices will be added
> in quirks.c file.
> 
> Suggested-by: Vikram Sethi <vsethi@nvidia.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  include/linux/pci-ats.h       |  3 +++
>  include/uapi/linux/pci_regs.h |  5 ++++
>  drivers/pci/ats.c             | 44 +++++++++++++++++++++++++++++++++++
>  3 files changed, 52 insertions(+)
> 
> diff --git a/include/linux/pci-ats.h b/include/linux/pci-ats.h
> index 75c6c86cf09d..d14ba727d38b 100644
> --- a/include/linux/pci-ats.h
> +++ b/include/linux/pci-ats.h
> @@ -12,6 +12,7 @@ int pci_prepare_ats(struct pci_dev *dev, int ps);
>  void pci_disable_ats(struct pci_dev *dev);
>  int pci_ats_queue_depth(struct pci_dev *dev);
>  int pci_ats_page_aligned(struct pci_dev *dev);
> +bool pci_ats_always_on(struct pci_dev *dev);
>  #else /* CONFIG_PCI_ATS */
>  static inline bool pci_ats_supported(struct pci_dev *d)
>  { return false; }
> @@ -24,6 +25,8 @@ static inline int pci_ats_queue_depth(struct pci_dev *d)
>  { return -ENODEV; }
>  static inline int pci_ats_page_aligned(struct pci_dev *dev)
>  { return 0; }
> +static inline bool pci_ats_always_on(struct pci_dev *dev)
> +{ return false; }
>  #endif /* CONFIG_PCI_ATS */
> 
>  #ifdef CONFIG_PCI_PRI
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 3add74ae2594..84da6d7645a3 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -1258,6 +1258,11 @@
>  #define PCI_DVSEC_CXL_PORT_CTL				0x0c
>  #define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
> 
> +/* CXL 2.0 8.1.3: PCIe DVSEC for CXL Device */
> +#define CXL_DVSEC_PCIE_DEVICE				0
> +#define   CXL_DVSEC_CAP_OFFSET				0xA
> +#define     CXL_DVSEC_CACHE_CAPABLE			BIT(0)
> +
>  /* Integrity and Data Encryption Extended Capability */
>  #define PCI_IDE_CAP			0x04
>  #define  PCI_IDE_CAP_LINK		0x1  /* Link IDE Stream Supported */
> diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
> index ec6c8dbdc5e9..1795131f0697 100644
> --- a/drivers/pci/ats.c
> +++ b/drivers/pci/ats.c
> @@ -205,6 +205,50 @@ int pci_ats_page_aligned(struct pci_dev *pdev)
>  	return 0;
>  }
> 
> +/*
> + * CXL r4.0, sec 3.2.5.13 Memory Type on CXL.cache notes: to source
> requests on
> + * CXL.cache, devices need to get the Host Physical Address (HPA) from the
> Host
> + * by means of an ATS request on CXL.io.
> + *
> + * In other world, CXL.cache devices cannot access physical memory
> without ATS.
> + */
> +static bool pci_cxl_ats_always_on(struct pci_dev *pdev)
> +{
> +	int offset;
> +	u16 cap;
> +
> +	offset = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
> +					   CXL_DVSEC_PCIE_DEVICE);
> +	if (!offset)
> +		return false;
> +
> +	pci_read_config_word(pdev, offset + CXL_DVSEC_CAP_OFFSET,
> &cap);
> +	if (cap & CXL_DVSEC_CACHE_CAPABLE)
> +		return true;
> +
> +	return false;
> +}
> +
> +/**
> + * pci_ats_always_on - Whether the PCI device requires ATS to be always
> enabled
> + * @pdev: the PCI device
> + *
> + * Returns true, if the PCI device requires non-PASID ATS function on an
> IOMMU
> + * bypassed configuration.
> + */
> +bool pci_ats_always_on(struct pci_dev *pdev)
> +{
> +	if (pci_ats_disabled() || !pci_ats_supported(pdev))
> +		return false;
> +
> +	/* A VF inherits its PF's requirement for ATS function */
> +	if (pdev->is_virtfn)
> +		pdev = pci_physfn(pdev);
> +
> +	return pci_cxl_ats_always_on(pdev);
> +}
> +EXPORT_SYMBOL_GPL(pci_ats_always_on);
> +
>  #ifdef CONFIG_PCI_PRI
>  void pci_pri_init(struct pci_dev *pdev)
>  {
> --
> 2.43.0


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-21  8:01   ` Tian, Kevin
@ 2026-01-21 10:03     ` Jonathan Cameron
  2026-01-21 13:03       ` Jason Gunthorpe
  0 siblings, 1 reply; 60+ messages in thread
From: Jonathan Cameron @ 2026-01-21 10:03 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Nicolin Chen, jgg@nvidia.com, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, Williams, Dan J,
	joro@8bytes.org, praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl

On Wed, 21 Jan 2026 08:01:36 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> +Dan. I recalled an offline discussion in which he raised concern on
> having the kernel blindly enable ATS for cxl.cache device instead of
> creating a knob for admin to configure from userspace (in case
> security is viewed more important than functionality, upon allowing
> DMA to read data out of CPU caches)...
> 

+CC Linux-cxl

Jonathan


> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Saturday, January 17, 2026 12:57 PM
> > 
> > Controlled by the IOMMU driver, ATS is usually enabled "on demand", when
> > a
> > device requests a translation service from its associated IOMMU HW running
> > on the channel of a given PASID. This is working even when a device has no
> > translation on its RID, i.e. RID is IOMMU bypassed.
> > 
> > On the other hand, certain PCIe device requires non-PASID ATS, when its RID
> > stream is IOMMU bypassed. Call this "always on".
> > 
> > For instance, the CXL spec notes in "3.2.5.13 Memory Type on CXL.cache":
> > "To source requests on CXL.cache, devices need to get the Host Physical
> >  Address (HPA) from the Host by means of an ATS request on CXL.io."
> > In other word, the CXL.cache capability relies on ATS. Otherwise, it won't
> > have access to the host physical memory.
> > 
> > Introduce a new pci_ats_always_on() for IOMMU driver to scan a PCI device,
> > to shift ATS policies between "on demand" and "always on".
> > 
> > Add the support for CXL.cache devices first. Non-CXL devices will be added
> > in quirks.c file.
> > 
> > Suggested-by: Vikram Sethi <vsethi@nvidia.com>
> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> > ---
> >  include/linux/pci-ats.h       |  3 +++
> >  include/uapi/linux/pci_regs.h |  5 ++++
> >  drivers/pci/ats.c             | 44 +++++++++++++++++++++++++++++++++++
> >  3 files changed, 52 insertions(+)
> > 
> > diff --git a/include/linux/pci-ats.h b/include/linux/pci-ats.h
> > index 75c6c86cf09d..d14ba727d38b 100644
> > --- a/include/linux/pci-ats.h
> > +++ b/include/linux/pci-ats.h
> > @@ -12,6 +12,7 @@ int pci_prepare_ats(struct pci_dev *dev, int ps);
> >  void pci_disable_ats(struct pci_dev *dev);
> >  int pci_ats_queue_depth(struct pci_dev *dev);
> >  int pci_ats_page_aligned(struct pci_dev *dev);
> > +bool pci_ats_always_on(struct pci_dev *dev);
> >  #else /* CONFIG_PCI_ATS */
> >  static inline bool pci_ats_supported(struct pci_dev *d)
> >  { return false; }
> > @@ -24,6 +25,8 @@ static inline int pci_ats_queue_depth(struct pci_dev *d)
> >  { return -ENODEV; }
> >  static inline int pci_ats_page_aligned(struct pci_dev *dev)
> >  { return 0; }
> > +static inline bool pci_ats_always_on(struct pci_dev *dev)
> > +{ return false; }
> >  #endif /* CONFIG_PCI_ATS */
> > 
> >  #ifdef CONFIG_PCI_PRI
> > diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> > index 3add74ae2594..84da6d7645a3 100644
> > --- a/include/uapi/linux/pci_regs.h
> > +++ b/include/uapi/linux/pci_regs.h
> > @@ -1258,6 +1258,11 @@
> >  #define PCI_DVSEC_CXL_PORT_CTL				0x0c
> >  #define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
> > 
> > +/* CXL 2.0 8.1.3: PCIe DVSEC for CXL Device */
> > +#define CXL_DVSEC_PCIE_DEVICE				0
> > +#define   CXL_DVSEC_CAP_OFFSET				0xA
> > +#define     CXL_DVSEC_CACHE_CAPABLE			BIT(0)
> > +
> >  /* Integrity and Data Encryption Extended Capability */
> >  #define PCI_IDE_CAP			0x04
> >  #define  PCI_IDE_CAP_LINK		0x1  /* Link IDE Stream Supported */
> > diff --git a/drivers/pci/ats.c b/drivers/pci/ats.c
> > index ec6c8dbdc5e9..1795131f0697 100644
> > --- a/drivers/pci/ats.c
> > +++ b/drivers/pci/ats.c
> > @@ -205,6 +205,50 @@ int pci_ats_page_aligned(struct pci_dev *pdev)
> >  	return 0;
> >  }
> > 
> > +/*
> > + * CXL r4.0, sec 3.2.5.13 Memory Type on CXL.cache notes: to source
> > requests on
> > + * CXL.cache, devices need to get the Host Physical Address (HPA) from the
> > Host
> > + * by means of an ATS request on CXL.io.
> > + *
> > + * In other world, CXL.cache devices cannot access physical memory
> > without ATS.
> > + */
> > +static bool pci_cxl_ats_always_on(struct pci_dev *pdev)
> > +{
> > +	int offset;
> > +	u16 cap;
> > +
> > +	offset = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
> > +					   CXL_DVSEC_PCIE_DEVICE);
> > +	if (!offset)
> > +		return false;
> > +
> > +	pci_read_config_word(pdev, offset + CXL_DVSEC_CAP_OFFSET,
> > &cap);
> > +	if (cap & CXL_DVSEC_CACHE_CAPABLE)
> > +		return true;
> > +
> > +	return false;
> > +}
> > +
> > +/**
> > + * pci_ats_always_on - Whether the PCI device requires ATS to be always
> > enabled
> > + * @pdev: the PCI device
> > + *
> > + * Returns true, if the PCI device requires non-PASID ATS function on an
> > IOMMU
> > + * bypassed configuration.
> > + */
> > +bool pci_ats_always_on(struct pci_dev *pdev)
> > +{
> > +	if (pci_ats_disabled() || !pci_ats_supported(pdev))
> > +		return false;
> > +
> > +	/* A VF inherits its PF's requirement for ATS function */
> > +	if (pdev->is_virtfn)
> > +		pdev = pci_physfn(pdev);
> > +
> > +	return pci_cxl_ats_always_on(pdev);
> > +}
> > +EXPORT_SYMBOL_GPL(pci_ats_always_on);
> > +
> >  #ifdef CONFIG_PCI_PRI
> >  void pci_pri_init(struct pci_dev *pdev)
> >  {
> > --
> > 2.43.0  
> 
> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-21 10:03     ` Jonathan Cameron
@ 2026-01-21 13:03       ` Jason Gunthorpe
  2026-01-22  1:17         ` Baolu Lu
                           ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Jason Gunthorpe @ 2026-01-21 13:03 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Tian, Kevin, Nicolin Chen, will@kernel.org, robin.murphy@arm.com,
	bhelgaas@google.com, Williams, Dan J, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl

On Wed, Jan 21, 2026 at 10:03:07AM +0000, Jonathan Cameron wrote:
> On Wed, 21 Jan 2026 08:01:36 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > +Dan. I recalled an offline discussion in which he raised concern on
> > having the kernel blindly enable ATS for cxl.cache device instead of
> > creating a knob for admin to configure from userspace (in case
> > security is viewed more important than functionality, upon allowing
> > DMA to read data out of CPU caches)...
> > 
> 
> +CC Linux-cxl

A cxl.cache device supporting ATS will automatically enable ATS today
if the kernel option to enable translation is set.

Even if the device is marked untrusted by the PCI layer (eg an
external port).

Yes this is effectively a security issue, but it is not really a CXL
specific problem.

We might perfer to not enable ATS for untrusted devices and then fail to
load drivers for "ats always on" cases.

Or maybe we can enable one of the ATS security features someday,
though I wonder if those work for CXL..

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-21 13:03       ` Jason Gunthorpe
@ 2026-01-22  1:17         ` Baolu Lu
  2026-01-22 13:15           ` Jason Gunthorpe
  2026-01-22  5:44         ` dan.j.williams
  2026-01-22 10:24         ` Alejandro Lucero Palau
  2 siblings, 1 reply; 60+ messages in thread
From: Baolu Lu @ 2026-01-22  1:17 UTC (permalink / raw)
  To: Jason Gunthorpe, Jonathan Cameron
  Cc: Tian, Kevin, Nicolin Chen, will@kernel.org, robin.murphy@arm.com,
	bhelgaas@google.com, Williams, Dan J, joro@8bytes.org,
	praan@google.com, miko.lenczewski@arm.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-cxl

On 1/21/26 21:03, Jason Gunthorpe wrote:
> On Wed, Jan 21, 2026 at 10:03:07AM +0000, Jonathan Cameron wrote:
>> On Wed, 21 Jan 2026 08:01:36 +0000
>> "Tian, Kevin"<kevin.tian@intel.com> wrote:
>>
>>> +Dan. I recalled an offline discussion in which he raised concern on
>>> having the kernel blindly enable ATS for cxl.cache device instead of
>>> creating a knob for admin to configure from userspace (in case
>>> security is viewed more important than functionality, upon allowing
>>> DMA to read data out of CPU caches)...
>>>
>> +CC Linux-cxl
> A cxl.cache device supporting ATS will automatically enable ATS today
> if the kernel option to enable translation is set.
> 
> Even if the device is marked untrusted by the PCI layer (eg an
> external port).

I don't follow here. The untrusted check is now in pci_ats_supported():

/**
  * pci_ats_supported - check if the device can use ATS
  * @dev: the PCI device
  *
  * Returns true if the device supports ATS and is allowed to use it, false
  * otherwise.
  */
bool pci_ats_supported(struct pci_dev *dev)
{
         if (!dev->ats_cap)
                 return false;

         return (dev->untrusted == 0);
}
EXPORT_SYMBOL_GPL(pci_ats_supported);

The iommu drivers (intel/amd/arm-smmuv3) all call pci_ats_supported()
before enabling ATS on a device. Anything I missed?

> 
> Yes this is effectively a security issue, but it is not really a CXL
> specific problem.
> 
> We might perfer to not enable ATS for untrusted devices and then fail to
> load drivers for "ats always on" cases.
> 
> Or maybe we can enable one of the ATS security features someday,
> though I wonder if those work for CXL..

Thanks,
baolu

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-21 13:03       ` Jason Gunthorpe
  2026-01-22  1:17         ` Baolu Lu
@ 2026-01-22  5:44         ` dan.j.williams
  2026-01-22 13:14           ` Jason Gunthorpe
  2026-01-22 10:24         ` Alejandro Lucero Palau
  2 siblings, 1 reply; 60+ messages in thread
From: dan.j.williams @ 2026-01-22  5:44 UTC (permalink / raw)
  To: Jason Gunthorpe, Jonathan Cameron
  Cc: Tian, Kevin, Nicolin Chen, will@kernel.org, robin.murphy@arm.com,
	bhelgaas@google.com, Williams, Dan J, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl

Jason Gunthorpe wrote:
> On Wed, Jan 21, 2026 at 10:03:07AM +0000, Jonathan Cameron wrote:
> > On Wed, 21 Jan 2026 08:01:36 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > 
> > > +Dan. I recalled an offline discussion in which he raised concern on
> > > having the kernel blindly enable ATS for cxl.cache device instead of
> > > creating a knob for admin to configure from userspace (in case
> > > security is viewed more important than functionality, upon allowing
> > > DMA to read data out of CPU caches)...
> > > 
> > 
> > +CC Linux-cxl
> 
> A cxl.cache device supporting ATS will automatically enable ATS today
> if the kernel option to enable translation is set.
> 
> Even if the device is marked untrusted by the PCI layer (eg an
> external port).
> 
> Yes this is effectively a security issue, but it is not really a CXL
> specific problem.

My contention is that it is a worse or at least different problem in the
CXL case because now you have a new toolkit in an attack that wants to
exfiltrate data from CPU caches.

> We might perfer to not enable ATS for untrusted devices and then fail to
> load drivers for "ats always on" cases.

The current PCI untrusted flag is not fit for purpose in this new age of
PCI device authentication and CXL.cache capable devices.

> Or maybe we can enable one of the ATS security features someday,
> though I wonder if those work for CXL..

It should work, but before that I do not see the justification to say
effectively:

"We have a less than perfect legacy way (PCI untrusted flag) to nod at
 ATS security problems. Let us ignore even that for a new class of
 devices that advertise they can trigger all the old security problems
 plus new ones."

I do not immediately see what is wrong with requiring userspace policy
opt-in. That naturally gets replaced by installing the device's
certificate (for native PCI CMA), authenticating the device with the
TSM (for PCI IDE), or obviated by secure-ATS if that arrives.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-21 13:03       ` Jason Gunthorpe
  2026-01-22  1:17         ` Baolu Lu
  2026-01-22  5:44         ` dan.j.williams
@ 2026-01-22 10:24         ` Alejandro Lucero Palau
  2 siblings, 0 replies; 60+ messages in thread
From: Alejandro Lucero Palau @ 2026-01-22 10:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Jonathan Cameron
  Cc: Tian, Kevin, Nicolin Chen, will@kernel.org, robin.murphy@arm.com,
	bhelgaas@google.com, Williams, Dan J, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl


On 1/21/26 13:03, Jason Gunthorpe wrote:
> On Wed, Jan 21, 2026 at 10:03:07AM +0000, Jonathan Cameron wrote:
>> On Wed, 21 Jan 2026 08:01:36 +0000
>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>
>>> +Dan. I recalled an offline discussion in which he raised concern on
>>> having the kernel blindly enable ATS for cxl.cache device instead of
>>> creating a knob for admin to configure from userspace (in case
>>> security is viewed more important than functionality, upon allowing
>>> DMA to read data out of CPU caches)...
>>>
>> +CC Linux-cxl
> A cxl.cache device supporting ATS will automatically enable ATS today
> if the kernel option to enable translation is set.
>
> Even if the device is marked untrusted by the PCI layer (eg an
> external port).
>
> Yes this is effectively a security issue, but it is not really a CXL
> specific problem.
>
> We might perfer to not enable ATS for untrusted devices and then fail to
> load drivers for "ats always on" cases.
>
> Or maybe we can enable one of the ATS security features someday,
> though I wonder if those work for CXL..


I raised my concerns about CXL.cache and virtualization at LPC: 
https://lpc.events/event/19/contributions/2173/attachments/1842/3940/LPC_2025_CXL_CACHE.pdf


I expose there some concerns, although I admit some could be due to my 
twisted understanding of what CXL specs states, but regarding IOMMU/ATS, 
my view is ATS is not safe enough ... what I guess is a matter of 
opinion (trusted device based on vendor confirming it is an "official" 
device not enough for paranoid mode with vendors subjected to 
governments "agencies" actions/pressures). But this links to what I 
think Jason points out about ATS security features where the IOMMU 
hardware can be configured for checking those translated PCIe accesses 
as well, if the host owner/admin paranoid mind decides so. With CXL 
cache that is not possible since the route is through a different link 
and AFAIK, there is no support for something like this by current 
implementations. I think it could be implemented without impacting the 
gains from CXL.cache, but that is another story.


So, FWIW, I think it should not be enabled by default.


Thank you,

Alejandro



> Jason
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-22  5:44         ` dan.j.williams
@ 2026-01-22 13:14           ` Jason Gunthorpe
  2026-01-22 16:29             ` Nicolin Chen
  2026-01-22 19:46             ` dan.j.williams
  0 siblings, 2 replies; 60+ messages in thread
From: Jason Gunthorpe @ 2026-01-22 13:14 UTC (permalink / raw)
  To: dan.j.williams
  Cc: Jonathan Cameron, Tian, Kevin, Nicolin Chen, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl

On Wed, Jan 21, 2026 at 09:44:32PM -0800, dan.j.williams@intel.com wrote:
> Jason Gunthorpe wrote:
> > On Wed, Jan 21, 2026 at 10:03:07AM +0000, Jonathan Cameron wrote:
> > > On Wed, 21 Jan 2026 08:01:36 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > 
> > > > +Dan. I recalled an offline discussion in which he raised concern on
> > > > having the kernel blindly enable ATS for cxl.cache device instead of
> > > > creating a knob for admin to configure from userspace (in case
> > > > security is viewed more important than functionality, upon allowing
> > > > DMA to read data out of CPU caches)...
> > > > 
> > > 
> > > +CC Linux-cxl
> > 
> > A cxl.cache device supporting ATS will automatically enable ATS today
> > if the kernel option to enable translation is set.
> > 
> > Even if the device is marked untrusted by the PCI layer (eg an
> > external port).
> > 
> > Yes this is effectively a security issue, but it is not really a CXL
> > specific problem.
> 
> My contention is that it is a worse or at least different problem in the
> CXL case because now you have a new toolkit in an attack that wants to
> exfiltrate data from CPU caches.

?? I don't see CXL as meaningfully different than PCI in terms of what
data can be accessed with Translated requests. If the IOMMU doesn't
block Translated requests the whole systems is open. CXL doesn't make
it more open.

> "We have a less than perfect legacy way (PCI untrusted flag) to nod at
>  ATS security problems. Let us ignore even that for a new class of
>  devices that advertise they can trigger all the old security problems
>  plus new ones."

Ah, I missed that we are already force disabling ATS in this untrusted
case, so we should ensure that continues to be the case here
too. Nicolin does it need a change?

> I do not immediately see what is wrong with requiring userspace policy
> opt-in. That naturally gets replaced by installing the device's
> certificate (for native PCI CMA), authenticating the device with the
> TSM (for PCI IDE), or obviated by secure-ATS if that arrives.

I think that goes back to the discussion about not loading drivers
before validating the device.

It would also make alot of sense to leave the IOMMU blocking until the
driver is loaded for these secure situations. The blocking translation
should block ATS too.

Then the flow you are describing will work well:

1) At pre-boot the IOMMU will block all DMA including Translated.
2) The OS activates the IOMMU driver and keeps blocking.
3) Instead of immediately binding a default domain the IOMMU core
   leaves the translation blocking.
4) The OS defers loading the driver to userspace.
5) Userspace measures the device and "accepts" it by loading the
   driver
6) IOMMU core attaches a non-blocking default domain and activates ATS

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-22  1:17         ` Baolu Lu
@ 2026-01-22 13:15           ` Jason Gunthorpe
  0 siblings, 0 replies; 60+ messages in thread
From: Jason Gunthorpe @ 2026-01-22 13:15 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Jonathan Cameron, Tian, Kevin, Nicolin Chen, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, Williams, Dan J,
	joro@8bytes.org, praan@google.com, miko.lenczewski@arm.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-cxl

On Thu, Jan 22, 2026 at 09:17:27AM +0800, Baolu Lu wrote:
> On 1/21/26 21:03, Jason Gunthorpe wrote:
> > On Wed, Jan 21, 2026 at 10:03:07AM +0000, Jonathan Cameron wrote:
> > > On Wed, 21 Jan 2026 08:01:36 +0000
> > > "Tian, Kevin"<kevin.tian@intel.com> wrote:
> > > 
> > > > +Dan. I recalled an offline discussion in which he raised concern on
> > > > having the kernel blindly enable ATS for cxl.cache device instead of
> > > > creating a knob for admin to configure from userspace (in case
> > > > security is viewed more important than functionality, upon allowing
> > > > DMA to read data out of CPU caches)...
> > > > 
> > > +CC Linux-cxl
> > A cxl.cache device supporting ATS will automatically enable ATS today
> > if the kernel option to enable translation is set.
> > 
> > Even if the device is marked untrusted by the PCI layer (eg an
> > external port).
> 
> I don't follow here. The untrusted check is now in pci_ats_supported():
> 
> /**
>  * pci_ats_supported - check if the device can use ATS
>  * @dev: the PCI device
>  *
>  * Returns true if the device supports ATS and is allowed to use it, false
>  * otherwise.
>  */
> bool pci_ats_supported(struct pci_dev *dev)
> {
>         if (!dev->ats_cap)
>                 return false;
> 
>         return (dev->untrusted == 0);
> }
> EXPORT_SYMBOL_GPL(pci_ats_supported);
> 
> The iommu drivers (intel/amd/arm-smmuv3) all call pci_ats_supported()
> before enabling ATS on a device. Anything I missed?

No, not at all, I forgot about this!

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-22 13:14           ` Jason Gunthorpe
@ 2026-01-22 16:29             ` Nicolin Chen
  2026-01-22 16:58               ` Jason Gunthorpe
  2026-01-22 19:46             ` dan.j.williams
  1 sibling, 1 reply; 60+ messages in thread
From: Nicolin Chen @ 2026-01-22 16:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dan.j.williams, Jonathan Cameron, Tian, Kevin, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl

On Thu, Jan 22, 2026 at 09:14:32AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 21, 2026 at 09:44:32PM -0800, dan.j.williams@intel.com wrote:
> > "We have a less than perfect legacy way (PCI untrusted flag) to nod at
> >  ATS security problems. Let us ignore even that for a new class of
> >  devices that advertise they can trigger all the old security problems
> >  plus new ones."
> 
> Ah, I missed that we are already force disabling ATS in this untrusted
> case, so we should ensure that continues to be the case here
> too. Nicolin does it need a change?

pci_ats_always_on() validates against !pci_ats_supported(pdev), so
we ensured that untrusted devices would not be always on.

Perhaps we should highlight in the commit message, as it's a topic?

Nicolin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-22 16:29             ` Nicolin Chen
@ 2026-01-22 16:58               ` Jason Gunthorpe
  0 siblings, 0 replies; 60+ messages in thread
From: Jason Gunthorpe @ 2026-01-22 16:58 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: dan.j.williams, Jonathan Cameron, Tian, Kevin, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl

On Thu, Jan 22, 2026 at 08:29:10AM -0800, Nicolin Chen wrote:
> On Thu, Jan 22, 2026 at 09:14:32AM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 21, 2026 at 09:44:32PM -0800, dan.j.williams@intel.com wrote:
> > > "We have a less than perfect legacy way (PCI untrusted flag) to nod at
> > >  ATS security problems. Let us ignore even that for a new class of
> > >  devices that advertise they can trigger all the old security problems
> > >  plus new ones."
> > 
> > Ah, I missed that we are already force disabling ATS in this untrusted
> > case, so we should ensure that continues to be the case here
> > too. Nicolin does it need a change?
> 
> pci_ats_always_on() validates against !pci_ats_supported(pdev), so
> we ensured that untrusted devices would not be always on.
> 
> Perhaps we should highlight in the commit message, as it's a topic?

Yes

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-22 13:14           ` Jason Gunthorpe
  2026-01-22 16:29             ` Nicolin Chen
@ 2026-01-22 19:46             ` dan.j.williams
  2026-01-27  8:10               ` Tian, Kevin
  1 sibling, 1 reply; 60+ messages in thread
From: dan.j.williams @ 2026-01-22 19:46 UTC (permalink / raw)
  To: Jason Gunthorpe, dan.j.williams
  Cc: Jonathan Cameron, Tian, Kevin, Nicolin Chen, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl

Jason Gunthorpe wrote:
> On Wed, Jan 21, 2026 at 09:44:32PM -0800, dan.j.williams@intel.com wrote:
> > Jason Gunthorpe wrote:
> > > On Wed, Jan 21, 2026 at 10:03:07AM +0000, Jonathan Cameron wrote:
> > > > On Wed, 21 Jan 2026 08:01:36 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > 
> > > > > +Dan. I recalled an offline discussion in which he raised concern on
> > > > > having the kernel blindly enable ATS for cxl.cache device instead of
> > > > > creating a knob for admin to configure from userspace (in case
> > > > > security is viewed more important than functionality, upon allowing
> > > > > DMA to read data out of CPU caches)...
> > > > > 
> > > > 
> > > > +CC Linux-cxl
> > > 
> > > A cxl.cache device supporting ATS will automatically enable ATS today
> > > if the kernel option to enable translation is set.
> > > 
> > > Even if the device is marked untrusted by the PCI layer (eg an
> > > external port).
> > > 
> > > Yes this is effectively a security issue, but it is not really a CXL
> > > specific problem.
> > 
> > My contention is that it is a worse or at least different problem in the
> > CXL case because now you have a new toolkit in an attack that wants to
> > exfiltrate data from CPU caches.
> 
> ?? I don't see CXL as meaningfully different than PCI in terms of what
> data can be accessed with Translated requests. If the IOMMU doesn't
> block Translated requests the whole systems is open. CXL doesn't make
> it more open.

Right, the game is mostly over in the current case, but CXL.cache still
deserves to be treated carefully. Consider a world where we do have limitations
against requests to HPAs that were never translated for the device.  In that
scenario the device can help side channel the contents of HPAs it does not
otherwise have access by messing with aliased lines it does have access.

At a minimum CXL.cache is not improving the security story, so no time like the
present to put a policy mechanism in place that improves upon the PCI untrusted
flag.

> > "We have a less than perfect legacy way (PCI untrusted flag) to nod at
> >  ATS security problems. Let us ignore even that for a new class of
> >  devices that advertise they can trigger all the old security problems
> >  plus new ones."
> 
> Ah, I missed that we are already force disabling ATS in this untrusted
> case, so we should ensure that continues to be the case here
> too. Nicolin does it need a change?
> 
> > I do not immediately see what is wrong with requiring userspace policy
> > opt-in. That naturally gets replaced by installing the device's
> > certificate (for native PCI CMA), authenticating the device with the
> > TSM (for PCI IDE), or obviated by secure-ATS if that arrives.
> 
> I think that goes back to the discussion about not loading drivers
> before validating the device.
> 
> It would also make alot of sense to leave the IOMMU blocking until the
> driver is loaded for these secure situations. The blocking translation
> should block ATS too.
> 
> Then the flow you are describing will work well:
> 
> 1) At pre-boot the IOMMU will block all DMA including Translated.
> 2) The OS activates the IOMMU driver and keeps blocking.
> 3) Instead of immediately binding a default domain the IOMMU core
>    leaves the translation blocking.
> 4) The OS defers loading the driver to userspace.
> 5) Userspace measures the device and "accepts" it by loading the
>    driver
> 6) IOMMU core attaches a non-blocking default domain and activates ATS

That works for me. Give the paranoid the ability to have a point where they can
be assured that the shields were not lowered prematurely.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on
  2026-01-17  4:56 ` [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on Nicolin Chen
  2026-01-19 20:06   ` Jason Gunthorpe
@ 2026-01-26 12:39   ` Will Deacon
  2026-01-26 17:20     ` Jason Gunthorpe
  2026-01-26 18:21     ` Nicolin Chen
  1 sibling, 2 replies; 60+ messages in thread
From: Will Deacon @ 2026-01-26 12:39 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jgg, robin.murphy, bhelgaas, joro, praan, baolu.lu, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel, linux-pci

On Fri, Jan 16, 2026 at 08:56:42PM -0800, Nicolin Chen wrote:
> When a device's default substream attaches to an identity domain, the SMMU
> driver currently sets the device's STE between two modes:
> 
>   Mode 1: Cfg=Translate, S1DSS=Bypass, EATS=1
>   Mode 2: Cfg=bypass (EATS is ignored by HW)
> 
> When there is an active PASID (non-default substream), mode 1 is used. And
> when there is no PASID support or no active PASID, mode 2 is used.
> 
> The driver will also downgrade an STE from mode 1 to mode 2, when the last
> active substream becomes inactive.
> 
> However, there are PCIe devices that demand ATS to be always on. For these
> devices, their STEs have to use the mode 1 as HW ignores EATS with mode 2.
> 
> Change the driver accordingly:
>   - always use the mode 1
>   - never downgrade to mode 2
>   - allocate and retain a CD table (see note below)
> 
> Note that these devices might not support PASID, i.e. doing non-PASID ATS.
> In such a case, the ssid_bits is set to 0. However, s1cdmax must be set to
> a !0 value in order to keep the S1DSS field effective. Thus, when a master
> requires ats_always_on, set its s1cdmax to minimal 1, meaning the CD table
> will have a dummy entry (SSID=1) that will be never used.
> 
> Now, for these device, arm_smmu_cdtab_allocated() will always return true,
> v.s. false prior to this change. When its default substream is attached to
> an IDENTITY domain, its first CD is NULL in the table, which is a totally
> valid case. Thus, drop the WARN_ON().
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 74 ++++++++++++++++++---
>  2 files changed, 64 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index ae23aacc3840..2ed68f43347e 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -850,6 +850,7 @@ struct arm_smmu_master {
>  	bool				ats_enabled : 1;
>  	bool				ste_ats_enabled : 1;
>  	bool				stall_enabled;
> +	bool				ats_always_on;
>  	unsigned int			ssid_bits;
>  	unsigned int			iopf_refcount;
>  };
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index d16d35c78c06..5b7deb708636 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1422,7 +1422,7 @@ void arm_smmu_clear_cd(struct arm_smmu_master *master, ioasid_t ssid)
>  	if (!arm_smmu_cdtab_allocated(&master->cd_table))
>  		return;
>  	cdptr = arm_smmu_get_cd_ptr(master, ssid);
> -	if (WARN_ON(!cdptr))
> +	if (!cdptr)
>  		return;

Should we still warn if !master->ats_always_on?

>  	arm_smmu_write_cd_entry(master, ssid, cdptr, &target);
>  }
> @@ -1436,6 +1436,22 @@ static int arm_smmu_alloc_cd_tables(struct arm_smmu_master *master)
>  	struct arm_smmu_ctx_desc_cfg *cd_table = &master->cd_table;
>  
>  	cd_table->s1cdmax = master->ssid_bits;
> +
> +	/*
> +	 * When a device doesn't support PASID (non default SSID), ssid_bits is
> +	 * set to 0. This also sets S1CDMAX to 0, which disables the substreams
> +	 * and ignores the S1DSS field.
> +	 *
> +	 * On the other hand, if a device demands ATS to be always on even when
> +	 * its default substream is IOMMU bypassed, it has to use EATS that is
> +	 * only effective with an STE (CFG=S1translate, S1DSS=Bypass). For such
> +	 * use cases, S1CDMAX has to be !0, in order to make use of S1DSS/EATS.
> +	 *
> +	 * Set S1CDMAX no lower than 1. This would add a dummy substream in the
> +	 * CD table but it should never be used by an actual CD.
> +	 */
> +	if (master->ats_always_on)
> +		cd_table->s1cdmax = max_t(u8, cd_table->s1cdmax, 1);
>  	max_contexts = 1 << cd_table->s1cdmax;
>  
>  	if (!(smmu->features & ARM_SMMU_FEAT_2_LVL_CDTAB) ||
> @@ -3189,7 +3205,8 @@ static int arm_smmu_blocking_set_dev_pasid(struct iommu_domain *new_domain,
>  	 * When the last user of the CD table goes away downgrade the STE back
>  	 * to a non-cd_table one, by re-attaching its sid_domain.
>  	 */
> -	if (!arm_smmu_ssids_in_use(&master->cd_table)) {
> +	if (!master->ats_always_on &&
> +	    !arm_smmu_ssids_in_use(&master->cd_table)) {
>  		struct iommu_domain *sid_domain =
>  			iommu_get_domain_for_dev(master->dev);
>  
> @@ -3205,7 +3222,7 @@ static void arm_smmu_attach_dev_ste(struct iommu_domain *domain,
>  				    struct iommu_domain *old_domain,
>  				    struct device *dev,
>  				    struct arm_smmu_ste *ste,
> -				    unsigned int s1dss)
> +				    unsigned int s1dss, bool ats_always_on)
>  {
>  	struct arm_smmu_master *master = dev_iommu_priv_get(dev);

Can we avoid the 'bool' parameter if possible, please? They tend to make the
callsites pretty horrible to read and you're already passing the 'struct
device *' so you should have the master in hand?

>  	struct arm_smmu_attach_state state = {
> @@ -3224,7 +3241,7 @@ static void arm_smmu_attach_dev_ste(struct iommu_domain *domain,
>  	 * If the CD table is not in use we can use the provided STE, otherwise
>  	 * we use a cdtable STE with the provided S1DSS.
>  	 */
> -	if (arm_smmu_ssids_in_use(&master->cd_table)) {
> +	if (ats_always_on || arm_smmu_ssids_in_use(&master->cd_table)) {
>  		/*
>  		 * If a CD table has to be present then we need to run with ATS
>  		 * on because we have to assume a PASID is using ATS. For
> @@ -3260,7 +3277,8 @@ static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
>  	arm_smmu_master_clear_vmaster(master);
>  	arm_smmu_make_bypass_ste(master->smmu, &ste);
>  	arm_smmu_attach_dev_ste(domain, old_domain, dev, &ste,
> -				STRTAB_STE_1_S1DSS_BYPASS);
> +				STRTAB_STE_1_S1DSS_BYPASS,
> +				master->ats_always_on);
>  	return 0;
>  }
>  
> @@ -3283,7 +3301,7 @@ static int arm_smmu_attach_dev_blocked(struct iommu_domain *domain,
>  	arm_smmu_master_clear_vmaster(master);
>  	arm_smmu_make_abort_ste(&ste);
>  	arm_smmu_attach_dev_ste(domain, old_domain, dev, &ste,
> -				STRTAB_STE_1_S1DSS_TERMINATE);
> +				STRTAB_STE_1_S1DSS_TERMINATE, false);
>  	return 0;
>  }
>  
> @@ -3521,6 +3539,40 @@ static void arm_smmu_remove_master(struct arm_smmu_master *master)
>  	kfree(master->streams);
>  }
>  
> +static int arm_smmu_master_prepare_ats(struct arm_smmu_master *master)
> +{
> +	bool s1p = master->smmu->features & ARM_SMMU_FEAT_TRANS_S1;
> +	unsigned int stu = __ffs(master->smmu->pgsize_bitmap);
> +	struct pci_dev *pdev = to_pci_dev(master->dev);
> +	int ret;
> +
> +	if (!arm_smmu_ats_supported(master))
> +		return 0;
> +
> +	if (!pci_ats_always_on(pdev))
> +		goto out_prepare;
> +
> +	/*
> +	 * S1DSS is required for ATS to be always on for identity domain cases.
> +	 * However, the S1DSS field is ignored if !IDR0_S1P or !IDR1_SSIDSIZE.
> +	 */
> +	if (!s1p || !master->smmu->ssid_bits) {
> +		dev_info_once(master->dev,
> +			      "SMMU doesn't support ATS to be always on\n");
> +		goto out_prepare;
> +	}
> +
> +	master->ats_always_on = true;
> +
> +	ret = arm_smmu_alloc_cd_tables(master);
> +	if (ret)
> +		return ret;

Were do you allocate the second level entry for ssid 0 if we're using
2-level cd tables?

Will

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on
  2026-01-26 12:39   ` Will Deacon
@ 2026-01-26 17:20     ` Jason Gunthorpe
  2026-01-26 18:40       ` Nicolin Chen
  2026-01-26 18:49       ` Robin Murphy
  2026-01-26 18:21     ` Nicolin Chen
  1 sibling, 2 replies; 60+ messages in thread
From: Jason Gunthorpe @ 2026-01-26 17:20 UTC (permalink / raw)
  To: Will Deacon
  Cc: Nicolin Chen, robin.murphy, bhelgaas, joro, praan, baolu.lu,
	kevin.tian, miko.lenczewski, linux-arm-kernel, iommu,
	linux-kernel, linux-pci

On Mon, Jan 26, 2026 at 12:39:50PM +0000, Will Deacon wrote:
> > +	ret = arm_smmu_alloc_cd_tables(master);
> > +	if (ret)
> > +		return ret;
> 
> Were do you allocate the second level entry for ssid 0 if we're using
> 2-level cd tables?

I don't think we need to. The entire design here has a non-valid CD entry
for SSID 0.

The spec is really weird here, on one hand it explicitly says that with
S1DSS the CD entry is ignored.

On the other hand, you are also required to have a CD table pointer of
at least size one for some reason.

So, I think a CD table pointer to a fully invalid L1 table of at least
size 1 should be OK?

Or stated another way, why would ie be OK to have a 1 level table with
an non-valid CD table entry for SSID0 but not OK to have a 2 level
table that returns non-valid at the first walk?

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on
  2026-01-26 12:39   ` Will Deacon
  2026-01-26 17:20     ` Jason Gunthorpe
@ 2026-01-26 18:21     ` Nicolin Chen
  1 sibling, 0 replies; 60+ messages in thread
From: Nicolin Chen @ 2026-01-26 18:21 UTC (permalink / raw)
  To: Will Deacon
  Cc: jgg, robin.murphy, bhelgaas, joro, praan, baolu.lu, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel, linux-pci

On Mon, Jan 26, 2026 at 12:39:50PM +0000, Will Deacon wrote:
> On Fri, Jan 16, 2026 at 08:56:42PM -0800, Nicolin Chen wrote:
> > @@ -1422,7 +1422,7 @@ void arm_smmu_clear_cd(struct arm_smmu_master *master, ioasid_t ssid)
> >  	if (!arm_smmu_cdtab_allocated(&master->cd_table))
> >  		return;
> >  	cdptr = arm_smmu_get_cd_ptr(master, ssid);
> > -	if (WARN_ON(!cdptr))
> > +	if (!cdptr)
> >  		return;
> 
> Should we still warn if !master->ats_always_on?

Hmm, yes. I'll fix this.

> > @@ -3205,7 +3222,7 @@ static void arm_smmu_attach_dev_ste(struct iommu_domain *domain,
> >  				    struct iommu_domain *old_domain,
> >  				    struct device *dev,
> >  				    struct arm_smmu_ste *ste,
> > -				    unsigned int s1dss)
> > +				    unsigned int s1dss, bool ats_always_on)
> >  {
> >  	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> 
> Can we avoid the 'bool' parameter if possible, please? They tend to make the
> callsites pretty horrible to read and you're already passing the 'struct
> device *' so you should have the master in hand?

Trying to set ats_always_on=false for blocked domain here:

@@ -3260,7 +3277,8 @@ static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
	arm_smmu_attach_dev_ste(domain, old_domain, dev, &ste,
				STRTAB_STE_1_S1DSS_BYPASS,
				master->ats_always_on);
@@ -3283,7 +3301,7 @@ static int arm_smmu_attach_dev_blocked(struct iommu_domain *domain,
	arm_smmu_attach_dev_ste(domain, old_domain, dev, &ste,
				STRTAB_STE_1_S1DSS_TERMINATE, false);

But I think we could do that by combining master->ats_always_on
with the s1dss. I will drop the "bool".

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on
  2026-01-26 17:20     ` Jason Gunthorpe
@ 2026-01-26 18:40       ` Nicolin Chen
  2026-01-26 19:16         ` Jason Gunthorpe
  2026-01-26 18:49       ` Robin Murphy
  1 sibling, 1 reply; 60+ messages in thread
From: Nicolin Chen @ 2026-01-26 18:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Will Deacon, robin.murphy, bhelgaas, joro, praan, baolu.lu,
	kevin.tian, miko.lenczewski, linux-arm-kernel, iommu,
	linux-kernel, linux-pci

On Mon, Jan 26, 2026 at 01:20:20PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 26, 2026 at 12:39:50PM +0000, Will Deacon wrote:
> > > +	ret = arm_smmu_alloc_cd_tables(master);
> > > +	if (ret)
> > > +		return ret;
> > 
> > Were do you allocate the second level entry for ssid 0 if we're using
> > 2-level cd tables?
> 
> I don't think we need to. The entire design here has a non-valid CD entry
> for SSID 0.

Hmm, whether we allocate a 2-level cd table would actually depend on
the "1 << cd_table->s1cdmax" v.s. CTXDESC_L2_ENTRIES, right?

If the device supports PASID and s1cdmax is large, we should prepare
a 2-level cd tables, even if only SSID0 is used at this moment since
we have to support !0 pasids via potential SVA domains.

In all Other cases, we would prepare a linear one.

Nicolin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on
  2026-01-26 17:20     ` Jason Gunthorpe
  2026-01-26 18:40       ` Nicolin Chen
@ 2026-01-26 18:49       ` Robin Murphy
  2026-01-26 19:09         ` Jason Gunthorpe
  1 sibling, 1 reply; 60+ messages in thread
From: Robin Murphy @ 2026-01-26 18:49 UTC (permalink / raw)
  To: Jason Gunthorpe, Will Deacon
  Cc: Nicolin Chen, bhelgaas, joro, praan, baolu.lu, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel, linux-pci

On 2026-01-26 5:20 pm, Jason Gunthorpe wrote:
> On Mon, Jan 26, 2026 at 12:39:50PM +0000, Will Deacon wrote:
>>> +	ret = arm_smmu_alloc_cd_tables(master);
>>> +	if (ret)
>>> +		return ret;
>>
>> Were do you allocate the second level entry for ssid 0 if we're using
>> 2-level cd tables?
> 
> I don't think we need to. The entire design here has a non-valid CD entry
> for SSID 0.
> 
> The spec is really weird here, on one hand it explicitly says that with
> S1DSS the CD entry is ignored.
> 
> On the other hand, you are also required to have a CD table pointer of
> at least size one for some reason.

Because it is not possible to enable 0 SubStreams, since that wouldn't 
make any sense, hence S1CDMax also acts as the "enable SubStreams" 
control (assuming SSIDSIZE > 0 and it does anything at all - note that 
strictly we cannot assume this bypass trick is *always* possible, since 
an SMMU is permitted to support ATS without supporting SubStreams).

> So, I think a CD table pointer to a fully invalid L1 table of at least
> size 1 should be OK?
> 
> Or stated another way, why would ie be OK to have a 1 level table with
> an non-valid CD table entry for SSID0 but not OK to have a 2 level
> table that returns non-valid at the first walk?

S1ContextPtr itself is reachable since S1 is enabled, so it cannot point 
to nonsense. But the S1DSS==Bypass behaviour does state:

"Note: Such a transaction does not fetch a CD, and therefore does not 
report F_CD_FETCH, C_BAD_CD or a stage 2 Translation-related fault with 
CLASS == CD."

So if we're not intending to actually allow traffic on the SubStream(s), 
then it should be fine to use either a 1-level table of invalid CDs, or 
a 2-level format with an empty L1CD table to gracefully terminate any 
config prefetches.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on
  2026-01-26 18:49       ` Robin Murphy
@ 2026-01-26 19:09         ` Jason Gunthorpe
  2026-01-27 13:10           ` Will Deacon
  0 siblings, 1 reply; 60+ messages in thread
From: Jason Gunthorpe @ 2026-01-26 19:09 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Will Deacon, Nicolin Chen, bhelgaas, joro, praan, baolu.lu,
	kevin.tian, miko.lenczewski, linux-arm-kernel, iommu,
	linux-kernel, linux-pci

On Mon, Jan 26, 2026 at 06:49:07PM +0000, Robin Murphy wrote:
> (assuming SSIDSIZE > 0 and it does anything at all - note that strictly we
> cannot assume this bypass trick is *always* possible, since an SMMU is
> permitted to support ATS without supporting SubStreams).

Yes, I think Nicolin has captured those conditions in computing
it... We don't have a logic to disable bypass in that case though.

> > So, I think a CD table pointer to a fully invalid L1 table of at least
> > size 1 should be OK?
> > 
> > Or stated another way, why would ie be OK to have a 1 level table with
> > an non-valid CD table entry for SSID0 but not OK to have a 2 level
> > table that returns non-valid at the first walk?
> 
> S1ContextPtr itself is reachable since S1 is enabled, so it cannot point to
> nonsense. But the S1DSS==Bypass behaviour does state:

> "Note: Such a transaction does not fetch a CD, and therefore does not report
> F_CD_FETCH, C_BAD_CD or a stage 2 Translation-related fault with CLASS ==
> CD."

Yes

However, taken together:
 * S1CDMax is set to substream 0 only
 * S1DSS is set such that "does not fetch a CD" for SSID = 0
 * SSID >0 doesn't fetch CDs because of S1CDMax

Then it seems to be saying that it will never use S1ContextPtr? ie it
is IGNORED?

> So if we're not intending to actually allow traffic on the SubStream(s),
> then it should be fine to use either a 1-level table of invalid CDs, or a
> 2-level format with an empty L1CD table to gracefully terminate any config
> prefetches.

Yes, so arm_smmu_alloc_cd_tables() is fine since it creates a valid
value for S1ContextPtr such that any future use can happen without
changing S1ContextPtr.

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on
  2026-01-26 18:40       ` Nicolin Chen
@ 2026-01-26 19:16         ` Jason Gunthorpe
  0 siblings, 0 replies; 60+ messages in thread
From: Jason Gunthorpe @ 2026-01-26 19:16 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Will Deacon, robin.murphy, bhelgaas, joro, praan, baolu.lu,
	kevin.tian, miko.lenczewski, linux-arm-kernel, iommu,
	linux-kernel, linux-pci

On Mon, Jan 26, 2026 at 10:40:39AM -0800, Nicolin Chen wrote:
> On Mon, Jan 26, 2026 at 01:20:20PM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 26, 2026 at 12:39:50PM +0000, Will Deacon wrote:
> > > > +	ret = arm_smmu_alloc_cd_tables(master);
> > > > +	if (ret)
> > > > +		return ret;
> > > 
> > > Were do you allocate the second level entry for ssid 0 if we're using
> > > 2-level cd tables?
> > 
> > I don't think we need to. The entire design here has a non-valid CD entry
> > for SSID 0.
> 
> Hmm, whether we allocate a 2-level cd table would actually depend on
> the "1 << cd_table->s1cdmax" v.s. CTXDESC_L2_ENTRIES, right?
> 
> If the device supports PASID and s1cdmax is large, we should prepare
> a 2-level cd tables, even if only SSID0 is used at this moment since
> we have to support !0 pasids via potential SVA domains.
> 
> In all Other cases, we would prepare a linear one.

Yes, this is what arm_smmu_alloc_cd_tables() is doing.

I think will was questioning if this needs to be
arm_smmu_alloc_cd_ptr(master, 0);

To ensure there is some memory under the SSID=0 case, but it seems we
don't need that.

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-22 19:46             ` dan.j.williams
@ 2026-01-27  8:10               ` Tian, Kevin
  2026-01-27 15:04                 ` Jason Gunthorpe
  0 siblings, 1 reply; 60+ messages in thread
From: Tian, Kevin @ 2026-01-27  8:10 UTC (permalink / raw)
  To: Williams, Dan J, Jason Gunthorpe
  Cc: Jonathan Cameron, Nicolin Chen, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

> From: Williams, Dan J <dan.j.williams@intel.com>
> Sent: Friday, January 23, 2026 3:46 AM
> 
> Jason Gunthorpe wrote:
> > On Wed, Jan 21, 2026 at 09:44:32PM -0800, dan.j.williams@intel.com
> wrote:
> > > I do not immediately see what is wrong with requiring userspace policy
> > > opt-in. That naturally gets replaced by installing the device's
> > > certificate (for native PCI CMA), authenticating the device with the
> > > TSM (for PCI IDE), or obviated by secure-ATS if that arrives.
> >
> > I think that goes back to the discussion about not loading drivers
> > before validating the device.
> >
> > It would also make alot of sense to leave the IOMMU blocking until the
> > driver is loaded for these secure situations. The blocking translation
> > should block ATS too.
> >
> > Then the flow you are describing will work well:
> >
> > 1) At pre-boot the IOMMU will block all DMA including Translated.
> > 2) The OS activates the IOMMU driver and keeps blocking.
> > 3) Instead of immediately binding a default domain the IOMMU core
> >    leaves the translation blocking.
> > 4) The OS defers loading the driver to userspace.
> > 5) Userspace measures the device and "accepts" it by loading the
> >    driver
> > 6) IOMMU core attaches a non-blocking default domain and activates ATS
> 
> That works for me. Give the paranoid the ability to have a point where they
> can
> be assured that the shields were not lowered prematurely.

Jason described the flow as "for these secure situations", i.e. not a general
requirement for cxl.cache, but iiuc Dan may instead want userspace policy
opt-in to be default (and with CMA/TSM etc. it gets easier)?

Better to clarity the agreement here as the output decides whether to
continue what this series tries to do...

At a glance cxl.cache devices have gained ATS enabled automatically in
most cases (same as for all other ats-capable PCI devices):

- ARM: ATS is enabled automatically when attaching the default domain
  to the device in certain configurations, and this series tries to auto
  enable it in a missing configuration

- AMD: ATS is enabled at domain attach time

- Intel: ATS is enabled when a device is probed by intel-iommu driver
  (incompatible with the suggested flow)

Given above already shipped in distributions, probably we have to keep
them for compatibility (implying this series makes sense to fix a gap
in existing policy), then treat the suggested flow as an enhancement
for future?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on
  2026-01-26 19:09         ` Jason Gunthorpe
@ 2026-01-27 13:10           ` Will Deacon
  2026-01-27 13:26             ` Robin Murphy
  0 siblings, 1 reply; 60+ messages in thread
From: Will Deacon @ 2026-01-27 13:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Robin Murphy, Nicolin Chen, bhelgaas, joro, praan, baolu.lu,
	kevin.tian, miko.lenczewski, linux-arm-kernel, iommu,
	linux-kernel, linux-pci

On Mon, Jan 26, 2026 at 03:09:35PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 26, 2026 at 06:49:07PM +0000, Robin Murphy wrote:
> > (assuming SSIDSIZE > 0 and it does anything at all - note that strictly we
> > cannot assume this bypass trick is *always* possible, since an SMMU is
> > permitted to support ATS without supporting SubStreams).
> 
> Yes, I think Nicolin has captured those conditions in computing
> it... We don't have a logic to disable bypass in that case though.
> 
> > > So, I think a CD table pointer to a fully invalid L1 table of at least
> > > size 1 should be OK?
> > > 
> > > Or stated another way, why would ie be OK to have a 1 level table with
> > > an non-valid CD table entry for SSID0 but not OK to have a 2 level
> > > table that returns non-valid at the first walk?
> > 
> > S1ContextPtr itself is reachable since S1 is enabled, so it cannot point to
> > nonsense. But the S1DSS==Bypass behaviour does state:
> 
> > "Note: Such a transaction does not fetch a CD, and therefore does not report
> > F_CD_FETCH, C_BAD_CD or a stage 2 Translation-related fault with CLASS ==
> > CD."
> 
> Yes
> 
> However, taken together:
>  * S1CDMax is set to substream 0 only
>  * S1DSS is set such that "does not fetch a CD" for SSID = 0
>  * SSID >0 doesn't fetch CDs because of S1CDMax
> 
> Then it seems to be saying that it will never use S1ContextPtr? ie it
> is IGNORED?

Right, I think the critical question is whether that setting of S1DSS
(0b01) means that STE.S1ContextPtr is considered "invalid". The spec
doesn't call this out explicitly but the "translation procedure charts"
seem to indicate that it doesn't use the CD for anything...

It would be good to get some clarification from Arm about this
particular case.

Will

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on
  2026-01-27 13:10           ` Will Deacon
@ 2026-01-27 13:26             ` Robin Murphy
  2026-01-27 13:50               ` Will Deacon
  0 siblings, 1 reply; 60+ messages in thread
From: Robin Murphy @ 2026-01-27 13:26 UTC (permalink / raw)
  To: Will Deacon, Jason Gunthorpe
  Cc: Nicolin Chen, bhelgaas, joro, praan, baolu.lu, kevin.tian,
	miko.lenczewski, linux-arm-kernel, iommu, linux-kernel, linux-pci

On 2026-01-27 1:10 pm, Will Deacon wrote:
> On Mon, Jan 26, 2026 at 03:09:35PM -0400, Jason Gunthorpe wrote:
>> On Mon, Jan 26, 2026 at 06:49:07PM +0000, Robin Murphy wrote:
>>> (assuming SSIDSIZE > 0 and it does anything at all - note that strictly we
>>> cannot assume this bypass trick is *always* possible, since an SMMU is
>>> permitted to support ATS without supporting SubStreams).
>>
>> Yes, I think Nicolin has captured those conditions in computing
>> it... We don't have a logic to disable bypass in that case though.
>>
>>>> So, I think a CD table pointer to a fully invalid L1 table of at least
>>>> size 1 should be OK?
>>>>
>>>> Or stated another way, why would ie be OK to have a 1 level table with
>>>> an non-valid CD table entry for SSID0 but not OK to have a 2 level
>>>> table that returns non-valid at the first walk?
>>>
>>> S1ContextPtr itself is reachable since S1 is enabled, so it cannot point to
>>> nonsense. But the S1DSS==Bypass behaviour does state:
>>
>>> "Note: Such a transaction does not fetch a CD, and therefore does not report
>>> F_CD_FETCH, C_BAD_CD or a stage 2 Translation-related fault with CLASS ==
>>> CD."
>>
>> Yes
>>
>> However, taken together:
>>   * S1CDMax is set to substream 0 only
>>   * S1DSS is set such that "does not fetch a CD" for SSID = 0
>>   * SSID >0 doesn't fetch CDs because of S1CDMax
>>
>> Then it seems to be saying that it will never use S1ContextPtr? ie it
>> is IGNORED?
> 
> Right, I think the critical question is whether that setting of S1DSS
> (0b01) means that STE.S1ContextPtr is considered "invalid". The spec
> doesn't call this out explicitly but the "translation procedure charts"
> seem to indicate that it doesn't use the CD for anything...
> 
> It would be good to get some clarification from Arm about this
> particular case.

No, STE.S1ContextPtr itself is "valid" since S1 is enabled. No CD fetch 
will occur for no-SubStreamID transactions that are bypassed by S1DSS, 
but the SMMU is permitted to attempt to speculatively fetch CDs for the 
enabled SubStreamID(s). Those fetches do not have to reach a valid CD if 
the SubStream is not actually in use, much like we don't have to fully 
populate a 2-level Stream table for StreamID ranges we don't care about 
either.

Don't confuse S1DSS==1 (bypass) with the S1DSS==2 behaviour we use in 
other cases - the latter is "Use CD 0 for no-SubstreamID traffic" which 
makes SubStreamID 0 invalid to use. However in the bypass case (and also 
S1DSS==0 where no-SubstreamID traffic is blocked entirely), SubStreamID 
0 remains perfectly valid and usable (we just still won't ever use it in 
Linux due to the middle case).

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on
  2026-01-27 13:26             ` Robin Murphy
@ 2026-01-27 13:50               ` Will Deacon
  2026-01-27 14:49                 ` Jason Gunthorpe
  0 siblings, 1 reply; 60+ messages in thread
From: Will Deacon @ 2026-01-27 13:50 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Jason Gunthorpe, Nicolin Chen, bhelgaas, joro, praan, baolu.lu,
	kevin.tian, miko.lenczewski, linux-arm-kernel, iommu,
	linux-kernel, linux-pci

On Tue, Jan 27, 2026 at 01:26:02PM +0000, Robin Murphy wrote:
> On 2026-01-27 1:10 pm, Will Deacon wrote:
> > On Mon, Jan 26, 2026 at 03:09:35PM -0400, Jason Gunthorpe wrote:
> > > On Mon, Jan 26, 2026 at 06:49:07PM +0000, Robin Murphy wrote:
> > > > (assuming SSIDSIZE > 0 and it does anything at all - note that strictly we
> > > > cannot assume this bypass trick is *always* possible, since an SMMU is
> > > > permitted to support ATS without supporting SubStreams).
> > > 
> > > Yes, I think Nicolin has captured those conditions in computing
> > > it... We don't have a logic to disable bypass in that case though.
> > > 
> > > > > So, I think a CD table pointer to a fully invalid L1 table of at least
> > > > > size 1 should be OK?
> > > > > 
> > > > > Or stated another way, why would ie be OK to have a 1 level table with
> > > > > an non-valid CD table entry for SSID0 but not OK to have a 2 level
> > > > > table that returns non-valid at the first walk?
> > > > 
> > > > S1ContextPtr itself is reachable since S1 is enabled, so it cannot point to
> > > > nonsense. But the S1DSS==Bypass behaviour does state:
> > > 
> > > > "Note: Such a transaction does not fetch a CD, and therefore does not report
> > > > F_CD_FETCH, C_BAD_CD or a stage 2 Translation-related fault with CLASS ==
> > > > CD."
> > > 
> > > Yes
> > > 
> > > However, taken together:
> > >   * S1CDMax is set to substream 0 only
> > >   * S1DSS is set such that "does not fetch a CD" for SSID = 0
> > >   * SSID >0 doesn't fetch CDs because of S1CDMax
> > > 
> > > Then it seems to be saying that it will never use S1ContextPtr? ie it
> > > is IGNORED?
> > 
> > Right, I think the critical question is whether that setting of S1DSS
> > (0b01) means that STE.S1ContextPtr is considered "invalid". The spec
> > doesn't call this out explicitly but the "translation procedure charts"
> > seem to indicate that it doesn't use the CD for anything...
> > 
> > It would be good to get some clarification from Arm about this
> > particular case.
> 
> No, STE.S1ContextPtr itself is "valid" since S1 is enabled. No CD fetch will
> occur for no-SubStreamID transactions that are bypassed by S1DSS, but the
> SMMU is permitted to attempt to speculatively fetch CDs for the enabled
> SubStreamID(s). Those fetches do not have to reach a valid CD if the
> SubStream is not actually in use, much like we don't have to fully populate
> a 2-level Stream table for StreamID ranges we don't care about either.
> 
> Don't confuse S1DSS==1 (bypass) with the S1DSS==2 behaviour we use in other
> cases - the latter is "Use CD 0 for no-SubstreamID traffic" which makes
> SubStreamID 0 invalid to use. However in the bypass case (and also S1DSS==0
> where no-SubstreamID traffic is blocked entirely), SubStreamID 0 remains
> perfectly valid and usable (we just still won't ever use it in Linux due to
> the middle case).

Argh, I had conflated a transaction using SSID 0 vs a transaction
without a substream at all. So I think this makes sense now...

Thanks,

Will

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on
  2026-01-27 13:50               ` Will Deacon
@ 2026-01-27 14:49                 ` Jason Gunthorpe
  0 siblings, 0 replies; 60+ messages in thread
From: Jason Gunthorpe @ 2026-01-27 14:49 UTC (permalink / raw)
  To: Will Deacon
  Cc: Robin Murphy, Nicolin Chen, bhelgaas, joro, praan, baolu.lu,
	kevin.tian, miko.lenczewski, linux-arm-kernel, iommu,
	linux-kernel, linux-pci

On Tue, Jan 27, 2026 at 01:50:54PM +0000, Will Deacon wrote:
> Argh, I had conflated a transaction using SSID 0 vs a transaction
> without a substream at all. So I think this makes sense now...

Yeah, it is bit subtle, but as a SW choice the iommu subsystem
reserves PASID 0/SSID 0 as the "untagged" translation.

Several HW's force this in their implementation (ie AMD)

ARM however includes a "Substream Valid" in the input bus. Linux
doesn't use the combination "Substream Valid, SSID=0", that should
never occur.

If it wrongly does happen then IDENTITY will generate a fault, either
C_BAD_CD (due to it being non-valid) or C_BAD_SUBSTREAMID (due to
S1CDMax disabling substreams).

While PAGING will either fault with C_BAD_SUBSTREAMID (S2 paging
domain) or success when S1DSS=b10.

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-27  8:10               ` Tian, Kevin
@ 2026-01-27 15:04                 ` Jason Gunthorpe
  2026-01-28  0:49                   ` dan.j.williams
  2026-01-28  0:57                   ` Tian, Kevin
  0 siblings, 2 replies; 60+ messages in thread
From: Jason Gunthorpe @ 2026-01-27 15:04 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Williams, Dan J, Jonathan Cameron, Nicolin Chen, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Tue, Jan 27, 2026 at 08:10:06AM +0000, Tian, Kevin wrote:
> > From: Williams, Dan J <dan.j.williams@intel.com>
> > Sent: Friday, January 23, 2026 3:46 AM
> > 
> > Jason Gunthorpe wrote:
> > > On Wed, Jan 21, 2026 at 09:44:32PM -0800, dan.j.williams@intel.com
> > wrote:
> > > > I do not immediately see what is wrong with requiring userspace policy
> > > > opt-in. That naturally gets replaced by installing the device's
> > > > certificate (for native PCI CMA), authenticating the device with the
> > > > TSM (for PCI IDE), or obviated by secure-ATS if that arrives.
> > >
> > > I think that goes back to the discussion about not loading drivers
> > > before validating the device.
> > >
> > > It would also make alot of sense to leave the IOMMU blocking until the
> > > driver is loaded for these secure situations. The blocking translation
> > > should block ATS too.
> > >
> > > Then the flow you are describing will work well:
> > >
> > > 1) At pre-boot the IOMMU will block all DMA including Translated.
> > > 2) The OS activates the IOMMU driver and keeps blocking.
> > > 3) Instead of immediately binding a default domain the IOMMU core
> > >    leaves the translation blocking.
> > > 4) The OS defers loading the driver to userspace.
> > > 5) Userspace measures the device and "accepts" it by loading the
> > >    driver
> > > 6) IOMMU core attaches a non-blocking default domain and activates ATS
> > 
> > That works for me. Give the paranoid the ability to have a point where they
> > can
> > be assured that the shields were not lowered prematurely.
> 
> Jason described the flow as "for these secure situations", i.e. not a general
> requirement for cxl.cache, but iiuc Dan may instead want userspace policy
> opt-in to be default (and with CMA/TSM etc. it gets easier)?

I think the general strategy has been to push userspace to do security
decisions before binding drivers. So we have a plan for confidential
compute VMs, and if there is interest then we can probably re-use that
plan in all other cases.

> At a glance cxl.cache devices have gained ATS enabled automatically in
> most cases (same as for all other ats-capable PCI devices):

Yes.
 
> - ARM: ATS is enabled automatically when attaching the default domain
>   to the device in certain configurations, and this series tries to auto
>   enable it in a missing configuration

Yes, ARM took the position that ATS should be left disabled for
IDENTITY both because of SMMU constraints and also because it made
some sense that you wouldn't want ATS overhead just to get a 1:1
translation.

> - AMD: ATS is enabled at domain attach time

I'd argue this is an error and it should work like ARM

> - Intel: ATS is enabled when a device is probed by intel-iommu driver
>   (incompatible with the suggested flow)

This is definately not a good choice :)

IMHO it is security required that the IOMMU driver block Translated
requests while a BLOCKED domain is attached, and while the IOMMU is
refusing ATS then device's ATS enable should be disabled.

> Given above already shipped in distributions, probably we have to keep
> them for compatibility (implying this series makes sense to fix a gap
> in existing policy), then treat the suggested flow as an enhancement
> for future?

I don't think we have a compatability issue here, just a security
one.

Drivers need to ensure that ATS is disabled at PCI and Translated
requestes blocked in IOMMU HW while a BLOCKED domain is attached.

Drivers can choose if they want to enable ATS for IDENTITY or not,
(recommend not for performance and consistency).

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-27 15:04                 ` Jason Gunthorpe
@ 2026-01-28  0:49                   ` dan.j.williams
  2026-01-28 13:05                     ` Jason Gunthorpe
  2026-01-28  0:57                   ` Tian, Kevin
  1 sibling, 1 reply; 60+ messages in thread
From: dan.j.williams @ 2026-01-28  0:49 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: Williams, Dan J, Jonathan Cameron, Nicolin Chen, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

Jason Gunthorpe wrote:
[..]
> > Jason described the flow as "for these secure situations", i.e. not a general
> > requirement for cxl.cache, but iiuc Dan may instead want userspace policy
> > opt-in to be default (and with CMA/TSM etc. it gets easier)?
> 
> I think the general strategy has been to push userspace to do security
> decisions before binding drivers. So we have a plan for confidential
> compute VMs, and if there is interest then we can probably re-use that
> plan in all other cases.

Right, if you want to configure a kernel to automatically enable ATS
that is choice. But, as distros get more security concious about devices
for confidential compute, it would be nice to be able to rely on the
same opt-in model for other security concerns like ATS security.

> > At a glance cxl.cache devices have gained ATS enabled automatically in
> > most cases (same as for all other ats-capable PCI devices):
> 
> Yes.
>  
> > - ARM: ATS is enabled automatically when attaching the default domain
> >   to the device in certain configurations, and this series tries to auto
> >   enable it in a missing configuration
> 
> Yes, ARM took the position that ATS should be left disabled for
> IDENTITY both because of SMMU constraints and also because it made
> some sense that you wouldn't want ATS overhead just to get a 1:1
> translation.

Does this mean that ARM already today does not enable ATS until driver
attach, or is incremental work needed for that capability?

> > - AMD: ATS is enabled at domain attach time
> 
> I'd argue this is an error and it should work like ARM
> 
> > - Intel: ATS is enabled when a device is probed by intel-iommu driver
> >   (incompatible with the suggested flow)
> 
> This is definately not a good choice :)
> 
> IMHO it is security required that the IOMMU driver block Translated
> requests while a BLOCKED domain is attached, and while the IOMMU is
> refusing ATS then device's ATS enable should be disabled.
> 
> > Given above already shipped in distributions, probably we have to keep
> > them for compatibility (implying this series makes sense to fix a gap
> > in existing policy), then treat the suggested flow as an enhancement
> > for future?
> 
> I don't think we have a compatability issue here, just a security
> one.
> 
> Drivers need to ensure that ATS is disabled at PCI and Translated
> requestes blocked in IOMMU HW while a BLOCKED domain is attached.

"Drivers" here meaning IOMMU drivers, right?

> Drivers can choose if they want to enable ATS for IDENTITY or not,
> (recommend not for performance and consistency).
> 
> Jason



^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-27 15:04                 ` Jason Gunthorpe
  2026-01-28  0:49                   ` dan.j.williams
@ 2026-01-28  0:57                   ` Tian, Kevin
  2026-01-28 13:11                     ` Jason Gunthorpe
  1 sibling, 1 reply; 60+ messages in thread
From: Tian, Kevin @ 2026-01-28  0:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Williams, Dan J, Jonathan Cameron, Nicolin Chen, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, January 27, 2026 11:05 PM
> 
> On Tue, Jan 27, 2026 at 08:10:06AM +0000, Tian, Kevin wrote:
> > > From: Williams, Dan J <dan.j.williams@intel.com>
> > > Sent: Friday, January 23, 2026 3:46 AM
> > >
> > > Jason Gunthorpe wrote:
> > > > On Wed, Jan 21, 2026 at 09:44:32PM -0800, dan.j.williams@intel.com
> > > wrote:
> > > > > I do not immediately see what is wrong with requiring userspace
> policy
> > > > > opt-in. That naturally gets replaced by installing the device's
> > > > > certificate (for native PCI CMA), authenticating the device with the
> > > > > TSM (for PCI IDE), or obviated by secure-ATS if that arrives.
> > > >
> > > > I think that goes back to the discussion about not loading drivers
> > > > before validating the device.
> > > >
> > > > It would also make alot of sense to leave the IOMMU blocking until the
> > > > driver is loaded for these secure situations. The blocking translation
> > > > should block ATS too.
> > > >
> > > > Then the flow you are describing will work well:
> > > >
> > > > 1) At pre-boot the IOMMU will block all DMA including Translated.
> > > > 2) The OS activates the IOMMU driver and keeps blocking.
> > > > 3) Instead of immediately binding a default domain the IOMMU core
> > > >    leaves the translation blocking.
> > > > 4) The OS defers loading the driver to userspace.
> > > > 5) Userspace measures the device and "accepts" it by loading the
> > > >    driver
> > > > 6) IOMMU core attaches a non-blocking default domain and activates
> ATS
> > >
> > > That works for me. Give the paranoid the ability to have a point where
> they
> > > can
> > > be assured that the shields were not lowered prematurely.
> >
> > Jason described the flow as "for these secure situations", i.e. not a general
> > requirement for cxl.cache, but iiuc Dan may instead want userspace policy
> > opt-in to be default (and with CMA/TSM etc. it gets easier)?
> 
> I think the general strategy has been to push userspace to do security
> decisions before binding drivers. So we have a plan for confidential
> compute VMs, and if there is interest then we can probably re-use that
> plan in all other cases.

make sense

> 
> > At a glance cxl.cache devices have gained ATS enabled automatically in
> > most cases (same as for all other ats-capable PCI devices):
> 
> Yes.
> 
> > - ARM: ATS is enabled automatically when attaching the default domain
> >   to the device in certain configurations, and this series tries to auto
> >   enable it in a missing configuration
> 
> Yes, ARM took the position that ATS should be left disabled for
> IDENTITY both because of SMMU constraints and also because it made
> some sense that you wouldn't want ATS overhead just to get a 1:1
> translation.
> 
> > - AMD: ATS is enabled at domain attach time
> 
> I'd argue this is an error and it should work like ARM
> 
> > - Intel: ATS is enabled when a device is probed by intel-iommu driver
> >   (incompatible with the suggested flow)
> 
> This is definately not a good choice :)
> 
> IMHO it is security required that the IOMMU driver block Translated
> requests while a BLOCKED domain is attached, and while the IOMMU is
> refusing ATS then device's ATS enable should be disabled.

It was made that way by commit 5518f239aff1 ("iommu/vt-d: Move
scalable mode ATS enablement to probe path "). The old policy was
same as AMD side, and changed to current way so domain change 
in RID won't affect the ATS requirement for PASIDs.

But I agree BLOCKED is special. Ideally there is no reason to have
RID attached to a BLOCKED domain while PASIDs are not
blocked... oh, wait

there is one scenario, e.g. VFIO allows domains attached to RID and
PASIDs being changed independently. It's a sane situation to have
userspace change RID domain via attach/detach/re-attach, while
PASID domains are still active. 'detach' will attach RID to a BLOCKED
domain, then disabling ATS in that window may break PASIDs.

How does ARM address this scenario? Is it more suitable to have
a new interface specific for driver bind/unbind to enable/disable
ATS instead of toggling it based on BLOCKED?

> 
> > Given above already shipped in distributions, probably we have to keep
> > them for compatibility (implying this series makes sense to fix a gap
> > in existing policy), then treat the suggested flow as an enhancement
> > for future?
> 
> I don't think we have a compatability issue here, just a security
> one.

'compatibility issue' if auto-enabling is completely removed with
userspace opt-in as the only way. But since it's not the case, then
yes it's more a security one.

> 
> Drivers need to ensure that ATS is disabled at PCI and Translated
> requestes blocked in IOMMU HW while a BLOCKED domain is attached.
> 
> Drivers can choose if they want to enable ATS for IDENTITY or not,
> (recommend not for performance and consistency).
> 
> Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-28  0:49                   ` dan.j.williams
@ 2026-01-28 13:05                     ` Jason Gunthorpe
  2026-02-03  5:13                       ` Nicolin Chen
  0 siblings, 1 reply; 60+ messages in thread
From: Jason Gunthorpe @ 2026-01-28 13:05 UTC (permalink / raw)
  To: dan.j.williams
  Cc: Tian, Kevin, Jonathan Cameron, Nicolin Chen, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Tue, Jan 27, 2026 at 04:49:07PM -0800, dan.j.williams@intel.com wrote:
> > Yes, ARM took the position that ATS should be left disabled for
> > IDENTITY both because of SMMU constraints and also because it made
> > some sense that you wouldn't want ATS overhead just to get a 1:1
> > translation.
> 
> Does this mean that ARM already today does not enable ATS until driver
> attach, or is incremental work needed for that capability?

All of the iommu drivers setup an iommu translation and enable ATS
before any driver is bound.

We would need to do more work in the core to leave the translation
blocked when there is no driver. I don't think it is that difficult

> > Drivers need to ensure that ATS is disabled at PCI and Translated
> > requestes blocked in IOMMU HW while a BLOCKED domain is attached.
> 
> "Drivers" here meaning IOMMU drivers, right?

Yes

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-28  0:57                   ` Tian, Kevin
@ 2026-01-28 13:11                     ` Jason Gunthorpe
  2026-01-29  3:28                       ` Tian, Kevin
  0 siblings, 1 reply; 60+ messages in thread
From: Jason Gunthorpe @ 2026-01-28 13:11 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Williams, Dan J, Jonathan Cameron, Nicolin Chen, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Wed, Jan 28, 2026 at 12:57:59AM +0000, Tian, Kevin wrote:
> > > - Intel: ATS is enabled when a device is probed by intel-iommu driver
> > >   (incompatible with the suggested flow)
> > 
> > This is definately not a good choice :)
> > 
> > IMHO it is security required that the IOMMU driver block Translated
> > requests while a BLOCKED domain is attached, and while the IOMMU is
> > refusing ATS then device's ATS enable should be disabled.
> 
> It was made that way by commit 5518f239aff1 ("iommu/vt-d: Move
> scalable mode ATS enablement to probe path "). The old policy was
> same as AMD side, and changed to current way so domain change 
> in RID won't affect the ATS requirement for PASIDs.

That's a legimiate thing, but always on is a heavy handed solution.

The driver should track what is going on with the PASID and enable ATS
if required.

Which also solves this:

> there is one scenario, e.g. VFIO allows domains attached to RID and
> PASIDs being changed independently. It's a sane situation to have
> userspace change RID domain via attach/detach/re-attach, while
> PASID domains are still active. 'detach' will attach RID to a BLOCKED
> domain, then disabling ATS in that window may break PASIDs.
 
> How does ARM address this scenario? Is it more suitable to have
> a new interface specific for driver bind/unbind to enable/disable
> ATS instead of toggling it based on BLOCKED?

And is what SMMUv3 is doing already. With an IDENTITY translation on
the RID it starts out with ATS disabled and switches to IDENTITY with
ATS enabled when the first PASID appears. Switches back when the PASID
goes away.

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* RE: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-28 13:11                     ` Jason Gunthorpe
@ 2026-01-29  3:28                       ` Tian, Kevin
  0 siblings, 0 replies; 60+ messages in thread
From: Tian, Kevin @ 2026-01-29  3:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Williams, Dan J, Jonathan Cameron, Nicolin Chen, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, January 28, 2026 9:12 PM
> 
> On Wed, Jan 28, 2026 at 12:57:59AM +0000, Tian, Kevin wrote:
> > > > - Intel: ATS is enabled when a device is probed by intel-iommu driver
> > > >   (incompatible with the suggested flow)
> > >
> > > This is definately not a good choice :)
> > >
> > > IMHO it is security required that the IOMMU driver block Translated
> > > requests while a BLOCKED domain is attached, and while the IOMMU is
> > > refusing ATS then device's ATS enable should be disabled.
> >
> > It was made that way by commit 5518f239aff1 ("iommu/vt-d: Move
> > scalable mode ATS enablement to probe path "). The old policy was
> > same as AMD side, and changed to current way so domain change
> > in RID won't affect the ATS requirement for PASIDs.
> 
> That's a legimiate thing, but always on is a heavy handed solution.

yes. at that point the rationale was made purely based on functionality
instead of security.

> 
> The driver should track what is going on with the PASID and enable ATS
> if required.
> 
> Which also solves this:
> 
> > there is one scenario, e.g. VFIO allows domains attached to RID and
> > PASIDs being changed independently. It's a sane situation to have
> > userspace change RID domain via attach/detach/re-attach, while
> > PASID domains are still active. 'detach' will attach RID to a BLOCKED
> > domain, then disabling ATS in that window may break PASIDs.
> 
> > How does ARM address this scenario? Is it more suitable to have
> > a new interface specific for driver bind/unbind to enable/disable
> > ATS instead of toggling it based on BLOCKED?
> 
> And is what SMMUv3 is doing already. With an IDENTITY translation on
> the RID it starts out with ATS disabled and switches to IDENTITY with
> ATS enabled when the first PASID appears. Switches back when the PASID
> goes away.
> 

yes, that should work. In that way the driver binding flow is covered
automatically.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-01-28 13:05                     ` Jason Gunthorpe
@ 2026-02-03  5:13                       ` Nicolin Chen
  2026-02-03 14:33                         ` Jason Gunthorpe
  0 siblings, 1 reply; 60+ messages in thread
From: Nicolin Chen @ 2026-02-03  5:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dan.j.williams, Tian, Kevin, Jonathan Cameron, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Wed, Jan 28, 2026 at 09:05:20AM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 27, 2026 at 04:49:07PM -0800, dan.j.williams@intel.com wrote:
> > > Yes, ARM took the position that ATS should be left disabled for
> > > IDENTITY both because of SMMU constraints and also because it made
> > > some sense that you wouldn't want ATS overhead just to get a 1:1
> > > translation.
> > 
> > Does this mean that ARM already today does not enable ATS until driver
> > attach, or is incremental work needed for that capability?
> 
> All of the iommu drivers setup an iommu translation and enable ATS
> before any driver is bound.
> 
> We would need to do more work in the core to leave the translation
> blocked when there is no driver. I don't think it is that difficult

Hmm, not sure if we could use group->domain=NULL as "blocked..

Otherwise, I made a draft:
-----------------------------------------------------------------
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 349f31bedfa17..8ed15d5ea1f51 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -437,8 +437,6 @@ static int driver_sysfs_add(struct device *dev)
 {
 	int ret;
 
-	bus_notify(dev, BUS_NOTIFY_BIND_DRIVER);
-
 	ret = sysfs_create_link(&dev->driver->p->kobj, &dev->kobj,
 				kobject_name(&dev->kobj));
 	if (ret)
@@ -638,10 +636,12 @@ static int really_probe(struct device *dev, const struct device_driver *drv)
 	if (ret)
 		goto pinctrl_bind_failed;
 
+	bus_notify(dev, BUS_NOTIFY_BIND_DRIVER);
+
 	if (dev->bus->dma_configure) {
 		ret = dev->bus->dma_configure(dev);
 		if (ret)
-			goto pinctrl_bind_failed;
+			goto bus_notify_bind_failed;
 	}
 
 	ret = driver_sysfs_add(dev);
@@ -717,9 +717,10 @@ static int really_probe(struct device *dev, const struct device_driver *drv)
 probe_failed:
 	driver_sysfs_remove(dev);
 sysfs_failed:
-	bus_notify(dev, BUS_NOTIFY_DRIVER_NOT_BOUND);
 	if (dev->bus && dev->bus->dma_cleanup)
 		dev->bus->dma_cleanup(dev);
+bus_notify_bind_failed:
+	bus_notify(dev, BUS_NOTIFY_DRIVER_NOT_BOUND);
 pinctrl_bind_failed:
 	device_links_no_driver(dev);
 	device_unbind_cleanup(dev);
@@ -1275,8 +1276,6 @@ static void __device_release_driver(struct device *dev, struct device *parent)
 
 		driver_sysfs_remove(dev);
 
-		bus_notify(dev, BUS_NOTIFY_UNBIND_DRIVER);
-
 		pm_runtime_put_sync(dev);
 
 		device_remove(dev);
@@ -1284,6 +1283,8 @@ static void __device_release_driver(struct device *dev, struct device *parent)
 		if (dev->bus && dev->bus->dma_cleanup)
 			dev->bus->dma_cleanup(dev);
 
+		bus_notify(dev, BUS_NOTIFY_UNBIND_DRIVER);
+
 		device_unbind_cleanup(dev);
 		device_links_driver_cleanup(dev);
 
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 2ca990dfbb884..af53dce00e29b 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -106,6 +106,7 @@ static int __iommu_attach_group(struct iommu_domain *domain,
 static struct iommu_domain *__iommu_paging_domain_alloc_flags(struct device *dev,
 						       unsigned int type,
 						       unsigned int flags);
+static int __iommu_group_alloc_blocking_domain(struct iommu_group *group);
 
 enum {
 	IOMMU_SET_DOMAIN_MUST_SUCCEED = 1 << 0,
@@ -618,12 +619,6 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
 	ret = iommu_init_device(dev);
 	if (ret)
 		return ret;
-	/*
-	 * And if we do now see any replay calls, they would indicate someone
-	 * misusing the dma_configure path outside bus code.
-	 */
-	if (dev->driver)
-		dev_WARN(dev, "late IOMMU probe at driver bind, something fishy here!\n");
 
 	group = dev->iommu_group;
 	gdev = iommu_group_alloc_device(group, dev);
@@ -641,6 +636,15 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list
 	WARN_ON(group->default_domain && !group->domain);
 	if (group->default_domain)
 		iommu_create_device_direct_mappings(group->default_domain, dev);
+
+	/* Block translation requests from a device without driver */
+	if (!dev->driver) {
+		ret = __iommu_group_alloc_blocking_domain(group);
+		if (ret)
+			goto err_remove_gdev;
+		group->domain = group->blocking_domain;
+	}
+
 	if (group->domain) {
 		ret = __iommu_device_set_domain(group, dev, group->domain, NULL,
 						0);
@@ -1781,19 +1785,70 @@ static int probe_iommu_group(struct device *dev, void *data)
 	return ret;
 }
 
+static int iommu_attach_default_domain(struct device *dev)
+{
+	struct iommu_group *group = iommu_group_get(dev);
+	int ret = 0;
+
+	if (!group)
+		return 0;
+
+	mutex_lock(&group->mutex);
+
+	if (group->blocking_domain) {
+		if (!group->default_domain) {
+			ret = iommu_setup_default_domain(group, 0);
+			if (!ret)
+				iommu_setup_dma_ops(dev);
+		} else if (group->domain == group->blocking_domain) {
+			ret = __iommu_group_set_domain(
+				group, group->default_domain);
+		}
+	}
+
+	mutex_unlock(&group->mutex);
+	iommu_group_put(group);
+	return ret;
+}
+
+static void iommu_detach_default_domain(struct device *dev)
+{
+	struct iommu_group *group = iommu_group_get(dev);
+
+	if (!group)
+		return;
+
+	mutex_lock(&group->mutex);
+
+	if (group->blocking_domain && group->domain != group->blocking_domain) {
+		__iommu_attach_device(group->blocking_domain, dev,
+				      group->domain);
+		group->domain = group->blocking_domain;
+	}
+
+	mutex_unlock(&group->mutex);
+	iommu_group_put(group);
+}
+
 static int iommu_bus_notifier(struct notifier_block *nb,
 			      unsigned long action, void *data)
 {
 	struct device *dev = data;
+	int ret;
 
 	if (action == BUS_NOTIFY_ADD_DEVICE) {
-		int ret;
-
 		ret = iommu_probe_device(dev);
 		return (ret) ? NOTIFY_DONE : NOTIFY_OK;
 	} else if (action == BUS_NOTIFY_REMOVED_DEVICE) {
 		iommu_release_device(dev);
 		return NOTIFY_OK;
+	} else if (action == BUS_NOTIFY_BIND_DRIVER) {
+		ret = iommu_attach_default_domain(dev);
+		return ret ? NOTIFY_DONE : NOTIFY_OK;
+	} else if (action == BUS_NOTIFY_UNBOUND_DRIVER ||
+		   action == BUS_NOTIFY_DRIVER_NOT_BOUND) {
+		iommu_detach_default_domain(dev);
+		return NOTIFY_OK;
 	}
 
 	return 0;
-----------------------------------------------------------------

Thanks
Nicolin

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-03  5:13                       ` Nicolin Chen
@ 2026-02-03 14:33                         ` Jason Gunthorpe
  2026-02-03 17:45                           ` Nicolin Chen
  0 siblings, 1 reply; 60+ messages in thread
From: Jason Gunthorpe @ 2026-02-03 14:33 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: dan.j.williams, Tian, Kevin, Jonathan Cameron, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Mon, Feb 02, 2026 at 09:13:50PM -0800, Nicolin Chen wrote:
> On Wed, Jan 28, 2026 at 09:05:20AM -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 27, 2026 at 04:49:07PM -0800, dan.j.williams@intel.com wrote:
> > > > Yes, ARM took the position that ATS should be left disabled for
> > > > IDENTITY both because of SMMU constraints and also because it made
> > > > some sense that you wouldn't want ATS overhead just to get a 1:1
> > > > translation.
> > > 
> > > Does this mean that ARM already today does not enable ATS until driver
> > > attach, or is incremental work needed for that capability?
> > 
> > All of the iommu drivers setup an iommu translation and enable ATS
> > before any driver is bound.
> > 
> > We would need to do more work in the core to leave the translation
> > blocked when there is no driver. I don't think it is that difficult
> 
> Hmm, not sure if we could use group->domain=NULL as "blocked..

Definately not, we need to use a proper blocked domain.

> @@ -437,8 +437,6 @@ static int driver_sysfs_add(struct device *dev)
>  {
>  	int ret;
>  
> -	bus_notify(dev, BUS_NOTIFY_BIND_DRIVER);
> -
>  	ret = sysfs_create_link(&dev->driver->p->kobj, &dev->kobj,
>  				kobject_name(&dev->kobj));
>  	if (ret)
> @@ -638,10 +636,12 @@ static int really_probe(struct device *dev, const struct device_driver *drv)
>  	if (ret)
>  		goto pinctrl_bind_failed;
>  
> +	bus_notify(dev, BUS_NOTIFY_BIND_DRIVER);
> +
>  	if (dev->bus->dma_configure) {
>  		ret = dev->bus->dma_configure(dev);
>  		if (ret)
> -			goto pinctrl_bind_failed;
> +			goto bus_notify_bind_failed;
>  	}

We shouldn't need any of these? The dma_configure callback already
gets into the iommu code to validate the domain and restrict VFIO,
no further callbacks should be needed.

When the iommu driver is probed to the device we can assume no driver
is bound and immediately attach the blocked domain.

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-03 14:33                         ` Jason Gunthorpe
@ 2026-02-03 17:45                           ` Nicolin Chen
  2026-02-03 17:55                             ` Jason Gunthorpe
  0 siblings, 1 reply; 60+ messages in thread
From: Nicolin Chen @ 2026-02-03 17:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dan.j.williams, Tian, Kevin, Jonathan Cameron, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Tue, Feb 03, 2026 at 10:33:48AM -0400, Jason Gunthorpe wrote:
> On Mon, Feb 02, 2026 at 09:13:50PM -0800, Nicolin Chen wrote:
> > On Wed, Jan 28, 2026 at 09:05:20AM -0400, Jason Gunthorpe wrote:
> > > On Tue, Jan 27, 2026 at 04:49:07PM -0800, dan.j.williams@intel.com wrote:
> > > > > Yes, ARM took the position that ATS should be left disabled for
> > > > > IDENTITY both because of SMMU constraints and also because it made
> > > > > some sense that you wouldn't want ATS overhead just to get a 1:1
> > > > > translation.
> > > > 
> > > > Does this mean that ARM already today does not enable ATS until driver
> > > > attach, or is incremental work needed for that capability?
> > > 
> > > All of the iommu drivers setup an iommu translation and enable ATS
> > > before any driver is bound.
> > > 
> > > We would need to do more work in the core to leave the translation
> > > blocked when there is no driver. I don't think it is that difficult
> > 
> > Hmm, not sure if we could use group->domain=NULL as "blocked..
> 
> Definately not, we need to use a proper blocked domain.

Yea, I suspected so.

> > @@ -437,8 +437,6 @@ static int driver_sysfs_add(struct device *dev)
> >  {
> >  	int ret;
> >  
> > -	bus_notify(dev, BUS_NOTIFY_BIND_DRIVER);
> > -
> >  	ret = sysfs_create_link(&dev->driver->p->kobj, &dev->kobj,
> >  				kobject_name(&dev->kobj));
> >  	if (ret)
> > @@ -638,10 +636,12 @@ static int really_probe(struct device *dev, const struct device_driver *drv)
> >  	if (ret)
> >  		goto pinctrl_bind_failed;
> >  
> > +	bus_notify(dev, BUS_NOTIFY_BIND_DRIVER);
> > +
> >  	if (dev->bus->dma_configure) {
> >  		ret = dev->bus->dma_configure(dev);
> >  		if (ret)
> > -			goto pinctrl_bind_failed;
> > +			goto bus_notify_bind_failed;
> >  	}
> 
> We shouldn't need any of these? The dma_configure callback already
> gets into the iommu code to validate the domain and restrict VFIO,
> no further callbacks should be needed.
>
> When the iommu driver is probed to the device we can assume no driver
> is bound and immediately attach the blocked domain.

I was trying to use dev->driver that gets set before dma_configure()
and unset after dma_cleanup(). But looks like we could just keep the
track of group->owner_cnt in iommu_device_use/unuse_default_domain().

Btw, attaching to IOMMU_DOMAIN_BLOCKED/group->blocking_domain is not
allowed in general if require_direct=true. I assume this case can be
an exception since there's no point in allowing a device that has no
driver yet to access any reserved region?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-03 17:45                           ` Nicolin Chen
@ 2026-02-03 17:55                             ` Jason Gunthorpe
  2026-02-03 18:50                               ` Nicolin Chen
                                                 ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Jason Gunthorpe @ 2026-02-03 17:55 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: dan.j.williams, Tian, Kevin, Jonathan Cameron, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Tue, Feb 03, 2026 at 09:45:17AM -0800, Nicolin Chen wrote:
> Btw, attaching to IOMMU_DOMAIN_BLOCKED/group->blocking_domain is not
> allowed in general if require_direct=true. I assume this case can be
> an exception since there's no point in allowing a device that has no
> driver yet to access any reserved region?

If require_direct is set then we have to disable this mechanism..

I'm not sure exactly what to do about this as the require_direct comes
from the hypervisor in a CC VM and we probably don't want to give the
hypervisor this kind of escape hatch.

Perhaps we need to lock off to failure on CC VMs if this ever
happens..

But baremetal should just keep working how it always worked in this
case..

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-03 17:55                             ` Jason Gunthorpe
@ 2026-02-03 18:50                               ` Nicolin Chen
  2026-02-04 13:21                                 ` Jason Gunthorpe
  2026-02-03 18:59                               ` Robin Murphy
  2026-02-18 22:56                               ` Nicolin Chen
  2 siblings, 1 reply; 60+ messages in thread
From: Nicolin Chen @ 2026-02-03 18:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dan.j.williams, Tian, Kevin, Jonathan Cameron, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Tue, Feb 03, 2026 at 01:55:40PM -0400, Jason Gunthorpe wrote:
> On Tue, Feb 03, 2026 at 09:45:17AM -0800, Nicolin Chen wrote:
> > Btw, attaching to IOMMU_DOMAIN_BLOCKED/group->blocking_domain is not
> > allowed in general if require_direct=true. I assume this case can be
> > an exception since there's no point in allowing a device that has no
> > driver yet to access any reserved region?
> 
> If require_direct is set then we have to disable this mechanism..
> 
> I'm not sure exactly what to do about this as the require_direct comes
> from the hypervisor in a CC VM and we probably don't want to give the
> hypervisor this kind of escape hatch.
> 
> Perhaps we need to lock off to failure on CC VMs if this ever
> happens..
> 
> But baremetal should just keep working how it always worked in this
> case..

OK. I will put a note in the patch, since it would literally skip
any VM case at this moment.

I just realized a corner case, as iommu_probe_device() may attach
the device to group->domain if it's set:
https://lore.kernel.org/all/9-v5-1b99ae392328+44574-iommu_err_unwind_jgg@nvidia.com/

I am not sure about the use case, but I assume we should skip the
blocking_domain as well in this case?

Then, this makes the condition be:
+	if (!dev->driver && !group->domain && !dev->iommu->require_direct) {
+               ret = __iommu_group_alloc_blocking_domain(group);
+               if (ret)
+                       goto err_remove_gdev;
+               group->domain = group->blocking_domain;
+	}

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-03 17:55                             ` Jason Gunthorpe
  2026-02-03 18:50                               ` Nicolin Chen
@ 2026-02-03 18:59                               ` Robin Murphy
  2026-02-03 19:24                                 ` Nicolin Chen
  2026-02-03 23:16                                 ` Jason Gunthorpe
  2026-02-18 22:56                               ` Nicolin Chen
  2 siblings, 2 replies; 60+ messages in thread
From: Robin Murphy @ 2026-02-03 18:59 UTC (permalink / raw)
  To: Jason Gunthorpe, Nicolin Chen
  Cc: dan.j.williams, Tian, Kevin, Jonathan Cameron, will@kernel.org,
	bhelgaas@google.com, joro@8bytes.org, praan@google.com,
	baolu.lu@linux.intel.com, miko.lenczewski@arm.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-cxl@vger.kernel.org

On 2026-02-03 5:55 pm, Jason Gunthorpe wrote:
> On Tue, Feb 03, 2026 at 09:45:17AM -0800, Nicolin Chen wrote:
>> Btw, attaching to IOMMU_DOMAIN_BLOCKED/group->blocking_domain is not
>> allowed in general if require_direct=true. I assume this case can be
>> an exception since there's no point in allowing a device that has no
>> driver yet to access any reserved region?

No, the point of RMRs in general is that the device can be assumed to 
already be accessing them, and that access must be preserved, regardless 
of whether an OS driver may or may not take over the device later. In 
fact RMRs with the "Remapping Permitted" flag are only strictly needed 
*until* an OS driver has taken control over whatever it was that 
firmware left them doing.

> If require_direct is set then we have to disable this mechanism..
> 
> I'm not sure exactly what to do about this as the require_direct comes
> from the hypervisor in a CC VM and we probably don't want to give the
> hypervisor this kind of escape hatch.
> 
> Perhaps we need to lock off to failure on CC VMs if this ever
> happens..
> 
> But baremetal should just keep working how it always worked in this
> case..

Realistically this combination cannot exist bare-metal, since if the 
device requires to send ATS TT's to access an RMR then the SMMU would 
have to be enabled pre-boot, so then the RMR means we cannot ever 
disable it to reconfigure, so we'd be stuffed from the start...

Even though it's potentially a little more plausible in a VM where the 
underlying S2 would satisfy ATS, for much the same reason it's still 
going to look highly suspect from the VM's point of view to be presented 
with a device whose apparent ability to perform ATS traffic through a 
supposedly-disabled S1 SMMU must not be disrupted. However I think there 
would be no point exposing the ATS details to the VM to begin with. It's 
the host's decision to trust the device to play in the translated PA 
space and system cache coherency protocol, and no guest would be allowed 
to mess with those aspects either way, so there seems no obvious good 
reason for them to know at all.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-03 18:59                               ` Robin Murphy
@ 2026-02-03 19:24                                 ` Nicolin Chen
  2026-02-03 23:16                                 ` Jason Gunthorpe
  1 sibling, 0 replies; 60+ messages in thread
From: Nicolin Chen @ 2026-02-03 19:24 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Jason Gunthorpe, dan.j.williams, Tian, Kevin, Jonathan Cameron,
	will@kernel.org, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Tue, Feb 03, 2026 at 06:59:35PM +0000, Robin Murphy wrote:
> On 2026-02-03 5:55 pm, Jason Gunthorpe wrote:
> > On Tue, Feb 03, 2026 at 09:45:17AM -0800, Nicolin Chen wrote:
> > > Btw, attaching to IOMMU_DOMAIN_BLOCKED/group->blocking_domain is not
> > > allowed in general if require_direct=true. I assume this case can be
> > > an exception since there's no point in allowing a device that has no
> > > driver yet to access any reserved region?
> 
> No, the point of RMRs in general is that the device can be assumed to
> already be accessing them, and that access must be preserved, regardless of
> whether an OS driver may or may not take over the device later.

I see. Thanks for the input.

> In fact RMRs
> with the "Remapping Permitted" flag are only strictly needed *until* an OS
> driver has taken control over whatever it was that firmware left them doing.

Yes. I see that doesn't set require_direct.

Nicolin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-03 18:59                               ` Robin Murphy
  2026-02-03 19:24                                 ` Nicolin Chen
@ 2026-02-03 23:16                                 ` Jason Gunthorpe
  2026-02-04 12:18                                   ` Robin Murphy
  1 sibling, 1 reply; 60+ messages in thread
From: Jason Gunthorpe @ 2026-02-03 23:16 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Nicolin Chen, dan.j.williams, Tian, Kevin, Jonathan Cameron,
	will@kernel.org, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Tue, Feb 03, 2026 at 06:59:35PM +0000, Robin Murphy wrote:
> Realistically this combination cannot exist bare-metal, since if the device
> requires to send ATS TT's to access an RMR then the SMMU would have to be
> enabled pre-boot, so then the RMR means we cannot ever disable it to
> reconfigure, so we'd be stuffed from the start...

This thread has gotten mixed up..

First this series as it is has nothing to do with RMRs.

What the latter part is discussing is a future series to implement
what I think MS calls "boot DMA security". Meaning we don't get into a
position of allowing a device access to OS memory, even through ATS
translated requests, until after userspace has approved the device.

This is something that should combine with Dynamic Root of Trust for
Measurement, as DRTM is much less useful if DMA can mutate the OS code
after the DTRM returns.

It is also meaningful for systems with encrypted PCI where the OS can
measure the PCI device before permitting it access to anything.

So... When we do implement this new security mode, what should it do
if FW attempts to attack the kernel with these nonsensical RMR
configurations? With DRTM we explicitly don't trust the FW for
security anymore, so it is a problem.

I strongly suspect the answer is that RMR has to be ignored in this
more secure mode.

> However I think there would be no point exposing the ATS details to
> the VM to begin with. It's the host's decision to trust the device
> to play in the translated PA space and system cache coherency
> protocol, and no guest would be allowed to mess with those aspects
> either way, so there seems no obvious good reason for them to know
> at all.

If the vSMMU is presented then the guest must be aware of the ATS
because only the guest can generate the ATC invalidations for changes
in the S1.

Without a vSMMU the ATS can be hidden from the guest.

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-03 23:16                                 ` Jason Gunthorpe
@ 2026-02-04 12:18                                   ` Robin Murphy
  2026-02-04 13:20                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 60+ messages in thread
From: Robin Murphy @ 2026-02-04 12:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, dan.j.williams, Tian, Kevin, Jonathan Cameron,
	will@kernel.org, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On 2026-02-03 11:16 pm, Jason Gunthorpe wrote:
> On Tue, Feb 03, 2026 at 06:59:35PM +0000, Robin Murphy wrote:
>> Realistically this combination cannot exist bare-metal, since if the device
>> requires to send ATS TT's to access an RMR then the SMMU would have to be
>> enabled pre-boot, so then the RMR means we cannot ever disable it to
>> reconfigure, so we'd be stuffed from the start...
> 
> This thread has gotten mixed up..
> 
> First this series as it is has nothing to do with RMRs.

I know, but you brought up require_direct so I figured it was worth 
clarifying that should not in fact impact ATS decisions, since the 
combination a device requiring ATS while *also* requiring an RMR would 
be essentially impossible to support given the SMMU architecture, thus 
we can reasonably assume nobody will do such a thing (or at least do it 
with any expectation of it ever working).

> What the latter part is discussing is a future series to implement
> what I think MS calls "boot DMA security". Meaning we don't get into a
> position of allowing a device access to OS memory, even through ATS
> translated requests, until after userspace has approved the device.

Pre-boot protection is in the same boat, as things currently stand - an 
OS could either preserve security (by using GBPA to keep traffic blocked 
while reconfiguring the rest of the SMMU), *or* have ongoing DMA for a 
splash screen or whatever; it cannot realistically have both.

> This is something that should combine with Dynamic Root of Trust for
> Measurement, as DRTM is much less useful if DMA can mutate the OS code
> after the DTRM returns.
> 
> It is also meaningful for systems with encrypted PCI where the OS can
> measure the PCI device before permitting it access to anything.
> 
> So... When we do implement this new security mode, what should it do
> if FW attempts to attack the kernel with these nonsensical RMR
> configurations? With DRTM we explicitly don't trust the FW for
> security anymore, so it is a problem.
> 
> I strongly suspect the answer is that RMR has to be ignored in this
> more secure mode.

Yes, I think the only valid case for having an RMR and expecting it to 
work in combination with these other things is if the device has some 
firmware or preloaded configuration in memory which it will still need 
to access at that address once an OS driver starts using it, but does 
not need to access *during* the boot-time handover. Thus it seems fair 
to still honour the reserved regions upon attaching to a default domain, 
but not worry too much about being in a transient blocking state in the 
interim if it's unavoidable for other reasons (at worst maybe just log a 
warning that we're doing so).

>> However I think there would be no point exposing the ATS details to
>> the VM to begin with. It's the host's decision to trust the device
>> to play in the translated PA space and system cache coherency
>> protocol, and no guest would be allowed to mess with those aspects
>> either way, so there seems no obvious good reason for them to know
>> at all.
> 
> If the vSMMU is presented then the guest must be aware of the ATS
> because only the guest can generate the ATC invalidations for changes
> in the S1.

Only if you assume DVM or some other mechanism for the guest to issue S1 
invalidations directly to the hardware - with an emulated CMDQ we can do 
whatever we like.

And in fact, I think we actually *have* to if the host has enabled ATS 
itself, since we cannot assume that a guest is going to choose to use 
it, thus we cannot rely on the guest issuing ATCIs in order to get the 
correct behaviour it expects unless and until we've seen it set EATS 
appropriately in all the corresponding vSTEs. So we would have to forbid 
DVM, and only allow other direct mechanisms that can be dynamically 
enabled for as long as the guest configuration matches... Fun.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-04 12:18                                   ` Robin Murphy
@ 2026-02-04 13:20                                     ` Jason Gunthorpe
  0 siblings, 0 replies; 60+ messages in thread
From: Jason Gunthorpe @ 2026-02-04 13:20 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Nicolin Chen, dan.j.williams, Tian, Kevin, Jonathan Cameron,
	will@kernel.org, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Wed, Feb 04, 2026 at 12:18:15PM +0000, Robin Murphy wrote:
> > I strongly suspect the answer is that RMR has to be ignored in this
> > more secure mode.
> 
> Yes, I think the only valid case for having an RMR and expecting it to work
> in combination with these other things is if the device has some firmware or
> preloaded configuration in memory which it will still need to access at that
> address once an OS driver starts using it, but does not need to access
> *during* the boot-time handover. 

Splash screens are the most obvious case here where the framebuffer
may be in DMA'able memory and must go through the iommu..

At least we are already shipping products where the GPU has DRAM based
framebuffer, the GPU requires ATS for alot of functions, but the
framebuffer scan out does not use ATS.

Sigh. So that will be exciting to make work at some point.

> Thus it seems fair to still honour the
> reserved regions upon attaching to a default domain, but not worry too much
> about being in a transient blocking state in the interim if it's unavoidable
> for other reasons (at worst maybe just log a warning that we're
> doing so).

The interest in the blocking state was to disable ATS.

Maybe another approach would be to have a "RMR blocking" domain which is a
paging domain that tells the driver explicitly not to enable ATS for
it.

Then we could validate the RMR range is OK and install this special
domain and still have security against translated TLPs..

> > > However I think there would be no point exposing the ATS details to
> > > the VM to begin with. It's the host's decision to trust the device
> > > to play in the translated PA space and system cache coherency
> > > protocol, and no guest would be allowed to mess with those aspects
> > > either way, so there seems no obvious good reason for them to know
> > > at all.
> > 
> > If the vSMMU is presented then the guest must be aware of the ATS
> > because only the guest can generate the ATC invalidations for changes
> > in the S1.
> 
> Only if you assume DVM or some other mechanism for the guest to issue S1
> invalidations directly to the hardware - with an emulated CMDQ we can do
> whatever we like.

With alot of work yes, but that is not the model that is implemented
today.

If the hypervisor has to generate a ATC invalidation from an IOTLB
invalidation then it also needs a map of ASID to RID&PASID, which it
can only build by inspecting all the CD tables. The VMMs in nesting
mode don't read the CD tables at all today, so they don't implement
this option.

> And in fact, I think we actually *have* to if the host has enabled ATS
> itself, since we cannot assume that a guest is going to choose to use it,
> thus we cannot rely on the guest issuing ATCIs in order to get the correct
> behaviour it expects unless and until we've seen it set EATS appropriately
> in all the corresponding vSTEs. 

Due to the above we've done the reverse, the host does not get to
unilaterally decide ATS policy, it follows the guest's vEATS setting
so that we never have a situation where the hypervisor has to generate
ATC invalidations.

The kernel offers the VMM the freedom to do it either way, but today
all the VMMs I'm aware of choose the above path.

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-03 18:50                               ` Nicolin Chen
@ 2026-02-04 13:21                                 ` Jason Gunthorpe
  0 siblings, 0 replies; 60+ messages in thread
From: Jason Gunthorpe @ 2026-02-04 13:21 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: dan.j.williams, Tian, Kevin, Jonathan Cameron, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Tue, Feb 03, 2026 at 10:50:39AM -0800, Nicolin Chen wrote:
> On Tue, Feb 03, 2026 at 01:55:40PM -0400, Jason Gunthorpe wrote:
> > On Tue, Feb 03, 2026 at 09:45:17AM -0800, Nicolin Chen wrote:
> > > Btw, attaching to IOMMU_DOMAIN_BLOCKED/group->blocking_domain is not
> > > allowed in general if require_direct=true. I assume this case can be
> > > an exception since there's no point in allowing a device that has no
> > > driver yet to access any reserved region?
> > 
> > If require_direct is set then we have to disable this mechanism..
> > 
> > I'm not sure exactly what to do about this as the require_direct comes
> > from the hypervisor in a CC VM and we probably don't want to give the
> > hypervisor this kind of escape hatch.
> > 
> > Perhaps we need to lock off to failure on CC VMs if this ever
> > happens..
> > 
> > But baremetal should just keep working how it always worked in this
> > case..
> 
> OK. I will put a note in the patch, since it would literally skip
> any VM case at this moment.
> 
> I just realized a corner case, as iommu_probe_device() may attach
> the device to group->domain if it's set:
> https://lore.kernel.org/all/9-v5-1b99ae392328+44574-iommu_err_unwind_jgg@nvidia.com/
> 
> I am not sure about the use case, but I assume we should skip the
> blocking_domain as well in this case?
> 
> Then, this makes the condition be:
> +	if (!dev->driver && !group->domain && !dev->iommu->require_direct) {
> +               ret = __iommu_group_alloc_blocking_domain(group);
> +               if (ret)
> +                       goto err_remove_gdev;
> +               group->domain = group->blocking_domain;
> +	}

Just to be clear this is some other project "DMA boot security" and
IDK if we need to do it until the CC patches land for user space
device binding policy, or someone seriously implements DRTM..

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-03 17:55                             ` Jason Gunthorpe
  2026-02-03 18:50                               ` Nicolin Chen
  2026-02-03 18:59                               ` Robin Murphy
@ 2026-02-18 22:56                               ` Nicolin Chen
  2026-02-19 14:37                                 ` Jason Gunthorpe
  2 siblings, 1 reply; 60+ messages in thread
From: Nicolin Chen @ 2026-02-18 22:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dan.j.williams, Tian, Kevin, Jonathan Cameron, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Tue, Feb 03, 2026 at 01:55:40PM -0400, Jason Gunthorpe wrote:
> On Tue, Feb 03, 2026 at 09:45:17AM -0800, Nicolin Chen wrote:
> > Btw, attaching to IOMMU_DOMAIN_BLOCKED/group->blocking_domain is not
> > allowed in general if require_direct=true. I assume this case can be
> > an exception since there's no point in allowing a device that has no
> > driver yet to access any reserved region?
> 
> If require_direct is set then we have to disable this mechanism..

I found a corner case, which might be another exception here?

Most of dma_configure callback functions don't use default domain
when driver_managed_dma is set. And this breaks MSI on pcieports.

So, I am thinking of doing this:
	bool is_pci_bridge = dev_is_pci(dev) && pci_is_bridge(to_pci_dev(dev));
	[...]
	/*
	 * Block translation requests from a device not bound to a driver yet,
	 * with two exceptions:
	 *  1. IOMMU_RESV_DIRECT (require_direct) must guarantee that the
	 *     device always has access to reserved region(s)
	 *  2. PCI bridges (pcieport, CXL) skip default domain setup in its
	 *     dma_configure callback function due to !driver_managed_dma.
	 *     On the other hand, they require the default domain for MSIs.
	 */
	if (!dev->driver && !group->domain && !dev->iommu->require_direct &&
	    !is_pci_bridge) {
		ret = __iommu_group_alloc_blocking_domain(group);
		if (ret)
			goto err_remove_gdev;
		group->domain = group->blocking_domain;
	}

But I am very unsure about the other cases because it could lead
to some "regression" due to this new restriction.

That being said, setting driver_managed_dma while relying on the
default domain somewhat seems like a bug to me. So it feels that
we should fix those drivers instead of making an exception here?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-18 22:56                               ` Nicolin Chen
@ 2026-02-19 14:37                                 ` Jason Gunthorpe
  2026-02-19 16:53                                   ` Nicolin Chen
  0 siblings, 1 reply; 60+ messages in thread
From: Jason Gunthorpe @ 2026-02-19 14:37 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: dan.j.williams, Tian, Kevin, Jonathan Cameron, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Wed, Feb 18, 2026 at 02:56:35PM -0800, Nicolin Chen wrote:
> On Tue, Feb 03, 2026 at 01:55:40PM -0400, Jason Gunthorpe wrote:
> > On Tue, Feb 03, 2026 at 09:45:17AM -0800, Nicolin Chen wrote:
> > > Btw, attaching to IOMMU_DOMAIN_BLOCKED/group->blocking_domain is not
> > > allowed in general if require_direct=true. I assume this case can be
> > > an exception since there's no point in allowing a device that has no
> > > driver yet to access any reserved region?
> > 
> > If require_direct is set then we have to disable this mechanism..
> 
> I found a corner case, which might be another exception here?

I don't think this blocking security work needs to be part of this
series. We just need to disable the mechanism for untrusted devices.

> Most of dma_configure callback functions don't use default domain
> when driver_managed_dma is set. And this breaks MSI on pcieports.

The ARM MSI aperture need is some special case here. Those drivers
don't use DMA at all so of course they don't have the DMA API setup,
but they do use the MSI aperture on ARM.

Broadly here we were talking about blocked domains for unattached
drivers, but an empty DMA domain is the same thing and still continues
to allow the MSI vectors to work.

So we can reframe this a little bit into more like

if the user requests IDENTITY then the IDENTITY domain is not
installed until just before the driver binds. Up until then it is in
the DMA domain. Meaning if userspace controls driver binding then
unbound drivers have their DMA access blocked by an empty DMA domain.

ie we dynamical shift modes for security.

This would only work on singleton groups.

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-19 14:37                                 ` Jason Gunthorpe
@ 2026-02-19 16:53                                   ` Nicolin Chen
  2026-02-19 17:41                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 60+ messages in thread
From: Nicolin Chen @ 2026-02-19 16:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dan.j.williams, Tian, Kevin, Jonathan Cameron, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Thu, Feb 19, 2026 at 10:37:37AM -0400, Jason Gunthorpe wrote:
> On Wed, Feb 18, 2026 at 02:56:35PM -0800, Nicolin Chen wrote:
> > On Tue, Feb 03, 2026 at 01:55:40PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Feb 03, 2026 at 09:45:17AM -0800, Nicolin Chen wrote:
> > > > Btw, attaching to IOMMU_DOMAIN_BLOCKED/group->blocking_domain is not
> > > > allowed in general if require_direct=true. I assume this case can be
> > > > an exception since there's no point in allowing a device that has no
> > > > driver yet to access any reserved region?
> > > 
> > > If require_direct is set then we have to disable this mechanism..
> > 
> > I found a corner case, which might be another exception here?
> 
> I don't think this blocking security work needs to be part of this
> series. We just need to disable the mechanism for untrusted devices.

Oh, I thought it should be a prerequisite. I'll separate the patch
then.

> > Most of dma_configure callback functions don't use default domain
> > when driver_managed_dma is set. And this breaks MSI on pcieports.
> 
> The ARM MSI aperture need is some special case here. Those drivers
> don't use DMA at all so of course they don't have the DMA API setup,
> but they do use the MSI aperture on ARM.
> 
> Broadly here we were talking about blocked domains for unattached
> drivers, but an empty DMA domain is the same thing and still continues
> to allow the MSI vectors to work.

I see.

> So we can reframe this a little bit into more like
> 
> if the user requests IDENTITY then the IDENTITY domain is not
> installed until just before the driver binds. Up until then it is in
> the DMA domain. Meaning if userspace controls driver binding then
> unbound drivers have their DMA access blocked by an empty DMA domain.

The thing is that those driver_managed_dma callbacks don't call
iommu_device_use_default_domain(). So, the iommu core loses the
trigger to switch domain from BLOCKED/empty-DMA to DMA/IDENTITY.

The pcieports case via pci_dma_configure() is an example.

My previous approach using BUS_NOTIFY could likely work, but it
needed to adjust the timing between BUS_NOTIFY and dev->driver
setting.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-19 16:53                                   ` Nicolin Chen
@ 2026-02-19 17:41                                     ` Jason Gunthorpe
  2026-02-20  4:52                                       ` Nicolin Chen
  0 siblings, 1 reply; 60+ messages in thread
From: Jason Gunthorpe @ 2026-02-19 17:41 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: dan.j.williams, Tian, Kevin, Jonathan Cameron, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Thu, Feb 19, 2026 at 08:53:19AM -0800, Nicolin Chen wrote:
> The thing is that those driver_managed_dma callbacks don't call
> iommu_device_use_default_domain(). So, the iommu core loses the
> trigger to switch domain from BLOCKED/empty-DMA to DMA/IDENTITY.

But they don't use DMA API at all so it doesn't matter to them.

Your issue is that BLOCKED breaks MSI on ARM. That is fixed by using
an empty-DMA API domain as default.

What is missing is to bring back the IDENTITY performance optimization
in a secure way.

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-19 17:41                                     ` Jason Gunthorpe
@ 2026-02-20  4:52                                       ` Nicolin Chen
  2026-02-20 12:50                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 60+ messages in thread
From: Nicolin Chen @ 2026-02-20  4:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dan.j.williams, Tian, Kevin, Jonathan Cameron, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Thu, Feb 19, 2026 at 01:41:39PM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 19, 2026 at 08:53:19AM -0800, Nicolin Chen wrote:
> > The thing is that those driver_managed_dma callbacks don't call
> > iommu_device_use_default_domain(). So, the iommu core loses the
> > trigger to switch domain from BLOCKED/empty-DMA to DMA/IDENTITY.
> 
> But they don't use DMA API at all so it doesn't matter to them.
> 
> Your issue is that BLOCKED breaks MSI on ARM.

Thanks for the hint!

It actually failed in iommu_dma_prepare_msi() due to having an
IOMMU_COOKIE_NONE in the blocking_domain.

My implementation sets group->domain to group->blocking_domain,
and keeps group->default_domain=NULL to retain the EPROBE_DEFER
validation in iommu_device_use_default_domain().

Then in iommu_dma_prepare_msi(), the group->domain now becomes
valid so it failed due to its unsupported iommu cookie, which
I entirely missed.

I could simply fix this by adding:

@@ -3892,7 +3892,8 @@ int iommu_dma_prepare_msi(struct msi_desc *desc, phys_addr_t msi_addr)

        mutex_lock(&group->mutex);
        /* An IDENTITY domain must pass through */
-       if (group->domain && group->domain->type != IOMMU_DOMAIN_IDENTITY) {
+       if (group->default_domain && group->domain &&
+           group->domain->type != IOMMU_DOMAIN_IDENTITY) {
                switch (group->domain->cookie_type) {
                case IOMMU_COOKIE_DMA_MSI:
                case IOMMU_COOKIE_DMA_IOVA:

> That is fixed by using
> an empty-DMA API domain as default.

Hmm, even if we set arm_smmu_blocked_domain.type to an empty DMA
(IOMMU_DOMAIN_UNMANAGED?), it still doesn't allocate a cookie?

> What is missing is to bring back the IDENTITY performance optimization
> in a secure way.

I might have got it wrong (from the last part below).
https://lore.kernel.org/linux-iommu/20260127150440.GF1134360@nvidia.com/.

You mean to disable ATS on IDENTITY domains? 

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-20  4:52                                       ` Nicolin Chen
@ 2026-02-20 12:50                                         ` Jason Gunthorpe
  2026-02-20 13:22                                           ` Robin Murphy
  2026-02-20 18:49                                           ` Nicolin Chen
  0 siblings, 2 replies; 60+ messages in thread
From: Jason Gunthorpe @ 2026-02-20 12:50 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: dan.j.williams, Tian, Kevin, Jonathan Cameron, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Thu, Feb 19, 2026 at 08:52:56PM -0800, Nicolin Chen wrote:
> > What is missing is to bring back the IDENTITY performance optimization
> > in a secure way.
> 
> I might have got it wrong (from the last part below).
> https://lore.kernel.org/linux-iommu/20260127150440.GF1134360@nvidia.com/.
> 
> You mean to disable ATS on IDENTITY domains? 

The objective of this security step is to keep ATS blocked and
IDENTITY domains disabled until the userspace has "accepted" the
device by binding a driver to it.

The off the cuff suggestion was to just park the device BLOCKED until
a driver is bound. This disables ATS and blocks translation.

That doesn't work on ARM because of the MSI issue.

The next suggestion is to park the device in a real DMA domain with an
actual page table and DMA API hooked up. Now interrupts will work and
the domain is empty so there is no translation. The issue here is the
domain doesn't block ATS. We could fix this with some "disable ATS"
domain flag.

In either case when the driver is bound and requests that the DMA API
start working if the user requested IDENTITY then it has to be
switched away from the parked domain to IDENTITY.

A final thought would be to change around the driver managed DMA
mechanism a bit to allow drivers to indicate they use IRQs but not
DMA, then the bind step could switch from a BLOCKED domain to an empty
DMA API domain to allow MSI to work.

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-20 12:50                                         ` Jason Gunthorpe
@ 2026-02-20 13:22                                           ` Robin Murphy
  2026-02-20 13:51                                             ` Jason Gunthorpe
  2026-02-20 18:49                                           ` Nicolin Chen
  1 sibling, 1 reply; 60+ messages in thread
From: Robin Murphy @ 2026-02-20 13:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Nicolin Chen
  Cc: dan.j.williams, Tian, Kevin, Jonathan Cameron, will@kernel.org,
	bhelgaas@google.com, joro@8bytes.org, praan@google.com,
	baolu.lu@linux.intel.com, miko.lenczewski@arm.com,
	linux-arm-kernel@lists.infradead.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-cxl@vger.kernel.org

On 2026-02-20 12:50 pm, Jason Gunthorpe wrote:
> On Thu, Feb 19, 2026 at 08:52:56PM -0800, Nicolin Chen wrote:
>>> What is missing is to bring back the IDENTITY performance optimization
>>> in a secure way.
>>
>> I might have got it wrong (from the last part below).
>> https://lore.kernel.org/linux-iommu/20260127150440.GF1134360@nvidia.com/.
>>
>> You mean to disable ATS on IDENTITY domains?
> 
> The objective of this security step is to keep ATS blocked and
> IDENTITY domains disabled until the userspace has "accepted" the
> device by binding a driver to it.
> 
> The off the cuff suggestion was to just park the device BLOCKED until
> a driver is bound. This disables ATS and blocks translation.
> 
> That doesn't work on ARM because of the MSI issue.

But is that an issue? Until the device has a driver, surely it shouldn't 
be expected to send interrupts at all, much less depend on them being 
received and understood by Linux? The MSI cookie is only populated once 
a driver actually requests some MSI vectors (since it doesn't know what 
ITS address(es) may or may not need mapping), so an empty DMA domain is 
still no better than a true blocking domain in this regard anyway.

Thanks,
Robin.

> The next suggestion is to park the device in a real DMA domain with an
> actual page table and DMA API hooked up. Now interrupts will work and
> the domain is empty so there is no translation. The issue here is the
> domain doesn't block ATS. We could fix this with some "disable ATS"
> domain flag.
> 
> In either case when the driver is bound and requests that the DMA API
> start working if the user requested IDENTITY then it has to be
> switched away from the parked domain to IDENTITY.
> 
> A final thought would be to change around the driver managed DMA
> mechanism a bit to allow drivers to indicate they use IRQs but not
> DMA, then the bind step could switch from a BLOCKED domain to an empty
> DMA API domain to allow MSI to work.
> 
> Jason


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-20 13:22                                           ` Robin Murphy
@ 2026-02-20 13:51                                             ` Jason Gunthorpe
  2026-02-20 14:45                                               ` Robin Murphy
  0 siblings, 1 reply; 60+ messages in thread
From: Jason Gunthorpe @ 2026-02-20 13:51 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Nicolin Chen, dan.j.williams, Tian, Kevin, Jonathan Cameron,
	will@kernel.org, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Fri, Feb 20, 2026 at 01:22:49PM +0000, Robin Murphy wrote:

> But is that an issue? Until the device has a driver, surely it shouldn't be
> expected to send interrupts at all, much less depend on them being received
> and understood by Linux? The MSI cookie is only populated once a driver
> actually requests some MSI vectors (since it doesn't know what ITS
> address(es) may or may not need mapping), so an empty DMA domain is still no
> better than a true blocking domain in this regard anyway.

Oh, the issue is the driver_managed_dma flag.

In this mode we do bind a driver but the iommu callbacks at driver
bind are not called anymore because that flag says the driver itself
will call them later.

Things like PCI port driver that never issue DMA at all will set the
flag and never make any calls, while still expecting interrupts to
work.

This is why the other option is to rework this somewhat so these
drivers still make call in to the iommu and can get an interrupt
setup.

All of this is only for multi-device groups where we want to ignore
some bad grouping with VFIO on old HW without sufficient ACS. Thinking
about it some more I suspect this entire concept has been broken from
day 1 in VFIO on ARM. If the iommu_group has two members, port driver
and a VFIO device then:

 The port driver will start first, install the ITS page in the DMA
 domain, VFIO will start second an switch the domain to BLOCKED, then
 to PAGING, and the ITS mapping used by the port driver will be lost.

And nobody will notice this has happened because the interrupts in the
port driver are only used for RAS IIRC so the net effect is your
system doesn't print AERs anymore.

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-20 13:51                                             ` Jason Gunthorpe
@ 2026-02-20 14:45                                               ` Robin Murphy
  2026-02-26 15:10                                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 60+ messages in thread
From: Robin Murphy @ 2026-02-20 14:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, dan.j.williams, Tian, Kevin, Jonathan Cameron,
	will@kernel.org, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On 2026-02-20 1:51 pm, Jason Gunthorpe wrote:
> On Fri, Feb 20, 2026 at 01:22:49PM +0000, Robin Murphy wrote:
> 
>> But is that an issue? Until the device has a driver, surely it shouldn't be
>> expected to send interrupts at all, much less depend on them being received
>> and understood by Linux? The MSI cookie is only populated once a driver
>> actually requests some MSI vectors (since it doesn't know what ITS
>> address(es) may or may not need mapping), so an empty DMA domain is still no
>> better than a true blocking domain in this regard anyway.
> 
> Oh, the issue is the driver_managed_dma flag.
> 
> In this mode we do bind a driver but the iommu callbacks at driver
> bind are not called anymore because that flag says the driver itself
> will call them later.
> 
> Things like PCI port driver that never issue DMA at all will set the
> flag and never make any calls, while still expecting interrupts to
> work.
> 
> This is why the other option is to rework this somewhat so these
> drivers still make call in to the iommu and can get an interrupt
> setup.

Or perhaps we handle BUS_NOTIFY_BIND_DRIVER to manage the switch from 
BLOCKED to (empty) DMA independently from whether the driver 
subsequently claims the DMA domain or not? That said, I wouldn't have 
any particular objection to generalising iommu_use_default_domain() into 
something like iommu_prepare_default_domain(bool managed) either.

> All of this is only for multi-device groups where we want to ignore
> some bad grouping with VFIO on old HW without sufficient ACS. Thinking
> about it some more I suspect this entire concept has been broken from
> day 1 in VFIO on ARM. If the iommu_group has two members, port driver
> and a VFIO device then:
> 
>   The port driver will start first, install the ITS page in the DMA
>   domain, VFIO will start second an switch the domain to BLOCKED, then
>   to PAGING, and the ITS mapping used by the port driver will be lost.
> 
> And nobody will notice this has happened because the interrupts in the
> port driver are only used for RAS IIRC so the net effect is your
> system doesn't print AERs anymore.

Indeed VFIO's MSI cookie doesn't inherit any existing mappings from the 
DMA domain, and that wouldn't work anyway since the IOVAs would almost 
certainly be different. So we'd have to somehow free any existing AER 
interrupts before the domain switch, then fully re-request and reprogram 
them afterwards, in both DMA->UNMANAGED and UNMANAGED->DMA directions. 
Oof...

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-20 12:50                                         ` Jason Gunthorpe
  2026-02-20 13:22                                           ` Robin Murphy
@ 2026-02-20 18:49                                           ` Nicolin Chen
  2026-02-24 14:38                                             ` Jason Gunthorpe
  1 sibling, 1 reply; 60+ messages in thread
From: Nicolin Chen @ 2026-02-20 18:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dan.j.williams, Tian, Kevin, Jonathan Cameron, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Fri, Feb 20, 2026 at 08:50:44AM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 19, 2026 at 08:52:56PM -0800, Nicolin Chen wrote:
> The next suggestion is to park the device in a real DMA domain with an
> actual page table and DMA API hooked up. Now interrupts will work and
> the domain is empty so there is no translation. The issue here is the
> domain doesn't block ATS. We could fix this with some "disable ATS"
> domain flag.
> 
> In either case when the driver is bound and requests that the DMA API
> start working if the user requested IDENTITY then it has to be
> switched away from the parked domain to IDENTITY.

Thanks for elaborating. This seems very orthogonal to the issue
that driver_managed_dma skips iommu_device_use_default_domain().
(And I see you discussion with Robin.)

Regarding the empty-DMA domain, I have an idea of accommodating
ARM cases with an IOMMU_DOMAIN_MSI_ONLY, which is essentially a
paging domain that only allows IOMMU_COOKIE_DMA_MSI but blocks
everything else.

> A final thought would be to change around the driver managed DMA
> mechanism a bit to allow drivers to indicate they use IRQs but not
> DMA, then the bind step could switch from a BLOCKED domain to an empty
> DMA API domain to allow MSI to work.

Yes, "driver_managed_dma" is so unclear in pcieport case, since
its driver doesn't really manage DMA...

A separate flag could be clear. And the IOMMU layer might do an
AND operation between driver_uses_msi(?) flag and another IOMMU
device-level flag msi_behind_iommu?

Thank you
Nicolin

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-20 18:49                                           ` Nicolin Chen
@ 2026-02-24 14:38                                             ` Jason Gunthorpe
  0 siblings, 0 replies; 60+ messages in thread
From: Jason Gunthorpe @ 2026-02-24 14:38 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: dan.j.williams, Tian, Kevin, Jonathan Cameron, will@kernel.org,
	robin.murphy@arm.com, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Fri, Feb 20, 2026 at 10:49:09AM -0800, Nicolin Chen wrote:
> On Fri, Feb 20, 2026 at 08:50:44AM -0400, Jason Gunthorpe wrote:
> > On Thu, Feb 19, 2026 at 08:52:56PM -0800, Nicolin Chen wrote:
> > The next suggestion is to park the device in a real DMA domain with an
> > actual page table and DMA API hooked up. Now interrupts will work and
> > the domain is empty so there is no translation. The issue here is the
> > domain doesn't block ATS. We could fix this with some "disable ATS"
> > domain flag.
> > 
> > In either case when the driver is bound and requests that the DMA API
> > start working if the user requested IDENTITY then it has to be
> > switched away from the parked domain to IDENTITY.
> 
> Thanks for elaborating. This seems very orthogonal to the issue
> that driver_managed_dma skips iommu_device_use_default_domain().
> (And I see you discussion with Robin.)
> 
> Regarding the empty-DMA domain, I have an idea of accommodating
> ARM cases with an IOMMU_DOMAIN_MSI_ONLY, which is essentially a
> paging domain that only allows IOMMU_COOKIE_DMA_MSI but blocks
> everything else.

Yeah, maybe, but also we probably don't need such stringent checks
since no driver will be bound while this domain is setup, so the basic
DMA domain is fine too.

> > A final thought would be to change around the driver managed DMA
> > mechanism a bit to allow drivers to indicate they use IRQs but not
> > DMA, then the bind step could switch from a BLOCKED domain to an empty
> > DMA API domain to allow MSI to work.
> 
> Yes, "driver_managed_dma" is so unclear in pcieport case, since
> its driver doesn't really manage DMA...

It should be taken to mean "this device never does DMA when attached
to this driver"
 
> A separate flag could be clear. And the IOMMU layer might do an
> AND operation between driver_uses_msi(?) flag and another IOMMU
> device-level flag msi_behind_iommu?

Perhaps the bool can become an enum:
  NORMAL DMA API
  MANUAL DOMAIN MANAGEMENT
  NO DMA MSI ONLY

Then the latter can switch to a DMA domain out of blocked on driver
binding, and we'd teach VFIO to refuse multi-device groups with MSI
ONLY members on ARM because it doesn't work :\

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices
  2026-02-20 14:45                                               ` Robin Murphy
@ 2026-02-26 15:10                                                 ` Jason Gunthorpe
  0 siblings, 0 replies; 60+ messages in thread
From: Jason Gunthorpe @ 2026-02-26 15:10 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Nicolin Chen, dan.j.williams, Tian, Kevin, Jonathan Cameron,
	will@kernel.org, bhelgaas@google.com, joro@8bytes.org,
	praan@google.com, baolu.lu@linux.intel.com,
	miko.lenczewski@arm.com, linux-arm-kernel@lists.infradead.org,
	iommu@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-cxl@vger.kernel.org

On Fri, Feb 20, 2026 at 02:45:49PM +0000, Robin Murphy wrote:
> >   The port driver will start first, install the ITS page in the DMA
> >   domain, VFIO will start second an switch the domain to BLOCKED, then
> >   to PAGING, and the ITS mapping used by the port driver will be lost.
> > 
> > And nobody will notice this has happened because the interrupts in the
> > port driver are only used for RAS IIRC so the net effect is your
> > system doesn't print AERs anymore.
> 
> Indeed VFIO's MSI cookie doesn't inherit any existing mappings from the DMA
> domain, and that wouldn't work anyway since the IOVAs would almost certainly
> be different. So we'd have to somehow free any existing AER interrupts
> before the domain switch, then fully re-request and reprogram them
> afterwards, in both DMA->UNMANAGED and UNMANAGED->DMA directions. Oof...

I'm inclined to say we should disable this VFIO feature if CONFIG_IRQ_MSI_IOMMU
is enabled... Better to hard fail then silently loose interrutps.

If that causes problems for people then we should investigate how to
fix the MSI.

It is really hard, but perhaps the most elegant solution would be to
allow the group members to have unique iommu domains by
dis-ambiguating the 'grouping for aliasing' and 'grouping for ACS'
cases. Then the portdrv can stay on its DMA domain and everything
keeps working for it. That sounds like a monster project...

Jason

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2026-02-26 15:10 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-17  4:56 [PATCH RFCv1 0/3] Allow ATS to be always on for certain ATS-capable devices Nicolin Chen
2026-01-17  4:56 ` [PATCH RFCv1 1/3] PCI: Allow ATS to be always on for CXL.cache capable devices Nicolin Chen
2026-01-19 17:58   ` Jason Gunthorpe
2026-01-21  8:01   ` Tian, Kevin
2026-01-21 10:03     ` Jonathan Cameron
2026-01-21 13:03       ` Jason Gunthorpe
2026-01-22  1:17         ` Baolu Lu
2026-01-22 13:15           ` Jason Gunthorpe
2026-01-22  5:44         ` dan.j.williams
2026-01-22 13:14           ` Jason Gunthorpe
2026-01-22 16:29             ` Nicolin Chen
2026-01-22 16:58               ` Jason Gunthorpe
2026-01-22 19:46             ` dan.j.williams
2026-01-27  8:10               ` Tian, Kevin
2026-01-27 15:04                 ` Jason Gunthorpe
2026-01-28  0:49                   ` dan.j.williams
2026-01-28 13:05                     ` Jason Gunthorpe
2026-02-03  5:13                       ` Nicolin Chen
2026-02-03 14:33                         ` Jason Gunthorpe
2026-02-03 17:45                           ` Nicolin Chen
2026-02-03 17:55                             ` Jason Gunthorpe
2026-02-03 18:50                               ` Nicolin Chen
2026-02-04 13:21                                 ` Jason Gunthorpe
2026-02-03 18:59                               ` Robin Murphy
2026-02-03 19:24                                 ` Nicolin Chen
2026-02-03 23:16                                 ` Jason Gunthorpe
2026-02-04 12:18                                   ` Robin Murphy
2026-02-04 13:20                                     ` Jason Gunthorpe
2026-02-18 22:56                               ` Nicolin Chen
2026-02-19 14:37                                 ` Jason Gunthorpe
2026-02-19 16:53                                   ` Nicolin Chen
2026-02-19 17:41                                     ` Jason Gunthorpe
2026-02-20  4:52                                       ` Nicolin Chen
2026-02-20 12:50                                         ` Jason Gunthorpe
2026-02-20 13:22                                           ` Robin Murphy
2026-02-20 13:51                                             ` Jason Gunthorpe
2026-02-20 14:45                                               ` Robin Murphy
2026-02-26 15:10                                                 ` Jason Gunthorpe
2026-02-20 18:49                                           ` Nicolin Chen
2026-02-24 14:38                                             ` Jason Gunthorpe
2026-01-28  0:57                   ` Tian, Kevin
2026-01-28 13:11                     ` Jason Gunthorpe
2026-01-29  3:28                       ` Tian, Kevin
2026-01-22 10:24         ` Alejandro Lucero Palau
2026-01-17  4:56 ` [PATCH RFCv1 2/3] PCI: Allow ATS to be always on for non-CXL NVIDIA GPUs Nicolin Chen
2026-01-19 18:00   ` Jason Gunthorpe
2026-01-19 18:09     ` Nicolin Chen
2026-01-17  4:56 ` [PATCH RFCv1 3/3] iommu/arm-smmu-v3: Allow ATS to be always on Nicolin Chen
2026-01-19 20:06   ` Jason Gunthorpe
2026-01-26 12:39   ` Will Deacon
2026-01-26 17:20     ` Jason Gunthorpe
2026-01-26 18:40       ` Nicolin Chen
2026-01-26 19:16         ` Jason Gunthorpe
2026-01-26 18:49       ` Robin Murphy
2026-01-26 19:09         ` Jason Gunthorpe
2026-01-27 13:10           ` Will Deacon
2026-01-27 13:26             ` Robin Murphy
2026-01-27 13:50               ` Will Deacon
2026-01-27 14:49                 ` Jason Gunthorpe
2026-01-26 18:21     ` Nicolin Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox