linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] Initial support for SMMUv3 nested translation
@ 2024-08-06 23:41 Jason Gunthorpe
  2024-08-06 23:41 ` [PATCH 1/8] vfio: Remove VFIO_TYPE1_NESTING_IOMMU Jason Gunthorpe
                   ` (8 more replies)
  0 siblings, 9 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-06 23:41 UTC (permalink / raw)
  To: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon
  Cc: Eric Auger, Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches, Shameerali Kolothum Thodi

This brings support for the IOMMFD ioctls:

 - IOMMU_GET_HW_INFO
 - IOMMU_HWPT_ALLOC_NEST_PARENT
 - IOMMU_DOMAIN_NESTED
 - ops->enforce_cache_coherency()

This is quite straightforward as the nested STE can just be built in the
special NESTED domain op and fed through the generic update machinery.

The design allows the user provided STE fragment to control several
aspects of the translation, including putting the STE into a "virtual
bypass" or a aborting state. This duplicates functionality available by
other means, but it allows trivially preserving the VMID in the STE as we
eventually move towards the VIOMMU owning the VMID.

Nesting support requires the system to either support S2FWB or the
stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
cache and view incoherent data, currently VFIO lacks any cache flushing
that would make this safe.

Yan has a series to add some of the needed infrastructure for VFIO cache
flushing here:

 https://lore.kernel.org/linux-iommu/20240507061802.20184-1-yan.y.zhao@intel.com/

Which may someday allow relaxing this further.

Remove VFIO_TYPE1_NESTING_IOMMU since it was never used and superseded by
this.

This is the first series in what will be several to complete nesting
support. At least:
 - IOMMU_RESV_SW_MSI related fixups
 - VIOMMU object support to allow ATS invalidations
 - vCMDQ hypervisor support for direct invalidation queue assignment
 - KVM pinned VMID using VIOMMU for vBTM
 - Cross instance S2 sharing
 - Virtual Machine Structure using VIOMMU (for vMPAM?)
 - Fault forwarding support through IOMMUFD's fault fd for vSVA

It is enough to allow significant amounts of qemu work to progress.

This is on github: https://github.com/jgunthorpe/linux/commits/smmuv3_nesting

Jason Gunthorpe (5):
  vfio: Remove VFIO_TYPE1_NESTING_IOMMU
  iommu/arm-smmu-v3: Use S2FWB when available
  iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
  iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
  iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED

Nicolin Chen (3):
  ACPI/IORT: Support CANWBS memory access flag
  iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct
    arm_smmu_hw_info
  iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

 drivers/acpi/arm64/iort.c                   |  13 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 398 ++++++++++++++++++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  27 ++
 drivers/iommu/arm/arm-smmu/arm-smmu.c       |  16 -
 drivers/iommu/io-pgtable-arm.c              |  24 +-
 drivers/iommu/iommu.c                       |  10 -
 drivers/iommu/iommufd/vfio_compat.c         |   7 +-
 drivers/vfio/vfio_iommu_type1.c             |  12 +-
 include/acpi/actbl2.h                       |   1 +
 include/linux/io-pgtable.h                  |   2 +
 include/linux/iommu.h                       |  54 ++-
 include/uapi/linux/iommufd.h                |  79 ++++
 include/uapi/linux/vfio.h                   |   2 +-
 13 files changed, 572 insertions(+), 73 deletions(-)


base-commit: e5e288d94186b266b062b3e44c82c285dfe68712
-- 
2.46.0



^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/8] vfio: Remove VFIO_TYPE1_NESTING_IOMMU
  2024-08-06 23:41 [PATCH 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
@ 2024-08-06 23:41 ` Jason Gunthorpe
  2024-08-07 17:52   ` Alex Williamson
  2024-08-20  8:23   ` Mostafa Saleh
  2024-08-06 23:41 ` [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available Jason Gunthorpe
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-06 23:41 UTC (permalink / raw)
  To: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon
  Cc: Eric Auger, Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches, Shameerali Kolothum Thodi

This control causes the ARM SMMU drivers to choose a stage 2
implementation for the IO pagetable (vs the stage 1 usual default),
however this choice has no significant visible impact to the VFIO
user. Further qemu never implemented this and no other userspace user is
known.

The original description in commit f5c9ecebaf2a ("vfio/iommu_type1: add
new VFIO_TYPE1_NESTING_IOMMU IOMMU type") suggested this was to "provide
SMMU translation services to the guest operating system" however the rest
of the API to set the guest table pointer for the stage 1 and manage
invalidation was never completed, or at least never upstreamed, rendering
this part useless dead code.

Upstream has now settled on iommufd as the uAPI for controlling nested
translation. Choosing the stage 2 implementation should be done by through
the IOMMU_HWPT_ALLOC_NEST_PARENT flag during domain allocation.

Remove VFIO_TYPE1_NESTING_IOMMU and everything under it including the
enable_nesting iommu_domain_op.

Just in-case there is some userspace using this continue to treat
requesting it as a NOP, but do not advertise support any more.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 ----------------
 drivers/iommu/arm/arm-smmu/arm-smmu.c       | 16 ----------------
 drivers/iommu/iommu.c                       | 10 ----------
 drivers/iommu/iommufd/vfio_compat.c         |  7 +------
 drivers/vfio/vfio_iommu_type1.c             | 12 +-----------
 include/linux/iommu.h                       |  3 ---
 include/uapi/linux/vfio.h                   |  2 +-
 7 files changed, 3 insertions(+), 63 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index e5db5325f7eaed..531125f231b662 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3331,21 +3331,6 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev)
 	return group;
 }
 
-static int arm_smmu_enable_nesting(struct iommu_domain *domain)
-{
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	int ret = 0;
-
-	mutex_lock(&smmu_domain->init_mutex);
-	if (smmu_domain->smmu)
-		ret = -EPERM;
-	else
-		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
-	mutex_unlock(&smmu_domain->init_mutex);
-
-	return ret;
-}
-
 static int arm_smmu_of_xlate(struct device *dev,
 			     const struct of_phandle_args *args)
 {
@@ -3467,7 +3452,6 @@ static struct iommu_ops arm_smmu_ops = {
 		.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
 		.iotlb_sync		= arm_smmu_iotlb_sync,
 		.iova_to_phys		= arm_smmu_iova_to_phys,
-		.enable_nesting		= arm_smmu_enable_nesting,
 		.free			= arm_smmu_domain_free_paging,
 	}
 };
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 723273440c2118..38dad1fd53b80a 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -1558,21 +1558,6 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev)
 	return group;
 }
 
-static int arm_smmu_enable_nesting(struct iommu_domain *domain)
-{
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	int ret = 0;
-
-	mutex_lock(&smmu_domain->init_mutex);
-	if (smmu_domain->smmu)
-		ret = -EPERM;
-	else
-		smmu_domain->stage = ARM_SMMU_DOMAIN_NESTED;
-	mutex_unlock(&smmu_domain->init_mutex);
-
-	return ret;
-}
-
 static int arm_smmu_set_pgtable_quirks(struct iommu_domain *domain,
 		unsigned long quirks)
 {
@@ -1656,7 +1641,6 @@ static struct iommu_ops arm_smmu_ops = {
 		.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
 		.iotlb_sync		= arm_smmu_iotlb_sync,
 		.iova_to_phys		= arm_smmu_iova_to_phys,
-		.enable_nesting		= arm_smmu_enable_nesting,
 		.set_pgtable_quirks	= arm_smmu_set_pgtable_quirks,
 		.free			= arm_smmu_domain_free,
 	}
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index ed6c5cb60c5aee..9da63d57a53cd7 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2723,16 +2723,6 @@ static int __init iommu_init(void)
 }
 core_initcall(iommu_init);
 
-int iommu_enable_nesting(struct iommu_domain *domain)
-{
-	if (domain->type != IOMMU_DOMAIN_UNMANAGED)
-		return -EINVAL;
-	if (!domain->ops->enable_nesting)
-		return -EINVAL;
-	return domain->ops->enable_nesting(domain);
-}
-EXPORT_SYMBOL_GPL(iommu_enable_nesting);
-
 int iommu_set_pgtable_quirks(struct iommu_domain *domain,
 		unsigned long quirk)
 {
diff --git a/drivers/iommu/iommufd/vfio_compat.c b/drivers/iommu/iommufd/vfio_compat.c
index a3ad5f0b6c59dd..514aacd6400949 100644
--- a/drivers/iommu/iommufd/vfio_compat.c
+++ b/drivers/iommu/iommufd/vfio_compat.c
@@ -291,12 +291,7 @@ static int iommufd_vfio_check_extension(struct iommufd_ctx *ictx,
 	case VFIO_DMA_CC_IOMMU:
 		return iommufd_vfio_cc_iommu(ictx);
 
-	/*
-	 * This is obsolete, and to be removed from VFIO. It was an incomplete
-	 * idea that got merged.
-	 * https://lore.kernel.org/kvm/0-v1-0093c9b0e345+19-vfio_no_nesting_jgg@nvidia.com/
-	 */
-	case VFIO_TYPE1_NESTING_IOMMU:
+	case __VFIO_RESERVED_TYPE1_NESTING_IOMMU:
 		return 0;
 
 	/*
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0960699e75543e..13cf6851cc2718 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -72,7 +72,6 @@ struct vfio_iommu {
 	uint64_t		pgsize_bitmap;
 	uint64_t		num_non_pinned_groups;
 	bool			v2;
-	bool			nesting;
 	bool			dirty_page_tracking;
 	struct list_head	emulated_iommu_groups;
 };
@@ -2199,12 +2198,6 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 		goto out_free_domain;
 	}
 
-	if (iommu->nesting) {
-		ret = iommu_enable_nesting(domain->domain);
-		if (ret)
-			goto out_domain;
-	}
-
 	ret = iommu_attach_group(domain->domain, group->iommu_group);
 	if (ret)
 		goto out_domain;
@@ -2545,9 +2538,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 	switch (arg) {
 	case VFIO_TYPE1_IOMMU:
 		break;
-	case VFIO_TYPE1_NESTING_IOMMU:
-		iommu->nesting = true;
-		fallthrough;
+	case __VFIO_RESERVED_TYPE1_NESTING_IOMMU:
 	case VFIO_TYPE1v2_IOMMU:
 		iommu->v2 = true;
 		break;
@@ -2642,7 +2633,6 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
 	switch (arg) {
 	case VFIO_TYPE1_IOMMU:
 	case VFIO_TYPE1v2_IOMMU:
-	case VFIO_TYPE1_NESTING_IOMMU:
 	case VFIO_UNMAP_ALL:
 		return 1;
 	case VFIO_UPDATE_VADDR:
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 4d47f2c3331185..15d7657509f662 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -635,7 +635,6 @@ struct iommu_ops {
  * @enforce_cache_coherency: Prevent any kind of DMA from bypassing IOMMU_CACHE,
  *                           including no-snoop TLPs on PCIe or other platform
  *                           specific mechanisms.
- * @enable_nesting: Enable nesting
  * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
  * @free: Release the domain after use.
  */
@@ -663,7 +662,6 @@ struct iommu_domain_ops {
 				    dma_addr_t iova);
 
 	bool (*enforce_cache_coherency)(struct iommu_domain *domain);
-	int (*enable_nesting)(struct iommu_domain *domain);
 	int (*set_pgtable_quirks)(struct iommu_domain *domain,
 				  unsigned long quirks);
 
@@ -846,7 +844,6 @@ extern void iommu_group_put(struct iommu_group *group);
 extern int iommu_group_id(struct iommu_group *group);
 extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
 
-int iommu_enable_nesting(struct iommu_domain *domain);
 int iommu_set_pgtable_quirks(struct iommu_domain *domain,
 		unsigned long quirks);
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 2b68e6cdf1902f..c8dbf8219c4fcb 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -35,7 +35,7 @@
 #define VFIO_EEH			5
 
 /* Two-stage IOMMU */
-#define VFIO_TYPE1_NESTING_IOMMU	6	/* Implies v2 */
+#define __VFIO_RESERVED_TYPE1_NESTING_IOMMU	6	/* Implies v2 */
 
 #define VFIO_SPAPR_TCE_v2_IOMMU		7
 
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-06 23:41 [PATCH 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
  2024-08-06 23:41 ` [PATCH 1/8] vfio: Remove VFIO_TYPE1_NESTING_IOMMU Jason Gunthorpe
@ 2024-08-06 23:41 ` Jason Gunthorpe
  2024-08-09 14:26   ` Shameerali Kolothum Thodi
  2024-08-20  8:30   ` Mostafa Saleh
  2024-08-06 23:41 ` [PATCH 3/8] ACPI/IORT: Support CANWBS memory access flag Jason Gunthorpe
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-06 23:41 UTC (permalink / raw)
  To: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon
  Cc: Eric Auger, Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches, Shameerali Kolothum Thodi

Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
works. When S2FWB is supported and enabled the IOPTE will force cachable
access to IOMMU_CACHE memory and deny cachable access otherwise.

This is not especially meaningful for simple S2 domains, it apparently
doesn't even force PCI no-snoop access to be coherent.

However, when used with a nested S1, FWB has the effect of preventing the
guest from choosing a MemAttr that would cause ordinary DMA to bypass the
cache. Consistent with KVM we wish to deny the guest the ability to become
incoherent with cached memory the hypervisor believes is cachable so we
don't have to flush it.

Turn on S2FWB whenever the SMMU supports it and use it for all S2
mappings.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  6 ++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
 drivers/iommu/io-pgtable-arm.c              | 24 +++++++++++++++++----
 include/linux/io-pgtable.h                  |  2 ++
 4 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 531125f231b662..7fe1e27d11586c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1612,6 +1612,8 @@ void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
 		FIELD_PREP(STRTAB_STE_1_EATS,
 			   ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0));
 
+	if (smmu->features & ARM_SMMU_FEAT_S2FWB)
+		target->data[1] |= cpu_to_le64(STRTAB_STE_1_S2FWB);
 	if (smmu->features & ARM_SMMU_FEAT_ATTR_TYPES_OVR)
 		target->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
 							  STRTAB_STE_1_SHCFG_INCOMING));
@@ -2400,6 +2402,8 @@ static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain,
 		pgtbl_cfg.oas = smmu->oas;
 		fmt = ARM_64_LPAE_S2;
 		finalise_stage_fn = arm_smmu_domain_finalise_s2;
+		if (smmu->features & ARM_SMMU_FEAT_S2FWB)
+			pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB;
 		break;
 	default:
 		return -EINVAL;
@@ -4189,6 +4193,8 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 
 	/* IDR3 */
 	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
+	if (FIELD_GET(IDR3_FWB, reg))
+		smmu->features |= ARM_SMMU_FEAT_S2FWB;
 	if (FIELD_GET(IDR3_RIL, reg))
 		smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 8851a7abb5f0f3..7e8d2f36faebf3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -55,6 +55,7 @@
 #define IDR1_SIDSIZE			GENMASK(5, 0)
 
 #define ARM_SMMU_IDR3			0xc
+#define IDR3_FWB			(1 << 8)
 #define IDR3_RIL			(1 << 10)
 
 #define ARM_SMMU_IDR5			0x14
@@ -258,6 +259,7 @@ static inline u32 arm_smmu_strtab_l2_idx(u32 sid)
 #define STRTAB_STE_1_S1CSH		GENMASK_ULL(7, 6)
 
 #define STRTAB_STE_1_S1STALLD		(1UL << 27)
+#define STRTAB_STE_1_S2FWB		(1UL << 25)
 
 #define STRTAB_STE_1_EATS		GENMASK_ULL(29, 28)
 #define STRTAB_STE_1_EATS_ABT		0UL
@@ -700,6 +702,7 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_ATTR_TYPES_OVR	(1 << 20)
 #define ARM_SMMU_FEAT_HA		(1 << 21)
 #define ARM_SMMU_FEAT_HD		(1 << 22)
+#define ARM_SMMU_FEAT_S2FWB		(1 << 23)
 	u32				features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index f5d9fd1f45bf49..62bbb6037e1686 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -106,6 +106,18 @@
 #define ARM_LPAE_PTE_HAP_FAULT		(((arm_lpae_iopte)0) << 6)
 #define ARM_LPAE_PTE_HAP_READ		(((arm_lpae_iopte)1) << 6)
 #define ARM_LPAE_PTE_HAP_WRITE		(((arm_lpae_iopte)2) << 6)
+/*
+ * For !FWB these code to:
+ *  1111 = Normal outer write back cachable / Inner Write Back Cachable
+ *         Permit S1 to override
+ *  0101 = Normal Non-cachable / Inner Non-cachable
+ *  0001 = Device / Device-nGnRE
+ * For S2FWB these code:
+ *  0110 Force Normal Write Back
+ *  0101 Normal* is forced Normal-NC, Device unchanged
+ *  0001 Force Device-nGnRE
+ */
+#define ARM_LPAE_PTE_MEMATTR_FWB_WB	(((arm_lpae_iopte)0x6) << 2)
 #define ARM_LPAE_PTE_MEMATTR_OIWB	(((arm_lpae_iopte)0xf) << 2)
 #define ARM_LPAE_PTE_MEMATTR_NC		(((arm_lpae_iopte)0x5) << 2)
 #define ARM_LPAE_PTE_MEMATTR_DEV	(((arm_lpae_iopte)0x1) << 2)
@@ -458,12 +470,16 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
 	 */
 	if (data->iop.fmt == ARM_64_LPAE_S2 ||
 	    data->iop.fmt == ARM_32_LPAE_S2) {
-		if (prot & IOMMU_MMIO)
+		if (prot & IOMMU_MMIO) {
 			pte |= ARM_LPAE_PTE_MEMATTR_DEV;
-		else if (prot & IOMMU_CACHE)
-			pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
-		else
+		} else if (prot & IOMMU_CACHE) {
+			if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_ARM_S2FWB)
+				pte |= ARM_LPAE_PTE_MEMATTR_FWB_WB;
+			else
+				pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
+		} else {
 			pte |= ARM_LPAE_PTE_MEMATTR_NC;
+		}
 	} else {
 		if (prot & IOMMU_MMIO)
 			pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index f9a81761bfceda..aff9b020b6dcc7 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -87,6 +87,7 @@ struct io_pgtable_cfg {
 	 *	attributes set in the TCR for a non-coherent page-table walker.
 	 *
 	 * IO_PGTABLE_QUIRK_ARM_HD: Enables dirty tracking in stage 1 pagetable.
+	 * IO_PGTABLE_QUIRK_ARM_S2FWB: Use the FWB format for the MemAttrs bits
 	 */
 	#define IO_PGTABLE_QUIRK_ARM_NS			BIT(0)
 	#define IO_PGTABLE_QUIRK_NO_PERMS		BIT(1)
@@ -95,6 +96,7 @@ struct io_pgtable_cfg {
 	#define IO_PGTABLE_QUIRK_ARM_TTBR1		BIT(5)
 	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA		BIT(6)
 	#define IO_PGTABLE_QUIRK_ARM_HD			BIT(7)
+	#define IO_PGTABLE_QUIRK_ARM_S2FWB		BIT(8)
 	unsigned long			quirks;
 	unsigned long			pgsize_bitmap;
 	unsigned int			ias;
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/8] ACPI/IORT: Support CANWBS memory access flag
  2024-08-06 23:41 [PATCH 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
  2024-08-06 23:41 ` [PATCH 1/8] vfio: Remove VFIO_TYPE1_NESTING_IOMMU Jason Gunthorpe
  2024-08-06 23:41 ` [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available Jason Gunthorpe
@ 2024-08-06 23:41 ` Jason Gunthorpe
  2024-08-09 14:36   ` Shameerali Kolothum Thodi
  2024-08-06 23:41 ` [PATCH 4/8] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS Jason Gunthorpe
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-06 23:41 UTC (permalink / raw)
  To: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon
  Cc: Eric Auger, Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches, Shameerali Kolothum Thodi

From: Nicolin Chen <nicolinc@nvidia.com>

The IORT spec, Issue E.f (April 2024), adds a new CANWBS bit to the Memory
Access Flag field in the Memory Access Properties table, mainly for a PCI
Root Complex.

This CANWBS defines the coherency of memory accesses to be not marked IOWB
cacheable/shareable. Its value further implies the coherency impact from a
pair of mismatched memory attributes (e.g. in a nested translation case):
  0x0: Use of mismatched memory attributes for accesses made by this
       device may lead to a loss of coherency.
  0x1: Coherency of accesses made by this device to locations in
       Conventional memory are ensured as follows, even if the memory
       attributes for the accesses presented by the device or provided by
       the SMMU are different from Inner and Outer Write-back cacheable,
       Shareable.

Note that the loss of coherency on a CANWBS-unsupported HW typically could
occur to an SMMU that doesn't implement the S2FWB feature where additional
cache flush operations would be required to prevent that from happening.

Add a new ACPI_IORT_MF_CANWBS flag and set IOMMU_FWSPEC_PCI_RC_CANWBS upon
the presence of this new flag.

CANWBS and S2FWB are similar features, in that they both guarantee the VM
can not violate coherency, however S2FWB can be bypassed by PCI No Snoop
TLPs, while CANWBS cannot. Thus CANWBS meets the requirements to set
IOMMU_CAP_ENFORCE_CACHE_COHERENCY.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/acpi/arm64/iort.c | 13 +++++++++++++
 include/acpi/actbl2.h     |  1 +
 include/linux/iommu.h     |  2 ++
 3 files changed, 16 insertions(+)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 1b39e9ae7ac178..52f5836fa888db 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -1218,6 +1218,17 @@ static bool iort_pci_rc_supports_ats(struct acpi_iort_node *node)
 	return pci_rc->ats_attribute & ACPI_IORT_ATS_SUPPORTED;
 }
 
+static bool iort_pci_rc_supports_canwbs(struct acpi_iort_node *node)
+{
+	struct acpi_iort_memory_access *memory_access;
+	struct acpi_iort_root_complex *pci_rc;
+
+	pci_rc = (struct acpi_iort_root_complex *)node->node_data;
+	memory_access =
+		(struct acpi_iort_memory_access *)&pci_rc->memory_properties;
+	return memory_access->memory_flags & ACPI_IORT_MF_CANWBS;
+}
+
 static int iort_iommu_xlate(struct device *dev, struct acpi_iort_node *node,
 			    u32 streamid)
 {
@@ -1335,6 +1346,8 @@ int iort_iommu_configure_id(struct device *dev, const u32 *id_in)
 		fwspec = dev_iommu_fwspec_get(dev);
 		if (fwspec && iort_pci_rc_supports_ats(node))
 			fwspec->flags |= IOMMU_FWSPEC_PCI_RC_ATS;
+		if (fwspec && iort_pci_rc_supports_canwbs(node))
+			fwspec->flags |= IOMMU_FWSPEC_PCI_RC_CANWBS;
 	} else {
 		node = iort_scan_node(ACPI_IORT_NODE_NAMED_COMPONENT,
 				      iort_match_node_callback, dev);
diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
index e27958ef82642f..56ce7fc35312c8 100644
--- a/include/acpi/actbl2.h
+++ b/include/acpi/actbl2.h
@@ -524,6 +524,7 @@ struct acpi_iort_memory_access {
 
 #define ACPI_IORT_MF_COHERENCY          (1)
 #define ACPI_IORT_MF_ATTRIBUTES         (1<<1)
+#define ACPI_IORT_MF_CANWBS             (1<<2)
 
 /*
  * IORT node specific subtables
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 15d7657509f662..d1660ec23f263b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -993,6 +993,8 @@ struct iommu_fwspec {
 
 /* ATS is supported */
 #define IOMMU_FWSPEC_PCI_RC_ATS			(1 << 0)
+/* CANWBS is supported */
+#define IOMMU_FWSPEC_PCI_RC_CANWBS		(1 << 1)
 
 /*
  * An iommu attach handle represents a relationship between an iommu domain
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/8] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
  2024-08-06 23:41 [PATCH 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
                   ` (2 preceding siblings ...)
  2024-08-06 23:41 ` [PATCH 3/8] ACPI/IORT: Support CANWBS memory access flag Jason Gunthorpe
@ 2024-08-06 23:41 ` Jason Gunthorpe
  2024-08-06 23:41 ` [PATCH 5/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info Jason Gunthorpe
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-06 23:41 UTC (permalink / raw)
  To: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon
  Cc: Eric Auger, Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches, Shameerali Kolothum Thodi

HW with CANWBS is always cache coherent and ignores PCI No Snoop requests
as well. This meets the requirement for IOMMU_CAP_ENFORCE_CACHE_COHERENCY,
so let's return it.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 35 +++++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
 2 files changed, 36 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7fe1e27d11586c..998c01f4b3d2ee 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2253,6 +2253,9 @@ static bool arm_smmu_capable(struct device *dev, enum iommu_cap cap)
 	case IOMMU_CAP_CACHE_COHERENCY:
 		/* Assume that a coherent TCU implies coherent TBUs */
 		return master->smmu->features & ARM_SMMU_FEAT_COHERENCY;
+	case IOMMU_CAP_ENFORCE_CACHE_COHERENCY:
+		return dev_iommu_fwspec_get(dev)->flags &
+		       IOMMU_FWSPEC_PCI_RC_CANWBS;
 	case IOMMU_CAP_NOEXEC:
 	case IOMMU_CAP_DEFERRED_FLUSH:
 		return true;
@@ -2263,6 +2266,28 @@ static bool arm_smmu_capable(struct device *dev, enum iommu_cap cap)
 	}
 }
 
+static bool arm_smmu_enforce_cache_coherency(struct iommu_domain *domain)
+{
+	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+	struct arm_smmu_master_domain *master_domain;
+	unsigned long flags;
+	bool ret = false;
+
+	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+	list_for_each_entry(master_domain, &smmu_domain->devices,
+			    devices_elm) {
+		if (!(dev_iommu_fwspec_get(master_domain->master->dev)->flags &
+		      IOMMU_FWSPEC_PCI_RC_CANWBS))
+			goto out;
+	}
+
+	smmu_domain->enforce_cache_coherency = true;
+	ret = true;
+out:
+	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+	return ret;
+}
+
 struct arm_smmu_domain *arm_smmu_domain_alloc(void)
 {
 	struct arm_smmu_domain *smmu_domain;
@@ -2693,6 +2718,15 @@ static int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
 		 * one of them.
 		 */
 		spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+		if (smmu_domain->enforce_cache_coherency &&
+		    !(dev_iommu_fwspec_get(master->dev)->flags &
+		      IOMMU_FWSPEC_PCI_RC_CANWBS)) {
+			kfree(master_domain);
+			spin_unlock_irqrestore(&smmu_domain->devices_lock,
+					       flags);
+			return -EINVAL;
+		}
+
 		if (state->ats_enabled)
 			atomic_inc(&smmu_domain->nr_ats_masters);
 		list_add(&master_domain->devices_elm, &smmu_domain->devices);
@@ -3450,6 +3484,7 @@ static struct iommu_ops arm_smmu_ops = {
 	.owner			= THIS_MODULE,
 	.default_domain_ops = &(const struct iommu_domain_ops) {
 		.attach_dev		= arm_smmu_attach_dev,
+		.enforce_cache_coherency = arm_smmu_enforce_cache_coherency,
 		.set_dev_pasid		= arm_smmu_s1_set_dev_pasid,
 		.map_pages		= arm_smmu_map_pages,
 		.unmap_pages		= arm_smmu_unmap_pages,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 7e8d2f36faebf3..79e1c7a9a218f9 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -787,6 +787,7 @@ struct arm_smmu_domain {
 	/* List of struct arm_smmu_master_domain */
 	struct list_head		devices;
 	spinlock_t			devices_lock;
+	u8				enforce_cache_coherency;
 
 	struct mmu_notifier		mmu_notifier;
 };
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
  2024-08-06 23:41 [PATCH 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
                   ` (3 preceding siblings ...)
  2024-08-06 23:41 ` [PATCH 4/8] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS Jason Gunthorpe
@ 2024-08-06 23:41 ` Jason Gunthorpe
  2024-08-06 23:41 ` [PATCH 6/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT Jason Gunthorpe
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-06 23:41 UTC (permalink / raw)
  To: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon
  Cc: Eric Auger, Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches, Shameerali Kolothum Thodi

From: Nicolin Chen <nicolinc@nvidia.com>

For virtualization cases the IDR/IIDR/AIDR values of the actual SMMU
instance need to be available to the VMM so it can construct an
appropriate vSMMUv3 that reflects the correct HW capabilities.

For userspace page tables these values are required to constrain the valid
values within the CD table and the IOPTEs.

The kernel does not sanitize these values. If building a VMM then
userspace is required to only forward bits into a VM that it knows it can
implement. Some bits will also require a VMM to detect if appropriate
kernel support is available such as for ATS and BTM.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 ++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  2 ++
 include/uapi/linux/iommufd.h                | 35 +++++++++++++++++++++
 3 files changed, 61 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 998c01f4b3d2ee..6bbe4aa7b9511c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2288,6 +2288,29 @@ static bool arm_smmu_enforce_cache_coherency(struct iommu_domain *domain)
 	return ret;
 }
 
+static void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type)
+{
+	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct iommu_hw_info_arm_smmuv3 *info;
+	u32 __iomem *base_idr;
+	unsigned int i;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info)
+		return ERR_PTR(-ENOMEM);
+
+	base_idr = master->smmu->base + ARM_SMMU_IDR0;
+	for (i = 0; i <= 5; i++)
+		info->idr[i] = readl_relaxed(base_idr + i);
+	info->iidr = readl_relaxed(master->smmu->base + ARM_SMMU_IIDR);
+	info->aidr = readl_relaxed(master->smmu->base + ARM_SMMU_AIDR);
+
+	*length = sizeof(*info);
+	*type = IOMMU_HW_INFO_TYPE_ARM_SMMUV3;
+
+	return info;
+}
+
 struct arm_smmu_domain *arm_smmu_domain_alloc(void)
 {
 	struct arm_smmu_domain *smmu_domain;
@@ -3467,6 +3490,7 @@ static struct iommu_ops arm_smmu_ops = {
 	.identity_domain	= &arm_smmu_identity_domain,
 	.blocked_domain		= &arm_smmu_blocked_domain,
 	.capable		= arm_smmu_capable,
+	.hw_info		= arm_smmu_hw_info,
 	.domain_alloc_paging    = arm_smmu_domain_alloc_paging,
 	.domain_alloc_sva       = arm_smmu_sva_domain_alloc,
 	.domain_alloc_user	= arm_smmu_domain_alloc_user,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 79e1c7a9a218f9..58cd405652e06a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -80,6 +80,8 @@
 #define IIDR_REVISION			GENMASK(15, 12)
 #define IIDR_IMPLEMENTER		GENMASK(11, 0)
 
+#define ARM_SMMU_AIDR			0x1C
+
 #define ARM_SMMU_CR0			0x20
 #define CR0_ATSCHK			(1 << 4)
 #define CR0_CMDQEN			(1 << 3)
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 4dde745cfb7e29..83b6e1cd338d8f 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -484,15 +484,50 @@ struct iommu_hw_info_vtd {
 	__aligned_u64 ecap_reg;
 };
 
+/**
+ * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
+ *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
+ *
+ * @flags: Must be set to 0
+ * @__reserved: Must be 0
+ * @idr: Implemented features for ARM SMMU Non-secure programming interface
+ * @iidr: Information about the implementation and implementer of ARM SMMU,
+ *        and architecture version supported
+ * @aidr: ARM SMMU architecture version
+ *
+ * For the details of @idr, @iidr and @aidr, please refer to the chapters
+ * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
+ *
+ * User space should read the underlying ARM SMMUv3 hardware information for
+ * the list of supported features.
+ *
+ * Note that these values reflect the raw HW capability, without any insight if
+ * any required kernel driver support is present. Bits may be set indicating the
+ * HW has functionality that is lacking kernel software support, such as BTM. If
+ * a VMM is using this information to construct emulated copies of these
+ * registers it should only forward bits that it knows it can support.
+ *
+ * In future, presence of required kernel support will be indicated in flags.
+ */
+struct iommu_hw_info_arm_smmuv3 {
+	__u32 flags;
+	__u32 __reserved;
+	__u32 idr[6];
+	__u32 iidr;
+	__u32 aidr;
+};
+
 /**
  * enum iommu_hw_info_type - IOMMU Hardware Info Types
  * @IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not report hardware
  *                           info
  * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
+ * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
  */
 enum iommu_hw_info_type {
 	IOMMU_HW_INFO_TYPE_NONE = 0,
 	IOMMU_HW_INFO_TYPE_INTEL_VTD = 1,
+	IOMMU_HW_INFO_TYPE_ARM_SMMUV3 = 2,
 };
 
 /**
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 6/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
  2024-08-06 23:41 [PATCH 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
                   ` (4 preceding siblings ...)
  2024-08-06 23:41 ` [PATCH 5/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info Jason Gunthorpe
@ 2024-08-06 23:41 ` Jason Gunthorpe
  2024-08-09 15:06   ` Robin Murphy
  2024-08-06 23:41 ` [PATCH 7/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED Jason Gunthorpe
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-06 23:41 UTC (permalink / raw)
  To: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon
  Cc: Eric Auger, Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches, Shameerali Kolothum Thodi

For SMMUv3 the parent must be a S2 domain, which can be composed
into a IOMMU_DOMAIN_NESTED.

In future the S2 parent will also need a VMID linked to the VIOMMU and
even to KVM.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 6bbe4aa7b9511c..5faaccef707ef1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3103,7 +3103,8 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
 			   const struct iommu_user_data *user_data)
 {
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
-	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
+	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING |
+				 IOMMU_HWPT_ALLOC_NEST_PARENT;
 	struct arm_smmu_domain *smmu_domain;
 	int ret;
 
@@ -3116,6 +3117,14 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
 	if (!smmu_domain)
 		return ERR_PTR(-ENOMEM);
 
+	if (flags & IOMMU_HWPT_ALLOC_NEST_PARENT) {
+		if (!(master->smmu->features & ARM_SMMU_FEAT_TRANS_S2)) {
+			ret = -EOPNOTSUPP;
+			goto err_free;
+		}
+		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
+	}
+
 	smmu_domain->domain.type = IOMMU_DOMAIN_UNMANAGED;
 	smmu_domain->domain.ops = arm_smmu_ops.default_domain_ops;
 	ret = arm_smmu_domain_finalise(smmu_domain, master->smmu, flags);
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 7/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-08-06 23:41 [PATCH 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
                   ` (5 preceding siblings ...)
  2024-08-06 23:41 ` [PATCH 6/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT Jason Gunthorpe
@ 2024-08-06 23:41 ` Jason Gunthorpe
  2024-08-09 16:05   ` Robin Murphy
  2024-08-06 23:41 ` [PATCH 8/8] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user Jason Gunthorpe
  2024-08-20  8:20 ` [PATCH 0/8] Initial support for SMMUv3 nested translation Mostafa Saleh
  8 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-06 23:41 UTC (permalink / raw)
  To: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon
  Cc: Eric Auger, Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches, Shameerali Kolothum Thodi

For SMMUv3 a IOMMU_DOMAIN_NESTED is composed of a S2 iommu_domain acting
as the parent and a user provided STE fragment that defines the CD table
and related data with addresses translated by the S2 iommu_domain.

The kernel only permits userspace to control certain allowed bits of the
STE that are safe for user/guest control.

IOTLB maintenance is a bit subtle here, the S1 implicitly includes the S2
translation, but there is no way of knowing which S1 entries refer to a
range of S2.

For the IOTLB we follow ARM's guidance and issue a CMDQ_OP_TLBI_NH_ALL to
flush all ASIDs from the VMID after flushing the S2 on any change to the
S2.

Similarly we have to flush the entire ATC if the S2 is changed.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 211 +++++++++++++++++++-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  20 ++
 include/uapi/linux/iommufd.h                |  20 ++
 3 files changed, 247 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5faaccef707ef1..5dbaffd7937747 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -295,6 +295,7 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
 	case CMDQ_OP_TLBI_NH_ASID:
 		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
 		fallthrough;
+	case CMDQ_OP_TLBI_NH_ALL:
 	case CMDQ_OP_TLBI_S12_VMALL:
 		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
 		break;
@@ -1640,6 +1641,59 @@ void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
 }
 EXPORT_SYMBOL_IF_KUNIT(arm_smmu_make_s2_domain_ste);
 
+static void arm_smmu_make_nested_cd_table_ste(
+	struct arm_smmu_ste *target, struct arm_smmu_master *master,
+	struct arm_smmu_nested_domain *nested_domain, bool ats_enabled)
+{
+	arm_smmu_make_s2_domain_ste(target, master, nested_domain->s2_parent,
+				    ats_enabled);
+
+	target->data[0] = cpu_to_le64(STRTAB_STE_0_V |
+				      FIELD_PREP(STRTAB_STE_0_CFG,
+						 STRTAB_STE_0_CFG_NESTED)) |
+			  (nested_domain->ste[0] & ~STRTAB_STE_0_CFG);
+	target->data[1] |= nested_domain->ste[1];
+}
+
+/*
+ * Create a physical STE from the virtual STE that userspace provided when it
+ * created the nested domain. Using the vSTE userspace can request:
+ * - Non-valid STE
+ * - Abort STE
+ * - Bypass STE (install the S2, no CD table)
+ * - CD table STE (install the S2 and the userspace CD table)
+ */
+static void arm_smmu_make_nested_domain_ste(
+	struct arm_smmu_ste *target, struct arm_smmu_master *master,
+	struct arm_smmu_nested_domain *nested_domain, bool ats_enabled)
+{
+	/*
+	 * Userspace can request a non-valid STE through the nesting interface.
+	 * We relay that into a non-valid physical STE with the intention that
+	 * C_BAD_STE for this SID can be delivered to userspace.
+	 */
+	if (!(nested_domain->ste[0] & cpu_to_le64(STRTAB_STE_0_V))) {
+		memset(target, 0, sizeof(*target));
+		return;
+	}
+
+	switch (FIELD_GET(STRTAB_STE_0_CFG,
+			  le64_to_cpu(nested_domain->ste[0]))) {
+	case STRTAB_STE_0_CFG_S1_TRANS:
+		arm_smmu_make_nested_cd_table_ste(target, master, nested_domain,
+						  ats_enabled);
+		break;
+	case STRTAB_STE_0_CFG_BYPASS:
+		arm_smmu_make_s2_domain_ste(
+			target, master, nested_domain->s2_parent, ats_enabled);
+		break;
+	case STRTAB_STE_0_CFG_ABORT:
+	default:
+		arm_smmu_make_abort_ste(target);
+		break;
+	}
+}
+
 /*
  * This can safely directly manipulate the STE memory without a sync sequence
  * because the STE table has not been installed in the SMMU yet.
@@ -2065,7 +2119,16 @@ int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
 		if (!master->ats_enabled)
 			continue;
 
-		arm_smmu_atc_inv_to_cmd(master_domain->ssid, iova, size, &cmd);
+		if (master_domain->nested_parent) {
+			/*
+			 * If a S2 used as a nesting parent is changed we have
+			 * no option but to completely flush the ATC.
+			 */
+			arm_smmu_atc_inv_to_cmd(IOMMU_NO_PASID, 0, 0, &cmd);
+		} else {
+			arm_smmu_atc_inv_to_cmd(master_domain->ssid, iova, size,
+						&cmd);
+		}
 
 		for (i = 0; i < master->num_streams; i++) {
 			cmd.atc.sid = master->streams[i].id;
@@ -2192,6 +2255,16 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
 	}
 	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
 
+	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S2 &&
+	    smmu_domain->nesting_parent) {
+		/*
+		 * When the S2 domain changes all the nested S1 ASIDs have to be
+		 * flushed too.
+		 */
+		cmd.opcode = CMDQ_OP_TLBI_NH_ALL;
+		arm_smmu_cmdq_issue_cmd_with_sync(smmu_domain->smmu, &cmd);
+	}
+
 	/*
 	 * Unfortunately, this can't be leaf-only since we may have
 	 * zapped an entire table.
@@ -2608,13 +2681,15 @@ arm_smmu_find_master_domain(struct arm_smmu_domain *smmu_domain,
 			    ioasid_t ssid)
 {
 	struct arm_smmu_master_domain *master_domain;
+	bool nested_parent = smmu_domain->domain.type == IOMMU_DOMAIN_NESTED;
 
 	lockdep_assert_held(&smmu_domain->devices_lock);
 
 	list_for_each_entry(master_domain, &smmu_domain->devices,
 			    devices_elm) {
 		if (master_domain->master == master &&
-		    master_domain->ssid == ssid)
+		    master_domain->ssid == ssid &&
+		    master_domain->nested_parent == nested_parent)
 			return master_domain;
 	}
 	return NULL;
@@ -2634,6 +2709,9 @@ to_smmu_domain_devices(struct iommu_domain *domain)
 	if ((domain->type & __IOMMU_DOMAIN_PAGING) ||
 	    domain->type == IOMMU_DOMAIN_SVA)
 		return to_smmu_domain(domain);
+	if (domain->type == IOMMU_DOMAIN_NESTED)
+		return container_of(domain, struct arm_smmu_nested_domain,
+				    domain)->s2_parent;
 	return NULL;
 }
 
@@ -2664,6 +2742,7 @@ struct arm_smmu_attach_state {
 	struct iommu_domain *old_domain;
 	struct arm_smmu_master *master;
 	bool cd_needs_ats;
+	bool disable_ats;
 	ioasid_t ssid;
 	/* Resulting state */
 	bool ats_enabled;
@@ -2716,7 +2795,8 @@ static int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
 		 * enabled if we have arm_smmu_domain, those always have page
 		 * tables.
 		 */
-		state->ats_enabled = arm_smmu_ats_supported(master);
+		state->ats_enabled = !state->disable_ats &&
+				     arm_smmu_ats_supported(master);
 	}
 
 	if (smmu_domain) {
@@ -2725,6 +2805,8 @@ static int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
 			return -ENOMEM;
 		master_domain->master = master;
 		master_domain->ssid = state->ssid;
+		master_domain->nested_parent = new_domain->type ==
+					       IOMMU_DOMAIN_NESTED;
 
 		/*
 		 * During prepare we want the current smmu_domain and new
@@ -3097,6 +3179,122 @@ static struct iommu_domain arm_smmu_blocked_domain = {
 	.ops = &arm_smmu_blocked_ops,
 };
 
+static int arm_smmu_attach_dev_nested(struct iommu_domain *domain,
+				      struct device *dev)
+{
+	struct arm_smmu_nested_domain *nested_domain =
+		container_of(domain, struct arm_smmu_nested_domain, domain);
+	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct arm_smmu_attach_state state = {
+		.master = master,
+		.old_domain = iommu_get_domain_for_dev(dev),
+		.ssid = IOMMU_NO_PASID,
+		/* Currently invalidation of ATC is not supported */
+		.disable_ats = true,
+	};
+	struct arm_smmu_ste ste;
+	int ret;
+
+	if (arm_smmu_ssids_in_use(&master->cd_table) ||
+	    nested_domain->s2_parent->smmu != master->smmu)
+		return -EINVAL;
+
+	mutex_lock(&arm_smmu_asid_lock);
+	ret = arm_smmu_attach_prepare(&state, domain);
+	if (ret) {
+		mutex_unlock(&arm_smmu_asid_lock);
+		return ret;
+	}
+
+	arm_smmu_make_nested_domain_ste(&ste, master, nested_domain,
+					state.ats_enabled);
+	arm_smmu_install_ste_for_dev(master, &ste);
+	arm_smmu_attach_commit(&state);
+	mutex_unlock(&arm_smmu_asid_lock);
+	return 0;
+}
+
+static void arm_smmu_domain_nested_free(struct iommu_domain *domain)
+{
+	kfree(container_of(domain, struct arm_smmu_nested_domain, domain));
+}
+
+static const struct iommu_domain_ops arm_smmu_nested_ops = {
+	.attach_dev = arm_smmu_attach_dev_nested,
+	.free = arm_smmu_domain_nested_free,
+};
+
+static struct iommu_domain *
+arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags,
+			      struct iommu_domain *parent,
+			      const struct iommu_user_data *user_data)
+{
+	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+	struct arm_smmu_nested_domain *nested_domain;
+	struct arm_smmu_domain *smmu_parent;
+	struct iommu_hwpt_arm_smmuv3 arg;
+	unsigned int eats;
+	unsigned int cfg;
+	int ret;
+
+	if (!(master->smmu->features & ARM_SMMU_FEAT_NESTING))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	/*
+	 * Must support some way to prevent the VM from bypassing the cache
+	 * because VFIO currently does not do any cache maintenance.
+	 */
+	if (!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_CANWBS) &&
+	    !(master->smmu->features & ARM_SMMU_FEAT_S2FWB))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	ret = iommu_copy_struct_from_user(&arg, user_data,
+					  IOMMU_HWPT_DATA_ARM_SMMUV3, ste);
+	if (ret)
+		return ERR_PTR(ret);
+
+	if (flags || !(master->smmu->features & ARM_SMMU_FEAT_TRANS_S1))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	if (!(parent->type & __IOMMU_DOMAIN_PAGING))
+		return ERR_PTR(-EINVAL);
+
+	smmu_parent = to_smmu_domain(parent);
+	if (smmu_parent->stage != ARM_SMMU_DOMAIN_S2 ||
+	    smmu_parent->smmu != master->smmu)
+		return ERR_PTR(-EINVAL);
+
+	/* EIO is reserved for invalid STE data. */
+	if ((arg.ste[0] & ~STRTAB_STE_0_NESTING_ALLOWED) ||
+	    (arg.ste[1] & ~STRTAB_STE_1_NESTING_ALLOWED))
+		return ERR_PTR(-EIO);
+
+	cfg = FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(arg.ste[0]));
+	if (cfg != STRTAB_STE_0_CFG_ABORT && cfg != STRTAB_STE_0_CFG_BYPASS &&
+	    cfg != STRTAB_STE_0_CFG_S1_TRANS)
+		return ERR_PTR(-EIO);
+
+	eats = FIELD_GET(STRTAB_STE_1_EATS, le64_to_cpu(arg.ste[1]));
+	if (eats != STRTAB_STE_1_EATS_ABT)
+		return ERR_PTR(-EIO);
+
+	if (cfg != STRTAB_STE_0_CFG_S1_TRANS)
+		eats = STRTAB_STE_1_EATS_ABT;
+
+	nested_domain = kzalloc(sizeof(*nested_domain), GFP_KERNEL_ACCOUNT);
+	if (!nested_domain)
+		return ERR_PTR(-ENOMEM);
+
+	nested_domain->domain.type = IOMMU_DOMAIN_NESTED;
+	nested_domain->domain.ops = &arm_smmu_nested_ops;
+	nested_domain->s2_parent = smmu_parent;
+	nested_domain->ste[0] = arg.ste[0];
+	nested_domain->ste[1] = arg.ste[1] & ~cpu_to_le64(STRTAB_STE_1_EATS);
+
+	return &nested_domain->domain;
+}
+
 static struct iommu_domain *
 arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
 			   struct iommu_domain *parent,
@@ -3108,9 +3306,13 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
 	struct arm_smmu_domain *smmu_domain;
 	int ret;
 
+	if (parent)
+		return arm_smmu_domain_alloc_nesting(dev, flags, parent,
+						     user_data);
+
 	if (flags & ~PAGING_FLAGS)
 		return ERR_PTR(-EOPNOTSUPP);
-	if (parent || user_data)
+	if (user_data)
 		return ERR_PTR(-EOPNOTSUPP);
 
 	smmu_domain = arm_smmu_domain_alloc();
@@ -3123,6 +3325,7 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
 			goto err_free;
 		}
 		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
+		smmu_domain->nesting_parent = true;
 	}
 
 	smmu_domain->domain.type = IOMMU_DOMAIN_UNMANAGED;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 58cd405652e06a..e149eddb568e7e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -240,6 +240,7 @@ static inline u32 arm_smmu_strtab_l2_idx(u32 sid)
 #define STRTAB_STE_0_CFG_BYPASS		4
 #define STRTAB_STE_0_CFG_S1_TRANS	5
 #define STRTAB_STE_0_CFG_S2_TRANS	6
+#define STRTAB_STE_0_CFG_NESTED		7
 
 #define STRTAB_STE_0_S1FMT		GENMASK_ULL(5, 4)
 #define STRTAB_STE_0_S1FMT_LINEAR	0
@@ -291,6 +292,15 @@ static inline u32 arm_smmu_strtab_l2_idx(u32 sid)
 
 #define STRTAB_STE_3_S2TTB_MASK		GENMASK_ULL(51, 4)
 
+/* These bits can be controlled by userspace for STRTAB_STE_0_CFG_NESTED */
+#define STRTAB_STE_0_NESTING_ALLOWED                                         \
+	cpu_to_le64(STRTAB_STE_0_V | STRTAB_STE_0_CFG | STRTAB_STE_0_S1FMT | \
+		    STRTAB_STE_0_S1CTXPTR_MASK | STRTAB_STE_0_S1CDMAX)
+#define STRTAB_STE_1_NESTING_ALLOWED                            \
+	cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |   \
+		    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |   \
+		    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_EATS)
+
 /*
  * Context descriptors.
  *
@@ -508,6 +518,7 @@ struct arm_smmu_cmdq_ent {
 			};
 		} cfgi;
 
+		#define CMDQ_OP_TLBI_NH_ALL     0x10
 		#define CMDQ_OP_TLBI_NH_ASID	0x11
 		#define CMDQ_OP_TLBI_NH_VA	0x12
 		#define CMDQ_OP_TLBI_EL2_ALL	0x20
@@ -792,6 +803,14 @@ struct arm_smmu_domain {
 	u8				enforce_cache_coherency;
 
 	struct mmu_notifier		mmu_notifier;
+	bool				nesting_parent : 1;
+};
+
+struct arm_smmu_nested_domain {
+	struct iommu_domain domain;
+	struct arm_smmu_domain *s2_parent;
+
+	__le64 ste[2];
 };
 
 /* The following are exposed for testing purposes. */
@@ -830,6 +849,7 @@ struct arm_smmu_master_domain {
 	struct list_head devices_elm;
 	struct arm_smmu_master *master;
 	ioasid_t ssid;
+	u8 nested_parent;
 };
 
 static inline struct arm_smmu_domain *to_smmu_domain(struct iommu_domain *dom)
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 83b6e1cd338d8f..76e9ad6c9403af 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -394,14 +394,34 @@ struct iommu_hwpt_vtd_s1 {
 	__u32 __reserved;
 };
 
+/**
+ * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 Context Descriptor Table info
+ *                                (IOMMU_HWPT_DATA_ARM_SMMUV3)
+ *
+ * @ste: The first two double words of the user space Stream Table Entry for
+ *       a user stage-1 Context Descriptor Table. Must be little-endian.
+ *       Allowed fields: (Refer to "5.2 Stream Table Entry" in SMMUv3 HW Spec)
+ *       - word-0: V, Cfg, S1Fmt, S1ContextPtr, S1CDMax
+ *       - word-1: S1DSS, S1CIR, S1COR, S1CSH, S1STALLD
+ *
+ * -EIO will be returned if @ste is not legal or contains any non-allowed field.
+ * Cfg can be used to select a S1, Bypass or Abort configuration. A Bypass
+ * nested domain will translate the same as the nesting parent.
+ */
+struct iommu_hwpt_arm_smmuv3 {
+	__aligned_le64 ste[2];
+};
+
 /**
  * enum iommu_hwpt_data_type - IOMMU HWPT Data Type
  * @IOMMU_HWPT_DATA_NONE: no data
  * @IOMMU_HWPT_DATA_VTD_S1: Intel VT-d stage-1 page table
+ * @IOMMU_HWPT_DATA_ARM_SMMUV3: ARM SMMUv3 Context Descriptor Table
  */
 enum iommu_hwpt_data_type {
 	IOMMU_HWPT_DATA_NONE = 0,
 	IOMMU_HWPT_DATA_VTD_S1 = 1,
+	IOMMU_HWPT_DATA_ARM_SMMUV3 = 2,
 };
 
 /**
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 8/8] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user
  2024-08-06 23:41 [PATCH 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
                   ` (6 preceding siblings ...)
  2024-08-06 23:41 ` [PATCH 7/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED Jason Gunthorpe
@ 2024-08-06 23:41 ` Jason Gunthorpe
  2024-08-20  8:20 ` [PATCH 0/8] Initial support for SMMUv3 nested translation Mostafa Saleh
  8 siblings, 0 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-06 23:41 UTC (permalink / raw)
  To: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon
  Cc: Eric Auger, Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches, Shameerali Kolothum Thodi

From: Nicolin Chen <nicolinc@nvidia.com>

Add arm_smmu_cache_invalidate_user() function for user space to invalidate
IOTLB entries that are still cached by the hardware.

Add struct iommu_hwpt_arm_smmuv3_invalidate defining an invalidation entry
that is simply the native format of a 128-bit TLBI command. Scan commands
against the permitted command list and fix their VMID fields.

Co-developed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 95 +++++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
 include/linux/iommu.h                       | 49 ++++++++++-
 include/uapi/linux/iommufd.h                | 24 ++++++
 4 files changed, 168 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5dbaffd7937747..24836f3269b3f4 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3219,9 +3219,96 @@ static void arm_smmu_domain_nested_free(struct iommu_domain *domain)
 	kfree(container_of(domain, struct arm_smmu_nested_domain, domain));
 }
 
+/*
+ * Convert, in place, the raw invalidation command into an internal format that
+ * can be passed to arm_smmu_cmdq_issue_cmdlist(). Internally commands are
+ * stored in CPU endian.
+ *
+ * Enforce the VMID on the command.
+ */
+static int
+arm_smmu_convert_user_cmd(struct arm_smmu_nested_domain *nested_domain,
+			  struct iommu_hwpt_arm_smmuv3_invalidate *cmd)
+{
+	u16 vmid = nested_domain->s2_parent->s2_cfg.vmid;
+
+	cmd->cmd[0] = le64_to_cpu(cmd->cmd[0]);
+	cmd->cmd[1] = le64_to_cpu(cmd->cmd[1]);
+
+	switch (cmd->cmd[0] & CMDQ_0_OP) {
+	case CMDQ_OP_TLBI_NSNH_ALL:
+		/* Convert to NH_ALL */
+		cmd->cmd[0] = CMDQ_OP_TLBI_NH_ALL |
+			      FIELD_PREP(CMDQ_TLBI_0_VMID, vmid);
+		cmd->cmd[1] = 0;
+		break;
+	case CMDQ_OP_TLBI_NH_VA:
+	case CMDQ_OP_TLBI_NH_VAA:
+	case CMDQ_OP_TLBI_NH_ALL:
+	case CMDQ_OP_TLBI_NH_ASID:
+		cmd->cmd[0] &= ~CMDQ_TLBI_0_VMID;
+		cmd->cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, vmid);
+		break;
+	default:
+		return -EIO;
+	}
+	return 0;
+}
+
+static int arm_smmu_cache_invalidate_user(struct iommu_domain *domain,
+					  struct iommu_user_data_array *array)
+{
+	struct arm_smmu_nested_domain *nested_domain =
+		container_of(domain, struct arm_smmu_nested_domain, domain);
+	struct arm_smmu_device *smmu = nested_domain->s2_parent->smmu;
+	struct iommu_hwpt_arm_smmuv3_invalidate *last_batch;
+	struct iommu_hwpt_arm_smmuv3_invalidate *cmds;
+	struct iommu_hwpt_arm_smmuv3_invalidate *cur;
+	struct iommu_hwpt_arm_smmuv3_invalidate *end;
+	int ret;
+
+	cmds = kcalloc(array->entry_num, sizeof(*cmds), GFP_KERNEL);
+	if (!cmds)
+		return -ENOMEM;
+	cur = cmds;
+	end = cmds + array->entry_num;
+
+	static_assert(sizeof(*cmds) == 2 * sizeof(u64));
+	ret = iommu_copy_struct_from_full_user_array(
+		cmds, sizeof(*cmds), array,
+		IOMMU_HWPT_INVALIDATE_DATA_ARM_SMMUV3);
+	if (ret)
+		goto out;
+
+	last_batch = cmds;
+	while (cur != end) {
+		ret = arm_smmu_convert_user_cmd(nested_domain, cur);
+		if (ret)
+			goto out;
+
+		/* FIXME work in blocks of CMDQ_BATCH_ENTRIES and copy each block? */
+		cur++;
+		if (cur != end && (cur - last_batch) != CMDQ_BATCH_ENTRIES - 1)
+			continue;
+
+		ret = arm_smmu_cmdq_issue_cmdlist(smmu, last_batch->cmd,
+						  cur - last_batch, true);
+		if (ret) {
+			cur--;
+			goto out;
+		}
+		last_batch = cur;
+	}
+out:
+	array->entry_num = cur - cmds;
+	kfree(cmds);
+	return ret;
+}
+
 static const struct iommu_domain_ops arm_smmu_nested_ops = {
 	.attach_dev = arm_smmu_attach_dev_nested,
 	.free = arm_smmu_domain_nested_free,
+	.cache_invalidate_user	= arm_smmu_cache_invalidate_user,
 };
 
 static struct iommu_domain *
@@ -3249,6 +3336,14 @@ arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags,
 	    !(master->smmu->features & ARM_SMMU_FEAT_S2FWB))
 		return ERR_PTR(-EOPNOTSUPP);
 
+	/*
+	 * FORCE_SYNC is not set with FEAT_NESTING. Some study of the exact HW
+	 * defect is needed to determine if arm_smmu_cache_invalidate_user()
+	 * needs any change to remove this.
+	 */
+	if (WARN_ON(master->smmu->options & ARM_SMMU_OPT_CMDQ_FORCE_SYNC))
+		return ERR_PTR(-EOPNOTSUPP);
+
 	ret = iommu_copy_struct_from_user(&arg, user_data,
 					  IOMMU_HWPT_DATA_ARM_SMMUV3, ste);
 	if (ret)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index e149eddb568e7e..3f7442f0167efb 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -521,6 +521,7 @@ struct arm_smmu_cmdq_ent {
 		#define CMDQ_OP_TLBI_NH_ALL     0x10
 		#define CMDQ_OP_TLBI_NH_ASID	0x11
 		#define CMDQ_OP_TLBI_NH_VA	0x12
+		#define CMDQ_OP_TLBI_NH_VAA	0x13
 		#define CMDQ_OP_TLBI_EL2_ALL	0x20
 		#define CMDQ_OP_TLBI_EL2_ASID	0x21
 		#define CMDQ_OP_TLBI_EL2_VA	0x22
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d1660ec23f263b..b0323290cb6c72 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -491,7 +491,9 @@ static inline int __iommu_copy_struct_from_user_array(
  * @index: Index to the location in the array to copy user data from
  * @min_last: The last member of the data structure @kdst points in the
  *            initial version.
- * Return 0 for success, otherwise -error.
+ *
+ * Copy a single entry from a user array. Return 0 for success, otherwise
+ * -error.
  */
 #define iommu_copy_struct_from_user_array(kdst, user_array, data_type, index, \
 					  min_last)                           \
@@ -499,6 +501,51 @@ static inline int __iommu_copy_struct_from_user_array(
 		kdst, user_array, data_type, index, sizeof(*(kdst)),          \
 		offsetofend(typeof(*(kdst)), min_last))
 
+
+/**
+ * iommu_copy_struct_from_full_user_array - Copy iommu driver specific user
+ *         space data from an iommu_user_data_array
+ * @kdst: Pointer to an iommu driver specific user data that is defined in
+ *        include/uapi/linux/iommufd.h
+ * @kdst_entry_size: sizeof(*kdst)
+ * @user_array: Pointer to a struct iommu_user_data_array for a user space
+ *              array
+ * @data_type: The data type of the @kdst. Must match with @user_array->type
+ *
+ * Copy the entire user array. kdst must have room for kdst_entry_size *
+ * user_array->entry_num bytes. Return 0 for success, otherwise -error.
+ */
+static inline int
+iommu_copy_struct_from_full_user_array(void *kdst, size_t kdst_entry_size,
+				       struct iommu_user_data_array *user_array,
+				       unsigned int data_type)
+{
+	unsigned int i;
+	int ret;
+
+	if (user_array->type != data_type)
+		return -EINVAL;
+	if (!user_array->entry_num)
+		return -EINVAL;
+	if (likely(user_array->entry_len == kdst_entry_size)) {
+		if (copy_from_user(kdst, user_array->uptr,
+				   user_array->entry_num *
+					   user_array->entry_len))
+			return -EFAULT;
+	}
+
+	/* Copy item by item */
+	for (i = 0; i != user_array->entry_num; i++) {
+		ret = copy_struct_from_user(
+			kdst + kdst_entry_size * i, kdst_entry_size,
+			user_array->uptr + user_array->entry_len * i,
+			user_array->entry_len);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
 /**
  * struct iommu_ops - iommu ops and capabilities
  * @capable: check capability
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 76e9ad6c9403af..f2d1677ddec445 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -682,9 +682,11 @@ struct iommu_hwpt_get_dirty_bitmap {
  * enum iommu_hwpt_invalidate_data_type - IOMMU HWPT Cache Invalidation
  *                                        Data Type
  * @IOMMU_HWPT_INVALIDATE_DATA_VTD_S1: Invalidation data for VTD_S1
+ * @IOMMU_HWPT_INVALIDATE_DATA_ARM_SMMUV3: Invalidation data for ARM SMMUv3
  */
 enum iommu_hwpt_invalidate_data_type {
 	IOMMU_HWPT_INVALIDATE_DATA_VTD_S1 = 0,
+	IOMMU_HWPT_INVALIDATE_DATA_ARM_SMMUV3 = 1,
 };
 
 /**
@@ -723,6 +725,28 @@ struct iommu_hwpt_vtd_s1_invalidate {
 	__u32 __reserved;
 };
 
+/**
+ * struct iommu_hwpt_arm_smmuv3_invalidate - ARM SMMUv3 cahce invalidation
+ *         (IOMMU_HWPT_INVALIDATE_DATA_ARM_SMMUV3)
+ * @cmd: 128-bit cache invalidation command that runs in SMMU CMDQ.
+ *       Must be little-endian.
+ *
+ * Supported command list:
+ *     CMDQ_OP_TLBI_NSNH_ALL
+ *     CMDQ_OP_TLBI_NH_VA
+ *     CMDQ_OP_TLBI_NH_VAA
+ *     CMDQ_OP_TLBI_NH_ALL
+ *     CMDQ_OP_TLBI_NH_ASID
+ *
+ * This API does not support ATS invalidation. Userspace must not request EATS,
+ * or enable ATS in the IDR.
+ *
+ * -EIO will be returned if the command is not supported.
+ */
+struct iommu_hwpt_arm_smmuv3_invalidate {
+	__aligned_u64 cmd[2];
+};
+
 /**
  * struct iommu_hwpt_invalidate - ioctl(IOMMU_HWPT_INVALIDATE)
  * @size: sizeof(struct iommu_hwpt_invalidate)
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] vfio: Remove VFIO_TYPE1_NESTING_IOMMU
  2024-08-06 23:41 ` [PATCH 1/8] vfio: Remove VFIO_TYPE1_NESTING_IOMMU Jason Gunthorpe
@ 2024-08-07 17:52   ` Alex Williamson
  2024-08-20  8:23   ` Mostafa Saleh
  1 sibling, 0 replies; 34+ messages in thread
From: Alex Williamson @ 2024-08-07 17:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Eric Auger, Jean-Philippe Brucker, Moritz Fischer,
	Michael Shavit, Nicolin Chen, patches, Shameerali Kolothum Thodi

On Tue,  6 Aug 2024 20:41:14 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> This control causes the ARM SMMU drivers to choose a stage 2
> implementation for the IO pagetable (vs the stage 1 usual default),
> however this choice has no significant visible impact to the VFIO
> user. Further qemu never implemented this and no other userspace user is
> known.
> 
> The original description in commit f5c9ecebaf2a ("vfio/iommu_type1: add
> new VFIO_TYPE1_NESTING_IOMMU IOMMU type") suggested this was to "provide
> SMMU translation services to the guest operating system" however the rest
> of the API to set the guest table pointer for the stage 1 and manage
> invalidation was never completed, or at least never upstreamed, rendering
> this part useless dead code.
> 
> Upstream has now settled on iommufd as the uAPI for controlling nested
> translation. Choosing the stage 2 implementation should be done by through
> the IOMMU_HWPT_ALLOC_NEST_PARENT flag during domain allocation.
> 
> Remove VFIO_TYPE1_NESTING_IOMMU and everything under it including the
> enable_nesting iommu_domain_op.
> 
> Just in-case there is some userspace using this continue to treat
> requesting it as a NOP, but do not advertise support any more.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 ----------------
>  drivers/iommu/arm/arm-smmu/arm-smmu.c       | 16 ----------------
>  drivers/iommu/iommu.c                       | 10 ----------
>  drivers/iommu/iommufd/vfio_compat.c         |  7 +------
>  drivers/vfio/vfio_iommu_type1.c             | 12 +-----------
>  include/linux/iommu.h                       |  3 ---
>  include/uapi/linux/vfio.h                   |  2 +-
>  7 files changed, 3 insertions(+), 63 deletions(-)

I think we should also wait for Eric's ack on this when he's back in
the office, but

Acked-by: Alex Williamson <alex.williamson@redhat.com>

> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index e5db5325f7eaed..531125f231b662 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -3331,21 +3331,6 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev)
>  	return group;
>  }
>  
> -static int arm_smmu_enable_nesting(struct iommu_domain *domain)
> -{
> -	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> -	int ret = 0;
> -
> -	mutex_lock(&smmu_domain->init_mutex);
> -	if (smmu_domain->smmu)
> -		ret = -EPERM;
> -	else
> -		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
> -	mutex_unlock(&smmu_domain->init_mutex);
> -
> -	return ret;
> -}
> -
>  static int arm_smmu_of_xlate(struct device *dev,
>  			     const struct of_phandle_args *args)
>  {
> @@ -3467,7 +3452,6 @@ static struct iommu_ops arm_smmu_ops = {
>  		.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
>  		.iotlb_sync		= arm_smmu_iotlb_sync,
>  		.iova_to_phys		= arm_smmu_iova_to_phys,
> -		.enable_nesting		= arm_smmu_enable_nesting,
>  		.free			= arm_smmu_domain_free_paging,
>  	}
>  };
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> index 723273440c2118..38dad1fd53b80a 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> @@ -1558,21 +1558,6 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev)
>  	return group;
>  }
>  
> -static int arm_smmu_enable_nesting(struct iommu_domain *domain)
> -{
> -	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> -	int ret = 0;
> -
> -	mutex_lock(&smmu_domain->init_mutex);
> -	if (smmu_domain->smmu)
> -		ret = -EPERM;
> -	else
> -		smmu_domain->stage = ARM_SMMU_DOMAIN_NESTED;
> -	mutex_unlock(&smmu_domain->init_mutex);
> -
> -	return ret;
> -}
> -
>  static int arm_smmu_set_pgtable_quirks(struct iommu_domain *domain,
>  		unsigned long quirks)
>  {
> @@ -1656,7 +1641,6 @@ static struct iommu_ops arm_smmu_ops = {
>  		.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
>  		.iotlb_sync		= arm_smmu_iotlb_sync,
>  		.iova_to_phys		= arm_smmu_iova_to_phys,
> -		.enable_nesting		= arm_smmu_enable_nesting,
>  		.set_pgtable_quirks	= arm_smmu_set_pgtable_quirks,
>  		.free			= arm_smmu_domain_free,
>  	}
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index ed6c5cb60c5aee..9da63d57a53cd7 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2723,16 +2723,6 @@ static int __init iommu_init(void)
>  }
>  core_initcall(iommu_init);
>  
> -int iommu_enable_nesting(struct iommu_domain *domain)
> -{
> -	if (domain->type != IOMMU_DOMAIN_UNMANAGED)
> -		return -EINVAL;
> -	if (!domain->ops->enable_nesting)
> -		return -EINVAL;
> -	return domain->ops->enable_nesting(domain);
> -}
> -EXPORT_SYMBOL_GPL(iommu_enable_nesting);
> -
>  int iommu_set_pgtable_quirks(struct iommu_domain *domain,
>  		unsigned long quirk)
>  {
> diff --git a/drivers/iommu/iommufd/vfio_compat.c b/drivers/iommu/iommufd/vfio_compat.c
> index a3ad5f0b6c59dd..514aacd6400949 100644
> --- a/drivers/iommu/iommufd/vfio_compat.c
> +++ b/drivers/iommu/iommufd/vfio_compat.c
> @@ -291,12 +291,7 @@ static int iommufd_vfio_check_extension(struct iommufd_ctx *ictx,
>  	case VFIO_DMA_CC_IOMMU:
>  		return iommufd_vfio_cc_iommu(ictx);
>  
> -	/*
> -	 * This is obsolete, and to be removed from VFIO. It was an incomplete
> -	 * idea that got merged.
> -	 * https://lore.kernel.org/kvm/0-v1-0093c9b0e345+19-vfio_no_nesting_jgg@nvidia.com/
> -	 */
> -	case VFIO_TYPE1_NESTING_IOMMU:
> +	case __VFIO_RESERVED_TYPE1_NESTING_IOMMU:
>  		return 0;
>  
>  	/*
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 0960699e75543e..13cf6851cc2718 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -72,7 +72,6 @@ struct vfio_iommu {
>  	uint64_t		pgsize_bitmap;
>  	uint64_t		num_non_pinned_groups;
>  	bool			v2;
> -	bool			nesting;
>  	bool			dirty_page_tracking;
>  	struct list_head	emulated_iommu_groups;
>  };
> @@ -2199,12 +2198,6 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  		goto out_free_domain;
>  	}
>  
> -	if (iommu->nesting) {
> -		ret = iommu_enable_nesting(domain->domain);
> -		if (ret)
> -			goto out_domain;
> -	}
> -
>  	ret = iommu_attach_group(domain->domain, group->iommu_group);
>  	if (ret)
>  		goto out_domain;
> @@ -2545,9 +2538,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
>  	switch (arg) {
>  	case VFIO_TYPE1_IOMMU:
>  		break;
> -	case VFIO_TYPE1_NESTING_IOMMU:
> -		iommu->nesting = true;
> -		fallthrough;
> +	case __VFIO_RESERVED_TYPE1_NESTING_IOMMU:
>  	case VFIO_TYPE1v2_IOMMU:
>  		iommu->v2 = true;
>  		break;
> @@ -2642,7 +2633,6 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
>  	switch (arg) {
>  	case VFIO_TYPE1_IOMMU:
>  	case VFIO_TYPE1v2_IOMMU:
> -	case VFIO_TYPE1_NESTING_IOMMU:
>  	case VFIO_UNMAP_ALL:
>  		return 1;
>  	case VFIO_UPDATE_VADDR:
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 4d47f2c3331185..15d7657509f662 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -635,7 +635,6 @@ struct iommu_ops {
>   * @enforce_cache_coherency: Prevent any kind of DMA from bypassing IOMMU_CACHE,
>   *                           including no-snoop TLPs on PCIe or other platform
>   *                           specific mechanisms.
> - * @enable_nesting: Enable nesting
>   * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
>   * @free: Release the domain after use.
>   */
> @@ -663,7 +662,6 @@ struct iommu_domain_ops {
>  				    dma_addr_t iova);
>  
>  	bool (*enforce_cache_coherency)(struct iommu_domain *domain);
> -	int (*enable_nesting)(struct iommu_domain *domain);
>  	int (*set_pgtable_quirks)(struct iommu_domain *domain,
>  				  unsigned long quirks);
>  
> @@ -846,7 +844,6 @@ extern void iommu_group_put(struct iommu_group *group);
>  extern int iommu_group_id(struct iommu_group *group);
>  extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
>  
> -int iommu_enable_nesting(struct iommu_domain *domain);
>  int iommu_set_pgtable_quirks(struct iommu_domain *domain,
>  		unsigned long quirks);
>  
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 2b68e6cdf1902f..c8dbf8219c4fcb 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -35,7 +35,7 @@
>  #define VFIO_EEH			5
>  
>  /* Two-stage IOMMU */
> -#define VFIO_TYPE1_NESTING_IOMMU	6	/* Implies v2 */
> +#define __VFIO_RESERVED_TYPE1_NESTING_IOMMU	6	/* Implies v2 */
>  
>  #define VFIO_SPAPR_TCE_v2_IOMMU		7
>  



^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-06 23:41 ` [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available Jason Gunthorpe
@ 2024-08-09 14:26   ` Shameerali Kolothum Thodi
  2024-08-09 15:12     ` Jason Gunthorpe
  2024-08-20  8:30   ` Mostafa Saleh
  1 sibling, 1 reply; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-08-09 14:26 UTC (permalink / raw)
  To: Jason Gunthorpe, acpica-devel@lists.linux.dev, Alex Williamson,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon
  Cc: Eric Auger, Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches@lists.linux.dev



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, August 7, 2024 12:41 AM
> To: acpica-devel@lists.linux.dev; Alex Williamson
> <alex.williamson@redhat.com>; Guohanjun (Hanjun Guo)
> <guohanjun@huawei.com>; iommu@lists.linux.dev; Joerg Roedel
> <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>; kvm@vger.kernel.org;
> Len Brown <lenb@kernel.org>; linux-acpi@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; Lorenzo Pieralisi <lpieralisi@kernel.org>; Rafael J.
> Wysocki <rafael@kernel.org>; Robert Moore <robert.moore@intel.com>; Robin
> Murphy <robin.murphy@arm.com>; Sudeep Holla <sudeep.holla@arm.com>;
> Will Deacon <will@kernel.org>
> Cc: Eric Auger <eric.auger@redhat.com>; Jean-Philippe Brucker <jean-
> philippe@linaro.org>; Moritz Fischer <mdf@kernel.org>; Michael Shavit
> <mshavit@google.com>; Nicolin Chen <nicolinc@nvidia.com>;
> patches@lists.linux.dev; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>
> Subject: [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available
> 
> Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
> works. When S2FWB is supported and enabled the IOPTE will force cachable
> access to IOMMU_CACHE memory and deny cachable access otherwise.
> 
> This is not especially meaningful for simple S2 domains, it apparently
> doesn't even force PCI no-snoop access to be coherent.
> 
> However, when used with a nested S1, FWB has the effect of preventing the
> guest from choosing a MemAttr that would cause ordinary DMA to bypass the
> cache. Consistent with KVM we wish to deny the guest the ability to become
> incoherent with cached memory the hypervisor believes is cachable so we
> don't have to flush it.
> 
> Turn on S2FWB whenever the SMMU supports it and use it for all S2
> mappings.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  6 ++++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
>  drivers/iommu/io-pgtable-arm.c              | 24 +++++++++++++++++----
>  include/linux/io-pgtable.h                  |  2 ++
>  4 files changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 531125f231b662..7fe1e27d11586c 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1612,6 +1612,8 @@ void arm_smmu_make_s2_domain_ste(struct
> arm_smmu_ste *target,
>  		FIELD_PREP(STRTAB_STE_1_EATS,
>  			   ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0));
> 
> +	if (smmu->features & ARM_SMMU_FEAT_S2FWB)
> +		target->data[1] |= cpu_to_le64(STRTAB_STE_1_S2FWB);
>  	if (smmu->features & ARM_SMMU_FEAT_ATTR_TYPES_OVR)
>  		target->data[1] |=
> cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
> 
> STRTAB_STE_1_SHCFG_INCOMING));
> @@ -2400,6 +2402,8 @@ static int arm_smmu_domain_finalise(struct
> arm_smmu_domain *smmu_domain,
>  		pgtbl_cfg.oas = smmu->oas;
>  		fmt = ARM_64_LPAE_S2;
>  		finalise_stage_fn = arm_smmu_domain_finalise_s2;
> +		if (smmu->features & ARM_SMMU_FEAT_S2FWB)
> +			pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB;

This probably requires an update in arm_64_lpae_alloc_pgtable_s2() quirks check.

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH 3/8] ACPI/IORT: Support CANWBS memory access flag
  2024-08-06 23:41 ` [PATCH 3/8] ACPI/IORT: Support CANWBS memory access flag Jason Gunthorpe
@ 2024-08-09 14:36   ` Shameerali Kolothum Thodi
  2024-08-09 14:59     ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-08-09 14:36 UTC (permalink / raw)
  To: Jason Gunthorpe, acpica-devel@lists.linux.dev, Alex Williamson,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon
  Cc: Eric Auger, Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches@lists.linux.dev



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, August 7, 2024 12:41 AM
> To: acpica-devel@lists.linux.dev; Alex Williamson
> <alex.williamson@redhat.com>; Guohanjun (Hanjun Guo)
> <guohanjun@huawei.com>; iommu@lists.linux.dev; Joerg Roedel
> <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>; kvm@vger.kernel.org;
> Len Brown <lenb@kernel.org>; linux-acpi@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; Lorenzo Pieralisi <lpieralisi@kernel.org>; Rafael J.
> Wysocki <rafael@kernel.org>; Robert Moore <robert.moore@intel.com>; Robin
> Murphy <robin.murphy@arm.com>; Sudeep Holla <sudeep.holla@arm.com>;
> Will Deacon <will@kernel.org>
> Cc: Eric Auger <eric.auger@redhat.com>; Jean-Philippe Brucker <jean-
> philippe@linaro.org>; Moritz Fischer <mdf@kernel.org>; Michael Shavit
> <mshavit@google.com>; Nicolin Chen <nicolinc@nvidia.com>;
> patches@lists.linux.dev; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>
> Subject: [PATCH 3/8] ACPI/IORT: Support CANWBS memory access flag
> 
> From: Nicolin Chen <nicolinc@nvidia.com>
> 
> The IORT spec, Issue E.f (April 2024), adds a new CANWBS bit to the Memory
> Access Flag field in the Memory Access Properties table, mainly for a PCI
> Root Complex.
...
> diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
> index e27958ef82642f..56ce7fc35312c8 100644
> --- a/include/acpi/actbl2.h
> +++ b/include/acpi/actbl2.h
> @@ -524,6 +524,7 @@ struct acpi_iort_memory_access {
> 
>  #define ACPI_IORT_MF_COHERENCY          (1)
>  #define ACPI_IORT_MF_ATTRIBUTES         (1<<1)
> +#define ACPI_IORT_MF_CANWBS             (1<<2)

I think we need to update Document number to E.f in IORT section in 
this file. Also isn't it this file normally gets updated through ACPICA pull ?

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/8] ACPI/IORT: Support CANWBS memory access flag
  2024-08-09 14:36   ` Shameerali Kolothum Thodi
@ 2024-08-09 14:59     ` Jason Gunthorpe
  2024-08-09 15:15       ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-09 14:59 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: acpica-devel@lists.linux.dev, Alex Williamson,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches@lists.linux.dev

On Fri, Aug 09, 2024 at 02:36:31PM +0000, Shameerali Kolothum Thodi wrote:
> > @@ -524,6 +524,7 @@ struct acpi_iort_memory_access {
> > 
> >  #define ACPI_IORT_MF_COHERENCY          (1)
> >  #define ACPI_IORT_MF_ATTRIBUTES         (1<<1)
> > +#define ACPI_IORT_MF_CANWBS             (1<<2)
> 
> I think we need to update Document number to E.f in IORT section in 
> this file. Also isn't it this file normally gets updated through ACPICA pull ?

I don't know anything about the ACPI process..

Can someone say for sure what to do here?

Jason


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
  2024-08-06 23:41 ` [PATCH 6/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT Jason Gunthorpe
@ 2024-08-09 15:06   ` Robin Murphy
  2024-08-09 16:09     ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: Robin Murphy @ 2024-08-09 15:06 UTC (permalink / raw)
  To: Jason Gunthorpe, acpica-devel, Alex Williamson, Hanjun Guo, iommu,
	Joerg Roedel, Kevin Tian, kvm, Len Brown, linux-acpi,
	linux-arm-kernel, Lorenzo Pieralisi, Rafael J. Wysocki,
	Robert Moore, Sudeep Holla, Will Deacon
  Cc: Eric Auger, Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches, Shameerali Kolothum Thodi

On 2024-08-07 12:41 am, Jason Gunthorpe wrote:
> For SMMUv3 the parent must be a S2 domain, which can be composed
> into a IOMMU_DOMAIN_NESTED.
> 
> In future the S2 parent will also need a VMID linked to the VIOMMU and
> even to KVM.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 ++++++++++-
>   1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 6bbe4aa7b9511c..5faaccef707ef1 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -3103,7 +3103,8 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
>   			   const struct iommu_user_data *user_data)
>   {
>   	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> -	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
> +	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING |
> +				 IOMMU_HWPT_ALLOC_NEST_PARENT;
>   	struct arm_smmu_domain *smmu_domain;
>   	int ret;
>   
> @@ -3116,6 +3117,14 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
>   	if (!smmu_domain)
>   		return ERR_PTR(-ENOMEM);
>   
> +	if (flags & IOMMU_HWPT_ALLOC_NEST_PARENT) {
> +		if (!(master->smmu->features & ARM_SMMU_FEAT_TRANS_S2)) {

Nope, nesting needs to rely on FEAT_NESTING, that's why it exists. S2 
alone isn't sufficient - without S1 there's nothing to expose to 
userspace, so zero point in having a "nested" domain with nothing to 
nest into it - but furthermore we need S2 *without* unsafe broken TLBs.

Thanks,
Robin.

> +			ret = -EOPNOTSUPP;
> +			goto err_free;
> +		}
> +		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
> +	}
> +
>   	smmu_domain->domain.type = IOMMU_DOMAIN_UNMANAGED;
>   	smmu_domain->domain.ops = arm_smmu_ops.default_domain_ops;
>   	ret = arm_smmu_domain_finalise(smmu_domain, master->smmu, flags);


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-09 14:26   ` Shameerali Kolothum Thodi
@ 2024-08-09 15:12     ` Jason Gunthorpe
  2024-08-15 16:14       ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-09 15:12 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: acpica-devel@lists.linux.dev, Alex Williamson,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches@lists.linux.dev

On Fri, Aug 09, 2024 at 02:26:13PM +0000, Shameerali Kolothum Thodi wrote:
> > +		if (smmu->features & ARM_SMMU_FEAT_S2FWB)
> > +			pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB;
> 
> This probably requires an update in arm_64_lpae_alloc_pgtable_s2() quirks check.

Yep, fixed I was hoping you had HW to test this..

Thanks,
Jason


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH 3/8] ACPI/IORT: Support CANWBS memory access flag
  2024-08-09 14:59     ` Jason Gunthorpe
@ 2024-08-09 15:15       ` Shameerali Kolothum Thodi
  2024-08-09 20:14         ` Nicolin Chen
  0 siblings, 1 reply; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-08-09 15:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel@lists.linux.dev, Alex Williamson,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches@lists.linux.dev



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, August 9, 2024 3:59 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: acpica-devel@lists.linux.dev; Alex Williamson
> <alex.williamson@redhat.com>; Guohanjun (Hanjun Guo)
> <guohanjun@huawei.com>; iommu@lists.linux.dev; Joerg Roedel
> <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>; kvm@vger.kernel.org;
> Len Brown <lenb@kernel.org>; linux-acpi@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; Lorenzo Pieralisi <lpieralisi@kernel.org>; Rafael J.
> Wysocki <rafael@kernel.org>; Robert Moore <robert.moore@intel.com>; Robin
> Murphy <robin.murphy@arm.com>; Sudeep Holla <sudeep.holla@arm.com>;
> Will Deacon <will@kernel.org>; Eric Auger <eric.auger@redhat.com>; Jean-
> Philippe Brucker <jean-philippe@linaro.org>; Moritz Fischer <mdf@kernel.org>;
> Michael Shavit <mshavit@google.com>; Nicolin Chen <nicolinc@nvidia.com>;
> patches@lists.linux.dev
> Subject: Re: [PATCH 3/8] ACPI/IORT: Support CANWBS memory access flag
> 
> On Fri, Aug 09, 2024 at 02:36:31PM +0000, Shameerali Kolothum Thodi wrote:
> > > @@ -524,6 +524,7 @@ struct acpi_iort_memory_access {
> > >
> > >  #define ACPI_IORT_MF_COHERENCY          (1)
> > >  #define ACPI_IORT_MF_ATTRIBUTES         (1<<1)
> > > +#define ACPI_IORT_MF_CANWBS             (1<<2)
> >
> > I think we need to update Document number to E.f in IORT section in
> > this file. Also isn't it this file normally gets updated through ACPICA pull ?
> 
> I don't know anything about the ACPI process..
> 
> Can someone say for sure what to do here?

From past experience, it is normally sending a PULL request to here,
https://github.com/acpica/acpica/pulls

And I think it then gets merged by Robert Moore and then send to LKML.

Thanks,
Shameer



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-08-06 23:41 ` [PATCH 7/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED Jason Gunthorpe
@ 2024-08-09 16:05   ` Robin Murphy
  2024-08-09 18:03     ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: Robin Murphy @ 2024-08-09 16:05 UTC (permalink / raw)
  To: Jason Gunthorpe, acpica-devel, Alex Williamson, Hanjun Guo, iommu,
	Joerg Roedel, Kevin Tian, kvm, Len Brown, linux-acpi,
	linux-arm-kernel, Lorenzo Pieralisi, Rafael J. Wysocki,
	Robert Moore, Sudeep Holla, Will Deacon
  Cc: Eric Auger, Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches, Shameerali Kolothum Thodi

On 2024-08-07 12:41 am, Jason Gunthorpe wrote:
> For SMMUv3 a IOMMU_DOMAIN_NESTED is composed of a S2 iommu_domain acting
> as the parent and a user provided STE fragment that defines the CD table
> and related data with addresses translated by the S2 iommu_domain.
> 
> The kernel only permits userspace to control certain allowed bits of the
> STE that are safe for user/guest control.
> 
> IOTLB maintenance is a bit subtle here, the S1 implicitly includes the S2
> translation, but there is no way of knowing which S1 entries refer to a
> range of S2.
> 
> For the IOTLB we follow ARM's guidance and issue a CMDQ_OP_TLBI_NH_ALL to
> flush all ASIDs from the VMID after flushing the S2 on any change to the
> S2.
> 
> Similarly we have to flush the entire ATC if the S2 is changed.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 211 +++++++++++++++++++-
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  20 ++
>   include/uapi/linux/iommufd.h                |  20 ++
>   3 files changed, 247 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 5faaccef707ef1..5dbaffd7937747 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -295,6 +295,7 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>   	case CMDQ_OP_TLBI_NH_ASID:
>   		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
>   		fallthrough;
> +	case CMDQ_OP_TLBI_NH_ALL:
>   	case CMDQ_OP_TLBI_S12_VMALL:
>   		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
>   		break;
> @@ -1640,6 +1641,59 @@ void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
>   }
>   EXPORT_SYMBOL_IF_KUNIT(arm_smmu_make_s2_domain_ste);
>   
> +static void arm_smmu_make_nested_cd_table_ste(
> +	struct arm_smmu_ste *target, struct arm_smmu_master *master,
> +	struct arm_smmu_nested_domain *nested_domain, bool ats_enabled)
> +{
> +	arm_smmu_make_s2_domain_ste(target, master, nested_domain->s2_parent,
> +				    ats_enabled);
> +
> +	target->data[0] = cpu_to_le64(STRTAB_STE_0_V |
> +				      FIELD_PREP(STRTAB_STE_0_CFG,
> +						 STRTAB_STE_0_CFG_NESTED)) |
> +			  (nested_domain->ste[0] & ~STRTAB_STE_0_CFG);
> +	target->data[1] |= nested_domain->ste[1];
> +}
> +
> +/*
> + * Create a physical STE from the virtual STE that userspace provided when it
> + * created the nested domain. Using the vSTE userspace can request:
> + * - Non-valid STE
> + * - Abort STE
> + * - Bypass STE (install the S2, no CD table)
> + * - CD table STE (install the S2 and the userspace CD table)
> + */
> +static void arm_smmu_make_nested_domain_ste(
> +	struct arm_smmu_ste *target, struct arm_smmu_master *master,
> +	struct arm_smmu_nested_domain *nested_domain, bool ats_enabled)
> +{
> +	/*
> +	 * Userspace can request a non-valid STE through the nesting interface.
> +	 * We relay that into a non-valid physical STE with the intention that
> +	 * C_BAD_STE for this SID can be delivered to userspace.

NAK, that is a horrible idea. If userspace really wants to emulate that 
it can install a disabled S1 context or move the device to an empty s2 
domain, get translation faults signalled through the normal path, and 
synthesise C_BAD_STE for itself because it knows what it's done. 
Otherwise, how do you propose we would actually tell whether a real 
C_BAD_STE is due to a driver bug, an unknown device, or intentional 
(especially in cases like a surprise removal where we might have to 
transition directly from fake-invalid to real-invalid)?

Yes, userspace can spam up the event queue with translation/permission 
etc. faults, but those are at least clearly attributable and an expected 
part of normal operation; giving it free reign to spam up the event 
queue with what are currently considered *host kernel errors*, with no 
handling or mitigation, is another thing entirely.

> +	 */
> +	if (!(nested_domain->ste[0] & cpu_to_le64(STRTAB_STE_0_V))) {
> +		memset(target, 0, sizeof(*target));
> +		return;
> +	}
> +
> +	switch (FIELD_GET(STRTAB_STE_0_CFG,
> +			  le64_to_cpu(nested_domain->ste[0]))) {
> +	case STRTAB_STE_0_CFG_S1_TRANS:
> +		arm_smmu_make_nested_cd_table_ste(target, master, nested_domain,
> +						  ats_enabled);
> +		break;
> +	case STRTAB_STE_0_CFG_BYPASS:
> +		arm_smmu_make_s2_domain_ste(
> +			target, master, nested_domain->s2_parent, ats_enabled);
> +		break;
> +	case STRTAB_STE_0_CFG_ABORT:
> +	default:
> +		arm_smmu_make_abort_ste(target);
> +		break;
> +	}
> +}
> +
>   /*
>    * This can safely directly manipulate the STE memory without a sync sequence
>    * because the STE table has not been installed in the SMMU yet.
> @@ -2065,7 +2119,16 @@ int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
>   		if (!master->ats_enabled)
>   			continue;
>   
> -		arm_smmu_atc_inv_to_cmd(master_domain->ssid, iova, size, &cmd);
> +		if (master_domain->nested_parent) {
> +			/*
> +			 * If a S2 used as a nesting parent is changed we have
> +			 * no option but to completely flush the ATC.
> +			 */
> +			arm_smmu_atc_inv_to_cmd(IOMMU_NO_PASID, 0, 0, &cmd);
> +		} else {
> +			arm_smmu_atc_inv_to_cmd(master_domain->ssid, iova, size,
> +						&cmd);
> +		}
>   
>   		for (i = 0; i < master->num_streams; i++) {
>   			cmd.atc.sid = master->streams[i].id;
> @@ -2192,6 +2255,16 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
>   	}
>   	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
>   
> +	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S2 &&
> +	    smmu_domain->nesting_parent) {

Surely nesting_parent must never be set on anything other than S2 
domains in the first place?

> +		/*
> +		 * When the S2 domain changes all the nested S1 ASIDs have to be
> +		 * flushed too.
> +		 */
> +		cmd.opcode = CMDQ_OP_TLBI_NH_ALL;
> +		arm_smmu_cmdq_issue_cmd_with_sync(smmu_domain->smmu, &cmd);
> +	}
> +
>   	/*
>   	 * Unfortunately, this can't be leaf-only since we may have
>   	 * zapped an entire table.
> @@ -2608,13 +2681,15 @@ arm_smmu_find_master_domain(struct arm_smmu_domain *smmu_domain,
>   			    ioasid_t ssid)
>   {
>   	struct arm_smmu_master_domain *master_domain;
> +	bool nested_parent = smmu_domain->domain.type == IOMMU_DOMAIN_NESTED;
>   
>   	lockdep_assert_held(&smmu_domain->devices_lock);
>   
>   	list_for_each_entry(master_domain, &smmu_domain->devices,
>   			    devices_elm) {
>   		if (master_domain->master == master &&
> -		    master_domain->ssid == ssid)
> +		    master_domain->ssid == ssid &&
> +		    master_domain->nested_parent == nested_parent)

As if nested_parent vs. nesting parent wasn't bad enough, why would we 
need additional disambiguation here? How could more than one attachment 
to the same SID:SSID exist at the same time? How could we have a 
non-nested S1 attachment in a nested domain, or vice-versa, at all?

>   			return master_domain;
>   	}
>   	return NULL;
> @@ -2634,6 +2709,9 @@ to_smmu_domain_devices(struct iommu_domain *domain)
>   	if ((domain->type & __IOMMU_DOMAIN_PAGING) ||
>   	    domain->type == IOMMU_DOMAIN_SVA)
>   		return to_smmu_domain(domain);
> +	if (domain->type == IOMMU_DOMAIN_NESTED)
> +		return container_of(domain, struct arm_smmu_nested_domain,
> +				    domain)->s2_parent;
>   	return NULL;
>   }
>   
> @@ -2664,6 +2742,7 @@ struct arm_smmu_attach_state {
>   	struct iommu_domain *old_domain;
>   	struct arm_smmu_master *master;
>   	bool cd_needs_ats;
> +	bool disable_ats;
>   	ioasid_t ssid;
>   	/* Resulting state */
>   	bool ats_enabled;
> @@ -2716,7 +2795,8 @@ static int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
>   		 * enabled if we have arm_smmu_domain, those always have page
>   		 * tables.
>   		 */
> -		state->ats_enabled = arm_smmu_ats_supported(master);
> +		state->ats_enabled = !state->disable_ats &&
> +				     arm_smmu_ats_supported(master);
>   	}
>   
>   	if (smmu_domain) {
> @@ -2725,6 +2805,8 @@ static int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
>   			return -ENOMEM;
>   		master_domain->master = master;
>   		master_domain->ssid = state->ssid;
> +		master_domain->nested_parent = new_domain->type ==
> +					       IOMMU_DOMAIN_NESTED;
>   
>   		/*
>   		 * During prepare we want the current smmu_domain and new
> @@ -3097,6 +3179,122 @@ static struct iommu_domain arm_smmu_blocked_domain = {
>   	.ops = &arm_smmu_blocked_ops,
>   };
>   
> +static int arm_smmu_attach_dev_nested(struct iommu_domain *domain,
> +				      struct device *dev)
> +{
> +	struct arm_smmu_nested_domain *nested_domain =
> +		container_of(domain, struct arm_smmu_nested_domain, domain);
> +	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> +	struct arm_smmu_attach_state state = {
> +		.master = master,
> +		.old_domain = iommu_get_domain_for_dev(dev),
> +		.ssid = IOMMU_NO_PASID,
> +		/* Currently invalidation of ATC is not supported */
> +		.disable_ats = true,
> +	};
> +	struct arm_smmu_ste ste;
> +	int ret;
> +
> +	if (arm_smmu_ssids_in_use(&master->cd_table) ||
> +	    nested_domain->s2_parent->smmu != master->smmu)
> +		return -EINVAL;
> +
> +	mutex_lock(&arm_smmu_asid_lock);
> +	ret = arm_smmu_attach_prepare(&state, domain);
> +	if (ret) {
> +		mutex_unlock(&arm_smmu_asid_lock);
> +		return ret;
> +	}
> +
> +	arm_smmu_make_nested_domain_ste(&ste, master, nested_domain,
> +					state.ats_enabled);
> +	arm_smmu_install_ste_for_dev(master, &ste);
> +	arm_smmu_attach_commit(&state);
> +	mutex_unlock(&arm_smmu_asid_lock);
> +	return 0;
> +}
> +
> +static void arm_smmu_domain_nested_free(struct iommu_domain *domain)
> +{
> +	kfree(container_of(domain, struct arm_smmu_nested_domain, domain));
> +}
> +
> +static const struct iommu_domain_ops arm_smmu_nested_ops = {
> +	.attach_dev = arm_smmu_attach_dev_nested,
> +	.free = arm_smmu_domain_nested_free,
> +};
> +
> +static struct iommu_domain *
> +arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags,
> +			      struct iommu_domain *parent,
> +			      const struct iommu_user_data *user_data)
> +{
> +	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> +	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> +	struct arm_smmu_nested_domain *nested_domain;
> +	struct arm_smmu_domain *smmu_parent;
> +	struct iommu_hwpt_arm_smmuv3 arg;
> +	unsigned int eats;
> +	unsigned int cfg;
> +	int ret;
> +
> +	if (!(master->smmu->features & ARM_SMMU_FEAT_NESTING))
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	/*
> +	 * Must support some way to prevent the VM from bypassing the cache
> +	 * because VFIO currently does not do any cache maintenance.
> +	 */
> +	if (!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_CANWBS) &&
> +	    !(master->smmu->features & ARM_SMMU_FEAT_S2FWB))
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	ret = iommu_copy_struct_from_user(&arg, user_data,
> +					  IOMMU_HWPT_DATA_ARM_SMMUV3, ste);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	if (flags || !(master->smmu->features & ARM_SMMU_FEAT_TRANS_S1))
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	if (!(parent->type & __IOMMU_DOMAIN_PAGING))
> +		return ERR_PTR(-EINVAL);
> +
> +	smmu_parent = to_smmu_domain(parent);
> +	if (smmu_parent->stage != ARM_SMMU_DOMAIN_S2 ||
> +	    smmu_parent->smmu != master->smmu)
> +		return ERR_PTR(-EINVAL);
> +
> +	/* EIO is reserved for invalid STE data. */
> +	if ((arg.ste[0] & ~STRTAB_STE_0_NESTING_ALLOWED) ||
> +	    (arg.ste[1] & ~STRTAB_STE_1_NESTING_ALLOWED))
> +		return ERR_PTR(-EIO);
> +
> +	cfg = FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(arg.ste[0]));
> +	if (cfg != STRTAB_STE_0_CFG_ABORT && cfg != STRTAB_STE_0_CFG_BYPASS &&
> +	    cfg != STRTAB_STE_0_CFG_S1_TRANS)
> +		return ERR_PTR(-EIO);
> +
> +	eats = FIELD_GET(STRTAB_STE_1_EATS, le64_to_cpu(arg.ste[1]));
> +	if (eats != STRTAB_STE_1_EATS_ABT)
> +		return ERR_PTR(-EIO);
> +
> +	if (cfg != STRTAB_STE_0_CFG_S1_TRANS)
> +		eats = STRTAB_STE_1_EATS_ABT;
> +
> +	nested_domain = kzalloc(sizeof(*nested_domain), GFP_KERNEL_ACCOUNT);
> +	if (!nested_domain)
> +		return ERR_PTR(-ENOMEM);
> +
> +	nested_domain->domain.type = IOMMU_DOMAIN_NESTED;
> +	nested_domain->domain.ops = &arm_smmu_nested_ops;
> +	nested_domain->s2_parent = smmu_parent;
> +	nested_domain->ste[0] = arg.ste[0];
> +	nested_domain->ste[1] = arg.ste[1] & ~cpu_to_le64(STRTAB_STE_1_EATS);
> +
> +	return &nested_domain->domain;
> +}
> +
>   static struct iommu_domain *
>   arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
>   			   struct iommu_domain *parent,
> @@ -3108,9 +3306,13 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
>   	struct arm_smmu_domain *smmu_domain;
>   	int ret;
>   
> +	if (parent)
> +		return arm_smmu_domain_alloc_nesting(dev, flags, parent,
> +						     user_data);
> +
>   	if (flags & ~PAGING_FLAGS)
>   		return ERR_PTR(-EOPNOTSUPP);
> -	if (parent || user_data)
> +	if (user_data)
>   		return ERR_PTR(-EOPNOTSUPP);
>   
>   	smmu_domain = arm_smmu_domain_alloc();
> @@ -3123,6 +3325,7 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
>   			goto err_free;
>   		}
>   		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
> +		smmu_domain->nesting_parent = true;
>   	}
>   
>   	smmu_domain->domain.type = IOMMU_DOMAIN_UNMANAGED;
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 58cd405652e06a..e149eddb568e7e 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -240,6 +240,7 @@ static inline u32 arm_smmu_strtab_l2_idx(u32 sid)
>   #define STRTAB_STE_0_CFG_BYPASS		4
>   #define STRTAB_STE_0_CFG_S1_TRANS	5
>   #define STRTAB_STE_0_CFG_S2_TRANS	6
> +#define STRTAB_STE_0_CFG_NESTED		7
>   
>   #define STRTAB_STE_0_S1FMT		GENMASK_ULL(5, 4)
>   #define STRTAB_STE_0_S1FMT_LINEAR	0
> @@ -291,6 +292,15 @@ static inline u32 arm_smmu_strtab_l2_idx(u32 sid)
>   
>   #define STRTAB_STE_3_S2TTB_MASK		GENMASK_ULL(51, 4)
>   
> +/* These bits can be controlled by userspace for STRTAB_STE_0_CFG_NESTED */
> +#define STRTAB_STE_0_NESTING_ALLOWED                                         \
> +	cpu_to_le64(STRTAB_STE_0_V | STRTAB_STE_0_CFG | STRTAB_STE_0_S1FMT | \
> +		    STRTAB_STE_0_S1CTXPTR_MASK | STRTAB_STE_0_S1CDMAX)
> +#define STRTAB_STE_1_NESTING_ALLOWED                            \
> +	cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |   \
> +		    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |   \
> +		    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_EATS)
> +
>   /*
>    * Context descriptors.
>    *
> @@ -508,6 +518,7 @@ struct arm_smmu_cmdq_ent {
>   			};
>   		} cfgi;
>   
> +		#define CMDQ_OP_TLBI_NH_ALL     0x10
>   		#define CMDQ_OP_TLBI_NH_ASID	0x11
>   		#define CMDQ_OP_TLBI_NH_VA	0x12
>   		#define CMDQ_OP_TLBI_EL2_ALL	0x20
> @@ -792,6 +803,14 @@ struct arm_smmu_domain {
>   	u8				enforce_cache_coherency;
>   
>   	struct mmu_notifier		mmu_notifier;
> +	bool				nesting_parent : 1;

Erm, please use bool consistently, or use integer bitfields 
consistently, but not a deranged mess of bool bitfields while also 
assigning true/false to full u8s... :/

Thanks,
Robin.

> +};
> +
> +struct arm_smmu_nested_domain {
> +	struct iommu_domain domain;
> +	struct arm_smmu_domain *s2_parent;
> +
> +	__le64 ste[2];
>   };
>   
>   /* The following are exposed for testing purposes. */
> @@ -830,6 +849,7 @@ struct arm_smmu_master_domain {
>   	struct list_head devices_elm;
>   	struct arm_smmu_master *master;
>   	ioasid_t ssid;
> +	u8 nested_parent;
>   };
>   
>   static inline struct arm_smmu_domain *to_smmu_domain(struct iommu_domain *dom)
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 83b6e1cd338d8f..76e9ad6c9403af 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -394,14 +394,34 @@ struct iommu_hwpt_vtd_s1 {
>   	__u32 __reserved;
>   };
>   
> +/**
> + * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 Context Descriptor Table info
> + *                                (IOMMU_HWPT_DATA_ARM_SMMUV3)
> + *
> + * @ste: The first two double words of the user space Stream Table Entry for
> + *       a user stage-1 Context Descriptor Table. Must be little-endian.
> + *       Allowed fields: (Refer to "5.2 Stream Table Entry" in SMMUv3 HW Spec)
> + *       - word-0: V, Cfg, S1Fmt, S1ContextPtr, S1CDMax
> + *       - word-1: S1DSS, S1CIR, S1COR, S1CSH, S1STALLD
> + *
> + * -EIO will be returned if @ste is not legal or contains any non-allowed field.
> + * Cfg can be used to select a S1, Bypass or Abort configuration. A Bypass
> + * nested domain will translate the same as the nesting parent.
> + */
> +struct iommu_hwpt_arm_smmuv3 {
> +	__aligned_le64 ste[2];
> +};
> +
>   /**
>    * enum iommu_hwpt_data_type - IOMMU HWPT Data Type
>    * @IOMMU_HWPT_DATA_NONE: no data
>    * @IOMMU_HWPT_DATA_VTD_S1: Intel VT-d stage-1 page table
> + * @IOMMU_HWPT_DATA_ARM_SMMUV3: ARM SMMUv3 Context Descriptor Table
>    */
>   enum iommu_hwpt_data_type {
>   	IOMMU_HWPT_DATA_NONE = 0,
>   	IOMMU_HWPT_DATA_VTD_S1 = 1,
> +	IOMMU_HWPT_DATA_ARM_SMMUV3 = 2,
>   };
>   
>   /**


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
  2024-08-09 15:06   ` Robin Murphy
@ 2024-08-09 16:09     ` Jason Gunthorpe
  2024-08-09 18:34       ` Robin Murphy
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-09 16:09 UTC (permalink / raw)
  To: Robin Murphy
  Cc: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Sudeep Holla,
	Will Deacon, Eric Auger, Jean-Philippe Brucker, Moritz Fischer,
	Michael Shavit, Nicolin Chen, patches, Shameerali Kolothum Thodi

On Fri, Aug 09, 2024 at 04:06:22PM +0100, Robin Murphy wrote:
> On 2024-08-07 12:41 am, Jason Gunthorpe wrote:
> > For SMMUv3 the parent must be a S2 domain, which can be composed
> > into a IOMMU_DOMAIN_NESTED.
> > 
> > In future the S2 parent will also need a VMID linked to the VIOMMU and
> > even to KVM.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > ---
> >   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 ++++++++++-
> >   1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 6bbe4aa7b9511c..5faaccef707ef1 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -3103,7 +3103,8 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
> >   			   const struct iommu_user_data *user_data)
> >   {
> >   	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> > -	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
> > +	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING |
> > +				 IOMMU_HWPT_ALLOC_NEST_PARENT;
> >   	struct arm_smmu_domain *smmu_domain;
> >   	int ret;
> > @@ -3116,6 +3117,14 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
> >   	if (!smmu_domain)
> >   		return ERR_PTR(-ENOMEM);
> > +	if (flags & IOMMU_HWPT_ALLOC_NEST_PARENT) {
> > +		if (!(master->smmu->features & ARM_SMMU_FEAT_TRANS_S2)) {
> 
> Nope, nesting needs to rely on FEAT_NESTING, that's why it exists. S2 alone
> isn't sufficient - without S1 there's nothing to expose to userspace, so
> zero point in having a "nested" domain with nothing to nest into it - but
> furthermore we need S2 *without* unsafe broken TLBs.

I do tend to agree we should fail earlier if IOMMU_DOMAIN_NESTED is
not possible so let's narrow it.

However, the above was matching how the driver already worked (ie the
old arm_smmu_enable_nesting()) where just asking for a normal S2 was
gated only by FEAT_S2.

This does add a CMDQ_OP_TLBI_NH_ALL, but I didn't think that hit an
errata?

The nesting specific stuff that touches things that FEAT_NESTING
covers in the driver is checked here:

static struct iommu_domain *
arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags,
			      struct iommu_domain *parent,
			      const struct iommu_user_data *user_data)
{
	if (!(master->smmu->features & ARM_SMMU_FEAT_NESTING))
		return ERR_PTR(-EOPNOTSUPP);

Which prevents creating a IOMMU_DOMAIN_NESTED, meaning you can't get a
CD table on top of the S2 or issue any S1 invalidations.

Thanks,
Jason


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-08-09 16:05   ` Robin Murphy
@ 2024-08-09 18:03     ` Jason Gunthorpe
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-09 18:03 UTC (permalink / raw)
  To: Robin Murphy
  Cc: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Sudeep Holla,
	Will Deacon, Eric Auger, Jean-Philippe Brucker, Moritz Fischer,
	Michael Shavit, Nicolin Chen, patches, Shameerali Kolothum Thodi

On Fri, Aug 09, 2024 at 05:05:36PM +0100, Robin Murphy wrote:
> > +static void arm_smmu_make_nested_domain_ste(
> > +	struct arm_smmu_ste *target, struct arm_smmu_master *master,
> > +	struct arm_smmu_nested_domain *nested_domain, bool ats_enabled)
> > +{
> > +	/*
> > +	 * Userspace can request a non-valid STE through the nesting interface.
> > +	 * We relay that into a non-valid physical STE with the intention that
> > +	 * C_BAD_STE for this SID can be delivered to userspace.
> 
> NAK, that is a horrible idea. If userspace really wants to emulate that it
> can install a disabled S1 context or move the device to an empty s2 domain,
> get translation faults signalled through the normal path, and synthesise
> C_BAD_STE for itself because it knows what it's done. 

The main point is that we need the VIOMMU to become linked to the SID
though a IOMMU_DOMAIN_NESTED attachment so we know how to route events
to userspace. Some of these options won't allow that.

> Otherwise, how do you propose we would actually tell whether a real
> C_BAD_STE is due to a driver

It is the same as every other SID based event, you lookup the SID, see
there is an IOMMU_DOMAIN_NESTED attached, extract the VIOMMU and route
the whole event to the VIOMMU's event queue.

For C_BAD_STE you'd want to also check that the STE is all zeros
before doing this to detect hypervisor driver bugs. It is not perfect,
but it is not wildly unworkable either.

> Yes, userspace can spam up the event queue with translation/permission etc.
> faults, but those are at least clearly attributable and an expected part of
> normal operation; giving it free reign to spam up the event queue with what
> are currently considered *host kernel errors*, with no handling or
> mitigation, is another thing entirely.

Let's use arm_smmu_make_abort_ste():

	if (!(nested_domain->ste[0] & cpu_to_le64(STRTAB_STE_0_V))) {
		arm_smmu_make_abort_ste(target);
		return;
	}

We can look into how to transform that into a virtual C_BAD_STE as
part of the event infrastructure patches?

> > @@ -2192,6 +2255,16 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
> >   	}
> >   	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
> > +	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S2 &&
> > +	    smmu_domain->nesting_parent) {
> 
> Surely nesting_parent must never be set on anything other than S2 domains in
> the first place?

Done

> > @@ -2608,13 +2681,15 @@ arm_smmu_find_master_domain(struct arm_smmu_domain *smmu_domain,
> >   			    ioasid_t ssid)
> >   {
> >   	struct arm_smmu_master_domain *master_domain;
> > +	bool nested_parent = smmu_domain->domain.type == IOMMU_DOMAIN_NESTED;
> >   	lockdep_assert_held(&smmu_domain->devices_lock);
> >   	list_for_each_entry(master_domain, &smmu_domain->devices,
> >   			    devices_elm) {
> >   		if (master_domain->master == master &&
> > -		    master_domain->ssid == ssid)
> > +		    master_domain->ssid == ssid &&
> > +		    master_domain->nested_parent == nested_parent)
> 
> As if nested_parent vs. nesting parent wasn't bad enough, 

Done - we used IOMMU_HWPT_ALLOC_NEST_PARENT so lets call them all nest_parent

> why would we need additional disambiguation here?

Oh there is mistake here, that is why it looks so weird, the
smmu_domain here is the S2 always we are supposed to be testing the
attaching domain:

--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2677,11 +2677,10 @@ static void arm_smmu_disable_pasid(struct arm_smmu_master *master)
 
 static struct arm_smmu_master_domain *
 arm_smmu_find_master_domain(struct arm_smmu_domain *smmu_domain,
-                           struct arm_smmu_master *master,
-                           ioasid_t ssid)
+                           struct arm_smmu_master *master, ioasid_t ssid,
+                           bool nest_parent)
 {
        struct arm_smmu_master_domain *master_domain;
-       bool nested_parent = smmu_domain->domain.type == IOMMU_DOMAIN_NESTED;
 
        lockdep_assert_held(&smmu_domain->devices_lock);
 
@@ -2689,7 +2688,7 @@ arm_smmu_find_master_domain(struct arm_smmu_domain *smmu_domain,
                            devices_elm) {
                if (master_domain->master == master &&
                    master_domain->ssid == ssid &&
-                   master_domain->nest_parent == nested_parent)
+                   master_domain->nest_parent == nest_parent)
                        return master_domain;
        }
        return NULL;
@@ -2727,7 +2726,8 @@ static void arm_smmu_remove_master_domain(struct arm_smmu_master *master,
                return;
 
        spin_lock_irqsave(&smmu_domain->devices_lock, flags);
-       master_domain = arm_smmu_find_master_domain(smmu_domain, master, ssid);
+       master_domain = arm_smmu_find_master_domain(
+               smmu_domain, master, ssid, domain->type == IOMMU_DOMAIN_NESTED);
        if (master_domain) {
                list_del(&master_domain->devices_elm);
                kfree(master_domain);

> How could more than one attachment to the same SID:SSID exist at the
> same time?

The attachment logic puts both the new and old domain in this list
while it works on invalidating caches. This ensures we don't loose any
invalidation. We also directly put the S2 into the list when attaching
an IOMMU_DOMAIN_NESTED.

Thus, it is possible for the same S2 to be in the list twice for a
short time as switching between the S2 to an IOMMU_DOMAIN_NESTED will
cause it. They are not the same as one will have nest_parent set to do
heavier ATC invalidation.

It is an optimization to allow the naked S2 to be used as an identity
translation with less expensive ATC invalidation.

> > @@ -792,6 +803,14 @@ struct arm_smmu_domain {
> >   	u8				enforce_cache_coherency;
> >   	struct mmu_notifier		mmu_notifier;
> > +	bool				nesting_parent : 1;
> 
> Erm, please use bool consistently, or use integer bitfields consistently,
> but not a deranged mess of bool bitfields while also assigning true/false to
> full u8s... :/

I made it like this:

	struct list_head		devices;
	spinlock_t			devices_lock;
	bool				enforce_cache_coherency : 1;
	bool				nest_parent : 1;

Thanks,
Jason


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
  2024-08-09 16:09     ` Jason Gunthorpe
@ 2024-08-09 18:34       ` Robin Murphy
  2024-08-13 14:35         ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: Robin Murphy @ 2024-08-09 18:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Sudeep Holla,
	Will Deacon, Eric Auger, Jean-Philippe Brucker, Moritz Fischer,
	Michael Shavit, Nicolin Chen, patches, Shameerali Kolothum Thodi

On 2024-08-09 5:09 pm, Jason Gunthorpe wrote:
> On Fri, Aug 09, 2024 at 04:06:22PM +0100, Robin Murphy wrote:
>> On 2024-08-07 12:41 am, Jason Gunthorpe wrote:
>>> For SMMUv3 the parent must be a S2 domain, which can be composed
>>> into a IOMMU_DOMAIN_NESTED.
>>>
>>> In future the S2 parent will also need a VMID linked to the VIOMMU and
>>> even to KVM.
>>>
>>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>> ---
>>>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 ++++++++++-
>>>    1 file changed, 10 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> index 6bbe4aa7b9511c..5faaccef707ef1 100644
>>> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
>>> @@ -3103,7 +3103,8 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
>>>    			   const struct iommu_user_data *user_data)
>>>    {
>>>    	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
>>> -	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
>>> +	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING |
>>> +				 IOMMU_HWPT_ALLOC_NEST_PARENT;
>>>    	struct arm_smmu_domain *smmu_domain;
>>>    	int ret;
>>> @@ -3116,6 +3117,14 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
>>>    	if (!smmu_domain)
>>>    		return ERR_PTR(-ENOMEM);
>>> +	if (flags & IOMMU_HWPT_ALLOC_NEST_PARENT) {
>>> +		if (!(master->smmu->features & ARM_SMMU_FEAT_TRANS_S2)) {
>>
>> Nope, nesting needs to rely on FEAT_NESTING, that's why it exists. S2 alone
>> isn't sufficient - without S1 there's nothing to expose to userspace, so
>> zero point in having a "nested" domain with nothing to nest into it - but
>> furthermore we need S2 *without* unsafe broken TLBs.
> 
> I do tend to agree we should fail earlier if IOMMU_DOMAIN_NESTED is
> not possible so let's narrow it.
> 
> However, the above was matching how the driver already worked (ie the
> old arm_smmu_enable_nesting()) where just asking for a normal S2 was
> gated only by FEAT_S2.

Ohhhh, I see, so actually the same old subtlety is still there - 
ALLOC_NEST_PARENT isn't a definite "allocate the parent domain for my 
nested setup", it's "allocate a domain which will be capable of being 
upgraded to nesting later *if* I choose to do so". Is the intent that 
someone could still use this if they had no intention of nesting but 
just wanted to ensure S2 format for their single stage of translation 
for some reason? It remains somewhat confusing since S2 domains on 
S2-only SMMUs are still fundamentally incapable of ever becoming a 
nested parent, but admittedly I'm struggling to think of a name which 
would be more accurate while still generic, so maybe it's OK...

> This does add a CMDQ_OP_TLBI_NH_ALL, but I didn't think that hit an
> errata?

Indeed, all the really nasty errata depend on both stages being active 
(such that S1 translation requests interact with concurrent S2 
invalidations)

Thanks,
Robin.

> The nesting specific stuff that touches things that FEAT_NESTING
> covers in the driver is checked here:
> 
> static struct iommu_domain *
> arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags,
> 			      struct iommu_domain *parent,
> 			      const struct iommu_user_data *user_data)
> {
> 	if (!(master->smmu->features & ARM_SMMU_FEAT_NESTING))
> 		return ERR_PTR(-EOPNOTSUPP);
> 
> Which prevents creating a IOMMU_DOMAIN_NESTED, meaning you can't get a
> CD table on top of the S2 or issue any S1 invalidations.
> 
> Thanks,
> Jason


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/8] ACPI/IORT: Support CANWBS memory access flag
  2024-08-09 15:15       ` Shameerali Kolothum Thodi
@ 2024-08-09 20:14         ` Nicolin Chen
  0 siblings, 0 replies; 34+ messages in thread
From: Nicolin Chen @ 2024-08-09 20:14 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Jason Gunthorpe, acpica-devel@lists.linux.dev, Alex Williamson,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches@lists.linux.dev

On Fri, Aug 09, 2024 at 03:15:20PM +0000, Shameerali Kolothum Thodi wrote:
> > -----Original Message-----
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > On Fri, Aug 09, 2024 at 02:36:31PM +0000, Shameerali Kolothum Thodi wrote:
> > > > @@ -524,6 +524,7 @@ struct acpi_iort_memory_access {
> > > >
> > > >  #define ACPI_IORT_MF_COHERENCY          (1)
> > > >  #define ACPI_IORT_MF_ATTRIBUTES         (1<<1)
> > > > +#define ACPI_IORT_MF_CANWBS             (1<<2)
> > >
> > > I think we need to update Document number to E.f in IORT section in
> > > this file. Also isn't it this file normally gets updated through ACPICA pull ?
> >
> > I don't know anything about the ACPI process..
> >
> > Can someone say for sure what to do here?
> 
> From past experience, it is normally sending a PULL request to here,
> https://github.com/acpica/acpica/pulls
> 
> And I think it then gets merged by Robert Moore and then send to LKML.

Shameer, thanks for the info!

I created a pull request: https://github.com/acpica/acpica/pull/962

By looking at one of the merged pulls, it seems this is going to
take a while. So, I think we might want to split all the CANWBS
pieces out of this series, into a followup series.

Thanks!
Nicolin


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
  2024-08-09 18:34       ` Robin Murphy
@ 2024-08-13 14:35         ` Jason Gunthorpe
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-13 14:35 UTC (permalink / raw)
  To: Robin Murphy
  Cc: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Sudeep Holla,
	Will Deacon, Eric Auger, Jean-Philippe Brucker, Moritz Fischer,
	Michael Shavit, Nicolin Chen, patches, Shameerali Kolothum Thodi

On Fri, Aug 09, 2024 at 07:34:20PM +0100, Robin Murphy wrote:

> > However, the above was matching how the driver already worked (ie the
> > old arm_smmu_enable_nesting()) where just asking for a normal S2 was
> > gated only by FEAT_S2.
> 
> Ohhhh, I see, so actually the same old subtlety is still there -
> ALLOC_NEST_PARENT isn't a definite "allocate the parent domain for my nested
> setup", it's "allocate a domain which will be capable of being upgraded to
> nesting later *if* I choose to do so". 

Yes. All PAGING type domains are expected to be able to be attached
nakedly without creating the DOMAIN_NESTED.

> Is the intent that someone could still use this if they had no
> intention of nesting but just wanted to ensure S2 format for their
> single stage of translation for some reason?

Sort of, yes..

When booting a VM with DMA default to bypass there are two flows for
the time before the vIOMMU is enabled.

The first flow is to allocate the S2 NESTING_PARENT and attach it
directly to the RID. This is a normal S2 paging domain. The VMM would
later switch to a DOMAIN_NESTED (maybe with bypass) when the vIOMMU is
enabled by the VM and the vSTEs are parsed.

The second flow, which is probably going to be the better way, is the
VMM will create a DOMAIN_NESTED with a bypass vSTE and attach that
instead of directly attaching the S2.

When we worked through VIOMMU it turned out we always want the
DOMAIN_NESTED to be the attached domain and the bypass/abort cases are
handled through vSTE settings instead of naked domains. This allows
the VIOMMU object to always be associated with the SID which is how we
will link the event queues, the VMID and so on.

> It remains somewhat confusing since S2 domains on S2-only SMMUs are
> still fundamentally incapable of ever becoming a nested parent, but
> admittedly I'm struggling to think of a name which would be more
> accurate while still generic, so maybe it's OK...

I think for now we can block it, as we know no use case to request
NESTING_PARENT without intending to go on and create a DOMAIN_NESTED.

Someday we may want to allow userspace to specify the page table
parameters more exactly and that might be a good API to also force a
S2 if someone has a (performance?) use case for S2 outside of nesting.

Jason


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-09 15:12     ` Jason Gunthorpe
@ 2024-08-15 16:14       ` Shameerali Kolothum Thodi
  2024-08-15 16:18         ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-08-15 16:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel@lists.linux.dev, Alex Williamson,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches@lists.linux.dev, Linuxarm



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, August 9, 2024 4:12 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: acpica-devel@lists.linux.dev; Alex Williamson
> <alex.williamson@redhat.com>; Guohanjun (Hanjun Guo)
> <guohanjun@huawei.com>; iommu@lists.linux.dev; Joerg Roedel
> <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>; kvm@vger.kernel.org;
> Len Brown <lenb@kernel.org>; linux-acpi@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; Lorenzo Pieralisi <lpieralisi@kernel.org>; Rafael J.
> Wysocki <rafael@kernel.org>; Robert Moore <robert.moore@intel.com>; Robin
> Murphy <robin.murphy@arm.com>; Sudeep Holla <sudeep.holla@arm.com>;
> Will Deacon <will@kernel.org>; Eric Auger <eric.auger@redhat.com>; Jean-
> Philippe Brucker <jean-philippe@linaro.org>; Moritz Fischer <mdf@kernel.org>;
> Michael Shavit <mshavit@google.com>; Nicolin Chen <nicolinc@nvidia.com>;
> patches@lists.linux.dev
> Subject: Re: [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available
> 
> On Fri, Aug 09, 2024 at 02:26:13PM +0000, Shameerali Kolothum Thodi wrote:
> > > +		if (smmu->features & ARM_SMMU_FEAT_S2FWB)
> > > +			pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB;
> >
> > This probably requires an update in arm_64_lpae_alloc_pgtable_s2() quirks
> check.
> 
> Yep, fixed I was hoping you had HW to test this..

Let me see if I can get hold of a test setup that supports S2FWB.

I do have another concern with respect to the hardware we have which doesn't
support S2FWB, but those can claim CANWBS. The problem is, BIOS update is not
a very liked/feasible solution to already deployed ones. But we can probably add 
an option/quirk in SMMUv3 driver for those platforms(based on 
ACPI_IORT_SMMU_V3_HISILICON_HI161X).  I hope this is fine.

Thanks,
Shameer




^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-15 16:14       ` Shameerali Kolothum Thodi
@ 2024-08-15 16:18         ` Jason Gunthorpe
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-15 16:18 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: acpica-devel@lists.linux.dev, Alex Williamson,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches@lists.linux.dev, Linuxarm

On Thu, Aug 15, 2024 at 04:14:22PM +0000, Shameerali Kolothum Thodi wrote:
> > On Fri, Aug 09, 2024 at 02:26:13PM +0000, Shameerali Kolothum Thodi wrote:
> > > > +		if (smmu->features & ARM_SMMU_FEAT_S2FWB)
> > > > +			pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB;
> > >
> > > This probably requires an update in arm_64_lpae_alloc_pgtable_s2() quirks
> > check.
> > 
> > Yep, fixed I was hoping you had HW to test this..
> 
> Let me see if I can get hold of a test setup that supports S2FWB.

Thanks!
 
> I do have another concern with respect to the hardware we have which doesn't
> support S2FWB, but those can claim CANWBS. The problem is, BIOS update is not
> a very liked/feasible solution to already deployed ones. But we can probably add 
> an option/quirk in SMMUv3 driver for those platforms(based on 
> ACPI_IORT_SMMU_V3_HISILICON_HI161X).  I hope this is fine.

I don't have an issue with doing that, if you can reliably identify
the platform in some way a kernel quirk seems reasonable.

Thanks,
Jason


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/8] Initial support for SMMUv3 nested translation
  2024-08-06 23:41 [PATCH 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
                   ` (7 preceding siblings ...)
  2024-08-06 23:41 ` [PATCH 8/8] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user Jason Gunthorpe
@ 2024-08-20  8:20 ` Mostafa Saleh
  2024-08-20 15:24   ` Nicolin Chen
  8 siblings, 1 reply; 34+ messages in thread
From: Mostafa Saleh @ 2024-08-20  8:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

Hi Jason,

On Tue, Aug 06, 2024 at 08:41:13PM -0300, Jason Gunthorpe wrote:
> This brings support for the IOMMFD ioctls:
> 
>  - IOMMU_GET_HW_INFO
>  - IOMMU_HWPT_ALLOC_NEST_PARENT
>  - IOMMU_DOMAIN_NESTED
>  - ops->enforce_cache_coherency()
> 
> This is quite straightforward as the nested STE can just be built in the
> special NESTED domain op and fed through the generic update machinery.
> 
> The design allows the user provided STE fragment to control several
> aspects of the translation, including putting the STE into a "virtual
> bypass" or a aborting state. This duplicates functionality available by
> other means, but it allows trivially preserving the VMID in the STE as we
> eventually move towards the VIOMMU owning the VMID.
> 
> Nesting support requires the system to either support S2FWB or the
> stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
> cache and view incoherent data, currently VFIO lacks any cache flushing
> that would make this safe.
> 
> Yan has a series to add some of the needed infrastructure for VFIO cache
> flushing here:
> 
>  https://lore.kernel.org/linux-iommu/20240507061802.20184-1-yan.y.zhao@intel.com/
> 
> Which may someday allow relaxing this further.
> 
> Remove VFIO_TYPE1_NESTING_IOMMU since it was never used and superseded by
> this.
> 
> This is the first series in what will be several to complete nesting
> support. At least:
>  - IOMMU_RESV_SW_MSI related fixups
>  - VIOMMU object support to allow ATS invalidations
>  - vCMDQ hypervisor support for direct invalidation queue assignment
>  - KVM pinned VMID using VIOMMU for vBTM
>  - Cross instance S2 sharing
>  - Virtual Machine Structure using VIOMMU (for vMPAM?)
>  - Fault forwarding support through IOMMUFD's fault fd for vSVA
> 
> It is enough to allow significant amounts of qemu work to progress.
> 
Are there any qemu patches to tests this?
As I am confused with some of the user space bits and that would help.

Thanks,
Mostafa

> This is on github: https://github.com/jgunthorpe/linux/commits/smmuv3_nesting
> 
> Jason Gunthorpe (5):
>   vfio: Remove VFIO_TYPE1_NESTING_IOMMU
>   iommu/arm-smmu-v3: Use S2FWB when available
>   iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
>   iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
>   iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
> 
> Nicolin Chen (3):
>   ACPI/IORT: Support CANWBS memory access flag
>   iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct
>     arm_smmu_hw_info
>   iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user
> 
>  drivers/acpi/arm64/iort.c                   |  13 +
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 398 ++++++++++++++++++--
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  27 ++
>  drivers/iommu/arm/arm-smmu/arm-smmu.c       |  16 -
>  drivers/iommu/io-pgtable-arm.c              |  24 +-
>  drivers/iommu/iommu.c                       |  10 -
>  drivers/iommu/iommufd/vfio_compat.c         |   7 +-
>  drivers/vfio/vfio_iommu_type1.c             |  12 +-
>  include/acpi/actbl2.h                       |   1 +
>  include/linux/io-pgtable.h                  |   2 +
>  include/linux/iommu.h                       |  54 ++-
>  include/uapi/linux/iommufd.h                |  79 ++++
>  include/uapi/linux/vfio.h                   |   2 +-
>  13 files changed, 572 insertions(+), 73 deletions(-)
> 
> 
> base-commit: e5e288d94186b266b062b3e44c82c285dfe68712
> -- 
> 2.46.0
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/8] vfio: Remove VFIO_TYPE1_NESTING_IOMMU
  2024-08-06 23:41 ` [PATCH 1/8] vfio: Remove VFIO_TYPE1_NESTING_IOMMU Jason Gunthorpe
  2024-08-07 17:52   ` Alex Williamson
@ 2024-08-20  8:23   ` Mostafa Saleh
  1 sibling, 0 replies; 34+ messages in thread
From: Mostafa Saleh @ 2024-08-20  8:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

Hi Jason,

On Tue, Aug 06, 2024 at 08:41:14PM -0300, Jason Gunthorpe wrote:
> This control causes the ARM SMMU drivers to choose a stage 2
> implementation for the IO pagetable (vs the stage 1 usual default),
> however this choice has no significant visible impact to the VFIO
> user. Further qemu never implemented this and no other userspace user is
> known.
> 
> The original description in commit f5c9ecebaf2a ("vfio/iommu_type1: add
> new VFIO_TYPE1_NESTING_IOMMU IOMMU type") suggested this was to "provide
> SMMU translation services to the guest operating system" however the rest
> of the API to set the guest table pointer for the stage 1 and manage
> invalidation was never completed, or at least never upstreamed, rendering
> this part useless dead code.
> 
> Upstream has now settled on iommufd as the uAPI for controlling nested
> translation. Choosing the stage 2 implementation should be done by through
> the IOMMU_HWPT_ALLOC_NEST_PARENT flag during domain allocation.
> 
> Remove VFIO_TYPE1_NESTING_IOMMU and everything under it including the
> enable_nesting iommu_domain_op.
> 
> Just in-case there is some userspace using this continue to treat
> requesting it as a NOP, but do not advertise support any more.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>

> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 ----------------
>  drivers/iommu/arm/arm-smmu/arm-smmu.c       | 16 ----------------
>  drivers/iommu/iommu.c                       | 10 ----------
>  drivers/iommu/iommufd/vfio_compat.c         |  7 +------
>  drivers/vfio/vfio_iommu_type1.c             | 12 +-----------
>  include/linux/iommu.h                       |  3 ---
>  include/uapi/linux/vfio.h                   |  2 +-
>  7 files changed, 3 insertions(+), 63 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index e5db5325f7eaed..531125f231b662 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -3331,21 +3331,6 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev)
>  	return group;
>  }
>  
> -static int arm_smmu_enable_nesting(struct iommu_domain *domain)
> -{
> -	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> -	int ret = 0;
> -
> -	mutex_lock(&smmu_domain->init_mutex);
> -	if (smmu_domain->smmu)
> -		ret = -EPERM;
> -	else
> -		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
> -	mutex_unlock(&smmu_domain->init_mutex);
> -
> -	return ret;
> -}
> -
>  static int arm_smmu_of_xlate(struct device *dev,
>  			     const struct of_phandle_args *args)
>  {
> @@ -3467,7 +3452,6 @@ static struct iommu_ops arm_smmu_ops = {
>  		.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
>  		.iotlb_sync		= arm_smmu_iotlb_sync,
>  		.iova_to_phys		= arm_smmu_iova_to_phys,
> -		.enable_nesting		= arm_smmu_enable_nesting,
>  		.free			= arm_smmu_domain_free_paging,
>  	}
>  };
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> index 723273440c2118..38dad1fd53b80a 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> @@ -1558,21 +1558,6 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev)
>  	return group;
>  }
>  
> -static int arm_smmu_enable_nesting(struct iommu_domain *domain)
> -{
> -	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> -	int ret = 0;
> -
> -	mutex_lock(&smmu_domain->init_mutex);
> -	if (smmu_domain->smmu)
> -		ret = -EPERM;
> -	else
> -		smmu_domain->stage = ARM_SMMU_DOMAIN_NESTED;
> -	mutex_unlock(&smmu_domain->init_mutex);
> -
> -	return ret;
> -}
> -
>  static int arm_smmu_set_pgtable_quirks(struct iommu_domain *domain,
>  		unsigned long quirks)
>  {
> @@ -1656,7 +1641,6 @@ static struct iommu_ops arm_smmu_ops = {
>  		.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
>  		.iotlb_sync		= arm_smmu_iotlb_sync,
>  		.iova_to_phys		= arm_smmu_iova_to_phys,
> -		.enable_nesting		= arm_smmu_enable_nesting,
>  		.set_pgtable_quirks	= arm_smmu_set_pgtable_quirks,
>  		.free			= arm_smmu_domain_free,
>  	}
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index ed6c5cb60c5aee..9da63d57a53cd7 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -2723,16 +2723,6 @@ static int __init iommu_init(void)
>  }
>  core_initcall(iommu_init);
>  
> -int iommu_enable_nesting(struct iommu_domain *domain)
> -{
> -	if (domain->type != IOMMU_DOMAIN_UNMANAGED)
> -		return -EINVAL;
> -	if (!domain->ops->enable_nesting)
> -		return -EINVAL;
> -	return domain->ops->enable_nesting(domain);
> -}
> -EXPORT_SYMBOL_GPL(iommu_enable_nesting);
> -
>  int iommu_set_pgtable_quirks(struct iommu_domain *domain,
>  		unsigned long quirk)
>  {
> diff --git a/drivers/iommu/iommufd/vfio_compat.c b/drivers/iommu/iommufd/vfio_compat.c
> index a3ad5f0b6c59dd..514aacd6400949 100644
> --- a/drivers/iommu/iommufd/vfio_compat.c
> +++ b/drivers/iommu/iommufd/vfio_compat.c
> @@ -291,12 +291,7 @@ static int iommufd_vfio_check_extension(struct iommufd_ctx *ictx,
>  	case VFIO_DMA_CC_IOMMU:
>  		return iommufd_vfio_cc_iommu(ictx);
>  
> -	/*
> -	 * This is obsolete, and to be removed from VFIO. It was an incomplete
> -	 * idea that got merged.
> -	 * https://lore.kernel.org/kvm/0-v1-0093c9b0e345+19-vfio_no_nesting_jgg@nvidia.com/
> -	 */
> -	case VFIO_TYPE1_NESTING_IOMMU:
> +	case __VFIO_RESERVED_TYPE1_NESTING_IOMMU:
>  		return 0;
>  
>  	/*
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 0960699e75543e..13cf6851cc2718 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -72,7 +72,6 @@ struct vfio_iommu {
>  	uint64_t		pgsize_bitmap;
>  	uint64_t		num_non_pinned_groups;
>  	bool			v2;
> -	bool			nesting;
>  	bool			dirty_page_tracking;
>  	struct list_head	emulated_iommu_groups;
>  };
> @@ -2199,12 +2198,6 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  		goto out_free_domain;
>  	}
>  
> -	if (iommu->nesting) {
> -		ret = iommu_enable_nesting(domain->domain);
> -		if (ret)
> -			goto out_domain;
> -	}
> -
>  	ret = iommu_attach_group(domain->domain, group->iommu_group);
>  	if (ret)
>  		goto out_domain;
> @@ -2545,9 +2538,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
>  	switch (arg) {
>  	case VFIO_TYPE1_IOMMU:
>  		break;
> -	case VFIO_TYPE1_NESTING_IOMMU:
> -		iommu->nesting = true;
> -		fallthrough;
> +	case __VFIO_RESERVED_TYPE1_NESTING_IOMMU:
>  	case VFIO_TYPE1v2_IOMMU:
>  		iommu->v2 = true;
>  		break;
> @@ -2642,7 +2633,6 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
>  	switch (arg) {
>  	case VFIO_TYPE1_IOMMU:
>  	case VFIO_TYPE1v2_IOMMU:
> -	case VFIO_TYPE1_NESTING_IOMMU:
>  	case VFIO_UNMAP_ALL:
>  		return 1;
>  	case VFIO_UPDATE_VADDR:
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 4d47f2c3331185..15d7657509f662 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -635,7 +635,6 @@ struct iommu_ops {
>   * @enforce_cache_coherency: Prevent any kind of DMA from bypassing IOMMU_CACHE,
>   *                           including no-snoop TLPs on PCIe or other platform
>   *                           specific mechanisms.
> - * @enable_nesting: Enable nesting
>   * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
>   * @free: Release the domain after use.
>   */
> @@ -663,7 +662,6 @@ struct iommu_domain_ops {
>  				    dma_addr_t iova);
>  
>  	bool (*enforce_cache_coherency)(struct iommu_domain *domain);
> -	int (*enable_nesting)(struct iommu_domain *domain);
>  	int (*set_pgtable_quirks)(struct iommu_domain *domain,
>  				  unsigned long quirks);
>  
> @@ -846,7 +844,6 @@ extern void iommu_group_put(struct iommu_group *group);
>  extern int iommu_group_id(struct iommu_group *group);
>  extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
>  
> -int iommu_enable_nesting(struct iommu_domain *domain);
>  int iommu_set_pgtable_quirks(struct iommu_domain *domain,
>  		unsigned long quirks);
>  
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 2b68e6cdf1902f..c8dbf8219c4fcb 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -35,7 +35,7 @@
>  #define VFIO_EEH			5
>  
>  /* Two-stage IOMMU */
> -#define VFIO_TYPE1_NESTING_IOMMU	6	/* Implies v2 */
> +#define __VFIO_RESERVED_TYPE1_NESTING_IOMMU	6	/* Implies v2 */
>  
>  #define VFIO_SPAPR_TCE_v2_IOMMU		7
>  
> -- 
> 2.46.0
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-06 23:41 ` [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available Jason Gunthorpe
  2024-08-09 14:26   ` Shameerali Kolothum Thodi
@ 2024-08-20  8:30   ` Mostafa Saleh
  2024-08-20 12:01     ` Jason Gunthorpe
  1 sibling, 1 reply; 34+ messages in thread
From: Mostafa Saleh @ 2024-08-20  8:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

Hi Jason,

On Tue, Aug 06, 2024 at 08:41:15PM -0300, Jason Gunthorpe wrote:
> Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
> works. When S2FWB is supported and enabled the IOPTE will force cachable
> access to IOMMU_CACHE memory and deny cachable access otherwise.
> 
> This is not especially meaningful for simple S2 domains, it apparently
> doesn't even force PCI no-snoop access to be coherent.
> 
> However, when used with a nested S1, FWB has the effect of preventing the
> guest from choosing a MemAttr that would cause ordinary DMA to bypass the
> cache. Consistent with KVM we wish to deny the guest the ability to become
> incoherent with cached memory the hypervisor believes is cachable so we
> don't have to flush it.
> 
> Turn on S2FWB whenever the SMMU supports it and use it for all S2
> mappings.

I have been looking into this recently from the KVM side as it will
use FWB for the CPU stage-2 unconditionally for guests(if supported),
however that breaks for non-coherent devices when assigned, and
limiting assigned devices to be coherent seems too restrictive.
I have been looking into ways to notify KVM from VFIO as early as
possible so it can configure the page table properly.

But for SMMUv3, S2FWB is per stream, can’t we just use it if the master
is DMA coherent?

Thanks,
Mostafa

> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  6 ++++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
>  drivers/iommu/io-pgtable-arm.c              | 24 +++++++++++++++++----
>  include/linux/io-pgtable.h                  |  2 ++
>  4 files changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 531125f231b662..7fe1e27d11586c 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1612,6 +1612,8 @@ void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
>  		FIELD_PREP(STRTAB_STE_1_EATS,
>  			   ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0));
>  
> +	if (smmu->features & ARM_SMMU_FEAT_S2FWB)
> +		target->data[1] |= cpu_to_le64(STRTAB_STE_1_S2FWB);
>  	if (smmu->features & ARM_SMMU_FEAT_ATTR_TYPES_OVR)
>  		target->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
>  							  STRTAB_STE_1_SHCFG_INCOMING));
> @@ -2400,6 +2402,8 @@ static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain,
>  		pgtbl_cfg.oas = smmu->oas;
>  		fmt = ARM_64_LPAE_S2;
>  		finalise_stage_fn = arm_smmu_domain_finalise_s2;
> +		if (smmu->features & ARM_SMMU_FEAT_S2FWB)
> +			pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB;
>  		break;
>  	default:
>  		return -EINVAL;
> @@ -4189,6 +4193,8 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
>  
>  	/* IDR3 */
>  	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
> +	if (FIELD_GET(IDR3_FWB, reg))
> +		smmu->features |= ARM_SMMU_FEAT_S2FWB;
>  	if (FIELD_GET(IDR3_RIL, reg))
>  		smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
>  
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 8851a7abb5f0f3..7e8d2f36faebf3 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -55,6 +55,7 @@
>  #define IDR1_SIDSIZE			GENMASK(5, 0)
>  
>  #define ARM_SMMU_IDR3			0xc
> +#define IDR3_FWB			(1 << 8)
>  #define IDR3_RIL			(1 << 10)
>  
>  #define ARM_SMMU_IDR5			0x14
> @@ -258,6 +259,7 @@ static inline u32 arm_smmu_strtab_l2_idx(u32 sid)
>  #define STRTAB_STE_1_S1CSH		GENMASK_ULL(7, 6)
>  
>  #define STRTAB_STE_1_S1STALLD		(1UL << 27)
> +#define STRTAB_STE_1_S2FWB		(1UL << 25)
>  
>  #define STRTAB_STE_1_EATS		GENMASK_ULL(29, 28)
>  #define STRTAB_STE_1_EATS_ABT		0UL
> @@ -700,6 +702,7 @@ struct arm_smmu_device {
>  #define ARM_SMMU_FEAT_ATTR_TYPES_OVR	(1 << 20)
>  #define ARM_SMMU_FEAT_HA		(1 << 21)
>  #define ARM_SMMU_FEAT_HD		(1 << 22)
> +#define ARM_SMMU_FEAT_S2FWB		(1 << 23)
>  	u32				features;
>  
>  #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index f5d9fd1f45bf49..62bbb6037e1686 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -106,6 +106,18 @@
>  #define ARM_LPAE_PTE_HAP_FAULT		(((arm_lpae_iopte)0) << 6)
>  #define ARM_LPAE_PTE_HAP_READ		(((arm_lpae_iopte)1) << 6)
>  #define ARM_LPAE_PTE_HAP_WRITE		(((arm_lpae_iopte)2) << 6)
> +/*
> + * For !FWB these code to:
> + *  1111 = Normal outer write back cachable / Inner Write Back Cachable
> + *         Permit S1 to override
> + *  0101 = Normal Non-cachable / Inner Non-cachable
> + *  0001 = Device / Device-nGnRE
> + * For S2FWB these code:
> + *  0110 Force Normal Write Back
> + *  0101 Normal* is forced Normal-NC, Device unchanged
> + *  0001 Force Device-nGnRE
> + */
> +#define ARM_LPAE_PTE_MEMATTR_FWB_WB	(((arm_lpae_iopte)0x6) << 2)
>  #define ARM_LPAE_PTE_MEMATTR_OIWB	(((arm_lpae_iopte)0xf) << 2)
>  #define ARM_LPAE_PTE_MEMATTR_NC		(((arm_lpae_iopte)0x5) << 2)
>  #define ARM_LPAE_PTE_MEMATTR_DEV	(((arm_lpae_iopte)0x1) << 2)
> @@ -458,12 +470,16 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
>  	 */
>  	if (data->iop.fmt == ARM_64_LPAE_S2 ||
>  	    data->iop.fmt == ARM_32_LPAE_S2) {
> -		if (prot & IOMMU_MMIO)
> +		if (prot & IOMMU_MMIO) {
>  			pte |= ARM_LPAE_PTE_MEMATTR_DEV;
> -		else if (prot & IOMMU_CACHE)
> -			pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> -		else
> +		} else if (prot & IOMMU_CACHE) {
> +			if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_ARM_S2FWB)
> +				pte |= ARM_LPAE_PTE_MEMATTR_FWB_WB;
> +			else
> +				pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> +		} else {
>  			pte |= ARM_LPAE_PTE_MEMATTR_NC;
> +		}
>  	} else {
>  		if (prot & IOMMU_MMIO)
>  			pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
> index f9a81761bfceda..aff9b020b6dcc7 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -87,6 +87,7 @@ struct io_pgtable_cfg {
>  	 *	attributes set in the TCR for a non-coherent page-table walker.
>  	 *
>  	 * IO_PGTABLE_QUIRK_ARM_HD: Enables dirty tracking in stage 1 pagetable.
> +	 * IO_PGTABLE_QUIRK_ARM_S2FWB: Use the FWB format for the MemAttrs bits
>  	 */
>  	#define IO_PGTABLE_QUIRK_ARM_NS			BIT(0)
>  	#define IO_PGTABLE_QUIRK_NO_PERMS		BIT(1)
> @@ -95,6 +96,7 @@ struct io_pgtable_cfg {
>  	#define IO_PGTABLE_QUIRK_ARM_TTBR1		BIT(5)
>  	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA		BIT(6)
>  	#define IO_PGTABLE_QUIRK_ARM_HD			BIT(7)
> +	#define IO_PGTABLE_QUIRK_ARM_S2FWB		BIT(8)
>  	unsigned long			quirks;
>  	unsigned long			pgsize_bitmap;
>  	unsigned int			ias;
> -- 
> 2.46.0
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-20  8:30   ` Mostafa Saleh
@ 2024-08-20 12:01     ` Jason Gunthorpe
  2024-08-20 13:01       ` Jason Gunthorpe
  2024-08-20 19:52       ` Mostafa Saleh
  0 siblings, 2 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-20 12:01 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Tue, Aug 20, 2024 at 08:30:05AM +0000, Mostafa Saleh wrote:
> Hi Jason,
> 
> On Tue, Aug 06, 2024 at 08:41:15PM -0300, Jason Gunthorpe wrote:
> > Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
> > works. When S2FWB is supported and enabled the IOPTE will force cachable
> > access to IOMMU_CACHE memory and deny cachable access otherwise.
> > 
> > This is not especially meaningful for simple S2 domains, it apparently
> > doesn't even force PCI no-snoop access to be coherent.
> > 
> > However, when used with a nested S1, FWB has the effect of preventing the
> > guest from choosing a MemAttr that would cause ordinary DMA to bypass the
> > cache. Consistent with KVM we wish to deny the guest the ability to become
> > incoherent with cached memory the hypervisor believes is cachable so we
> > don't have to flush it.
> > 
> > Turn on S2FWB whenever the SMMU supports it and use it for all S2
> > mappings.
> 
> I have been looking into this recently from the KVM side as it will
> use FWB for the CPU stage-2 unconditionally for guests(if supported),
> however that breaks for non-coherent devices when assigned, and
> limiting assigned devices to be coherent seems too restrictive.

kvm's CPU S2 doesn't care about non-DMA-coherent devices though? That
concept is only relevant to the SMMU.

The issue on the KVM side is you can't put device MMIO into the CPU S2
using S2FWB and Normal Cachable, it will break the MMIO programming
model. That isn't "coherency" though.

It has to be Normal-NC, which this patch does:

https://lore.kernel.org/r/20240224150546.368-4-ankita@nvidia.com

> But for SMMUv3, S2FWB is per stream, can’t we just use it if the master
> is DMA coherent?

Sure, that seems to be a weird corner. Lets add this:

@@ -4575,7 +4575,12 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 
        /* IDR3 */
        reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
-       if (FIELD_GET(IDR3_FWB, reg))
+       /*
+        * If for some reason the HW does not support DMA coherency then using
+        * S2FWB won't work. This will also disable nesting support.
+        */
+       if (FIELD_GET(IDR3_FWB, reg) &&
+           (smmu->features & ARM_SMMU_FEAT_COHERENCY))
                smmu->features |= ARM_SMMU_FEAT_S2FWB;
        if (FIELD_GET(IDR3_RIL, reg))
                smmu->features |= ARM_SMMU_FEAT_RANGE_INV;

IMHO it would be weird to make HW that has S2FWB but not coherency,
but sure let's check it.

Also bear in mind VFIO won't run unless ARM_SMMU_FEAT_COHERENCY is set
so we won't even get a chance to ask for a S2 domain.

Jason


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-20 12:01     ` Jason Gunthorpe
@ 2024-08-20 13:01       ` Jason Gunthorpe
  2024-08-20 19:52       ` Mostafa Saleh
  1 sibling, 0 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-20 13:01 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Tue, Aug 20, 2024 at 09:01:02AM -0300, Jason Gunthorpe wrote:

> Also bear in mind VFIO won't run unless ARM_SMMU_FEAT_COHERENCY is set
> so we won't even get a chance to ask for a S2 domain.

And I should also say that without iommufd something like the DMA API
could select a S2 with S2FWB enabled, but all that does is change the
encoding of the memattr bits. Requests for !IOMMU_CACHE will still map
to non-cacheble IO PTEs like before - just with a different encoding.

The only thing at issue is nesting which will end up in the guest as
forced cachable - however since VFIO doesn't support non-DMA-coherent
devices at all this is not a problem.

Jason


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/8] Initial support for SMMUv3 nested translation
  2024-08-20  8:20 ` [PATCH 0/8] Initial support for SMMUv3 nested translation Mostafa Saleh
@ 2024-08-20 15:24   ` Nicolin Chen
  0 siblings, 0 replies; 34+ messages in thread
From: Nicolin Chen @ 2024-08-20 15:24 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Jason Gunthorpe, acpica-devel, Alex Williamson, Hanjun Guo, iommu,
	Joerg Roedel, Kevin Tian, kvm, Len Brown, linux-acpi,
	linux-arm-kernel, Lorenzo Pieralisi, Rafael J. Wysocki,
	Robert Moore, Robin Murphy, Sudeep Holla, Will Deacon, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi

On Tue, Aug 20, 2024 at 08:20:32AM +0000, Mostafa Saleh wrote:
> > This is the first series in what will be several to complete nesting
> > support. At least:
> >  - IOMMU_RESV_SW_MSI related fixups
> >  - VIOMMU object support to allow ATS invalidations
> >  - vCMDQ hypervisor support for direct invalidation queue assignment
> >  - KVM pinned VMID using VIOMMU for vBTM
> >  - Cross instance S2 sharing
> >  - Virtual Machine Structure using VIOMMU (for vMPAM?)
> >  - Fault forwarding support through IOMMUFD's fault fd for vSVA
> >
> > It is enough to allow significant amounts of qemu work to progress.
> >

> Are there any qemu patches to tests this?
> As I am confused with some of the user space bits and that would help.

I have the qemu patches, but am running some backlogs to keep it
updated, and don't have one exactly fitting to test this series.

I collected a few remarks from Jason regarding the VIOMMU series.
And I am reworking on it. I plan to post a testable QEMU branch
with the next VIOMMU version. Will CC you and more folks.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-20 12:01     ` Jason Gunthorpe
  2024-08-20 13:01       ` Jason Gunthorpe
@ 2024-08-20 19:52       ` Mostafa Saleh
  2024-08-20 20:21         ` Jason Gunthorpe
  1 sibling, 1 reply; 34+ messages in thread
From: Mostafa Saleh @ 2024-08-20 19:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Tue, Aug 20, 2024 at 09:01:02AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 20, 2024 at 08:30:05AM +0000, Mostafa Saleh wrote:
> > Hi Jason,
> > 
> > On Tue, Aug 06, 2024 at 08:41:15PM -0300, Jason Gunthorpe wrote:
> > > Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
> > > works. When S2FWB is supported and enabled the IOPTE will force cachable
> > > access to IOMMU_CACHE memory and deny cachable access otherwise.
> > > 
> > > This is not especially meaningful for simple S2 domains, it apparently
> > > doesn't even force PCI no-snoop access to be coherent.
> > > 
> > > However, when used with a nested S1, FWB has the effect of preventing the
> > > guest from choosing a MemAttr that would cause ordinary DMA to bypass the
> > > cache. Consistent with KVM we wish to deny the guest the ability to become
> > > incoherent with cached memory the hypervisor believes is cachable so we
> > > don't have to flush it.
> > > 
> > > Turn on S2FWB whenever the SMMU supports it and use it for all S2
> > > mappings.
> > 
> > I have been looking into this recently from the KVM side as it will
> > use FWB for the CPU stage-2 unconditionally for guests(if supported),
> > however that breaks for non-coherent devices when assigned, and
> > limiting assigned devices to be coherent seems too restrictive.
> 
> kvm's CPU S2 doesn't care about non-DMA-coherent devices though? That
> concept is only relevant to the SMMU.
> 

Why not? That would be a problem if a device is not dma coherent,
and the VM knows that and maps it’s DMA memory as non cacheable.
But it would be overridden by FWB in stage-2 to be cacheable,
it would lead to coherency issues.

> The issue on the KVM side is you can't put device MMIO into the CPU S2
> using S2FWB and Normal Cachable, it will break the MMIO programming
> model. That isn't "coherency" though.> 
> It has to be Normal-NC, which this patch does:
> 
> https://lore.kernel.org/r/20240224150546.368-4-ankita@nvidia.com

Yes, that also breaks (although I think this is an easier problem to
solve)

> 
> > But for SMMUv3, S2FWB is per stream, can’t we just use it if the master
> > is DMA coherent?
> 
> Sure, that seems to be a weird corner. Lets add this:
> 
> @@ -4575,7 +4575,12 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
>  
>         /* IDR3 */
>         reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
> -       if (FIELD_GET(IDR3_FWB, reg))
> +       /*
> +        * If for some reason the HW does not support DMA coherency then using
> +        * S2FWB won't work. This will also disable nesting support.
> +        */
> +       if (FIELD_GET(IDR3_FWB, reg) &&
> +           (smmu->features & ARM_SMMU_FEAT_COHERENCY))
>                 smmu->features |= ARM_SMMU_FEAT_S2FWB;
>         if (FIELD_GET(IDR3_RIL, reg))
>                 smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
> 
> IMHO it would be weird to make HW that has S2FWB but not coherency,
> but sure let's check it.
> 
What I mean is the master itself not the SMMU (the SID basically),
so in that case the STE shouldn’t have FWB enabled.

> Also bear in mind VFIO won't run unless ARM_SMMU_FEAT_COHERENCY is set
> so we won't even get a chance to ask for a S2 domain.

Oh, I think that is only for the SMMU, not for the master, the
SMMU can be coherent (for pte, ste …) but the master can still be
non coherent. Looking at how VFIO uses it, that seems to be a bug?

Thanks,
Mostafa

> 
> Jason


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-20 19:52       ` Mostafa Saleh
@ 2024-08-20 20:21         ` Jason Gunthorpe
  2024-08-21  9:53           ` Mostafa Saleh
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-20 20:21 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Tue, Aug 20, 2024 at 07:52:53PM +0000, Mostafa Saleh wrote:
> On Tue, Aug 20, 2024 at 09:01:02AM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 20, 2024 at 08:30:05AM +0000, Mostafa Saleh wrote:
> > > Hi Jason,
> > > 
> > > On Tue, Aug 06, 2024 at 08:41:15PM -0300, Jason Gunthorpe wrote:
> > > > Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
> > > > works. When S2FWB is supported and enabled the IOPTE will force cachable
> > > > access to IOMMU_CACHE memory and deny cachable access otherwise.
> > > > 
> > > > This is not especially meaningful for simple S2 domains, it apparently
> > > > doesn't even force PCI no-snoop access to be coherent.
> > > > 
> > > > However, when used with a nested S1, FWB has the effect of preventing the
> > > > guest from choosing a MemAttr that would cause ordinary DMA to bypass the
> > > > cache. Consistent with KVM we wish to deny the guest the ability to become
> > > > incoherent with cached memory the hypervisor believes is cachable so we
> > > > don't have to flush it.
> > > > 
> > > > Turn on S2FWB whenever the SMMU supports it and use it for all S2
> > > > mappings.
> > > 
> > > I have been looking into this recently from the KVM side as it will
> > > use FWB for the CPU stage-2 unconditionally for guests(if supported),
> > > however that breaks for non-coherent devices when assigned, and
> > > limiting assigned devices to be coherent seems too restrictive.
> > 
> > kvm's CPU S2 doesn't care about non-DMA-coherent devices though? That
> > concept is only relevant to the SMMU.
>
> Why not? That would be a problem if a device is not dma coherent,
> and the VM knows that and maps it’s DMA memory as non cacheable.
> But it would be overridden by FWB in stage-2 to be cacheable,
> it would lead to coherency issues.

Oh, from that perspective yes, but the entire point of S2FWB is that
VM's can not create non-coherent access so it is a bit nonsense to ask
for both S2FWB and try to assign a non-DMA coherent device.

> Yes, that also breaks (although I think this is an easier problem to
> solve)

Well, it is easy to solve, just don't use S2FWB and manually flush the
caches before the hypervisor touches any memory. :)

> What I mean is the master itself not the SMMU (the SID basically),
> so in that case the STE shouldn’t have FWB enabled.

That doesn't matter, those cases will not pass in IOMMU_CACHE and they
will work fine with S2FWB turned on.

> > Also bear in mind VFIO won't run unless ARM_SMMU_FEAT_COHERENCY is set
> > so we won't even get a chance to ask for a S2 domain.
> 
> Oh, I think that is only for the SMMU, not for the master, the
> SMMU can be coherent (for pte, ste …) but the master can still be
> non coherent. Looking at how VFIO uses it, that seems to be a bug?

If there are mixes of SMMU feature and dev_is_dma_coherent() then it
would be a bug yes..

I recall we started out trying to use dev_is_dma_coherent() but
Christoph explained it doesn't work that generally:

https://lore.kernel.org/kvm/20220406135150.GA21532@lst.de/

Seems we sort of gave up on it, too complicated. Robin had a nice
observation of the complexity:

    Disregarding the complete disaster of PCIe No Snoop on Arm-Based 
    systems, there's the more interesting effectively-opposite scenario 
    where an SMMU bridges non-coherent devices to a coherent interconnect. 
    It's not something we take advantage of yet in Linux, and it can only be 
    properly described in ACPI, but there do exist situations where 
    IOMMU_CACHE is capable of making the device's traffic snoop, but 
    dev_is_dma_coherent() - and device_get_dma_attr() for external users - 
    would still say non-coherent because they can't assume that the SMMU is 
    enabled and programmed in just the right way.

Anyhow, for the purposes of KVM and VFIO, devices that don't work with
IOMMU_CACHE are not allowed. From an API perspective
IOMMU_CAP_CACHE_COHERENCY is supposed to return if the struct device
can use IOMMU_CACHE.

The corner case where we have a ARM_SMMU_FEAT_COHERENCY SMMU but
somehow specific devices don't support IOMMU_CACHE is not properly
reflected in IOMMU_CAP_CACHE_COHERENCY. I don't know how to fix that,
and we've been ignoring it for a long time now :)

Jason


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-20 20:21         ` Jason Gunthorpe
@ 2024-08-21  9:53           ` Mostafa Saleh
  2024-08-21 12:06             ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: Mostafa Saleh @ 2024-08-21  9:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi, Marc Zyngier

On Tue, Aug 20, 2024 at 05:21:38PM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 20, 2024 at 07:52:53PM +0000, Mostafa Saleh wrote:
> > On Tue, Aug 20, 2024 at 09:01:02AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 20, 2024 at 08:30:05AM +0000, Mostafa Saleh wrote:
> > > > Hi Jason,
> > > > 
> > > > On Tue, Aug 06, 2024 at 08:41:15PM -0300, Jason Gunthorpe wrote:
> > > > > Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
> > > > > works. When S2FWB is supported and enabled the IOPTE will force cachable
> > > > > access to IOMMU_CACHE memory and deny cachable access otherwise.
> > > > > 
> > > > > This is not especially meaningful for simple S2 domains, it apparently
> > > > > doesn't even force PCI no-snoop access to be coherent.
> > > > > 
> > > > > However, when used with a nested S1, FWB has the effect of preventing the
> > > > > guest from choosing a MemAttr that would cause ordinary DMA to bypass the
> > > > > cache. Consistent with KVM we wish to deny the guest the ability to become
> > > > > incoherent with cached memory the hypervisor believes is cachable so we
> > > > > don't have to flush it.
> > > > > 
> > > > > Turn on S2FWB whenever the SMMU supports it and use it for all S2
> > > > > mappings.
> > > > 
> > > > I have been looking into this recently from the KVM side as it will
> > > > use FWB for the CPU stage-2 unconditionally for guests(if supported),
> > > > however that breaks for non-coherent devices when assigned, and
> > > > limiting assigned devices to be coherent seems too restrictive.
> > > 
> > > kvm's CPU S2 doesn't care about non-DMA-coherent devices though? That
> > > concept is only relevant to the SMMU.
> >
> > Why not? That would be a problem if a device is not dma coherent,
> > and the VM knows that and maps it’s DMA memory as non cacheable.
> > But it would be overridden by FWB in stage-2 to be cacheable,
> > it would lead to coherency issues.
> 
> Oh, from that perspective yes, but the entire point of S2FWB is that
> VM's can not create non-coherent access so it is a bit nonsense to ask
> for both S2FWB and try to assign a non-DMA coherent device.

Yes, but KVM sets FWB unconditionally and would use cacheable mapping
for stage-2, and I expect the same for the nested SMMU.

> 
> > Yes, that also breaks (although I think this is an easier problem to
> > solve)
> 
> Well, it is easy to solve, just don't use S2FWB and manually flush the
> caches before the hypervisor touches any memory. :)

Yes, although that means virtualized devices would have worse
performance :/ but I guess there is nothing more to do here.

I have some ideas about that, I can send patches to the kvm list
as an RFC.

> 
> > What I mean is the master itself not the SMMU (the SID basically),
> > so in that case the STE shouldn’t have FWB enabled.
> 
> That doesn't matter, those cases will not pass in IOMMU_CACHE and they
> will work fine with S2FWB turned on.
> 

But that won’t be the case in nested? Otherwise why we use FWB in the
first place.

> > > Also bear in mind VFIO won't run unless ARM_SMMU_FEAT_COHERENCY is set
> > > so we won't even get a chance to ask for a S2 domain.
> > 
> > Oh, I think that is only for the SMMU, not for the master, the
> > SMMU can be coherent (for pte, ste …) but the master can still be
> > non coherent. Looking at how VFIO uses it, that seems to be a bug?
> 
> If there are mixes of SMMU feature and dev_is_dma_coherent() then it
> would be a bug yes..
> 

I think there is a bug, I was able to assign a “non-coherent” device with
VFIO with no issues, and it allows it as long as the SMMU is coherent.

> I recall we started out trying to use dev_is_dma_coherent() but
> Christoph explained it doesn't work that generally:
> 
> https://lore.kernel.org/kvm/20220406135150.GA21532@lst.de/
> 
> Seems we sort of gave up on it, too complicated. Robin had a nice
> observation of the complexity:
> 
>     Disregarding the complete disaster of PCIe No Snoop on Arm-Based 
>     systems, there's the more interesting effectively-opposite scenario 
>     where an SMMU bridges non-coherent devices to a coherent interconnect. 
>     It's not something we take advantage of yet in Linux, and it can only be 
>     properly described in ACPI, but there do exist situations where 
>     IOMMU_CACHE is capable of making the device's traffic snoop, but 
>     dev_is_dma_coherent() - and device_get_dma_attr() for external users - 
>     would still say non-coherent because they can't assume that the SMMU is 
>     enabled and programmed in just the right way.
> 
> Anyhow, for the purposes of KVM and VFIO, devices that don't work with
> IOMMU_CACHE are not allowed. From an API perspective
> IOMMU_CAP_CACHE_COHERENCY is supposed to return if the struct device
> can use IOMMU_CACHE.
> 
> The corner case where we have a ARM_SMMU_FEAT_COHERENCY SMMU but
> somehow specific devices don't support IOMMU_CACHE is not properly
> reflected in IOMMU_CAP_CACHE_COHERENCY. I don't know how to fix that,
> and we've been ignoring it for a long time now :)

Thanks a lot for the extra context!

Maybe the SMMUv3 .capable, should be changed to check if the device is
coherent (instead of using dev_is_dma_coherent, it can use lower level
functions from the supported buses)

Also, I think supporting IOMMU_CACHE is not enough, as the SMMU can
support it but the device is still not coherent.

Thanks,
Mostafa

> 
> Jason


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-21  9:53           ` Mostafa Saleh
@ 2024-08-21 12:06             ` Jason Gunthorpe
  0 siblings, 0 replies; 34+ messages in thread
From: Jason Gunthorpe @ 2024-08-21 12:06 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Alex Williamson, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi, Marc Zyngier

On Wed, Aug 21, 2024 at 09:53:33AM +0000, Mostafa Saleh wrote:
> > Oh, from that perspective yes, but the entire point of S2FWB is that
> > VM's can not create non-coherent access so it is a bit nonsense to ask
> > for both S2FWB and try to assign a non-DMA coherent device.
> 
> Yes, but KVM sets FWB unconditionally and would use cacheable mapping
> for stage-2, and I expect the same for the nested SMMU.

Yes, you'd need some kind of handshake like Intel built for their GPU
cache incoherence to turn that off.

> > > What I mean is the master itself not the SMMU (the SID basically),
> > > so in that case the STE shouldn’t have FWB enabled.
> > 
> > That doesn't matter, those cases will not pass in IOMMU_CACHE and they
> > will work fine with S2FWB turned on.
> 
> But that won’t be the case in nested? Otherwise why we use FWB in the
> first place.

Right, without KVM support for guest cachability selection and cache
flushing in VFIO, is infeasible to allow non-coherent devices. It is a
medium sized problem if someone wants to tackle it.

> Maybe the SMMUv3 .capable, should be changed to check if the device is
> coherent (instead of using dev_is_dma_coherent, it can use lower level
> functions from the supported buses)

That would be the fix I expect. Either SMMUv3 does it, or the core
code adds it on top in the .capable wrapper. It makes sense to me that
the iommu driver should be aware of per-master coherence capability.

> Also, I think supporting IOMMU_CACHE is not enough, as the SMMU can
> support it but the device is still not coherent.

IOMMU_CACHE is defined as requiring no cache maintenance on that
memory.

If specific devices can't guarentee that then IOMMU_CACHE should not
be used on those devices and IOMMU_CAP_CACHE_COHERENCY for that device
should be false.

That is what I mean by support.

Anyhow, I'm going to continue to leave this problem alone for
nesting. Nothing gets worse by adding nesting on top of this. Even if
we wrongly permit VFIO to open non-coherent devices they won't
actually work correctly (VFIO forces IOMMU_CACHE and S2FWB). Most
likely anything trying to use them will just crash/malfunction due to
missing cache flushing.

Jason


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2024-08-21 12:07 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-06 23:41 [PATCH 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
2024-08-06 23:41 ` [PATCH 1/8] vfio: Remove VFIO_TYPE1_NESTING_IOMMU Jason Gunthorpe
2024-08-07 17:52   ` Alex Williamson
2024-08-20  8:23   ` Mostafa Saleh
2024-08-06 23:41 ` [PATCH 2/8] iommu/arm-smmu-v3: Use S2FWB when available Jason Gunthorpe
2024-08-09 14:26   ` Shameerali Kolothum Thodi
2024-08-09 15:12     ` Jason Gunthorpe
2024-08-15 16:14       ` Shameerali Kolothum Thodi
2024-08-15 16:18         ` Jason Gunthorpe
2024-08-20  8:30   ` Mostafa Saleh
2024-08-20 12:01     ` Jason Gunthorpe
2024-08-20 13:01       ` Jason Gunthorpe
2024-08-20 19:52       ` Mostafa Saleh
2024-08-20 20:21         ` Jason Gunthorpe
2024-08-21  9:53           ` Mostafa Saleh
2024-08-21 12:06             ` Jason Gunthorpe
2024-08-06 23:41 ` [PATCH 3/8] ACPI/IORT: Support CANWBS memory access flag Jason Gunthorpe
2024-08-09 14:36   ` Shameerali Kolothum Thodi
2024-08-09 14:59     ` Jason Gunthorpe
2024-08-09 15:15       ` Shameerali Kolothum Thodi
2024-08-09 20:14         ` Nicolin Chen
2024-08-06 23:41 ` [PATCH 4/8] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS Jason Gunthorpe
2024-08-06 23:41 ` [PATCH 5/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info Jason Gunthorpe
2024-08-06 23:41 ` [PATCH 6/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT Jason Gunthorpe
2024-08-09 15:06   ` Robin Murphy
2024-08-09 16:09     ` Jason Gunthorpe
2024-08-09 18:34       ` Robin Murphy
2024-08-13 14:35         ` Jason Gunthorpe
2024-08-06 23:41 ` [PATCH 7/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED Jason Gunthorpe
2024-08-09 16:05   ` Robin Murphy
2024-08-09 18:03     ` Jason Gunthorpe
2024-08-06 23:41 ` [PATCH 8/8] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user Jason Gunthorpe
2024-08-20  8:20 ` [PATCH 0/8] Initial support for SMMUv3 nested translation Mostafa Saleh
2024-08-20 15:24   ` Nicolin Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).