linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/8] Initial support for SMMUv3 nested translation
@ 2024-08-27 15:51 Jason Gunthorpe
  2024-08-27 15:51 ` [PATCH v2 1/8] vfio: Remove VFIO_TYPE1_NESTING_IOMMU Jason Gunthorpe
                   ` (9 more replies)
  0 siblings, 10 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-27 15:51 UTC (permalink / raw)
  To: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon
  Cc: Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

This brings support for the IOMMFD ioctls:

 - IOMMU_GET_HW_INFO
 - IOMMU_HWPT_ALLOC_NEST_PARENT
 - IOMMU_DOMAIN_NESTED
 - ops->enforce_cache_coherency()

This is quite straightforward as the nested STE can just be built in the
special NESTED domain op and fed through the generic update machinery.

The design allows the user provided STE fragment to control several
aspects of the translation, including putting the STE into a "virtual
bypass" or a aborting state. This duplicates functionality available by
other means, but it allows trivially preserving the VMID in the STE as we
eventually move towards the VIOMMU owning the VMID.

Nesting support requires the system to either support S2FWB or the
stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
cache and view incoherent data, currently VFIO lacks any cache flushing
that would make this safe.

Yan has a series to add some of the needed infrastructure for VFIO cache
flushing here:

 https://lore.kernel.org/linux-iommu/20240507061802.20184-1-yan.y.zhao@intel.com/

Which may someday allow relaxing this further.

Remove VFIO_TYPE1_NESTING_IOMMU since it was never used and superseded by
this.

This is the first series in what will be several to complete nesting
support. At least:
 - IOMMU_RESV_SW_MSI related fixups
    https://lore.kernel.org/linux-iommu/cover.1722644866.git.nicolinc@nvidia.com/
 - VIOMMU object support to allow ATS and CD invalidations
    https://lore.kernel.org/linux-iommu/cover.1723061377.git.nicolinc@nvidia.com/
 - vCMDQ hypervisor support for direct invalidation queue assignment
    https://lore.kernel.org/linux-iommu/cover.1712978212.git.nicolinc@nvidia.com/
 - KVM pinned VMID using VIOMMU for vBTM
    https://lore.kernel.org/linux-iommu/20240208151837.35068-1-shameerali.kolothum.thodi@huawei.com/
 - Cross instance S2 sharing
 - Virtual Machine Structure using VIOMMU (for vMPAM?)
 - Fault forwarding support through IOMMUFD's fault fd for vSVA

The VIOMMU series is essential to allow the invalidations to be processed
for the CD as well.

It is enough to allow qemu work to progress.

This is on github: https://github.com/jgunthorpe/linux/commits/smmuv3_nesting

v2:
 - Revise commit messages
 - Guard S2FWB support with ARM_SMMU_FEAT_COHERENCY, since it doesn't make
   sense to use S2FWB to enforce coherency on inherently non-coherent hardware.
 - Add missing IO_PGTABLE_QUIRK_ARM_S2FWB validation
 - Include formal ACPIA commit for IORT built using
   generate/linux/gen-patch.sh
 - Use FEAT_NESTING to block creating a NESTING_PARENT
 - Use an abort STE instead of non-valid if the user requests a non-valid
   vSTE
 - Consistently use 'nest_parent' for naming variables
 - Use the right domain for arm_smmu_remove_master_domain() when it
   removes the master
 - Join bitfields together
 - Drop arm_smmu_cache_invalidate_user patch, invalidation will
   exclusively go via viommu
v1: https://patch.msgid.link/r/0-v1-54e734311a7f+14f72-smmuv3_nesting_jgg@nvidia.com

Jason Gunthorpe (5):
  vfio: Remove VFIO_TYPE1_NESTING_IOMMU
  iommu/arm-smmu-v3: Use S2FWB when available
  iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
  iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
  iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED

Nicolin Chen (3):
  ACPICA: IORT: Update for revision E.f
  ACPI/IORT: Support CANWBS memory access flag
  iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct
    arm_smmu_hw_info

 drivers/acpi/arm64/iort.c                   |  13 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 314 ++++++++++++++++++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  26 ++
 drivers/iommu/arm/arm-smmu/arm-smmu.c       |  16 -
 drivers/iommu/io-pgtable-arm.c              |  27 +-
 drivers/iommu/iommu.c                       |  10 -
 drivers/iommu/iommufd/vfio_compat.c         |   7 +-
 drivers/vfio/vfio_iommu_type1.c             |  12 +-
 include/acpi/actbl2.h                       |   3 +-
 include/linux/io-pgtable.h                  |   2 +
 include/linux/iommu.h                       |   5 +-
 include/uapi/linux/iommufd.h                |  55 ++++
 include/uapi/linux/vfio.h                   |   2 +-
 13 files changed, 415 insertions(+), 77 deletions(-)


base-commit: e5e288d94186b266b062b3e44c82c285dfe68712
-- 
2.46.0



^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH v2 1/8] vfio: Remove VFIO_TYPE1_NESTING_IOMMU
  2024-08-27 15:51 [PATCH v2 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
@ 2024-08-27 15:51 ` Jason Gunthorpe
  2024-08-30  7:40   ` Tian, Kevin
  2024-08-27 15:51 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available Jason Gunthorpe
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-27 15:51 UTC (permalink / raw)
  To: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon
  Cc: Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

This control causes the ARM SMMU drivers to choose a stage 2
implementation for the IO pagetable (vs the stage 1 usual default),
however this choice has no significant visible impact to the VFIO
user. Further qemu never implemented this and no other userspace user is
known.

The original description in commit f5c9ecebaf2a ("vfio/iommu_type1: add
new VFIO_TYPE1_NESTING_IOMMU IOMMU type") suggested this was to "provide
SMMU translation services to the guest operating system" however the rest
of the API to set the guest table pointer for the stage 1 and manage
invalidation was never completed, or at least never upstreamed, rendering
this part useless dead code.

Upstream has now settled on iommufd as the uAPI for controlling nested
translation. Choosing the stage 2 implementation should be done by through
the IOMMU_HWPT_ALLOC_NEST_PARENT flag during domain allocation.

Remove VFIO_TYPE1_NESTING_IOMMU and everything under it including the
enable_nesting iommu_domain_op.

Just in-case there is some userspace using this continue to treat
requesting it as a NOP, but do not advertise support any more.

Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Mostafa Saleh <smostafa@google.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 ----------------
 drivers/iommu/arm/arm-smmu/arm-smmu.c       | 16 ----------------
 drivers/iommu/iommu.c                       | 10 ----------
 drivers/iommu/iommufd/vfio_compat.c         |  7 +------
 drivers/vfio/vfio_iommu_type1.c             | 12 +-----------
 include/linux/iommu.h                       |  3 ---
 include/uapi/linux/vfio.h                   |  2 +-
 7 files changed, 3 insertions(+), 63 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index e5db5325f7eaed..531125f231b662 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3331,21 +3331,6 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev)
 	return group;
 }
 
-static int arm_smmu_enable_nesting(struct iommu_domain *domain)
-{
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	int ret = 0;
-
-	mutex_lock(&smmu_domain->init_mutex);
-	if (smmu_domain->smmu)
-		ret = -EPERM;
-	else
-		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
-	mutex_unlock(&smmu_domain->init_mutex);
-
-	return ret;
-}
-
 static int arm_smmu_of_xlate(struct device *dev,
 			     const struct of_phandle_args *args)
 {
@@ -3467,7 +3452,6 @@ static struct iommu_ops arm_smmu_ops = {
 		.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
 		.iotlb_sync		= arm_smmu_iotlb_sync,
 		.iova_to_phys		= arm_smmu_iova_to_phys,
-		.enable_nesting		= arm_smmu_enable_nesting,
 		.free			= arm_smmu_domain_free_paging,
 	}
 };
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 723273440c2118..38dad1fd53b80a 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -1558,21 +1558,6 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev)
 	return group;
 }
 
-static int arm_smmu_enable_nesting(struct iommu_domain *domain)
-{
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	int ret = 0;
-
-	mutex_lock(&smmu_domain->init_mutex);
-	if (smmu_domain->smmu)
-		ret = -EPERM;
-	else
-		smmu_domain->stage = ARM_SMMU_DOMAIN_NESTED;
-	mutex_unlock(&smmu_domain->init_mutex);
-
-	return ret;
-}
-
 static int arm_smmu_set_pgtable_quirks(struct iommu_domain *domain,
 		unsigned long quirks)
 {
@@ -1656,7 +1641,6 @@ static struct iommu_ops arm_smmu_ops = {
 		.flush_iotlb_all	= arm_smmu_flush_iotlb_all,
 		.iotlb_sync		= arm_smmu_iotlb_sync,
 		.iova_to_phys		= arm_smmu_iova_to_phys,
-		.enable_nesting		= arm_smmu_enable_nesting,
 		.set_pgtable_quirks	= arm_smmu_set_pgtable_quirks,
 		.free			= arm_smmu_domain_free,
 	}
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index ed6c5cb60c5aee..9da63d57a53cd7 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2723,16 +2723,6 @@ static int __init iommu_init(void)
 }
 core_initcall(iommu_init);
 
-int iommu_enable_nesting(struct iommu_domain *domain)
-{
-	if (domain->type != IOMMU_DOMAIN_UNMANAGED)
-		return -EINVAL;
-	if (!domain->ops->enable_nesting)
-		return -EINVAL;
-	return domain->ops->enable_nesting(domain);
-}
-EXPORT_SYMBOL_GPL(iommu_enable_nesting);
-
 int iommu_set_pgtable_quirks(struct iommu_domain *domain,
 		unsigned long quirk)
 {
diff --git a/drivers/iommu/iommufd/vfio_compat.c b/drivers/iommu/iommufd/vfio_compat.c
index a3ad5f0b6c59dd..514aacd6400949 100644
--- a/drivers/iommu/iommufd/vfio_compat.c
+++ b/drivers/iommu/iommufd/vfio_compat.c
@@ -291,12 +291,7 @@ static int iommufd_vfio_check_extension(struct iommufd_ctx *ictx,
 	case VFIO_DMA_CC_IOMMU:
 		return iommufd_vfio_cc_iommu(ictx);
 
-	/*
-	 * This is obsolete, and to be removed from VFIO. It was an incomplete
-	 * idea that got merged.
-	 * https://lore.kernel.org/kvm/0-v1-0093c9b0e345+19-vfio_no_nesting_jgg@nvidia.com/
-	 */
-	case VFIO_TYPE1_NESTING_IOMMU:
+	case __VFIO_RESERVED_TYPE1_NESTING_IOMMU:
 		return 0;
 
 	/*
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0960699e75543e..13cf6851cc2718 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -72,7 +72,6 @@ struct vfio_iommu {
 	uint64_t		pgsize_bitmap;
 	uint64_t		num_non_pinned_groups;
 	bool			v2;
-	bool			nesting;
 	bool			dirty_page_tracking;
 	struct list_head	emulated_iommu_groups;
 };
@@ -2199,12 +2198,6 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 		goto out_free_domain;
 	}
 
-	if (iommu->nesting) {
-		ret = iommu_enable_nesting(domain->domain);
-		if (ret)
-			goto out_domain;
-	}
-
 	ret = iommu_attach_group(domain->domain, group->iommu_group);
 	if (ret)
 		goto out_domain;
@@ -2545,9 +2538,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 	switch (arg) {
 	case VFIO_TYPE1_IOMMU:
 		break;
-	case VFIO_TYPE1_NESTING_IOMMU:
-		iommu->nesting = true;
-		fallthrough;
+	case __VFIO_RESERVED_TYPE1_NESTING_IOMMU:
 	case VFIO_TYPE1v2_IOMMU:
 		iommu->v2 = true;
 		break;
@@ -2642,7 +2633,6 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
 	switch (arg) {
 	case VFIO_TYPE1_IOMMU:
 	case VFIO_TYPE1v2_IOMMU:
-	case VFIO_TYPE1_NESTING_IOMMU:
 	case VFIO_UNMAP_ALL:
 		return 1;
 	case VFIO_UPDATE_VADDR:
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 4d47f2c3331185..15d7657509f662 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -635,7 +635,6 @@ struct iommu_ops {
  * @enforce_cache_coherency: Prevent any kind of DMA from bypassing IOMMU_CACHE,
  *                           including no-snoop TLPs on PCIe or other platform
  *                           specific mechanisms.
- * @enable_nesting: Enable nesting
  * @set_pgtable_quirks: Set io page table quirks (IO_PGTABLE_QUIRK_*)
  * @free: Release the domain after use.
  */
@@ -663,7 +662,6 @@ struct iommu_domain_ops {
 				    dma_addr_t iova);
 
 	bool (*enforce_cache_coherency)(struct iommu_domain *domain);
-	int (*enable_nesting)(struct iommu_domain *domain);
 	int (*set_pgtable_quirks)(struct iommu_domain *domain,
 				  unsigned long quirks);
 
@@ -846,7 +844,6 @@ extern void iommu_group_put(struct iommu_group *group);
 extern int iommu_group_id(struct iommu_group *group);
 extern struct iommu_domain *iommu_group_default_domain(struct iommu_group *);
 
-int iommu_enable_nesting(struct iommu_domain *domain);
 int iommu_set_pgtable_quirks(struct iommu_domain *domain,
 		unsigned long quirks);
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 2b68e6cdf1902f..c8dbf8219c4fcb 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -35,7 +35,7 @@
 #define VFIO_EEH			5
 
 /* Two-stage IOMMU */
-#define VFIO_TYPE1_NESTING_IOMMU	6	/* Implies v2 */
+#define __VFIO_RESERVED_TYPE1_NESTING_IOMMU	6	/* Implies v2 */
 
 #define VFIO_SPAPR_TCE_v2_IOMMU		7
 
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-27 15:51 [PATCH v2 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
  2024-08-27 15:51 ` [PATCH v2 1/8] vfio: Remove VFIO_TYPE1_NESTING_IOMMU Jason Gunthorpe
@ 2024-08-27 15:51 ` Jason Gunthorpe
  2024-08-27 19:48   ` Nicolin Chen
                     ` (4 more replies)
  2024-08-27 15:51 ` [PATCH v2 3/8] ACPICA: IORT: Update for revision E.f Jason Gunthorpe
                   ` (7 subsequent siblings)
  9 siblings, 5 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-27 15:51 UTC (permalink / raw)
  To: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon
  Cc: Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
works. When S2FWB is supported and enabled the IOPTE will force cachable
access to IOMMU_CACHE memory when nesting with a S1 and deny cachable
access otherwise.

When using a single stage of translation, a simple S2 domain, it doesn't
change anything as it is just a different encoding for the exsting mapping
of the IOMMU protection flags to cachability attributes.

However, when used with a nested S1, FWB has the effect of preventing the
guest from choosing a MemAttr in it's S1 that would cause ordinary DMA to
bypass the cache. Consistent with KVM we wish to deny the guest the
ability to become incoherent with cached memory the hypervisor believes is
cachable so we don't have to flush it.

Turn on S2FWB whenever the SMMU supports it and use it for all S2
mappings.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 +++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
 drivers/iommu/io-pgtable-arm.c              | 27 +++++++++++++++++----
 include/linux/io-pgtable.h                  |  2 ++
 4 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 531125f231b662..e2b97ad6d74b03 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1612,6 +1612,8 @@ void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
 		FIELD_PREP(STRTAB_STE_1_EATS,
 			   ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0));
 
+	if (smmu->features & ARM_SMMU_FEAT_S2FWB)
+		target->data[1] |= cpu_to_le64(STRTAB_STE_1_S2FWB);
 	if (smmu->features & ARM_SMMU_FEAT_ATTR_TYPES_OVR)
 		target->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
 							  STRTAB_STE_1_SHCFG_INCOMING));
@@ -2400,6 +2402,8 @@ static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain,
 		pgtbl_cfg.oas = smmu->oas;
 		fmt = ARM_64_LPAE_S2;
 		finalise_stage_fn = arm_smmu_domain_finalise_s2;
+		if (smmu->features & ARM_SMMU_FEAT_S2FWB)
+			pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB;
 		break;
 	default:
 		return -EINVAL;
@@ -4189,6 +4193,13 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
 
 	/* IDR3 */
 	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
+	/*
+	 * If for some reason the HW does not support DMA coherency then using
+	 * S2FWB won't work. This will also disable nesting support.
+	 */
+	if (FIELD_GET(IDR3_FWB, reg) &&
+	    (smmu->features & ARM_SMMU_FEAT_COHERENCY))
+		smmu->features |= ARM_SMMU_FEAT_S2FWB;
 	if (FIELD_GET(IDR3_RIL, reg))
 		smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 8851a7abb5f0f3..7e8d2f36faebf3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -55,6 +55,7 @@
 #define IDR1_SIDSIZE			GENMASK(5, 0)
 
 #define ARM_SMMU_IDR3			0xc
+#define IDR3_FWB			(1 << 8)
 #define IDR3_RIL			(1 << 10)
 
 #define ARM_SMMU_IDR5			0x14
@@ -258,6 +259,7 @@ static inline u32 arm_smmu_strtab_l2_idx(u32 sid)
 #define STRTAB_STE_1_S1CSH		GENMASK_ULL(7, 6)
 
 #define STRTAB_STE_1_S1STALLD		(1UL << 27)
+#define STRTAB_STE_1_S2FWB		(1UL << 25)
 
 #define STRTAB_STE_1_EATS		GENMASK_ULL(29, 28)
 #define STRTAB_STE_1_EATS_ABT		0UL
@@ -700,6 +702,7 @@ struct arm_smmu_device {
 #define ARM_SMMU_FEAT_ATTR_TYPES_OVR	(1 << 20)
 #define ARM_SMMU_FEAT_HA		(1 << 21)
 #define ARM_SMMU_FEAT_HD		(1 << 22)
+#define ARM_SMMU_FEAT_S2FWB		(1 << 23)
 	u32				features;
 
 #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index f5d9fd1f45bf49..9b3658aae21005 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -106,6 +106,18 @@
 #define ARM_LPAE_PTE_HAP_FAULT		(((arm_lpae_iopte)0) << 6)
 #define ARM_LPAE_PTE_HAP_READ		(((arm_lpae_iopte)1) << 6)
 #define ARM_LPAE_PTE_HAP_WRITE		(((arm_lpae_iopte)2) << 6)
+/*
+ * For !FWB these code to:
+ *  1111 = Normal outer write back cachable / Inner Write Back Cachable
+ *         Permit S1 to override
+ *  0101 = Normal Non-cachable / Inner Non-cachable
+ *  0001 = Device / Device-nGnRE
+ * For S2FWB these code:
+ *  0110 Force Normal Write Back
+ *  0101 Normal* is forced Normal-NC, Device unchanged
+ *  0001 Force Device-nGnRE
+ */
+#define ARM_LPAE_PTE_MEMATTR_FWB_WB	(((arm_lpae_iopte)0x6) << 2)
 #define ARM_LPAE_PTE_MEMATTR_OIWB	(((arm_lpae_iopte)0xf) << 2)
 #define ARM_LPAE_PTE_MEMATTR_NC		(((arm_lpae_iopte)0x5) << 2)
 #define ARM_LPAE_PTE_MEMATTR_DEV	(((arm_lpae_iopte)0x1) << 2)
@@ -458,12 +470,16 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
 	 */
 	if (data->iop.fmt == ARM_64_LPAE_S2 ||
 	    data->iop.fmt == ARM_32_LPAE_S2) {
-		if (prot & IOMMU_MMIO)
+		if (prot & IOMMU_MMIO) {
 			pte |= ARM_LPAE_PTE_MEMATTR_DEV;
-		else if (prot & IOMMU_CACHE)
-			pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
-		else
+		} else if (prot & IOMMU_CACHE) {
+			if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_ARM_S2FWB)
+				pte |= ARM_LPAE_PTE_MEMATTR_FWB_WB;
+			else
+				pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
+		} else {
 			pte |= ARM_LPAE_PTE_MEMATTR_NC;
+		}
 	} else {
 		if (prot & IOMMU_MMIO)
 			pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
@@ -932,7 +948,8 @@ arm_64_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg, void *cookie)
 	if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS |
 			    IO_PGTABLE_QUIRK_ARM_TTBR1 |
 			    IO_PGTABLE_QUIRK_ARM_OUTER_WBWA |
-			    IO_PGTABLE_QUIRK_ARM_HD))
+			    IO_PGTABLE_QUIRK_ARM_HD |
+			    IO_PGTABLE_QUIRK_ARM_S2FWB))
 		return NULL;
 
 	data = arm_lpae_alloc_pgtable(cfg);
diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index f9a81761bfceda..aff9b020b6dcc7 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -87,6 +87,7 @@ struct io_pgtable_cfg {
 	 *	attributes set in the TCR for a non-coherent page-table walker.
 	 *
 	 * IO_PGTABLE_QUIRK_ARM_HD: Enables dirty tracking in stage 1 pagetable.
+	 * IO_PGTABLE_QUIRK_ARM_S2FWB: Use the FWB format for the MemAttrs bits
 	 */
 	#define IO_PGTABLE_QUIRK_ARM_NS			BIT(0)
 	#define IO_PGTABLE_QUIRK_NO_PERMS		BIT(1)
@@ -95,6 +96,7 @@ struct io_pgtable_cfg {
 	#define IO_PGTABLE_QUIRK_ARM_TTBR1		BIT(5)
 	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA		BIT(6)
 	#define IO_PGTABLE_QUIRK_ARM_HD			BIT(7)
+	#define IO_PGTABLE_QUIRK_ARM_S2FWB		BIT(8)
 	unsigned long			quirks;
 	unsigned long			pgsize_bitmap;
 	unsigned int			ias;
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 3/8] ACPICA: IORT: Update for revision E.f
  2024-08-27 15:51 [PATCH v2 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
  2024-08-27 15:51 ` [PATCH v2 1/8] vfio: Remove VFIO_TYPE1_NESTING_IOMMU Jason Gunthorpe
  2024-08-27 15:51 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available Jason Gunthorpe
@ 2024-08-27 15:51 ` Jason Gunthorpe
  2024-08-29 10:14   ` Rafael J. Wysocki
  2024-08-27 15:51 ` [PATCH v2 4/8] ACPI/IORT: Support CANWBS memory access flag Jason Gunthorpe
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-27 15:51 UTC (permalink / raw)
  To: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon
  Cc: Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

From: Nicolin Chen <nicolinc@nvidia.com>

ACPICA commit c4f5c083d24df9ddd71d5782c0988408cf0fc1ab

The IORT spec, Issue E.f (April 2024), adds a new CANWBS bit to the Memory
Access Flag field in the Memory Access Properties table, mainly for a PCI
Root Complex.

This CANWBS defines the coherency of memory accesses to be not marked IOWB
cacheable/shareable. Its value further implies the coherency impact from a
pair of mismatched memory attributes (e.g. in a nested translation case):
  0x0: Use of mismatched memory attributes for accesses made by this
       device may lead to a loss of coherency.
  0x1: Coherency of accesses made by this device to locations in
       Conventional memory are ensured as follows, even if the memory
       attributes for the accesses presented by the device or provided by
       the SMMU are different from Inner and Outer Write-back cacheable,
       Shareable.

Link: https://github.com/acpica/acpica/commit/c4f5c083
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 include/acpi/actbl2.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
index e27958ef82642f..9a7acf403ed3c8 100644
--- a/include/acpi/actbl2.h
+++ b/include/acpi/actbl2.h
@@ -453,7 +453,7 @@ struct acpi_table_ccel {
  * IORT - IO Remapping Table
  *
  * Conforms to "IO Remapping Table System Software on ARM Platforms",
- * Document number: ARM DEN 0049E.e, Sep 2022
+ * Document number: ARM DEN 0049E.f, Apr 2024
  *
  ******************************************************************************/
 
@@ -524,6 +524,7 @@ struct acpi_iort_memory_access {
 
 #define ACPI_IORT_MF_COHERENCY          (1)
 #define ACPI_IORT_MF_ATTRIBUTES         (1<<1)
+#define ACPI_IORT_MF_CANWBS             (1<<2)
 
 /*
  * IORT node specific subtables
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 4/8] ACPI/IORT: Support CANWBS memory access flag
  2024-08-27 15:51 [PATCH v2 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
                   ` (2 preceding siblings ...)
  2024-08-27 15:51 ` [PATCH v2 3/8] ACPICA: IORT: Update for revision E.f Jason Gunthorpe
@ 2024-08-27 15:51 ` Jason Gunthorpe
  2024-08-30  7:52   ` Tian, Kevin
  2024-08-27 15:51 ` [PATCH v2 5/8] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS Jason Gunthorpe
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-27 15:51 UTC (permalink / raw)
  To: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon
  Cc: Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

From: Nicolin Chen <nicolinc@nvidia.com>

The IORT spec, Issue E.f (April 2024), adds a new CANWBS bit to the Memory
Access Flag field in the Memory Access Properties table, mainly for a PCI
Root Complex.

This CANWBS defines the coherency of memory accesses to be not marked IOWB
cacheable/shareable. Its value further implies the coherency impact from a
pair of mismatched memory attributes (e.g. in a nested translation case):
  0x0: Use of mismatched memory attributes for accesses made by this
       device may lead to a loss of coherency.
  0x1: Coherency of accesses made by this device to locations in
       Conventional memory are ensured as follows, even if the memory
       attributes for the accesses presented by the device or provided by
       the SMMU are different from Inner and Outer Write-back cacheable,
       Shareable.

Note that the loss of coherency on a CANWBS-unsupported HW typically could
occur to an SMMU that doesn't implement the S2FWB feature where additional
cache flush operations would be required to prevent that from happening.

Add a new ACPI_IORT_MF_CANWBS flag and set IOMMU_FWSPEC_PCI_RC_CANWBS upon
the presence of this new flag.

CANWBS and S2FWB are similar features, in that they both guarantee the VM
can not violate coherency, however S2FWB can be bypassed by PCI No Snoop
TLPs, while CANWBS cannot. Thus CANWBS meets the requirements to set
IOMMU_CAP_ENFORCE_CACHE_COHERENCY.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/acpi/arm64/iort.c | 13 +++++++++++++
 include/linux/iommu.h     |  2 ++
 2 files changed, 15 insertions(+)

diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 1b39e9ae7ac178..52f5836fa888db 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -1218,6 +1218,17 @@ static bool iort_pci_rc_supports_ats(struct acpi_iort_node *node)
 	return pci_rc->ats_attribute & ACPI_IORT_ATS_SUPPORTED;
 }
 
+static bool iort_pci_rc_supports_canwbs(struct acpi_iort_node *node)
+{
+	struct acpi_iort_memory_access *memory_access;
+	struct acpi_iort_root_complex *pci_rc;
+
+	pci_rc = (struct acpi_iort_root_complex *)node->node_data;
+	memory_access =
+		(struct acpi_iort_memory_access *)&pci_rc->memory_properties;
+	return memory_access->memory_flags & ACPI_IORT_MF_CANWBS;
+}
+
 static int iort_iommu_xlate(struct device *dev, struct acpi_iort_node *node,
 			    u32 streamid)
 {
@@ -1335,6 +1346,8 @@ int iort_iommu_configure_id(struct device *dev, const u32 *id_in)
 		fwspec = dev_iommu_fwspec_get(dev);
 		if (fwspec && iort_pci_rc_supports_ats(node))
 			fwspec->flags |= IOMMU_FWSPEC_PCI_RC_ATS;
+		if (fwspec && iort_pci_rc_supports_canwbs(node))
+			fwspec->flags |= IOMMU_FWSPEC_PCI_RC_CANWBS;
 	} else {
 		node = iort_scan_node(ACPI_IORT_NODE_NAMED_COMPONENT,
 				      iort_match_node_callback, dev);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 15d7657509f662..d1660ec23f263b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -993,6 +993,8 @@ struct iommu_fwspec {
 
 /* ATS is supported */
 #define IOMMU_FWSPEC_PCI_RC_ATS			(1 << 0)
+/* CANWBS is supported */
+#define IOMMU_FWSPEC_PCI_RC_CANWBS		(1 << 1)
 
 /*
  * An iommu attach handle represents a relationship between an iommu domain
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 5/8] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
  2024-08-27 15:51 [PATCH v2 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
                   ` (3 preceding siblings ...)
  2024-08-27 15:51 ` [PATCH v2 4/8] ACPI/IORT: Support CANWBS memory access flag Jason Gunthorpe
@ 2024-08-27 15:51 ` Jason Gunthorpe
  2024-08-27 20:12   ` Nicolin Chen
  2024-08-30 15:19   ` Mostafa Saleh
  2024-08-27 15:51 ` [PATCH v2 6/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info Jason Gunthorpe
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-27 15:51 UTC (permalink / raw)
  To: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon
  Cc: Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

HW with CANWBS is always cache coherent and ignores PCI No Snoop requests
as well. This meets the requirement for IOMMU_CAP_ENFORCE_CACHE_COHERENCY,
so let's return it.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 35 +++++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
 2 files changed, 36 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index e2b97ad6d74b03..c2021e821e5cb6 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2253,6 +2253,9 @@ static bool arm_smmu_capable(struct device *dev, enum iommu_cap cap)
 	case IOMMU_CAP_CACHE_COHERENCY:
 		/* Assume that a coherent TCU implies coherent TBUs */
 		return master->smmu->features & ARM_SMMU_FEAT_COHERENCY;
+	case IOMMU_CAP_ENFORCE_CACHE_COHERENCY:
+		return dev_iommu_fwspec_get(dev)->flags &
+		       IOMMU_FWSPEC_PCI_RC_CANWBS;
 	case IOMMU_CAP_NOEXEC:
 	case IOMMU_CAP_DEFERRED_FLUSH:
 		return true;
@@ -2263,6 +2266,28 @@ static bool arm_smmu_capable(struct device *dev, enum iommu_cap cap)
 	}
 }
 
+static bool arm_smmu_enforce_cache_coherency(struct iommu_domain *domain)
+{
+	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
+	struct arm_smmu_master_domain *master_domain;
+	unsigned long flags;
+	bool ret = false;
+
+	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+	list_for_each_entry(master_domain, &smmu_domain->devices,
+			    devices_elm) {
+		if (!(dev_iommu_fwspec_get(master_domain->master->dev)->flags &
+		      IOMMU_FWSPEC_PCI_RC_CANWBS))
+			goto out;
+	}
+
+	smmu_domain->enforce_cache_coherency = true;
+	ret = true;
+out:
+	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
+	return ret;
+}
+
 struct arm_smmu_domain *arm_smmu_domain_alloc(void)
 {
 	struct arm_smmu_domain *smmu_domain;
@@ -2693,6 +2718,15 @@ static int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
 		 * one of them.
 		 */
 		spin_lock_irqsave(&smmu_domain->devices_lock, flags);
+		if (smmu_domain->enforce_cache_coherency &&
+		    !(dev_iommu_fwspec_get(master->dev)->flags &
+		      IOMMU_FWSPEC_PCI_RC_CANWBS)) {
+			kfree(master_domain);
+			spin_unlock_irqrestore(&smmu_domain->devices_lock,
+					       flags);
+			return -EINVAL;
+		}
+
 		if (state->ats_enabled)
 			atomic_inc(&smmu_domain->nr_ats_masters);
 		list_add(&master_domain->devices_elm, &smmu_domain->devices);
@@ -3450,6 +3484,7 @@ static struct iommu_ops arm_smmu_ops = {
 	.owner			= THIS_MODULE,
 	.default_domain_ops = &(const struct iommu_domain_ops) {
 		.attach_dev		= arm_smmu_attach_dev,
+		.enforce_cache_coherency = arm_smmu_enforce_cache_coherency,
 		.set_dev_pasid		= arm_smmu_s1_set_dev_pasid,
 		.map_pages		= arm_smmu_map_pages,
 		.unmap_pages		= arm_smmu_unmap_pages,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 7e8d2f36faebf3..45882f65bfcad0 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -787,6 +787,7 @@ struct arm_smmu_domain {
 	/* List of struct arm_smmu_master_domain */
 	struct list_head		devices;
 	spinlock_t			devices_lock;
+	bool				enforce_cache_coherency : 1;
 
 	struct mmu_notifier		mmu_notifier;
 };
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 6/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
  2024-08-27 15:51 [PATCH v2 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
                   ` (4 preceding siblings ...)
  2024-08-27 15:51 ` [PATCH v2 5/8] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS Jason Gunthorpe
@ 2024-08-27 15:51 ` Jason Gunthorpe
  2024-08-30  7:55   ` Tian, Kevin
  2024-08-30 15:23   ` Mostafa Saleh
  2024-08-27 15:51 ` [PATCH v2 7/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT Jason Gunthorpe
                   ` (3 subsequent siblings)
  9 siblings, 2 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-27 15:51 UTC (permalink / raw)
  To: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon
  Cc: Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

From: Nicolin Chen <nicolinc@nvidia.com>

For virtualization cases the IDR/IIDR/AIDR values of the actual SMMU
instance need to be available to the VMM so it can construct an
appropriate vSMMUv3 that reflects the correct HW capabilities.

For userspace page tables these values are required to constrain the valid
values within the CD table and the IOPTEs.

The kernel does not sanitize these values. If building a VMM then
userspace is required to only forward bits into a VM that it knows it can
implement. Some bits will also require a VMM to detect if appropriate
kernel support is available such as for ATS and BTM.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 ++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  2 ++
 include/uapi/linux/iommufd.h                | 35 +++++++++++++++++++++
 3 files changed, 61 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index c2021e821e5cb6..ec2fcdd4523a26 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2288,6 +2288,29 @@ static bool arm_smmu_enforce_cache_coherency(struct iommu_domain *domain)
 	return ret;
 }
 
+static void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type)
+{
+	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct iommu_hw_info_arm_smmuv3 *info;
+	u32 __iomem *base_idr;
+	unsigned int i;
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info)
+		return ERR_PTR(-ENOMEM);
+
+	base_idr = master->smmu->base + ARM_SMMU_IDR0;
+	for (i = 0; i <= 5; i++)
+		info->idr[i] = readl_relaxed(base_idr + i);
+	info->iidr = readl_relaxed(master->smmu->base + ARM_SMMU_IIDR);
+	info->aidr = readl_relaxed(master->smmu->base + ARM_SMMU_AIDR);
+
+	*length = sizeof(*info);
+	*type = IOMMU_HW_INFO_TYPE_ARM_SMMUV3;
+
+	return info;
+}
+
 struct arm_smmu_domain *arm_smmu_domain_alloc(void)
 {
 	struct arm_smmu_domain *smmu_domain;
@@ -3467,6 +3490,7 @@ static struct iommu_ops arm_smmu_ops = {
 	.identity_domain	= &arm_smmu_identity_domain,
 	.blocked_domain		= &arm_smmu_blocked_domain,
 	.capable		= arm_smmu_capable,
+	.hw_info		= arm_smmu_hw_info,
 	.domain_alloc_paging    = arm_smmu_domain_alloc_paging,
 	.domain_alloc_sva       = arm_smmu_sva_domain_alloc,
 	.domain_alloc_user	= arm_smmu_domain_alloc_user,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 45882f65bfcad0..4b05c81b181a82 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -80,6 +80,8 @@
 #define IIDR_REVISION			GENMASK(15, 12)
 #define IIDR_IMPLEMENTER		GENMASK(11, 0)
 
+#define ARM_SMMU_AIDR			0x1C
+
 #define ARM_SMMU_CR0			0x20
 #define CR0_ATSCHK			(1 << 4)
 #define CR0_CMDQEN			(1 << 3)
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 4dde745cfb7e29..83b6e1cd338d8f 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -484,15 +484,50 @@ struct iommu_hw_info_vtd {
 	__aligned_u64 ecap_reg;
 };
 
+/**
+ * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
+ *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
+ *
+ * @flags: Must be set to 0
+ * @__reserved: Must be 0
+ * @idr: Implemented features for ARM SMMU Non-secure programming interface
+ * @iidr: Information about the implementation and implementer of ARM SMMU,
+ *        and architecture version supported
+ * @aidr: ARM SMMU architecture version
+ *
+ * For the details of @idr, @iidr and @aidr, please refer to the chapters
+ * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
+ *
+ * User space should read the underlying ARM SMMUv3 hardware information for
+ * the list of supported features.
+ *
+ * Note that these values reflect the raw HW capability, without any insight if
+ * any required kernel driver support is present. Bits may be set indicating the
+ * HW has functionality that is lacking kernel software support, such as BTM. If
+ * a VMM is using this information to construct emulated copies of these
+ * registers it should only forward bits that it knows it can support.
+ *
+ * In future, presence of required kernel support will be indicated in flags.
+ */
+struct iommu_hw_info_arm_smmuv3 {
+	__u32 flags;
+	__u32 __reserved;
+	__u32 idr[6];
+	__u32 iidr;
+	__u32 aidr;
+};
+
 /**
  * enum iommu_hw_info_type - IOMMU Hardware Info Types
  * @IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not report hardware
  *                           info
  * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
+ * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
  */
 enum iommu_hw_info_type {
 	IOMMU_HW_INFO_TYPE_NONE = 0,
 	IOMMU_HW_INFO_TYPE_INTEL_VTD = 1,
+	IOMMU_HW_INFO_TYPE_ARM_SMMUV3 = 2,
 };
 
 /**
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 7/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
  2024-08-27 15:51 [PATCH v2 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
                   ` (5 preceding siblings ...)
  2024-08-27 15:51 ` [PATCH v2 6/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info Jason Gunthorpe
@ 2024-08-27 15:51 ` Jason Gunthorpe
  2024-08-27 20:16   ` Nicolin Chen
                     ` (2 more replies)
  2024-08-27 15:51 ` [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED Jason Gunthorpe
                   ` (2 subsequent siblings)
  9 siblings, 3 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-27 15:51 UTC (permalink / raw)
  To: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon
  Cc: Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

For SMMUv3 the parent must be a S2 domain, which can be composed
into a IOMMU_DOMAIN_NESTED.

In future the S2 parent will also need a VMID linked to the VIOMMU and
even to KVM.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index ec2fcdd4523a26..8db3db6328f8b7 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3103,7 +3103,8 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
 			   const struct iommu_user_data *user_data)
 {
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
-	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
+	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING |
+				 IOMMU_HWPT_ALLOC_NEST_PARENT;
 	struct arm_smmu_domain *smmu_domain;
 	int ret;
 
@@ -3116,6 +3117,14 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
 	if (!smmu_domain)
 		return ERR_PTR(-ENOMEM);
 
+	if (flags & IOMMU_HWPT_ALLOC_NEST_PARENT) {
+		if (!(master->smmu->features & ARM_SMMU_FEAT_NESTING)) {
+			ret = -EOPNOTSUPP;
+			goto err_free;
+		}
+		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
+	}
+
 	smmu_domain->domain.type = IOMMU_DOMAIN_UNMANAGED;
 	smmu_domain->domain.ops = arm_smmu_ops.default_domain_ops;
 	ret = arm_smmu_domain_finalise(smmu_domain, master->smmu, flags);
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-08-27 15:51 [PATCH v2 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
                   ` (6 preceding siblings ...)
  2024-08-27 15:51 ` [PATCH v2 7/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT Jason Gunthorpe
@ 2024-08-27 15:51 ` Jason Gunthorpe
  2024-08-27 21:23   ` Nicolin Chen
                     ` (2 more replies)
  2024-08-27 21:31 ` [PATCH v2 0/8] Initial support for SMMUv3 nested translation Nicolin Chen
  2024-10-16  2:23 ` Zhangfei Gao
  9 siblings, 3 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-27 15:51 UTC (permalink / raw)
  To: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon
  Cc: Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

For SMMUv3 a IOMMU_DOMAIN_NESTED is composed of a S2 iommu_domain acting
as the parent and a user provided STE fragment that defines the CD table
and related data with addresses translated by the S2 iommu_domain.

The kernel only permits userspace to control certain allowed bits of the
STE that are safe for user/guest control.

IOTLB maintenance is a bit subtle here, the S1 implicitly includes the S2
translation, but there is no way of knowing which S1 entries refer to a
range of S2.

For the IOTLB we follow ARM's guidance and issue a CMDQ_OP_TLBI_NH_ALL to
flush all ASIDs from the VMID after flushing the S2 on any change to the
S2.

Similarly we have to flush the entire ATC if the S2 is changed.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 217 +++++++++++++++++++-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  20 ++
 include/uapi/linux/iommufd.h                |  20 ++
 3 files changed, 250 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8db3db6328f8b7..a21dce1f25cb95 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -295,6 +295,7 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
 	case CMDQ_OP_TLBI_NH_ASID:
 		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
 		fallthrough;
+	case CMDQ_OP_TLBI_NH_ALL:
 	case CMDQ_OP_TLBI_S12_VMALL:
 		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
 		break;
@@ -1640,6 +1641,59 @@ void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
 }
 EXPORT_SYMBOL_IF_KUNIT(arm_smmu_make_s2_domain_ste);
 
+static void arm_smmu_make_nested_cd_table_ste(
+	struct arm_smmu_ste *target, struct arm_smmu_master *master,
+	struct arm_smmu_nested_domain *nested_domain, bool ats_enabled)
+{
+	arm_smmu_make_s2_domain_ste(target, master, nested_domain->s2_parent,
+				    ats_enabled);
+
+	target->data[0] = cpu_to_le64(STRTAB_STE_0_V |
+				      FIELD_PREP(STRTAB_STE_0_CFG,
+						 STRTAB_STE_0_CFG_NESTED)) |
+			  (nested_domain->ste[0] & ~STRTAB_STE_0_CFG);
+	target->data[1] |= nested_domain->ste[1];
+}
+
+/*
+ * Create a physical STE from the virtual STE that userspace provided when it
+ * created the nested domain. Using the vSTE userspace can request:
+ * - Non-valid STE
+ * - Abort STE
+ * - Bypass STE (install the S2, no CD table)
+ * - CD table STE (install the S2 and the userspace CD table)
+ */
+static void arm_smmu_make_nested_domain_ste(
+	struct arm_smmu_ste *target, struct arm_smmu_master *master,
+	struct arm_smmu_nested_domain *nested_domain, bool ats_enabled)
+{
+	/*
+	 * Userspace can request a non-valid STE through the nesting interface.
+	 * We relay that into an abort physical STE with the intention that
+	 * C_BAD_STE for this SID can be generated to userspace.
+	 */
+	if (!(nested_domain->ste[0] & cpu_to_le64(STRTAB_STE_0_V))) {
+		arm_smmu_make_abort_ste(target);
+		return;
+	}
+
+	switch (FIELD_GET(STRTAB_STE_0_CFG,
+			  le64_to_cpu(nested_domain->ste[0]))) {
+	case STRTAB_STE_0_CFG_S1_TRANS:
+		arm_smmu_make_nested_cd_table_ste(target, master, nested_domain,
+						  ats_enabled);
+		break;
+	case STRTAB_STE_0_CFG_BYPASS:
+		arm_smmu_make_s2_domain_ste(
+			target, master, nested_domain->s2_parent, ats_enabled);
+		break;
+	case STRTAB_STE_0_CFG_ABORT:
+	default:
+		arm_smmu_make_abort_ste(target);
+		break;
+	}
+}
+
 /*
  * This can safely directly manipulate the STE memory without a sync sequence
  * because the STE table has not been installed in the SMMU yet.
@@ -2065,7 +2119,16 @@ int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
 		if (!master->ats_enabled)
 			continue;
 
-		arm_smmu_atc_inv_to_cmd(master_domain->ssid, iova, size, &cmd);
+		if (master_domain->nest_parent) {
+			/*
+			 * If a S2 used as a nesting parent is changed we have
+			 * no option but to completely flush the ATC.
+			 */
+			arm_smmu_atc_inv_to_cmd(IOMMU_NO_PASID, 0, 0, &cmd);
+		} else {
+			arm_smmu_atc_inv_to_cmd(master_domain->ssid, iova, size,
+						&cmd);
+		}
 
 		for (i = 0; i < master->num_streams; i++) {
 			cmd.atc.sid = master->streams[i].id;
@@ -2192,6 +2255,16 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
 	}
 	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
 
+	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S2 &&
+	    smmu_domain->nest_parent) {
+		/*
+		 * When the S2 domain changes all the nested S1 ASIDs have to be
+		 * flushed too.
+		 */
+		cmd.opcode = CMDQ_OP_TLBI_NH_ALL;
+		arm_smmu_cmdq_issue_cmd_with_sync(smmu_domain->smmu, &cmd);
+	}
+
 	/*
 	 * Unfortunately, this can't be leaf-only since we may have
 	 * zapped an entire table.
@@ -2604,8 +2677,8 @@ static void arm_smmu_disable_pasid(struct arm_smmu_master *master)
 
 static struct arm_smmu_master_domain *
 arm_smmu_find_master_domain(struct arm_smmu_domain *smmu_domain,
-			    struct arm_smmu_master *master,
-			    ioasid_t ssid)
+			    struct arm_smmu_master *master, ioasid_t ssid,
+			    bool nest_parent)
 {
 	struct arm_smmu_master_domain *master_domain;
 
@@ -2614,7 +2687,8 @@ arm_smmu_find_master_domain(struct arm_smmu_domain *smmu_domain,
 	list_for_each_entry(master_domain, &smmu_domain->devices,
 			    devices_elm) {
 		if (master_domain->master == master &&
-		    master_domain->ssid == ssid)
+		    master_domain->ssid == ssid &&
+		    master_domain->nest_parent == nest_parent)
 			return master_domain;
 	}
 	return NULL;
@@ -2634,6 +2708,9 @@ to_smmu_domain_devices(struct iommu_domain *domain)
 	if ((domain->type & __IOMMU_DOMAIN_PAGING) ||
 	    domain->type == IOMMU_DOMAIN_SVA)
 		return to_smmu_domain(domain);
+	if (domain->type == IOMMU_DOMAIN_NESTED)
+		return container_of(domain, struct arm_smmu_nested_domain,
+				    domain)->s2_parent;
 	return NULL;
 }
 
@@ -2649,7 +2726,8 @@ static void arm_smmu_remove_master_domain(struct arm_smmu_master *master,
 		return;
 
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
-	master_domain = arm_smmu_find_master_domain(smmu_domain, master, ssid);
+	master_domain = arm_smmu_find_master_domain(
+		smmu_domain, master, ssid, domain->type == IOMMU_DOMAIN_NESTED);
 	if (master_domain) {
 		list_del(&master_domain->devices_elm);
 		kfree(master_domain);
@@ -2664,6 +2742,7 @@ struct arm_smmu_attach_state {
 	struct iommu_domain *old_domain;
 	struct arm_smmu_master *master;
 	bool cd_needs_ats;
+	bool disable_ats;
 	ioasid_t ssid;
 	/* Resulting state */
 	bool ats_enabled;
@@ -2716,7 +2795,8 @@ static int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
 		 * enabled if we have arm_smmu_domain, those always have page
 		 * tables.
 		 */
-		state->ats_enabled = arm_smmu_ats_supported(master);
+		state->ats_enabled = !state->disable_ats &&
+				     arm_smmu_ats_supported(master);
 	}
 
 	if (smmu_domain) {
@@ -2725,6 +2805,8 @@ static int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
 			return -ENOMEM;
 		master_domain->master = master;
 		master_domain->ssid = state->ssid;
+		master_domain->nest_parent = new_domain->type ==
+					       IOMMU_DOMAIN_NESTED;
 
 		/*
 		 * During prepare we want the current smmu_domain and new
@@ -3097,6 +3179,122 @@ static struct iommu_domain arm_smmu_blocked_domain = {
 	.ops = &arm_smmu_blocked_ops,
 };
 
+static int arm_smmu_attach_dev_nested(struct iommu_domain *domain,
+				      struct device *dev)
+{
+	struct arm_smmu_nested_domain *nested_domain =
+		container_of(domain, struct arm_smmu_nested_domain, domain);
+	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct arm_smmu_attach_state state = {
+		.master = master,
+		.old_domain = iommu_get_domain_for_dev(dev),
+		.ssid = IOMMU_NO_PASID,
+		/* Currently invalidation of ATC is not supported */
+		.disable_ats = true,
+	};
+	struct arm_smmu_ste ste;
+	int ret;
+
+	if (arm_smmu_ssids_in_use(&master->cd_table) ||
+	    nested_domain->s2_parent->smmu != master->smmu)
+		return -EINVAL;
+
+	mutex_lock(&arm_smmu_asid_lock);
+	ret = arm_smmu_attach_prepare(&state, domain);
+	if (ret) {
+		mutex_unlock(&arm_smmu_asid_lock);
+		return ret;
+	}
+
+	arm_smmu_make_nested_domain_ste(&ste, master, nested_domain,
+					state.ats_enabled);
+	arm_smmu_install_ste_for_dev(master, &ste);
+	arm_smmu_attach_commit(&state);
+	mutex_unlock(&arm_smmu_asid_lock);
+	return 0;
+}
+
+static void arm_smmu_domain_nested_free(struct iommu_domain *domain)
+{
+	kfree(container_of(domain, struct arm_smmu_nested_domain, domain));
+}
+
+static const struct iommu_domain_ops arm_smmu_nested_ops = {
+	.attach_dev = arm_smmu_attach_dev_nested,
+	.free = arm_smmu_domain_nested_free,
+};
+
+static struct iommu_domain *
+arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags,
+			      struct iommu_domain *parent,
+			      const struct iommu_user_data *user_data)
+{
+	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+	struct arm_smmu_nested_domain *nested_domain;
+	struct arm_smmu_domain *smmu_parent;
+	struct iommu_hwpt_arm_smmuv3 arg;
+	unsigned int eats;
+	unsigned int cfg;
+	int ret;
+
+	if (!(master->smmu->features & ARM_SMMU_FEAT_NESTING))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	/*
+	 * Must support some way to prevent the VM from bypassing the cache
+	 * because VFIO currently does not do any cache maintenance.
+	 */
+	if (!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_CANWBS) &&
+	    !(master->smmu->features & ARM_SMMU_FEAT_S2FWB))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	ret = iommu_copy_struct_from_user(&arg, user_data,
+					  IOMMU_HWPT_DATA_ARM_SMMUV3, ste);
+	if (ret)
+		return ERR_PTR(ret);
+
+	if (flags || !(master->smmu->features & ARM_SMMU_FEAT_TRANS_S1))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	if (!(parent->type & __IOMMU_DOMAIN_PAGING))
+		return ERR_PTR(-EINVAL);
+
+	smmu_parent = to_smmu_domain(parent);
+	if (smmu_parent->stage != ARM_SMMU_DOMAIN_S2 ||
+	    smmu_parent->smmu != master->smmu)
+		return ERR_PTR(-EINVAL);
+
+	/* EIO is reserved for invalid STE data. */
+	if ((arg.ste[0] & ~STRTAB_STE_0_NESTING_ALLOWED) ||
+	    (arg.ste[1] & ~STRTAB_STE_1_NESTING_ALLOWED))
+		return ERR_PTR(-EIO);
+
+	cfg = FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(arg.ste[0]));
+	if (cfg != STRTAB_STE_0_CFG_ABORT && cfg != STRTAB_STE_0_CFG_BYPASS &&
+	    cfg != STRTAB_STE_0_CFG_S1_TRANS)
+		return ERR_PTR(-EIO);
+
+	eats = FIELD_GET(STRTAB_STE_1_EATS, le64_to_cpu(arg.ste[1]));
+	if (eats != STRTAB_STE_1_EATS_ABT)
+		return ERR_PTR(-EIO);
+
+	if (cfg != STRTAB_STE_0_CFG_S1_TRANS)
+		eats = STRTAB_STE_1_EATS_ABT;
+
+	nested_domain = kzalloc(sizeof(*nested_domain), GFP_KERNEL_ACCOUNT);
+	if (!nested_domain)
+		return ERR_PTR(-ENOMEM);
+
+	nested_domain->domain.type = IOMMU_DOMAIN_NESTED;
+	nested_domain->domain.ops = &arm_smmu_nested_ops;
+	nested_domain->s2_parent = smmu_parent;
+	nested_domain->ste[0] = arg.ste[0];
+	nested_domain->ste[1] = arg.ste[1] & ~cpu_to_le64(STRTAB_STE_1_EATS);
+
+	return &nested_domain->domain;
+}
+
 static struct iommu_domain *
 arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
 			   struct iommu_domain *parent,
@@ -3108,9 +3306,13 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
 	struct arm_smmu_domain *smmu_domain;
 	int ret;
 
+	if (parent)
+		return arm_smmu_domain_alloc_nesting(dev, flags, parent,
+						     user_data);
+
 	if (flags & ~PAGING_FLAGS)
 		return ERR_PTR(-EOPNOTSUPP);
-	if (parent || user_data)
+	if (user_data)
 		return ERR_PTR(-EOPNOTSUPP);
 
 	smmu_domain = arm_smmu_domain_alloc();
@@ -3123,6 +3325,7 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
 			goto err_free;
 		}
 		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
+		smmu_domain->nest_parent = true;
 	}
 
 	smmu_domain->domain.type = IOMMU_DOMAIN_UNMANAGED;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 4b05c81b181a82..b563cfedf22e91 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -240,6 +240,7 @@ static inline u32 arm_smmu_strtab_l2_idx(u32 sid)
 #define STRTAB_STE_0_CFG_BYPASS		4
 #define STRTAB_STE_0_CFG_S1_TRANS	5
 #define STRTAB_STE_0_CFG_S2_TRANS	6
+#define STRTAB_STE_0_CFG_NESTED		7
 
 #define STRTAB_STE_0_S1FMT		GENMASK_ULL(5, 4)
 #define STRTAB_STE_0_S1FMT_LINEAR	0
@@ -291,6 +292,15 @@ static inline u32 arm_smmu_strtab_l2_idx(u32 sid)
 
 #define STRTAB_STE_3_S2TTB_MASK		GENMASK_ULL(51, 4)
 
+/* These bits can be controlled by userspace for STRTAB_STE_0_CFG_NESTED */
+#define STRTAB_STE_0_NESTING_ALLOWED                                         \
+	cpu_to_le64(STRTAB_STE_0_V | STRTAB_STE_0_CFG | STRTAB_STE_0_S1FMT | \
+		    STRTAB_STE_0_S1CTXPTR_MASK | STRTAB_STE_0_S1CDMAX)
+#define STRTAB_STE_1_NESTING_ALLOWED                            \
+	cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |   \
+		    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |   \
+		    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_EATS)
+
 /*
  * Context descriptors.
  *
@@ -508,6 +518,7 @@ struct arm_smmu_cmdq_ent {
 			};
 		} cfgi;
 
+		#define CMDQ_OP_TLBI_NH_ALL     0x10
 		#define CMDQ_OP_TLBI_NH_ASID	0x11
 		#define CMDQ_OP_TLBI_NH_VA	0x12
 		#define CMDQ_OP_TLBI_EL2_ALL	0x20
@@ -790,10 +801,18 @@ struct arm_smmu_domain {
 	struct list_head		devices;
 	spinlock_t			devices_lock;
 	bool				enforce_cache_coherency : 1;
+	bool				nest_parent : 1;
 
 	struct mmu_notifier		mmu_notifier;
 };
 
+struct arm_smmu_nested_domain {
+	struct iommu_domain domain;
+	struct arm_smmu_domain *s2_parent;
+
+	__le64 ste[2];
+};
+
 /* The following are exposed for testing purposes. */
 struct arm_smmu_entry_writer_ops;
 struct arm_smmu_entry_writer {
@@ -830,6 +849,7 @@ struct arm_smmu_master_domain {
 	struct list_head devices_elm;
 	struct arm_smmu_master *master;
 	ioasid_t ssid;
+	u8 nest_parent;
 };
 
 static inline struct arm_smmu_domain *to_smmu_domain(struct iommu_domain *dom)
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 83b6e1cd338d8f..76e9ad6c9403af 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -394,14 +394,34 @@ struct iommu_hwpt_vtd_s1 {
 	__u32 __reserved;
 };
 
+/**
+ * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 Context Descriptor Table info
+ *                                (IOMMU_HWPT_DATA_ARM_SMMUV3)
+ *
+ * @ste: The first two double words of the user space Stream Table Entry for
+ *       a user stage-1 Context Descriptor Table. Must be little-endian.
+ *       Allowed fields: (Refer to "5.2 Stream Table Entry" in SMMUv3 HW Spec)
+ *       - word-0: V, Cfg, S1Fmt, S1ContextPtr, S1CDMax
+ *       - word-1: S1DSS, S1CIR, S1COR, S1CSH, S1STALLD
+ *
+ * -EIO will be returned if @ste is not legal or contains any non-allowed field.
+ * Cfg can be used to select a S1, Bypass or Abort configuration. A Bypass
+ * nested domain will translate the same as the nesting parent.
+ */
+struct iommu_hwpt_arm_smmuv3 {
+	__aligned_le64 ste[2];
+};
+
 /**
  * enum iommu_hwpt_data_type - IOMMU HWPT Data Type
  * @IOMMU_HWPT_DATA_NONE: no data
  * @IOMMU_HWPT_DATA_VTD_S1: Intel VT-d stage-1 page table
+ * @IOMMU_HWPT_DATA_ARM_SMMUV3: ARM SMMUv3 Context Descriptor Table
  */
 enum iommu_hwpt_data_type {
 	IOMMU_HWPT_DATA_NONE = 0,
 	IOMMU_HWPT_DATA_VTD_S1 = 1,
+	IOMMU_HWPT_DATA_ARM_SMMUV3 = 2,
 };
 
 /**
-- 
2.46.0



^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-27 15:51 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available Jason Gunthorpe
@ 2024-08-27 19:48   ` Nicolin Chen
  2024-08-28 18:30     ` Jason Gunthorpe
  2024-08-28 19:50   ` Nicolin Chen
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 95+ messages in thread
From: Nicolin Chen @ 2024-08-27 19:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

Hi Jason,

On Tue, Aug 27, 2024 at 12:51:32PM -0300, Jason Gunthorpe wrote:
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index f5d9fd1f45bf49..9b3658aae21005 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -106,6 +106,18 @@
>  #define ARM_LPAE_PTE_HAP_FAULT		(((arm_lpae_iopte)0) << 6)
>  #define ARM_LPAE_PTE_HAP_READ		(((arm_lpae_iopte)1) << 6)
>  #define ARM_LPAE_PTE_HAP_WRITE		(((arm_lpae_iopte)2) << 6)
> +/*
> + * For !FWB these code to:
> + *  1111 = Normal outer write back cachable / Inner Write Back Cachable
> + *         Permit S1 to override
> + *  0101 = Normal Non-cachable / Inner Non-cachable
> + *  0001 = Device / Device-nGnRE
> + * For S2FWB these code:
> + *  0110 Force Normal Write Back
> + *  0101 Normal* is forced Normal-NC, Device unchanged
> + *  0001 Force Device-nGnRE
> + */
> +#define ARM_LPAE_PTE_MEMATTR_FWB_WB	(((arm_lpae_iopte)0x6) << 2)

The other part looks good. Yet, would you mind sharing the location
that defines this 0x6 explicitly?

I am looking at DDI0487K, directed from 13.1.6 in SMMU RM and its
Reference:
[2] Arm Architecture Reference Manual for A-profile architecture.
    (ARM DDI 0487) Arm Ltd.

Where it has the followings in D8.6.6:
 "For stage 2 translations, if FEAT_MTE_PERM is not implemented, then
  FEAT_S2FWB has all of the following effects on the MemAttr[3:2] bits:
   - MemAttr[3] is RES0.
   - The value of MemAttr[2] determines the interpretation of the
     MemAttr[1:0] bits.
  For stage 2 translations, if FEAT_MTE_PERM is implemented, then
  MemAttr[3] is not RES0 and all bits of MemAttr[3:0] determine the
  memory region type and Cacheability attributes.
  For stage 2 translations, if FEAT_MTE_PERM is implemented, then all
  of the following values of MemAttr[3:2] apply:
   - 0b10 is Reserved.
   - All other values determine the interpretation of the MemAttr[1:0]
     bits.
  For stage 2 translations, if MemAttr[2] is 0, or if FEAT_MTE_PERM is
  implemented and MemAttr[3:2] is 0b00, then the MemAttr[1:0] bits
  define Device memory attributes as shown in the following table:"

So, MemAttr[3:2] seems to be 00b or 10b depending on FEAT_MTE_PERM,
either of which would never result in 0x6?

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 5/8] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
  2024-08-27 15:51 ` [PATCH v2 5/8] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS Jason Gunthorpe
@ 2024-08-27 20:12   ` Nicolin Chen
  2024-08-28 19:12     ` Jason Gunthorpe
  2024-08-30 15:19   ` Mostafa Saleh
  1 sibling, 1 reply; 95+ messages in thread
From: Nicolin Chen @ 2024-08-27 20:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Tue, Aug 27, 2024 at 12:51:35PM -0300, Jason Gunthorpe wrote:
> HW with CANWBS is always cache coherent and ignores PCI No Snoop requests
> as well. This meets the requirement for IOMMU_CAP_ENFORCE_CACHE_COHERENCY,
> so let's return it.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>

With two very ignorable nits:

> @@ -2693,6 +2718,15 @@ static int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
>  		 * one of them.
>  		 */
>  		spin_lock_irqsave(&smmu_domain->devices_lock, flags);
> +		if (smmu_domain->enforce_cache_coherency &&
> +		    !(dev_iommu_fwspec_get(master->dev)->flags &
> +		      IOMMU_FWSPEC_PCI_RC_CANWBS)) {

How about a small dev_enforce_cache_coherency() helper?

> +			kfree(master_domain);
> +			spin_unlock_irqrestore(&smmu_domain->devices_lock,
> +					       flags);
> +			return -EINVAL;

kfree() doesn't need to be locked.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 7/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
  2024-08-27 15:51 ` [PATCH v2 7/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT Jason Gunthorpe
@ 2024-08-27 20:16   ` Nicolin Chen
  2024-08-30  7:58   ` Tian, Kevin
  2024-08-30 15:27   ` Mostafa Saleh
  2 siblings, 0 replies; 95+ messages in thread
From: Nicolin Chen @ 2024-08-27 20:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Tue, Aug 27, 2024 at 12:51:37PM -0300, Jason Gunthorpe wrote:
> For SMMUv3 the parent must be a S2 domain, which can be composed
> into a IOMMU_DOMAIN_NESTED.
> 
> In future the S2 parent will also need a VMID linked to the VIOMMU and
> even to KVM.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-08-27 15:51 ` [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED Jason Gunthorpe
@ 2024-08-27 21:23   ` Nicolin Chen
  2024-08-28 19:01     ` Jason Gunthorpe
  2024-08-30  8:16   ` Tian, Kevin
  2024-08-30 16:09   ` Mostafa Saleh
  2 siblings, 1 reply; 95+ messages in thread
From: Nicolin Chen @ 2024-08-27 21:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Tue, Aug 27, 2024 at 12:51:38PM -0300, Jason Gunthorpe wrote:
> For SMMUv3 a IOMMU_DOMAIN_NESTED is composed of a S2 iommu_domain acting
> as the parent and a user provided STE fragment that defines the CD table
> and related data with addresses translated by the S2 iommu_domain.
> 
> The kernel only permits userspace to control certain allowed bits of the
> STE that are safe for user/guest control.
> 
> IOTLB maintenance is a bit subtle here, the S1 implicitly includes the S2
> translation, but there is no way of knowing which S1 entries refer to a
> range of S2.
> 
> For the IOTLB we follow ARM's guidance and issue a CMDQ_OP_TLBI_NH_ALL to
> flush all ASIDs from the VMID after flushing the S2 on any change to the
> S2.
> 
> Similarly we have to flush the entire ATC if the S2 is changed.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>

With some small nits:

> @@ -2192,6 +2255,16 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
>  	}
>  	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
>  
> +	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S2 &&
> +	    smmu_domain->nest_parent) {

smmu_domain->nest_parent alone is enough?
  
[---]
> +static int arm_smmu_attach_dev_nested(struct iommu_domain *domain,
> +				      struct device *dev)
> +{
[..]
> +	if (arm_smmu_ssids_in_use(&master->cd_table) ||

This feels more like a -EBUSY as it would be unlikely able to
attach to a different nested domain?

> +	    nested_domain->s2_parent->smmu != master->smmu)
> +		return -EINVAL;
[---]

> +static struct iommu_domain *
> +arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags,
> +			      struct iommu_domain *parent,
> +			      const struct iommu_user_data *user_data)
> +{
> +	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> +	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> +	struct arm_smmu_nested_domain *nested_domain;
> +	struct arm_smmu_domain *smmu_parent;
> +	struct iommu_hwpt_arm_smmuv3 arg;
> +	unsigned int eats;
> +	unsigned int cfg;
> +	int ret;
> +
> +	if (!(master->smmu->features & ARM_SMMU_FEAT_NESTING))
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	/*
> +	 * Must support some way to prevent the VM from bypassing the cache
> +	 * because VFIO currently does not do any cache maintenance.
> +	 */
> +	if (!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_CANWBS) &&
> +	    !(master->smmu->features & ARM_SMMU_FEAT_S2FWB))
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	ret = iommu_copy_struct_from_user(&arg, user_data,
> +					  IOMMU_HWPT_DATA_ARM_SMMUV3, ste);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	if (flags || !(master->smmu->features & ARM_SMMU_FEAT_TRANS_S1))
> +		return ERR_PTR(-EOPNOTSUPP);

A bit redundant to the first sanity against ARM_SMMU_FEAT_NESTING,
since ARM_SMMU_FEAT_NESTING includes ARM_SMMU_FEAT_TRANS_S1.

> +
> +	if (!(parent->type & __IOMMU_DOMAIN_PAGING))
> +		return ERR_PTR(-EINVAL);
> +
> +	smmu_parent = to_smmu_domain(parent);
> +	if (smmu_parent->stage != ARM_SMMU_DOMAIN_S2 ||

Maybe "!smmu_parent->nest_parent" instead.

[---]
> +	    smmu_parent->smmu != master->smmu)
> +		return ERR_PTR(-EINVAL);

It'd be slightly nicer if we do all the non-arg validations prior
to calling iommu_copy_struct_from_user(). Then, the following arg
validations would be closer to the copy().

> +
> +	/* EIO is reserved for invalid STE data. */
> +	if ((arg.ste[0] & ~STRTAB_STE_0_NESTING_ALLOWED) ||
> +	    (arg.ste[1] & ~STRTAB_STE_1_NESTING_ALLOWED))
> +		return ERR_PTR(-EIO);
[---]


>  /* The following are exposed for testing purposes. */
>  struct arm_smmu_entry_writer_ops;
>  struct arm_smmu_entry_writer {
> @@ -830,6 +849,7 @@ struct arm_smmu_master_domain {
>  	struct list_head devices_elm;
>  	struct arm_smmu_master *master;
>  	ioasid_t ssid;
> +	u8 nest_parent;

Would it be nicer to match with the one in struct arm_smmu_domain:
+	bool				nest_parent : 1;
?

> + * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 Context Descriptor Table info
> + *                                (IOMMU_HWPT_DATA_ARM_SMMUV3)
> + *
> + * @ste: The first two double words of the user space Stream Table Entry for
> + *       a user stage-1 Context Descriptor Table. Must be little-endian.
> + *       Allowed fields: (Refer to "5.2 Stream Table Entry" in SMMUv3 HW Spec)
> + *       - word-0: V, Cfg, S1Fmt, S1ContextPtr, S1CDMax
> + *       - word-1: S1DSS, S1CIR, S1COR, S1CSH, S1STALLD

It seems that word-1 is missing EATS.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-08-27 15:51 [PATCH v2 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
                   ` (7 preceding siblings ...)
  2024-08-27 15:51 ` [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED Jason Gunthorpe
@ 2024-08-27 21:31 ` Nicolin Chen
  2024-08-28 16:31   ` Shameerali Kolothum Thodi
  2024-09-12  3:42   ` Zhangfei Gao
  2024-10-16  2:23 ` Zhangfei Gao
  9 siblings, 2 replies; 95+ messages in thread
From: Nicolin Chen @ 2024-08-27 21:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Tue, Aug 27, 2024 at 12:51:30PM -0300, Jason Gunthorpe wrote:
> This brings support for the IOMMFD ioctls:
> 
>  - IOMMU_GET_HW_INFO
>  - IOMMU_HWPT_ALLOC_NEST_PARENT
>  - IOMMU_DOMAIN_NESTED
>  - ops->enforce_cache_coherency()
> 
> This is quite straightforward as the nested STE can just be built in the
> special NESTED domain op and fed through the generic update machinery.
> 
> The design allows the user provided STE fragment to control several
> aspects of the translation, including putting the STE into a "virtual
> bypass" or a aborting state. This duplicates functionality available by
> other means, but it allows trivially preserving the VMID in the STE as we
> eventually move towards the VIOMMU owning the VMID.
> 
> Nesting support requires the system to either support S2FWB or the
> stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
> cache and view incoherent data, currently VFIO lacks any cache flushing
> that would make this safe.
> 
> Yan has a series to add some of the needed infrastructure for VFIO cache
> flushing here:
> 
>  https://lore.kernel.org/linux-iommu/20240507061802.20184-1-yan.y.zhao@intel.com/
> 
> Which may someday allow relaxing this further.
> 
> Remove VFIO_TYPE1_NESTING_IOMMU since it was never used and superseded by
> this.
> 
> This is the first series in what will be several to complete nesting
> support. At least:
>  - IOMMU_RESV_SW_MSI related fixups
>     https://lore.kernel.org/linux-iommu/cover.1722644866.git.nicolinc@nvidia.com/
>  - VIOMMU object support to allow ATS and CD invalidations
>     https://lore.kernel.org/linux-iommu/cover.1723061377.git.nicolinc@nvidia.com/
>  - vCMDQ hypervisor support for direct invalidation queue assignment
>     https://lore.kernel.org/linux-iommu/cover.1712978212.git.nicolinc@nvidia.com/
>  - KVM pinned VMID using VIOMMU for vBTM
>     https://lore.kernel.org/linux-iommu/20240208151837.35068-1-shameerali.kolothum.thodi@huawei.com/
>  - Cross instance S2 sharing
>  - Virtual Machine Structure using VIOMMU (for vMPAM?)
>  - Fault forwarding support through IOMMUFD's fault fd for vSVA
> 
> The VIOMMU series is essential to allow the invalidations to be processed
> for the CD as well.
> 
> It is enough to allow qemu work to progress.
> 
> This is on github: https://github.com/jgunthorpe/linux/commits/smmuv3_nesting
> 
> v2:

As mentioned above, the VIOMMU series would be required to test
the entire nesting feature, which now has a v2 rebasing on this
series. I tested it with a paring QEMU branch. Please refer to:
https://lore.kernel.org/linux-iommu/cover.1724776335.git.nicolinc@nvidia.com/
Also, there is another new VIRQ series on top of the VIOMMU one
and this nesting series. And I tested it too. Please refer to:
https://lore.kernel.org/linux-iommu/cover.1724777091.git.nicolinc@nvidia.com/

With that,

Tested-by: Nicolin Chen <nicolinc@nvidia.com>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-08-27 21:31 ` [PATCH v2 0/8] Initial support for SMMUv3 nested translation Nicolin Chen
@ 2024-08-28 16:31   ` Shameerali Kolothum Thodi
  2024-08-28 17:14     ` Nicolin Chen
  2024-09-12  3:42   ` Zhangfei Gao
  1 sibling, 1 reply; 95+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-08-28 16:31 UTC (permalink / raw)
  To: Nicolin Chen, Jason Gunthorpe
  Cc: acpica-devel@lists.linux.dev, Guohanjun (Hanjun Guo),
	iommu@lists.linux.dev, Joerg Roedel, Kevin Tian,
	kvm@vger.kernel.org, Len Brown, linux-acpi@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches@lists.linux.dev,
	Mostafa Saleh

Hi Nicolin,

> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, August 27, 2024 10:31 PM
> To: Jason Gunthorpe <jgg@nvidia.com>
> Cc: acpica-devel@lists.linux.dev; Guohanjun (Hanjun Guo)
> <guohanjun@huawei.com>; iommu@lists.linux.dev; Joerg Roedel
> <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>;
> kvm@vger.kernel.org; Len Brown <lenb@kernel.org>; linux-
> acpi@vger.kernel.org; linux-arm-kernel@lists.infradead.org; Lorenzo Pieralisi
> <lpieralisi@kernel.org>; Rafael J. Wysocki <rafael@kernel.org>; Robert
> Moore <robert.moore@intel.com>; Robin Murphy
> <robin.murphy@arm.com>; Sudeep Holla <sudeep.holla@arm.com>; Will
> Deacon <will@kernel.org>; Alex Williamson <alex.williamson@redhat.com>;
> Eric Auger <eric.auger@redhat.com>; Jean-Philippe Brucker <jean-
> philippe@linaro.org>; Moritz Fischer <mdf@kernel.org>; Michael Shavit
> <mshavit@google.com>; patches@lists.linux.dev; Shameerali Kolothum
> Thodi <shameerali.kolothum.thodi@huawei.com>; Mostafa Saleh
> <smostafa@google.com>
> Subject: Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
> 
 
> As mentioned above, the VIOMMU series would be required to test the
> entire nesting feature, which now has a v2 rebasing on this series. I tested it
> with a paring QEMU branch. Please refer to:
> https://lore.kernel.org/linux-
> iommu/cover.1724776335.git.nicolinc@nvidia.com/

Thanks for this. I haven't gone through the viommu and its Qemu branch
yet.  The way we present nested-smmuv3/iommufd to the Qemu seems to
have changed  with the above Qemu branch(multiple nested SMMUs).
The old Qemu command line for nested setup doesn't work anymore.

Could you please share an example Qemu command line  to verify this
series(Sorry, if I missed it in the links/git).

Thanks,
Shameer 






^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-08-28 16:31   ` Shameerali Kolothum Thodi
@ 2024-08-28 17:14     ` Nicolin Chen
  2024-08-28 18:06       ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 95+ messages in thread
From: Nicolin Chen @ 2024-08-28 17:14 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Jason Gunthorpe, acpica-devel@lists.linux.dev,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	patches@lists.linux.dev, Mostafa Saleh

Hi Shameer,

On Wed, Aug 28, 2024 at 04:31:36PM +0000, Shameerali Kolothum Thodi wrote:
> Hi Nicolin,
> 
> > -----Original Message-----
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, August 27, 2024 10:31 PM
> > To: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: acpica-devel@lists.linux.dev; Guohanjun (Hanjun Guo)
> > <guohanjun@huawei.com>; iommu@lists.linux.dev; Joerg Roedel
> > <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>;
> > kvm@vger.kernel.org; Len Brown <lenb@kernel.org>; linux-
> > acpi@vger.kernel.org; linux-arm-kernel@lists.infradead.org; Lorenzo Pieralisi
> > <lpieralisi@kernel.org>; Rafael J. Wysocki <rafael@kernel.org>; Robert
> > Moore <robert.moore@intel.com>; Robin Murphy
> > <robin.murphy@arm.com>; Sudeep Holla <sudeep.holla@arm.com>; Will
> > Deacon <will@kernel.org>; Alex Williamson <alex.williamson@redhat.com>;
> > Eric Auger <eric.auger@redhat.com>; Jean-Philippe Brucker <jean-
> > philippe@linaro.org>; Moritz Fischer <mdf@kernel.org>; Michael Shavit
> > <mshavit@google.com>; patches@lists.linux.dev; Shameerali Kolothum
> > Thodi <shameerali.kolothum.thodi@huawei.com>; Mostafa Saleh
> > <smostafa@google.com>
> > Subject: Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
> >
> 
> > As mentioned above, the VIOMMU series would be required to test the
> > entire nesting feature, which now has a v2 rebasing on this series. I tested it
> > with a paring QEMU branch. Please refer to:
> > https://lore.kernel.org/linux-
> > iommu/cover.1724776335.git.nicolinc@nvidia.com/
> 
> Thanks for this. I haven't gone through the viommu and its Qemu branch
> yet.  The way we present nested-smmuv3/iommufd to the Qemu seems to
> have changed  with the above Qemu branch(multiple nested SMMUs).
> The old Qemu command line for nested setup doesn't work anymore.
> 
> Could you please share an example Qemu command line  to verify this
> series(Sorry, if I missed it in the links/git).

My bad. I updated those two "for_iommufd_" QEMU branches with a
README commit on top of each for the reference command.

By the way, I wonder how many SMMUv3 instances there are on the
platforms that SMMUv3 developers here are running on -- if some
one is also working on a chip that has multiple instances?

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-08-28 17:14     ` Nicolin Chen
@ 2024-08-28 18:06       ` Shameerali Kolothum Thodi
  2024-08-28 18:12         ` Nicolin Chen
  0 siblings, 1 reply; 95+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-08-28 18:06 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Jason Gunthorpe, acpica-devel@lists.linux.dev,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	patches@lists.linux.dev, Mostafa Saleh



> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, August 28, 2024 6:15 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; acpica-devel@lists.linux.dev;
> Guohanjun (Hanjun Guo) <guohanjun@huawei.com>;
> iommu@lists.linux.dev; Joerg Roedel <joro@8bytes.org>; Kevin Tian
> <kevin.tian@intel.com>; kvm@vger.kernel.org; Len Brown
> <lenb@kernel.org>; linux-acpi@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; Lorenzo Pieralisi <lpieralisi@kernel.org>; Rafael J.
> Wysocki <rafael@kernel.org>; Robert Moore <robert.moore@intel.com>;
> Robin Murphy <robin.murphy@arm.com>; Sudeep Holla
> <sudeep.holla@arm.com>; Will Deacon <will@kernel.org>; Alex Williamson
> <alex.williamson@redhat.com>; Eric Auger <eric.auger@redhat.com>; Jean-
> Philippe Brucker <jean-philippe@linaro.org>; Moritz Fischer
> <mdf@kernel.org>; Michael Shavit <mshavit@google.com>;
> patches@lists.linux.dev; Mostafa Saleh <smostafa@google.com>
> Subject: Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
> 
> Hi Shameer,
> 
> On Wed, Aug 28, 2024 at 04:31:36PM +0000, Shameerali Kolothum Thodi
> wrote:
> > Hi Nicolin,
> >
> > > -----Original Message-----
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Tuesday, August 27, 2024 10:31 PM
> > > To: Jason Gunthorpe <jgg@nvidia.com>
> > > Cc: acpica-devel@lists.linux.dev; Guohanjun (Hanjun Guo)
> > > <guohanjun@huawei.com>; iommu@lists.linux.dev; Joerg Roedel
> > > <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>;
> > > kvm@vger.kernel.org; Len Brown <lenb@kernel.org>; linux-
> > > acpi@vger.kernel.org; linux-arm-kernel@lists.infradead.org; Lorenzo
> > > Pieralisi <lpieralisi@kernel.org>; Rafael J. Wysocki
> > > <rafael@kernel.org>; Robert Moore <robert.moore@intel.com>; Robin
> > > Murphy <robin.murphy@arm.com>; Sudeep Holla
> <sudeep.holla@arm.com>;
> > > Will Deacon <will@kernel.org>; Alex Williamson
> > > <alex.williamson@redhat.com>; Eric Auger <eric.auger@redhat.com>;
> > > Jean-Philippe Brucker <jean- philippe@linaro.org>; Moritz Fischer
> > > <mdf@kernel.org>; Michael Shavit <mshavit@google.com>;
> > > patches@lists.linux.dev; Shameerali Kolothum Thodi
> > > <shameerali.kolothum.thodi@huawei.com>; Mostafa Saleh
> > > <smostafa@google.com>
> > > Subject: Re: [PATCH v2 0/8] Initial support for SMMUv3 nested
> > > translation
> > >
> >
> > > As mentioned above, the VIOMMU series would be required to test the
> > > entire nesting feature, which now has a v2 rebasing on this series.
> > > I tested it with a paring QEMU branch. Please refer to:
> > > https://lore.kernel.org/linux-
> > > iommu/cover.1724776335.git.nicolinc@nvidia.com/
> >
> > Thanks for this. I haven't gone through the viommu and its Qemu branch
> > yet.  The way we present nested-smmuv3/iommufd to the Qemu seems to
> > have changed  with the above Qemu branch(multiple nested SMMUs).
> > The old Qemu command line for nested setup doesn't work anymore.
> >
> > Could you please share an example Qemu command line  to verify this
> > series(Sorry, if I missed it in the links/git).
> 
> My bad. I updated those two "for_iommufd_" QEMU branches with a
> README commit on top of each for the reference command.

Thanks. I did give it a go and this is my command line based on above,

./qemu-system-aarch64-nicolin-viommu -object iommufd,id=iommufd0 \
-machine hmat=on \
-machine virt,accel=kvm,gic-version=3,iommu=nested-smmuv3,ras=on \
-cpu host -smp cpus=61 -m size=16G,slots=4,maxmem=256G -nographic \
-object memory-backend-ram,size=8G,id=m0 \
-object memory-backend-ram,size=8G,id=m1 \
-numa node,memdev=m0,cpus=0-60,nodeid=0  -numa node,memdev=m1,nodeid=1 \
-device vfio-pci-nohotplug,host=0000:75:00.1,iommufd=iommufd0 \
-bios QEMU_EFI.fd \
-drive if=none,file=ubuntu-18.04-old.img,id=fs \
-device virtio-blk-device,drive=fs \
-kernel Image \
-append "rdinit=init console=ttyAMA0 root=/dev/vda rw earlycon=pl011,0x9000000 kpti=off" \
-nographic

But it fails to boot very early:

root@ubuntu:/home/shameer/qemu-test# ./qemu_run-simple-iommufd-nicolin-2
qemu-system-aarch64-nicolin-viommu: Illegal numa node 2
 
Any idea what am I missing? Do you any special config enabled while building Qemu?

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-08-28 18:06       ` Shameerali Kolothum Thodi
@ 2024-08-28 18:12         ` Nicolin Chen
  2024-08-29 13:14           ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 95+ messages in thread
From: Nicolin Chen @ 2024-08-28 18:12 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Jason Gunthorpe, acpica-devel@lists.linux.dev,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	patches@lists.linux.dev, Mostafa Saleh

On Wed, Aug 28, 2024 at 06:06:36PM +0000, Shameerali Kolothum Thodi wrote:
> > > > As mentioned above, the VIOMMU series would be required to test the
> > > > entire nesting feature, which now has a v2 rebasing on this series.
> > > > I tested it with a paring QEMU branch. Please refer to:
> > > > https://lore.kernel.org/linux-
> > > > iommu/cover.1724776335.git.nicolinc@nvidia.com/
> > >
> > > Thanks for this. I haven't gone through the viommu and its Qemu branch
> > > yet.  The way we present nested-smmuv3/iommufd to the Qemu seems to
> > > have changed  with the above Qemu branch(multiple nested SMMUs).
> > > The old Qemu command line for nested setup doesn't work anymore.
> > >
> > > Could you please share an example Qemu command line  to verify this
> > > series(Sorry, if I missed it in the links/git).
> >
> > My bad. I updated those two "for_iommufd_" QEMU branches with a
> > README commit on top of each for the reference command.
> 
> Thanks. I did give it a go and this is my command line based on above,

> But it fails to boot very early:
> 
> root@ubuntu:/home/shameer/qemu-test# ./qemu_run-simple-iommufd-nicolin-2
> qemu-system-aarch64-nicolin-viommu: Illegal numa node 2
> 
> Any idea what am I missing? Do you any special config enabled while building Qemu?

Looks like you are running on a multi-SMMU platform :)

Would you please try syncing your local branch? That should work,
as the update also had a small change to the virt code:

diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 161a28a311..a782909016 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -1640,7 +1640,7 @@ static PCIBus *create_pcie_expander_bridge(VirtMachineState *vms, uint8_t idx)
     }

     qdev_prop_set_uint8(dev, "bus_nr", bus_nr);
-    qdev_prop_set_uint16(dev, "numa_node", idx);
+    qdev_prop_set_uint16(dev, "numa_node", 0);
     qdev_realize_and_unref(dev, BUS(bus), &error_fatal);

     /* Get the pxb bus */


Thanks
Nicolin


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-27 19:48   ` Nicolin Chen
@ 2024-08-28 18:30     ` Jason Gunthorpe
  2024-08-28 19:47       ` Nicolin Chen
  0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-28 18:30 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Tue, Aug 27, 2024 at 12:48:13PM -0700, Nicolin Chen wrote:
> Hi Jason,
> 
> On Tue, Aug 27, 2024 at 12:51:32PM -0300, Jason Gunthorpe wrote:
> > diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> > index f5d9fd1f45bf49..9b3658aae21005 100644
> > --- a/drivers/iommu/io-pgtable-arm.c
> > +++ b/drivers/iommu/io-pgtable-arm.c
> > @@ -106,6 +106,18 @@
> >  #define ARM_LPAE_PTE_HAP_FAULT		(((arm_lpae_iopte)0) << 6)
> >  #define ARM_LPAE_PTE_HAP_READ		(((arm_lpae_iopte)1) << 6)
> >  #define ARM_LPAE_PTE_HAP_WRITE		(((arm_lpae_iopte)2) << 6)
> > +/*
> > + * For !FWB these code to:
> > + *  1111 = Normal outer write back cachable / Inner Write Back Cachable
> > + *         Permit S1 to override
> > + *  0101 = Normal Non-cachable / Inner Non-cachable
> > + *  0001 = Device / Device-nGnRE
> > + * For S2FWB these code:
> > + *  0110 Force Normal Write Back
> > + *  0101 Normal* is forced Normal-NC, Device unchanged
> > + *  0001 Force Device-nGnRE
> > + */
> > +#define ARM_LPAE_PTE_MEMATTR_FWB_WB	(((arm_lpae_iopte)0x6) << 2)
> 
> The other part looks good. Yet, would you mind sharing the location
> that defines this 0x6 explicitly?

I'm looking at an older one ARM DDI 0487F.c

D5.5.5 Stage 2 memory region type and Cacheability attributes when FEAT_S2FWB is implemented

The text talks about the bits in the PTE, not relative to the MEMATTR
field, so 6 << 2 encodes to:

 543210
 011000

Then see table D5-40 Effect of bit[4] == 1 on Cacheability and Memory Type)

 Bit[5] = 0 = is RES0.
 Bit[4] = 1 = determines the interpretation of bits [3:2].
 Bits[3:2] == 10 == Normal Write-Back 

Here Bit means 'bit of the PTE' because the MemAttr does not have 5
bits.

> Where it has the followings in D8.6.6:
>  "For stage 2 translations, if FEAT_MTE_PERM is not implemented, then
>   FEAT_S2FWB has all of the following effects on the MemAttr[3:2] bits:
>    - MemAttr[3] is RES0.
>    - The value of MemAttr[2] determines the interpretation of the
>      MemAttr[1:0] bits.

And here the text switches from talking about the PTE bits to the
Field bits. MemAttr[3] == PTE[5], and the above text matches D5.5

The use of numbering schemes relative to the start of the field and
also relative to the PTE is tricky.

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-08-27 21:23   ` Nicolin Chen
@ 2024-08-28 19:01     ` Jason Gunthorpe
  2024-08-28 19:27       ` Nicolin Chen
  0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-28 19:01 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Tue, Aug 27, 2024 at 02:23:07PM -0700, Nicolin Chen wrote:
> On Tue, Aug 27, 2024 at 12:51:38PM -0300, Jason Gunthorpe wrote:
> > For SMMUv3 a IOMMU_DOMAIN_NESTED is composed of a S2 iommu_domain acting
> > as the parent and a user provided STE fragment that defines the CD table
> > and related data with addresses translated by the S2 iommu_domain.
> > 
> > The kernel only permits userspace to control certain allowed bits of the
> > STE that are safe for user/guest control.
> > 
> > IOTLB maintenance is a bit subtle here, the S1 implicitly includes the S2
> > translation, but there is no way of knowing which S1 entries refer to a
> > range of S2.
> > 
> > For the IOTLB we follow ARM's guidance and issue a CMDQ_OP_TLBI_NH_ALL to
> > flush all ASIDs from the VMID after flushing the S2 on any change to the
> > S2.
> > 
> > Similarly we have to flush the entire ATC if the S2 is changed.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> 
> With some small nits:
> 
> > @@ -2192,6 +2255,16 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
> >  	}
> >  	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
> >  
> > +	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S2 &&
> > +	    smmu_domain->nest_parent) {
> 
> smmu_domain->nest_parent alone is enough?

Yes, I thought I did that when Robin noted it.. 

> [---]
> > +static int arm_smmu_attach_dev_nested(struct iommu_domain *domain,
> > +				      struct device *dev)
> > +{
> [..]
> > +	if (arm_smmu_ssids_in_use(&master->cd_table) ||
> 
> This feels more like a -EBUSY as it would be unlikely able to
> attach to a different nested domain?

Yeah, we did that in arm_smmu_attach_dev()

> > +static struct iommu_domain *
> > +arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags,
> > +			      struct iommu_domain *parent,
> > +			      const struct iommu_user_data *user_data)
> > +{
> > +	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> > +	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> > +	struct arm_smmu_nested_domain *nested_domain;
> > +	struct arm_smmu_domain *smmu_parent;
> > +	struct iommu_hwpt_arm_smmuv3 arg;
> > +	unsigned int eats;
> > +	unsigned int cfg;
> > +	int ret;
> > +
> > +	if (!(master->smmu->features & ARM_SMMU_FEAT_NESTING))
> > +		return ERR_PTR(-EOPNOTSUPP);
> > +
> > +	/*
> > +	 * Must support some way to prevent the VM from bypassing the cache
> > +	 * because VFIO currently does not do any cache maintenance.
> > +	 */
> > +	if (!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_CANWBS) &&
> > +	    !(master->smmu->features & ARM_SMMU_FEAT_S2FWB))
> > +		return ERR_PTR(-EOPNOTSUPP);
> > +
> > +	ret = iommu_copy_struct_from_user(&arg, user_data,
> > +					  IOMMU_HWPT_DATA_ARM_SMMUV3, ste);
> > +	if (ret)
> > +		return ERR_PTR(ret);
> > +
> > +	if (flags || !(master->smmu->features & ARM_SMMU_FEAT_TRANS_S1))
> > +		return ERR_PTR(-EOPNOTSUPP);
> 
> A bit redundant to the first sanity against ARM_SMMU_FEAT_NESTING,
> since ARM_SMMU_FEAT_NESTING includes ARM_SMMU_FEAT_TRANS_S1.

Yeah, I think this was ment to be up at the top

	if (flags || !(master->smmu->features & ARM_SMMU_FEAT_NESTING))
		return ERR_PTR(-EOPNOTSUPP);

> > +
> > +	if (!(parent->type & __IOMMU_DOMAIN_PAGING))
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	smmu_parent = to_smmu_domain(parent);
> > +	if (smmu_parent->stage != ARM_SMMU_DOMAIN_S2 ||
> 
> Maybe "!smmu_parent->nest_parent" instead.

Hmm, yes.. Actually we can delete it, and the paging test above.

The core code checks it.

Though I think we missed owner validation there??

@@ -225,7 +225,8 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
        if ((flags & ~IOMMU_HWPT_FAULT_ID_VALID) ||
            !user_data->len || !ops->domain_alloc_user)
                return ERR_PTR(-EOPNOTSUPP);
-       if (parent->auto_domain || !parent->nest_parent)
+       if (parent->auto_domain || !parent->nest_parent ||
+           parent->common.domain->owner != ops)
                return ERR_PTR(-EINVAL);

Right??

> [---]
> > +	    smmu_parent->smmu != master->smmu)
> > +		return ERR_PTR(-EINVAL);
> 
> It'd be slightly nicer if we do all the non-arg validations prior
> to calling iommu_copy_struct_from_user(). Then, the following arg
> validations would be closer to the copy().

Sure

> >  struct arm_smmu_entry_writer {
> > @@ -830,6 +849,7 @@ struct arm_smmu_master_domain {
> >  	struct list_head devices_elm;
> >  	struct arm_smmu_master *master;
> >  	ioasid_t ssid;
> > +	u8 nest_parent;
> 
> Would it be nicer to match with the one in struct arm_smmu_domain:
> +	bool				nest_parent : 1;
> ?

Ah, lets just use bool

> > + * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 Context Descriptor Table info
> > + *                                (IOMMU_HWPT_DATA_ARM_SMMUV3)
> > + *
> > + * @ste: The first two double words of the user space Stream Table Entry for
> > + *       a user stage-1 Context Descriptor Table. Must be little-endian.
> > + *       Allowed fields: (Refer to "5.2 Stream Table Entry" in SMMUv3 HW Spec)
> > + *       - word-0: V, Cfg, S1Fmt, S1ContextPtr, S1CDMax
> > + *       - word-1: S1DSS, S1CIR, S1COR, S1CSH, S1STALLD
> 
> It seems that word-1 is missing EATS.

Yes, this was missed

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 5/8] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
  2024-08-27 20:12   ` Nicolin Chen
@ 2024-08-28 19:12     ` Jason Gunthorpe
  0 siblings, 0 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-28 19:12 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Tue, Aug 27, 2024 at 01:12:27PM -0700, Nicolin Chen wrote:
> > @@ -2693,6 +2718,15 @@ static int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
> >  		 * one of them.
> >  		 */
> >  		spin_lock_irqsave(&smmu_domain->devices_lock, flags);
> > +		if (smmu_domain->enforce_cache_coherency &&
> > +		    !(dev_iommu_fwspec_get(master->dev)->flags &
> > +		      IOMMU_FWSPEC_PCI_RC_CANWBS)) {
> 
> How about a small dev_enforce_cache_coherency() helper?

I added a

+static bool arm_smmu_master_canwbs(struct arm_smmu_master *master)
+{
+       return dev_iommu_fwspec_get(master->dev)->flags &
+              IOMMU_FWSPEC_PCI_RC_CANWBS;
+}
+

> > +			kfree(master_domain);
> > +			spin_unlock_irqrestore(&smmu_domain->devices_lock,
> > +					       flags);
> > +			return -EINVAL;
> 
> kfree() doesn't need to be locked.

Yep

Thanks,
Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-08-28 19:01     ` Jason Gunthorpe
@ 2024-08-28 19:27       ` Nicolin Chen
  0 siblings, 0 replies; 95+ messages in thread
From: Nicolin Chen @ 2024-08-28 19:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Wed, Aug 28, 2024 at 04:01:00PM -0300, Jason Gunthorpe wrote:

> > > +
> > > +	if (!(parent->type & __IOMMU_DOMAIN_PAGING))
> > > +		return ERR_PTR(-EINVAL);
> > > +
> > > +	smmu_parent = to_smmu_domain(parent);
> > > +	if (smmu_parent->stage != ARM_SMMU_DOMAIN_S2 ||
> > 
> > Maybe "!smmu_parent->nest_parent" instead.
> 
> Hmm, yes.. Actually we can delete it, and the paging test above.
> 
> The core code checks it.

Yea, we can rely on the core.

> Though I think we missed owner validation there??
> 
> @@ -225,7 +225,8 @@ iommufd_hwpt_nested_alloc(struct iommufd_ctx *ictx,
>         if ((flags & ~IOMMU_HWPT_FAULT_ID_VALID) ||
>             !user_data->len || !ops->domain_alloc_user)
>                 return ERR_PTR(-EOPNOTSUPP);
> -       if (parent->auto_domain || !parent->nest_parent)
> +       if (parent->auto_domain || !parent->nest_parent ||
> +           parent->common.domain->owner != ops)
>                 return ERR_PTR(-EINVAL);
> 
> Right??

Yea, this ensures the same driver.

> > [---]
> > > +	    smmu_parent->smmu != master->smmu)
> > > +		return ERR_PTR(-EINVAL);

Then, we only need this one.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-28 18:30     ` Jason Gunthorpe
@ 2024-08-28 19:47       ` Nicolin Chen
  0 siblings, 0 replies; 95+ messages in thread
From: Nicolin Chen @ 2024-08-28 19:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Wed, Aug 28, 2024 at 03:30:37PM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 27, 2024 at 12:48:13PM -0700, Nicolin Chen wrote:
> > Hi Jason,
> > 
> > On Tue, Aug 27, 2024 at 12:51:32PM -0300, Jason Gunthorpe wrote:
> > > diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> > > index f5d9fd1f45bf49..9b3658aae21005 100644
> > > --- a/drivers/iommu/io-pgtable-arm.c
> > > +++ b/drivers/iommu/io-pgtable-arm.c
> > > @@ -106,6 +106,18 @@
> > >  #define ARM_LPAE_PTE_HAP_FAULT		(((arm_lpae_iopte)0) << 6)
> > >  #define ARM_LPAE_PTE_HAP_READ		(((arm_lpae_iopte)1) << 6)
> > >  #define ARM_LPAE_PTE_HAP_WRITE		(((arm_lpae_iopte)2) << 6)
> > > +/*
> > > + * For !FWB these code to:
> > > + *  1111 = Normal outer write back cachable / Inner Write Back Cachable
> > > + *         Permit S1 to override
> > > + *  0101 = Normal Non-cachable / Inner Non-cachable
> > > + *  0001 = Device / Device-nGnRE
> > > + * For S2FWB these code:
> > > + *  0110 Force Normal Write Back
> > > + *  0101 Normal* is forced Normal-NC, Device unchanged
> > > + *  0001 Force Device-nGnRE
> > > + */
> > > +#define ARM_LPAE_PTE_MEMATTR_FWB_WB	(((arm_lpae_iopte)0x6) << 2)
> > 
> > The other part looks good. Yet, would you mind sharing the location
> > that defines this 0x6 explicitly?
> 
> I'm looking at an older one ARM DDI 0487F.c
> 
> D5.5.5 Stage 2 memory region type and Cacheability attributes when FEAT_S2FWB is implemented
> 
> The text talks about the bits in the PTE, not relative to the MEMATTR
> field, so 6 << 2 encodes to:
> 
>  543210
>  011000
> 
> Then see table D5-40 Effect of bit[4] == 1 on Cacheability and Memory Type)
> 
>  Bit[5] = 0 = is RES0.
>  Bit[4] = 1 = determines the interpretation of bits [3:2].
>  Bits[3:2] == 10 == Normal Write-Back 
> 
> Here Bit means 'bit of the PTE' because the MemAttr does not have 5
> bits.
> 
> > Where it has the followings in D8.6.6:
> >  "For stage 2 translations, if FEAT_MTE_PERM is not implemented, then
> >   FEAT_S2FWB has all of the following effects on the MemAttr[3:2] bits:
> >    - MemAttr[3] is RES0.
> >    - The value of MemAttr[2] determines the interpretation of the
> >      MemAttr[1:0] bits.
> 
> And here the text switches from talking about the PTE bits to the
> Field bits. MemAttr[3] == PTE[5], and the above text matches D5.5
> 
> The use of numbering schemes relative to the start of the field and
> also relative to the PTE is tricky.

I download the version F.c, and the chapter looks cleaner than
the one in K.a. I guess the FEAT_MTE_PERM complicates that...

I double checked the K.a doc, and found a piece of pseudocode
that seems to confirm 0b0110 (0x6) is the correct value:

MemoryAttributes AArch64.S2ApplyFWBMemAttrs(MemoryAttributes s1_memattrs,
					    S2TTWParams walkparams,
					    bits(N) descriptor)
	MemoryAttributes memattrs;
	s2_attr = descriptor<5:2>;
	s2_sh = if walkparams.ds == '1' then walkparams.sh else descriptor<9:8>;
	s2_fnxs = descriptor<11>;
	if s2_attr<2> == '0' then // S2 Device, S1 any
		s2_device = DecodeDevice(s2_attr<1:0>);
		memattrs.memtype = MemType_Device;
		if s1_memattrs.memtype == MemType_Device then
			memattrs.device = S2CombineS1Device(s1_memattrs.device, s2_device);
		else
			memattrs.device = s2_device;
		memattrs.xs = s1_memattrs.xs;
	elsif s2_attr<1:0> == '11' then // S2 attr = S1 attr
		memattrs = s1_memattrs;
	elsif s2_attr<1:0> == '10' then // Force writeback

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-27 15:51 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available Jason Gunthorpe
  2024-08-27 19:48   ` Nicolin Chen
@ 2024-08-28 19:50   ` Nicolin Chen
  2024-08-30  7:44   ` Tian, Kevin
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 95+ messages in thread
From: Nicolin Chen @ 2024-08-28 19:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Tue, Aug 27, 2024 at 12:51:32PM -0300, Jason Gunthorpe wrote:
> Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
> works. When S2FWB is supported and enabled the IOPTE will force cachable
> access to IOMMU_CACHE memory when nesting with a S1 and deny cachable
> access otherwise.
> 
> When using a single stage of translation, a simple S2 domain, it doesn't
> change anything as it is just a different encoding for the exsting mapping
> of the IOMMU protection flags to cachability attributes.
> 
> However, when used with a nested S1, FWB has the effect of preventing the
> guest from choosing a MemAttr in it's S1 that would cause ordinary DMA to
> bypass the cache. Consistent with KVM we wish to deny the guest the
> ability to become incoherent with cached memory the hypervisor believes is
> cachable so we don't have to flush it.
> 
> Turn on S2FWB whenever the SMMU supports it and use it for all S2
> mappings.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 3/8] ACPICA: IORT: Update for revision E.f
  2024-08-27 15:51 ` [PATCH v2 3/8] ACPICA: IORT: Update for revision E.f Jason Gunthorpe
@ 2024-08-29 10:14   ` Rafael J. Wysocki
  0 siblings, 0 replies; 95+ messages in thread
From: Rafael J. Wysocki @ 2024-08-29 10:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Tue, Aug 27, 2024 at 5:51 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> From: Nicolin Chen <nicolinc@nvidia.com>
>
> ACPICA commit c4f5c083d24df9ddd71d5782c0988408cf0fc1ab
>
> The IORT spec, Issue E.f (April 2024), adds a new CANWBS bit to the Memory
> Access Flag field in the Memory Access Properties table, mainly for a PCI
> Root Complex.
>
> This CANWBS defines the coherency of memory accesses to be not marked IOWB
> cacheable/shareable. Its value further implies the coherency impact from a
> pair of mismatched memory attributes (e.g. in a nested translation case):
>   0x0: Use of mismatched memory attributes for accesses made by this
>        device may lead to a loss of coherency.
>   0x1: Coherency of accesses made by this device to locations in
>        Conventional memory are ensured as follows, even if the memory
>        attributes for the accesses presented by the device or provided by
>        the SMMU are different from Inner and Outer Write-back cacheable,
>        Shareable.
>
> Link: https://github.com/acpica/acpica/commit/c4f5c083
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  include/acpi/actbl2.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h
> index e27958ef82642f..9a7acf403ed3c8 100644
> --- a/include/acpi/actbl2.h
> +++ b/include/acpi/actbl2.h
> @@ -453,7 +453,7 @@ struct acpi_table_ccel {
>   * IORT - IO Remapping Table
>   *
>   * Conforms to "IO Remapping Table System Software on ARM Platforms",
> - * Document number: ARM DEN 0049E.e, Sep 2022
> + * Document number: ARM DEN 0049E.f, Apr 2024
>   *
>   ******************************************************************************/
>
> @@ -524,6 +524,7 @@ struct acpi_iort_memory_access {
>
>  #define ACPI_IORT_MF_COHERENCY          (1)
>  #define ACPI_IORT_MF_ATTRIBUTES         (1<<1)
> +#define ACPI_IORT_MF_CANWBS             (1<<2)
>
>  /*
>   * IORT node specific subtables
> --
> 2.46.0
>
>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-08-28 18:12         ` Nicolin Chen
@ 2024-08-29 13:14           ` Shameerali Kolothum Thodi
  2024-08-29 14:52             ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 95+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-08-29 13:14 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Jason Gunthorpe, acpica-devel@lists.linux.dev,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	patches@lists.linux.dev, Mostafa Saleh



> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, August 28, 2024 7:13 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; acpica-devel@lists.linux.dev;
> Guohanjun (Hanjun Guo) <guohanjun@huawei.com>;
> iommu@lists.linux.dev; Joerg Roedel <joro@8bytes.org>; Kevin Tian
> <kevin.tian@intel.com>; kvm@vger.kernel.org; Len Brown
> <lenb@kernel.org>; linux-acpi@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; Lorenzo Pieralisi <lpieralisi@kernel.org>; Rafael J.
> Wysocki <rafael@kernel.org>; Robert Moore <robert.moore@intel.com>;
> Robin Murphy <robin.murphy@arm.com>; Sudeep Holla
> <sudeep.holla@arm.com>; Will Deacon <will@kernel.org>; Alex Williamson
> <alex.williamson@redhat.com>; Eric Auger <eric.auger@redhat.com>; Jean-
> Philippe Brucker <jean-philippe@linaro.org>; Moritz Fischer
> <mdf@kernel.org>; Michael Shavit <mshavit@google.com>;
> patches@lists.linux.dev; Mostafa Saleh <smostafa@google.com>
> Subject: Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
> 
> On Wed, Aug 28, 2024 at 06:06:36PM +0000, Shameerali Kolothum Thodi
> wrote:
> > > > > As mentioned above, the VIOMMU series would be required to test
> the
> > > > > entire nesting feature, which now has a v2 rebasing on this series.
> > > > > I tested it with a paring QEMU branch. Please refer to:
> > > > > https://lore.kernel.org/linux-
> > > > > iommu/cover.1724776335.git.nicolinc@nvidia.com/
> > > >
> > > > Thanks for this. I haven't gone through the viommu and its Qemu
> branch
> > > > yet.  The way we present nested-smmuv3/iommufd to the Qemu seems
> to
> > > > have changed  with the above Qemu branch(multiple nested SMMUs).
> > > > The old Qemu command line for nested setup doesn't work anymore.
> > > >
> > > > Could you please share an example Qemu command line  to verify this
> > > > series(Sorry, if I missed it in the links/git).
> > >
> > > My bad. I updated those two "for_iommufd_" QEMU branches with a
> > > README commit on top of each for the reference command.
> >
> > Thanks. I did give it a go and this is my command line based on above,
> 
> > But it fails to boot very early:
> >
> > root@ubuntu:/home/shameer/qemu-test# ./qemu_run-simple-iommufd-
> nicolin-2
> > qemu-system-aarch64-nicolin-viommu: Illegal numa node 2
> >
> > Any idea what am I missing? Do you any special config enabled while
> building Qemu?
> 
> Looks like you are running on a multi-SMMU platform :)
> 
> Would you please try syncing your local branch? That should work,
> as the update also had a small change to the virt code:
> 
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 161a28a311..a782909016 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -1640,7 +1640,7 @@ static PCIBus
> *create_pcie_expander_bridge(VirtMachineState *vms, uint8_t idx)
>      }
> 
>      qdev_prop_set_uint8(dev, "bus_nr", bus_nr);
> -    qdev_prop_set_uint16(dev, "numa_node", idx);
> +    qdev_prop_set_uint16(dev, "numa_node", 0);
>      qdev_realize_and_unref(dev, BUS(bus), &error_fatal);

That makes some progress. But still I am not seeing the assigned
dev  in Guest.

-device vfio-pci-nohotplug,host=0000:75:00.1,iommufd=iommufd0

root@ubuntu:/# lspci -tv#

root@ubuntu:/# lspci -tv
-+-[0000:ca]---00.0-[cb]--
 \-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
             +-01.0  Red Hat, Inc Virtio network device
             +-02.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-04.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-05.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-06.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-07.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-08.0  Red Hat, Inc. QEMU PCIe Expander bridge
             \-09.0  Red Hat, Inc. QEMU PCIe Expander bridge

The new root port is created, but no device attached.

But without iommufd,
-device vfio-pci-nohotplug,host=0000:75:00.1

root@ubuntu:/# lspci -tv
-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
           +-01.0  Red Hat, Inc Virtio network device
           +-02.0  Red Hat, Inc. QEMU PCIe Expander bridge
           +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
           +-04.0  Red Hat, Inc. QEMU PCIe Expander bridge
           +-05.0  Red Hat, Inc. QEMU PCIe Expander bridge
           +-06.0  Red Hat, Inc. QEMU PCIe Expander bridge
           +-07.0  Red Hat, Inc. QEMU PCIe Expander bridge
           +-08.0  Red Hat, Inc. QEMU PCIe Expander bridge
           +-09.0  Red Hat, Inc. QEMU PCIe Expander bridge
           \-0a.0  Huawei Technologies Co., Ltd. Device a251

We can see dev a251.

And yes the setup has multiple SMMUs(8).

Thanks,
Shameer



^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-08-29 13:14           ` Shameerali Kolothum Thodi
@ 2024-08-29 14:52             ` Shameerali Kolothum Thodi
  2024-08-29 16:10               ` Nicolin Chen
  0 siblings, 1 reply; 95+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-08-29 14:52 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Jason Gunthorpe, acpica-devel@lists.linux.dev,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	patches@lists.linux.dev, Mostafa Saleh



> -----Original Message-----
> From: Shameerali Kolothum Thodi
> Sent: Thursday, August 29, 2024 2:15 PM
> To: 'Nicolin Chen' <nicolinc@nvidia.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; acpica-devel@lists.linux.dev;
> Guohanjun (Hanjun Guo) <guohanjun@huawei.com>;
> iommu@lists.linux.dev; Joerg Roedel <joro@8bytes.org>; Kevin Tian
> <kevin.tian@intel.com>; kvm@vger.kernel.org; Len Brown
> <lenb@kernel.org>; linux-acpi@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; Lorenzo Pieralisi <lpieralisi@kernel.org>; Rafael J.
> Wysocki <rafael@kernel.org>; Robert Moore <robert.moore@intel.com>;
> Robin Murphy <robin.murphy@arm.com>; Sudeep Holla
> <sudeep.holla@arm.com>; Will Deacon <will@kernel.org>; Alex Williamson
> <alex.williamson@redhat.com>; Eric Auger <eric.auger@redhat.com>; Jean-
> Philippe Brucker <jean-philippe@linaro.org>; Moritz Fischer
> <mdf@kernel.org>; Michael Shavit <mshavit@google.com>;
> patches@lists.linux.dev; Mostafa Saleh <smostafa@google.com>
> Subject: RE: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
> 
> That makes some progress. But still I am not seeing the assigned dev  in
> Guest.
> 
> -device vfio-pci-nohotplug,host=0000:75:00.1,iommufd=iommufd0
> 
> root@ubuntu:/# lspci -tv#
> 
> root@ubuntu:/# lspci -tv
> -+-[0000:ca]---00.0-[cb]--
>  \-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
>              +-01.0  Red Hat, Inc Virtio network device
>              +-02.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-04.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-05.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-06.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-07.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-08.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              \-09.0  Red Hat, Inc. QEMU PCIe Expander bridge
> 
> The new root port is created, but no device attached.
It looks like Guest finds the config invalid:

[    0.283618] PCI host bridge to bus 0000:ca
[    0.284064] ACPI BIOS Error (bug): \_SB.PCF7.PCEE.PCE5.PCDC.PCD3.PCCA._DSM: Excess arguments - ASL declared 5, ACPI requires 4 (20240322/nsarguments-162)
[    0.285533] pci_bus 0000:ca: root bus resource [bus ca]
[    0.286214] pci 0000:ca:00.0: [1b36:000c] type 01 class 0x060400 PCIe Root Port
[    0.287717] pci 0000:ca:00.0: BAR 0 [mem 0x00000000-0x00000fff]
[    0.288431] pci 0000:ca:00.0: PCI bridge to [bus 00]
[    0.290649] pci 0000:ca:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    0.292476] pci_bus 0000:cb: busn_res: can not insert [bus cb-ca] under [bus ca] (conflicts with (null) [bus ca])
[    0.293597] pci_bus 0000:cb: busn_res: [bus cb-ca] end is updated to cb
[    0.294300] pci_bus 0000:cb: busn_res: can not insert [bus cb] under [bus ca] (conflicts with (null) [bus ca])

Let me know if you have any clue. 

Thanks,
Shameer



^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-08-29 14:52             ` Shameerali Kolothum Thodi
@ 2024-08-29 16:10               ` Nicolin Chen
  2024-08-30  9:07                 ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 95+ messages in thread
From: Nicolin Chen @ 2024-08-29 16:10 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Jason Gunthorpe, acpica-devel@lists.linux.dev,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	patches@lists.linux.dev, Mostafa Saleh

On Thu, Aug 29, 2024 at 02:52:23PM +0000, Shameerali Kolothum Thodi wrote:
> > That makes some progress. But still I am not seeing the assigned dev  in
> > Guest.
> >
> > -device vfio-pci-nohotplug,host=0000:75:00.1,iommufd=iommufd0
> >
> > root@ubuntu:/# lspci -tv#
> >
> > root@ubuntu:/# lspci -tv
> > -+-[0000:ca]---00.0-[cb]--
> >  \-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
> >              +-01.0  Red Hat, Inc Virtio network device
> >              +-02.0  Red Hat, Inc. QEMU PCIe Expander bridge
> >              +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
> >              +-04.0  Red Hat, Inc. QEMU PCIe Expander bridge
> >              +-05.0  Red Hat, Inc. QEMU PCIe Expander bridge
> >              +-06.0  Red Hat, Inc. QEMU PCIe Expander bridge
> >              +-07.0  Red Hat, Inc. QEMU PCIe Expander bridge
> >              +-08.0  Red Hat, Inc. QEMU PCIe Expander bridge
> >              \-09.0  Red Hat, Inc. QEMU PCIe Expander bridge

Hmm, the tree looks correct..

> > The new root port is created, but no device attached.
> It looks like Guest finds the config invalid:
> 
> [    0.283618] PCI host bridge to bus 0000:ca
> [    0.284064] ACPI BIOS Error (bug): \_SB.PCF7.PCEE.PCE5.PCDC.PCD3.PCCA._DSM: Excess arguments - ASL declared 5, ACPI requires 4 (20240322/nsarguments-162)

Looks like the DSM change wasn't clean. Yet, this might not be the
root cause, as mine could boot with it.

Here is mine (I added a print to that conflict part, for success):

[    0.340733] ACPI BIOS Error (bug): \_SB.PCF7.PCEE.PCE5.PCDC._DSM: Excess arguments - ASL declared 5, ACPI requires 4 (20230628/nsarguments-162)
[    0.341776] pci 0000:dc:00.0: [1b36:000c] type 01 class 0x060400 PCIe Root Port
[    0.344895] pci 0000:dc:00.0: BAR 0 [mem 0x10400000-0x10400fff]
[    0.347935] pci 0000:dc:00.0: PCI bridge to [bus dd]
[    0.348410] pci 0000:dc:00.0:   bridge window [mem 0x10200000-0x103fffff]
[    0.349483] pci 0000:dc:00.0:   bridge window [mem 0x42000000000-0x44080ffffff 64bit pref]
[    0.351459] pci_bus 0000:dd: busn_res: insert [bus dd] under [bus dc-dd]

In my case:
[root bus (00)] <---[pxb (dc)] <--- [root-port (dd)] <--- dev

In your case:
[root bus (00)] <---[pxb (ca)] <--- [root-port (cb)] <--- dev

> [    0.285533] pci_bus 0000:ca: root bus resource [bus ca]
> [    0.286214] pci 0000:ca:00.0: [1b36:000c] type 01 class 0x060400 PCIe Root Port
> [    0.287717] pci 0000:ca:00.0: BAR 0 [mem 0x00000000-0x00000fff]
> [    0.288431] pci 0000:ca:00.0: PCI bridge to [bus 00]

This starts to diff. Somehow the link is reversed? It should be:
 [    0.288431] pci 0000:ca:00.0: PCI bridge to [bus cb]

> [    0.290649] pci 0000:ca:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
> [    0.292476] pci_bus 0000:cb: busn_res: can not insert [bus cb-ca] under [bus ca] (conflicts with (null) [bus ca])
> [    0.293597] pci_bus 0000:cb: busn_res: [bus cb-ca] end is updated to cb
> [    0.294300] pci_bus 0000:cb: busn_res: can not insert [bus cb] under [bus ca] (conflicts with (null) [bus ca])

And then everything went south...

Would you please try adding some prints?
----------------------------------------------------------------------
@@ -1556,6 +1556,7 @@ static char *create_new_pcie_port(VirtNestedSmmu *nested_smmu, Error **errp)
     uint32_t bus_nr = pci_bus_num(nested_smmu->pci_bus);
     DeviceState *dev;
     char *name_port;
+    bool ret;
 
     /* Create a root port */
     dev = qdev_new("pcie-root-port");
@@ -1571,7 +1572,9 @@ static char *create_new_pcie_port(VirtNestedSmmu *nested_smmu, Error **errp)
     qdev_prop_set_uint32(dev, "chassis", chassis_nr);
     qdev_prop_set_uint32(dev, "slot", port_nr);
     qdev_prop_set_uint64(dev, "io-reserve", 0);
-    qdev_realize_and_unref(dev, BUS(nested_smmu->pci_bus), &error_fatal);
+    ret = qdev_realize_and_unref(dev, BUS(nested_smmu->pci_bus), &error_fatal);
+    fprintf(stderr, "ret=%d, pcie-root-port ID: %s, added to pxb_bus num: %x, chassis: %d\n",
+            ret, name_port, pci_bus_num(nested_smmu->pci_bus), chassis_nr);
     return name_port;
 }
 
----------------------------------------------------------------------

We should make sure that the 'bus_nr' and 'bus' are set correctly
and all the realize() returned true?:

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH v2 1/8] vfio: Remove VFIO_TYPE1_NESTING_IOMMU
  2024-08-27 15:51 ` [PATCH v2 1/8] vfio: Remove VFIO_TYPE1_NESTING_IOMMU Jason Gunthorpe
@ 2024-08-30  7:40   ` Tian, Kevin
  0 siblings, 0 replies; 95+ messages in thread
From: Tian, Kevin @ 2024-08-30  7:40 UTC (permalink / raw)
  To: Jason Gunthorpe, acpica-devel@lists.linux.dev, Hanjun Guo,
	iommu@lists.linux.dev, Joerg Roedel, kvm@vger.kernel.org,
	Len Brown, linux-acpi@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Lorenzo Pieralisi,
	Rafael J. Wysocki, Moore, Robert, Robin Murphy, Sudeep Holla,
	Will Deacon
  Cc: Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches@lists.linux.dev, Shameerali Kolothum Thodi, Mostafa Saleh

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, August 27, 2024 11:52 PM
> 
> This control causes the ARM SMMU drivers to choose a stage 2
> implementation for the IO pagetable (vs the stage 1 usual default),
> however this choice has no significant visible impact to the VFIO
> user. Further qemu never implemented this and no other userspace user is
> known.
> 
> The original description in commit f5c9ecebaf2a ("vfio/iommu_type1: add
> new VFIO_TYPE1_NESTING_IOMMU IOMMU type") suggested this was to
> "provide
> SMMU translation services to the guest operating system" however the rest
> of the API to set the guest table pointer for the stage 1 and manage
> invalidation was never completed, or at least never upstreamed, rendering
> this part useless dead code.
> 
> Upstream has now settled on iommufd as the uAPI for controlling nested
> translation. Choosing the stage 2 implementation should be done by through
> the IOMMU_HWPT_ALLOC_NEST_PARENT flag during domain allocation.
> 
> Remove VFIO_TYPE1_NESTING_IOMMU and everything under it including the
> enable_nesting iommu_domain_op.
> 
> Just in-case there is some userspace using this continue to treat
> requesting it as a NOP, but do not advertise support any more.

It took me a while to understand why we still allow the user setting the
IOMMU type to nesting below...

> @@ -2545,9 +2538,7 @@ static void *vfio_iommu_type1_open(unsigned
> long arg)
>  	switch (arg) {
>  	case VFIO_TYPE1_IOMMU:
>  		break;
> -	case VFIO_TYPE1_NESTING_IOMMU:
> -		iommu->nesting = true;
> -		fallthrough;
> +	case __VFIO_RESERVED_TYPE1_NESTING_IOMMU:
>  	case VFIO_TYPE1v2_IOMMU:
>  		iommu->v2 = true;
>  		break;

I guess the reason was that NESTING_IOMMU implies V2 so an user can
legitimately uses it as V2 w/o counting on any removed nesting logic.

So,

Reviewed-by: Kevin Tian <kevin.tian@intel.com>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-27 15:51 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available Jason Gunthorpe
  2024-08-27 19:48   ` Nicolin Chen
  2024-08-28 19:50   ` Nicolin Chen
@ 2024-08-30  7:44   ` Tian, Kevin
  2024-08-30  7:56     ` Nicolin Chen
  2024-08-30 15:12   ` Mostafa Saleh
  2024-09-04 14:20   ` Shameerali Kolothum Thodi
  4 siblings, 1 reply; 95+ messages in thread
From: Tian, Kevin @ 2024-08-30  7:44 UTC (permalink / raw)
  To: Jason Gunthorpe, acpica-devel@lists.linux.dev, Hanjun Guo,
	iommu@lists.linux.dev, Joerg Roedel, kvm@vger.kernel.org,
	Len Brown, linux-acpi@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Lorenzo Pieralisi,
	Rafael J. Wysocki, Moore, Robert, Robin Murphy, Sudeep Holla,
	Will Deacon
  Cc: Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches@lists.linux.dev, Shameerali Kolothum Thodi, Mostafa Saleh

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, August 27, 2024 11:52 PM
> 
> @@ -4189,6 +4193,13 @@ static int arm_smmu_device_hw_probe(struct
> arm_smmu_device *smmu)
> 
>  	/* IDR3 */
>  	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
> +	/*
> +	 * If for some reason the HW does not support DMA coherency then
> using
> +	 * S2FWB won't work. This will also disable nesting support.
> +	 */
> +	if (FIELD_GET(IDR3_FWB, reg) &&
> +	    (smmu->features & ARM_SMMU_FEAT_COHERENCY))
> +		smmu->features |= ARM_SMMU_FEAT_S2FWB;
>  	if (FIELD_GET(IDR3_RIL, reg))
>  		smmu->features |= ARM_SMMU_FEAT_RANGE_INV;

then also clear ARM_SMMU_FEAT_NESTING?


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH v2 4/8] ACPI/IORT: Support CANWBS memory access flag
  2024-08-27 15:51 ` [PATCH v2 4/8] ACPI/IORT: Support CANWBS memory access flag Jason Gunthorpe
@ 2024-08-30  7:52   ` Tian, Kevin
  2024-08-30 13:54     ` Jason Gunthorpe
  0 siblings, 1 reply; 95+ messages in thread
From: Tian, Kevin @ 2024-08-30  7:52 UTC (permalink / raw)
  To: Jason Gunthorpe, acpica-devel@lists.linux.dev, Hanjun Guo,
	iommu@lists.linux.dev, Joerg Roedel, kvm@vger.kernel.org,
	Len Brown, linux-acpi@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Lorenzo Pieralisi,
	Rafael J. Wysocki, Moore, Robert, Robin Murphy, Sudeep Holla,
	Will Deacon
  Cc: Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches@lists.linux.dev, Shameerali Kolothum Thodi, Mostafa Saleh

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, August 27, 2024 11:52 PM
> 
> From: Nicolin Chen <nicolinc@nvidia.com>
> 
> The IORT spec, Issue E.f (April 2024), adds a new CANWBS bit to the Memory
> Access Flag field in the Memory Access Properties table, mainly for a PCI
> Root Complex.
> 
> This CANWBS defines the coherency of memory accesses to be not marked
> IOWB
> cacheable/shareable. Its value further implies the coherency impact from a
> pair of mismatched memory attributes (e.g. in a nested translation case):
>   0x0: Use of mismatched memory attributes for accesses made by this
>        device may lead to a loss of coherency.
>   0x1: Coherency of accesses made by this device to locations in
>        Conventional memory are ensured as follows, even if the memory
>        attributes for the accesses presented by the device or provided by
>        the SMMU are different from Inner and Outer Write-back cacheable,
>        Shareable.
> 
> Note that the loss of coherency on a CANWBS-unsupported HW typically could
> occur to an SMMU that doesn't implement the S2FWB feature where additional
> cache flush operations would be required to prevent that from happening.
> 
> Add a new ACPI_IORT_MF_CANWBS flag and set
> IOMMU_FWSPEC_PCI_RC_CANWBS upon
> the presence of this new flag.
> 
> CANWBS and S2FWB are similar features, in that they both guarantee the VM
> can not violate coherency, however S2FWB can be bypassed by PCI No Snoop
> TLPs, while CANWBS cannot. Thus CANWBS meets the requirements to set
> IOMMU_CAP_ENFORCE_CACHE_COHERENCY.
> 

I'm confused here. It is clear that we need a mechanism via which the VM
cannot bypass the cache, before Yan's series comes to relax.

But according to above description S2FWB cannot 100% guarantee it
due to PCI No Snoop. Does it suggest that we should only allow nesting
only for CANWBS, or disable/hide PCI No Snoop cap from the guest
in case of S2FWB?


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH v2 6/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
  2024-08-27 15:51 ` [PATCH v2 6/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info Jason Gunthorpe
@ 2024-08-30  7:55   ` Tian, Kevin
  2024-08-30 15:23   ` Mostafa Saleh
  1 sibling, 0 replies; 95+ messages in thread
From: Tian, Kevin @ 2024-08-30  7:55 UTC (permalink / raw)
  To: Jason Gunthorpe, acpica-devel@lists.linux.dev, Hanjun Guo,
	iommu@lists.linux.dev, Joerg Roedel, kvm@vger.kernel.org,
	Len Brown, linux-acpi@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Lorenzo Pieralisi,
	Rafael J. Wysocki, Moore, Robert, Robin Murphy, Sudeep Holla,
	Will Deacon
  Cc: Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches@lists.linux.dev, Shameerali Kolothum Thodi, Mostafa Saleh

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, August 27, 2024 11:52 PM
> 
> From: Nicolin Chen <nicolinc@nvidia.com>
> 
> For virtualization cases the IDR/IIDR/AIDR values of the actual SMMU
> instance need to be available to the VMM so it can construct an
> appropriate vSMMUv3 that reflects the correct HW capabilities.
> 
> For userspace page tables these values are required to constrain the valid
> values within the CD table and the IOPTEs.
> 
> The kernel does not sanitize these values. If building a VMM then
> userspace is required to only forward bits into a VM that it knows it can
> implement. Some bits will also require a VMM to detect if appropriate
> kernel support is available such as for ATS and BTM.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-30  7:44   ` Tian, Kevin
@ 2024-08-30  7:56     ` Nicolin Chen
  2024-08-30  8:01       ` Tian, Kevin
  0 siblings, 1 reply; 95+ messages in thread
From: Nicolin Chen @ 2024-08-30  7:56 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, acpica-devel@lists.linux.dev, Hanjun Guo,
	iommu@lists.linux.dev, Joerg Roedel, kvm@vger.kernel.org,
	Len Brown, linux-acpi@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Lorenzo Pieralisi,
	Rafael J. Wysocki, Moore, Robert, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches@lists.linux.dev,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Fri, Aug 30, 2024 at 07:44:35AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, August 27, 2024 11:52 PM
> >
> > @@ -4189,6 +4193,13 @@ static int arm_smmu_device_hw_probe(struct
> > arm_smmu_device *smmu)
> >
> >       /* IDR3 */
> >       reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
> > +     /*
> > +      * If for some reason the HW does not support DMA coherency then
> > using
> > +      * S2FWB won't work. This will also disable nesting support.
> > +      */
> > +     if (FIELD_GET(IDR3_FWB, reg) &&
> > +         (smmu->features & ARM_SMMU_FEAT_COHERENCY))
> > +             smmu->features |= ARM_SMMU_FEAT_S2FWB;
> >       if (FIELD_GET(IDR3_RIL, reg))
> >               smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
> 
> then also clear ARM_SMMU_FEAT_NESTING?

S2FWB isn't the only HW option for nesting. Pls refer to PATCH-8:
https://lore.kernel.org/linux-iommu/8-v2-621370057090+91fec-smmuv3_nesting_jgg@nvidia.com/

+static struct iommu_domain *
+arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags,
[...]
+	/*
+	 * Must support some way to prevent the VM from bypassing the cache
+	 * because VFIO currently does not do any cache maintenance.
+	 */
+	if (!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_CANWBS) &&
+	    !(master->smmu->features & ARM_SMMU_FEAT_S2FWB))
+		return ERR_PTR(-EOPNOTSUPP);

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH v2 7/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
  2024-08-27 15:51 ` [PATCH v2 7/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT Jason Gunthorpe
  2024-08-27 20:16   ` Nicolin Chen
@ 2024-08-30  7:58   ` Tian, Kevin
  2024-08-30 13:55     ` Jason Gunthorpe
  2024-08-30 15:27   ` Mostafa Saleh
  2 siblings, 1 reply; 95+ messages in thread
From: Tian, Kevin @ 2024-08-30  7:58 UTC (permalink / raw)
  To: Jason Gunthorpe, acpica-devel@lists.linux.dev, Hanjun Guo,
	iommu@lists.linux.dev, Joerg Roedel, kvm@vger.kernel.org,
	Len Brown, linux-acpi@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Lorenzo Pieralisi,
	Rafael J. Wysocki, Moore, Robert, Robin Murphy, Sudeep Holla,
	Will Deacon
  Cc: Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches@lists.linux.dev, Shameerali Kolothum Thodi, Mostafa Saleh

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, August 27, 2024 11:52 PM
> 
> For SMMUv3 the parent must be a S2 domain, which can be composed
> into a IOMMU_DOMAIN_NESTED.
> 
> In future the S2 parent will also need a VMID linked to the VIOMMU and
> even to KVM.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>, with a nit:

> @@ -3103,7 +3103,8 @@ arm_smmu_domain_alloc_user(struct device *dev,
> u32 flags,
>  			   const struct iommu_user_data *user_data)
>  {
>  	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> -	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
> +	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING |
> +				 IOMMU_HWPT_ALLOC_NEST_PARENT;

lowercase for variable name.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-30  7:56     ` Nicolin Chen
@ 2024-08-30  8:01       ` Tian, Kevin
  0 siblings, 0 replies; 95+ messages in thread
From: Tian, Kevin @ 2024-08-30  8:01 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Jason Gunthorpe, acpica-devel@lists.linux.dev, Hanjun Guo,
	iommu@lists.linux.dev, Joerg Roedel, kvm@vger.kernel.org,
	Len Brown, linux-acpi@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Lorenzo Pieralisi,
	Rafael J. Wysocki, Moore, Robert, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches@lists.linux.dev,
	Shameerali Kolothum Thodi, Mostafa Saleh

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Friday, August 30, 2024 3:56 PM
> 
> On Fri, Aug 30, 2024 at 07:44:35AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, August 27, 2024 11:52 PM
> > >
> > > @@ -4189,6 +4193,13 @@ static int arm_smmu_device_hw_probe(struct
> > > arm_smmu_device *smmu)
> > >
> > >       /* IDR3 */
> > >       reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
> > > +     /*
> > > +      * If for some reason the HW does not support DMA coherency then
> > > using
> > > +      * S2FWB won't work. This will also disable nesting support.
> > > +      */
> > > +     if (FIELD_GET(IDR3_FWB, reg) &&
> > > +         (smmu->features & ARM_SMMU_FEAT_COHERENCY))
> > > +             smmu->features |= ARM_SMMU_FEAT_S2FWB;
> > >       if (FIELD_GET(IDR3_RIL, reg))
> > >               smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
> >
> > then also clear ARM_SMMU_FEAT_NESTING?
> 
> S2FWB isn't the only HW option for nesting. Pls refer to PATCH-8:
> https://lore.kernel.org/linux-iommu/8-v2-621370057090+91fec-
> smmuv3_nesting_jgg@nvidia.com/
> 
> +static struct iommu_domain *
> +arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags,
> [...]
> +	/*
> +	 * Must support some way to prevent the VM from bypassing the
> cache
> +	 * because VFIO currently does not do any cache maintenance.
> +	 */
> +	if (!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_CANWBS) &&
> +	    !(master->smmu->features & ARM_SMMU_FEAT_S2FWB))
> +		return ERR_PTR(-EOPNOTSUPP);
> 

Yes, but if we guard the setting of the nesting bit upon those
conditions then it's simpler code in other paths by only looking
at one bit.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-08-27 15:51 ` [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED Jason Gunthorpe
  2024-08-27 21:23   ` Nicolin Chen
@ 2024-08-30  8:16   ` Tian, Kevin
  2024-08-30 14:13     ` Jason Gunthorpe
  2024-08-30 14:39     ` Jason Gunthorpe
  2024-08-30 16:09   ` Mostafa Saleh
  2 siblings, 2 replies; 95+ messages in thread
From: Tian, Kevin @ 2024-08-30  8:16 UTC (permalink / raw)
  To: Jason Gunthorpe, acpica-devel@lists.linux.dev, Hanjun Guo,
	iommu@lists.linux.dev, Joerg Roedel, kvm@vger.kernel.org,
	Len Brown, linux-acpi@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Lorenzo Pieralisi,
	Rafael J. Wysocki, Moore, Robert, Robin Murphy, Sudeep Holla,
	Will Deacon
  Cc: Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches@lists.linux.dev, Shameerali Kolothum Thodi, Mostafa Saleh

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, August 27, 2024 11:52 PM
> 
> For SMMUv3 a IOMMU_DOMAIN_NESTED is composed of a S2
> iommu_domain acting
> as the parent and a user provided STE fragment that defines the CD table
> and related data with addresses translated by the S2 iommu_domain.
> 
> The kernel only permits userspace to control certain allowed bits of the
> STE that are safe for user/guest control.
> 
> IOTLB maintenance is a bit subtle here, the S1 implicitly includes the S2
> translation, but there is no way of knowing which S1 entries refer to a
> range of S2.
> 
> For the IOTLB we follow ARM's guidance and issue a
> CMDQ_OP_TLBI_NH_ALL to
> flush all ASIDs from the VMID after flushing the S2 on any change to the
> S2.
> 
> Similarly we have to flush the entire ATC if the S2 is changed.

it's clearer to mention that ATS is not supported at this point. 

> @@ -2614,7 +2687,8 @@ arm_smmu_find_master_domain(struct
> arm_smmu_domain *smmu_domain,
>  	list_for_each_entry(master_domain, &smmu_domain->devices,
>  			    devices_elm) {
>  		if (master_domain->master == master &&
> -		    master_domain->ssid == ssid)
> +		    master_domain->ssid == ssid &&
> +		    master_domain->nest_parent == nest_parent)
>  			return master_domain;
>  	}

there are two nest_parent flags in master_domain and smmu_domain.
Probably duplicating?

> +static struct iommu_domain *
> +arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags,
> +			      struct iommu_domain *parent,
> +			      const struct iommu_user_data *user_data)
> +{
> +	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> +	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> +	struct arm_smmu_nested_domain *nested_domain;
> +	struct arm_smmu_domain *smmu_parent;
> +	struct iommu_hwpt_arm_smmuv3 arg;
> +	unsigned int eats;
> +	unsigned int cfg;
> +	int ret;
> +
> +	if (!(master->smmu->features & ARM_SMMU_FEAT_NESTING))
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	/*
> +	 * Must support some way to prevent the VM from bypassing the
> cache
> +	 * because VFIO currently does not do any cache maintenance.
> +	 */
> +	if (!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_CANWBS) &&
> +	    !(master->smmu->features & ARM_SMMU_FEAT_S2FWB))
> +		return ERR_PTR(-EOPNOTSUPP);

this can be saved if we guard the setting of NESTING upon them.

> +
> +	ret = iommu_copy_struct_from_user(&arg, user_data,
> +
> IOMMU_HWPT_DATA_ARM_SMMUV3, ste);
> +	if (ret)
> +		return ERR_PTR(ret);

prefer to allocating resource after static condition checks below.

> +
> +	if (flags || !(master->smmu->features &
> ARM_SMMU_FEAT_TRANS_S1))
> +		return ERR_PTR(-EOPNOTSUPP);

Is it possible when NESTING is supported?

> +
> +	if (!(parent->type & __IOMMU_DOMAIN_PAGING))
> +		return ERR_PTR(-EINVAL);

Just check parent->nest_parent

> +
> +	smmu_parent = to_smmu_domain(parent);
> +	if (smmu_parent->stage != ARM_SMMU_DOMAIN_S2 ||
> +	    smmu_parent->smmu != master->smmu)
> +		return ERR_PTR(-EINVAL);

again S2 should be implied when parent->nest_parent is true.

> +
> +	/* EIO is reserved for invalid STE data. */
> +	if ((arg.ste[0] & ~STRTAB_STE_0_NESTING_ALLOWED) ||
> +	    (arg.ste[1] & ~STRTAB_STE_1_NESTING_ALLOWED))
> +		return ERR_PTR(-EIO);
> +
> +	cfg = FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(arg.ste[0]));
> +	if (cfg != STRTAB_STE_0_CFG_ABORT && cfg !=
> STRTAB_STE_0_CFG_BYPASS &&
> +	    cfg != STRTAB_STE_0_CFG_S1_TRANS)
> +		return ERR_PTR(-EIO);

If vSTE is invalid those bits can be ignored?

> +
> +	eats = FIELD_GET(STRTAB_STE_1_EATS, le64_to_cpu(arg.ste[1]));
> +	if (eats != STRTAB_STE_1_EATS_ABT)
> +		return ERR_PTR(-EIO);
> +
> +	if (cfg != STRTAB_STE_0_CFG_S1_TRANS)
> +		eats = STRTAB_STE_1_EATS_ABT;

this check sounds redundant. If the last check passes then eats is
already set to _ABT. 

> 
> +/**
> + * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 Context Descriptor
> Table info
> + *                                (IOMMU_HWPT_DATA_ARM_SMMUV3)
> + *
> + * @ste: The first two double words of the user space Stream Table Entry
> for
> + *       a user stage-1 Context Descriptor Table. Must be little-endian.
> + *       Allowed fields: (Refer to "5.2 Stream Table Entry" in SMMUv3 HW
> Spec)
> + *       - word-0: V, Cfg, S1Fmt, S1ContextPtr, S1CDMax
> + *       - word-1: S1DSS, S1CIR, S1COR, S1CSH, S1STALLD

Not sure whether EATS should be documented here or not. It's handled 
but must be ZERO at this point.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-08-29 16:10               ` Nicolin Chen
@ 2024-08-30  9:07                 ` Shameerali Kolothum Thodi
  2024-08-30 17:01                   ` Nicolin Chen
  0 siblings, 1 reply; 95+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-08-30  9:07 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Jason Gunthorpe, acpica-devel@lists.linux.dev,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	patches@lists.linux.dev, Mostafa Saleh



> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, August 29, 2024 5:10 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>; acpica-devel@lists.linux.dev;
> Guohanjun (Hanjun Guo) <guohanjun@huawei.com>; iommu@lists.linux.dev;
> Joerg Roedel <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>;
> kvm@vger.kernel.org; Len Brown <lenb@kernel.org>; linux-
> acpi@vger.kernel.org; linux-arm-kernel@lists.infradead.org; Lorenzo Pieralisi
> <lpieralisi@kernel.org>; Rafael J. Wysocki <rafael@kernel.org>; Robert Moore
> <robert.moore@intel.com>; Robin Murphy <robin.murphy@arm.com>; Sudeep
> Holla <sudeep.holla@arm.com>; Will Deacon <will@kernel.org>; Alex
> Williamson <alex.williamson@redhat.com>; Eric Auger
> <eric.auger@redhat.com>; Jean-Philippe Brucker <jean-philippe@linaro.org>;
> Moritz Fischer <mdf@kernel.org>; Michael Shavit <mshavit@google.com>;
> patches@lists.linux.dev; Mostafa Saleh <smostafa@google.com>
> Subject: Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
> 
> On Thu, Aug 29, 2024 at 02:52:23PM +0000, Shameerali Kolothum Thodi wrote:
> > > That makes some progress. But still I am not seeing the assigned dev  in
> > > Guest.
> > >
> > > -device vfio-pci-nohotplug,host=0000:75:00.1,iommufd=iommufd0
> > >
> > > root@ubuntu:/# lspci -tv#
> > >
> > > root@ubuntu:/# lspci -tv
> > > -+-[0000:ca]---00.0-[cb]--
> > >  \-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
> > >              +-01.0  Red Hat, Inc Virtio network device
> > >              +-02.0  Red Hat, Inc. QEMU PCIe Expander bridge
> > >              +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
> > >              +-04.0  Red Hat, Inc. QEMU PCIe Expander bridge
> > >              +-05.0  Red Hat, Inc. QEMU PCIe Expander bridge
> > >              +-06.0  Red Hat, Inc. QEMU PCIe Expander bridge
> > >              +-07.0  Red Hat, Inc. QEMU PCIe Expander bridge
> > >              +-08.0  Red Hat, Inc. QEMU PCIe Expander bridge
> > >              \-09.0  Red Hat, Inc. QEMU PCIe Expander bridge
> 
> Hmm, the tree looks correct..
> 
> > > The new root port is created, but no device attached.
> > It looks like Guest finds the config invalid:
> >
> > [    0.283618] PCI host bridge to bus 0000:ca
> > [    0.284064] ACPI BIOS Error (bug):
> \_SB.PCF7.PCEE.PCE5.PCDC.PCD3.PCCA._DSM: Excess arguments - ASL declared
> 5, ACPI requires 4 (20240322/nsarguments-162)
> 
> Looks like the DSM change wasn't clean. Yet, this might not be the
> root cause, as mine could boot with it.

Yes. This is not the culprit in this case and was reported earlier as well,

https://patchew.org/QEMU/20211005085313.493858-1-eric.auger@redhat.com/20211005085313.493858-2-eric.auger@redhat.com/

> Here is mine (I added a print to that conflict part, for success):
> 
> [    0.340733] ACPI BIOS Error (bug): \_SB.PCF7.PCEE.PCE5.PCDC._DSM: Excess
> arguments - ASL declared 5, ACPI requires 4 (20230628/nsarguments-162)
> [    0.341776] pci 0000:dc:00.0: [1b36:000c] type 01 class 0x060400 PCIe Root
> Port
> [    0.344895] pci 0000:dc:00.0: BAR 0 [mem 0x10400000-0x10400fff]
> [    0.347935] pci 0000:dc:00.0: PCI bridge to [bus dd]
> [    0.348410] pci 0000:dc:00.0:   bridge window [mem 0x10200000-0x103fffff]
> [    0.349483] pci 0000:dc:00.0:   bridge window [mem 0x42000000000-
> 0x44080ffffff 64bit pref]
> [    0.351459] pci_bus 0000:dd: busn_res: insert [bus dd] under [bus dc-dd]
> 
> In my case:
> [root bus (00)] <---[pxb (dc)] <--- [root-port (dd)] <--- dev
> 
> In your case:
> [root bus (00)] <---[pxb (ca)] <--- [root-port (cb)] <--- dev
> 
> > [    0.285533] pci_bus 0000:ca: root bus resource [bus ca]
> > [    0.286214] pci 0000:ca:00.0: [1b36:000c] type 01 class 0x060400 PCIe Root
> Port
> > [    0.287717] pci 0000:ca:00.0: BAR 0 [mem 0x00000000-0x00000fff]
> > [    0.288431] pci 0000:ca:00.0: PCI bridge to [bus 00]
> 
> This starts to diff. Somehow the link is reversed? It should be:
>  [    0.288431] pci 0000:ca:00.0: PCI bridge to [bus cb]
> 
> > [    0.290649] pci 0000:ca:00.0: bridge configuration invalid ([bus 00-00]),
> reconfiguring
> > [    0.292476] pci_bus 0000:cb: busn_res: can not insert [bus cb-ca] under [bus
> ca] (conflicts with (null) [bus ca])
> > [    0.293597] pci_bus 0000:cb: busn_res: [bus cb-ca] end is updated to cb
> > [    0.294300] pci_bus 0000:cb: busn_res: can not insert [bus cb] under [bus ca]
> (conflicts with (null) [bus ca])
> 
> And then everything went south...
> 
> Would you please try adding some prints?
> ----------------------------------------------------------------------
> @@ -1556,6 +1556,7 @@ static char *create_new_pcie_port(VirtNestedSmmu
> *nested_smmu, Error **errp)
>      uint32_t bus_nr = pci_bus_num(nested_smmu->pci_bus);
>      DeviceState *dev;
>      char *name_port;
> +    bool ret;
> 
>      /* Create a root port */
>      dev = qdev_new("pcie-root-port");
> @@ -1571,7 +1572,9 @@ static char *create_new_pcie_port(VirtNestedSmmu
> *nested_smmu, Error **errp)
>      qdev_prop_set_uint32(dev, "chassis", chassis_nr);
>      qdev_prop_set_uint32(dev, "slot", port_nr);
>      qdev_prop_set_uint64(dev, "io-reserve", 0);
> -    qdev_realize_and_unref(dev, BUS(nested_smmu->pci_bus), &error_fatal);
> +    ret = qdev_realize_and_unref(dev, BUS(nested_smmu->pci_bus),
> &error_fatal);
> +    fprintf(stderr, "ret=%d, pcie-root-port ID: %s, added to pxb_bus num: %x,
> chassis: %d\n",
> +            ret, name_port, pci_bus_num(nested_smmu->pci_bus), chassis_nr);
>      return name_port;
>  }

Print shows everything fine:
create_new_pcie_port: name_port smmu_bus0xca_port0, bus_nr 0xca chassis_nr 0xfd, nested_smmu->index 0x2, pci_bus_num 0xca, ret 1

It looks like a problem with old QEMU_EFI.fd(2022 build and before).
I tried with 2023 QEMU_EFI.fd and with that it looks fine.

root@ubuntu:/# lspci -tv
-+-[0000:ca]---00.0-[cb]----00.0  Huawei Technologies Co., Ltd. Device a251
 \-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
             +-01.0  Red Hat, Inc Virtio network device
             +-02.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-04.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-05.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-06.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-07.0  Red Hat, Inc. QEMU PCIe Expander bridge
             +-08.0  Red Hat, Inc. QEMU PCIe Expander bridge
             \-09.0  Red Hat, Inc. QEMU PCIe Expander bridge

So for now, I can proceed.

Thanks,
Shameer




^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 4/8] ACPI/IORT: Support CANWBS memory access flag
  2024-08-30  7:52   ` Tian, Kevin
@ 2024-08-30 13:54     ` Jason Gunthorpe
  2024-09-03  7:14       ` Tian, Kevin
  0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-30 13:54 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: acpica-devel@lists.linux.dev, Hanjun Guo, iommu@lists.linux.dev,
	Joerg Roedel, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Moore, Robert, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches@lists.linux.dev, Shameerali Kolothum Thodi,
	Mostafa Saleh

On Fri, Aug 30, 2024 at 07:52:41AM +0000, Tian, Kevin wrote:

> But according to above description S2FWB cannot 100% guarantee it
> due to PCI No Snoop. Does it suggest that we should only allow nesting
> only for CANWBS, or disable/hide PCI No Snoop cap from the guest
> in case of S2FWB?

ARM has always had an issue with no-snoop and VFIO. The ARM
expectation is that VFIO/VMM would block no-snoop in the PCI config
space.

From a VM perspective, any VMM on ARM has to take care to do this
today already.

For instance a VMM could choose to only assign devices which never use
no-snoop, which describes almost all of what people actually do :)

The purpose of S2FWB is to keep that approach working. If the VMM has
blocked no-snoop then S2FWB ensures that the VM can't use IOPTE bits
to break cachability and it remains safe.

From a VFIO perspective ARM has always had a security hole similer to
what Yan is trying to fix on Intel, that is a separate pre-existing
topic. Ideally the VFIO kernel would block PCI config space no-snoop
for alot of cases.

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 7/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
  2024-08-30  7:58   ` Tian, Kevin
@ 2024-08-30 13:55     ` Jason Gunthorpe
  0 siblings, 0 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-30 13:55 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: acpica-devel@lists.linux.dev, Hanjun Guo, iommu@lists.linux.dev,
	Joerg Roedel, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Moore, Robert, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches@lists.linux.dev, Shameerali Kolothum Thodi,
	Mostafa Saleh

On Fri, Aug 30, 2024 at 07:58:09AM +0000, Tian, Kevin wrote:
> > @@ -3103,7 +3103,8 @@ arm_smmu_domain_alloc_user(struct device *dev,
> > u32 flags,
> >  			   const struct iommu_user_data *user_data)
> >  {
> >  	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> > -	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
> > +	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING |
> > +				 IOMMU_HWPT_ALLOC_NEST_PARENT;
> 
> lowercase for variable name.

Ah, but it is constant :) I have no idea if there is a style consensus
here

Thanks,
Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-08-30  8:16   ` Tian, Kevin
@ 2024-08-30 14:13     ` Jason Gunthorpe
  2024-08-30 14:39     ` Jason Gunthorpe
  1 sibling, 0 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-30 14:13 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: acpica-devel@lists.linux.dev, Hanjun Guo, iommu@lists.linux.dev,
	Joerg Roedel, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Moore, Robert, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches@lists.linux.dev, Shameerali Kolothum Thodi,
	Mostafa Saleh

On Fri, Aug 30, 2024 at 08:16:27AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, August 27, 2024 11:52 PM
> > 
> > For SMMUv3 a IOMMU_DOMAIN_NESTED is composed of a S2
> > iommu_domain acting
> > as the parent and a user provided STE fragment that defines the CD table
> > and related data with addresses translated by the S2 iommu_domain.
> > 
> > The kernel only permits userspace to control certain allowed bits of the
> > STE that are safe for user/guest control.
> > 
> > IOTLB maintenance is a bit subtle here, the S1 implicitly includes the S2
> > translation, but there is no way of knowing which S1 entries refer to a
> > range of S2.
> > 
> > For the IOTLB we follow ARM's guidance and issue a
> > CMDQ_OP_TLBI_NH_ALL to
> > flush all ASIDs from the VMID after flushing the S2 on any change to the
> > S2.
> > 
> > Similarly we have to flush the entire ATC if the S2 is changed.
> 
> it's clearer to mention that ATS is not supported at this point. 

As things have ended up we need the viommu series to come along with
this to enable the full feature, and viommu supports ATS invalidation.

Ideally I'd like to merge them both together.

> > @@ -2614,7 +2687,8 @@ arm_smmu_find_master_domain(struct
> > arm_smmu_domain *smmu_domain,
> >  	list_for_each_entry(master_domain, &smmu_domain->devices,
> >  			    devices_elm) {
> >  		if (master_domain->master == master &&
> > -		    master_domain->ssid == ssid)
> > +		    master_domain->ssid == ssid &&
> > +		    master_domain->nest_parent == nest_parent)
> >  			return master_domain;
> >  	}
> 
> there are two nest_parent flags in master_domain and smmu_domain.
> Probably duplicating?

Sort of, sort of not..

This is a bit awkward right now because the arm_smmu_domain is still
per-instance, so the domain->nest_parent exists to control flushing of
the VMID

But we also have the per-attachment 'master_domain' struct, and there
the nest_parent controls flushing of the ATC.

In the end arm_smmu_domain will stop being per-instance and per-attach
'master_domain' would have the vmid and the nest_parent only. I'm
aiming for something like how VTD and RISCV are doing their flushing,
with a list of flush instructions attached to the domain.

So for now we have the in-between state where a S2 marked as parent
will avoid the ATC flush when directly attached to a RID but not the
ASID flush. Eventually we will be able to avoid both.

> > +static struct iommu_domain *
> > +arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags,
> > +			      struct iommu_domain *parent,
> > +			      const struct iommu_user_data *user_data)
> > +{
> > +	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> > +	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> > +	struct arm_smmu_nested_domain *nested_domain;
> > +	struct arm_smmu_domain *smmu_parent;
> > +	struct iommu_hwpt_arm_smmuv3 arg;
> > +	unsigned int eats;
> > +	unsigned int cfg;
> > +	int ret;
> > +
> > +	if (!(master->smmu->features & ARM_SMMU_FEAT_NESTING))
> > +		return ERR_PTR(-EOPNOTSUPP);
> > +
> > +	/*
> > +	 * Must support some way to prevent the VM from bypassing the
> > cache
> > +	 * because VFIO currently does not do any cache maintenance.
> > +	 */
> > +	if (!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_CANWBS) &&
> > +	    !(master->smmu->features & ARM_SMMU_FEAT_S2FWB))
> > +		return ERR_PTR(-EOPNOTSUPP);
> 
> this can be saved if we guard the setting of NESTING upon them.

IOMMU_FWSPEC_PCI_RC_CANWBS is per-device, FEAT_NESTING is SMMU global,
they can't really be combined.

> > +
> > +	ret = iommu_copy_struct_from_user(&arg, user_data,
> > +
> > IOMMU_HWPT_DATA_ARM_SMMUV3, ste);
> > +	if (ret)
> > +		return ERR_PTR(ret);
> 
> prefer to allocating resource after static condition checks below.
> 
> > +
> > +	if (flags || !(master->smmu->features &
> > ARM_SMMU_FEAT_TRANS_S1))
> > +		return ERR_PTR(-EOPNOTSUPP);
> 
> Is it possible when NESTING is supported?
> 
> > +
> > +	if (!(parent->type & __IOMMU_DOMAIN_PAGING))
> > +		return ERR_PTR(-EINVAL);
> 
> Just check parent->nest_parent
> 
> > +
> > +	smmu_parent = to_smmu_domain(parent);
> > +	if (smmu_parent->stage != ARM_SMMU_DOMAIN_S2 ||
> > +	    smmu_parent->smmu != master->smmu)
> > +		return ERR_PTR(-EINVAL);
> 
> again S2 should be implied when parent->nest_parent is true.

I think I did all of these for Nicolin

> > +
> > +	/* EIO is reserved for invalid STE data. */
> > +	if ((arg.ste[0] & ~STRTAB_STE_0_NESTING_ALLOWED) ||
> > +	    (arg.ste[1] & ~STRTAB_STE_1_NESTING_ALLOWED))
> > +		return ERR_PTR(-EIO);
> > +
> > +	cfg = FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(arg.ste[0]));
> > +	if (cfg != STRTAB_STE_0_CFG_ABORT && cfg !=
> > STRTAB_STE_0_CFG_BYPASS &&
> > +	    cfg != STRTAB_STE_0_CFG_S1_TRANS)
> > +		return ERR_PTR(-EIO);
> 
> If vSTE is invalid those bits can be ignored?

Yes, but also I was expecting the VMM to sanitize that.. Let's have
the kernel do it. Move the validation to a function and then:

static int arm_smmu_validate_vste(struct iommu_hwpt_arm_smmuv3 *arg,
				  unsigned int *eats)
{
	unsigned int cfg;

	if (!(arg->ste[0] & STRTAB_STE_0_V)) {
		memset(arg->ste, 0, sizeof(arg->ste));
		return 0;
	}


> > +
> > +	eats = FIELD_GET(STRTAB_STE_1_EATS, le64_to_cpu(arg.ste[1]));
> > +	if (eats != STRTAB_STE_1_EATS_ABT)
> > +		return ERR_PTR(-EIO);
> > +
> > +	if (cfg != STRTAB_STE_0_CFG_S1_TRANS)
> > +		eats = STRTAB_STE_1_EATS_ABT;
> 
> this check sounds redundant. If the last check passes then eats is
> already set to _ABT. 

Yes.. This hunk needs to go into this patch:

https://lore.kernel.org/linux-iommu/3962bef2ca6ab9bd06a52910f114345ecfe48ba6.1724776335.git.nicolinc@nvidia.com/T/#u

> > +/**
> > + * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 Context Descriptor
> > Table info
> > + *                                (IOMMU_HWPT_DATA_ARM_SMMUV3)
> > + *
> > + * @ste: The first two double words of the user space Stream Table Entry
> > for
> > + *       a user stage-1 Context Descriptor Table. Must be little-endian.
> > + *       Allowed fields: (Refer to "5.2 Stream Table Entry" in SMMUv3 HW
> > Spec)
> > + *       - word-0: V, Cfg, S1Fmt, S1ContextPtr, S1CDMax
> > + *       - word-1: S1DSS, S1CIR, S1COR, S1CSH, S1STALLD
> 
> Not sure whether EATS should be documented here or not. It's handled 
> but must be ZERO at this point.

Let's put it in the above patch

Thanks,
Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-08-30  8:16   ` Tian, Kevin
  2024-08-30 14:13     ` Jason Gunthorpe
@ 2024-08-30 14:39     ` Jason Gunthorpe
  1 sibling, 0 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-30 14:39 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: acpica-devel@lists.linux.dev, Hanjun Guo, iommu@lists.linux.dev,
	Joerg Roedel, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Moore, Robert, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches@lists.linux.dev, Shameerali Kolothum Thodi,
	Mostafa Saleh

On Fri, Aug 30, 2024 at 08:16:27AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, August 27, 2024 11:52 PM
> > 
> > For SMMUv3 a IOMMU_DOMAIN_NESTED is composed of a S2
> > iommu_domain acting
> > as the parent and a user provided STE fragment that defines the CD table
> > and related data with addresses translated by the S2 iommu_domain.
> > 
> > The kernel only permits userspace to control certain allowed bits of the
> > STE that are safe for user/guest control.
> > 
> > IOTLB maintenance is a bit subtle here, the S1 implicitly includes the S2
> > translation, but there is no way of knowing which S1 entries refer to a
> > range of S2.
> > 
> > For the IOTLB we follow ARM's guidance and issue a
> > CMDQ_OP_TLBI_NH_ALL to
> > flush all ASIDs from the VMID after flushing the S2 on any change to the
> > S2.
> > 
> > Similarly we have to flush the entire ATC if the S2 is changed.
> 
> it's clearer to mention that ATS is not supported at this point.

I will also move all of this stuff to the ATS enablement patch

> > @@ -2614,7 +2687,8 @@ arm_smmu_find_master_domain(struct
> > arm_smmu_domain *smmu_domain,
> >  	list_for_each_entry(master_domain, &smmu_domain->devices,
> >  			    devices_elm) {
> >  		if (master_domain->master == master &&
> > -		    master_domain->ssid == ssid)
> > +		    master_domain->ssid == ssid &&
> > +		    master_domain->nest_parent == nest_parent)
> >  			return master_domain;
> >  	}
> 
> there are two nest_parent flags in master_domain and smmu_domain.
> Probably duplicating?

Including this

And I will rename master_domain->nest_parent to master_domain->nested_ats_flush
and it will derive from nest_domain->enable_ats.

Which I think will be much clearer..

Thanks,
Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-27 15:51 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available Jason Gunthorpe
                     ` (2 preceding siblings ...)
  2024-08-30  7:44   ` Tian, Kevin
@ 2024-08-30 15:12   ` Mostafa Saleh
  2024-08-30 16:40     ` Jason Gunthorpe
  2024-09-04 14:20   ` Shameerali Kolothum Thodi
  4 siblings, 1 reply; 95+ messages in thread
From: Mostafa Saleh @ 2024-08-30 15:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

Hi Jason,

Sorry, I haven’t followed up on that, I was out for a while.

On Tue, Aug 27, 2024 at 12:51:32PM -0300, Jason Gunthorpe wrote:
> Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
> works. When S2FWB is supported and enabled the IOPTE will force cachable
> access to IOMMU_CACHE memory when nesting with a S1 and deny cachable
> access otherwise.
> 
> When using a single stage of translation, a simple S2 domain, it doesn't
> change anything as it is just a different encoding for the exsting mapping
> of the IOMMU protection flags to cachability attributes.
> 
> However, when used with a nested S1, FWB has the effect of preventing the
> guest from choosing a MemAttr in it's S1 that would cause ordinary DMA to
> bypass the cache. Consistent with KVM we wish to deny the guest the
> ability to become incoherent with cached memory the hypervisor believes is
> cachable so we don't have to flush it.
> 
> Turn on S2FWB whenever the SMMU supports it and use it for all S2
> mappings.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 +++++++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  3 +++
>  drivers/iommu/io-pgtable-arm.c              | 27 +++++++++++++++++----
>  include/linux/io-pgtable.h                  |  2 ++
>  4 files changed, 38 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 531125f231b662..e2b97ad6d74b03 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1612,6 +1612,8 @@ void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
>  		FIELD_PREP(STRTAB_STE_1_EATS,
>  			   ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0));
>  
> +	if (smmu->features & ARM_SMMU_FEAT_S2FWB)
> +		target->data[1] |= cpu_to_le64(STRTAB_STE_1_S2FWB);
>  	if (smmu->features & ARM_SMMU_FEAT_ATTR_TYPES_OVR)
>  		target->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
>  							  STRTAB_STE_1_SHCFG_INCOMING));
> @@ -2400,6 +2402,8 @@ static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain,
>  		pgtbl_cfg.oas = smmu->oas;
>  		fmt = ARM_64_LPAE_S2;
>  		finalise_stage_fn = arm_smmu_domain_finalise_s2;
> +		if (smmu->features & ARM_SMMU_FEAT_S2FWB)
> +			pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_ARM_S2FWB;
>  		break;
>  	default:
>  		return -EINVAL;
> @@ -4189,6 +4193,13 @@ static int arm_smmu_device_hw_probe(struct arm_smmu_device *smmu)
>  
>  	/* IDR3 */
>  	reg = readl_relaxed(smmu->base + ARM_SMMU_IDR3);
> +	/*
> +	 * If for some reason the HW does not support DMA coherency then using
> +	 * S2FWB won't work. This will also disable nesting support.
> +	 */
> +	if (FIELD_GET(IDR3_FWB, reg) &&
> +	    (smmu->features & ARM_SMMU_FEAT_COHERENCY))
> +		smmu->features |= ARM_SMMU_FEAT_S2FWB;
I think that’s for the SMMU coherency which in theory is not related to the
master which FWB overrides, so this check is not correct.

What I meant in the previous thread that we should set FWB only for coherent
masters as (in attach s2):
	if (smmu->features & ARM_SMMU_FEAT_S2FWB && dev_is_dma_coherent(master->dev)
		// set S2FWB in STE

Thanks,
Mostafa
>  	if (FIELD_GET(IDR3_RIL, reg))
>  		smmu->features |= ARM_SMMU_FEAT_RANGE_INV;
>  
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 8851a7abb5f0f3..7e8d2f36faebf3 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -55,6 +55,7 @@
>  #define IDR1_SIDSIZE			GENMASK(5, 0)
>  
>  #define ARM_SMMU_IDR3			0xc
> +#define IDR3_FWB			(1 << 8)
>  #define IDR3_RIL			(1 << 10)
>  
>  #define ARM_SMMU_IDR5			0x14
> @@ -258,6 +259,7 @@ static inline u32 arm_smmu_strtab_l2_idx(u32 sid)
>  #define STRTAB_STE_1_S1CSH		GENMASK_ULL(7, 6)
>  
>  #define STRTAB_STE_1_S1STALLD		(1UL << 27)
> +#define STRTAB_STE_1_S2FWB		(1UL << 25)
>  
>  #define STRTAB_STE_1_EATS		GENMASK_ULL(29, 28)
>  #define STRTAB_STE_1_EATS_ABT		0UL
> @@ -700,6 +702,7 @@ struct arm_smmu_device {
>  #define ARM_SMMU_FEAT_ATTR_TYPES_OVR	(1 << 20)
>  #define ARM_SMMU_FEAT_HA		(1 << 21)
>  #define ARM_SMMU_FEAT_HD		(1 << 22)
> +#define ARM_SMMU_FEAT_S2FWB		(1 << 23)
>  	u32				features;
>  
>  #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index f5d9fd1f45bf49..9b3658aae21005 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -106,6 +106,18 @@
>  #define ARM_LPAE_PTE_HAP_FAULT		(((arm_lpae_iopte)0) << 6)
>  #define ARM_LPAE_PTE_HAP_READ		(((arm_lpae_iopte)1) << 6)
>  #define ARM_LPAE_PTE_HAP_WRITE		(((arm_lpae_iopte)2) << 6)
> +/*
> + * For !FWB these code to:
> + *  1111 = Normal outer write back cachable / Inner Write Back Cachable
> + *         Permit S1 to override
> + *  0101 = Normal Non-cachable / Inner Non-cachable
> + *  0001 = Device / Device-nGnRE
> + * For S2FWB these code:
> + *  0110 Force Normal Write Back
> + *  0101 Normal* is forced Normal-NC, Device unchanged
> + *  0001 Force Device-nGnRE
> + */
> +#define ARM_LPAE_PTE_MEMATTR_FWB_WB	(((arm_lpae_iopte)0x6) << 2)
>  #define ARM_LPAE_PTE_MEMATTR_OIWB	(((arm_lpae_iopte)0xf) << 2)
>  #define ARM_LPAE_PTE_MEMATTR_NC		(((arm_lpae_iopte)0x5) << 2)
>  #define ARM_LPAE_PTE_MEMATTR_DEV	(((arm_lpae_iopte)0x1) << 2)
> @@ -458,12 +470,16 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
>  	 */
>  	if (data->iop.fmt == ARM_64_LPAE_S2 ||
>  	    data->iop.fmt == ARM_32_LPAE_S2) {
> -		if (prot & IOMMU_MMIO)
> +		if (prot & IOMMU_MMIO) {
>  			pte |= ARM_LPAE_PTE_MEMATTR_DEV;
> -		else if (prot & IOMMU_CACHE)
> -			pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> -		else
> +		} else if (prot & IOMMU_CACHE) {
> +			if (data->iop.cfg.quirks & IO_PGTABLE_QUIRK_ARM_S2FWB)
> +				pte |= ARM_LPAE_PTE_MEMATTR_FWB_WB;
> +			else
> +				pte |= ARM_LPAE_PTE_MEMATTR_OIWB;
> +		} else {
>  			pte |= ARM_LPAE_PTE_MEMATTR_NC;
> +		}
>  	} else {
>  		if (prot & IOMMU_MMIO)
>  			pte |= (ARM_LPAE_MAIR_ATTR_IDX_DEV
> @@ -932,7 +948,8 @@ arm_64_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg, void *cookie)
>  	if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS |
>  			    IO_PGTABLE_QUIRK_ARM_TTBR1 |
>  			    IO_PGTABLE_QUIRK_ARM_OUTER_WBWA |
> -			    IO_PGTABLE_QUIRK_ARM_HD))
> +			    IO_PGTABLE_QUIRK_ARM_HD |
> +			    IO_PGTABLE_QUIRK_ARM_S2FWB))
>  		return NULL;
>  
>  	data = arm_lpae_alloc_pgtable(cfg);
> diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
> index f9a81761bfceda..aff9b020b6dcc7 100644
> --- a/include/linux/io-pgtable.h
> +++ b/include/linux/io-pgtable.h
> @@ -87,6 +87,7 @@ struct io_pgtable_cfg {
>  	 *	attributes set in the TCR for a non-coherent page-table walker.
>  	 *
>  	 * IO_PGTABLE_QUIRK_ARM_HD: Enables dirty tracking in stage 1 pagetable.
> +	 * IO_PGTABLE_QUIRK_ARM_S2FWB: Use the FWB format for the MemAttrs bits
>  	 */
>  	#define IO_PGTABLE_QUIRK_ARM_NS			BIT(0)
>  	#define IO_PGTABLE_QUIRK_NO_PERMS		BIT(1)
> @@ -95,6 +96,7 @@ struct io_pgtable_cfg {
>  	#define IO_PGTABLE_QUIRK_ARM_TTBR1		BIT(5)
>  	#define IO_PGTABLE_QUIRK_ARM_OUTER_WBWA		BIT(6)
>  	#define IO_PGTABLE_QUIRK_ARM_HD			BIT(7)
> +	#define IO_PGTABLE_QUIRK_ARM_S2FWB		BIT(8)
>  	unsigned long			quirks;
>  	unsigned long			pgsize_bitmap;
>  	unsigned int			ias;
> -- 
> 2.46.0
> 


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 5/8] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
  2024-08-27 15:51 ` [PATCH v2 5/8] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS Jason Gunthorpe
  2024-08-27 20:12   ` Nicolin Chen
@ 2024-08-30 15:19   ` Mostafa Saleh
  2024-08-30 17:10     ` Jason Gunthorpe
  1 sibling, 1 reply; 95+ messages in thread
From: Mostafa Saleh @ 2024-08-30 15:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

Hi Jason,

On Tue, Aug 27, 2024 at 12:51:35PM -0300, Jason Gunthorpe wrote:
> HW with CANWBS is always cache coherent and ignores PCI No Snoop requests
> as well. This meets the requirement for IOMMU_CAP_ENFORCE_CACHE_COHERENCY,
> so let's return it.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Reviewed-by: Mostafa Saleh <smostafa@google.com>

> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 35 +++++++++++++++++++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
>  2 files changed, 36 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index e2b97ad6d74b03..c2021e821e5cb6 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2253,6 +2253,9 @@ static bool arm_smmu_capable(struct device *dev, enum iommu_cap cap)
>  	case IOMMU_CAP_CACHE_COHERENCY:
>  		/* Assume that a coherent TCU implies coherent TBUs */
>  		return master->smmu->features & ARM_SMMU_FEAT_COHERENCY;
> +	case IOMMU_CAP_ENFORCE_CACHE_COHERENCY:
> +		return dev_iommu_fwspec_get(dev)->flags &
> +		       IOMMU_FWSPEC_PCI_RC_CANWBS;
>  	case IOMMU_CAP_NOEXEC:
>  	case IOMMU_CAP_DEFERRED_FLUSH:
>  		return true;
> @@ -2263,6 +2266,28 @@ static bool arm_smmu_capable(struct device *dev, enum iommu_cap cap)
>  	}
>  }
>  
> +static bool arm_smmu_enforce_cache_coherency(struct iommu_domain *domain)
> +{
> +	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> +	struct arm_smmu_master_domain *master_domain;
> +	unsigned long flags;
> +	bool ret = false;
nit: we can avoid the goto, if we inverse the logic of ret (and set it
to false if device doesn't support CANWBS)

> +	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
> +	list_for_each_entry(master_domain, &smmu_domain->devices,
> +			    devices_elm) {
> +		if (!(dev_iommu_fwspec_get(master_domain->master->dev)->flags &
> +		      IOMMU_FWSPEC_PCI_RC_CANWBS))
> +			goto out;
> +	}
> +
> +	smmu_domain->enforce_cache_coherency = true;
> +	ret = true;
> +out:
> +	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
> +	return ret;
> +}
> +
>  struct arm_smmu_domain *arm_smmu_domain_alloc(void)
>  {
>  	struct arm_smmu_domain *smmu_domain;
> @@ -2693,6 +2718,15 @@ static int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
>  		 * one of them.
>  		 */
>  		spin_lock_irqsave(&smmu_domain->devices_lock, flags);
> +		if (smmu_domain->enforce_cache_coherency &&
> +		    !(dev_iommu_fwspec_get(master->dev)->flags &
> +		      IOMMU_FWSPEC_PCI_RC_CANWBS)) {
> +			kfree(master_domain);
> +			spin_unlock_irqrestore(&smmu_domain->devices_lock,
> +					       flags);
> +			return -EINVAL;
> +		}
> +
>  		if (state->ats_enabled)
>  			atomic_inc(&smmu_domain->nr_ats_masters);
>  		list_add(&master_domain->devices_elm, &smmu_domain->devices);
> @@ -3450,6 +3484,7 @@ static struct iommu_ops arm_smmu_ops = {
>  	.owner			= THIS_MODULE,
>  	.default_domain_ops = &(const struct iommu_domain_ops) {
>  		.attach_dev		= arm_smmu_attach_dev,
> +		.enforce_cache_coherency = arm_smmu_enforce_cache_coherency,
>  		.set_dev_pasid		= arm_smmu_s1_set_dev_pasid,
>  		.map_pages		= arm_smmu_map_pages,
>  		.unmap_pages		= arm_smmu_unmap_pages,
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 7e8d2f36faebf3..45882f65bfcad0 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -787,6 +787,7 @@ struct arm_smmu_domain {
>  	/* List of struct arm_smmu_master_domain */
>  	struct list_head		devices;
>  	spinlock_t			devices_lock;
> +	bool				enforce_cache_coherency : 1;
>  
>  	struct mmu_notifier		mmu_notifier;
>  };
> -- 
> 2.46.0
> 


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 6/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
  2024-08-27 15:51 ` [PATCH v2 6/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info Jason Gunthorpe
  2024-08-30  7:55   ` Tian, Kevin
@ 2024-08-30 15:23   ` Mostafa Saleh
  2024-08-30 17:16     ` Jason Gunthorpe
  1 sibling, 1 reply; 95+ messages in thread
From: Mostafa Saleh @ 2024-08-30 15:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

Hi Jason,

On Tue, Aug 27, 2024 at 12:51:36PM -0300, Jason Gunthorpe wrote:
> From: Nicolin Chen <nicolinc@nvidia.com>
> 
> For virtualization cases the IDR/IIDR/AIDR values of the actual SMMU
> instance need to be available to the VMM so it can construct an
> appropriate vSMMUv3 that reflects the correct HW capabilities.
> 
> For userspace page tables these values are required to constrain the valid
> values within the CD table and the IOPTEs.
> 
> The kernel does not sanitize these values. If building a VMM then
> userspace is required to only forward bits into a VM that it knows it can
> implement. Some bits will also require a VMM to detect if appropriate
> kernel support is available such as for ATS and BTM.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 ++++++++++++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  2 ++
>  include/uapi/linux/iommufd.h                | 35 +++++++++++++++++++++
>  3 files changed, 61 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index c2021e821e5cb6..ec2fcdd4523a26 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2288,6 +2288,29 @@ static bool arm_smmu_enforce_cache_coherency(struct iommu_domain *domain)
>  	return ret;
>  }
>  
> +static void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type)
> +{
> +	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> +	struct iommu_hw_info_arm_smmuv3 *info;
> +	u32 __iomem *base_idr;
> +	unsigned int i;
> +
> +	info = kzalloc(sizeof(*info), GFP_KERNEL);
> +	if (!info)
> +		return ERR_PTR(-ENOMEM);
> +
> +	base_idr = master->smmu->base + ARM_SMMU_IDR0;
> +	for (i = 0; i <= 5; i++)
> +		info->idr[i] = readl_relaxed(base_idr + i);
> +	info->iidr = readl_relaxed(master->smmu->base + ARM_SMMU_IIDR);
> +	info->aidr = readl_relaxed(master->smmu->base + ARM_SMMU_AIDR);
> +
> +	*length = sizeof(*info);
> +	*type = IOMMU_HW_INFO_TYPE_ARM_SMMUV3;
> +
> +	return info;
> +}
> +
>  struct arm_smmu_domain *arm_smmu_domain_alloc(void)
>  {
>  	struct arm_smmu_domain *smmu_domain;
> @@ -3467,6 +3490,7 @@ static struct iommu_ops arm_smmu_ops = {
>  	.identity_domain	= &arm_smmu_identity_domain,
>  	.blocked_domain		= &arm_smmu_blocked_domain,
>  	.capable		= arm_smmu_capable,
> +	.hw_info		= arm_smmu_hw_info,
>  	.domain_alloc_paging    = arm_smmu_domain_alloc_paging,
>  	.domain_alloc_sva       = arm_smmu_sva_domain_alloc,
>  	.domain_alloc_user	= arm_smmu_domain_alloc_user,
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 45882f65bfcad0..4b05c81b181a82 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -80,6 +80,8 @@
>  #define IIDR_REVISION			GENMASK(15, 12)
>  #define IIDR_IMPLEMENTER		GENMASK(11, 0)
>  
> +#define ARM_SMMU_AIDR			0x1C
> +
>  #define ARM_SMMU_CR0			0x20
>  #define CR0_ATSCHK			(1 << 4)
>  #define CR0_CMDQEN			(1 << 3)
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 4dde745cfb7e29..83b6e1cd338d8f 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -484,15 +484,50 @@ struct iommu_hw_info_vtd {
>  	__aligned_u64 ecap_reg;
>  };
>  
> +/**
> + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
> + *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
> + *
> + * @flags: Must be set to 0
> + * @__reserved: Must be 0
> + * @idr: Implemented features for ARM SMMU Non-secure programming interface
> + * @iidr: Information about the implementation and implementer of ARM SMMU,
> + *        and architecture version supported
> + * @aidr: ARM SMMU architecture version
> + *
> + * For the details of @idr, @iidr and @aidr, please refer to the chapters
> + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
> + *
> + * User space should read the underlying ARM SMMUv3 hardware information for
> + * the list of supported features.
> + *
> + * Note that these values reflect the raw HW capability, without any insight if
> + * any required kernel driver support is present. Bits may be set indicating the
> + * HW has functionality that is lacking kernel software support, such as BTM. If
> + * a VMM is using this information to construct emulated copies of these
> + * registers it should only forward bits that it knows it can support.
> + *
> + * In future, presence of required kernel support will be indicated in flags.
> + */
> +struct iommu_hw_info_arm_smmuv3 {
> +	__u32 flags;
> +	__u32 __reserved;
> +	__u32 idr[6];
> +	__u32 iidr;
> +	__u32 aidr;
> +};
There is a ton of information here, I think we might need to santitze the
values for what user space needs to know (that's why I was asking about qemu)
also SMMU_IDR4 is implementation define, not sure if we can unconditionally
expose it to userspace.

Thanks,
Mostafa
> +
>  /**
>   * enum iommu_hw_info_type - IOMMU Hardware Info Types
>   * @IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not report hardware
>   *                           info
>   * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
> + * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
>   */
>  enum iommu_hw_info_type {
>  	IOMMU_HW_INFO_TYPE_NONE = 0,
>  	IOMMU_HW_INFO_TYPE_INTEL_VTD = 1,
> +	IOMMU_HW_INFO_TYPE_ARM_SMMUV3 = 2,
>  };
>  
>  /**
> -- 
> 2.46.0
> 


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 7/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
  2024-08-27 15:51 ` [PATCH v2 7/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT Jason Gunthorpe
  2024-08-27 20:16   ` Nicolin Chen
  2024-08-30  7:58   ` Tian, Kevin
@ 2024-08-30 15:27   ` Mostafa Saleh
  2024-08-30 17:18     ` Jason Gunthorpe
  2 siblings, 1 reply; 95+ messages in thread
From: Mostafa Saleh @ 2024-08-30 15:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

Hi Jason,

On Tue, Aug 27, 2024 at 12:51:37PM -0300, Jason Gunthorpe wrote:
> For SMMUv3 the parent must be a S2 domain, which can be composed
> into a IOMMU_DOMAIN_NESTED.
> 
> In future the S2 parent will also need a VMID linked to the VIOMMU and
> even to KVM.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index ec2fcdd4523a26..8db3db6328f8b7 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -3103,7 +3103,8 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
>  			   const struct iommu_user_data *user_data)
>  {
>  	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> -	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
> +	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING |
> +				 IOMMU_HWPT_ALLOC_NEST_PARENT;
>  	struct arm_smmu_domain *smmu_domain;
>  	int ret;
>  
> @@ -3116,6 +3117,14 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
>  	if (!smmu_domain)
>  		return ERR_PTR(-ENOMEM);
>  
> +	if (flags & IOMMU_HWPT_ALLOC_NEST_PARENT) {
> +		if (!(master->smmu->features & ARM_SMMU_FEAT_NESTING)) {
> +			ret = -EOPNOTSUPP;
I think that should be:
	ret = ERR_PTR(-EOPNOTSUPP);

Thanks,
Mostafa
> +			goto err_free;
> +		}
> +		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
> +	}
> +
>  	smmu_domain->domain.type = IOMMU_DOMAIN_UNMANAGED;
>  	smmu_domain->domain.ops = arm_smmu_ops.default_domain_ops;
>  	ret = arm_smmu_domain_finalise(smmu_domain, master->smmu, flags);
> -- 
> 2.46.0
> 


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-08-27 15:51 ` [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED Jason Gunthorpe
  2024-08-27 21:23   ` Nicolin Chen
  2024-08-30  8:16   ` Tian, Kevin
@ 2024-08-30 16:09   ` Mostafa Saleh
  2024-08-30 16:59     ` Nicolin Chen
  2024-08-30 17:04     ` Jason Gunthorpe
  2 siblings, 2 replies; 95+ messages in thread
From: Mostafa Saleh @ 2024-08-30 16:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

Hi Jason,

On Tue, Aug 27, 2024 at 12:51:38PM -0300, Jason Gunthorpe wrote:
> For SMMUv3 a IOMMU_DOMAIN_NESTED is composed of a S2 iommu_domain acting
> as the parent and a user provided STE fragment that defines the CD table
> and related data with addresses translated by the S2 iommu_domain.
> 
> The kernel only permits userspace to control certain allowed bits of the
> STE that are safe for user/guest control.
> 
> IOTLB maintenance is a bit subtle here, the S1 implicitly includes the S2
> translation, but there is no way of knowing which S1 entries refer to a
> range of S2.
> 
> For the IOTLB we follow ARM's guidance and issue a CMDQ_OP_TLBI_NH_ALL to
> flush all ASIDs from the VMID after flushing the S2 on any change to the
> S2.
> 
> Similarly we have to flush the entire ATC if the S2 is changed.
> 

I am still reviewing this patch, but just some quick questions.

1) How does userspace do IOTLB maintenance for S1 in that case?

2) Is there a reason the UAPI is designed this way?
The way I imagined this, is that userspace will pass the pointer to the CD
(+ format) not the STE (or part of it).
Making user space messing with shareability and cacheability of S1 CD access
feels odd. (Although CD configure page table access which is similar).

Thanks,
Mostafa

> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 217 +++++++++++++++++++-
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  20 ++
>  include/uapi/linux/iommufd.h                |  20 ++
>  3 files changed, 250 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 8db3db6328f8b7..a21dce1f25cb95 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -295,6 +295,7 @@ static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
>  	case CMDQ_OP_TLBI_NH_ASID:
>  		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_ASID, ent->tlbi.asid);
>  		fallthrough;
> +	case CMDQ_OP_TLBI_NH_ALL:
>  	case CMDQ_OP_TLBI_S12_VMALL:
>  		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, ent->tlbi.vmid);
>  		break;
> @@ -1640,6 +1641,59 @@ void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
>  }
>  EXPORT_SYMBOL_IF_KUNIT(arm_smmu_make_s2_domain_ste);
>  
> +static void arm_smmu_make_nested_cd_table_ste(
> +	struct arm_smmu_ste *target, struct arm_smmu_master *master,
> +	struct arm_smmu_nested_domain *nested_domain, bool ats_enabled)
> +{
> +	arm_smmu_make_s2_domain_ste(target, master, nested_domain->s2_parent,
> +				    ats_enabled);
> +
> +	target->data[0] = cpu_to_le64(STRTAB_STE_0_V |
> +				      FIELD_PREP(STRTAB_STE_0_CFG,
> +						 STRTAB_STE_0_CFG_NESTED)) |
> +			  (nested_domain->ste[0] & ~STRTAB_STE_0_CFG);
> +	target->data[1] |= nested_domain->ste[1];
> +}
> +
> +/*
> + * Create a physical STE from the virtual STE that userspace provided when it
> + * created the nested domain. Using the vSTE userspace can request:
> + * - Non-valid STE
> + * - Abort STE
> + * - Bypass STE (install the S2, no CD table)
> + * - CD table STE (install the S2 and the userspace CD table)
> + */
> +static void arm_smmu_make_nested_domain_ste(
> +	struct arm_smmu_ste *target, struct arm_smmu_master *master,
> +	struct arm_smmu_nested_domain *nested_domain, bool ats_enabled)
> +{
> +	/*
> +	 * Userspace can request a non-valid STE through the nesting interface.
> +	 * We relay that into an abort physical STE with the intention that
> +	 * C_BAD_STE for this SID can be generated to userspace.
> +	 */
> +	if (!(nested_domain->ste[0] & cpu_to_le64(STRTAB_STE_0_V))) {
> +		arm_smmu_make_abort_ste(target);
> +		return;
> +	}
> +
> +	switch (FIELD_GET(STRTAB_STE_0_CFG,
> +			  le64_to_cpu(nested_domain->ste[0]))) {
> +	case STRTAB_STE_0_CFG_S1_TRANS:
> +		arm_smmu_make_nested_cd_table_ste(target, master, nested_domain,
> +						  ats_enabled);
> +		break;
> +	case STRTAB_STE_0_CFG_BYPASS:
> +		arm_smmu_make_s2_domain_ste(
> +			target, master, nested_domain->s2_parent, ats_enabled);
> +		break;
> +	case STRTAB_STE_0_CFG_ABORT:
> +	default:
> +		arm_smmu_make_abort_ste(target);
> +		break;
> +	}
> +}
> +
>  /*
>   * This can safely directly manipulate the STE memory without a sync sequence
>   * because the STE table has not been installed in the SMMU yet.
> @@ -2065,7 +2119,16 @@ int arm_smmu_atc_inv_domain(struct arm_smmu_domain *smmu_domain,
>  		if (!master->ats_enabled)
>  			continue;
>  
> -		arm_smmu_atc_inv_to_cmd(master_domain->ssid, iova, size, &cmd);
> +		if (master_domain->nest_parent) {
> +			/*
> +			 * If a S2 used as a nesting parent is changed we have
> +			 * no option but to completely flush the ATC.
> +			 */
> +			arm_smmu_atc_inv_to_cmd(IOMMU_NO_PASID, 0, 0, &cmd);
> +		} else {
> +			arm_smmu_atc_inv_to_cmd(master_domain->ssid, iova, size,
> +						&cmd);
> +		}
>  
>  		for (i = 0; i < master->num_streams; i++) {
>  			cmd.atc.sid = master->streams[i].id;
> @@ -2192,6 +2255,16 @@ static void arm_smmu_tlb_inv_range_domain(unsigned long iova, size_t size,
>  	}
>  	__arm_smmu_tlb_inv_range(&cmd, iova, size, granule, smmu_domain);
>  
> +	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S2 &&
> +	    smmu_domain->nest_parent) {
> +		/*
> +		 * When the S2 domain changes all the nested S1 ASIDs have to be
> +		 * flushed too.
> +		 */
> +		cmd.opcode = CMDQ_OP_TLBI_NH_ALL;
> +		arm_smmu_cmdq_issue_cmd_with_sync(smmu_domain->smmu, &cmd);
> +	}
> +
>  	/*
>  	 * Unfortunately, this can't be leaf-only since we may have
>  	 * zapped an entire table.
> @@ -2604,8 +2677,8 @@ static void arm_smmu_disable_pasid(struct arm_smmu_master *master)
>  
>  static struct arm_smmu_master_domain *
>  arm_smmu_find_master_domain(struct arm_smmu_domain *smmu_domain,
> -			    struct arm_smmu_master *master,
> -			    ioasid_t ssid)
> +			    struct arm_smmu_master *master, ioasid_t ssid,
> +			    bool nest_parent)
>  {
>  	struct arm_smmu_master_domain *master_domain;
>  
> @@ -2614,7 +2687,8 @@ arm_smmu_find_master_domain(struct arm_smmu_domain *smmu_domain,
>  	list_for_each_entry(master_domain, &smmu_domain->devices,
>  			    devices_elm) {
>  		if (master_domain->master == master &&
> -		    master_domain->ssid == ssid)
> +		    master_domain->ssid == ssid &&
> +		    master_domain->nest_parent == nest_parent)
>  			return master_domain;
>  	}
>  	return NULL;
> @@ -2634,6 +2708,9 @@ to_smmu_domain_devices(struct iommu_domain *domain)
>  	if ((domain->type & __IOMMU_DOMAIN_PAGING) ||
>  	    domain->type == IOMMU_DOMAIN_SVA)
>  		return to_smmu_domain(domain);
> +	if (domain->type == IOMMU_DOMAIN_NESTED)
> +		return container_of(domain, struct arm_smmu_nested_domain,
> +				    domain)->s2_parent;
>  	return NULL;
>  }
>  
> @@ -2649,7 +2726,8 @@ static void arm_smmu_remove_master_domain(struct arm_smmu_master *master,
>  		return;
>  
>  	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
> -	master_domain = arm_smmu_find_master_domain(smmu_domain, master, ssid);
> +	master_domain = arm_smmu_find_master_domain(
> +		smmu_domain, master, ssid, domain->type == IOMMU_DOMAIN_NESTED);
>  	if (master_domain) {
>  		list_del(&master_domain->devices_elm);
>  		kfree(master_domain);
> @@ -2664,6 +2742,7 @@ struct arm_smmu_attach_state {
>  	struct iommu_domain *old_domain;
>  	struct arm_smmu_master *master;
>  	bool cd_needs_ats;
> +	bool disable_ats;
>  	ioasid_t ssid;
>  	/* Resulting state */
>  	bool ats_enabled;
> @@ -2716,7 +2795,8 @@ static int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
>  		 * enabled if we have arm_smmu_domain, those always have page
>  		 * tables.
>  		 */
> -		state->ats_enabled = arm_smmu_ats_supported(master);
> +		state->ats_enabled = !state->disable_ats &&
> +				     arm_smmu_ats_supported(master);
>  	}
>  
>  	if (smmu_domain) {
> @@ -2725,6 +2805,8 @@ static int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
>  			return -ENOMEM;
>  		master_domain->master = master;
>  		master_domain->ssid = state->ssid;
> +		master_domain->nest_parent = new_domain->type ==
> +					       IOMMU_DOMAIN_NESTED;
>  
>  		/*
>  		 * During prepare we want the current smmu_domain and new
> @@ -3097,6 +3179,122 @@ static struct iommu_domain arm_smmu_blocked_domain = {
>  	.ops = &arm_smmu_blocked_ops,
>  };
>  
> +static int arm_smmu_attach_dev_nested(struct iommu_domain *domain,
> +				      struct device *dev)
> +{
> +	struct arm_smmu_nested_domain *nested_domain =
> +		container_of(domain, struct arm_smmu_nested_domain, domain);
> +	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> +	struct arm_smmu_attach_state state = {
> +		.master = master,
> +		.old_domain = iommu_get_domain_for_dev(dev),
> +		.ssid = IOMMU_NO_PASID,
> +		/* Currently invalidation of ATC is not supported */
> +		.disable_ats = true,
> +	};
> +	struct arm_smmu_ste ste;
> +	int ret;
> +
> +	if (arm_smmu_ssids_in_use(&master->cd_table) ||
> +	    nested_domain->s2_parent->smmu != master->smmu)
> +		return -EINVAL;
> +
> +	mutex_lock(&arm_smmu_asid_lock);
> +	ret = arm_smmu_attach_prepare(&state, domain);
> +	if (ret) {
> +		mutex_unlock(&arm_smmu_asid_lock);
> +		return ret;
> +	}
> +
> +	arm_smmu_make_nested_domain_ste(&ste, master, nested_domain,
> +					state.ats_enabled);
> +	arm_smmu_install_ste_for_dev(master, &ste);
> +	arm_smmu_attach_commit(&state);
> +	mutex_unlock(&arm_smmu_asid_lock);
> +	return 0;
> +}
> +
> +static void arm_smmu_domain_nested_free(struct iommu_domain *domain)
> +{
> +	kfree(container_of(domain, struct arm_smmu_nested_domain, domain));
> +}
> +
> +static const struct iommu_domain_ops arm_smmu_nested_ops = {
> +	.attach_dev = arm_smmu_attach_dev_nested,
> +	.free = arm_smmu_domain_nested_free,
> +};
> +
> +static struct iommu_domain *
> +arm_smmu_domain_alloc_nesting(struct device *dev, u32 flags,
> +			      struct iommu_domain *parent,
> +			      const struct iommu_user_data *user_data)
> +{
> +	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> +	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> +	struct arm_smmu_nested_domain *nested_domain;
> +	struct arm_smmu_domain *smmu_parent;
> +	struct iommu_hwpt_arm_smmuv3 arg;
> +	unsigned int eats;
> +	unsigned int cfg;
> +	int ret;
> +
> +	if (!(master->smmu->features & ARM_SMMU_FEAT_NESTING))
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	/*
> +	 * Must support some way to prevent the VM from bypassing the cache
> +	 * because VFIO currently does not do any cache maintenance.
> +	 */
> +	if (!(fwspec->flags & IOMMU_FWSPEC_PCI_RC_CANWBS) &&
> +	    !(master->smmu->features & ARM_SMMU_FEAT_S2FWB))
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	ret = iommu_copy_struct_from_user(&arg, user_data,
> +					  IOMMU_HWPT_DATA_ARM_SMMUV3, ste);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	if (flags || !(master->smmu->features & ARM_SMMU_FEAT_TRANS_S1))
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	if (!(parent->type & __IOMMU_DOMAIN_PAGING))
> +		return ERR_PTR(-EINVAL);
> +
> +	smmu_parent = to_smmu_domain(parent);
> +	if (smmu_parent->stage != ARM_SMMU_DOMAIN_S2 ||
> +	    smmu_parent->smmu != master->smmu)
> +		return ERR_PTR(-EINVAL);
> +
> +	/* EIO is reserved for invalid STE data. */
> +	if ((arg.ste[0] & ~STRTAB_STE_0_NESTING_ALLOWED) ||
> +	    (arg.ste[1] & ~STRTAB_STE_1_NESTING_ALLOWED))
> +		return ERR_PTR(-EIO);
> +
> +	cfg = FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(arg.ste[0]));
> +	if (cfg != STRTAB_STE_0_CFG_ABORT && cfg != STRTAB_STE_0_CFG_BYPASS &&
> +	    cfg != STRTAB_STE_0_CFG_S1_TRANS)
> +		return ERR_PTR(-EIO);
> +
> +	eats = FIELD_GET(STRTAB_STE_1_EATS, le64_to_cpu(arg.ste[1]));
> +	if (eats != STRTAB_STE_1_EATS_ABT)
> +		return ERR_PTR(-EIO);
> +
> +	if (cfg != STRTAB_STE_0_CFG_S1_TRANS)
> +		eats = STRTAB_STE_1_EATS_ABT;
> +
> +	nested_domain = kzalloc(sizeof(*nested_domain), GFP_KERNEL_ACCOUNT);
> +	if (!nested_domain)
> +		return ERR_PTR(-ENOMEM);
> +
> +	nested_domain->domain.type = IOMMU_DOMAIN_NESTED;
> +	nested_domain->domain.ops = &arm_smmu_nested_ops;
> +	nested_domain->s2_parent = smmu_parent;
> +	nested_domain->ste[0] = arg.ste[0];
> +	nested_domain->ste[1] = arg.ste[1] & ~cpu_to_le64(STRTAB_STE_1_EATS);
> +
> +	return &nested_domain->domain;
> +}
> +
>  static struct iommu_domain *
>  arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
>  			   struct iommu_domain *parent,
> @@ -3108,9 +3306,13 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
>  	struct arm_smmu_domain *smmu_domain;
>  	int ret;
>  
> +	if (parent)
> +		return arm_smmu_domain_alloc_nesting(dev, flags, parent,
> +						     user_data);
> +
>  	if (flags & ~PAGING_FLAGS)
>  		return ERR_PTR(-EOPNOTSUPP);
> -	if (parent || user_data)
> +	if (user_data)
>  		return ERR_PTR(-EOPNOTSUPP);
>  
>  	smmu_domain = arm_smmu_domain_alloc();
> @@ -3123,6 +3325,7 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
>  			goto err_free;
>  		}
>  		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
> +		smmu_domain->nest_parent = true;
>  	}
>  
>  	smmu_domain->domain.type = IOMMU_DOMAIN_UNMANAGED;
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 4b05c81b181a82..b563cfedf22e91 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -240,6 +240,7 @@ static inline u32 arm_smmu_strtab_l2_idx(u32 sid)
>  #define STRTAB_STE_0_CFG_BYPASS		4
>  #define STRTAB_STE_0_CFG_S1_TRANS	5
>  #define STRTAB_STE_0_CFG_S2_TRANS	6
> +#define STRTAB_STE_0_CFG_NESTED		7
>  
>  #define STRTAB_STE_0_S1FMT		GENMASK_ULL(5, 4)
>  #define STRTAB_STE_0_S1FMT_LINEAR	0
> @@ -291,6 +292,15 @@ static inline u32 arm_smmu_strtab_l2_idx(u32 sid)
>  
>  #define STRTAB_STE_3_S2TTB_MASK		GENMASK_ULL(51, 4)
>  
> +/* These bits can be controlled by userspace for STRTAB_STE_0_CFG_NESTED */
> +#define STRTAB_STE_0_NESTING_ALLOWED                                         \
> +	cpu_to_le64(STRTAB_STE_0_V | STRTAB_STE_0_CFG | STRTAB_STE_0_S1FMT | \
> +		    STRTAB_STE_0_S1CTXPTR_MASK | STRTAB_STE_0_S1CDMAX)
> +#define STRTAB_STE_1_NESTING_ALLOWED                            \
> +	cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |   \
> +		    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |   \
> +		    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_EATS)
> +
>  /*
>   * Context descriptors.
>   *
> @@ -508,6 +518,7 @@ struct arm_smmu_cmdq_ent {
>  			};
>  		} cfgi;
>  
> +		#define CMDQ_OP_TLBI_NH_ALL     0x10
>  		#define CMDQ_OP_TLBI_NH_ASID	0x11
>  		#define CMDQ_OP_TLBI_NH_VA	0x12
>  		#define CMDQ_OP_TLBI_EL2_ALL	0x20
> @@ -790,10 +801,18 @@ struct arm_smmu_domain {
>  	struct list_head		devices;
>  	spinlock_t			devices_lock;
>  	bool				enforce_cache_coherency : 1;
> +	bool				nest_parent : 1;
>  
>  	struct mmu_notifier		mmu_notifier;
>  };
>  
> +struct arm_smmu_nested_domain {
> +	struct iommu_domain domain;
> +	struct arm_smmu_domain *s2_parent;
> +
> +	__le64 ste[2];
> +};
> +
>  /* The following are exposed for testing purposes. */
>  struct arm_smmu_entry_writer_ops;
>  struct arm_smmu_entry_writer {
> @@ -830,6 +849,7 @@ struct arm_smmu_master_domain {
>  	struct list_head devices_elm;
>  	struct arm_smmu_master *master;
>  	ioasid_t ssid;
> +	u8 nest_parent;
>  };
>  
>  static inline struct arm_smmu_domain *to_smmu_domain(struct iommu_domain *dom)
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 83b6e1cd338d8f..76e9ad6c9403af 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -394,14 +394,34 @@ struct iommu_hwpt_vtd_s1 {
>  	__u32 __reserved;
>  };
>  
> +/**
> + * struct iommu_hwpt_arm_smmuv3 - ARM SMMUv3 Context Descriptor Table info
> + *                                (IOMMU_HWPT_DATA_ARM_SMMUV3)
> + *
> + * @ste: The first two double words of the user space Stream Table Entry for
> + *       a user stage-1 Context Descriptor Table. Must be little-endian.
> + *       Allowed fields: (Refer to "5.2 Stream Table Entry" in SMMUv3 HW Spec)
> + *       - word-0: V, Cfg, S1Fmt, S1ContextPtr, S1CDMax
> + *       - word-1: S1DSS, S1CIR, S1COR, S1CSH, S1STALLD
> + *
> + * -EIO will be returned if @ste is not legal or contains any non-allowed field.
> + * Cfg can be used to select a S1, Bypass or Abort configuration. A Bypass
> + * nested domain will translate the same as the nesting parent.
> + */
> +struct iommu_hwpt_arm_smmuv3 {
> +	__aligned_le64 ste[2];
> +};
> +
>  /**
>   * enum iommu_hwpt_data_type - IOMMU HWPT Data Type
>   * @IOMMU_HWPT_DATA_NONE: no data
>   * @IOMMU_HWPT_DATA_VTD_S1: Intel VT-d stage-1 page table
> + * @IOMMU_HWPT_DATA_ARM_SMMUV3: ARM SMMUv3 Context Descriptor Table
>   */
>  enum iommu_hwpt_data_type {
>  	IOMMU_HWPT_DATA_NONE = 0,
>  	IOMMU_HWPT_DATA_VTD_S1 = 1,
> +	IOMMU_HWPT_DATA_ARM_SMMUV3 = 2,
>  };
>  
>  /**
> -- 
> 2.46.0
> 


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-30 15:12   ` Mostafa Saleh
@ 2024-08-30 16:40     ` Jason Gunthorpe
  2024-09-02  9:29       ` Mostafa Saleh
  0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-30 16:40 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Fri, Aug 30, 2024 at 03:12:54PM +0000, Mostafa Saleh wrote:
> > +	/*
> > +	 * If for some reason the HW does not support DMA coherency then using
> > +	 * S2FWB won't work. This will also disable nesting support.
> > +	 */
> > +	if (FIELD_GET(IDR3_FWB, reg) &&
> > +	    (smmu->features & ARM_SMMU_FEAT_COHERENCY))
> > +		smmu->features |= ARM_SMMU_FEAT_S2FWB;
> I think that’s for the SMMU coherency which in theory is not related to the
> master which FWB overrides, so this check is not correct.

Yes, I agree, in theory.

However the driver today already links them together:

	case IOMMU_CAP_CACHE_COHERENCY:
		/* Assume that a coherent TCU implies coherent TBUs */
		return master->smmu->features & ARM_SMMU_FEAT_COHERENCY;

So this hunk was a continuation of that design.

> What I meant in the previous thread that we should set FWB only for coherent
> masters as (in attach s2):
> 	if (smmu->features & ARM_SMMU_FEAT_S2FWB && dev_is_dma_coherent(master->dev)
> 		// set S2FWB in STE

I think as I explained in that thread, it is not really correct
either. There is no reason to block using S2FWB for non-coherent
masters that are not used with VFIO. The page table will still place
the correct memattr according to the IOMMU_CACHE flag, S2FWB just
slightly changes the encoding.

For VFIO, non-coherent masters need to be blocked from VFIO entirely
and should never get even be allowed to get here.

If anything should be changed then it would be the above
IOMMU_CAP_CACHE_COHERENCY test, and I don't know if
dev_is_dma_coherent() would be correct there, or if it should do some
ACPI inspection or what.

So let's drop the above hunk, it already happens implicitly because
VFIO checks it via IOMMU_CAP_CACHE_COHERENCY and it makes more sense
to put the assumption in one place.

Thanks,
Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-08-30 16:09   ` Mostafa Saleh
@ 2024-08-30 16:59     ` Nicolin Chen
  2024-08-30 17:04     ` Jason Gunthorpe
  1 sibling, 0 replies; 95+ messages in thread
From: Nicolin Chen @ 2024-08-30 16:59 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: Jason Gunthorpe, acpica-devel, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi

On Fri, Aug 30, 2024 at 04:09:04PM +0000, Mostafa Saleh wrote:
 
> On Tue, Aug 27, 2024 at 12:51:38PM -0300, Jason Gunthorpe wrote:
> > For SMMUv3 a IOMMU_DOMAIN_NESTED is composed of a S2 iommu_domain acting
> > as the parent and a user provided STE fragment that defines the CD table
> > and related data with addresses translated by the S2 iommu_domain.
> >
> > The kernel only permits userspace to control certain allowed bits of the
> > STE that are safe for user/guest control.
> >
> > IOTLB maintenance is a bit subtle here, the S1 implicitly includes the S2
> > translation, but there is no way of knowing which S1 entries refer to a
> > range of S2.
> >
> > For the IOTLB we follow ARM's guidance and issue a CMDQ_OP_TLBI_NH_ALL to
> > flush all ASIDs from the VMID after flushing the S2 on any change to the
> > S2.
> >
> > Similarly we have to flush the entire ATC if the S2 is changed.
> >
> 
> I am still reviewing this patch, but just some quick questions.
> 
> 1) How does userspace do IOTLB maintenance for S1 in that case?

We do all the TLBI/ATC/CD invalidations via the VIOMMU uapi:
https://lore.kernel.org/linux-iommu/cover.1724776335.git.nicolinc@nvidia.com/

In another word, nesting support requires the VIOMMU p1 series
at least.

> 2) Is there a reason the UAPI is designed this way?
> The way I imagined this, is that userspace will pass the pointer to the CD
> (+ format) not the STE (or part of it).
> Making user space messing with shareability and cacheability of S1 CD access
> feels odd. (Although CD configure page table access which is similar).

Given that STE.S1ContextPtr itself is an IPA/GPA, it feels to me
that the HW is designed in such a fashion of user space managing
the CD table and its entries?

CD cache will be flushed if CFGI_CD{_ALL} is trapped. This would
be the same if we pass CD info via the uAPI.

What's the concern for shareability?

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-08-30  9:07                 ` Shameerali Kolothum Thodi
@ 2024-08-30 17:01                   ` Nicolin Chen
  0 siblings, 0 replies; 95+ messages in thread
From: Nicolin Chen @ 2024-08-30 17:01 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Jason Gunthorpe, acpica-devel@lists.linux.dev,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	patches@lists.linux.dev, Mostafa Saleh

On Fri, Aug 30, 2024 at 09:07:16AM +0000, Shameerali Kolothum Thodi wrote:

> Print shows everything fine:
> create_new_pcie_port: name_port smmu_bus0xca_port0, bus_nr 0xca chassis_nr 0xfd, nested_smmu->index 0x2, pci_bus_num 0xca, ret 1
> 
> It looks like a problem with old QEMU_EFI.fd(2022 build and before).
> I tried with 2023 QEMU_EFI.fd and with that it looks fine.
> 
> root@ubuntu:/# lspci -tv
> -+-[0000:ca]---00.0-[cb]----00.0  Huawei Technologies Co., Ltd. Device a251
>  \-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
>              +-01.0  Red Hat, Inc Virtio network device
>              +-02.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-04.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-05.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-06.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-07.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              +-08.0  Red Hat, Inc. QEMU PCIe Expander bridge
>              \-09.0  Red Hat, Inc. QEMU PCIe Expander bridge
> 
> So for now, I can proceed.

Nice! That's a relief, for now :)

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-08-30 16:09   ` Mostafa Saleh
  2024-08-30 16:59     ` Nicolin Chen
@ 2024-08-30 17:04     ` Jason Gunthorpe
  2024-09-02  9:57       ` Mostafa Saleh
  2024-09-06 18:28       ` Jason Gunthorpe
  1 sibling, 2 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-30 17:04 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Fri, Aug 30, 2024 at 04:09:04PM +0000, Mostafa Saleh wrote:
> Hi Jason,
> 
> On Tue, Aug 27, 2024 at 12:51:38PM -0300, Jason Gunthorpe wrote:
> > For SMMUv3 a IOMMU_DOMAIN_NESTED is composed of a S2 iommu_domain acting
> > as the parent and a user provided STE fragment that defines the CD table
> > and related data with addresses translated by the S2 iommu_domain.
> > 
> > The kernel only permits userspace to control certain allowed bits of the
> > STE that are safe for user/guest control.
> > 
> > IOTLB maintenance is a bit subtle here, the S1 implicitly includes the S2
> > translation, but there is no way of knowing which S1 entries refer to a
> > range of S2.
> > 
> > For the IOTLB we follow ARM's guidance and issue a CMDQ_OP_TLBI_NH_ALL to
> > flush all ASIDs from the VMID after flushing the S2 on any change to the
> > S2.
> > 
> > Similarly we have to flush the entire ATC if the S2 is changed.
> > 
> 
> I am still reviewing this patch, but just some quick questions.
> 
> 1) How does userspace do IOTLB maintenance for S1 in that case?

See

https://lore.kernel.org/linux-iommu/cover.1724776335.git.nicolinc@nvidia.com

Patch 17

Really, this series and that series must be together. We have a patch
planning issue to sort out here as well, all 27 should go together
into the same merge window.

> 2) Is there a reason the UAPI is designed this way?
> The way I imagined this, is that userspace will pass the pointer to the CD
> (+ format) not the STE (or part of it).

Yes, we need more information from the STE than just that. EATS and
STALL for instance. And the cachability below. Who knows what else in
the future.

We also want to support the V=0, Bypass and Abort STE configurations
under the nesting domain (V, CFG required) so that the VIOMMU can
remain affiliated with the STE in all cases. This is necessary to
allow VIOMMU event reporting to always work.

Looking at the masks:

STRTAB_STE_0_NESTING_ALLOWED = 0xf80fffffffffffff
STRTAB_STE_1_NESTING_ALLOWED = 0x380000ff

So we do use alot of the bits. Reformatting from the native HW format
into something else doesn't seem better for VMM or kernel..

This is similar to the invalidation design where we also just forward
the invalidation command as is in native HW format, and how IDR is
done the same.

Overall this sort of direct transparency is how I prefer to see these
kinds of iommufd HW specific interfaces designed. From a lot of
experience here, arbitary marshall/unmarshall is often an
antipattern :)

> Making user space messing with shareability and cacheability of S1 CD access
> feels odd. (Although CD configure page table access which is similar).

As I understand it, the walk of the CD table will be constrained by
the S2FWB, just like all the other accesses by the guest.

So we just take a consistent approach of allowing the guest to provide
memattrs in the vSTE, CD, and S1 page table and rely on the HW's S2FWB
to police it.

As you say there are lots of memattr type bits under direct guest
control, it doesn't necessarily make alot of sense to permit
everything in those contexts and then add extra code to do something
different here.

Though I agree it looks odd, it is self-consistent.

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 5/8] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS
  2024-08-30 15:19   ` Mostafa Saleh
@ 2024-08-30 17:10     ` Jason Gunthorpe
  0 siblings, 0 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-30 17:10 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Fri, Aug 30, 2024 at 03:19:16PM +0000, Mostafa Saleh wrote:
> > @@ -2263,6 +2266,28 @@ static bool arm_smmu_capable(struct device *dev, enum iommu_cap cap)
> >  	}
> >  }
> >  
> > +static bool arm_smmu_enforce_cache_coherency(struct iommu_domain *domain)
> > +{
> > +	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> > +	struct arm_smmu_master_domain *master_domain;
> > +	unsigned long flags;
> > +	bool ret = false;
> nit: we can avoid the goto, if we inverse the logic of ret (and set it
> to false if device doesn't support CANWBS)

Yeah, that is tidier:

static bool arm_smmu_enforce_cache_coherency(struct iommu_domain *domain)
{
	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
	struct arm_smmu_master_domain *master_domain;
	unsigned long flags;
	bool ret = true;

	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
	list_for_each_entry(master_domain, &smmu_domain->devices,
			    devices_elm) {
		if (!arm_smmu_master_canwbs(master_domain->master)) {
			ret = false;
			break;
		}
	}
	smmu_domain->enforce_cache_coherency = ret;
	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
	return ret;
}

Thanks,
Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 6/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
  2024-08-30 15:23   ` Mostafa Saleh
@ 2024-08-30 17:16     ` Jason Gunthorpe
  2024-09-02 10:11       ` Mostafa Saleh
  0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-30 17:16 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Fri, Aug 30, 2024 at 03:23:41PM +0000, Mostafa Saleh wrote:
> > +/**
> > + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
> > + *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
> > + *
> > + * @flags: Must be set to 0
> > + * @__reserved: Must be 0
> > + * @idr: Implemented features for ARM SMMU Non-secure programming interface
> > + * @iidr: Information about the implementation and implementer of ARM SMMU,
> > + *        and architecture version supported
> > + * @aidr: ARM SMMU architecture version
> > + *
> > + * For the details of @idr, @iidr and @aidr, please refer to the chapters
> > + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
> > + *
> > + * User space should read the underlying ARM SMMUv3 hardware information for
> > + * the list of supported features.
> > + *
> > + * Note that these values reflect the raw HW capability, without any insight if
> > + * any required kernel driver support is present. Bits may be set indicating the
> > + * HW has functionality that is lacking kernel software support, such as BTM. If
> > + * a VMM is using this information to construct emulated copies of these
> > + * registers it should only forward bits that it knows it can support.
> > + *
> > + * In future, presence of required kernel support will be indicated in flags.
> > + */
> > +struct iommu_hw_info_arm_smmuv3 {
> > +	__u32 flags;
> > +	__u32 __reserved;
> > +	__u32 idr[6];
> > +	__u32 iidr;
> > +	__u32 aidr;
> > +};
> There is a ton of information here, I think we might need to santitze the
> values for what user space needs to know (that's why I was asking about qemu)
> also SMMU_IDR4 is implementation define, not sure if we can unconditionally
> expose it to userspace.

What is the harm? Does exposing IDR data to userspace in any way
compromise the security or integrity of the system?

I think no - how could it?

As the comments says, the VMM should not just blindly forward this to
a guest!

The VMM needs to make its own IDR to reflect its own vSMMU
capabilities. It can refer to the kernel IDR if it needs to.

So, if the kernel is going to limit it, what criteria would you
propose the kernel use?

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 7/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
  2024-08-30 15:27   ` Mostafa Saleh
@ 2024-08-30 17:18     ` Jason Gunthorpe
  2024-09-02  8:57       ` Mostafa Saleh
  0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-08-30 17:18 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Fri, Aug 30, 2024 at 03:27:00PM +0000, Mostafa Saleh wrote:
> Hi Jason,
> 
> On Tue, Aug 27, 2024 at 12:51:37PM -0300, Jason Gunthorpe wrote:
> > For SMMUv3 the parent must be a S2 domain, which can be composed
> > into a IOMMU_DOMAIN_NESTED.
> > 
> > In future the S2 parent will also need a VMID linked to the VIOMMU and
> > even to KVM.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > ---
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index ec2fcdd4523a26..8db3db6328f8b7 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -3103,7 +3103,8 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
> >  			   const struct iommu_user_data *user_data)
> >  {
> >  	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> > -	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
> > +	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING |
> > +				 IOMMU_HWPT_ALLOC_NEST_PARENT;
> >  	struct arm_smmu_domain *smmu_domain;
> >  	int ret;
> >  
> > @@ -3116,6 +3117,14 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
> >  	if (!smmu_domain)
> >  		return ERR_PTR(-ENOMEM);
> >  
> > +	if (flags & IOMMU_HWPT_ALLOC_NEST_PARENT) {
> > +		if (!(master->smmu->features & ARM_SMMU_FEAT_NESTING)) {
> > +			ret = -EOPNOTSUPP;
> I think that should be:
> 	ret = ERR_PTR(-EOPNOTSUPP);

Read again :)

static struct iommu_domain *
arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
			   struct iommu_domain *parent,
			   const struct iommu_user_data *user_data)
{
	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING |
				 IOMMU_HWPT_ALLOC_NEST_PARENT;
	struct arm_smmu_domain *smmu_domain;
	int ret;
     ^^^^^^^^^^^^^^

err_free:
	kfree(smmu_domain);
	return ERR_PTR(ret);
           ^^^^^^^^^^^^^^^^^^

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 7/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT
  2024-08-30 17:18     ` Jason Gunthorpe
@ 2024-09-02  8:57       ` Mostafa Saleh
  0 siblings, 0 replies; 95+ messages in thread
From: Mostafa Saleh @ 2024-09-02  8:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Fri, Aug 30, 2024 at 02:18:17PM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 30, 2024 at 03:27:00PM +0000, Mostafa Saleh wrote:
> > Hi Jason,
> > 
> > On Tue, Aug 27, 2024 at 12:51:37PM -0300, Jason Gunthorpe wrote:
> > > For SMMUv3 the parent must be a S2 domain, which can be composed
> > > into a IOMMU_DOMAIN_NESTED.
> > > 
> > > In future the S2 parent will also need a VMID linked to the VIOMMU and
> > > even to KVM.
> > > 
> > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > > ---
> > >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 ++++++++++-
> > >  1 file changed, 10 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > index ec2fcdd4523a26..8db3db6328f8b7 100644
> > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > @@ -3103,7 +3103,8 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
> > >  			   const struct iommu_user_data *user_data)
> > >  {
> > >  	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> > > -	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING;
> > > +	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING |
> > > +				 IOMMU_HWPT_ALLOC_NEST_PARENT;
> > >  	struct arm_smmu_domain *smmu_domain;
> > >  	int ret;
> > >  
> > > @@ -3116,6 +3117,14 @@ arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
> > >  	if (!smmu_domain)
> > >  		return ERR_PTR(-ENOMEM);
> > >  
> > > +	if (flags & IOMMU_HWPT_ALLOC_NEST_PARENT) {
> > > +		if (!(master->smmu->features & ARM_SMMU_FEAT_NESTING)) {
> > > +			ret = -EOPNOTSUPP;
> > I think that should be:
> > 	ret = ERR_PTR(-EOPNOTSUPP);
> 
> Read again :)

Oops, sorry about the noise.

Thanks,
Mostafa
> 
> static struct iommu_domain *
> arm_smmu_domain_alloc_user(struct device *dev, u32 flags,
> 			   struct iommu_domain *parent,
> 			   const struct iommu_user_data *user_data)
> {
> 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> 	const u32 PAGING_FLAGS = IOMMU_HWPT_ALLOC_DIRTY_TRACKING |
> 				 IOMMU_HWPT_ALLOC_NEST_PARENT;
> 	struct arm_smmu_domain *smmu_domain;
> 	int ret;
>      ^^^^^^^^^^^^^^
> 
> err_free:
> 	kfree(smmu_domain);
> 	return ERR_PTR(ret);
>            ^^^^^^^^^^^^^^^^^^
> 
> Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-30 16:40     ` Jason Gunthorpe
@ 2024-09-02  9:29       ` Mostafa Saleh
  2024-09-03  0:05         ` Jason Gunthorpe
  0 siblings, 1 reply; 95+ messages in thread
From: Mostafa Saleh @ 2024-09-02  9:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Fri, Aug 30, 2024 at 01:40:19PM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 30, 2024 at 03:12:54PM +0000, Mostafa Saleh wrote:
> > > +	/*
> > > +	 * If for some reason the HW does not support DMA coherency then using
> > > +	 * S2FWB won't work. This will also disable nesting support.
> > > +	 */
> > > +	if (FIELD_GET(IDR3_FWB, reg) &&
> > > +	    (smmu->features & ARM_SMMU_FEAT_COHERENCY))
> > > +		smmu->features |= ARM_SMMU_FEAT_S2FWB;
> > I think that’s for the SMMU coherency which in theory is not related to the
> > master which FWB overrides, so this check is not correct.
> 
> Yes, I agree, in theory.
> 
> However the driver today already links them together:
> 
> 	case IOMMU_CAP_CACHE_COHERENCY:
> 		/* Assume that a coherent TCU implies coherent TBUs */
> 		return master->smmu->features & ARM_SMMU_FEAT_COHERENCY;
> 
> So this hunk was a continuation of that design.
> 
> > What I meant in the previous thread that we should set FWB only for coherent
> > masters as (in attach s2):
> > 	if (smmu->features & ARM_SMMU_FEAT_S2FWB && dev_is_dma_coherent(master->dev)
> > 		// set S2FWB in STE
> 
> I think as I explained in that thread, it is not really correct
> either. There is no reason to block using S2FWB for non-coherent
> masters that are not used with VFIO. The page table will still place
> the correct memattr according to the IOMMU_CACHE flag, S2FWB just
> slightly changes the encoding.

It’s not just the encoding that changes, as
- Without FWB, stage-2 combine attributes
- While with FWB, it overrides them.

So a cacheable mapping in stage-2 can lead to a non-cacheable
(or with different cachableitiy attributes) transaction based on the
input. I am not sure though if there is such case in the kernel.

Also, that logic doesn't only apply to VFIO, but also for stage-2
only SMMUs that use stage-2 for kernel DMA.

> 
> For VFIO, non-coherent masters need to be blocked from VFIO entirely
> and should never get even be allowed to get here.
> 
> If anything should be changed then it would be the above
> IOMMU_CAP_CACHE_COHERENCY test, and I don't know if
> dev_is_dma_coherent() would be correct there, or if it should do some
> ACPI inspection or what.

I agree, I believe that this assumption is not accurate, I am not sure
what is the right approach here, but in concept I think we shouldn’t
enable FWB for non-coherent devices (using dev_is_dma_coherent() or
other check)

Thanks,
Mostafa
> 
> So let's drop the above hunk, it already happens implicitly because
> VFIO checks it via IOMMU_CAP_CACHE_COHERENCY and it makes more sense
> to put the assumption in one place.
> 
> Thanks,
> Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-08-30 17:04     ` Jason Gunthorpe
@ 2024-09-02  9:57       ` Mostafa Saleh
  2024-09-03  0:30         ` Jason Gunthorpe
  2024-09-06 18:28       ` Jason Gunthorpe
  1 sibling, 1 reply; 95+ messages in thread
From: Mostafa Saleh @ 2024-09-02  9:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Fri, Aug 30, 2024 at 02:04:26PM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 30, 2024 at 04:09:04PM +0000, Mostafa Saleh wrote:
> > Hi Jason,
> > 
> > On Tue, Aug 27, 2024 at 12:51:38PM -0300, Jason Gunthorpe wrote:
> > > For SMMUv3 a IOMMU_DOMAIN_NESTED is composed of a S2 iommu_domain acting
> > > as the parent and a user provided STE fragment that defines the CD table
> > > and related data with addresses translated by the S2 iommu_domain.
> > > 
> > > The kernel only permits userspace to control certain allowed bits of the
> > > STE that are safe for user/guest control.
> > > 
> > > IOTLB maintenance is a bit subtle here, the S1 implicitly includes the S2
> > > translation, but there is no way of knowing which S1 entries refer to a
> > > range of S2.
> > > 
> > > For the IOTLB we follow ARM's guidance and issue a CMDQ_OP_TLBI_NH_ALL to
> > > flush all ASIDs from the VMID after flushing the S2 on any change to the
> > > S2.
> > > 
> > > Similarly we have to flush the entire ATC if the S2 is changed.
> > > 
> > 
> > I am still reviewing this patch, but just some quick questions.
> > 
> > 1) How does userspace do IOTLB maintenance for S1 in that case?
> 
> See
> 
> https://lore.kernel.org/linux-iommu/cover.1724776335.git.nicolinc@nvidia.com
> 
> Patch 17

Thanks, I had this series on my radar, I will check it by this week.

>
> Really, this series and that series must be together. We have a patch
> planning issue to sort out here as well, all 27 should go together
> into the same merge window.
> 
> > 2) Is there a reason the UAPI is designed this way?
> > The way I imagined this, is that userspace will pass the pointer to the CD
> > (+ format) not the STE (or part of it).
> 
> Yes, we need more information from the STE than just that. EATS and
> STALL for instance. And the cachability below. Who knows what else in
> the future.

But for example if that was extended later, how can user space know
which fields are allowed and which are not?

> 
> We also want to support the V=0, Bypass and Abort STE configurations
> under the nesting domain (V, CFG required) so that the VIOMMU can
> remain affiliated with the STE in all cases. This is necessary to
> allow VIOMMU event reporting to always work.
> 
> Looking at the masks:
> 
> STRTAB_STE_0_NESTING_ALLOWED = 0xf80fffffffffffff
> STRTAB_STE_1_NESTING_ALLOWED = 0x380000ff
> 
> So we do use alot of the bits. Reformatting from the native HW format
> into something else doesn't seem better for VMM or kernel..
> 
> This is similar to the invalidation design where we also just forward
> the invalidation command as is in native HW format, and how IDR is
> done the same.
> 
> Overall this sort of direct transparency is how I prefer to see these
> kinds of iommufd HW specific interfaces designed. From a lot of
> experience here, arbitary marshall/unmarshall is often an
> antipattern :)

Is there any documentation for the (proposed) SMMUv3 UAPI for IOMMUFD?
I can understand reading IDRs from userspace (with some sanitation),
but adding some more logic to map vSTE to STE needs more care of what
kind of semantics are provided.

Also, I am working on similar interface for pKVM where we “paravirtualize”
the SMMU access for guests, it’s different semantics, but I hope we can
align that with IOMMUFD (but it’s nowhere near upstream now)

I see you are talking in LPC about IOMMUFD:
https://lore.kernel.org/linux-iommu/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/T/#m2dbb08f3bf8506a492bc7dda2de662e42371e683

Do you have any plans to talk about this also?

Thanks,
Mostafa
> 
> > Making user space messing with shareability and cacheability of S1 CD access
> > feels odd. (Although CD configure page table access which is similar).
> 
> As I understand it, the walk of the CD table will be constrained by
> the S2FWB, just like all the other accesses by the guest.
> 
> So we just take a consistent approach of allowing the guest to provide
> memattrs in the vSTE, CD, and S1 page table and rely on the HW's S2FWB
> to police it.
> 
> As you say there are lots of memattr type bits under direct guest
> control, it doesn't necessarily make alot of sense to permit
> everything in those contexts and then add extra code to do something
> different here.
> 
> Though I agree it looks odd, it is self-consistent.
> 
> Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 6/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
  2024-08-30 17:16     ` Jason Gunthorpe
@ 2024-09-02 10:11       ` Mostafa Saleh
  2024-09-03  0:16         ` Jason Gunthorpe
  0 siblings, 1 reply; 95+ messages in thread
From: Mostafa Saleh @ 2024-09-02 10:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Fri, Aug 30, 2024 at 02:16:02PM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 30, 2024 at 03:23:41PM +0000, Mostafa Saleh wrote:
> > > +/**
> > > + * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
> > > + *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
> > > + *
> > > + * @flags: Must be set to 0
> > > + * @__reserved: Must be 0
> > > + * @idr: Implemented features for ARM SMMU Non-secure programming interface
> > > + * @iidr: Information about the implementation and implementer of ARM SMMU,
> > > + *        and architecture version supported
> > > + * @aidr: ARM SMMU architecture version
> > > + *
> > > + * For the details of @idr, @iidr and @aidr, please refer to the chapters
> > > + * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
> > > + *
> > > + * User space should read the underlying ARM SMMUv3 hardware information for
> > > + * the list of supported features.
> > > + *
> > > + * Note that these values reflect the raw HW capability, without any insight if
> > > + * any required kernel driver support is present. Bits may be set indicating the
> > > + * HW has functionality that is lacking kernel software support, such as BTM. If
> > > + * a VMM is using this information to construct emulated copies of these
> > > + * registers it should only forward bits that it knows it can support.
> > > + *
> > > + * In future, presence of required kernel support will be indicated in flags.
> > > + */
> > > +struct iommu_hw_info_arm_smmuv3 {
> > > +	__u32 flags;
> > > +	__u32 __reserved;
> > > +	__u32 idr[6];
> > > +	__u32 iidr;
> > > +	__u32 aidr;
> > > +};
> > There is a ton of information here, I think we might need to santitze the
> > values for what user space needs to know (that's why I was asking about qemu)
> > also SMMU_IDR4 is implementation define, not sure if we can unconditionally
> > expose it to userspace.
> 
> What is the harm? Does exposing IDR data to userspace in any way
> compromise the security or integrity of the system?
> 
> I think no - how could it?

I don’t see a clear harm or exploit with exposing IDRs, but IMHO we
should deal with userspace with the least privilege principle and
only expose what user space cares about (with sanitised IDRs or
through another mechanism)

For example, KVM doesn’t allow reading reading the CPU system
registers to know if SVE(or other features) is supported but hides
that by a CAP in KVM_CHECK_EXTENSION

> 
> As the comments says, the VMM should not just blindly forward this to
> a guest!

I don't think the kernel should trust userspace.

> 
> The VMM needs to make its own IDR to reflect its own vSMMU
> capabilities. It can refer to the kernel IDR if it needs to.
> 
> So, if the kernel is going to limit it, what criteria would you
> propose the kernel use?

I agree that the VMM would create a virtual IDR for guest, but that
doesn't have to be directly based on the physical one (same as CPU).

Thanks,
Mostafa

> 
> Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-09-02  9:29       ` Mostafa Saleh
@ 2024-09-03  0:05         ` Jason Gunthorpe
  2024-09-03  7:57           ` Mostafa Saleh
  0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-09-03  0:05 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Mon, Sep 02, 2024 at 09:29:53AM +0000, Mostafa Saleh wrote:
> On Fri, Aug 30, 2024 at 01:40:19PM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 30, 2024 at 03:12:54PM +0000, Mostafa Saleh wrote:
> > > > +	/*
> > > > +	 * If for some reason the HW does not support DMA coherency then using
> > > > +	 * S2FWB won't work. This will also disable nesting support.
> > > > +	 */
> > > > +	if (FIELD_GET(IDR3_FWB, reg) &&
> > > > +	    (smmu->features & ARM_SMMU_FEAT_COHERENCY))
> > > > +		smmu->features |= ARM_SMMU_FEAT_S2FWB;
> > > I think that’s for the SMMU coherency which in theory is not related to the
> > > master which FWB overrides, so this check is not correct.
> > 
> > Yes, I agree, in theory.
> > 
> > However the driver today already links them together:
> > 
> > 	case IOMMU_CAP_CACHE_COHERENCY:
> > 		/* Assume that a coherent TCU implies coherent TBUs */
> > 		return master->smmu->features & ARM_SMMU_FEAT_COHERENCY;
> > 
> > So this hunk was a continuation of that design.
> > 
> > > What I meant in the previous thread that we should set FWB only for coherent
> > > masters as (in attach s2):
> > > 	if (smmu->features & ARM_SMMU_FEAT_S2FWB && dev_is_dma_coherent(master->dev)
> > > 		// set S2FWB in STE
> > 
> > I think as I explained in that thread, it is not really correct
> > either. There is no reason to block using S2FWB for non-coherent
> > masters that are not used with VFIO. The page table will still place
> > the correct memattr according to the IOMMU_CACHE flag, S2FWB just
> > slightly changes the encoding.
> 
> It’s not just the encoding that changes, as
> - Without FWB, stage-2 combine attributes
> - While with FWB, it overrides them.

You mean there is some incomming attribute in the transaction
(obviously not talking PCI here) and S2FWB combines with that?

> So a cacheable mapping in stage-2 can lead to a non-cacheable
> (or with different cachableitiy attributes) transaction based on the
> input. I am not sure though if there is such case in the kernel.

If the kernel supplies IOMMU_CACHE then the kernel also skips all the
cache flushing. So it would be a functional problem if combining was
causing a non-cachable access through a IOMMU_CACHE S2 already. The
DMA API would fail if that was the case.

> > If anything should be changed then it would be the above
> > IOMMU_CAP_CACHE_COHERENCY test, and I don't know if
> > dev_is_dma_coherent() would be correct there, or if it should do some
> > ACPI inspection or what.
> 
> I agree, I believe that this assumption is not accurate, I am not sure
> what is the right approach here, but in concept I think we shouldn’t
> enable FWB for non-coherent devices (using dev_is_dma_coherent() or
> other check)

The DMA API requires that the cachability rules it sets via
IOMMU_CACHE are followed. In this way the stricter behavior of S2FWB
is a benefit, not a draw back.

I'm still not seeing a problm here??

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 6/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
  2024-09-02 10:11       ` Mostafa Saleh
@ 2024-09-03  0:16         ` Jason Gunthorpe
  2024-09-03  8:34           ` Mostafa Saleh
  0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-09-03  0:16 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Mon, Sep 02, 2024 at 10:11:16AM +0000, Mostafa Saleh wrote:

> > What is the harm? Does exposing IDR data to userspace in any way
> > compromise the security or integrity of the system?
> > 
> > I think no - how could it?
> 
> I don’t see a clear harm or exploit with exposing IDRs, but IMHO we
> should deal with userspace with the least privilege principle and
> only expose what user space cares about (with sanitised IDRs or
> through another mechanism)

If the information is harmless then why hide it? We expose all kinds
of stuff to userspace, like most of the PCI config space for
instance. I think we need a reason. 

Any sanitization in the kernel will complicate everything because we
will get it wrong.

Let's not make things complicated without reasons. Intel and AMD are
exposing their IDR equivalents in this manner as well.

> For example, KVM doesn’t allow reading reading the CPU system
> registers to know if SVE(or other features) is supported but hides
> that by a CAP in KVM_CHECK_EXTENSION

Do you know why?

> > As the comments says, the VMM should not just blindly forward this to
> > a guest!
> 
> I don't think the kernel should trust userspace.

There is no trust. If the VMM blindly forwards the IDRS then the VMM
will find its VM's have issues. It is a functional bug, just as if the
VMM puts random garbage in its vIDRS.

The onl purpose of this interface is to provide information about the
physical hardware to the VMM.

> > The VMM needs to make its own IDR to reflect its own vSMMU
> > capabilities. It can refer to the kernel IDR if it needs to.
> > 
> > So, if the kernel is going to limit it, what criteria would you
> > propose the kernel use?
> 
> I agree that the VMM would create a virtual IDR for guest, but that
> doesn't have to be directly based on the physical one (same as CPU).

No one said it should be. In fact the comment explicitly says not to
do that.

The VMM is expected to read out of the physical IDR any information
that effects data structures that are under direct guest control.

For instance anything that effects the CD on downwards. So page sizes,
IAS limits, etc etc etc. Anything that effects assigned invalidation
queues. Anything that impacts errata the VM needs to be aware of.

If you sanitize it then you will hide information that someone will
need at some point, then we have go an unsanitize it, then add feature
flags.. It is a pain.

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-09-02  9:57       ` Mostafa Saleh
@ 2024-09-03  0:30         ` Jason Gunthorpe
  2024-09-03  1:13           ` Nicolin Chen
  2024-09-03  9:00           ` Mostafa Saleh
  0 siblings, 2 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-09-03  0:30 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Mon, Sep 02, 2024 at 09:57:45AM +0000, Mostafa Saleh wrote:
> > > 2) Is there a reason the UAPI is designed this way?
> > > The way I imagined this, is that userspace will pass the pointer to the CD
> > > (+ format) not the STE (or part of it).
> > 
> > Yes, we need more information from the STE than just that. EATS and
> > STALL for instance. And the cachability below. Who knows what else in
> > the future.
> 
> But for example if that was extended later, how can user space know
> which fields are allowed and which are not?

Changes the vSTE rules that require userspace being aware would have
to be signaled in the GET_INFO answer. This is the same process no
matter how you encode the STE bits in the structure.

This confirmation of kernel support would then be reflected in the
vIDRs to the VM and the VM could know to set the extended bits.

Otherwise setting an invalidate vSTE will fail the ioctl, the VMM can
log the event, generate an event and install an abort vSTE.

> > Overall this sort of direct transparency is how I prefer to see these
> > kinds of iommufd HW specific interfaces designed. From a lot of
> > experience here, arbitary marshall/unmarshall is often an
> > antipattern :)
> 
> Is there any documentation for the (proposed) SMMUv3 UAPI for IOMMUFD?

Just the comments in this series?

> I can understand reading IDRs from userspace (with some sanitation),
> but adding some more logic to map vSTE to STE needs more care of what
> kind of semantics are provided.

We can enhance the comment if you think it is not clear enough. It
lists the fields the userspace should pass through.

> Also, I am working on similar interface for pKVM where we “paravirtualize”
> the SMMU access for guests, it’s different semantics, but I hope we can
> align that with IOMMUFD (but it’s nowhere near upstream now)

Well, if you do paravirt where you just do map/unmap calls to the
hypervisor (ie classic virtio-iommu) then you don't need to do very
much.

If you want to do nesting, then IMHO, just present a real vSMMU. It is
already intended to be paravirtualized and this is what the
confidential compute people are going to be doing as well.

Otherwise I'd expect you'd get more value to align with the
virtio-iommu nesting stuff, where they have layed out what information
the VM needs. iommufd is not intended to be just jammed directly into
a VM. There is an expectation that a VMM will sit there on top and
massage things.

> I see you are talking in LPC about IOMMUFD:
> https://lore.kernel.org/linux-iommu/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/T/#m2dbb08f3bf8506a492bc7dda2de662e42371e683
> 
> Do you have any plans to talk about this also?

Nothing specific, this is LPC so if people in the room would like to
use the session for that then we can talk about it. Last year the room
wanted to talk about PASID mostly.

I haven't heard if someone is going to KVM forum to talk about
vSMMUv3? Eric? Nicolin do you know?

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-09-03  0:30         ` Jason Gunthorpe
@ 2024-09-03  1:13           ` Nicolin Chen
  2024-09-03  9:00           ` Mostafa Saleh
  1 sibling, 0 replies; 95+ messages in thread
From: Nicolin Chen @ 2024-09-03  1:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mostafa Saleh, acpica-devel, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi

On Mon, Sep 02, 2024 at 09:30:22PM -0300, Jason Gunthorpe wrote:

> > I see you are talking in LPC about IOMMUFD:
> > https://lore.kernel.org/linux-iommu/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/T/#m2dbb08f3bf8506a492bc7dda2de662e42371e683
> > 
> > Do you have any plans to talk about this also?
> 
> Nothing specific, this is LPC so if people in the room would like to
> use the session for that then we can talk about it. Last year the room
> wanted to talk about PASID mostly.
> 
> I haven't heard if someone is going to KVM forum to talk about
> vSMMUv3? Eric? Nicolin do you know?

I am not attending for seemingly in-person option only, nor sure
if there's a topic: it doesn't seem to have an official schedule
yet..

That being said, I think we do need a thread at least, for vSMMU
in the QEMU. The multi-vSMMU design requires to change the vSMMU
module to a pluggable one via QEMU command line, while the PCI-
vSMMU topology should be implemented in libvirt, either of which
needs some effort, where I haven't got bandwidth to spare. Given
that kernel patches are in a solid shape, I think I could start
to borrow some help. I'll draft an email to the qemu-devel list.
Folks joining the forum might be able to carry out a discussion.

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH v2 4/8] ACPI/IORT: Support CANWBS memory access flag
  2024-08-30 13:54     ` Jason Gunthorpe
@ 2024-09-03  7:14       ` Tian, Kevin
  0 siblings, 0 replies; 95+ messages in thread
From: Tian, Kevin @ 2024-09-03  7:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel@lists.linux.dev, Hanjun Guo, iommu@lists.linux.dev,
	Joerg Roedel, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Moore, Robert, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches@lists.linux.dev, Shameerali Kolothum Thodi,
	Mostafa Saleh

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, August 30, 2024 9:55 PM
> 
> On Fri, Aug 30, 2024 at 07:52:41AM +0000, Tian, Kevin wrote:
> 
> > But according to above description S2FWB cannot 100% guarantee it
> > due to PCI No Snoop. Does it suggest that we should only allow nesting
> > only for CANWBS, or disable/hide PCI No Snoop cap from the guest
> > in case of S2FWB?
> 
> ARM has always had an issue with no-snoop and VFIO. The ARM
> expectation is that VFIO/VMM would block no-snoop in the PCI config
> space.
> 
> From a VM perspective, any VMM on ARM has to take care to do this
> today already.
> 
> For instance a VMM could choose to only assign devices which never use
> no-snoop, which describes almost all of what people actually do :)
> 
> The purpose of S2FWB is to keep that approach working. If the VMM has
> blocked no-snoop then S2FWB ensures that the VM can't use IOPTE bits
> to break cachability and it remains safe.
> 
> From a VFIO perspective ARM has always had a security hole similer to
> what Yan is trying to fix on Intel, that is a separate pre-existing
> topic. Ideally the VFIO kernel would block PCI config space no-snoop
> for alot of cases.
> 

Make sense. It'd be helpful putting some words in the commit msg too.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-09-03  0:05         ` Jason Gunthorpe
@ 2024-09-03  7:57           ` Mostafa Saleh
  2024-09-03 23:33             ` Jason Gunthorpe
  0 siblings, 1 reply; 95+ messages in thread
From: Mostafa Saleh @ 2024-09-03  7:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Mon, Sep 02, 2024 at 09:05:46PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 02, 2024 at 09:29:53AM +0000, Mostafa Saleh wrote:
> > On Fri, Aug 30, 2024 at 01:40:19PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 30, 2024 at 03:12:54PM +0000, Mostafa Saleh wrote:
> > > > > +	/*
> > > > > +	 * If for some reason the HW does not support DMA coherency then using
> > > > > +	 * S2FWB won't work. This will also disable nesting support.
> > > > > +	 */
> > > > > +	if (FIELD_GET(IDR3_FWB, reg) &&
> > > > > +	    (smmu->features & ARM_SMMU_FEAT_COHERENCY))
> > > > > +		smmu->features |= ARM_SMMU_FEAT_S2FWB;
> > > > I think that’s for the SMMU coherency which in theory is not related to the
> > > > master which FWB overrides, so this check is not correct.
> > > 
> > > Yes, I agree, in theory.
> > > 
> > > However the driver today already links them together:
> > > 
> > > 	case IOMMU_CAP_CACHE_COHERENCY:
> > > 		/* Assume that a coherent TCU implies coherent TBUs */
> > > 		return master->smmu->features & ARM_SMMU_FEAT_COHERENCY;
> > > 
> > > So this hunk was a continuation of that design.
> > > 
> > > > What I meant in the previous thread that we should set FWB only for coherent
> > > > masters as (in attach s2):
> > > > 	if (smmu->features & ARM_SMMU_FEAT_S2FWB && dev_is_dma_coherent(master->dev)
> > > > 		// set S2FWB in STE
> > > 
> > > I think as I explained in that thread, it is not really correct
> > > either. There is no reason to block using S2FWB for non-coherent
> > > masters that are not used with VFIO. The page table will still place
> > > the correct memattr according to the IOMMU_CACHE flag, S2FWB just
> > > slightly changes the encoding.
> > 
> > It’s not just the encoding that changes, as
> > - Without FWB, stage-2 combine attributes
> > - While with FWB, it overrides them.
> 
> You mean there is some incomming attribute in the transaction
> (obviously not talking PCI here) and S2FWB combines with that?

Yes, stuff as cacheability (as defined by Arm spec)
I am not sure about PCI, but according to the spec:
	“PCIe does not contain memory type attributes, and each transaction
	takes a system-defined memory type when it progresses into the system”

> 
> > So a cacheable mapping in stage-2 can lead to a non-cacheable
> > (or with different cachableitiy attributes) transaction based on the
> > input. I am not sure though if there is such case in the kernel.
> 
> If the kernel supplies IOMMU_CACHE then the kernel also skips all the
> cache flushing. So it would be a functional problem if combining was
> causing a non-cachable access through a IOMMU_CACHE S2 already. The
> DMA API would fail if that was the case.

Correct, but it’s not just about cacheable/non-cacheable, as I mentioned
it’s about other attributes also, this is a very niche case, and again I
am not sure if there are devices affected in the kernel, but I just
wanted to highlight it’s not just a different encoding for stage-2.

> 
> > > If anything should be changed then it would be the above
> > > IOMMU_CAP_CACHE_COHERENCY test, and I don't know if
> > > dev_is_dma_coherent() would be correct there, or if it should do some
> > > ACPI inspection or what.
> > 
> > I agree, I believe that this assumption is not accurate, I am not sure
> > what is the right approach here, but in concept I think we shouldn’t
> > enable FWB for non-coherent devices (using dev_is_dma_coherent() or
> > other check)
> 
> The DMA API requires that the cachability rules it sets via
> IOMMU_CACHE are followed. In this way the stricter behavior of S2FWB
> is a benefit, not a draw back.
> 
> I'm still not seeing a problm here??

Basically, I believe we shouldn’t set FWB blindly just because it’s supported,
I don’t see how it’s useful for stage-2 only domains.

And I believe making assumptions about VFIO (which actually is not correctly
enforced at the moment) is fragile, and we should only set FWB for coherent
devices in nested setup only where the VMM(or hypervisor) knows better than
the VM.

Thanks,
Mostafa

> 
> Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 6/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
  2024-09-03  0:16         ` Jason Gunthorpe
@ 2024-09-03  8:34           ` Mostafa Saleh
  2024-09-03 23:40             ` Jason Gunthorpe
  0 siblings, 1 reply; 95+ messages in thread
From: Mostafa Saleh @ 2024-09-03  8:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Mon, Sep 02, 2024 at 09:16:54PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 02, 2024 at 10:11:16AM +0000, Mostafa Saleh wrote:
> 
> > > What is the harm? Does exposing IDR data to userspace in any way
> > > compromise the security or integrity of the system?
> > > 
> > > I think no - how could it?
> > 
> > I don’t see a clear harm or exploit with exposing IDRs, but IMHO we
> > should deal with userspace with the least privilege principle and
> > only expose what user space cares about (with sanitised IDRs or
> > through another mechanism)
> 
> If the information is harmless then why hide it? We expose all kinds
> of stuff to userspace, like most of the PCI config space for
> instance. I think we need a reason. 
> 
> Any sanitization in the kernel will complicate everything because we
> will get it wrong.
> 
> Let's not make things complicated without reasons. Intel and AMD are
> exposing their IDR equivalents in this manner as well.
> 
> > For example, KVM doesn’t allow reading reading the CPU system
> > registers to know if SVE(or other features) is supported but hides
> > that by a CAP in KVM_CHECK_EXTENSION
> 
> Do you know why?
> 

I am not really sure, but I believe it’s a useful abstraction

> > > As the comments says, the VMM should not just blindly forward this to
> > > a guest!
> > 
> > I don't think the kernel should trust userspace.
> 
> There is no trust. If the VMM blindly forwards the IDRS then the VMM
> will find its VM's have issues. It is a functional bug, just as if the
> VMM puts random garbage in its vIDRS.
> 
> The onl purpose of this interface is to provide information about the
> physical hardware to the VMM.
> 
> > > The VMM needs to make its own IDR to reflect its own vSMMU
> > > capabilities. It can refer to the kernel IDR if it needs to.
> > > 
> > > So, if the kernel is going to limit it, what criteria would you
> > > propose the kernel use?
> > 
> > I agree that the VMM would create a virtual IDR for guest, but that
> > doesn't have to be directly based on the physical one (same as CPU).
> 
> No one said it should be. In fact the comment explicitly says not to
> do that.
> 
> The VMM is expected to read out of the physical IDR any information
> that effects data structures that are under direct guest control.
> 
> For instance anything that effects the CD on downwards. So page sizes,
> IAS limits, etc etc etc. Anything that effects assigned invalidation
> queues. Anything that impacts errata the VM needs to be aware of.
> 
> If you sanitize it then you will hide information that someone will
> need at some point, then we have go an unsanitize it, then add feature
> flags.. It is a pain.

I don’t have a very strong opinion to sanitise the IDRs (specifically
many of those are documented anyway per IP), but at least we should have
some clear requirement for what userspace needs, I am just concerned
that userspace can misuse some of the features leading to a strange UAPI.

Thanks,
Mostafa

> 
> Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-09-03  0:30         ` Jason Gunthorpe
  2024-09-03  1:13           ` Nicolin Chen
@ 2024-09-03  9:00           ` Mostafa Saleh
  2024-09-03 23:55             ` Jason Gunthorpe
  1 sibling, 1 reply; 95+ messages in thread
From: Mostafa Saleh @ 2024-09-03  9:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Mon, Sep 02, 2024 at 09:30:22PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 02, 2024 at 09:57:45AM +0000, Mostafa Saleh wrote:
> > > > 2) Is there a reason the UAPI is designed this way?
> > > > The way I imagined this, is that userspace will pass the pointer to the CD
> > > > (+ format) not the STE (or part of it).
> > > 
> > > Yes, we need more information from the STE than just that. EATS and
> > > STALL for instance. And the cachability below. Who knows what else in
> > > the future.
> > 
> > But for example if that was extended later, how can user space know
> > which fields are allowed and which are not?
> 
> Changes the vSTE rules that require userspace being aware would have
> to be signaled in the GET_INFO answer. This is the same process no
> matter how you encode the STE bits in the structure.
> 
How? And why changing that in the future is not a problem as sanitising IDRs?

> This confirmation of kernel support would then be reflected in the
> vIDRs to the VM and the VM could know to set the extended bits.
> 
> Otherwise setting an invalidate vSTE will fail the ioctl, the VMM can
> log the event, generate an event and install an abort vSTE.
> 
> > > Overall this sort of direct transparency is how I prefer to see these
> > > kinds of iommufd HW specific interfaces designed. From a lot of
> > > experience here, arbitary marshall/unmarshall is often an
> > > antipattern :)
> > 
> > Is there any documentation for the (proposed) SMMUv3 UAPI for IOMMUFD?
> 
> Just the comments in this series?

But this is a UAPI. How can userspace implement that if it has no
documentation, and how can it be maintained if there is no clear
interface with userspace with what is expected/returned...

> 
> > I can understand reading IDRs from userspace (with some sanitation),
> > but adding some more logic to map vSTE to STE needs more care of what
> > kind of semantics are provided.
> 
> We can enhance the comment if you think it is not clear enough. It
> lists the fields the userspace should pass through.
> 
> > Also, I am working on similar interface for pKVM where we “paravirtualize”
> > the SMMU access for guests, it’s different semantics, but I hope we can
> > align that with IOMMUFD (but it’s nowhere near upstream now)
> 
> Well, if you do paravirt where you just do map/unmap calls to the
> hypervisor (ie classic virtio-iommu) then you don't need to do very
> much.

But we have a different model, with virtio-iommu, it typically presents
the device to the VM and on the backend it calls VFIO MAP/UNMAP.
Although technically we can have virtio-iommu in the hypervisor (EL2),
that is a lot of complexit and increase in the TCB of pKVM.

For pKVM, the VMM is not trusted and the hypervisor would do the map/unmap...,
but the VMM will have to configure the virtual view of the device (Mapping of
endpoints to virtual endpoints, vIRQs…), this requires a userspace interface
to query some HW info (similar to VFIO VFIO_DEVICE_GET_IRQ_INFO and then mapping
it to a GSI through KVM, but for IOMMUs)
Though, this design is very early and in progress.

> 
> If you want to do nesting, then IMHO, just present a real vSMMU. It is
> already intended to be paravirtualized and this is what the
> confidential compute people are going to be doing as well.
> 
> Otherwise I'd expect you'd get more value to align with the
> virtio-iommu nesting stuff, where they have layed out what information
> the VM needs. iommufd is not intended to be just jammed directly into
> a VM. There is an expectation that a VMM will sit there on top and
> massage things.

I haven’t been keeping up with iommufd lately, I will try to spend more
time on that in the future.
But my idea is that we would create an IOMMUFD, attach it to a device and then
through some extra IOCTLs, we can configure some “virtual” topology for it which
then relies on KVM, again this is very early, and we need to support pKVM IOMMUs
in the host first (I plan to send v2 RFC soon for that)

> 
> > I see you are talking in LPC about IOMMUFD:
> > https://lore.kernel.org/linux-iommu/0-v1-01fa10580981+1d-iommu_pt_jgg@nvidia.com/T/#m2dbb08f3bf8506a492bc7dda2de662e42371e683
> > 
> > Do you have any plans to talk about this also?
> 
> Nothing specific, this is LPC so if people in the room would like to
> use the session for that then we can talk about it. Last year the room
> wanted to talk about PASID mostly.
> 
> I haven't heard if someone is going to KVM forum to talk about
> vSMMUv3? Eric? Nicolin do you know?

I see, I won’t be in KVM forum, but I plan to attend LPC, we can discuss
further there if people are interested.

Thanks,
Mostafa

> 
> Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-09-03  7:57           ` Mostafa Saleh
@ 2024-09-03 23:33             ` Jason Gunthorpe
  2024-09-10 10:55               ` Mostafa Saleh
  0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-09-03 23:33 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Tue, Sep 03, 2024 at 07:57:01AM +0000, Mostafa Saleh wrote:

> Basically, I believe we shouldn’t set FWB blindly just because it’s supported,
> I don’t see how it’s useful for stage-2 only domains.

And the only problem we can see is some niche scenario where incoming
memory attributes that are already requesting cachable combine to a
different kind of cachable?

> And I believe making assumptions about VFIO (which actually is not correctly
> enforced at the moment) is fragile.

VFIO requiring cachable is definately not fragile, and it also sets
the IOMMU_CACHE flag to indicate this. Revising VFIO to allow
non-cachable would be a signficant change and would also change what
IOMMU_CACHE flag it sets.

> and we should only set FWB for coherent
> devices in nested setup only where the VMM(or hypervisor) knows better than
> the VM.

I don't want to touch the 'only coherent devices' question. Last time
I tried to do that I got told every option was wrong.

I would be fine to only enable for nesting parent domains. It is
mandatory here and we definitely don't support non-cachable nesting
today.  Can we agree on that?

Keep in mind SMMU S2FWB is really new and probably very little HW
supports it right now. So we are not breaking anything existing
here. IMHO it is better to always enable the stricter features going
forward, and then evaluate an in-kernel opt-out if someone comes with
a concrete use case.

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 6/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
  2024-09-03  8:34           ` Mostafa Saleh
@ 2024-09-03 23:40             ` Jason Gunthorpe
  2024-09-04  7:11               ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-09-03 23:40 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Tue, Sep 03, 2024 at 08:34:17AM +0000, Mostafa Saleh wrote:

> > > For example, KVM doesn’t allow reading reading the CPU system
> > > registers to know if SVE(or other features) is supported but hides
> > > that by a CAP in KVM_CHECK_EXTENSION
> > 
> > Do you know why?
> 
> I am not really sure, but I believe it’s a useful abstraction

It seems odd to me, unpriv userspace can look in /proc/cpuinfo and see
SEV, why would kvm hide the same information behind a
CAP_SYS_ADMIN/whatever check?

> I don’t have a very strong opinion to sanitise the IDRs (specifically
> many of those are documented anyway per IP), but at least we should have
> some clear requirement for what userspace needs, I am just concerned
> that userspace can misuse some of the features leading to a strange UAPI.

We should probably have a file in Documentation/ that does more
explaining of this.

The design has the kernel be very general, the kernel's scope is
bigger than just providing a vSMMU. It is the VMM's job to take the
kernel tools and build the vSMMU para virtualization.

It is like this because the configuration if the vSMMU is ultimately a
policy choice that should be configured by the operator. When we
consider live migration the vSMMU needs to be fully standardized and
consistent regardless of what the pSMMU is. We don't want policy in
the kernel.

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-09-03  9:00           ` Mostafa Saleh
@ 2024-09-03 23:55             ` Jason Gunthorpe
  2024-09-06 11:07               ` Mostafa Saleh
  0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-09-03 23:55 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Tue, Sep 03, 2024 at 09:00:32AM +0000, Mostafa Saleh wrote:
> On Mon, Sep 02, 2024 at 09:30:22PM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 02, 2024 at 09:57:45AM +0000, Mostafa Saleh wrote:
> > > > > 2) Is there a reason the UAPI is designed this way?
> > > > > The way I imagined this, is that userspace will pass the pointer to the CD
> > > > > (+ format) not the STE (or part of it).
> > > > 
> > > > Yes, we need more information from the STE than just that. EATS and
> > > > STALL for instance. And the cachability below. Who knows what else in
> > > > the future.
> > > 
> > > But for example if that was extended later, how can user space know
> > > which fields are allowed and which are not?
> > 
> > Changes the vSTE rules that require userspace being aware would have
> > to be signaled in the GET_INFO answer. This is the same process no
> > matter how you encode the STE bits in the structure.
>
> How? 

--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -504,6 +504,11 @@ struct iommu_hw_info_vtd {
        __aligned_u64 ecap_reg;
 };
 
+enum {
+       /* The kernel understand field NEW in the STE */
+       IOMMU_HW_INFO_ARM_SMMUV3_VSTE_NEW = 1 << 0,
+};
+
 /**
  * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
  *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
@@ -514,6 +519,7 @@ struct iommu_hw_info_vtd {
  * @iidr: Information about the implementation and implementer of ARM SMMU,
  *        and architecture version supported
  * @aidr: ARM SMMU architecture version
+ * @kernel_capabilities: Bitmask of IOMMU_HW_INFO_ARM_SMMUV3_*
  *
  * For the details of @idr, @iidr and @aidr, please refer to the chapters
  * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
@@ -535,6 +541,7 @@ struct iommu_hw_info_arm_smmuv3 {
        __u32 idr[6];
        __u32 iidr;
        __u32 aidr;
+       __u32 kernel_capabilities;
 };
 
 /**

For example. There are all sorts of rules about 0 filling and things
that make this work trivially for the userspace.

> And why changing that in the future is not a problem as
> sanitising IDRs?

Reporting a static kernel capability through GET_INFO output is
easier/saner than providing some kind of policy flags in the GET_INFO
input to specify how the sanitization should work.

> > This confirmation of kernel support would then be reflected in the
> > vIDRs to the VM and the VM could know to set the extended bits.
> > 
> > Otherwise setting an invalidate vSTE will fail the ioctl, the VMM can
> > log the event, generate an event and install an abort vSTE.
> > 
> > > > Overall this sort of direct transparency is how I prefer to see these
> > > > kinds of iommufd HW specific interfaces designed. From a lot of
> > > > experience here, arbitary marshall/unmarshall is often an
> > > > antipattern :)
> > > 
> > > Is there any documentation for the (proposed) SMMUv3 UAPI for IOMMUFD?
> > 
> > Just the comments in this series?
> 
> But this is a UAPI. How can userspace implement that if it has no
> documentation, and how can it be maintained if there is no clear
> interface with userspace with what is expected/returned...

I'm not sure what you are looking for here? I don't think an entire
tutorial on how to build a paravirtualized vSMMU is appropriate to
put in comments?

The behavior of the vSTE processing as a single feature should be
understandable, and I think it is from the comments and code. If it
isn't, lets improve that.

There is definitely a jump from knowing how these point items work to
knowing how to build a para virtualized vSMMU in your VMM. This is
likely a gap of thousands of lines of code in userspace :\

> But we have a different model, with virtio-iommu, it typically presents
> the device to the VM and on the backend it calls VFIO MAP/UNMAP.

I thought pkvm's model was also map/unmap - so it could suppor HW
without nesting?

> Although technically we can have virtio-iommu in the hypervisor (EL2),
> that is a lot of complexit and increase in the TCB of pKVM.

That is too bad, it would be nice to not have to do everything new
from scratch to just get to the same outcome. :(

> I haven’t been keeping up with iommufd lately, I will try to spend more
> time on that in the future.
> But my idea is that we would create an IOMMUFD, attach it to a device and then
> through some extra IOCTLs, we can configure some “virtual” topology for it which
> then relies on KVM, again this is very early, and we need to support pKVM IOMMUs
> in the host first (I plan to send v2 RFC soon for that)

Most likely your needs here will match the needs of the confidential
compute people which are basically doing that same stuf. The way pKVM
wants to operate looks really similar to me to how the confidential
compute stuff wants to work where the VMM is untrusted and operations
are delegated to some kind of secure world.

So, for instance, AMD recently posted patches about how they would
create vPCI devices in their secure world, and there are various
things in the works for secure IOMMUs and so forth all with the
intention of not trusting the VMM, or permitting the VMM to compromise
the VM.

I would *really* like everyone to sit down and figure out how to
manage virtual device lifecycle in a single language!

> > I haven't heard if someone is going to KVM forum to talk about
> > vSMMUv3? Eric? Nicolin do you know?
> 
> I see, I won’t be in KVM forum, but I plan to attend LPC, we can discuss
> further there if people are interested.

Sure, definately, look forward to meeting you!

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH v2 6/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
  2024-09-03 23:40             ` Jason Gunthorpe
@ 2024-09-04  7:11               ` Shameerali Kolothum Thodi
  2024-09-04 12:01                 ` Jason Gunthorpe
  0 siblings, 1 reply; 95+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-09-04  7:11 UTC (permalink / raw)
  To: Jason Gunthorpe, Mostafa Saleh
  Cc: acpica-devel@lists.linux.dev, Guohanjun (Hanjun Guo),
	iommu@lists.linux.dev, Joerg Roedel, Kevin Tian,
	kvm@vger.kernel.org, Len Brown, linux-acpi@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches@lists.linux.dev



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 4, 2024 12:40 AM
> To: Mostafa Saleh <smostafa@google.com>
> Cc: acpica-devel@lists.linux.dev; Guohanjun (Hanjun Guo)
> <guohanjun@huawei.com>; iommu@lists.linux.dev; Joerg Roedel
> <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>;
> kvm@vger.kernel.org; Len Brown <lenb@kernel.org>; linux-
> acpi@vger.kernel.org; linux-arm-kernel@lists.infradead.org; Lorenzo Pieralisi
> <lpieralisi@kernel.org>; Rafael J. Wysocki <rafael@kernel.org>; Robert
> Moore <robert.moore@intel.com>; Robin Murphy
> <robin.murphy@arm.com>; Sudeep Holla <sudeep.holla@arm.com>; Will
> Deacon <will@kernel.org>; Alex Williamson <alex.williamson@redhat.com>;
> Eric Auger <eric.auger@redhat.com>; Jean-Philippe Brucker <jean-
> philippe@linaro.org>; Moritz Fischer <mdf@kernel.org>; Michael Shavit
> <mshavit@google.com>; Nicolin Chen <nicolinc@nvidia.com>;
> patches@lists.linux.dev; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>
> Subject: Re: [PATCH v2 6/8] iommu/arm-smmu-v3: Support
> IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
> 
> On Tue, Sep 03, 2024 at 08:34:17AM +0000, Mostafa Saleh wrote:
> 
> > > > For example, KVM doesn’t allow reading reading the CPU system
> > > > registers to know if SVE(or other features) is supported but hides
> > > > that by a CAP in KVM_CHECK_EXTENSION
> > >
> > > Do you know why?
> >
> > I am not really sure, but I believe it’s a useful abstraction
> 
> It seems odd to me, unpriv userspace can look in /proc/cpuinfo and see
> SEV, why would kvm hide the same information behind a
> CAP_SYS_ADMIN/whatever check?

I don’t think KVM hides SVE always. It also depends on whether the VMM has
requested sve for a specific Guest or not(Qemu has option to turn sve on/off, similarly pmu
as well).  Based on that KVM populates the Guest specific ID registers.  And Guest
/proc/cpuinfo reflects that.

And for some features if KVM is not handling the feature properly or not making any sense
to be exposed to Guest, those features are masked in ID registers.

Recently ARM64 ID registers has been made writable from userspace to allow VMM to turn
on/off features, so that VMs can be migrated between hosts that differ in feature support.

https://lore.kernel.org/all/ZR2YfAixZgbCFnb8@linux.dev/T/#m7c2493fd2d43c13a3336d19f2dc06a89803c6fdb

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 6/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
  2024-09-04  7:11               ` Shameerali Kolothum Thodi
@ 2024-09-04 12:01                 ` Jason Gunthorpe
  2024-09-06 11:19                   ` Mostafa Saleh
  0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-09-04 12:01 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: Mostafa Saleh, acpica-devel@lists.linux.dev,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches@lists.linux.dev

On Wed, Sep 04, 2024 at 07:11:19AM +0000, Shameerali Kolothum Thodi wrote:

> > On Tue, Sep 03, 2024 at 08:34:17AM +0000, Mostafa Saleh wrote:
> > 
> > > > > For example, KVM doesn’t allow reading reading the CPU system
> > > > > registers to know if SVE(or other features) is supported but hides
> > > > > that by a CAP in KVM_CHECK_EXTENSION
> > > >
> > > > Do you know why?
> > >
> > > I am not really sure, but I believe it’s a useful abstraction
> > 
> > It seems odd to me, unpriv userspace can look in /proc/cpuinfo and see
> > SEV, why would kvm hide the same information behind a
> > CAP_SYS_ADMIN/whatever check?
> 
> I don’t think KVM hides SVE always. It also depends on whether the
> VMM has requested sve for a specific Guest or not(Qemu has option to
> turn sve on/off, similarly pmu as well).  Based on that KVM
> populates the Guest specific ID registers.  And Guest /proc/cpuinfo
> reflects that.
> 
> And for some features if KVM is not handling the feature properly or
> not making any sense to be exposed to Guest, those features are
> masked in ID registers.
> 
> Recently ARM64 ID registers has been made writable from userspace to
> allow VMM to turn on/off features, so that VMs can be migrated
> between hosts that differ in feature support.
> 
> https://lore.kernel.org/all/ZR2YfAixZgbCFnb8@linux.dev/T/#m7c2493fd2d43c13a3336d19f2dc06a89803c6fdb

I see, so there is a significant difference - in KVM the kernel
controls what ID values the VM observes and in vSMMU the VMM controls
it.

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-08-27 15:51 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available Jason Gunthorpe
                     ` (3 preceding siblings ...)
  2024-08-30 15:12   ` Mostafa Saleh
@ 2024-09-04 14:20   ` Shameerali Kolothum Thodi
  2024-09-04 15:00     ` Jason Gunthorpe
  4 siblings, 1 reply; 95+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-09-04 14:20 UTC (permalink / raw)
  To: Jason Gunthorpe, acpica-devel@lists.linux.dev,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon
  Cc: Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches@lists.linux.dev, Mostafa Saleh



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, August 27, 2024 4:52 PM
> To: acpica-devel@lists.linux.dev; Guohanjun (Hanjun Guo)
> <guohanjun@huawei.com>; iommu@lists.linux.dev; Joerg Roedel
> <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>;
> kvm@vger.kernel.org; Len Brown <lenb@kernel.org>; linux-
> acpi@vger.kernel.org; linux-arm-kernel@lists.infradead.org; Lorenzo Pieralisi
> <lpieralisi@kernel.org>; Rafael J. Wysocki <rafael@kernel.org>; Robert
> Moore <robert.moore@intel.com>; Robin Murphy
> <robin.murphy@arm.com>; Sudeep Holla <sudeep.holla@arm.com>; Will
> Deacon <will@kernel.org>
> Cc: Alex Williamson <alex.williamson@redhat.com>; Eric Auger
> <eric.auger@redhat.com>; Jean-Philippe Brucker <jean-
> philippe@linaro.org>; Moritz Fischer <mdf@kernel.org>; Michael Shavit
> <mshavit@google.com>; Nicolin Chen <nicolinc@nvidia.com>;
> patches@lists.linux.dev; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>; Mostafa Saleh
> <smostafa@google.com>
> Subject: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
> 
> Force Write Back (FWB) changes how the S2 IOPTE's MemAttr field
> works. When S2FWB is supported and enabled the IOPTE will force cachable
> access to IOMMU_CACHE memory when nesting with a S1 and deny cachable
> access otherwise.
> 
> When using a single stage of translation, a simple S2 domain, it doesn't
> change anything as it is just a different encoding for the exsting mapping
> of the IOMMU protection flags to cachability attributes.
> 
> However, when used with a nested S1, FWB has the effect of preventing the
> guest from choosing a MemAttr in it's S1 that would cause ordinary DMA to
> bypass the cache. Consistent with KVM we wish to deny the guest the
> ability to become incoherent with cached memory the hypervisor believes is
> cachable so we don't have to flush it.
> 
> Turn on S2FWB whenever the SMMU supports it and use it for all S2
> mappings.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---

(...)

> @@ -932,7 +948,8 @@ arm_64_lpae_alloc_pgtable_s1(struct io_pgtable_cfg
> *cfg, void *cookie)
>  	if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS |
>  			    IO_PGTABLE_QUIRK_ARM_TTBR1 |
>  			    IO_PGTABLE_QUIRK_ARM_OUTER_WBWA |
> -			    IO_PGTABLE_QUIRK_ARM_HD))
> +			    IO_PGTABLE_QUIRK_ARM_HD |
> +			    IO_PGTABLE_QUIRK_ARM_S2FWB))
>  		return NULL;

This should be added to arm_64_lpae_alloc_pgtable_s2(), not here.

With the above fixed, I was able to assign a n/w VF dev to a Guest on a
test hardware that supports S2FWB.

However host kernel has this WARN message:
[ 1546.165105] WARNING: CPU: 5 PID: 7047 at drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:1086 arm_smmu_entry_qword_diff+0x124/0x138
....
[ 1546.330312]  arm_smmu_entry_qword_diff+0x124/0x138
[ 1546.335090]  arm_smmu_write_entry+0x38/0x22c
[ 1546.339346]  arm_smmu_install_ste_for_dev+0x158/0x1ac
[ 1546.344383]  arm_smmu_attach_dev+0x138/0x240
[ 1546.348639]  __iommu_device_set_domain+0x7c/0x11c
[ 1546.353330]  __iommu_group_set_domain_internal+0x60/0x134
[ 1546.358714]  iommu_group_replace_domain+0x3c/0x68
[ 1546.363404]  iommufd_device_do_replace+0x334/0x398
[ 1546.368181]  iommufd_device_change_pt+0x26c/0x650
[ 1546.372871]  iommufd_device_replace+0x18/0x24
[ 1546.377214]  vfio_iommufd_physical_attach_ioas+0x28/0x68
[ 1546.382514]  vfio_df_ioctl_attach_pt+0x98/0x170


And when I tried to use the assigned n/w dev, it seems to do a reset
continuously.

root@localhost:/# ping 150.0.124.42
PING 150.0.124.42 (150.0.124.42): 56 data bytes
64 bytes from 150.0.124.42: seq=0 ttl=64 time=47.648 ms
[ 1395.958630] hns3 0000:c2:00.0 eth1: NETDEV WATCHDOG: CPU: 1: transmit queue 10 timed out 5260 ms
[ 1395.960187] hns3 0000:c2:00.0 eth1: DQL info last_cnt: 42, queued: 42, adj_limit: 0, completed: 0
[ 1395.961758] hns3 0000:c2:00.0 eth1: queue state: 0x6, delta msecs: 5260
[ 1395.962925] hns3 0000:c2:00.0 eth1: tx_timeout count: 1, queue id: 10, SW_NTU: 0x1, SW_NTC: 0x0, napi state: 16
[ 1395.964677] hns3 0000:c2:00.0 eth1: tx_pkts: 0, tx_bytes: 0, sw_err_cnt: 0, tx_pending: 0
[ 1395.966114] hns3 0000:c2:00.0 eth1: seg_pkt_cnt: 0, tx_more: 0, restart_queue: 0, tx_busy: 0
[ 1395.967598] hns3 0000:c2:00.0 eth1: tx_push: 1, tx_mem_doorbell: 0
[ 1395.968687] hns3 0000:c2:00.0 eth1: BD_NUM: 0x7f HW_HEAD: 0x0, HW_TAIL: 0x0, BD_ERR: 0x0, INT: 0x1
[ 1395.970291] hns3 0000:c2:00.0 eth1: RING_EN: 0x1, TC: 0x0, FBD_NUM: 0x0 FBD_OFT: 0x0, EBD_NUM: 0x400, EBD_OFT: 0x0
[ 1395.972134] hns3 0000:c2:00.0: received reset request from VF enet

All this works fine on a hardware without S2FWB though.

Also on this test hardware, it works fine with legacy VFIO assignment.

Not debugged further. Please let me know if you have any hunch.

Thanks,
Shameer





^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-09-04 14:20   ` Shameerali Kolothum Thodi
@ 2024-09-04 15:00     ` Jason Gunthorpe
  2024-09-10 11:25       ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-09-04 15:00 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: acpica-devel@lists.linux.dev, Guohanjun (Hanjun Guo),
	iommu@lists.linux.dev, Joerg Roedel, Kevin Tian,
	kvm@vger.kernel.org, Len Brown, linux-acpi@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches@lists.linux.dev, Mostafa Saleh

On Wed, Sep 04, 2024 at 02:20:36PM +0000, Shameerali Kolothum Thodi wrote:

> This should be added to arm_64_lpae_alloc_pgtable_s2(), not here.

Woops! Yes:

-       /* The NS quirk doesn't apply at stage 2 */
-       if (cfg->quirks)
+       if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_S2FWB))
                return NULL;

> With the above fixed, I was able to assign a n/w VF dev to a Guest on a
> test hardware that supports S2FWB.

Okay great
 
> However host kernel has this WARN message:
> [ 1546.165105] WARNING: CPU: 5 PID: 7047 at drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:1086 arm_smmu_entry_qword_diff+0x124/0x138
> ....

Yes, my dumb mistake again, thanks for testing

@@ -1009,7 +1009,8 @@ void arm_smmu_get_ste_used(const __le64 *ent, __le64 *used_bits)
        /* S2 translates */
        if (cfg & BIT(1)) {
                used_bits[1] |=
-                       cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
+                       cpu_to_le64(STRTAB_STE_1_S2FWB | STRTAB_STE_1_EATS |
+                                   STRTAB_STE_1_SHCFG);

> root@localhost:/# ping 150.0.124.42
> PING 150.0.124.42 (150.0.124.42): 56 data bytes
> 64 bytes from 150.0.124.42: seq=0 ttl=64 time=47.648 ms

So DMA is not totally broken if a packet flowed.

> [ 1395.958630] hns3 0000:c2:00.0 eth1: NETDEV WATCHDOG: CPU: 1: transmit queue 10 timed out 5260 ms

Timeout? Maybe interrupts are not working? Does /proc/interrupts
suggest that? That would point at the ITS mapping

Do you have all of Nicolin's extra patches in this kernel to make the
ITS work with nesting?

From a page table POV, iommu_dma_get_msi_page() has:

	int prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;

So the ITS page should be:

		if (prot & IOMMU_MMIO) {
			pte |= ARM_LPAE_PTE_MEMATTR_DEV;

Which which still looks right under S2FWB unless I've misread the manual?

> [ 1395.960187] hns3 0000:c2:00.0 eth1: DQL info last_cnt: 42, queued: 42, adj_limit: 0, completed: 0
> [ 1395.961758] hns3 0000:c2:00.0 eth1: queue state: 0x6, delta msecs: 5260
> [ 1395.962925] hns3 0000:c2:00.0 eth1: tx_timeout count: 1, queue id: 10, SW_NTU: 0x1, SW_NTC: 0x0, napi state: 16
> [ 1395.964677] hns3 0000:c2:00.0 eth1: tx_pkts: 0, tx_bytes: 0, sw_err_cnt: 0, tx_pending: 0
> [ 1395.966114] hns3 0000:c2:00.0 eth1: seg_pkt_cnt: 0, tx_more: 0, restart_queue: 0, tx_busy: 0
> [ 1395.967598] hns3 0000:c2:00.0 eth1: tx_push: 1, tx_mem_doorbell: 0
> [ 1395.968687] hns3 0000:c2:00.0 eth1: BD_NUM: 0x7f HW_HEAD: 0x0, HW_TAIL: 0x0, BD_ERR: 0x0, INT: 0x1
> [ 1395.970291] hns3 0000:c2:00.0 eth1: RING_EN: 0x1, TC: 0x0, FBD_NUM: 0x0 FBD_OFT: 0x0, EBD_NUM: 0x400, EBD_OFT: 0x0
> [ 1395.972134] hns3 0000:c2:00.0: received reset request from VF enet
> 
> All this works fine on a hardware without S2FWB though.
> 
> Also on this test hardware, it works fine with legacy VFIO assignment.

So.. Legacy VFIO assignment will use the S1, no nesting and not enable
S2FWB?

Try to isolate if S2FWB is the exact cause by disabling it in the
kernel on this system vs something else wrong?

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-09-03 23:55             ` Jason Gunthorpe
@ 2024-09-06 11:07               ` Mostafa Saleh
  2024-09-06 13:34                 ` Jason Gunthorpe
  0 siblings, 1 reply; 95+ messages in thread
From: Mostafa Saleh @ 2024-09-06 11:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Tue, Sep 03, 2024 at 08:55:32PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 03, 2024 at 09:00:32AM +0000, Mostafa Saleh wrote:
> > On Mon, Sep 02, 2024 at 09:30:22PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Sep 02, 2024 at 09:57:45AM +0000, Mostafa Saleh wrote:
> > > > > > 2) Is there a reason the UAPI is designed this way?
> > > > > > The way I imagined this, is that userspace will pass the pointer to the CD
> > > > > > (+ format) not the STE (or part of it).
> > > > > 
> > > > > Yes, we need more information from the STE than just that. EATS and
> > > > > STALL for instance. And the cachability below. Who knows what else in
> > > > > the future.
> > > > 
> > > > But for example if that was extended later, how can user space know
> > > > which fields are allowed and which are not?
> > > 
> > > Changes the vSTE rules that require userspace being aware would have
> > > to be signaled in the GET_INFO answer. This is the same process no
> > > matter how you encode the STE bits in the structure.
> >
> > How? 
> 
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -504,6 +504,11 @@ struct iommu_hw_info_vtd {
>         __aligned_u64 ecap_reg;
>  };
>  
> +enum {
> +       /* The kernel understand field NEW in the STE */
> +       IOMMU_HW_INFO_ARM_SMMUV3_VSTE_NEW = 1 << 0,
> +};
> +
>  /**
>   * struct iommu_hw_info_arm_smmuv3 - ARM SMMUv3 hardware information
>   *                                   (IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
> @@ -514,6 +519,7 @@ struct iommu_hw_info_vtd {
>   * @iidr: Information about the implementation and implementer of ARM SMMU,
>   *        and architecture version supported
>   * @aidr: ARM SMMU architecture version
> + * @kernel_capabilities: Bitmask of IOMMU_HW_INFO_ARM_SMMUV3_*
>   *
>   * For the details of @idr, @iidr and @aidr, please refer to the chapters
>   * from 6.3.1 to 6.3.6 in the SMMUv3 Spec.
> @@ -535,6 +541,7 @@ struct iommu_hw_info_arm_smmuv3 {
>         __u32 idr[6];
>         __u32 iidr;
>         __u32 aidr;
> +       __u32 kernel_capabilities;
>  };
>  
>  /**
> 
> For example. There are all sorts of rules about 0 filling and things
> that make this work trivially for the userspace.

I see, that makes sense to have.
However, I believe the UAPI can be more clear and solid in terms of
what is supported (maybe a typical struct with the CD, and some
extra configs?) I will give it a think.

> 
> > And why changing that in the future is not a problem as
> > sanitising IDRs?
> 
> Reporting a static kernel capability through GET_INFO output is
> easier/saner than providing some kind of policy flags in the GET_INFO
> input to specify how the sanitization should work.

I don’t think it’s “policy”, it’s just giving userspace the minimum
knowledge it needs to create the vSMMU, but again no really strong
opinion about that.

> 
> > > This confirmation of kernel support would then be reflected in the
> > > vIDRs to the VM and the VM could know to set the extended bits.
> > > 
> > > Otherwise setting an invalidate vSTE will fail the ioctl, the VMM can
> > > log the event, generate an event and install an abort vSTE.
> > > 
> > > > > Overall this sort of direct transparency is how I prefer to see these
> > > > > kinds of iommufd HW specific interfaces designed. From a lot of
> > > > > experience here, arbitary marshall/unmarshall is often an
> > > > > antipattern :)
> > > > 
> > > > Is there any documentation for the (proposed) SMMUv3 UAPI for IOMMUFD?
> > > 
> > > Just the comments in this series?
> > 
> > But this is a UAPI. How can userspace implement that if it has no
> > documentation, and how can it be maintained if there is no clear
> > interface with userspace with what is expected/returned...
> 
> I'm not sure what you are looking for here? I don't think an entire
> tutorial on how to build a paravirtualized vSMMU is appropriate to
> put in comments?

Sorry, I don’t think I was clear, I meant actual documentation for
the UAPI, as in RST files for example. If I want to support that
in kvmtool how can I implement it? I think we should have clear
docs for the UAPI with what is exposed from the driver, what are the
possible returns, expected behaviour of abort, bypass in the vSTE...,
it also makes it easier to reason about some of the choices.

> 
> The behavior of the vSTE processing as a single feature should be
> understandable, and I think it is from the comments and code. If it
> isn't, lets improve that.
> 
> There is definitely a jump from knowing how these point items work to
> knowing how to build a para virtualized vSMMU in your VMM. This is
> likely a gap of thousands of lines of code in userspace :\
> 
> > But we have a different model, with virtio-iommu, it typically presents
> > the device to the VM and on the backend it calls VFIO MAP/UNMAP.
> 
> I thought pkvm's model was also map/unmap - so it could suppor HW
> without nesting?
> 

Yes, it’s map/unmap based, but it has to be implemented in the
hypervisor, it doesn’t rely on VFIO.

Also, I have been looking at nesting recently (but for the host).

>
> > Although technically we can have virtio-iommu in the hypervisor (EL2),
> > that is a lot of complexit and increase in the TCB of pKVM.
> 
> That is too bad, it would be nice to not have to do everything new
> from scratch to just get to the same outcome. :(
> 

Yeah, I agree, yet a new pv interface :/
Although, it’s quite simple as it follows Linux IOMMU semantics with
HVC as transport and no in-memory data structures, queues...

Not implementing virtio in the hypervisor was an initial design choice
as it would be very challenging in terms of reasoning about the TCB.

> > I haven’t been keeping up with iommufd lately, I will try to spend more
> > time on that in the future.
> > But my idea is that we would create an IOMMUFD, attach it to a device and then
> > through some extra IOCTLs, we can configure some “virtual” topology for it which
> > then relies on KVM, again this is very early, and we need to support pKVM IOMMUs
> > in the host first (I plan to send v2 RFC soon for that)
> 
> Most likely your needs here will match the needs of the confidential
> compute people which are basically doing that same stuf. The way pKVM
> wants to operate looks really similar to me to how the confidential
> compute stuff wants to work where the VMM is untrusted and operations
> are delegated to some kind of secure world.
> 

Exactly.

> So, for instance, AMD recently posted patches about how they would
> create vPCI devices in their secure world, and there are various
> things in the works for secure IOMMUs and so forth all with the
> intention of not trusting the VMM, or permitting the VMM to compromise
> the VM.
> 

I have seen those, but didn't get the time to read them

> I would *really* like everyone to sit down and figure out how to
> manage virtual device lifecycle in a single language!

Yes, just like the guest_memfd work. There has been also
some work to unify some of the guest HVC bits:
https://lore.kernel.org/all/20240830130150.8568-1-will@kernel.org/

We should do the same for IOMMUs and IO.

> 
> > > I haven't heard if someone is going to KVM forum to talk about
> > > vSMMUv3? Eric? Nicolin do you know?
> > 
> > I see, I won’t be in KVM forum, but I plan to attend LPC, we can discuss
> > further there if people are interested.
> 
> Sure, definately, look forward to meeting you!

Great, me too!

Thanks,
Mostafa

> 
> Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 6/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info
  2024-09-04 12:01                 ` Jason Gunthorpe
@ 2024-09-06 11:19                   ` Mostafa Saleh
  0 siblings, 0 replies; 95+ messages in thread
From: Mostafa Saleh @ 2024-09-06 11:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Shameerali Kolothum Thodi, acpica-devel@lists.linux.dev,
	Guohanjun (Hanjun Guo), iommu@lists.linux.dev, Joerg Roedel,
	Kevin Tian, kvm@vger.kernel.org, Len Brown,
	linux-acpi@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit,
	Nicolin Chen, patches@lists.linux.dev

On Wed, Sep 04, 2024 at 09:01:03AM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 04, 2024 at 07:11:19AM +0000, Shameerali Kolothum Thodi wrote:
> 
> > > On Tue, Sep 03, 2024 at 08:34:17AM +0000, Mostafa Saleh wrote:
> > > 
> > > > > > For example, KVM doesn’t allow reading reading the CPU system
> > > > > > registers to know if SVE(or other features) is supported but hides
> > > > > > that by a CAP in KVM_CHECK_EXTENSION
> > > > >
> > > > > Do you know why?
> > > >
> > > > I am not really sure, but I believe it’s a useful abstraction
> > > 
> > > It seems odd to me, unpriv userspace can look in /proc/cpuinfo and see
> > > SEV, why would kvm hide the same information behind a
> > > CAP_SYS_ADMIN/whatever check?
> > 
> > I don’t think KVM hides SVE always. It also depends on whether the
> > VMM has requested sve for a specific Guest or not(Qemu has option to
> > turn sve on/off, similarly pmu as well).  Based on that KVM
> > populates the Guest specific ID registers.  And Guest /proc/cpuinfo
> > reflects that.
> > 
> > And for some features if KVM is not handling the feature properly or
> > not making any sense to be exposed to Guest, those features are
> > masked in ID registers.
> > 
> > Recently ARM64 ID registers has been made writable from userspace to
> > allow VMM to turn on/off features, so that VMs can be migrated
> > between hosts that differ in feature support.
> > 
> > https://lore.kernel.org/all/ZR2YfAixZgbCFnb8@linux.dev/T/#m7c2493fd2d43c13a3336d19f2dc06a89803c6fdb
> 
> I see, so there is a significant difference - in KVM the kernel
> controls what ID values the VM observes and in vSMMU the VMM controls
> it.

Yes, that’s for guests.

What I meant is that the host sysregs are not read from userspace which
is the synonym of reading SMMUv3 IDRs from userspace, instead the kernel
controls what features are visible to userspace(VMM) which it can enable
for guests if it wants, as SVE, MTE...

Thanks,
Mostafa

> 
> Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-09-06 11:07               ` Mostafa Saleh
@ 2024-09-06 13:34                 ` Jason Gunthorpe
  2024-09-10 11:12                   ` Mostafa Saleh
  0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-09-06 13:34 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Fri, Sep 06, 2024 at 11:07:47AM +0000, Mostafa Saleh wrote:

> However, I believe the UAPI can be more clear and solid in terms of
> what is supported (maybe a typical struct with the CD, and some
> extra configs?) I will give it a think.

I don't think breaking up the STE into fields in another struct is
going to be a big improvement, it adds more code and corner cases to
break up and reassemble it.

#define STRTAB_STE_0_NESTING_ALLOWED                                         \
	cpu_to_le64(STRTAB_STE_0_V | STRTAB_STE_0_CFG | STRTAB_STE_0_S1FMT | \
		    STRTAB_STE_0_S1CTXPTR_MASK | STRTAB_STE_0_S1CDMAX)
#define STRTAB_STE_1_NESTING_ALLOWED                            \
	cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |   \
		    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |   \
		    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_EATS)

It is 11 fields that would need to be recoded, that's alot.. Even if
you say the 3 cache ones are not needed it is still alot.

> > Reporting a static kernel capability through GET_INFO output is
> > easier/saner than providing some kind of policy flags in the GET_INFO
> > input to specify how the sanitization should work.
> 
> I don’t think it’s “policy”, it’s just giving userspace the minimum
> knowledge it needs to create the vSMMU, but again no really strong
> opinion about that.

There is no single "minimum knowledge" though, it depends on what the
VMM is able to support. IMHO once you go over to the "VMM has to
ignore bits it doesn't understand" you may as well just show
everything. Then the kernel side can't be wrong.

If the kernel side can be wrong, then you are back to handshaking
policy because the kernel can't assume that all existing VMMs wil not
rely on the kernel to do the masking.

> > > But this is a UAPI. How can userspace implement that if it has no
> > > documentation, and how can it be maintained if there is no clear
> > > interface with userspace with what is expected/returned...
> > 
> > I'm not sure what you are looking for here? I don't think an entire
> > tutorial on how to build a paravirtualized vSMMU is appropriate to
> > put in comments?
> 
> Sorry, I don’t think I was clear, I meant actual documentation for
> the UAPI, as in RST files for example. If I want to support that
> in kvmtool how can I implement it? 

Well, you need thousands of lines of code in kvtool to build a vIOMMU :)

Nicolin is looking at writing something, lets see.

I think for here we should focus on the comments being succinct but
sufficient to understand what the uAPI does itself.

> > I would *really* like everyone to sit down and figure out how to
> > manage virtual device lifecycle in a single language!
> 
> Yes, just like the guest_memfd work. There has been also
> some work to unify some of the guest HVC bits:
> https://lore.kernel.org/all/20240830130150.8568-1-will@kernel.org/

I think Dan Williams is being ringleader for the PCI side effort on CC

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-08-30 17:04     ` Jason Gunthorpe
  2024-09-02  9:57       ` Mostafa Saleh
@ 2024-09-06 18:28       ` Jason Gunthorpe
  2024-09-06 18:49         ` Nicolin Chen
  1 sibling, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-09-06 18:28 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Fri, Aug 30, 2024 at 02:04:26PM -0300, Jason Gunthorpe wrote:

> Really, this series and that series must be together. We have a patch
> planning issue to sort out here as well, all 27 should go together
> into the same merge window.

I'm thinking strongly about moving the nesting code into
arm-smmuv3-nesting.c and wrapping it all in a kconfig. Similar to
SVA. Now that we see the viommu related code and more it will be a few
hundred lines there.

We'd leave the kconfig off until all of the parts are merged. There
are enough dependent series here that this seems to be the best
compromise.. Embedded cases can turn it off so it is longterm useful.

WDYT?

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-09-06 18:28       ` Jason Gunthorpe
@ 2024-09-06 18:49         ` Nicolin Chen
  2024-09-06 23:15           ` Jason Gunthorpe
  0 siblings, 1 reply; 95+ messages in thread
From: Nicolin Chen @ 2024-09-06 18:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mostafa Saleh, acpica-devel, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi

On Fri, Sep 06, 2024 at 03:28:31PM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 30, 2024 at 02:04:26PM -0300, Jason Gunthorpe wrote:
> 
> > Really, this series and that series must be together. We have a patch
> > planning issue to sort out here as well, all 27 should go together
> > into the same merge window.
> 
> I'm thinking strongly about moving the nesting code into
> arm-smmuv3-nesting.c and wrapping it all in a kconfig. Similar to
> SVA. Now that we see the viommu related code and more it will be a few
> hundred lines there.

+1 for this. I was thinking of doing that when I started drafting
patches, yet at that time we only had a couple of functions. Now,
with viommu_ops, it has grown.

> We'd leave the kconfig off until all of the parts are merged. There
> are enough dependent series here that this seems to be the best
> compromise.. Embedded cases can turn it off so it is longterm useful.

You mean doing that so as to merge two series separately? I wonder
if somebody might turn it on while the 2nd series isn't merge...

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-09-06 18:49         ` Nicolin Chen
@ 2024-09-06 23:15           ` Jason Gunthorpe
  0 siblings, 0 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-09-06 23:15 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Mostafa Saleh, acpica-devel, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi

On Fri, Sep 06, 2024 at 11:49:08AM -0700, Nicolin Chen wrote:
> > We'd leave the kconfig off until all of the parts are merged. There
> > are enough dependent series here that this seems to be the best
> > compromise.. Embedded cases can turn it off so it is longterm useful.
> 
> You mean doing that so as to merge two series separately? 

Not just the two series, but the ITS fix too.

> I wonder if somebody might turn it on while the 2nd series isn't
> merge...

As long as distro's don't and I trust them to do a good job.

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-09-03 23:33             ` Jason Gunthorpe
@ 2024-09-10 10:55               ` Mostafa Saleh
  2024-09-10 20:22                 ` Jason Gunthorpe
  0 siblings, 1 reply; 95+ messages in thread
From: Mostafa Saleh @ 2024-09-10 10:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Tue, Sep 03, 2024 at 08:33:40PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 03, 2024 at 07:57:01AM +0000, Mostafa Saleh wrote:
> 
> > Basically, I believe we shouldn’t set FWB blindly just because it’s supported,
> > I don’t see how it’s useful for stage-2 only domains.
> 
> And the only problem we can see is some niche scenario where incoming
> memory attributes that are already requesting cachable combine to a
> different kind of cachable?

No, it’s not about the niche scenario, as I mentioned I don’t think
we should enable FWB because it just exists. One can argue the opposite,
if S2FWB is no different why enable it?

AFAIU, FWB would be useful in cases where the hypervisor(or VMM) knows
better than the VM, for example some devices MMIO space are emulated so
they are normal memory and it’s more efficient to use memory attributes.

Taking into consideration all the hassle that can happen if non-coherent
devices use the wrong attribute, I’d suggest either set FWB only for
coherent devices (I know it’s not easy to define, but maybe be should?)
or we have a new CAP where the caller is aware of that. But I don’t think
the driver should decide that on behalf of the caller.

> 
> > And I believe making assumptions about VFIO (which actually is not correctly
> > enforced at the moment) is fragile.
> 
> VFIO requiring cachable is definately not fragile, and it also sets
> the IOMMU_CACHE flag to indicate this. Revising VFIO to allow
> non-cachable would be a signficant change and would also change what
> IOMMU_CACHE flag it sets.
> 

I meant the driver shouldn't assume the caller behaviour, if it's VFIO
or something new.

> > and we should only set FWB for coherent
> > devices in nested setup only where the VMM(or hypervisor) knows better than
> > the VM.
> 
> I don't want to touch the 'only coherent devices' question. Last time
> I tried to do that I got told every option was wrong.
> 
> I would be fine to only enable for nesting parent domains. It is
> mandatory here and we definitely don't support non-cachable nesting
> today.  Can we agree on that?
> 
Why is it mandatory?

I think a supporting point for this, is that KVM does the same for
the CPU, where it enables FWB for VMs if supported. I have this on
my list to study if that can be improved. But may be if we are out
of options that would be a start.

> Keep in mind SMMU S2FWB is really new and probably very little HW
> supports it right now. So we are not breaking anything existing
> here. IMHO it is better to always enable the stricter features going
> forward, and then evaluate an in-kernel opt-out if someone comes with
> a concrete use case.
> 

I agree, it’s unlikely that this breaks existing hardware, but I’d
be concerned if FWB is enabled unconditionally it breaks devices in
the future and we end up restricting it more.

Thanks,
Mostafa
> Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-09-06 13:34                 ` Jason Gunthorpe
@ 2024-09-10 11:12                   ` Mostafa Saleh
  2024-09-15 21:39                     ` Jason Gunthorpe
  0 siblings, 1 reply; 95+ messages in thread
From: Mostafa Saleh @ 2024-09-10 11:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Fri, Sep 06, 2024 at 10:34:44AM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 06, 2024 at 11:07:47AM +0000, Mostafa Saleh wrote:
> 
> > However, I believe the UAPI can be more clear and solid in terms of
> > what is supported (maybe a typical struct with the CD, and some
> > extra configs?) I will give it a think.
> 
> I don't think breaking up the STE into fields in another struct is
> going to be a big improvement, it adds more code and corner cases to
> break up and reassemble it.
> 
> #define STRTAB_STE_0_NESTING_ALLOWED                                         \
> 	cpu_to_le64(STRTAB_STE_0_V | STRTAB_STE_0_CFG | STRTAB_STE_0_S1FMT | \
> 		    STRTAB_STE_0_S1CTXPTR_MASK | STRTAB_STE_0_S1CDMAX)
> #define STRTAB_STE_1_NESTING_ALLOWED                            \
> 	cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |   \
> 		    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |   \
> 		    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_EATS)
> 
> It is 11 fields that would need to be recoded, that's alot.. Even if
> you say the 3 cache ones are not needed it is still alot.

I was thinking of providing a higher level semantics
(no need for caching, valid...), something like:

struct smmu_user_table {
	u64 cd_table;
	u32 smmu_cd_cfg;  /* linear or 2lvl,.... */
	u32 smmu_trans_cfg; /* Translate, bypass, abort */
	u32 dev_feat; /*ATS, STALL, …*/
};

I feel that is a bit more clear for user space? Instead of
partially setting the STE, and it should be easier to extend than
masking the STE.

I’am not opposed to the vSTE, I just feel it's loosely defined,
that's why I was asking for the docs.

> 
> > > Reporting a static kernel capability through GET_INFO output is
> > > easier/saner than providing some kind of policy flags in the GET_INFO
> > > input to specify how the sanitization should work.
> > 
> > I don’t think it’s “policy”, it’s just giving userspace the minimum
> > knowledge it needs to create the vSMMU, but again no really strong
> > opinion about that.
> 
> There is no single "minimum knowledge" though, it depends on what the
> VMM is able to support. IMHO once you go over to the "VMM has to
> ignore bits it doesn't understand" you may as well just show
> everything. Then the kernel side can't be wrong.
> 
> If the kernel side can be wrong, then you are back to handshaking
> policy because the kernel can't assume that all existing VMMs wil not
> rely on the kernel to do the masking.
>

I agree it’s tricky, again no strong opinion on that, although I doubt
that a VMM would care about all the SMMU features.

> > > > But this is a UAPI. How can userspace implement that if it has no
> > > > documentation, and how can it be maintained if there is no clear
> > > > interface with userspace with what is expected/returned...
> > > 
> > > I'm not sure what you are looking for here? I don't think an entire
> > > tutorial on how to build a paravirtualized vSMMU is appropriate to
> > > put in comments?
> > 
> > Sorry, I don’t think I was clear, I meant actual documentation for
> > the UAPI, as in RST files for example. If I want to support that
> > in kvmtool how can I implement it? 
> 
> Well, you need thousands of lines of code in kvtool to build a vIOMMU :)
> 
> Nicolin is looking at writing something, lets see.
> 
> I think for here we should focus on the comments being succinct but
> sufficient to understand what the uAPI does itself.
> 
Actually I think the opposite, I think UAPI docs is more important
here, especially for the vSTE, that's how we can compare the code to
what is expected from user-space.

> > > I would *really* like everyone to sit down and figure out how to
> > > manage virtual device lifecycle in a single language!
> > 
> > Yes, just like the guest_memfd work. There has been also
> > some work to unify some of the guest HVC bits:
> > https://lore.kernel.org/all/20240830130150.8568-1-will@kernel.org/
> 
> I think Dan Williams is being ringleader for the PCI side effort on CC

Thanks, I will try to spend some time on the secure VFIO work.

Thanks,
Mostafa

> 
> Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-09-04 15:00     ` Jason Gunthorpe
@ 2024-09-10 11:25       ` Shameerali Kolothum Thodi
  2024-09-11 22:52         ` Jason Gunthorpe
  0 siblings, 1 reply; 95+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-09-10 11:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel@lists.linux.dev, Guohanjun (Hanjun Guo),
	iommu@lists.linux.dev, Joerg Roedel, Kevin Tian,
	kvm@vger.kernel.org, Len Brown, linux-acpi@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches@lists.linux.dev, Mostafa Saleh



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 4, 2024 4:00 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: acpica-devel@lists.linux.dev; Guohanjun (Hanjun Guo)
> <guohanjun@huawei.com>; iommu@lists.linux.dev; Joerg Roedel
> <joro@8bytes.org>; Kevin Tian <kevin.tian@intel.com>; kvm@vger.kernel.org;
> Len Brown <lenb@kernel.org>; linux-acpi@vger.kernel.org; linux-arm-
> kernel@lists.infradead.org; Lorenzo Pieralisi <lpieralisi@kernel.org>; Rafael J.
> Wysocki <rafael@kernel.org>; Robert Moore <robert.moore@intel.com>; Robin
> Murphy <robin.murphy@arm.com>; Sudeep Holla <sudeep.holla@arm.com>;
> Will Deacon <will@kernel.org>; Alex Williamson
> <alex.williamson@redhat.com>; Eric Auger <eric.auger@redhat.com>; Jean-
> Philippe Brucker <jean-philippe@linaro.org>; Moritz Fischer <mdf@kernel.org>;
> Michael Shavit <mshavit@google.com>; Nicolin Chen <nicolinc@nvidia.com>;
> patches@lists.linux.dev; Mostafa Saleh <smostafa@google.com>
> Subject: Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
> 
> On Wed, Sep 04, 2024 at 02:20:36PM +0000, Shameerali Kolothum Thodi wrote:
> 
> > This should be added to arm_64_lpae_alloc_pgtable_s2(), not here.
> 
> Woops! Yes:
> 
> -       /* The NS quirk doesn't apply at stage 2 */
> -       if (cfg->quirks)
> +       if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_S2FWB))
>                 return NULL;
> 
> > With the above fixed, I was able to assign a n/w VF dev to a Guest on
> > a test hardware that supports S2FWB.
> 
> Okay great
> 
> > However host kernel has this WARN message:
> > [ 1546.165105] WARNING: CPU: 5 PID: 7047 at
> > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:1086
> > arm_smmu_entry_qword_diff+0x124/0x138
> > ....
> 
> Yes, my dumb mistake again, thanks for testing
> 
> @@ -1009,7 +1009,8 @@ void arm_smmu_get_ste_used(const __le64 *ent,
> __le64 *used_bits)
>         /* S2 translates */
>         if (cfg & BIT(1)) {
>                 used_bits[1] |=
> -                       cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
> +                       cpu_to_le64(STRTAB_STE_1_S2FWB | STRTAB_STE_1_EATS |
> +                                   STRTAB_STE_1_SHCFG);
> 
> > root@localhost:/# ping 150.0.124.42
> > PING 150.0.124.42 (150.0.124.42): 56 data bytes
> > 64 bytes from 150.0.124.42: seq=0 ttl=64 time=47.648 ms
> 
> So DMA is not totally broken if a packet flowed.
> 
> > [ 1395.958630] hns3 0000:c2:00.0 eth1: NETDEV WATCHDOG: CPU: 1:
> > transmit queue 10 timed out 5260 ms
> 
> Timeout? Maybe interrupts are not working? Does /proc/interrupts suggest
> that? That would point at the ITS mapping

Interrupt seems to be Ok in this case as I can see /proc/interrupts increasing.

> Do you have all of Nicolin's extra patches in this kernel to make the ITS work
> with nesting?

Yes. I am using his
 https://github.com/nicolinc/iommufd/commits/iommufd_viommu_p1-v2/

> From a page table POV, iommu_dma_get_msi_page() has:
> 
> 	int prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
> 
> So the ITS page should be:
> 
> 		if (prot & IOMMU_MMIO) {
> 			pte |= ARM_LPAE_PTE_MEMATTR_DEV;
> 
> Which which still looks right under S2FWB unless I've misread the manual?
> 
> > [ 1395.960187] hns3 0000:c2:00.0 eth1: DQL info last_cnt: 42, queued:
> > 42, adj_limit: 0, completed: 0 [ 1395.961758] hns3 0000:c2:00.0 eth1:
> > queue state: 0x6, delta msecs: 5260 [ 1395.962925] hns3 0000:c2:00.0
> > eth1: tx_timeout count: 1, queue id: 10, SW_NTU: 0x1, SW_NTC: 0x0,
> > napi state: 16 [ 1395.964677] hns3 0000:c2:00.0 eth1: tx_pkts: 0,
> > tx_bytes: 0, sw_err_cnt: 0, tx_pending: 0 [ 1395.966114] hns3
> > 0000:c2:00.0 eth1: seg_pkt_cnt: 0, tx_more: 0, restart_queue: 0,
> > tx_busy: 0 [ 1395.967598] hns3 0000:c2:00.0 eth1: tx_push: 1,
> > tx_mem_doorbell: 0 [ 1395.968687] hns3 0000:c2:00.0 eth1: BD_NUM: 0x7f
> > HW_HEAD: 0x0, HW_TAIL: 0x0, BD_ERR: 0x0, INT: 0x1 [ 1395.970291] hns3
> > 0000:c2:00.0 eth1: RING_EN: 0x1, TC: 0x0, FBD_NUM: 0x0 FBD_OFT: 0x0,
> > EBD_NUM: 0x400, EBD_OFT: 0x0 [ 1395.972134] hns3 0000:c2:00.0:
> > received reset request from VF enet
> >
> > All this works fine on a hardware without S2FWB though.
> >
> > Also on this test hardware, it works fine with legacy VFIO assignment.
> 
> So.. Legacy VFIO assignment will use the S1, no nesting and not enable S2FWB?

Yes S1
 
> Try to isolate if S2FWB is the exact cause by disabling it in the kernel on this
> system vs something else wrong?

It looks like not related to S2FWB. I tried  commenting out S2FWB and issue is still
there.  Probably something related to this test setup.

Thanks,
Shameer


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-09-10 10:55               ` Mostafa Saleh
@ 2024-09-10 20:22                 ` Jason Gunthorpe
  2024-09-17  9:48                   ` Mostafa Saleh
  0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-09-10 20:22 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Tue, Sep 10, 2024 at 10:55:51AM +0000, Mostafa Saleh wrote:
> On Tue, Sep 03, 2024 at 08:33:40PM -0300, Jason Gunthorpe wrote:
> > On Tue, Sep 03, 2024 at 07:57:01AM +0000, Mostafa Saleh wrote:
> > 
> > > Basically, I believe we shouldn’t set FWB blindly just because it’s supported,
> > > I don’t see how it’s useful for stage-2 only domains.
> > 
> > And the only problem we can see is some niche scenario where incoming
> > memory attributes that are already requesting cachable combine to a
> > different kind of cachable?
> 
> No, it’s not about the niche scenario, as I mentioned I don’t think
> we should enable FWB because it just exists. One can argue the opposite,
> if S2FWB is no different why enable it?

Well, I'd argue that it provides more certainty for the kernel that
the DMA API behavior is matched by HW behavior. But I don't feel strongly.

I adjusted the patch to only enable it for nesting parents.

> AFAIU, FWB would be useful in cases where the hypervisor(or VMM) knows
> better than the VM, for example some devices MMIO space are emulated so
> they are normal memory and it’s more efficient to use memory attributes.

Not quite, the purpose of FWB is to allow the hypervisor to avoid
costly cache flushing. It is specifically to protect the hypervisor
against a VM causing the caches to go incoherent.

Caches that are unexpectedly incoherent are a security problem for the
hypervisor.

> > > and we should only set FWB for coherent
> > > devices in nested setup only where the VMM(or hypervisor) knows better than
> > > the VM.
> > 
> > I don't want to touch the 'only coherent devices' question. Last time
> > I tried to do that I got told every option was wrong.
> > 
> > I would be fine to only enable for nesting parent domains. It is
> > mandatory here and we definitely don't support non-cachable nesting
> > today.  Can we agree on that?
> 
> Why is it mandatory?

Because iommufd/vfio doesn't have cache flushing.
 
> I think a supporting point for this, is that KVM does the same for
> the CPU, where it enables FWB for VMs if supported. I have this on
> my list to study if that can be improved. But may be if we are out
> of options that would be a start.

When KVM turns on S2FWB it stops doing cache flushing. As I understand
it S2FWB is significantly a performance optimization.

On the VFIO side we don't have cache flushing at all. So enforcing
cache consistency is mandatory for security.

For native VFIO we set IOMMU_CACHE and expect that the contract with
the IOMMU is that no cache flushing is required.

For nested we set S2FWB/CANWBS to prevent the VM from disabling VFIO's
IOMMU_CACHE and again the contract with the HW is that no cache
flushing is required.

Thus VFIO is security correct even though it doesn't cache flush.

None of this has anything to do with device coherence capability. It
is why I keep saying incoherent devices must be blocked from VFIO
because it cannot operate them securely/correctly.

Fixing that is a whole other topic, Yi has a series for it on x86 at
least..

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-09-10 11:25       ` Shameerali Kolothum Thodi
@ 2024-09-11 22:52         ` Jason Gunthorpe
  0 siblings, 0 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-09-11 22:52 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: acpica-devel@lists.linux.dev, Guohanjun (Hanjun Guo),
	iommu@lists.linux.dev, Joerg Roedel, Kevin Tian,
	kvm@vger.kernel.org, Len Brown, linux-acpi@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches@lists.linux.dev, Mostafa Saleh

On Tue, Sep 10, 2024 at 11:25:55AM +0000, Shameerali Kolothum Thodi wrote:
> > Try to isolate if S2FWB is the exact cause by disabling it in the kernel on this
> > system vs something else wrong?
> 
> It looks like not related to S2FWB. I tried  commenting out S2FWB and issue is still
> there.  Probably something related to this test setup.

Okay, so not these patches. You managed to get some DMA through the
S2FWB so that is encouraging.

Thanks,
Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-08-27 21:31 ` [PATCH v2 0/8] Initial support for SMMUv3 nested translation Nicolin Chen
  2024-08-28 16:31   ` Shameerali Kolothum Thodi
@ 2024-09-12  3:42   ` Zhangfei Gao
  2024-09-12  4:05     ` Nicolin Chen
  2024-09-12  4:25     ` Baolu Lu
  1 sibling, 2 replies; 95+ messages in thread
From: Zhangfei Gao @ 2024-09-12  3:42 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Jason Gunthorpe, acpica-devel, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

Hi, Nico

On Wed, 28 Aug 2024 at 05:32, Nicolin Chen <nicolinc@nvidia.com> wrote:
>
> On Tue, Aug 27, 2024 at 12:51:30PM -0300, Jason Gunthorpe wrote:
> > This brings support for the IOMMFD ioctls:
> >
> >  - IOMMU_GET_HW_INFO
> >  - IOMMU_HWPT_ALLOC_NEST_PARENT
> >  - IOMMU_DOMAIN_NESTED
> >  - ops->enforce_cache_coherency()
> >
> > This is quite straightforward as the nested STE can just be built in the
> > special NESTED domain op and fed through the generic update machinery.
> >
> > The design allows the user provided STE fragment to control several
> > aspects of the translation, including putting the STE into a "virtual
> > bypass" or a aborting state. This duplicates functionality available by
> > other means, but it allows trivially preserving the VMID in the STE as we
> > eventually move towards the VIOMMU owning the VMID.
> >
> > Nesting support requires the system to either support S2FWB or the
> > stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
> > cache and view incoherent data, currently VFIO lacks any cache flushing
> > that would make this safe.
> >
> > Yan has a series to add some of the needed infrastructure for VFIO cache
> > flushing here:
> >
> >  https://lore.kernel.org/linux-iommu/20240507061802.20184-1-yan.y.zhao@intel.com/
> >
> > Which may someday allow relaxing this further.
> >
> > Remove VFIO_TYPE1_NESTING_IOMMU since it was never used and superseded by
> > this.
> >
> > This is the first series in what will be several to complete nesting
> > support. At least:
> >  - IOMMU_RESV_SW_MSI related fixups
> >     https://lore.kernel.org/linux-iommu/cover.1722644866.git.nicolinc@nvidia.com/
> >  - VIOMMU object support to allow ATS and CD invalidations
> >     https://lore.kernel.org/linux-iommu/cover.1723061377.git.nicolinc@nvidia.com/
> >  - vCMDQ hypervisor support for direct invalidation queue assignment
> >     https://lore.kernel.org/linux-iommu/cover.1712978212.git.nicolinc@nvidia.com/
> >  - KVM pinned VMID using VIOMMU for vBTM
> >     https://lore.kernel.org/linux-iommu/20240208151837.35068-1-shameerali.kolothum.thodi@huawei.com/
> >  - Cross instance S2 sharing
> >  - Virtual Machine Structure using VIOMMU (for vMPAM?)
> >  - Fault forwarding support through IOMMUFD's fault fd for vSVA
> >
> > The VIOMMU series is essential to allow the invalidations to be processed
> > for the CD as well.
> >
> > It is enough to allow qemu work to progress.
> >
> > This is on github: https://github.com/jgunthorpe/linux/commits/smmuv3_nesting
> >
> > v2:
>
> As mentioned above, the VIOMMU series would be required to test
> the entire nesting feature, which now has a v2 rebasing on this
> series. I tested it with a paring QEMU branch. Please refer to:
> https://lore.kernel.org/linux-iommu/cover.1724776335.git.nicolinc@nvidia.com/
> Also, there is another new VIRQ series on top of the VIOMMU one
> and this nesting series. And I tested it too. Please refer to:
> https://lore.kernel.org/linux-iommu/cover.1724777091.git.nicolinc@nvidia.com/
>
> With that,
>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
>
Have you tested the user page fault?

I got an issue, when a user page fault happens,
 group->attach_handle = iommu_attach_handle_get(pasid)
return NULL.

A bit confused here, only find IOMMU_NO_PASID is used when attaching

 __fault_domain_replace_dev
ret = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
&handle->handle);
curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle, GFP_KERNEL);

not find where the code attach user pasid with the attach_handle.

Thanks


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-09-12  3:42   ` Zhangfei Gao
@ 2024-09-12  4:05     ` Nicolin Chen
  2024-09-12  4:25     ` Baolu Lu
  1 sibling, 0 replies; 95+ messages in thread
From: Nicolin Chen @ 2024-09-12  4:05 UTC (permalink / raw)
  To: Zhangfei Gao
  Cc: Jason Gunthorpe, acpica-devel, Hanjun Guo, iommu, Joerg Roedel,
	Kevin Tian, kvm, Len Brown, linux-acpi, linux-arm-kernel,
	Lorenzo Pieralisi, Rafael J. Wysocki, Robert Moore, Robin Murphy,
	Sudeep Holla, Will Deacon, Alex Williamson, Eric Auger,
	Jean-Philippe Brucker, Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Thu, Sep 12, 2024 at 11:42:43AM +0800, Zhangfei Gao wrote:
> > > The VIOMMU series is essential to allow the invalidations to be processed
> > > for the CD as well.
> > >
> > > It is enough to allow qemu work to progress.
> > >
> > > This is on github: https://github.com/jgunthorpe/linux/commits/smmuv3_nesting
> > >
> > > v2:
> >
> > As mentioned above, the VIOMMU series would be required to test
> > the entire nesting feature, which now has a v2 rebasing on this
> > series. I tested it with a paring QEMU branch. Please refer to:
> > https://lore.kernel.org/linux-iommu/cover.1724776335.git.nicolinc@nvidia.com/
> > Also, there is another new VIRQ series on top of the VIOMMU one
> > and this nesting series. And I tested it too. Please refer to:
> > https://lore.kernel.org/linux-iommu/cover.1724777091.git.nicolinc@nvidia.com/
> >
> > With that,
> >
> > Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> >
> Have you tested the user page fault?

No, I don't have a HW to test PRI. So, I've little experience with
the IOPF and its counter part in QEMU. I recall that Shameer has a
series of changes in QEMU.

> I got an issue, when a user page fault happens,
>  group->attach_handle = iommu_attach_handle_get(pasid)
> return NULL.
> 
> A bit confused here, only find IOMMU_NO_PASID is used when attaching
> 
>  __fault_domain_replace_dev
> ret = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
> &handle->handle);
> curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle, GFP_KERNEL);
> 
> not find where the code attach user pasid with the attach_handle.

In SMMUv3 case (the latest design), a DOMAIN_NESTED is a CD/PASID
table. So one single attach_handle (domain/idev) should be enough
to cover the entire thing. Please refer to:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/io-pgfault.c?h=v6.11-rc7#n127

Thanks
Nicolin


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-09-12  3:42   ` Zhangfei Gao
  2024-09-12  4:05     ` Nicolin Chen
@ 2024-09-12  4:25     ` Baolu Lu
  2024-09-12  7:32       ` Zhangfei Gao
  2024-10-15  3:21       ` Zhangfei Gao
  1 sibling, 2 replies; 95+ messages in thread
From: Baolu Lu @ 2024-09-12  4:25 UTC (permalink / raw)
  To: Zhangfei Gao, Nicolin Chen
  Cc: baolu.lu, Jason Gunthorpe, acpica-devel, Hanjun Guo, iommu,
	Joerg Roedel, Kevin Tian, kvm, Len Brown, linux-acpi,
	linux-arm-kernel, Lorenzo Pieralisi, Rafael J. Wysocki,
	Robert Moore, Robin Murphy, Sudeep Holla, Will Deacon,
	Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On 9/12/24 11:42 AM, Zhangfei Gao wrote:
> On Wed, 28 Aug 2024 at 05:32, Nicolin Chen<nicolinc@nvidia.com>  wrote:
>> On Tue, Aug 27, 2024 at 12:51:30PM -0300, Jason Gunthorpe wrote:
>>> This brings support for the IOMMFD ioctls:
>>>
>>>   - IOMMU_GET_HW_INFO
>>>   - IOMMU_HWPT_ALLOC_NEST_PARENT
>>>   - IOMMU_DOMAIN_NESTED
>>>   - ops->enforce_cache_coherency()
>>>
>>> This is quite straightforward as the nested STE can just be built in the
>>> special NESTED domain op and fed through the generic update machinery.
>>>
>>> The design allows the user provided STE fragment to control several
>>> aspects of the translation, including putting the STE into a "virtual
>>> bypass" or a aborting state. This duplicates functionality available by
>>> other means, but it allows trivially preserving the VMID in the STE as we
>>> eventually move towards the VIOMMU owning the VMID.
>>>
>>> Nesting support requires the system to either support S2FWB or the
>>> stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
>>> cache and view incoherent data, currently VFIO lacks any cache flushing
>>> that would make this safe.
>>>
>>> Yan has a series to add some of the needed infrastructure for VFIO cache
>>> flushing here:
>>>
>>>   https://lore.kernel.org/linux-iommu/20240507061802.20184-1-yan.y.zhao@intel.com/
>>>
>>> Which may someday allow relaxing this further.
>>>
>>> Remove VFIO_TYPE1_NESTING_IOMMU since it was never used and superseded by
>>> this.
>>>
>>> This is the first series in what will be several to complete nesting
>>> support. At least:
>>>   - IOMMU_RESV_SW_MSI related fixups
>>>      https://lore.kernel.org/linux-iommu/cover.1722644866.git.nicolinc@nvidia.com/
>>>   - VIOMMU object support to allow ATS and CD invalidations
>>>      https://lore.kernel.org/linux-iommu/cover.1723061377.git.nicolinc@nvidia.com/
>>>   - vCMDQ hypervisor support for direct invalidation queue assignment
>>>      https://lore.kernel.org/linux-iommu/cover.1712978212.git.nicolinc@nvidia.com/
>>>   - KVM pinned VMID using VIOMMU for vBTM
>>>      https://lore.kernel.org/linux-iommu/20240208151837.35068-1-shameerali.kolothum.thodi@huawei.com/
>>>   - Cross instance S2 sharing
>>>   - Virtual Machine Structure using VIOMMU (for vMPAM?)
>>>   - Fault forwarding support through IOMMUFD's fault fd for vSVA
>>>
>>> The VIOMMU series is essential to allow the invalidations to be processed
>>> for the CD as well.
>>>
>>> It is enough to allow qemu work to progress.
>>>
>>> This is on github:https://github.com/jgunthorpe/linux/commits/smmuv3_nesting
>>>
>>> v2:
>> As mentioned above, the VIOMMU series would be required to test
>> the entire nesting feature, which now has a v2 rebasing on this
>> series. I tested it with a paring QEMU branch. Please refer to:
>> https://lore.kernel.org/linux-iommu/cover.1724776335.git.nicolinc@nvidia.com/
>> Also, there is another new VIRQ series on top of the VIOMMU one
>> and this nesting series. And I tested it too. Please refer to:
>> https://lore.kernel.org/linux-iommu/cover.1724777091.git.nicolinc@nvidia.com/
>>
>> With that,
>>
>> Tested-by: Nicolin Chen<nicolinc@nvidia.com>
>>
> Have you tested the user page fault?
> 
> I got an issue, when a user page fault happens,
>   group->attach_handle = iommu_attach_handle_get(pasid)
> return NULL.
> 
> A bit confused here, only find IOMMU_NO_PASID is used when attaching
> 
>   __fault_domain_replace_dev
> ret = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
> &handle->handle);
> curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle, GFP_KERNEL);
> 
> not find where the code attach user pasid with the attach_handle.

Have you set iommu_ops::user_pasid_table for SMMUv3 driver?

Thanks,
baolu


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-09-12  4:25     ` Baolu Lu
@ 2024-09-12  7:32       ` Zhangfei Gao
  2024-10-15  3:21       ` Zhangfei Gao
  1 sibling, 0 replies; 95+ messages in thread
From: Zhangfei Gao @ 2024-09-12  7:32 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Nicolin Chen, Jason Gunthorpe, acpica-devel, Hanjun Guo, iommu,
	Joerg Roedel, Kevin Tian, kvm, Len Brown, linux-acpi,
	linux-arm-kernel, Lorenzo Pieralisi, Rafael J. Wysocki,
	Robert Moore, Robin Murphy, Sudeep Holla, Will Deacon,
	Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Thu, 12 Sept 2024 at 12:29, Baolu Lu <baolu.lu@linux.intel.com> wrote:
>
> On 9/12/24 11:42 AM, Zhangfei Gao wrote:
> > On Wed, 28 Aug 2024 at 05:32, Nicolin Chen<nicolinc@nvidia.com>  wrote:
> >> On Tue, Aug 27, 2024 at 12:51:30PM -0300, Jason Gunthorpe wrote:
> >>> This brings support for the IOMMFD ioctls:
> >>>
> >>>   - IOMMU_GET_HW_INFO
> >>>   - IOMMU_HWPT_ALLOC_NEST_PARENT
> >>>   - IOMMU_DOMAIN_NESTED
> >>>   - ops->enforce_cache_coherency()
> >>>
> >>> This is quite straightforward as the nested STE can just be built in the
> >>> special NESTED domain op and fed through the generic update machinery.
> >>>
> >>> The design allows the user provided STE fragment to control several
> >>> aspects of the translation, including putting the STE into a "virtual
> >>> bypass" or a aborting state. This duplicates functionality available by
> >>> other means, but it allows trivially preserving the VMID in the STE as we
> >>> eventually move towards the VIOMMU owning the VMID.
> >>>
> >>> Nesting support requires the system to either support S2FWB or the
> >>> stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
> >>> cache and view incoherent data, currently VFIO lacks any cache flushing
> >>> that would make this safe.
> >>>
> >>> Yan has a series to add some of the needed infrastructure for VFIO cache
> >>> flushing here:
> >>>
> >>>   https://lore.kernel.org/linux-iommu/20240507061802.20184-1-yan.y.zhao@intel.com/
> >>>
> >>> Which may someday allow relaxing this further.
> >>>
> >>> Remove VFIO_TYPE1_NESTING_IOMMU since it was never used and superseded by
> >>> this.
> >>>
> >>> This is the first series in what will be several to complete nesting
> >>> support. At least:
> >>>   - IOMMU_RESV_SW_MSI related fixups
> >>>      https://lore.kernel.org/linux-iommu/cover.1722644866.git.nicolinc@nvidia.com/
> >>>   - VIOMMU object support to allow ATS and CD invalidations
> >>>      https://lore.kernel.org/linux-iommu/cover.1723061377.git.nicolinc@nvidia.com/
> >>>   - vCMDQ hypervisor support for direct invalidation queue assignment
> >>>      https://lore.kernel.org/linux-iommu/cover.1712978212.git.nicolinc@nvidia.com/
> >>>   - KVM pinned VMID using VIOMMU for vBTM
> >>>      https://lore.kernel.org/linux-iommu/20240208151837.35068-1-shameerali.kolothum.thodi@huawei.com/
> >>>   - Cross instance S2 sharing
> >>>   - Virtual Machine Structure using VIOMMU (for vMPAM?)
> >>>   - Fault forwarding support through IOMMUFD's fault fd for vSVA
> >>>
> >>> The VIOMMU series is essential to allow the invalidations to be processed
> >>> for the CD as well.
> >>>
> >>> It is enough to allow qemu work to progress.
> >>>
> >>> This is on github:https://github.com/jgunthorpe/linux/commits/smmuv3_nesting
> >>>
> >>> v2:
> >> As mentioned above, the VIOMMU series would be required to test
> >> the entire nesting feature, which now has a v2 rebasing on this
> >> series. I tested it with a paring QEMU branch. Please refer to:
> >> https://lore.kernel.org/linux-iommu/cover.1724776335.git.nicolinc@nvidia.com/
> >> Also, there is another new VIRQ series on top of the VIOMMU one
> >> and this nesting series. And I tested it too. Please refer to:
> >> https://lore.kernel.org/linux-iommu/cover.1724777091.git.nicolinc@nvidia.com/
> >>
> >> With that,
> >>
> >> Tested-by: Nicolin Chen<nicolinc@nvidia.com>
> >>
> > Have you tested the user page fault?
> >
> > I got an issue, when a user page fault happens,
> >   group->attach_handle = iommu_attach_handle_get(pasid)
> > return NULL.
> >
> > A bit confused here, only find IOMMU_NO_PASID is used when attaching
> >
> >   __fault_domain_replace_dev
> > ret = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
> > &handle->handle);
> > curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle, GFP_KERNEL);
> >
> > not find where the code attach user pasid with the attach_handle.
>
> Have you set iommu_ops::user_pasid_table for SMMUv3 driver?

Thanks Baolu, Nico

Yes, after arm_smmu_ops = {
+       .user_pasid_table       = 1,

find_fault_handler can go inside attach_handle =
iommu_attach_handle_get(IOMMU_NO_PASID);
qemu handler also gets called.

But hardware reports errors and needs reset, still in check.
[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0

Thanks


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
  2024-09-10 11:12                   ` Mostafa Saleh
@ 2024-09-15 21:39                     ` Jason Gunthorpe
  0 siblings, 0 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-09-15 21:39 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Tue, Sep 10, 2024 at 11:12:20AM +0000, Mostafa Saleh wrote:
> On Fri, Sep 06, 2024 at 10:34:44AM -0300, Jason Gunthorpe wrote:
> > On Fri, Sep 06, 2024 at 11:07:47AM +0000, Mostafa Saleh wrote:
> > 
> > > However, I believe the UAPI can be more clear and solid in terms of
> > > what is supported (maybe a typical struct with the CD, and some
> > > extra configs?) I will give it a think.
> > 
> > I don't think breaking up the STE into fields in another struct is
> > going to be a big improvement, it adds more code and corner cases to
> > break up and reassemble it.
> > 
> > #define STRTAB_STE_0_NESTING_ALLOWED                                         \
> > 	cpu_to_le64(STRTAB_STE_0_V | STRTAB_STE_0_CFG | STRTAB_STE_0_S1FMT | \
> > 		    STRTAB_STE_0_S1CTXPTR_MASK | STRTAB_STE_0_S1CDMAX)
> > #define STRTAB_STE_1_NESTING_ALLOWED                            \
> > 	cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |   \
> > 		    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |   \
> > 		    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_EATS)
> > 
> > It is 11 fields that would need to be recoded, that's alot.. Even if
> > you say the 3 cache ones are not needed it is still alot.
> 
> I was thinking of providing a higher level semantics
> (no need for caching, valid...), something like:

Well, that isn't higher level semantics, really, it is just splitting
up the existing fields.

We do need to do something with valid as well as the VM can create a
non-valid STE and we still have to wrap a nesting domain around it to
ensure that event routing can work.

> struct smmu_user_table {
> 	u64 cd_table;
> 	u32 smmu_cd_cfg;  /* linear or 2lvl,.... */
> 	u32 smmu_trans_cfg; /* Translate, bypass, abort */
> 	u32 dev_feat; /*ATS, STALL, …*/
> };
> 
> I feel that is a bit more clear for user space?

Having done these sorts of interfaces over a long time, I belive it is
not. Deviating from the native HW format and re-marshalling into
something else is error prone and can become a problem when the
transformation from the well known HW format to the intermediate
format becomes a source of confusion too.

> Instead of partially setting the STE, and it should be easier to
> extend than masking the STE.

It is not going to partially set, it is going to validate a mask from
the original vSTE and if the mask fails then it will create a
non-valid STE instead.

We can't eliminate the mask because the VMM needs to mask and check
always no matter what the kernel interface is.

One option for the vmm is to just pass the vSTE entirely to the kernel
and let it validate it. If validation fails then use a V=0 STE
instead.
x
> I’am not opposed to the vSTE, I just feel it's loosely defined,
> that's why I was asking for the docs.

The kdoc lists all the fields and it is reflected directly to HW, and
there is a bitmask above being very explicit about what bits are
allowed. Where is the loosely defined you see?

If we broaden the mask down the road then we'd need some feature bits
to inform the VMM that the kernel supports a wider vSTE mask.

Thanks,
Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available
  2024-09-10 20:22                 ` Jason Gunthorpe
@ 2024-09-17  9:48                   ` Mostafa Saleh
  0 siblings, 0 replies; 95+ messages in thread
From: Mostafa Saleh @ 2024-09-17  9:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi

On Tue, Sep 10, 2024 at 05:22:51PM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 10, 2024 at 10:55:51AM +0000, Mostafa Saleh wrote:
> > On Tue, Sep 03, 2024 at 08:33:40PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Sep 03, 2024 at 07:57:01AM +0000, Mostafa Saleh wrote:
> > > 
> > > > Basically, I believe we shouldn’t set FWB blindly just because it’s supported,
> > > > I don’t see how it’s useful for stage-2 only domains.
> > > 
> > > And the only problem we can see is some niche scenario where incoming
> > > memory attributes that are already requesting cachable combine to a
> > > different kind of cachable?
> > 
> > No, it’s not about the niche scenario, as I mentioned I don’t think
> > we should enable FWB because it just exists. One can argue the opposite,
> > if S2FWB is no different why enable it?
> 
> Well, I'd argue that it provides more certainty for the kernel that
> the DMA API behavior is matched by HW behavior. But I don't feel strongly.
> 
> I adjusted the patch to only enable it for nesting parents.
> 
> > AFAIU, FWB would be useful in cases where the hypervisor(or VMM) knows
> > better than the VM, for example some devices MMIO space are emulated so
> > they are normal memory and it’s more efficient to use memory attributes.
> 
> Not quite, the purpose of FWB is to allow the hypervisor to avoid
> costly cache flushing. It is specifically to protect the hypervisor
> against a VM causing the caches to go incoherent.
> 
> Caches that are unexpectedly incoherent are a security problem for the
> hypervisor.

I see, thanks for explaining, I got confused about the device emulation case,
it’s also about corruption because of a mismatch of memory attributes,
something like:
https://bugzilla.redhat.com/show_bug.cgi?id=1679680

At the moment, I see KVM doesn’t really touch guest memory, but it does CMO for
guest map(in case memslot had already some data) and on unmap, which I
believe has significant performance improvement.

> 
> > > > and we should only set FWB for coherent
> > > > devices in nested setup only where the VMM(or hypervisor) knows better than
> > > > the VM.
> > > 
> > > I don't want to touch the 'only coherent devices' question. Last time
> > > I tried to do that I got told every option was wrong.
> > > 
> > > I would be fine to only enable for nesting parent domains. It is
> > > mandatory here and we definitely don't support non-cachable nesting
> > > today.  Can we agree on that?
> > 
> > Why is it mandatory?
> 
> Because iommufd/vfio doesn't have cache flushing.
>  

I see.

> > I think a supporting point for this, is that KVM does the same for
> > the CPU, where it enables FWB for VMs if supported. I have this on
> > my list to study if that can be improved. But may be if we are out
> > of options that would be a start.
> 
> When KVM turns on S2FWB it stops doing cache flushing. As I understand
> it S2FWB is significantly a performance optimization.
> 
> On the VFIO side we don't have cache flushing at all. So enforcing
> cache consistency is mandatory for security.
> 
> For native VFIO we set IOMMU_CACHE and expect that the contract with
> the IOMMU is that no cache flushing is required.
> 
> For nested we set S2FWB/CANWBS to prevent the VM from disabling VFIO's
> IOMMU_CACHE and again the contract with the HW is that no cache
> flushing is required.
> 
> Thus VFIO is security correct even though it doesn't cache flush.
> 
> None of this has anything to do with device coherence capability. It
> is why I keep saying incoherent devices must be blocked from VFIO
> because it cannot operate them securely/correctly.
> 
> Fixing that is a whole other topic, Yi has a series for it on x86 at
> least..

I see, that makes sense to only support it for nested domains on
the assumption they are only used for VFIO/IOMMUFD till we figure out
non-coherent devices, I guess you are referring to:
https://lore.kernel.org/all/ZltQ3PyHKiQmN9SU@nvidia.com/t/#me702dd242782393eb7769000c96702a0fed7f6ca

Thanks,
Mostafa
> 
> Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-09-12  4:25     ` Baolu Lu
  2024-09-12  7:32       ` Zhangfei Gao
@ 2024-10-15  3:21       ` Zhangfei Gao
  2024-10-15 13:09         ` Jason Gunthorpe
  1 sibling, 1 reply; 95+ messages in thread
From: Zhangfei Gao @ 2024-10-15  3:21 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Nicolin Chen, Jason Gunthorpe, acpica-devel, Hanjun Guo, iommu,
	Joerg Roedel, Kevin Tian, kvm, Len Brown, linux-acpi,
	linux-arm-kernel, Lorenzo Pieralisi, Rafael J. Wysocki,
	Robert Moore, Robin Murphy, Sudeep Holla, Will Deacon,
	Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Thu, 12 Sept 2024 at 12:29, Baolu Lu <baolu.lu@linux.intel.com> wrote:

> > Have you tested the user page fault?
> >
> > I got an issue, when a user page fault happens,
> >   group->attach_handle = iommu_attach_handle_get(pasid)
> > return NULL.
> >
> > A bit confused here, only find IOMMU_NO_PASID is used when attaching
> >
> >   __fault_domain_replace_dev
> > ret = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
> > &handle->handle);
> > curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle, GFP_KERNEL);
> >
> > not find where the code attach user pasid with the attach_handle.
>
> Have you set iommu_ops::user_pasid_table for SMMUv3 driver?

Thanks Baolu

Can we send a patch to make it as default?

+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3570,6 +3570,7 @@ static struct iommu_ops arm_smmu_ops = {
        .viommu_alloc           = arm_vsmmu_alloc,
        .pgsize_bitmap          = -1UL, /* Restricted during device attach */
        .owner                  = THIS_MODULE,
+       .user_pasid_table       = 1,


Thanks


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-10-15  3:21       ` Zhangfei Gao
@ 2024-10-15 13:09         ` Jason Gunthorpe
  2024-10-17  1:53           ` Zhangfei Gao
  0 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2024-10-15 13:09 UTC (permalink / raw)
  To: Zhangfei Gao
  Cc: Baolu Lu, Nicolin Chen, acpica-devel, Hanjun Guo, iommu,
	Joerg Roedel, Kevin Tian, kvm, Len Brown, linux-acpi,
	linux-arm-kernel, Lorenzo Pieralisi, Rafael J. Wysocki,
	Robert Moore, Robin Murphy, Sudeep Holla, Will Deacon,
	Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Tue, Oct 15, 2024 at 11:21:54AM +0800, Zhangfei Gao wrote:
> On Thu, 12 Sept 2024 at 12:29, Baolu Lu <baolu.lu@linux.intel.com> wrote:
> 
> > > Have you tested the user page fault?
> > >
> > > I got an issue, when a user page fault happens,
> > >   group->attach_handle = iommu_attach_handle_get(pasid)
> > > return NULL.
> > >
> > > A bit confused here, only find IOMMU_NO_PASID is used when attaching
> > >
> > >   __fault_domain_replace_dev
> > > ret = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
> > > &handle->handle);
> > > curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle, GFP_KERNEL);
> > >
> > > not find where the code attach user pasid with the attach_handle.
> >
> > Have you set iommu_ops::user_pasid_table for SMMUv3 driver?
> 
> Thanks Baolu
> 
> Can we send a patch to make it as default?
> 
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -3570,6 +3570,7 @@ static struct iommu_ops arm_smmu_ops = {
>         .viommu_alloc           = arm_vsmmu_alloc,
>         .pgsize_bitmap          = -1UL, /* Restricted during device attach */
>         .owner                  = THIS_MODULE,
> +       .user_pasid_table       = 1,

You shouldn't need this right now as smmu3 doesn't support nesting
domains yet.

			if (!ops->user_pasid_table)
				return NULL;
			/*
			 * The iommu driver for this device supports user-
			 * managed PASID table. Therefore page faults for
			 * any PASID should go through the NESTING domain
			 * attached to the device RID.
			 */
			attach_handle = iommu_attach_handle_get(
					dev->iommu_group, IOMMU_NO_PASID,
					IOMMU_DOMAIN_NESTED);
			if (IS_ERR(attach_handle))
                        ^^^^^^^^^^^^^^^^^^^^^ Will always fail


But I will add it to the patch that adds IOMMU_DOMAIN_NESTED

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-08-27 15:51 [PATCH v2 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
                   ` (8 preceding siblings ...)
  2024-08-27 21:31 ` [PATCH v2 0/8] Initial support for SMMUv3 nested translation Nicolin Chen
@ 2024-10-16  2:23 ` Zhangfei Gao
  2024-10-16 11:53   ` Jason Gunthorpe
  9 siblings, 1 reply; 95+ messages in thread
From: Zhangfei Gao @ 2024-10-16  2:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

Hi, Jason

On Tue, 27 Aug 2024 at 23:51, Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> This brings support for the IOMMFD ioctls:
>
>  - IOMMU_GET_HW_INFO
>  - IOMMU_HWPT_ALLOC_NEST_PARENT
>  - IOMMU_DOMAIN_NESTED
>  - ops->enforce_cache_coherency()
>
> This is quite straightforward as the nested STE can just be built in the
> special NESTED domain op and fed through the generic update machinery.
>
> The design allows the user provided STE fragment to control several
> aspects of the translation, including putting the STE into a "virtual
> bypass" or a aborting state. This duplicates functionality available by
> other means, but it allows trivially preserving the VMID in the STE as we
> eventually move towards the VIOMMU owning the VMID.
>
> Nesting support requires the system to either support S2FWB or the
> stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
> cache and view incoherent data, currently VFIO lacks any cache flushing
> that would make this safe.

What if the system does not support S2FWB or CANWBS, any workaround to
passthrough?
Currently I am testing nesting by ignoring this check.

Thanks


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-10-16  2:23 ` Zhangfei Gao
@ 2024-10-16 11:53   ` Jason Gunthorpe
  0 siblings, 0 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-10-16 11:53 UTC (permalink / raw)
  To: Zhangfei Gao
  Cc: acpica-devel, Hanjun Guo, iommu, Joerg Roedel, Kevin Tian, kvm,
	Len Brown, linux-acpi, linux-arm-kernel, Lorenzo Pieralisi,
	Rafael J. Wysocki, Robert Moore, Robin Murphy, Sudeep Holla,
	Will Deacon, Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Wed, Oct 16, 2024 at 10:23:49AM +0800, Zhangfei Gao wrote:

> > Nesting support requires the system to either support S2FWB or the
> > stronger CANWBS ACPI flag. This is to ensure the VM cannot bypass the
> > cache and view incoherent data, currently VFIO lacks any cache flushing
> > that would make this safe.
> 
> What if the system does not support S2FWB or CANWBS, any workaround to
> passthrough?

Eventually we can add the required cache flushing to VFIO, but that
would have to be a followup.

> Currently I am testing nesting by ignoring this check.

This is probably OK, but I wouldn't run it as a production environment
with a hostile VM.

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-10-15 13:09         ` Jason Gunthorpe
@ 2024-10-17  1:53           ` Zhangfei Gao
  2024-10-17 11:57             ` Jason Gunthorpe
  0 siblings, 1 reply; 95+ messages in thread
From: Zhangfei Gao @ 2024-10-17  1:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Baolu Lu, Nicolin Chen, acpica-devel, Hanjun Guo, iommu,
	Joerg Roedel, Kevin Tian, kvm, Len Brown, linux-acpi,
	linux-arm-kernel, Lorenzo Pieralisi, Rafael J. Wysocki,
	Robert Moore, Robin Murphy, Sudeep Holla, Will Deacon,
	Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Tue, 15 Oct 2024 at 21:09, Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Oct 15, 2024 at 11:21:54AM +0800, Zhangfei Gao wrote:
> > On Thu, 12 Sept 2024 at 12:29, Baolu Lu <baolu.lu@linux.intel.com> wrote:
> >
> > > > Have you tested the user page fault?
> > > >
> > > > I got an issue, when a user page fault happens,
> > > >   group->attach_handle = iommu_attach_handle_get(pasid)
> > > > return NULL.
> > > >
> > > > A bit confused here, only find IOMMU_NO_PASID is used when attaching
> > > >
> > > >   __fault_domain_replace_dev
> > > > ret = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
> > > > &handle->handle);
> > > > curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle, GFP_KERNEL);
> > > >
> > > > not find where the code attach user pasid with the attach_handle.
> > >
> > > Have you set iommu_ops::user_pasid_table for SMMUv3 driver?
> >
> > Thanks Baolu
> >
> > Can we send a patch to make it as default?
> >
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -3570,6 +3570,7 @@ static struct iommu_ops arm_smmu_ops = {
> >         .viommu_alloc           = arm_vsmmu_alloc,
> >         .pgsize_bitmap          = -1UL, /* Restricted during device attach */
> >         .owner                  = THIS_MODULE,
> > +       .user_pasid_table       = 1,
>
> You shouldn't need this right now as smmu3 doesn't support nesting
> domains yet.

I am testing with  .user_pasid_table = 1 and IOMMU_NO_PASID
It works for user page faults.

>
>                         if (!ops->user_pasid_table)
>                                 return NULL;
>                         /*
>                          * The iommu driver for this device supports user-
>                          * managed PASID table. Therefore page faults for
>                          * any PASID should go through the NESTING domain
>                          * attached to the device RID.
>                          */
>                         attach_handle = iommu_attach_handle_get(
>                                         dev->iommu_group, IOMMU_NO_PASID,
>                                         IOMMU_DOMAIN_NESTED);
>                         if (IS_ERR(attach_handle))
>                         ^^^^^^^^^^^^^^^^^^^^^ Will always fail
>
>
> But I will add it to the patch that adds IOMMU_DOMAIN_NESTED

OK, cool.

Thanks


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH v2 0/8] Initial support for SMMUv3 nested translation
  2024-10-17  1:53           ` Zhangfei Gao
@ 2024-10-17 11:57             ` Jason Gunthorpe
  0 siblings, 0 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2024-10-17 11:57 UTC (permalink / raw)
  To: Zhangfei Gao
  Cc: Baolu Lu, Nicolin Chen, acpica-devel, Hanjun Guo, iommu,
	Joerg Roedel, Kevin Tian, kvm, Len Brown, linux-acpi,
	linux-arm-kernel, Lorenzo Pieralisi, Rafael J. Wysocki,
	Robert Moore, Robin Murphy, Sudeep Holla, Will Deacon,
	Alex Williamson, Eric Auger, Jean-Philippe Brucker,
	Moritz Fischer, Michael Shavit, patches,
	Shameerali Kolothum Thodi, Mostafa Saleh

On Thu, Oct 17, 2024 at 09:53:22AM +0800, Zhangfei Gao wrote:
> On Tue, 15 Oct 2024 at 21:09, Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Tue, Oct 15, 2024 at 11:21:54AM +0800, Zhangfei Gao wrote:
> > > On Thu, 12 Sept 2024 at 12:29, Baolu Lu <baolu.lu@linux.intel.com> wrote:
> > >
> > > > > Have you tested the user page fault?
> > > > >
> > > > > I got an issue, when a user page fault happens,
> > > > >   group->attach_handle = iommu_attach_handle_get(pasid)
> > > > > return NULL.
> > > > >
> > > > > A bit confused here, only find IOMMU_NO_PASID is used when attaching
> > > > >
> > > > >   __fault_domain_replace_dev
> > > > > ret = iommu_replace_group_handle(idev->igroup->group, hwpt->domain,
> > > > > &handle->handle);
> > > > > curr = xa_store(&group->pasid_array, IOMMU_NO_PASID, handle, GFP_KERNEL);
> > > > >
> > > > > not find where the code attach user pasid with the attach_handle.
> > > >
> > > > Have you set iommu_ops::user_pasid_table for SMMUv3 driver?
> > >
> > > Thanks Baolu
> > >
> > > Can we send a patch to make it as default?
> > >
> > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > @@ -3570,6 +3570,7 @@ static struct iommu_ops arm_smmu_ops = {
> > >         .viommu_alloc           = arm_vsmmu_alloc,
> > >         .pgsize_bitmap          = -1UL, /* Restricted during device attach */
> > >         .owner                  = THIS_MODULE,
> > > +       .user_pasid_table       = 1,
> >
> > You shouldn't need this right now as smmu3 doesn't support nesting
> > domains yet.
> 
> I am testing with  .user_pasid_table = 1 and IOMMU_NO_PASID
> It works for user page faults.

You shouldn't need user_pasid_table for that case, it is only
necessary if you are doing nesting.

Jason


^ permalink raw reply	[flat|nested] 95+ messages in thread

end of thread, other threads:[~2024-10-17 12:01 UTC | newest]

Thread overview: 95+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-27 15:51 [PATCH v2 0/8] Initial support for SMMUv3 nested translation Jason Gunthorpe
2024-08-27 15:51 ` [PATCH v2 1/8] vfio: Remove VFIO_TYPE1_NESTING_IOMMU Jason Gunthorpe
2024-08-30  7:40   ` Tian, Kevin
2024-08-27 15:51 ` [PATCH v2 2/8] iommu/arm-smmu-v3: Use S2FWB when available Jason Gunthorpe
2024-08-27 19:48   ` Nicolin Chen
2024-08-28 18:30     ` Jason Gunthorpe
2024-08-28 19:47       ` Nicolin Chen
2024-08-28 19:50   ` Nicolin Chen
2024-08-30  7:44   ` Tian, Kevin
2024-08-30  7:56     ` Nicolin Chen
2024-08-30  8:01       ` Tian, Kevin
2024-08-30 15:12   ` Mostafa Saleh
2024-08-30 16:40     ` Jason Gunthorpe
2024-09-02  9:29       ` Mostafa Saleh
2024-09-03  0:05         ` Jason Gunthorpe
2024-09-03  7:57           ` Mostafa Saleh
2024-09-03 23:33             ` Jason Gunthorpe
2024-09-10 10:55               ` Mostafa Saleh
2024-09-10 20:22                 ` Jason Gunthorpe
2024-09-17  9:48                   ` Mostafa Saleh
2024-09-04 14:20   ` Shameerali Kolothum Thodi
2024-09-04 15:00     ` Jason Gunthorpe
2024-09-10 11:25       ` Shameerali Kolothum Thodi
2024-09-11 22:52         ` Jason Gunthorpe
2024-08-27 15:51 ` [PATCH v2 3/8] ACPICA: IORT: Update for revision E.f Jason Gunthorpe
2024-08-29 10:14   ` Rafael J. Wysocki
2024-08-27 15:51 ` [PATCH v2 4/8] ACPI/IORT: Support CANWBS memory access flag Jason Gunthorpe
2024-08-30  7:52   ` Tian, Kevin
2024-08-30 13:54     ` Jason Gunthorpe
2024-09-03  7:14       ` Tian, Kevin
2024-08-27 15:51 ` [PATCH v2 5/8] iommu/arm-smmu-v3: Report IOMMU_CAP_ENFORCE_CACHE_COHERENCY for CANWBS Jason Gunthorpe
2024-08-27 20:12   ` Nicolin Chen
2024-08-28 19:12     ` Jason Gunthorpe
2024-08-30 15:19   ` Mostafa Saleh
2024-08-30 17:10     ` Jason Gunthorpe
2024-08-27 15:51 ` [PATCH v2 6/8] iommu/arm-smmu-v3: Support IOMMU_GET_HW_INFO via struct arm_smmu_hw_info Jason Gunthorpe
2024-08-30  7:55   ` Tian, Kevin
2024-08-30 15:23   ` Mostafa Saleh
2024-08-30 17:16     ` Jason Gunthorpe
2024-09-02 10:11       ` Mostafa Saleh
2024-09-03  0:16         ` Jason Gunthorpe
2024-09-03  8:34           ` Mostafa Saleh
2024-09-03 23:40             ` Jason Gunthorpe
2024-09-04  7:11               ` Shameerali Kolothum Thodi
2024-09-04 12:01                 ` Jason Gunthorpe
2024-09-06 11:19                   ` Mostafa Saleh
2024-08-27 15:51 ` [PATCH v2 7/8] iommu/arm-smmu-v3: Implement IOMMU_HWPT_ALLOC_NEST_PARENT Jason Gunthorpe
2024-08-27 20:16   ` Nicolin Chen
2024-08-30  7:58   ` Tian, Kevin
2024-08-30 13:55     ` Jason Gunthorpe
2024-08-30 15:27   ` Mostafa Saleh
2024-08-30 17:18     ` Jason Gunthorpe
2024-09-02  8:57       ` Mostafa Saleh
2024-08-27 15:51 ` [PATCH v2 8/8] iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED Jason Gunthorpe
2024-08-27 21:23   ` Nicolin Chen
2024-08-28 19:01     ` Jason Gunthorpe
2024-08-28 19:27       ` Nicolin Chen
2024-08-30  8:16   ` Tian, Kevin
2024-08-30 14:13     ` Jason Gunthorpe
2024-08-30 14:39     ` Jason Gunthorpe
2024-08-30 16:09   ` Mostafa Saleh
2024-08-30 16:59     ` Nicolin Chen
2024-08-30 17:04     ` Jason Gunthorpe
2024-09-02  9:57       ` Mostafa Saleh
2024-09-03  0:30         ` Jason Gunthorpe
2024-09-03  1:13           ` Nicolin Chen
2024-09-03  9:00           ` Mostafa Saleh
2024-09-03 23:55             ` Jason Gunthorpe
2024-09-06 11:07               ` Mostafa Saleh
2024-09-06 13:34                 ` Jason Gunthorpe
2024-09-10 11:12                   ` Mostafa Saleh
2024-09-15 21:39                     ` Jason Gunthorpe
2024-09-06 18:28       ` Jason Gunthorpe
2024-09-06 18:49         ` Nicolin Chen
2024-09-06 23:15           ` Jason Gunthorpe
2024-08-27 21:31 ` [PATCH v2 0/8] Initial support for SMMUv3 nested translation Nicolin Chen
2024-08-28 16:31   ` Shameerali Kolothum Thodi
2024-08-28 17:14     ` Nicolin Chen
2024-08-28 18:06       ` Shameerali Kolothum Thodi
2024-08-28 18:12         ` Nicolin Chen
2024-08-29 13:14           ` Shameerali Kolothum Thodi
2024-08-29 14:52             ` Shameerali Kolothum Thodi
2024-08-29 16:10               ` Nicolin Chen
2024-08-30  9:07                 ` Shameerali Kolothum Thodi
2024-08-30 17:01                   ` Nicolin Chen
2024-09-12  3:42   ` Zhangfei Gao
2024-09-12  4:05     ` Nicolin Chen
2024-09-12  4:25     ` Baolu Lu
2024-09-12  7:32       ` Zhangfei Gao
2024-10-15  3:21       ` Zhangfei Gao
2024-10-15 13:09         ` Jason Gunthorpe
2024-10-17  1:53           ` Zhangfei Gao
2024-10-17 11:57             ` Jason Gunthorpe
2024-10-16  2:23 ` Zhangfei Gao
2024-10-16 11:53   ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).