[PATCH v5 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ)

linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ)
@ 2025-01-07 17:10 Nicolin Chen
  2025-01-07 17:10 ` [PATCH v5 01/14] iommufd: Keep OBJ/IOCTL lists in an alphabetical order Nicolin Chen
                   ` (13 more replies)
  0 siblings, 14 replies; 77+ messages in thread
From: Nicolin Chen @ 2025-01-07 17:10 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: joro, suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

As the vIOMMU infrastructure series part-3, this introduces a new vEVENTQ
object. The existing FAULT object provides a nice notification pathway to
the user space with a queue already, so let vEVENTQ reuse that.

Mimicing the HWPT structure, add a common EVENTQ structure to support its
derivatives: IOMMUFD_OBJ_FAULT (existing) and IOMMUFD_OBJ_VEVENTQ (new).

An IOMMUFD_CMD_VEVENTQ_ALLOC is introduced to allocate vEVENTQ object for
vIOMMUs. One vIOMMU can have multiple vEVENTQs in different types but can
not support multiple vEVENTQs in the same type.

The forwarding part is fairly simple but might need to replace a physical
device ID with a virtual device ID in a driver-level event data structure.
So, this also adds some helpers for drivers to use.

As usual, this series comes with the selftest coverage for this new ioctl
and with a real world use case in the ARM SMMUv3 driver.

This is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_veventq-v5

Testing with RMR patches for MSI:
https://github.com/nicolinc/iommufd/commits/iommufd_veventq-v5-with-rmr
Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_veventq-v5

Changelog
v5
 * Add Reviewed-by from Baolu
 * Reorder the OBJ list as well
 * Fix alphabetical order after renaming in v4
 * Add supports_veventq viommu op for vEVENTQ type validation
v4
 https://lore.kernel.org/all/cover.1735933254.git.nicolinc@nvidia.com/
 * Rename "vIRQ" to "vEVENTQ"
 * Use flexible array in struct iommufd_vevent
 * Add the new ioctl command to union ucmd_buffer
 * Fix the alphabetical order in union ucmd_buffer too
 * Rename _TYPE_NONE to _TYPE_DEFAULT aligning with vIOMMU naming
v3
 https://lore.kernel.org/all/cover.1734477608.git.nicolinc@nvidia.com/
 * Rebase on Will's for-joerg/arm-smmu/updates for arm_smmu_event series
 * Add "Reviewed-by" lines from Kevin
 * Fix typos in comments, kdocs, and jump tags
 * Add a patch to sort struct iommufd_ioctl_op
 * Update iommufd's userpsace-api documentation
 * Update uAPI kdoc to quote SMMUv3 offical spec
 * Drop the unused workqueue in struct iommufd_virq
 * Drop might_sleep() in iommufd_viommu_report_irq() helper
 * Add missing "break" in iommufd_viommu_get_vdev_id() helper
 * Shrink the scope of the vmaster's read lock in SMMUv3 driver
 * Pass in two arguments to iommufd_eventq_virq_handler() helper
 * Move "!ops || !ops->read" validation into iommufd_eventq_init()
 * Move "fault->ictx = ictx" closer to iommufd_ctx_get(fault->ictx)
 * Update commit message for arm_smmu_attach_prepare/commit_vmaster()
 * Keep "iommufd_fault" as-is and rename "iommufd_eventq_virq" to just
   "iommufd_virq"
v2
 https://lore.kernel.org/all/cover.1733263737.git.nicolinc@nvidia.com/
 * Rebase on v6.13-rc1
 * Add IOPF and vIRQ in iommufd.rst (userspace-api)
 * Add a proper locking in iommufd_event_virq_destroy
 * Add iommufd_event_virq_abort with a lockdep_assert_held
 * Rename "EVENT_*" to "EVENTQ_*" to describe the objects better
 * Reorganize flows in iommufd_eventq_virq_alloc for abort() to work
 * Adde struct arm_smmu_vmaster to store vSID upon attaching to a nested
   domain, calling a newly added iommufd_viommu_get_vdev_id helper
 * Adde an arm_vmaster_report_event helper in arm-smmu-v3-iommufd file
   to simplify the routine in arm_smmu_handle_evt() of the main driver
v1
 https://lore.kernel.org/all/cover.1724777091.git.nicolinc@nvidia.com/

Thanks!
Nicolin

Nicolin Chen (14):
  iommufd: Keep OBJ/IOCTL lists in an alphabetical order
  iommufd/fault: Add an iommufd_fault_init() helper
  iommufd/fault: Move iommufd_fault_iopf_handler() to header
  iommufd: Abstract an iommufd_eventq from iommufd_fault
  iommufd: Rename fault.c to eventq.c
  iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
  iommufd/viommu: Add iommufd_viommu_get_vdev_id helper
  iommufd/viommu: Add iommufd_viommu_report_event helper
  iommufd/selftest: Require vdev_id when attaching to a nested domain
  iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_VEVENT for vEVENTQ
    coverage
  iommufd/selftest: Add IOMMU_VEVENTQ_ALLOC test coverage
  Documentation: userspace-api: iommufd: Update FAULT and VEVENTQ
  iommu/arm-smmu-v3: Introduce struct arm_smmu_vmaster
  iommu/arm-smmu-v3: Report events that belong to devices attached to
    vIOMMU

 drivers/iommu/iommufd/Makefile                |   2 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  30 ++
 drivers/iommu/iommufd/iommufd_private.h       | 116 ++++++-
 drivers/iommu/iommufd/iommufd_test.h          |  10 +
 include/linux/iommufd.h                       |  24 ++
 include/uapi/linux/iommufd.h                  |  46 +++
 tools/testing/selftests/iommu/iommufd_utils.h |  65 ++++
 .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     |  71 ++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  90 ++++--
 drivers/iommu/iommufd/driver.c                |  63 ++++
 drivers/iommu/iommufd/{fault.c => eventq.c}   | 303 ++++++++++++++----
 drivers/iommu/iommufd/hw_pagetable.c          |   6 +-
 drivers/iommu/iommufd/main.c                  |  37 ++-
 drivers/iommu/iommufd/selftest.c              |  53 +++
 drivers/iommu/iommufd/viommu.c                |   2 +
 tools/testing/selftests/iommu/iommufd.c       |  27 ++
 .../selftests/iommu/iommufd_fail_nth.c        |   7 +
 Documentation/userspace-api/iommufd.rst       |  16 +
 18 files changed, 843 insertions(+), 125 deletions(-)
 rename drivers/iommu/iommufd/{fault.c => eventq.c} (54%)


base-commit: e94dc6ddda8dd3770879a132d577accd2cce25f9
-- 
2.43.0


^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH v5 01/14] iommufd: Keep OBJ/IOCTL lists in an alphabetical order
  2025-01-07 17:10 [PATCH v5 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ) Nicolin Chen
@ 2025-01-07 17:10 ` Nicolin Chen
  2025-01-10  6:26   ` Tian, Kevin
                     ` (2 more replies)
  2025-01-07 17:10 ` [PATCH v5 02/14] iommufd/fault: Add an iommufd_fault_init() helper Nicolin Chen
                   ` (12 subsequent siblings)
  13 siblings, 3 replies; 77+ messages in thread
From: Nicolin Chen @ 2025-01-07 17:10 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: joro, suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

Reorder the existing OBJ/IOCTL lists.

Also run clang-format for the same coding style at line wrappings.

No functional change.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/main.c | 30 ++++++++++++++----------------
 1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 97c5e3567d33..a11e9cfd790f 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -307,9 +307,9 @@ union ucmd_buffer {
 	struct iommu_ioas_map map;
 	struct iommu_ioas_unmap unmap;
 	struct iommu_option option;
+	struct iommu_vdevice_alloc vdev;
 	struct iommu_vfio_ioas vfio_ioas;
 	struct iommu_viommu_alloc viommu;
-	struct iommu_vdevice_alloc vdev;
 #ifdef CONFIG_IOMMUFD_TEST
 	struct iommu_test_cmd test;
 #endif
@@ -333,8 +333,8 @@ struct iommufd_ioctl_op {
 	}
 static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 	IOCTL_OP(IOMMU_DESTROY, iommufd_destroy, struct iommu_destroy, id),
-	IOCTL_OP(IOMMU_FAULT_QUEUE_ALLOC, iommufd_fault_alloc, struct iommu_fault_alloc,
-		 out_fault_fd),
+	IOCTL_OP(IOMMU_FAULT_QUEUE_ALLOC, iommufd_fault_alloc,
+		 struct iommu_fault_alloc, out_fault_fd),
 	IOCTL_OP(IOMMU_GET_HW_INFO, iommufd_get_hw_info, struct iommu_hw_info,
 		 __reserved),
 	IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct iommu_hwpt_alloc,
@@ -355,20 +355,18 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 src_iova),
 	IOCTL_OP(IOMMU_IOAS_IOVA_RANGES, iommufd_ioas_iova_ranges,
 		 struct iommu_ioas_iova_ranges, out_iova_alignment),
-	IOCTL_OP(IOMMU_IOAS_MAP, iommufd_ioas_map, struct iommu_ioas_map,
-		 iova),
+	IOCTL_OP(IOMMU_IOAS_MAP, iommufd_ioas_map, struct iommu_ioas_map, iova),
 	IOCTL_OP(IOMMU_IOAS_MAP_FILE, iommufd_ioas_map_file,
 		 struct iommu_ioas_map_file, iova),
 	IOCTL_OP(IOMMU_IOAS_UNMAP, iommufd_ioas_unmap, struct iommu_ioas_unmap,
 		 length),
-	IOCTL_OP(IOMMU_OPTION, iommufd_option, struct iommu_option,
-		 val64),
+	IOCTL_OP(IOMMU_OPTION, iommufd_option, struct iommu_option, val64),
+	IOCTL_OP(IOMMU_VDEVICE_ALLOC, iommufd_vdevice_alloc_ioctl,
+		 struct iommu_vdevice_alloc, virt_id),
 	IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
 		 __reserved),
 	IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl,
 		 struct iommu_viommu_alloc, out_viommu_id),
-	IOCTL_OP(IOMMU_VDEVICE_ALLOC, iommufd_vdevice_alloc_ioctl,
-		 struct iommu_vdevice_alloc, virt_id),
 #ifdef CONFIG_IOMMUFD_TEST
 	IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last),
 #endif
@@ -490,8 +488,8 @@ static const struct iommufd_object_ops iommufd_object_ops[] = {
 	[IOMMUFD_OBJ_DEVICE] = {
 		.destroy = iommufd_device_destroy,
 	},
-	[IOMMUFD_OBJ_IOAS] = {
-		.destroy = iommufd_ioas_destroy,
+	[IOMMUFD_OBJ_FAULT] = {
+		.destroy = iommufd_fault_destroy,
 	},
 	[IOMMUFD_OBJ_HWPT_PAGING] = {
 		.destroy = iommufd_hwpt_paging_destroy,
@@ -501,15 +499,15 @@ static const struct iommufd_object_ops iommufd_object_ops[] = {
 		.destroy = iommufd_hwpt_nested_destroy,
 		.abort = iommufd_hwpt_nested_abort,
 	},
-	[IOMMUFD_OBJ_FAULT] = {
-		.destroy = iommufd_fault_destroy,
-	},
-	[IOMMUFD_OBJ_VIOMMU] = {
-		.destroy = iommufd_viommu_destroy,
+	[IOMMUFD_OBJ_IOAS] = {
+		.destroy = iommufd_ioas_destroy,
 	},
 	[IOMMUFD_OBJ_VDEVICE] = {
 		.destroy = iommufd_vdevice_destroy,
 	},
+	[IOMMUFD_OBJ_VIOMMU] = {
+		.destroy = iommufd_viommu_destroy,
+	},
 #ifdef CONFIG_IOMMUFD_TEST
 	[IOMMUFD_OBJ_SELFTEST] = {
 		.destroy = iommufd_selftest_destroy,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v5 02/14] iommufd/fault: Add an iommufd_fault_init() helper
  2025-01-07 17:10 [PATCH v5 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ) Nicolin Chen
  2025-01-07 17:10 ` [PATCH v5 01/14] iommufd: Keep OBJ/IOCTL lists in an alphabetical order Nicolin Chen
@ 2025-01-07 17:10 ` Nicolin Chen
  2025-01-10 17:25   ` Jason Gunthorpe
  2025-01-07 17:10 ` [PATCH v5 03/14] iommufd/fault: Move iommufd_fault_iopf_handler() to header Nicolin Chen
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-07 17:10 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: joro, suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

The infrastructure of a fault object will be shared with a new vEVENTQ
object in a following change. Add a helper for a vEVENTQ allocator to
call it too.

Reorder the iommufd_ctx_get and refcount_inc, to keep them symmetrical
with the iommufd_fault_fops_release().

Since the new vEVENTQ doesn't need "response", leave the xa_init_flags
in its original location.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/fault.c | 48 ++++++++++++++++++++---------------
 1 file changed, 28 insertions(+), 20 deletions(-)

diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c
index 1fe804e28a86..1d1095fc8224 100644
--- a/drivers/iommu/iommufd/fault.c
+++ b/drivers/iommu/iommufd/fault.c
@@ -367,11 +367,35 @@ static const struct file_operations iommufd_fault_fops = {
 	.release	= iommufd_fault_fops_release,
 };
 
+static int iommufd_fault_init(struct iommufd_fault *fault, char *name,
+			      struct iommufd_ctx *ictx)
+{
+	struct file *filep;
+	int fdno;
+
+	mutex_init(&fault->mutex);
+	INIT_LIST_HEAD(&fault->deliver);
+	init_waitqueue_head(&fault->wait_queue);
+
+	filep = anon_inode_getfile(name, &iommufd_fault_fops, fault, O_RDWR);
+	if (IS_ERR(filep))
+		return PTR_ERR(filep);
+
+	fault->ictx = ictx;
+	iommufd_ctx_get(fault->ictx);
+	fault->filep = filep;
+	refcount_inc(&fault->obj.users);
+
+	fdno = get_unused_fd_flags(O_CLOEXEC);
+	if (fdno < 0)
+		fput(filep);
+	return fdno;
+}
+
 int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)
 {
 	struct iommu_fault_alloc *cmd = ucmd->cmd;
 	struct iommufd_fault *fault;
-	struct file *filep;
 	int fdno;
 	int rc;
 
@@ -382,27 +406,12 @@ int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)
 	if (IS_ERR(fault))
 		return PTR_ERR(fault);
 
-	fault->ictx = ucmd->ictx;
-	INIT_LIST_HEAD(&fault->deliver);
 	xa_init_flags(&fault->response, XA_FLAGS_ALLOC1);
-	mutex_init(&fault->mutex);
-	init_waitqueue_head(&fault->wait_queue);
-
-	filep = anon_inode_getfile("[iommufd-pgfault]", &iommufd_fault_fops,
-				   fault, O_RDWR);
-	if (IS_ERR(filep)) {
-		rc = PTR_ERR(filep);
-		goto out_abort;
-	}
 
-	refcount_inc(&fault->obj.users);
-	iommufd_ctx_get(fault->ictx);
-	fault->filep = filep;
-
-	fdno = get_unused_fd_flags(O_CLOEXEC);
+	fdno = iommufd_fault_init(fault, "[iommufd-pgfault]", ucmd->ictx);
 	if (fdno < 0) {
 		rc = fdno;
-		goto out_fput;
+		goto out_abort;
 	}
 
 	cmd->out_fault_id = fault->obj.id;
@@ -418,8 +427,7 @@ int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)
 	return 0;
 out_put_fdno:
 	put_unused_fd(fdno);
-out_fput:
-	fput(filep);
+	fput(fault->filep);
 out_abort:
 	iommufd_object_abort_and_destroy(ucmd->ictx, &fault->obj);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v5 03/14] iommufd/fault: Move iommufd_fault_iopf_handler() to header
  2025-01-07 17:10 [PATCH v5 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ) Nicolin Chen
  2025-01-07 17:10 ` [PATCH v5 01/14] iommufd: Keep OBJ/IOCTL lists in an alphabetical order Nicolin Chen
  2025-01-07 17:10 ` [PATCH v5 02/14] iommufd/fault: Add an iommufd_fault_init() helper Nicolin Chen
@ 2025-01-07 17:10 ` Nicolin Chen
  2025-01-10 17:25   ` Jason Gunthorpe
  2025-01-07 17:10 ` [PATCH v5 04/14] iommufd: Abstract an iommufd_eventq from iommufd_fault Nicolin Chen
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-07 17:10 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: joro, suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

The new vEVENTQ object will need a similar function for drivers to report
the vIOMMU related events. Split the common part out to a smaller helper,
and place it in the header so that CONFIG_IOMMUFD_DRIVER_CORE can include
that in the driver.c file for drivers to use.

Then keep iommufd_fault_iopf_handler() in the header too, since it's quite
simple after all.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_private.h | 20 +++++++++++++++++++-
 drivers/iommu/iommufd/fault.c           | 17 -----------------
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index b6d706cf2c66..8b378705ee71 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -451,6 +451,17 @@ struct iommufd_fault {
 	struct wait_queue_head wait_queue;
 };
 
+static inline int iommufd_fault_notify(struct iommufd_fault *fault,
+				       struct list_head *new_fault)
+{
+	mutex_lock(&fault->mutex);
+	list_add_tail(new_fault, &fault->deliver);
+	mutex_unlock(&fault->mutex);
+
+	wake_up_interruptible(&fault->wait_queue);
+	return 0;
+}
+
 struct iommufd_attach_handle {
 	struct iommu_attach_handle handle;
 	struct iommufd_device *idev;
@@ -469,7 +480,14 @@ iommufd_get_fault(struct iommufd_ucmd *ucmd, u32 id)
 
 int iommufd_fault_alloc(struct iommufd_ucmd *ucmd);
 void iommufd_fault_destroy(struct iommufd_object *obj);
-int iommufd_fault_iopf_handler(struct iopf_group *group);
+
+static inline int iommufd_fault_iopf_handler(struct iopf_group *group)
+{
+	struct iommufd_hw_pagetable *hwpt =
+		group->attach_handle->domain->fault_data;
+
+	return iommufd_fault_notify(hwpt->fault, &group->node);
+}
 
 int iommufd_fault_domain_attach_dev(struct iommufd_hw_pagetable *hwpt,
 				    struct iommufd_device *idev);
diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c
index 1d1095fc8224..d188994e4e84 100644
--- a/drivers/iommu/iommufd/fault.c
+++ b/drivers/iommu/iommufd/fault.c
@@ -433,20 +433,3 @@ int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)
 
 	return rc;
 }
-
-int iommufd_fault_iopf_handler(struct iopf_group *group)
-{
-	struct iommufd_hw_pagetable *hwpt;
-	struct iommufd_fault *fault;
-
-	hwpt = group->attach_handle->domain->fault_data;
-	fault = hwpt->fault;
-
-	mutex_lock(&fault->mutex);
-	list_add_tail(&group->node, &fault->deliver);
-	mutex_unlock(&fault->mutex);
-
-	wake_up_interruptible(&fault->wait_queue);
-
-	return 0;
-}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v5 04/14] iommufd: Abstract an iommufd_eventq from iommufd_fault
  2025-01-07 17:10 [PATCH v5 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ) Nicolin Chen
                   ` (2 preceding siblings ...)
  2025-01-07 17:10 ` [PATCH v5 03/14] iommufd/fault: Move iommufd_fault_iopf_handler() to header Nicolin Chen
@ 2025-01-07 17:10 ` Nicolin Chen
  2025-01-10  6:26   ` Tian, Kevin
  2025-01-10 17:26   ` Jason Gunthorpe
  2025-01-07 17:10 ` [PATCH v5 05/14] iommufd: Rename fault.c to eventq.c Nicolin Chen
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 77+ messages in thread
From: Nicolin Chen @ 2025-01-07 17:10 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: joro, suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

The fault object was designed exclusively for hwpt's IO page faults (PRI).
But its queue implementation can be reused for other purposes too, such as
hardware IRQ and event injections to user space.

Meanwhile, a fault object holds a list of faults. So it's more accurate to
call it a "fault queue". Combining the reusing idea above, abstract a new
iommufd_eventq as a common structure embedded into struct iommufd_fault,
similar to hwpt_paging holding a common hwpt.

Add a common iommufd_eventq_ops and iommufd_eventq_init to prepare for an
IOMMUFD_OBJ_VEVENTQ (vIOMMU Event Queue).

Also, add missing xa_destroy and mutex_destroy in iommufd_fault_destroy().

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_private.h |  52 ++++++---
 drivers/iommu/iommufd/fault.c           | 142 +++++++++++++++---------
 drivers/iommu/iommufd/hw_pagetable.c    |   6 +-
 3 files changed, 130 insertions(+), 70 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 8b378705ee71..dfbc5cfbd164 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -18,6 +18,7 @@ struct iommu_domain;
 struct iommu_group;
 struct iommu_option;
 struct iommufd_device;
+struct iommufd_eventq;
 
 struct iommufd_ctx {
 	struct file *file;
@@ -433,32 +434,35 @@ void iopt_remove_access(struct io_pagetable *iopt,
 			u32 iopt_access_list_id);
 void iommufd_access_destroy_object(struct iommufd_object *obj);
 
-/*
- * An iommufd_fault object represents an interface to deliver I/O page faults
- * to the user space. These objects are created/destroyed by the user space and
- * associated with hardware page table objects during page-table allocation.
- */
-struct iommufd_fault {
+struct iommufd_eventq_ops {
+	ssize_t (*read)(struct iommufd_eventq *eventq, char __user *buf,
+			size_t count, loff_t *ppos); /* Mandatory op */
+	ssize_t (*write)(struct iommufd_eventq *eventq, const char __user *buf,
+			 size_t count, loff_t *ppos); /* Optional op */
+};
+
+struct iommufd_eventq {
 	struct iommufd_object obj;
 	struct iommufd_ctx *ictx;
 	struct file *filep;
 
-	/* The lists of outstanding faults protected by below mutex. */
+	const struct iommufd_eventq_ops *ops;
+
+	/* The lists of outstanding events protected by below mutex. */
 	struct mutex mutex;
 	struct list_head deliver;
-	struct xarray response;
 
 	struct wait_queue_head wait_queue;
 };
 
-static inline int iommufd_fault_notify(struct iommufd_fault *fault,
-				       struct list_head *new_fault)
+static inline int iommufd_eventq_notify(struct iommufd_eventq *eventq,
+					struct list_head *new_event)
 {
-	mutex_lock(&fault->mutex);
-	list_add_tail(new_fault, &fault->deliver);
-	mutex_unlock(&fault->mutex);
+	mutex_lock(&eventq->mutex);
+	list_add_tail(new_event, &eventq->deliver);
+	mutex_unlock(&eventq->mutex);
 
-	wake_up_interruptible(&fault->wait_queue);
+	wake_up_interruptible(&eventq->wait_queue);
 	return 0;
 }
 
@@ -470,12 +474,28 @@ struct iommufd_attach_handle {
 /* Convert an iommu attach handle to iommufd handle. */
 #define to_iommufd_handle(hdl)	container_of(hdl, struct iommufd_attach_handle, handle)
 
+/*
+ * An iommufd_fault object represents an interface to deliver I/O page faults
+ * to the user space. These objects are created/destroyed by the user space and
+ * associated with hardware page table objects during page-table allocation.
+ */
+struct iommufd_fault {
+	struct iommufd_eventq common;
+	struct xarray response;
+};
+
+static inline struct iommufd_fault *
+eventq_to_fault(struct iommufd_eventq *eventq)
+{
+	return container_of(eventq, struct iommufd_fault, common);
+}
+
 static inline struct iommufd_fault *
 iommufd_get_fault(struct iommufd_ucmd *ucmd, u32 id)
 {
 	return container_of(iommufd_get_object(ucmd->ictx, id,
 					       IOMMUFD_OBJ_FAULT),
-			    struct iommufd_fault, obj);
+			    struct iommufd_fault, common.obj);
 }
 
 int iommufd_fault_alloc(struct iommufd_ucmd *ucmd);
@@ -486,7 +506,7 @@ static inline int iommufd_fault_iopf_handler(struct iopf_group *group)
 	struct iommufd_hw_pagetable *hwpt =
 		group->attach_handle->domain->fault_data;
 
-	return iommufd_fault_notify(hwpt->fault, &group->node);
+	return iommufd_eventq_notify(&hwpt->fault->common, &group->node);
 }
 
 int iommufd_fault_domain_attach_dev(struct iommufd_hw_pagetable *hwpt,
diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/fault.c
index d188994e4e84..e386b6c3e6ab 100644
--- a/drivers/iommu/iommufd/fault.c
+++ b/drivers/iommu/iommufd/fault.c
@@ -17,6 +17,8 @@
 #include "../iommu-priv.h"
 #include "iommufd_private.h"
 
+/* IOMMUFD_OBJ_FAULT Functions */
+
 static int iommufd_fault_iopf_enable(struct iommufd_device *idev)
 {
 	struct device *dev = idev->dev;
@@ -108,8 +110,8 @@ static void iommufd_auto_response_faults(struct iommufd_hw_pagetable *hwpt,
 	if (!fault)
 		return;
 
-	mutex_lock(&fault->mutex);
-	list_for_each_entry_safe(group, next, &fault->deliver, node) {
+	mutex_lock(&fault->common.mutex);
+	list_for_each_entry_safe(group, next, &fault->common.deliver, node) {
 		if (group->attach_handle != &handle->handle)
 			continue;
 		list_del(&group->node);
@@ -124,7 +126,7 @@ static void iommufd_auto_response_faults(struct iommufd_hw_pagetable *hwpt,
 		iopf_group_response(group, IOMMU_PAGE_RESP_INVALID);
 		iopf_free_group(group);
 	}
-	mutex_unlock(&fault->mutex);
+	mutex_unlock(&fault->common.mutex);
 }
 
 static struct iommufd_attach_handle *
@@ -211,7 +213,8 @@ int iommufd_fault_domain_replace_dev(struct iommufd_device *idev,
 
 void iommufd_fault_destroy(struct iommufd_object *obj)
 {
-	struct iommufd_fault *fault = container_of(obj, struct iommufd_fault, obj);
+	struct iommufd_eventq *eventq =
+		container_of(obj, struct iommufd_eventq, obj);
 	struct iopf_group *group, *next;
 
 	/*
@@ -220,11 +223,13 @@ void iommufd_fault_destroy(struct iommufd_object *obj)
 	 * accessing this pointer. Therefore, acquiring the mutex here
 	 * is unnecessary.
 	 */
-	list_for_each_entry_safe(group, next, &fault->deliver, node) {
+	list_for_each_entry_safe(group, next, &eventq->deliver, node) {
 		list_del(&group->node);
 		iopf_group_response(group, IOMMU_PAGE_RESP_INVALID);
 		iopf_free_group(group);
 	}
+	xa_destroy(&eventq_to_fault(eventq)->response);
+	mutex_destroy(&eventq->mutex);
 }
 
 static void iommufd_compose_fault_message(struct iommu_fault *fault,
@@ -242,11 +247,12 @@ static void iommufd_compose_fault_message(struct iommu_fault *fault,
 	hwpt_fault->cookie = cookie;
 }
 
-static ssize_t iommufd_fault_fops_read(struct file *filep, char __user *buf,
-				       size_t count, loff_t *ppos)
+static ssize_t iommufd_fault_fops_read(struct iommufd_eventq *eventq,
+				       char __user *buf, size_t count,
+				       loff_t *ppos)
 {
 	size_t fault_size = sizeof(struct iommu_hwpt_pgfault);
-	struct iommufd_fault *fault = filep->private_data;
+	struct iommufd_fault *fault = eventq_to_fault(eventq);
 	struct iommu_hwpt_pgfault data;
 	struct iommufd_device *idev;
 	struct iopf_group *group;
@@ -257,10 +263,10 @@ static ssize_t iommufd_fault_fops_read(struct file *filep, char __user *buf,
 	if (*ppos || count % fault_size)
 		return -ESPIPE;
 
-	mutex_lock(&fault->mutex);
-	while (!list_empty(&fault->deliver) && count > done) {
-		group = list_first_entry(&fault->deliver,
-					 struct iopf_group, node);
+	mutex_lock(&eventq->mutex);
+	while (!list_empty(&eventq->deliver) && count > done) {
+		group = list_first_entry(&eventq->deliver, struct iopf_group,
+					 node);
 
 		if (group->fault_count * fault_size > count - done)
 			break;
@@ -285,16 +291,17 @@ static ssize_t iommufd_fault_fops_read(struct file *filep, char __user *buf,
 
 		list_del(&group->node);
 	}
-	mutex_unlock(&fault->mutex);
+	mutex_unlock(&eventq->mutex);
 
 	return done == 0 ? rc : done;
 }
 
-static ssize_t iommufd_fault_fops_write(struct file *filep, const char __user *buf,
-					size_t count, loff_t *ppos)
+static ssize_t iommufd_fault_fops_write(struct iommufd_eventq *eventq,
+					const char __user *buf, size_t count,
+					loff_t *ppos)
 {
 	size_t response_size = sizeof(struct iommu_hwpt_page_response);
-	struct iommufd_fault *fault = filep->private_data;
+	struct iommufd_fault *fault = eventq_to_fault(eventq);
 	struct iommu_hwpt_page_response response;
 	struct iopf_group *group;
 	size_t done = 0;
@@ -303,7 +310,7 @@ static ssize_t iommufd_fault_fops_write(struct file *filep, const char __user *b
 	if (*ppos || count % response_size)
 		return -ESPIPE;
 
-	mutex_lock(&fault->mutex);
+	mutex_lock(&eventq->mutex);
 	while (count > done) {
 		rc = copy_from_user(&response, buf + done, response_size);
 		if (rc)
@@ -329,62 +336,93 @@ static ssize_t iommufd_fault_fops_write(struct file *filep, const char __user *b
 		iopf_free_group(group);
 		done += response_size;
 	}
-	mutex_unlock(&fault->mutex);
+	mutex_unlock(&eventq->mutex);
 
 	return done == 0 ? rc : done;
 }
 
-static __poll_t iommufd_fault_fops_poll(struct file *filep,
-					struct poll_table_struct *wait)
+static const struct iommufd_eventq_ops iommufd_fault_ops = {
+	.read = &iommufd_fault_fops_read,
+	.write = &iommufd_fault_fops_write,
+};
+
+/* Common Event Queue Functions */
+
+static ssize_t iommufd_eventq_fops_read(struct file *filep, char __user *buf,
+					size_t count, loff_t *ppos)
 {
-	struct iommufd_fault *fault = filep->private_data;
+	struct iommufd_eventq *eventq = filep->private_data;
+
+	return eventq->ops->read(eventq, buf, count, ppos);
+}
+
+static ssize_t iommufd_eventq_fops_write(struct file *filep,
+					 const char __user *buf, size_t count,
+					 loff_t *ppos)
+{
+	struct iommufd_eventq *eventq = filep->private_data;
+
+	if (!eventq->ops->write)
+		return -EOPNOTSUPP;
+	return eventq->ops->write(eventq, buf, count, ppos);
+}
+
+static __poll_t iommufd_eventq_fops_poll(struct file *filep,
+					 struct poll_table_struct *wait)
+{
+	struct iommufd_eventq *eventq = filep->private_data;
 	__poll_t pollflags = EPOLLOUT;
 
-	poll_wait(filep, &fault->wait_queue, wait);
-	mutex_lock(&fault->mutex);
-	if (!list_empty(&fault->deliver))
+	poll_wait(filep, &eventq->wait_queue, wait);
+	mutex_lock(&eventq->mutex);
+	if (!list_empty(&eventq->deliver))
 		pollflags |= EPOLLIN | EPOLLRDNORM;
-	mutex_unlock(&fault->mutex);
+	mutex_unlock(&eventq->mutex);
 
 	return pollflags;
 }
 
-static int iommufd_fault_fops_release(struct inode *inode, struct file *filep)
+static int iommufd_eventq_fops_release(struct inode *inode, struct file *filep)
 {
-	struct iommufd_fault *fault = filep->private_data;
+	struct iommufd_eventq *eventq = filep->private_data;
 
-	refcount_dec(&fault->obj.users);
-	iommufd_ctx_put(fault->ictx);
+	refcount_dec(&eventq->obj.users);
+	iommufd_ctx_put(eventq->ictx);
 	return 0;
 }
 
-static const struct file_operations iommufd_fault_fops = {
+static const struct file_operations iommufd_eventq_fops = {
 	.owner		= THIS_MODULE,
 	.open		= nonseekable_open,
-	.read		= iommufd_fault_fops_read,
-	.write		= iommufd_fault_fops_write,
-	.poll		= iommufd_fault_fops_poll,
-	.release	= iommufd_fault_fops_release,
+	.read		= iommufd_eventq_fops_read,
+	.write		= iommufd_eventq_fops_write,
+	.poll		= iommufd_eventq_fops_poll,
+	.release	= iommufd_eventq_fops_release,
 };
 
-static int iommufd_fault_init(struct iommufd_fault *fault, char *name,
-			      struct iommufd_ctx *ictx)
+static int iommufd_eventq_init(struct iommufd_eventq *eventq, char *name,
+			       struct iommufd_ctx *ictx,
+			       const struct iommufd_eventq_ops *ops)
 {
 	struct file *filep;
 	int fdno;
 
-	mutex_init(&fault->mutex);
-	INIT_LIST_HEAD(&fault->deliver);
-	init_waitqueue_head(&fault->wait_queue);
+	if (WARN_ON_ONCE(!ops || !ops->read))
+		return -EINVAL;
+
+	mutex_init(&eventq->mutex);
+	INIT_LIST_HEAD(&eventq->deliver);
+	init_waitqueue_head(&eventq->wait_queue);
 
-	filep = anon_inode_getfile(name, &iommufd_fault_fops, fault, O_RDWR);
+	filep = anon_inode_getfile(name, &iommufd_eventq_fops, eventq, O_RDWR);
 	if (IS_ERR(filep))
 		return PTR_ERR(filep);
 
-	fault->ictx = ictx;
-	iommufd_ctx_get(fault->ictx);
-	fault->filep = filep;
-	refcount_inc(&fault->obj.users);
+	eventq->ops = ops;
+	eventq->ictx = ictx;
+	iommufd_ctx_get(eventq->ictx);
+	refcount_inc(&eventq->obj.users);
+	eventq->filep = filep;
 
 	fdno = get_unused_fd_flags(O_CLOEXEC);
 	if (fdno < 0)
@@ -402,34 +440,36 @@ int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)
 	if (cmd->flags)
 		return -EOPNOTSUPP;
 
-	fault = iommufd_object_alloc(ucmd->ictx, fault, IOMMUFD_OBJ_FAULT);
+	fault = __iommufd_object_alloc(ucmd->ictx, fault, IOMMUFD_OBJ_FAULT,
+				       common.obj);
 	if (IS_ERR(fault))
 		return PTR_ERR(fault);
 
 	xa_init_flags(&fault->response, XA_FLAGS_ALLOC1);
 
-	fdno = iommufd_fault_init(fault, "[iommufd-pgfault]", ucmd->ictx);
+	fdno = iommufd_eventq_init(&fault->common, "[iommufd-pgfault]",
+				   ucmd->ictx, &iommufd_fault_ops);
 	if (fdno < 0) {
 		rc = fdno;
 		goto out_abort;
 	}
 
-	cmd->out_fault_id = fault->obj.id;
+	cmd->out_fault_id = fault->common.obj.id;
 	cmd->out_fault_fd = fdno;
 
 	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
 	if (rc)
 		goto out_put_fdno;
-	iommufd_object_finalize(ucmd->ictx, &fault->obj);
+	iommufd_object_finalize(ucmd->ictx, &fault->common.obj);
 
-	fd_install(fdno, fault->filep);
+	fd_install(fdno, fault->common.filep);
 
 	return 0;
 out_put_fdno:
 	put_unused_fd(fdno);
-	fput(fault->filep);
+	fput(fault->common.filep);
 out_abort:
-	iommufd_object_abort_and_destroy(ucmd->ictx, &fault->obj);
+	iommufd_object_abort_and_destroy(ucmd->ictx, &fault->common.obj);
 
 	return rc;
 }
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index ce03c3804651..12a576f1f13d 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -14,7 +14,7 @@ static void __iommufd_hwpt_destroy(struct iommufd_hw_pagetable *hwpt)
 		iommu_domain_free(hwpt->domain);
 
 	if (hwpt->fault)
-		refcount_dec(&hwpt->fault->obj.users);
+		refcount_dec(&hwpt->fault->common.obj.users);
 }
 
 void iommufd_hwpt_paging_destroy(struct iommufd_object *obj)
@@ -403,8 +403,8 @@ int iommufd_hwpt_alloc(struct iommufd_ucmd *ucmd)
 		hwpt->fault = fault;
 		hwpt->domain->iopf_handler = iommufd_fault_iopf_handler;
 		hwpt->domain->fault_data = hwpt;
-		refcount_inc(&fault->obj.users);
-		iommufd_put_object(ucmd->ictx, &fault->obj);
+		refcount_inc(&fault->common.obj.users);
+		iommufd_put_object(ucmd->ictx, &fault->common.obj);
 	}
 
 	cmd->out_hwpt_id = hwpt->obj.id;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v5 05/14] iommufd: Rename fault.c to eventq.c
  2025-01-07 17:10 [PATCH v5 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ) Nicolin Chen
                   ` (3 preceding siblings ...)
  2025-01-07 17:10 ` [PATCH v5 04/14] iommufd: Abstract an iommufd_eventq from iommufd_fault Nicolin Chen
@ 2025-01-07 17:10 ` Nicolin Chen
  2025-01-10 17:27   ` Jason Gunthorpe
  2025-01-07 17:10 ` [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC Nicolin Chen
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-07 17:10 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: joro, suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

Rename the file, aligning with the new eventq object.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/Makefile              | 2 +-
 drivers/iommu/iommufd/{fault.c => eventq.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename drivers/iommu/iommufd/{fault.c => eventq.c} (100%)

diff --git a/drivers/iommu/iommufd/Makefile b/drivers/iommu/iommufd/Makefile
index cb784da6cddc..71d692c9a8f4 100644
--- a/drivers/iommu/iommufd/Makefile
+++ b/drivers/iommu/iommufd/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 iommufd-y := \
 	device.o \
-	fault.o \
+	eventq.o \
 	hw_pagetable.o \
 	io_pagetable.o \
 	ioas.o \
diff --git a/drivers/iommu/iommufd/fault.c b/drivers/iommu/iommufd/eventq.c
similarity index 100%
rename from drivers/iommu/iommufd/fault.c
rename to drivers/iommu/iommufd/eventq.c
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
  2025-01-07 17:10 [PATCH v5 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ) Nicolin Chen
                   ` (4 preceding siblings ...)
  2025-01-07 17:10 ` [PATCH v5 05/14] iommufd: Rename fault.c to eventq.c Nicolin Chen
@ 2025-01-07 17:10 ` Nicolin Chen
  2025-01-10  7:06   ` Tian, Kevin
  2025-01-10 17:48   ` Jason Gunthorpe
  2025-01-07 17:10 ` [PATCH v5 07/14] iommufd/viommu: Add iommufd_viommu_get_vdev_id helper Nicolin Chen
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 77+ messages in thread
From: Nicolin Chen @ 2025-01-07 17:10 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: joro, suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

Introduce a new IOMMUFD_OBJ_VEVENTQ object for vIOMMU Event Queue that
provides user space (VMM) another FD to read the vIOMMU Events.

Allow a vIOMMU object to allocate vEVENTQs, with a condition that each
vIOMMU can only have one single vEVENTQ per type.

Add iommufd_veventq_alloc() with iommufd_veventq_ops for the new ioctl.

And add a supports_veventq viommu op for drivers to help the core code
validate the input vEVENTQ type.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_private.h |  58 ++++++++++
 include/linux/iommufd.h                 |   5 +
 include/uapi/linux/iommufd.h            |  31 ++++++
 drivers/iommu/iommufd/eventq.c          | 134 ++++++++++++++++++++++++
 drivers/iommu/iommufd/main.c            |   7 ++
 drivers/iommu/iommufd/viommu.c          |   2 +
 6 files changed, 237 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index dfbc5cfbd164..3c0374154a94 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -547,6 +547,50 @@ static inline int iommufd_hwpt_replace_device(struct iommufd_device *idev,
 	return iommu_group_replace_domain(idev->igroup->group, hwpt->domain);
 }
 
+/*
+ * An iommufd_veventq object represents an interface to deliver vIOMMU events to
+ * the user space. It is created/destroyed by the user space and associated with
+ * vIOMMU object(s) during the allocations.
+ */
+struct iommufd_veventq {
+	struct iommufd_eventq common;
+	struct iommufd_viommu *viommu;
+	struct list_head node; /* for iommufd_viommu::veventqs */
+
+	unsigned int type;
+};
+
+static inline struct iommufd_veventq *
+eventq_to_veventq(struct iommufd_eventq *eventq)
+{
+	return container_of(eventq, struct iommufd_veventq, common);
+}
+
+static inline struct iommufd_veventq *
+iommufd_get_veventq(struct iommufd_ucmd *ucmd, u32 id)
+{
+	return container_of(iommufd_get_object(ucmd->ictx, id,
+					       IOMMUFD_OBJ_VEVENTQ),
+			    struct iommufd_veventq, common.obj);
+}
+
+int iommufd_veventq_alloc(struct iommufd_ucmd *ucmd);
+void iommufd_veventq_destroy(struct iommufd_object *obj);
+void iommufd_veventq_abort(struct iommufd_object *obj);
+
+/* An iommufd_vevent represents a vIOMMU event in an iommufd_veventq */
+struct iommufd_vevent {
+	struct list_head node; /* for iommufd_eventq::deliver */
+	ssize_t data_len;
+	u64 event_data[] __counted_by(data_len);
+};
+
+static inline int iommufd_vevent_handler(struct iommufd_veventq *veventq,
+					 struct iommufd_vevent *vevent)
+{
+	return iommufd_eventq_notify(&veventq->common, &vevent->node);
+}
+
 static inline struct iommufd_viommu *
 iommufd_get_viommu(struct iommufd_ucmd *ucmd, u32 id)
 {
@@ -555,6 +599,20 @@ iommufd_get_viommu(struct iommufd_ucmd *ucmd, u32 id)
 			    struct iommufd_viommu, obj);
 }
 
+static inline struct iommufd_veventq *
+iommufd_viommu_find_veventq(struct iommufd_viommu *viommu, u32 type)
+{
+	struct iommufd_veventq *veventq, *next;
+
+	lockdep_assert_held(&viommu->veventqs_rwsem);
+
+	list_for_each_entry_safe(veventq, next, &viommu->veventqs, node) {
+		if (veventq->type == type)
+			return veventq;
+	}
+	return NULL;
+}
+
 int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd);
 void iommufd_viommu_destroy(struct iommufd_object *obj);
 int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd);
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 11110c749200..941f2ed29914 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -34,6 +34,7 @@ enum iommufd_object_type {
 	IOMMUFD_OBJ_FAULT,
 	IOMMUFD_OBJ_VIOMMU,
 	IOMMUFD_OBJ_VDEVICE,
+	IOMMUFD_OBJ_VEVENTQ,
 #ifdef CONFIG_IOMMUFD_TEST
 	IOMMUFD_OBJ_SELFTEST,
 #endif
@@ -93,6 +94,8 @@ struct iommufd_viommu {
 	const struct iommufd_viommu_ops *ops;
 
 	struct xarray vdevs;
+	struct list_head veventqs;
+	struct rw_semaphore veventqs_rwsem;
 
 	unsigned int type;
 };
@@ -113,6 +116,7 @@ struct iommufd_viommu {
  *                    array->entry_num to report the number of handled requests.
  *                    The data structure of the array entry must be defined in
  *                    include/uapi/linux/iommufd.h
+ * @supports_veventq: Whether the vIOMMU supports a given vEVENTQ type
  */
 struct iommufd_viommu_ops {
 	void (*destroy)(struct iommufd_viommu *viommu);
@@ -121,6 +125,7 @@ struct iommufd_viommu_ops {
 		const struct iommu_user_data *user_data);
 	int (*cache_invalidate)(struct iommufd_viommu *viommu,
 				struct iommu_user_data_array *array);
+	bool (*supports_veventq)(unsigned int type);
 };
 
 #if IS_ENABLED(CONFIG_IOMMUFD)
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 34810f6ae2b5..0a08aa82e7cc 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -55,6 +55,7 @@ enum {
 	IOMMUFD_CMD_VIOMMU_ALLOC = 0x90,
 	IOMMUFD_CMD_VDEVICE_ALLOC = 0x91,
 	IOMMUFD_CMD_IOAS_CHANGE_PROCESS = 0x92,
+	IOMMUFD_CMD_VEVENTQ_ALLOC = 0x93,
 };
 
 /**
@@ -1012,4 +1013,34 @@ struct iommu_ioas_change_process {
 #define IOMMU_IOAS_CHANGE_PROCESS \
 	_IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_CHANGE_PROCESS)
 
+/**
+ * enum iommu_veventq_type - Virtual Event Queue Type
+ * @IOMMU_VEVENTQ_TYPE_DEFAULT: Reserved for future use
+ */
+enum iommu_veventq_type {
+	IOMMU_VEVENTQ_TYPE_DEFAULT = 0,
+};
+
+/**
+ * struct iommu_veventq_alloc - ioctl(IOMMU_VEVENTQ_ALLOC)
+ * @size: sizeof(struct iommu_veventq_alloc)
+ * @flags: Must be 0
+ * @viommu: virtual IOMMU ID to associate the vEVENTQ with
+ * @type: Type of the vEVENTQ. Must be defined in enum iommu_veventq_type
+ * @out_veventq_id: The ID of the new vEVENTQ
+ * @out_veventq_fd: The fd of the new vEVENTQ. User space must close the
+ *                  successfully returned fd after using it
+ *
+ * Explicitly allocate a virtual event queue interface for a vIOMMU. A vIOMMU
+ * can have multiple FDs for different types, but is confined to one per @type.
+ */
+struct iommu_veventq_alloc {
+	__u32 size;
+	__u32 flags;
+	__u32 viommu_id;
+	__u32 type;
+	__u32 out_veventq_id;
+	__u32 out_veventq_fd;
+};
+#define IOMMU_VEVENTQ_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VEVENTQ_ALLOC)
 #endif
diff --git a/drivers/iommu/iommufd/eventq.c b/drivers/iommu/iommufd/eventq.c
index e386b6c3e6ab..b5be629f38ed 100644
--- a/drivers/iommu/iommufd/eventq.c
+++ b/drivers/iommu/iommufd/eventq.c
@@ -346,6 +346,73 @@ static const struct iommufd_eventq_ops iommufd_fault_ops = {
 	.write = &iommufd_fault_fops_write,
 };
 
+/* IOMMUFD_OBJ_VEVENTQ Functions */
+
+void iommufd_veventq_abort(struct iommufd_object *obj)
+{
+	struct iommufd_eventq *eventq =
+		container_of(obj, struct iommufd_eventq, obj);
+	struct iommufd_veventq *veventq = eventq_to_veventq(eventq);
+	struct iommufd_viommu *viommu = veventq->viommu;
+	struct iommufd_vevent *cur, *next;
+
+	lockdep_assert_held_write(&viommu->veventqs_rwsem);
+
+	list_for_each_entry_safe(cur, next, &eventq->deliver, node) {
+		list_del(&cur->node);
+		kfree(cur);
+	}
+
+	refcount_dec(&viommu->obj.users);
+	mutex_destroy(&eventq->mutex);
+	list_del(&veventq->node);
+}
+
+void iommufd_veventq_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_veventq *veventq = eventq_to_veventq(
+		container_of(obj, struct iommufd_eventq, obj));
+
+	down_write(&veventq->viommu->veventqs_rwsem);
+	iommufd_veventq_abort(obj);
+	up_write(&veventq->viommu->veventqs_rwsem);
+}
+
+static ssize_t iommufd_veventq_fops_read(struct iommufd_eventq *eventq,
+					 char __user *buf, size_t count,
+					 loff_t *ppos)
+{
+	size_t done = 0;
+	int rc = 0;
+
+	if (*ppos)
+		return -ESPIPE;
+
+	mutex_lock(&eventq->mutex);
+	while (!list_empty(&eventq->deliver) && count > done) {
+		struct iommufd_vevent *cur = list_first_entry(
+			&eventq->deliver, struct iommufd_vevent, node);
+
+		if (cur->data_len > count - done)
+			break;
+
+		if (copy_to_user(buf + done, cur->event_data, cur->data_len)) {
+			rc = -EFAULT;
+			break;
+		}
+		done += cur->data_len;
+		list_del(&cur->node);
+		kfree(cur);
+	}
+	mutex_unlock(&eventq->mutex);
+
+	return done == 0 ? rc : done;
+}
+
+static const struct iommufd_eventq_ops iommufd_veventq_ops = {
+	.read = &iommufd_veventq_fops_read,
+};
+
 /* Common Event Queue Functions */
 
 static ssize_t iommufd_eventq_fops_read(struct file *filep, char __user *buf,
@@ -473,3 +540,70 @@ int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)
 
 	return rc;
 }
+
+int iommufd_veventq_alloc(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_veventq_alloc *cmd = ucmd->cmd;
+	struct iommufd_veventq *veventq;
+	struct iommufd_viommu *viommu;
+	int fdno;
+	int rc;
+
+	if (cmd->flags || cmd->type == IOMMU_VEVENTQ_TYPE_DEFAULT)
+		return -EOPNOTSUPP;
+
+	viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
+	if (IS_ERR(viommu))
+		return PTR_ERR(viommu);
+
+	if (!viommu->ops || !viommu->ops->supports_veventq ||
+	    !viommu->ops->supports_veventq(cmd->type))
+		return -EOPNOTSUPP;
+
+	down_write(&viommu->veventqs_rwsem);
+
+	if (iommufd_viommu_find_veventq(viommu, cmd->type)) {
+		rc = -EEXIST;
+		goto out_unlock_veventqs;
+	}
+
+	veventq = __iommufd_object_alloc(ucmd->ictx, veventq,
+					 IOMMUFD_OBJ_VEVENTQ, common.obj);
+	if (IS_ERR(veventq)) {
+		rc = PTR_ERR(veventq);
+		goto out_unlock_veventqs;
+	}
+
+	veventq->type = cmd->type;
+	veventq->viommu = viommu;
+	refcount_inc(&viommu->obj.users);
+	list_add_tail(&veventq->node, &viommu->veventqs);
+
+	fdno = iommufd_eventq_init(&veventq->common, "[iommufd-viommu-event]",
+				   ucmd->ictx, &iommufd_veventq_ops);
+	if (fdno < 0) {
+		rc = fdno;
+		goto out_abort;
+	}
+
+	cmd->out_veventq_id = veventq->common.obj.id;
+	cmd->out_veventq_fd = fdno;
+
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+	if (rc)
+		goto out_put_fdno;
+
+	iommufd_object_finalize(ucmd->ictx, &veventq->common.obj);
+	fd_install(fdno, veventq->common.filep);
+	goto out_unlock_veventqs;
+
+out_put_fdno:
+	put_unused_fd(fdno);
+	fput(veventq->common.filep);
+out_abort:
+	iommufd_object_abort_and_destroy(ucmd->ictx, &veventq->common.obj);
+out_unlock_veventqs:
+	up_write(&viommu->veventqs_rwsem);
+	iommufd_put_object(ucmd->ictx, &viommu->obj);
+	return rc;
+}
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index a11e9cfd790f..0d451601fb9a 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -308,6 +308,7 @@ union ucmd_buffer {
 	struct iommu_ioas_unmap unmap;
 	struct iommu_option option;
 	struct iommu_vdevice_alloc vdev;
+	struct iommu_veventq_alloc veventq;
 	struct iommu_vfio_ioas vfio_ioas;
 	struct iommu_viommu_alloc viommu;
 #ifdef CONFIG_IOMMUFD_TEST
@@ -363,6 +364,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 	IOCTL_OP(IOMMU_OPTION, iommufd_option, struct iommu_option, val64),
 	IOCTL_OP(IOMMU_VDEVICE_ALLOC, iommufd_vdevice_alloc_ioctl,
 		 struct iommu_vdevice_alloc, virt_id),
+	IOCTL_OP(IOMMU_VEVENTQ_ALLOC, iommufd_veventq_alloc,
+		 struct iommu_veventq_alloc, out_veventq_fd),
 	IOCTL_OP(IOMMU_VFIO_IOAS, iommufd_vfio_ioas, struct iommu_vfio_ioas,
 		 __reserved),
 	IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl,
@@ -505,6 +508,10 @@ static const struct iommufd_object_ops iommufd_object_ops[] = {
 	[IOMMUFD_OBJ_VDEVICE] = {
 		.destroy = iommufd_vdevice_destroy,
 	},
+	[IOMMUFD_OBJ_VEVENTQ] = {
+		.destroy = iommufd_veventq_destroy,
+		.abort = iommufd_veventq_abort,
+	},
 	[IOMMUFD_OBJ_VIOMMU] = {
 		.destroy = iommufd_viommu_destroy,
 	},
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
index 69b88e8c7c26..01df2b985f02 100644
--- a/drivers/iommu/iommufd/viommu.c
+++ b/drivers/iommu/iommufd/viommu.c
@@ -59,6 +59,8 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd)
 	viommu->ictx = ucmd->ictx;
 	viommu->hwpt = hwpt_paging;
 	refcount_inc(&viommu->hwpt->common.obj.users);
+	INIT_LIST_HEAD(&viommu->veventqs);
+	init_rwsem(&viommu->veventqs_rwsem);
 	/*
 	 * It is the most likely case that a physical IOMMU is unpluggable. A
 	 * pluggable IOMMU instance (if exists) is responsible for refcounting
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v5 07/14] iommufd/viommu: Add iommufd_viommu_get_vdev_id helper
  2025-01-07 17:10 [PATCH v5 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ) Nicolin Chen
                   ` (5 preceding siblings ...)
  2025-01-07 17:10 ` [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC Nicolin Chen
@ 2025-01-07 17:10 ` Nicolin Chen
  2025-01-10  7:07   ` Tian, Kevin
  2025-01-07 17:10 ` [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper Nicolin Chen
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-07 17:10 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: joro, suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

This is a reverse search v.s. iommufd_viommu_find_dev, as drivers may want
to convert a struct device pointer (physical) to its virtual device ID for
an event injection to the user space VM.

Again, this avoids exposing more core structures to the drivers, than the
iommufd_viommu alone.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommufd.h        |  8 ++++++++
 drivers/iommu/iommufd/driver.c | 20 ++++++++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 941f2ed29914..a6dd9f8edcf3 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -192,6 +192,8 @@ struct iommufd_object *_iommufd_object_alloc(struct iommufd_ctx *ictx,
 					     enum iommufd_object_type type);
 struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu,
 				       unsigned long vdev_id);
+unsigned long iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu,
+					 struct device *dev);
 #else /* !CONFIG_IOMMUFD_DRIVER_CORE */
 static inline struct iommufd_object *
 _iommufd_object_alloc(struct iommufd_ctx *ictx, size_t size,
@@ -205,6 +207,12 @@ iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id)
 {
 	return NULL;
 }
+
+static inline unsigned long
+iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu, struct device *dev)
+{
+	return 0;
+}
 #endif /* CONFIG_IOMMUFD_DRIVER_CORE */
 
 /*
diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c
index 2d98b04ff1cb..e5d7397c0a6c 100644
--- a/drivers/iommu/iommufd/driver.c
+++ b/drivers/iommu/iommufd/driver.c
@@ -49,5 +49,25 @@ struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu,
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_viommu_find_dev, "IOMMUFD");
 
+/* Return 0 if device is not associated to the vIOMMU */
+unsigned long iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu,
+					 struct device *dev)
+{
+	struct iommufd_vdevice *vdev;
+	unsigned long vdev_id = 0;
+	unsigned long index;
+
+	xa_lock(&viommu->vdevs);
+	xa_for_each(&viommu->vdevs, index, vdev) {
+		if (vdev && vdev->dev == dev) {
+			vdev_id = (unsigned long)vdev->id;
+			break;
+		}
+	}
+	xa_unlock(&viommu->vdevs);
+	return vdev_id;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_viommu_get_vdev_id, "IOMMUFD");
+
 MODULE_DESCRIPTION("iommufd code shared with builtin modules");
 MODULE_LICENSE("GPL");
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-07 17:10 [PATCH v5 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ) Nicolin Chen
                   ` (6 preceding siblings ...)
  2025-01-07 17:10 ` [PATCH v5 07/14] iommufd/viommu: Add iommufd_viommu_get_vdev_id helper Nicolin Chen
@ 2025-01-07 17:10 ` Nicolin Chen
  2025-01-10  7:12   ` Tian, Kevin
  2025-01-10 17:41   ` Jason Gunthorpe
  2025-01-07 17:10 ` [PATCH v5 09/14] iommufd/selftest: Require vdev_id when attaching to a nested domain Nicolin Chen
                   ` (5 subsequent siblings)
  13 siblings, 2 replies; 77+ messages in thread
From: Nicolin Chen @ 2025-01-07 17:10 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: joro, suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

Similar to iommu_report_device_fault, this allows IOMMU drivers to report
vIOMMU events from threaded IRQ handlers to user space hypervisors.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommufd.h        | 11 +++++++++
 drivers/iommu/iommufd/driver.c | 43 ++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index a6dd9f8edcf3..55e71dca3664 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -11,6 +11,7 @@
 #include <linux/refcount.h>
 #include <linux/types.h>
 #include <linux/xarray.h>
+#include <uapi/linux/iommufd.h>
 
 struct device;
 struct file;
@@ -194,6 +195,9 @@ struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu,
 				       unsigned long vdev_id);
 unsigned long iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu,
 					 struct device *dev);
+int iommufd_viommu_report_event(struct iommufd_viommu *viommu,
+				enum iommu_veventq_type type, void *event_data,
+				size_t data_len);
 #else /* !CONFIG_IOMMUFD_DRIVER_CORE */
 static inline struct iommufd_object *
 _iommufd_object_alloc(struct iommufd_ctx *ictx, size_t size,
@@ -213,6 +217,13 @@ iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu, struct device *dev)
 {
 	return 0;
 }
+
+static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu,
+					      enum iommu_veventq_type type,
+					      void *event_data, size_t data_len)
+{
+	return -EOPNOTSUPP;
+}
 #endif /* CONFIG_IOMMUFD_DRIVER_CORE */
 
 /*
diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c
index e5d7397c0a6c..b03ceb016bee 100644
--- a/drivers/iommu/iommufd/driver.c
+++ b/drivers/iommu/iommufd/driver.c
@@ -69,5 +69,48 @@ unsigned long iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu,
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_viommu_get_vdev_id, "IOMMUFD");
 
+/*
+ * Typically called in driver's threaded IRQ handler.
+ * The @type and @event_data must be defined in include/uapi/linux/iommufd.h
+ */
+int iommufd_viommu_report_event(struct iommufd_viommu *viommu,
+				enum iommu_veventq_type type, void *event_data,
+				size_t data_len)
+{
+	struct iommufd_veventq *veventq;
+	struct iommufd_vevent *vevent;
+	int rc = 0;
+
+	if (!viommu)
+		return -ENODEV;
+	if (WARN_ON_ONCE(!viommu->ops || !viommu->ops->supports_veventq ||
+			 !viommu->ops->supports_veventq(type)))
+		return -EOPNOTSUPP;
+	if (WARN_ON_ONCE(!data_len || !event_data))
+		return -EINVAL;
+
+	down_read(&viommu->veventqs_rwsem);
+
+	veventq = iommufd_viommu_find_veventq(viommu, type);
+	if (!veventq) {
+		rc = -EOPNOTSUPP;
+		goto out_unlock_veventqs;
+	}
+
+	vevent = kmalloc(struct_size(vevent, event_data, data_len), GFP_KERNEL);
+	if (!vevent) {
+		rc = -ENOMEM;
+		goto out_unlock_veventqs;
+	}
+	memcpy(vevent->event_data, event_data, data_len);
+	vevent->data_len = data_len;
+
+	iommufd_vevent_handler(veventq, vevent);
+out_unlock_veventqs:
+	up_read(&viommu->veventqs_rwsem);
+	return rc;
+}
+EXPORT_SYMBOL_NS_GPL(iommufd_viommu_report_event, "IOMMUFD");
+
 MODULE_DESCRIPTION("iommufd code shared with builtin modules");
 MODULE_LICENSE("GPL");
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v5 09/14] iommufd/selftest: Require vdev_id when attaching to a nested domain
  2025-01-07 17:10 [PATCH v5 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ) Nicolin Chen
                   ` (7 preceding siblings ...)
  2025-01-07 17:10 ` [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper Nicolin Chen
@ 2025-01-07 17:10 ` Nicolin Chen
  2025-01-07 17:10 ` [PATCH v5 10/14] iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_VEVENT for vEVENTQ coverage Nicolin Chen
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 77+ messages in thread
From: Nicolin Chen @ 2025-01-07 17:10 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: joro, suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

When attaching a device to a vIOMMU-based nested domain, vdev_id must be
present. Add a piece of code hard-requesting it, preparing for a vEVENTQ
support in the following patch. Then, update the TEST_F.

A HWPT-based nested domain will return a NULL new_viommu, thus no such a
vDEVICE requirement.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/selftest.c        | 23 +++++++++++++++++++++++
 tools/testing/selftests/iommu/iommufd.c |  5 +++++
 2 files changed, 28 insertions(+)

diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index a0de6d6d4e68..d1438d81e664 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -161,7 +161,10 @@ enum selftest_obj_type {
 
 struct mock_dev {
 	struct device dev;
+	struct mock_viommu *viommu;
+	struct rw_semaphore viommu_rwsem;
 	unsigned long flags;
+	unsigned long vdev_id;
 	int id;
 	u32 cache[MOCK_DEV_CACHE_NUM];
 };
@@ -193,10 +196,29 @@ static int mock_domain_nop_attach(struct iommu_domain *domain,
 				  struct device *dev)
 {
 	struct mock_dev *mdev = to_mock_dev(dev);
+	struct mock_viommu *new_viommu = NULL;
+	unsigned long vdev_id = 0;
 
 	if (domain->dirty_ops && (mdev->flags & MOCK_FLAGS_DEVICE_NO_DIRTY))
 		return -EINVAL;
 
+	iommu_group_mutex_assert(dev);
+	if (domain->type == IOMMU_DOMAIN_NESTED) {
+		new_viommu = to_mock_nested(domain)->mock_viommu;
+		if (new_viommu) {
+			vdev_id = iommufd_viommu_get_vdev_id(&new_viommu->core,
+							     dev);
+			if (!vdev_id)
+				return -ENOENT;
+		}
+	}
+	if (new_viommu != mdev->viommu) {
+		down_write(&mdev->viommu_rwsem);
+		mdev->viommu = new_viommu;
+		mdev->vdev_id = vdev_id;
+		up_write(&mdev->viommu_rwsem);
+	}
+
 	return 0;
 }
 
@@ -861,6 +883,7 @@ static struct mock_dev *mock_dev_create(unsigned long dev_flags)
 	if (!mdev)
 		return ERR_PTR(-ENOMEM);
 
+	init_rwsem(&mdev->viommu_rwsem);
 	device_initialize(&mdev->dev);
 	mdev->flags = dev_flags;
 	mdev->dev.release = mock_dev_release;
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index a1b2b657999d..212e5d62e13d 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -2736,6 +2736,7 @@ TEST_F(iommufd_viommu, viommu_alloc_nested_iopf)
 	uint32_t iopf_hwpt_id;
 	uint32_t fault_id;
 	uint32_t fault_fd;
+	uint32_t vdev_id;
 
 	if (self->device_id) {
 		test_ioctl_fault_alloc(&fault_id, &fault_fd);
@@ -2752,6 +2753,10 @@ TEST_F(iommufd_viommu, viommu_alloc_nested_iopf)
 			&iopf_hwpt_id, IOMMU_HWPT_DATA_SELFTEST, &data,
 			sizeof(data));
 
+		/* Must allocate vdevice before attaching to a nested hwpt */
+		test_err_mock_domain_replace(ENOENT, self->stdev_id,
+					     iopf_hwpt_id);
+		test_cmd_vdevice_alloc(viommu_id, dev_id, 0x99, &vdev_id);
 		test_cmd_mock_domain_replace(self->stdev_id, iopf_hwpt_id);
 		EXPECT_ERRNO(EBUSY,
 			     _test_ioctl_destroy(self->fd, iopf_hwpt_id));
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v5 10/14] iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_VEVENT for vEVENTQ coverage
  2025-01-07 17:10 [PATCH v5 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ) Nicolin Chen
                   ` (8 preceding siblings ...)
  2025-01-07 17:10 ` [PATCH v5 09/14] iommufd/selftest: Require vdev_id when attaching to a nested domain Nicolin Chen
@ 2025-01-07 17:10 ` Nicolin Chen
  2025-01-07 17:10 ` [PATCH v5 11/14] iommufd/selftest: Add IOMMU_VEVENTQ_ALLOC test coverage Nicolin Chen
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 77+ messages in thread
From: Nicolin Chen @ 2025-01-07 17:10 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: joro, suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

The handler will get vDEVICE object from the given mdev and convert it to
its per-vIOMMU virtual ID to mimic a real IOMMU driver.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_test.h | 10 ++++++++++
 drivers/iommu/iommufd/selftest.c     | 30 ++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index a6b7a163f636..87e9165cea27 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -24,6 +24,7 @@ enum {
 	IOMMU_TEST_OP_MD_CHECK_IOTLB,
 	IOMMU_TEST_OP_TRIGGER_IOPF,
 	IOMMU_TEST_OP_DEV_CHECK_CACHE,
+	IOMMU_TEST_OP_TRIGGER_VEVENT,
 };
 
 enum {
@@ -145,6 +146,9 @@ struct iommu_test_cmd {
 			__u32 id;
 			__u32 cache;
 		} check_dev_cache;
+		struct {
+			__u32 dev_id;
+		} trigger_vevent;
 	};
 	__u32 last;
 };
@@ -212,4 +216,10 @@ struct iommu_viommu_invalidate_selftest {
 	__u32 cache_id;
 };
 
+#define IOMMU_VEVENTQ_TYPE_SELFTEST 0xbeefbeef
+
+struct iommu_viommu_event_selftest {
+	__u32 virt_id;
+};
+
 #endif
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index d1438d81e664..7c20a85ed142 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -1631,6 +1631,34 @@ static int iommufd_test_trigger_iopf(struct iommufd_ucmd *ucmd,
 	return 0;
 }
 
+static int iommufd_test_trigger_vevent(struct iommufd_ucmd *ucmd,
+				       struct iommu_test_cmd *cmd)
+{
+	struct iommu_viommu_event_selftest test = {};
+	struct iommufd_device *idev;
+	struct mock_dev *mdev;
+	int rc = -ENOENT;
+
+	idev = iommufd_get_device(ucmd, cmd->trigger_vevent.dev_id);
+	if (IS_ERR(idev))
+		return PTR_ERR(idev);
+	mdev = to_mock_dev(idev->dev);
+
+	down_read(&mdev->viommu_rwsem);
+	if (!mdev->viommu || !mdev->vdev_id)
+		goto out_unlock;
+
+	test.virt_id = mdev->vdev_id;
+	rc = iommufd_viommu_report_event(&mdev->viommu->core,
+					 IOMMU_VEVENTQ_TYPE_SELFTEST, &test,
+					 sizeof(test));
+out_unlock:
+	up_read(&mdev->viommu_rwsem);
+	iommufd_put_object(ucmd->ictx, &idev->obj);
+
+	return rc;
+}
+
 void iommufd_selftest_destroy(struct iommufd_object *obj)
 {
 	struct selftest_obj *sobj = to_selftest_obj(obj);
@@ -1712,6 +1740,8 @@ int iommufd_test(struct iommufd_ucmd *ucmd)
 					  cmd->dirty.flags);
 	case IOMMU_TEST_OP_TRIGGER_IOPF:
 		return iommufd_test_trigger_iopf(ucmd, cmd);
+	case IOMMU_TEST_OP_TRIGGER_VEVENT:
+		return iommufd_test_trigger_vevent(ucmd, cmd);
 	default:
 		return -EOPNOTSUPP;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v5 11/14] iommufd/selftest: Add IOMMU_VEVENTQ_ALLOC test coverage
  2025-01-07 17:10 [PATCH v5 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ) Nicolin Chen
                   ` (9 preceding siblings ...)
  2025-01-07 17:10 ` [PATCH v5 10/14] iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_VEVENT for vEVENTQ coverage Nicolin Chen
@ 2025-01-07 17:10 ` Nicolin Chen
  2025-01-07 17:10 ` [PATCH v5 12/14] Documentation: userspace-api: iommufd: Update FAULT and VEVENTQ Nicolin Chen
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 77+ messages in thread
From: Nicolin Chen @ 2025-01-07 17:10 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: joro, suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

Trigger a vEVENT by feeding an idev ID and validating the returned output
virt_id whether it equals to the value that was set to the vDEVICE.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 tools/testing/selftests/iommu/iommufd_utils.h | 65 +++++++++++++++++++
 tools/testing/selftests/iommu/iommufd.c       | 22 +++++++
 .../selftests/iommu/iommufd_fail_nth.c        |  7 ++
 3 files changed, 94 insertions(+)

diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index d979f5b0efe8..15cf28b88332 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -9,6 +9,7 @@
 #include <sys/ioctl.h>
 #include <stdint.h>
 #include <assert.h>
+#include <poll.h>
 
 #include "../kselftest_harness.h"
 #include "../../../../drivers/iommu/iommufd/iommufd_test.h"
@@ -936,3 +937,67 @@ static int _test_cmd_vdevice_alloc(int fd, __u32 viommu_id, __u32 idev_id,
 	EXPECT_ERRNO(_errno,                                                 \
 		     _test_cmd_vdevice_alloc(self->fd, viommu_id, idev_id,   \
 					     virt_id, vdev_id))
+
+static int _test_cmd_veventq_alloc(int fd, __u32 viommu_id, __u32 type,
+				   __u32 *veventq_id, __u32 *veventq_fd)
+{
+	struct iommu_veventq_alloc cmd = {
+		.size = sizeof(cmd),
+		.type = type,
+		.viommu_id = viommu_id,
+	};
+	int ret;
+
+	ret = ioctl(fd, IOMMU_VEVENTQ_ALLOC, &cmd);
+	if (ret)
+		return ret;
+	if (veventq_id)
+		*veventq_id = cmd.out_veventq_id;
+	if (veventq_fd)
+		*veventq_fd = cmd.out_veventq_fd;
+	return 0;
+}
+
+#define test_cmd_veventq_alloc(viommu_id, type, veventq_id, veventq_fd) \
+	ASSERT_EQ(0, _test_cmd_veventq_alloc(self->fd, viommu_id, type, \
+					     veventq_id, veventq_fd))
+#define test_err_veventq_alloc(_errno, viommu_id, type, veventq_id,     \
+			       veventq_fd)                              \
+	EXPECT_ERRNO(_errno,                                            \
+		     _test_cmd_veventq_alloc(self->fd, viommu_id, type, \
+					     veventq_id, veventq_fd))
+
+static int _test_cmd_trigger_vevent(int fd, __u32 dev_id, __u32 event_fd,
+				    __u32 virt_id)
+{
+	struct iommu_test_cmd trigger_vevent_cmd = {
+		.size = sizeof(trigger_vevent_cmd),
+		.op = IOMMU_TEST_OP_TRIGGER_VEVENT,
+		.trigger_vevent = {
+			.dev_id = dev_id,
+		},
+	};
+	struct pollfd pollfd = { .fd = event_fd, .events = POLLIN };
+	struct iommu_viommu_event_selftest event;
+	ssize_t bytes;
+	int ret;
+
+	ret = ioctl(fd, _IOMMU_TEST_CMD(IOMMU_TEST_OP_TRIGGER_VEVENT),
+		    &trigger_vevent_cmd);
+	if (ret)
+		return ret;
+
+	ret = poll(&pollfd, 1, 1000);
+	if (ret < 0)
+		return ret;
+
+	bytes = read(event_fd, &event, sizeof(event));
+	if (bytes <= 0)
+		return -EIO;
+
+	return event.virt_id == virt_id ? 0 : -EINVAL;
+}
+
+#define test_cmd_trigger_vevent(dev_id, event_fd, virt_id)                \
+	ASSERT_EQ(0, _test_cmd_trigger_vevent(self->fd, dev_id, event_fd, \
+					      virt_id))
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 212e5d62e13d..c225f6e5108d 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -2774,15 +2774,37 @@ TEST_F(iommufd_viommu, vdevice_alloc)
 	uint32_t viommu_id = self->viommu_id;
 	uint32_t dev_id = self->device_id;
 	uint32_t vdev_id = 0;
+	uint32_t veventq_id;
+	uint32_t veventq_fd;
 
 	if (dev_id) {
+		/* Must allocate vdevice before attaching to a nested hwpt */
+		test_err_mock_domain_replace(ENOENT, self->stdev_id,
+					     self->nested_hwpt_id);
+
+		test_cmd_veventq_alloc(viommu_id, IOMMU_VEVENTQ_TYPE_SELFTEST,
+				       &veventq_id, &veventq_fd);
+		test_err_veventq_alloc(EEXIST, viommu_id,
+				       IOMMU_VEVENTQ_TYPE_SELFTEST, NULL, NULL);
 		/* Set vdev_id to 0x99, unset it, and set to 0x88 */
 		test_cmd_vdevice_alloc(viommu_id, dev_id, 0x99, &vdev_id);
+		test_cmd_mock_domain_replace(self->stdev_id,
+					     self->nested_hwpt_id);
+		test_cmd_trigger_vevent(dev_id, veventq_fd, 0x99);
 		test_err_vdevice_alloc(EEXIST, viommu_id, dev_id, 0x99,
 				       &vdev_id);
+		test_cmd_mock_domain_replace(self->stdev_id, self->ioas_id);
 		test_ioctl_destroy(vdev_id);
+
+		/* Try again with 0x88 */
 		test_cmd_vdevice_alloc(viommu_id, dev_id, 0x88, &vdev_id);
+		test_cmd_mock_domain_replace(self->stdev_id,
+					     self->nested_hwpt_id);
+		test_cmd_trigger_vevent(dev_id, veventq_fd, 0x88);
+		close(veventq_fd);
+		test_cmd_mock_domain_replace(self->stdev_id, self->ioas_id);
 		test_ioctl_destroy(vdev_id);
+		test_ioctl_destroy(veventq_id);
 	} else {
 		test_err_vdevice_alloc(ENOENT, viommu_id, dev_id, 0x99, NULL);
 	}
diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c
index 64b1f8e1b0cf..99a7f7897bb2 100644
--- a/tools/testing/selftests/iommu/iommufd_fail_nth.c
+++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c
@@ -620,6 +620,7 @@ TEST_FAIL_NTH(basic_fail_nth, device)
 	};
 	struct iommu_test_hw_info info;
 	uint32_t fault_id, fault_fd;
+	uint32_t veventq_id, veventq_fd;
 	uint32_t fault_hwpt_id;
 	uint32_t ioas_id;
 	uint32_t ioas_id2;
@@ -692,6 +693,12 @@ TEST_FAIL_NTH(basic_fail_nth, device)
 				 IOMMU_HWPT_DATA_SELFTEST, &data, sizeof(data)))
 		return -1;
 
+	if (_test_cmd_veventq_alloc(self->fd, viommu_id,
+				    IOMMU_VEVENTQ_TYPE_SELFTEST, &veventq_id,
+				    &veventq_fd))
+		return -1;
+	close(veventq_fd);
+
 	return 0;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v5 12/14] Documentation: userspace-api: iommufd: Update FAULT and VEVENTQ
  2025-01-07 17:10 [PATCH v5 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ) Nicolin Chen
                   ` (10 preceding siblings ...)
  2025-01-07 17:10 ` [PATCH v5 11/14] iommufd/selftest: Add IOMMU_VEVENTQ_ALLOC test coverage Nicolin Chen
@ 2025-01-07 17:10 ` Nicolin Chen
  2025-01-10  7:13   ` Tian, Kevin
  2025-01-07 17:10 ` [PATCH v5 13/14] iommu/arm-smmu-v3: Introduce struct arm_smmu_vmaster Nicolin Chen
  2025-01-07 17:10 ` [PATCH v5 14/14] iommu/arm-smmu-v3: Report events that belong to devices attached to vIOMMU Nicolin Chen
  13 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-07 17:10 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: joro, suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

With the introduction of the new objects, update the doc to reflect that.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 Documentation/userspace-api/iommufd.rst | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
index 70289d6815d2..cb2edce6f87d 100644
--- a/Documentation/userspace-api/iommufd.rst
+++ b/Documentation/userspace-api/iommufd.rst
@@ -63,6 +63,13 @@ Following IOMMUFD objects are exposed to userspace:
   space usually has mappings from guest-level I/O virtual addresses to guest-
   level physical addresses.
 
+- IOMMUFD_FAULT, representing a software queue for an HWPT reporting IO page
+  faults using the IOMMU HW's PRI (Page Request Interface). This queue object
+  provides user space an FD to poll the page fault events and also to respond
+  to those events. A FAULT object must be created first to get a fault_id that
+  could be then used to allocate a fault-enabled HWPT via the IOMMU_HWPT_ALLOC
+  command by setting the IOMMU_HWPT_FAULT_ID_VALID bit in its flags field.
+
 - IOMMUFD_OBJ_VIOMMU, representing a slice of the physical IOMMU instance,
   passed to or shared with a VM. It may be some HW-accelerated virtualization
   features and some SW resources used by the VM. For examples:
@@ -109,6 +116,13 @@ Following IOMMUFD objects are exposed to userspace:
   vIOMMU, which is a separate ioctl call from attaching the same device to an
   HWPT_PAGING that the vIOMMU holds.
 
+- IOMMUFD_OBJ_VEVENTQ, representing a software queue for a vIOMMU to report its
+  events such as translation faults occurred to a nested stage-1 and HW-specific
+  events. This queue object provides user space an FD to poll the vIOMMU events.
+  A vIOMMU object must be created first to get its viommu_id that could be then
+  used to allocate a vEVENTQ. Each vIOMMU can support multiple types of vEVENTS,
+  but is confined to one vEVENTQ per vEVENTQ type.
+
 All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.
 
 The diagrams below show relationships between user-visible objects and kernel
@@ -251,8 +265,10 @@ User visible objects are backed by following datastructures:
 - iommufd_device for IOMMUFD_OBJ_DEVICE.
 - iommufd_hwpt_paging for IOMMUFD_OBJ_HWPT_PAGING.
 - iommufd_hwpt_nested for IOMMUFD_OBJ_HWPT_NESTED.
+- iommufd_fault for IOMMUFD_OBJ_FAULT.
 - iommufd_viommu for IOMMUFD_OBJ_VIOMMU.
 - iommufd_vdevice for IOMMUFD_OBJ_VDEVICE.
+- iommufd_veventq for IOMMUFD_OBJ_VEVENTQ.
 
 Several terminologies when looking at these datastructures:
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v5 13/14] iommu/arm-smmu-v3: Introduce struct arm_smmu_vmaster
  2025-01-07 17:10 [PATCH v5 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ) Nicolin Chen
                   ` (11 preceding siblings ...)
  2025-01-07 17:10 ` [PATCH v5 12/14] Documentation: userspace-api: iommufd: Update FAULT and VEVENTQ Nicolin Chen
@ 2025-01-07 17:10 ` Nicolin Chen
  2025-01-13 19:29   ` Jason Gunthorpe
  2025-01-07 17:10 ` [PATCH v5 14/14] iommu/arm-smmu-v3: Report events that belong to devices attached to vIOMMU Nicolin Chen
  13 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-07 17:10 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: joro, suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

Use it to store all vSMMU-related data. The vsid (Virtual Stream ID) will
be the first use case. Then, add a rw_semaphore to protect it.

Also add a pair of arm_smmu_attach_prepare/commit_vmaster helpers to set
or unset the master->vmaster point. Put these helpers inside the existing
arm_smmu_attach_prepare/commit(). Note that identity and blocked ops don't
call arm_smmu_attach_prepare/commit(), thus simply call the new helpers at
the top, so a device attaching to an identity/blocked domain can unset the
master->vmaster when the device is moving away from a nested domain.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   | 23 +++++++++
 .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     | 49 +++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 32 +++++++++++-
 3 files changed, 103 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index bd9d7c85576a..4435ad7db776 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -799,6 +799,11 @@ struct arm_smmu_stream {
 	struct rb_node			node;
 };
 
+struct arm_smmu_vmaster {
+	struct arm_vsmmu		*vsmmu;
+	unsigned long			vsid;
+};
+
 struct arm_smmu_event {
 	u8				stall : 1,
 					ssv : 1,
@@ -824,6 +829,8 @@ struct arm_smmu_master {
 	struct arm_smmu_device		*smmu;
 	struct device			*dev;
 	struct arm_smmu_stream		*streams;
+	struct arm_smmu_vmaster		*vmaster;
+	struct rw_semaphore		vmaster_rwsem;
 	/* Locked by the iommu core using the group mutex */
 	struct arm_smmu_ctx_desc_cfg	cd_table;
 	unsigned int			num_streams;
@@ -972,6 +979,7 @@ struct arm_smmu_attach_state {
 	bool disable_ats;
 	ioasid_t ssid;
 	/* Resulting state */
+	struct arm_smmu_vmaster *vmaster;
 	bool ats_enabled;
 };
 
@@ -1055,9 +1063,24 @@ struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev,
 				       struct iommu_domain *parent,
 				       struct iommufd_ctx *ictx,
 				       unsigned int viommu_type);
+int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state,
+				    struct iommu_domain *domain);
+void arm_smmu_attach_commit_vmaster(struct arm_smmu_attach_state *state);
 #else
 #define arm_smmu_hw_info NULL
 #define arm_vsmmu_alloc NULL
+
+static inline int
+arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state,
+				struct iommu_domain *domain)
+{
+	return 0; /* NOP */
+}
+
+static inline void
+arm_smmu_attach_commit_vmaster(struct arm_smmu_attach_state *state)
+{
+}
 #endif /* CONFIG_ARM_SMMU_V3_IOMMUFD */
 
 #endif /* _ARM_SMMU_V3_H */
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
index c7cc613050d9..2b6253ef0e8f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
@@ -85,6 +85,55 @@ static void arm_smmu_make_nested_domain_ste(
 	}
 }
 
+int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state,
+				    struct iommu_domain *domain)
+{
+	struct arm_smmu_nested_domain *nested_domain;
+	struct arm_smmu_vmaster *vmaster;
+	unsigned long vsid;
+	unsigned int cfg;
+
+	iommu_group_mutex_assert(state->master->dev);
+
+	if (domain->type != IOMMU_DOMAIN_NESTED)
+		return 0;
+	nested_domain = to_smmu_nested_domain(domain);
+
+	/* Skip ABORT/BYPASS or invalid vSTE */
+	cfg = FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(nested_domain->ste[0]));
+	if (cfg == STRTAB_STE_0_CFG_ABORT || cfg == STRTAB_STE_0_CFG_BYPASS)
+		return 0;
+	if (!(nested_domain->ste[0] & cpu_to_le64(STRTAB_STE_0_V)))
+		return 0;
+
+	vsid = iommufd_viommu_get_vdev_id(&nested_domain->vsmmu->core,
+					  state->master->dev);
+	/* Fail the attach if vSID is not correct set by the user space */
+	if (!vsid)
+		return -ENOENT;
+
+	vmaster = kzalloc(sizeof(*vmaster), GFP_KERNEL);
+	if (!vmaster)
+		return -ENOMEM;
+	vmaster->vsmmu = nested_domain->vsmmu;
+	vmaster->vsid = vsid;
+	state->vmaster = vmaster;
+
+	return 0;
+}
+
+void arm_smmu_attach_commit_vmaster(struct arm_smmu_attach_state *state)
+{
+	struct arm_smmu_master *master = state->master;
+
+	down_write(&master->vmaster_rwsem);
+	if (state->vmaster != master->vmaster) {
+		kfree(master->vmaster);
+		master->vmaster = state->vmaster;
+	}
+	up_write(&master->vmaster_rwsem);
+}
+
 static int arm_smmu_attach_dev_nested(struct iommu_domain *domain,
 				      struct device *dev)
 {
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index ea76f25c0661..686c171dd273 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2802,6 +2802,7 @@ int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
 	struct arm_smmu_domain *smmu_domain =
 		to_smmu_domain_devices(new_domain);
 	unsigned long flags;
+	int ret;
 
 	/*
 	 * arm_smmu_share_asid() must not see two domains pointing to the same
@@ -2831,9 +2832,15 @@ int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
 	}
 
 	if (smmu_domain) {
+		ret = arm_smmu_attach_prepare_vmaster(state, new_domain);
+		if (ret)
+			return ret;
+
 		master_domain = kzalloc(sizeof(*master_domain), GFP_KERNEL);
-		if (!master_domain)
+		if (!master_domain) {
+			kfree(state->vmaster);
 			return -ENOMEM;
+		}
 		master_domain->master = master;
 		master_domain->ssid = state->ssid;
 		if (new_domain->type == IOMMU_DOMAIN_NESTED)
@@ -2860,6 +2867,7 @@ int arm_smmu_attach_prepare(struct arm_smmu_attach_state *state,
 			spin_unlock_irqrestore(&smmu_domain->devices_lock,
 					       flags);
 			kfree(master_domain);
+			kfree(state->vmaster);
 			return -EINVAL;
 		}
 
@@ -2892,6 +2900,8 @@ void arm_smmu_attach_commit(struct arm_smmu_attach_state *state)
 
 	lockdep_assert_held(&arm_smmu_asid_lock);
 
+	arm_smmu_attach_commit_vmaster(state);
+
 	if (state->ats_enabled && !master->ats_enabled) {
 		arm_smmu_enable_ats(master);
 	} else if (state->ats_enabled && master->ats_enabled) {
@@ -3158,8 +3168,17 @@ static void arm_smmu_attach_dev_ste(struct iommu_domain *domain,
 static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
 					struct device *dev)
 {
+	int ret;
 	struct arm_smmu_ste ste;
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct arm_smmu_attach_state state = {
+		.master = master,
+	};
+
+	ret = arm_smmu_attach_prepare_vmaster(&state, domain);
+	if (ret)
+		return ret;
+	arm_smmu_attach_commit_vmaster(&state);
 
 	arm_smmu_make_bypass_ste(master->smmu, &ste);
 	arm_smmu_attach_dev_ste(domain, dev, &ste, STRTAB_STE_1_S1DSS_BYPASS);
@@ -3178,7 +3197,17 @@ static struct iommu_domain arm_smmu_identity_domain = {
 static int arm_smmu_attach_dev_blocked(struct iommu_domain *domain,
 					struct device *dev)
 {
+	int ret;
 	struct arm_smmu_ste ste;
+	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct arm_smmu_attach_state state = {
+		.master = master,
+	};
+
+	ret = arm_smmu_attach_prepare_vmaster(&state, domain);
+	if (ret)
+		return ret;
+	arm_smmu_attach_commit_vmaster(&state);
 
 	arm_smmu_make_abort_ste(&ste);
 	arm_smmu_attach_dev_ste(domain, dev, &ste,
@@ -3428,6 +3457,7 @@ static struct iommu_device *arm_smmu_probe_device(struct device *dev)
 
 	master->dev = dev;
 	master->smmu = smmu;
+	init_rwsem(&master->vmaster_rwsem);
 	dev_iommu_priv_set(dev, master);
 
 	ret = arm_smmu_insert_master(smmu, master);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH v5 14/14] iommu/arm-smmu-v3: Report events that belong to devices attached to vIOMMU
  2025-01-07 17:10 [PATCH v5 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ) Nicolin Chen
                   ` (12 preceding siblings ...)
  2025-01-07 17:10 ` [PATCH v5 13/14] iommu/arm-smmu-v3: Introduce struct arm_smmu_vmaster Nicolin Chen
@ 2025-01-07 17:10 ` Nicolin Chen
  2025-01-09 11:04   ` kernel test robot
  13 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-07 17:10 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: joro, suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

Aside from the IOPF framework, iommufd provides an additional pathway to
report hardware events, via the vEVENTQ of vIOMMU infrastructure.

Define an iommu_vevent_arm_smmuv3 uAPI structure, and report stage-1 events
in the threaded IRQ handler.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  7 +++
 include/uapi/linux/iommufd.h                  | 15 +++++
 .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     | 22 +++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 58 +++++++++++--------
 4 files changed, 77 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 4435ad7db776..d24c3d8ee397 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -1066,6 +1066,7 @@ struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev,
 int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state,
 				    struct iommu_domain *domain);
 void arm_smmu_attach_commit_vmaster(struct arm_smmu_attach_state *state);
+int arm_vmaster_report_event(struct arm_smmu_vmaster *vmaster, u64 *evt);
 #else
 #define arm_smmu_hw_info NULL
 #define arm_vsmmu_alloc NULL
@@ -1081,6 +1082,12 @@ static inline void
 arm_smmu_attach_commit_vmaster(struct arm_smmu_attach_state *state)
 {
 }
+
+static inline int arm_vmaster_report_event(struct arm_smmu_vmaster *vmaster,
+					   u64 *evt)
+{
+	return -EOPNOTSUPP;
+}
 #endif /* CONFIG_ARM_SMMU_V3_IOMMUFD */
 
 #endif /* _ARM_SMMU_V3_H */
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 0a08aa82e7cc..55e3d5a14cca 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -1016,9 +1016,24 @@ struct iommu_ioas_change_process {
 /**
  * enum iommu_veventq_type - Virtual Event Queue Type
  * @IOMMU_VEVENTQ_TYPE_DEFAULT: Reserved for future use
+ * @IOMMU_VEVENTQ_TYPE_ARM_SMMUV3: ARM SMMUv3 Virtual Event Queue
  */
 enum iommu_veventq_type {
 	IOMMU_VEVENTQ_TYPE_DEFAULT = 0,
+	IOMMU_VEVENTQ_TYPE_ARM_SMMUV3 = 1,
+};
+
+/**
+ * struct iommu_vevent_arm_smmuv3 - ARM SMMUv3 Virtual Event
+ *                                  (IOMMU_VEVENTQ_TYPE_ARM_SMMUV3)
+ * @evt: 256-bit ARM SMMUv3 Event record, little-endian.
+ *       (Refer to "7.3 Event records" in SMMUv3 HW Spec)
+ *
+ * StreamID field reports a virtual device ID. To receive a virtual event for a
+ * device, a vDEVICE must be allocated via IOMMU_VDEVICE_ALLOC.
+ */
+struct iommu_vevent_arm_smmuv3 {
+	__aligned_le64 evt[4];
 };
 
 /**
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
index 2b6253ef0e8f..82b4513e56f3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
@@ -389,9 +389,15 @@ static int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
 	return ret;
 }
 
+static bool arm_vsmmu_supports_veventq(unsigned int type)
+{
+	return type == IOMMU_VEVENTQ_TYPE_ARM_SMMUV3;
+}
+
 static const struct iommufd_viommu_ops arm_vsmmu_ops = {
 	.alloc_domain_nested = arm_vsmmu_alloc_domain_nested,
 	.cache_invalidate = arm_vsmmu_cache_invalidate,
+	.supports_veventq = arm_vsmmu_supports_veventq,
 };
 
 struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev,
@@ -447,4 +453,20 @@ struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev,
 	return &vsmmu->core;
 }
 
+int arm_vmaster_report_event(struct arm_smmu_vmaster *vmaster, u64 *evt)
+{
+	struct iommu_vevent_arm_smmuv3 vevt =
+		*(struct iommu_vevent_arm_smmuv3 *)evt;
+
+	vevt.evt[0] &= ~EVTQ_0_SID;
+	vevt.evt[0] |= FIELD_PREP(EVTQ_0_SID, vmaster->vsid);
+
+	vevt.evt[0] = cpu_to_le64(vevt.evt[0]);
+	vevt.evt[1] = cpu_to_le64(vevt.evt[1]);
+
+	return iommufd_viommu_report_event(&vmaster->vsmmu->core,
+					   IOMMU_VEVENTQ_TYPE_ARM_SMMUV3, &vevt,
+					   sizeof(vevt));
+}
+
 MODULE_IMPORT_NS("IOMMUFD");
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 686c171dd273..59fbc342a095 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1812,8 +1812,8 @@ static void arm_smmu_decode_event(struct arm_smmu_device *smmu, u64 *raw,
 	mutex_unlock(&smmu->streams_mutex);
 }
 
-static int arm_smmu_handle_event(struct arm_smmu_device *smmu,
-			       struct arm_smmu_event *event)
+static int arm_smmu_handle_event(struct arm_smmu_device *smmu, u64 *evt,
+				 struct arm_smmu_event *event)
 {
 	int ret = 0;
 	u32 perm = 0;
@@ -1831,31 +1831,30 @@ static int arm_smmu_handle_event(struct arm_smmu_device *smmu,
 		return -EOPNOTSUPP;
 	}
 
-	if (!event->stall)
-		return -EOPNOTSUPP;
-
-	if (event->read)
-		perm |= IOMMU_FAULT_PERM_READ;
-	else
-		perm |= IOMMU_FAULT_PERM_WRITE;
+	if (event->stall) {
+		if (event->read)
+			perm |= IOMMU_FAULT_PERM_READ;
+		else
+			perm |= IOMMU_FAULT_PERM_WRITE;
 
-	if (event->instruction)
-		perm |= IOMMU_FAULT_PERM_EXEC;
+		if (event->instruction)
+			perm |= IOMMU_FAULT_PERM_EXEC;
 
-	if (event->privileged)
-		perm |= IOMMU_FAULT_PERM_PRIV;
+		if (event->privileged)
+			perm |= IOMMU_FAULT_PERM_PRIV;
 
-	flt->type = IOMMU_FAULT_PAGE_REQ;
-	flt->prm = (struct iommu_fault_page_request) {
-		.flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE,
-		.grpid = event->stag,
-		.perm = perm,
-		.addr = event->iova,
-	};
+		flt->type = IOMMU_FAULT_PAGE_REQ;
+		flt->prm = (struct iommu_fault_page_request){
+			.flags = IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE,
+			.grpid = event->stag,
+			.perm = perm,
+			.addr = event->iova,
+		};
 
-	if (event->ssv) {
-		flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
-		flt->prm.pasid = event->ssid;
+		if (event->ssv) {
+			flt->prm.flags |= IOMMU_FAULT_PAGE_REQUEST_PASID_VALID;
+			flt->prm.pasid = event->ssid;
+		}
 	}
 
 	mutex_lock(&smmu->streams_mutex);
@@ -1865,7 +1864,16 @@ static int arm_smmu_handle_event(struct arm_smmu_device *smmu,
 		goto out_unlock;
 	}
 
-	ret = iommu_report_device_fault(master->dev, &fault_evt);
+	if (event->stall) {
+		ret = iommu_report_device_fault(master->dev, &fault_evt);
+	} else {
+		down_read(&master->vmaster_rwsem);
+		if (master->vmaster && !event->s2)
+			ret = arm_vmaster_report_event(master->vmaster, evt);
+		else
+			ret = -EFAULT; /* Unhandled events should be pinned */
+		up_read(&master->vmaster_rwsem);
+	}
 out_unlock:
 	mutex_unlock(&smmu->streams_mutex);
 	return ret;
@@ -1943,7 +1951,7 @@ static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
 	do {
 		while (!queue_remove_raw(q, evt)) {
 			arm_smmu_decode_event(smmu, evt, &event);
-			if (arm_smmu_handle_event(smmu, &event))
+			if (arm_smmu_handle_event(smmu, evt, &event))
 				arm_smmu_dump_event(smmu, evt, &event, &rs);
 
 			put_device(event.dev);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 14/14] iommu/arm-smmu-v3: Report events that belong to devices attached to vIOMMU
  2025-01-07 17:10 ` [PATCH v5 14/14] iommu/arm-smmu-v3: Report events that belong to devices attached to vIOMMU Nicolin Chen
@ 2025-01-09 11:04   ` kernel test robot
  2025-01-13 19:01     ` Nicolin Chen
  0 siblings, 1 reply; 77+ messages in thread
From: kernel test robot @ 2025-01-09 11:04 UTC (permalink / raw)
  To: Nicolin Chen, jgg, kevin.tian, corbet, will
  Cc: oe-kbuild-all, joro, suravee.suthikulpanit, robin.murphy, dwmw2,
	baolu.lu, shuah, linux-kernel, iommu, linux-arm-kernel,
	linux-kselftest, linux-doc, eric.auger, jean-philippe, mdf,
	mshavit, shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu,
	patches

Hi Nicolin,

kernel test robot noticed the following build warnings:

[auto build test WARNING on e94dc6ddda8dd3770879a132d577accd2cce25f9]

url:    https://github.com/intel-lab-lkp/linux/commits/Nicolin-Chen/iommufd-Keep-OBJ-IOCTL-lists-in-an-alphabetical-order/20250108-011247
base:   e94dc6ddda8dd3770879a132d577accd2cce25f9
patch link:    https://lore.kernel.org/r/03c01be90e53f743a91b6c1376c408404b891867.1736237481.git.nicolinc%40nvidia.com
patch subject: [PATCH v5 14/14] iommu/arm-smmu-v3: Report events that belong to devices attached to vIOMMU
config: arm64-randconfig-r131-20250109 (https://download.01.org/0day-ci/archive/20250109/202501091822.4ocbIobQ-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce: (https://download.01.org/0day-ci/archive/20250109/202501091822.4ocbIobQ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202501091822.4ocbIobQ-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c:461:21: sparse: sparse: invalid assignment: &=
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c:461:21: sparse:    left side has type restricted __le64
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c:461:21: sparse:    right side has type unsigned long long
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c:462:21: sparse: sparse: invalid assignment: |=
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c:462:21: sparse:    left side has type restricted __le64
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c:462:21: sparse:    right side has type unsigned long long
>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c:464:23: sparse: sparse: cast from restricted __le64
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c:465:23: sparse: sparse: cast from restricted __le64

vim +461 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c

   455	
   456	int arm_vmaster_report_event(struct arm_smmu_vmaster *vmaster, u64 *evt)
   457	{
   458		struct iommu_vevent_arm_smmuv3 vevt =
   459			*(struct iommu_vevent_arm_smmuv3 *)evt;
   460	
 > 461		vevt.evt[0] &= ~EVTQ_0_SID;
   462		vevt.evt[0] |= FIELD_PREP(EVTQ_0_SID, vmaster->vsid);
   463	
 > 464		vevt.evt[0] = cpu_to_le64(vevt.evt[0]);
   465		vevt.evt[1] = cpu_to_le64(vevt.evt[1]);
   466	
   467		return iommufd_viommu_report_event(&vmaster->vsmmu->core,
   468						   IOMMU_VEVENTQ_TYPE_ARM_SMMUV3, &vevt,
   469						   sizeof(vevt));
   470	}
   471	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: [PATCH v5 01/14] iommufd: Keep OBJ/IOCTL lists in an alphabetical order
  2025-01-07 17:10 ` [PATCH v5 01/14] iommufd: Keep OBJ/IOCTL lists in an alphabetical order Nicolin Chen
@ 2025-01-10  6:26   ` Tian, Kevin
  2025-01-10 17:25   ` Jason Gunthorpe
  2025-01-14 19:29   ` Jason Gunthorpe
  2 siblings, 0 replies; 77+ messages in thread
From: Tian, Kevin @ 2025-01-10  6:26 UTC (permalink / raw)
  To: Nicolin Chen, jgg@nvidia.com, corbet@lwn.net, will@kernel.org
  Cc: joro@8bytes.org, suravee.suthikulpanit@amd.com,
	robin.murphy@arm.com, dwmw2@infradead.org,
	baolu.lu@linux.intel.com, shuah@kernel.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	eric.auger@redhat.com, jean-philippe@linaro.org, mdf@kernel.org,
	mshavit@google.com, shameerali.kolothum.thodi@huawei.com,
	smostafa@google.com, ddutile@redhat.com, Liu, Yi L,
	patches@lists.linux.dev

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, January 8, 2025 1:10 AM
> 
> Reorder the existing OBJ/IOCTL lists.
> 
> Also run clang-format for the same coding style at line wrappings.
> 
> No functional change.
> 
> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: [PATCH v5 04/14] iommufd: Abstract an iommufd_eventq from iommufd_fault
  2025-01-07 17:10 ` [PATCH v5 04/14] iommufd: Abstract an iommufd_eventq from iommufd_fault Nicolin Chen
@ 2025-01-10  6:26   ` Tian, Kevin
  2025-01-10 17:26   ` Jason Gunthorpe
  1 sibling, 0 replies; 77+ messages in thread
From: Tian, Kevin @ 2025-01-10  6:26 UTC (permalink / raw)
  To: Nicolin Chen, jgg@nvidia.com, corbet@lwn.net, will@kernel.org
  Cc: joro@8bytes.org, suravee.suthikulpanit@amd.com,
	robin.murphy@arm.com, dwmw2@infradead.org,
	baolu.lu@linux.intel.com, shuah@kernel.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	eric.auger@redhat.com, jean-philippe@linaro.org, mdf@kernel.org,
	mshavit@google.com, shameerali.kolothum.thodi@huawei.com,
	smostafa@google.com, ddutile@redhat.com, Liu, Yi L,
	patches@lists.linux.dev

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, January 8, 2025 1:10 AM
> 
> The fault object was designed exclusively for hwpt's IO page faults (PRI).
> But its queue implementation can be reused for other purposes too, such as
> hardware IRQ and event injections to user space.
> 
> Meanwhile, a fault object holds a list of faults. So it's more accurate to
> call it a "fault queue". Combining the reusing idea above, abstract a new
> iommufd_eventq as a common structure embedded into struct
> iommufd_fault,
> similar to hwpt_paging holding a common hwpt.
> 
> Add a common iommufd_eventq_ops and iommufd_eventq_init to prepare
> for an
> IOMMUFD_OBJ_VEVENTQ (vIOMMU Event Queue).
> 
> Also, add missing xa_destroy and mutex_destroy in iommufd_fault_destroy().
> 
> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
  2025-01-07 17:10 ` [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC Nicolin Chen
@ 2025-01-10  7:06   ` Tian, Kevin
  2025-01-10 21:29     ` Nicolin Chen
  2025-01-10 17:48   ` Jason Gunthorpe
  1 sibling, 1 reply; 77+ messages in thread
From: Tian, Kevin @ 2025-01-10  7:06 UTC (permalink / raw)
  To: Nicolin Chen, jgg@nvidia.com, corbet@lwn.net, will@kernel.org
  Cc: joro@8bytes.org, suravee.suthikulpanit@amd.com,
	robin.murphy@arm.com, dwmw2@infradead.org,
	baolu.lu@linux.intel.com, shuah@kernel.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	eric.auger@redhat.com, jean-philippe@linaro.org, mdf@kernel.org,
	mshavit@google.com, shameerali.kolothum.thodi@huawei.com,
	smostafa@google.com, ddutile@redhat.com, Liu, Yi L,
	patches@lists.linux.dev

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, January 8, 2025 1:10 AM
> +
> +int iommufd_veventq_alloc(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_veventq_alloc *cmd = ucmd->cmd;
> +	struct iommufd_veventq *veventq;
> +	struct iommufd_viommu *viommu;
> +	int fdno;
> +	int rc;
> +
> +	if (cmd->flags || cmd->type == IOMMU_VEVENTQ_TYPE_DEFAULT)
> +		return -EOPNOTSUPP;
> +
> +	viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
> +	if (IS_ERR(viommu))
> +		return PTR_ERR(viommu);
> +
> +	if (!viommu->ops || !viommu->ops->supports_veventq ||
> +	    !viommu->ops->supports_veventq(cmd->type))
> +		return -EOPNOTSUPP;
> +

I'm not sure about the necessity of above check. The event queue
is just a software struct with a user-specified format for the iommu
driver to report viommu event. The struct itself is not constrained
by the hardware capability, though I'm not sure a real usage in
which a smmu driver wants to report a vtd event. But legitimately
an user can create any type of event queues which might just be
never used.

It sounds clearer to do the check when IOPF cap is actually enabled
on a device contained in the viommu. At that point check whether 
a required type eventqueue has been created. If not then fail the
iopf enabling.

Then it reveals probably another todo in this series. Seems you still
let the smmu driver statically enable iopf when probing the device. 
Sounds like iommufd_viommu_alloc_hwpt_nested() may accept
IOMMU_HWPT_FAULT_ID_VALID to refer to a event queue and
later dynamically enable/disable iopf when attaching a device to the
hwpt and check the event queue type there. Just like how the fault
object is handled.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: [PATCH v5 07/14] iommufd/viommu: Add iommufd_viommu_get_vdev_id helper
  2025-01-07 17:10 ` [PATCH v5 07/14] iommufd/viommu: Add iommufd_viommu_get_vdev_id helper Nicolin Chen
@ 2025-01-10  7:07   ` Tian, Kevin
  2025-01-10 21:35     ` Nicolin Chen
  0 siblings, 1 reply; 77+ messages in thread
From: Tian, Kevin @ 2025-01-10  7:07 UTC (permalink / raw)
  To: Nicolin Chen, jgg@nvidia.com, corbet@lwn.net, will@kernel.org
  Cc: joro@8bytes.org, suravee.suthikulpanit@amd.com,
	robin.murphy@arm.com, dwmw2@infradead.org,
	baolu.lu@linux.intel.com, shuah@kernel.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	eric.auger@redhat.com, jean-philippe@linaro.org, mdf@kernel.org,
	mshavit@google.com, shameerali.kolothum.thodi@huawei.com,
	smostafa@google.com, ddutile@redhat.com, Liu, Yi L,
	patches@lists.linux.dev

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, January 8, 2025 1:10 AM
> +
> +	xa_lock(&viommu->vdevs);
> +	xa_for_each(&viommu->vdevs, index, vdev) {
> +		if (vdev && vdev->dev == dev) {
> +			vdev_id = (unsigned long)vdev->id;
> +			break;
> +		}
> +	}

Nit. No need to check vdev pointer as commented in last review.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-07 17:10 ` [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper Nicolin Chen
@ 2025-01-10  7:12   ` Tian, Kevin
  2025-01-10 14:51     ` Jason Gunthorpe
  2025-01-10 17:41   ` Jason Gunthorpe
  1 sibling, 1 reply; 77+ messages in thread
From: Tian, Kevin @ 2025-01-10  7:12 UTC (permalink / raw)
  To: Nicolin Chen, jgg@nvidia.com, corbet@lwn.net, will@kernel.org
  Cc: joro@8bytes.org, suravee.suthikulpanit@amd.com,
	robin.murphy@arm.com, dwmw2@infradead.org,
	baolu.lu@linux.intel.com, shuah@kernel.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	eric.auger@redhat.com, jean-philippe@linaro.org, mdf@kernel.org,
	mshavit@google.com, shameerali.kolothum.thodi@huawei.com,
	smostafa@google.com, ddutile@redhat.com, Liu, Yi L,
	patches@lists.linux.dev

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, January 8, 2025 1:10 AM
> 
> +/*
> + * Typically called in driver's threaded IRQ handler.
> + * The @type and @event_data must be defined in
> include/uapi/linux/iommufd.h
> + */
> +int iommufd_viommu_report_event(struct iommufd_viommu *viommu,
> +				enum iommu_veventq_type type, void
> *event_data,
> +				size_t data_len)
> +{
> +	struct iommufd_veventq *veventq;
> +	struct iommufd_vevent *vevent;
> +	int rc = 0;
> +
> +	if (!viommu)
> +		return -ENODEV;
> +	if (WARN_ON_ONCE(!viommu->ops || !viommu->ops-
> >supports_veventq ||
> +			 !viommu->ops->supports_veventq(type)))
> +		return -EOPNOTSUPP;

Hmm the driver knows which type is supported by itself before
calling this helper. Why bother having the helper calling into
the driver again to verify?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: [PATCH v5 12/14] Documentation: userspace-api: iommufd: Update FAULT and VEVENTQ
  2025-01-07 17:10 ` [PATCH v5 12/14] Documentation: userspace-api: iommufd: Update FAULT and VEVENTQ Nicolin Chen
@ 2025-01-10  7:13   ` Tian, Kevin
  0 siblings, 0 replies; 77+ messages in thread
From: Tian, Kevin @ 2025-01-10  7:13 UTC (permalink / raw)
  To: Nicolin Chen, jgg@nvidia.com, corbet@lwn.net, will@kernel.org
  Cc: joro@8bytes.org, suravee.suthikulpanit@amd.com,
	robin.murphy@arm.com, dwmw2@infradead.org,
	baolu.lu@linux.intel.com, shuah@kernel.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	eric.auger@redhat.com, jean-philippe@linaro.org, mdf@kernel.org,
	mshavit@google.com, shameerali.kolothum.thodi@huawei.com,
	smostafa@google.com, ddutile@redhat.com, Liu, Yi L,
	patches@lists.linux.dev

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, January 8, 2025 1:10 AM
> 
> With the introduction of the new objects, update the doc to reflect that.
> 
> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-10  7:12   ` Tian, Kevin
@ 2025-01-10 14:51     ` Jason Gunthorpe
  2025-01-10 18:40       ` Nicolin Chen
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-10 14:51 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Nicolin Chen, corbet@lwn.net, will@kernel.org, joro@8bytes.org,
	suravee.suthikulpanit@amd.com, robin.murphy@arm.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com, shuah@kernel.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	eric.auger@redhat.com, jean-philippe@linaro.org, mdf@kernel.org,
	mshavit@google.com, shameerali.kolothum.thodi@huawei.com,
	smostafa@google.com, ddutile@redhat.com, Liu, Yi L,
	patches@lists.linux.dev

On Fri, Jan 10, 2025 at 07:12:46AM +0000, Tian, Kevin wrote:

> > +	if (!viommu)
> > +		return -ENODEV;
> > +	if (WARN_ON_ONCE(!viommu->ops || !viommu->ops-
> > >supports_veventq ||
> > +			 !viommu->ops->supports_veventq(type)))
> > +		return -EOPNOTSUPP;
> 
> Hmm the driver knows which type is supported by itself before
> calling this helper. Why bother having the helper calling into
> the driver again to verify?

Indeed, it might make sense to protect this with

        if (IS_ENABLED(CONFIG_IOMMUFD_TEST))

As a compiled out assertion

Or drop it

We shouldn't have unnecessary argument validation on fast paths,
!viommu should go too.

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 01/14] iommufd: Keep OBJ/IOCTL lists in an alphabetical order
  2025-01-07 17:10 ` [PATCH v5 01/14] iommufd: Keep OBJ/IOCTL lists in an alphabetical order Nicolin Chen
  2025-01-10  6:26   ` Tian, Kevin
@ 2025-01-10 17:25   ` Jason Gunthorpe
  2025-01-14 19:29   ` Jason Gunthorpe
  2 siblings, 0 replies; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-10 17:25 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Tue, Jan 07, 2025 at 09:10:04AM -0800, Nicolin Chen wrote:
> Reorder the existing OBJ/IOCTL lists.
> 
> Also run clang-format for the same coding style at line wrappings.
> 
> No functional change.
> 
> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/iommufd/main.c | 30 ++++++++++++++----------------
>  1 file changed, 14 insertions(+), 16 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 02/14] iommufd/fault: Add an iommufd_fault_init() helper
  2025-01-07 17:10 ` [PATCH v5 02/14] iommufd/fault: Add an iommufd_fault_init() helper Nicolin Chen
@ 2025-01-10 17:25   ` Jason Gunthorpe
  0 siblings, 0 replies; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-10 17:25 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Tue, Jan 07, 2025 at 09:10:05AM -0800, Nicolin Chen wrote:
> The infrastructure of a fault object will be shared with a new vEVENTQ
> object in a following change. Add a helper for a vEVENTQ allocator to
> call it too.
> 
> Reorder the iommufd_ctx_get and refcount_inc, to keep them symmetrical
> with the iommufd_fault_fops_release().
> 
> Since the new vEVENTQ doesn't need "response", leave the xa_init_flags
> in its original location.
> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/iommufd/fault.c | 48 ++++++++++++++++++++---------------
>  1 file changed, 28 insertions(+), 20 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 03/14] iommufd/fault: Move iommufd_fault_iopf_handler() to header
  2025-01-07 17:10 ` [PATCH v5 03/14] iommufd/fault: Move iommufd_fault_iopf_handler() to header Nicolin Chen
@ 2025-01-10 17:25   ` Jason Gunthorpe
  0 siblings, 0 replies; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-10 17:25 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Tue, Jan 07, 2025 at 09:10:06AM -0800, Nicolin Chen wrote:
> The new vEVENTQ object will need a similar function for drivers to report
> the vIOMMU related events. Split the common part out to a smaller helper,
> and place it in the header so that CONFIG_IOMMUFD_DRIVER_CORE can include
> that in the driver.c file for drivers to use.
> 
> Then keep iommufd_fault_iopf_handler() in the header too, since it's quite
> simple after all.
> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/iommufd/iommufd_private.h | 20 +++++++++++++++++++-
>  drivers/iommu/iommufd/fault.c           | 17 -----------------
>  2 files changed, 19 insertions(+), 18 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 04/14] iommufd: Abstract an iommufd_eventq from iommufd_fault
  2025-01-07 17:10 ` [PATCH v5 04/14] iommufd: Abstract an iommufd_eventq from iommufd_fault Nicolin Chen
  2025-01-10  6:26   ` Tian, Kevin
@ 2025-01-10 17:26   ` Jason Gunthorpe
  2025-01-10 20:49     ` Nicolin Chen
  1 sibling, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-10 17:26 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Tue, Jan 07, 2025 at 09:10:07AM -0800, Nicolin Chen wrote:

> @@ -433,32 +434,35 @@ void iopt_remove_access(struct io_pagetable *iopt,
>  			u32 iopt_access_list_id);
>  void iommufd_access_destroy_object(struct iommufd_object *obj);
>  
> -/*
> - * An iommufd_fault object represents an interface to deliver I/O page faults
> - * to the user space. These objects are created/destroyed by the user space and
> - * associated with hardware page table objects during page-table allocation.
> - */
> -struct iommufd_fault {
> +struct iommufd_eventq_ops {
> +	ssize_t (*read)(struct iommufd_eventq *eventq, char __user *buf,
> +			size_t count, loff_t *ppos); /* Mandatory op */
> +	ssize_t (*write)(struct iommufd_eventq *eventq, const char __user *buf,
> +			 size_t count, loff_t *ppos); /* Optional op */
> +};

I think I recommend to avoid this, more like this:

diff --git a/drivers/iommu/iommufd/eventq.c b/drivers/iommu/iommufd/eventq.c
index b5be629f38eda7..73f5e8a6b17f54 100644
--- a/drivers/iommu/iommufd/eventq.c
+++ b/drivers/iommu/iommufd/eventq.c
@@ -341,11 +341,6 @@ static ssize_t iommufd_fault_fops_write(struct iommufd_eventq *eventq,
 	return done == 0 ? rc : done;
 }
 
-static const struct iommufd_eventq_ops iommufd_fault_ops = {
-	.read = &iommufd_fault_fops_read,
-	.write = &iommufd_fault_fops_write,
-};
-
 /* IOMMUFD_OBJ_VEVENTQ Functions */
 
 void iommufd_veventq_abort(struct iommufd_object *obj)
@@ -409,31 +404,8 @@ static ssize_t iommufd_veventq_fops_read(struct iommufd_eventq *eventq,
 	return done == 0 ? rc : done;
 }
 
-static const struct iommufd_eventq_ops iommufd_veventq_ops = {
-	.read = &iommufd_veventq_fops_read,
-};
-
 /* Common Event Queue Functions */
 
-static ssize_t iommufd_eventq_fops_read(struct file *filep, char __user *buf,
-					size_t count, loff_t *ppos)
-{
-	struct iommufd_eventq *eventq = filep->private_data;
-
-	return eventq->ops->read(eventq, buf, count, ppos);
-}
-
-static ssize_t iommufd_eventq_fops_write(struct file *filep,
-					 const char __user *buf, size_t count,
-					 loff_t *ppos)
-{
-	struct iommufd_eventq *eventq = filep->private_data;
-
-	if (!eventq->ops->write)
-		return -EOPNOTSUPP;
-	return eventq->ops->write(eventq, buf, count, ppos);
-}
-
 static __poll_t iommufd_eventq_fops_poll(struct file *filep,
 					 struct poll_table_struct *wait)
 {
@@ -458,34 +430,31 @@ static int iommufd_eventq_fops_release(struct inode *inode, struct file *filep)
 	return 0;
 }
 
-static const struct file_operations iommufd_eventq_fops = {
-	.owner		= THIS_MODULE,
-	.open		= nonseekable_open,
-	.read		= iommufd_eventq_fops_read,
-	.write		= iommufd_eventq_fops_write,
-	.poll		= iommufd_eventq_fops_poll,
-	.release	= iommufd_eventq_fops_release,
-};
+#define INIT_EVENTQ_FOPS(read_op, write_op)                     \
+	(struct file_operations){                               \
+		.owner = THIS_MODULE,                           \
+		.open = nonseekable_open,                       \
+		.read = read_op,                                \
+		.write = write_op,                              \
+		.poll = iommufd_eventq_fops_poll,               \
+		.release = iommufd_eventq_fops_release,         \
+	}
 
 static int iommufd_eventq_init(struct iommufd_eventq *eventq, char *name,
 			       struct iommufd_ctx *ictx,
-			       const struct iommufd_eventq_ops *ops)
+			       const struct file_operations *fops)
 {
 	struct file *filep;
 	int fdno;
 
-	if (WARN_ON_ONCE(!ops || !ops->read))
-		return -EINVAL;
-
 	mutex_init(&eventq->mutex);
 	INIT_LIST_HEAD(&eventq->deliver);
 	init_waitqueue_head(&eventq->wait_queue);
 
-	filep = anon_inode_getfile(name, &iommufd_eventq_fops, eventq, O_RDWR);
+	filep = anon_inode_getfile(name, fops, eventq, O_RDWR);
 	if (IS_ERR(filep))
 		return PTR_ERR(filep);
 
-	eventq->ops = ops;
 	eventq->ictx = ictx;
 	iommufd_ctx_get(eventq->ictx);
 	refcount_inc(&eventq->obj.users);
@@ -497,6 +466,9 @@ static int iommufd_eventq_init(struct iommufd_eventq *eventq, char *name,
 	return fdno;
 }
 
+static const struct file_operations iommufd_pgfault_fops =
+	INIT_EVENTQ_FOPS(iommufd_fault_fops_read, iommufd_fault_fops_write);
+
 int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)
 {
 	struct iommu_fault_alloc *cmd = ucmd->cmd;
@@ -515,7 +487,7 @@ int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)
 	xa_init_flags(&fault->response, XA_FLAGS_ALLOC1);
 
 	fdno = iommufd_eventq_init(&fault->common, "[iommufd-pgfault]",
-				   ucmd->ictx, &iommufd_fault_ops);
+				   ucmd->ictx, &iommufd_pgfault_fops);
 	if (fdno < 0) {
 		rc = fdno;
 		goto out_abort;
@@ -541,6 +513,9 @@ int iommufd_fault_alloc(struct iommufd_ucmd *ucmd)
 	return rc;
 }
 
+static const struct file_operations iommufd_viommu_fops =
+	INIT_EVENTQ_FOPS(iommufd_veventq_fops_read, NULL);
+
 int iommufd_veventq_alloc(struct iommufd_ucmd *ucmd)
 {
 	struct iommu_veventq_alloc *cmd = ucmd->cmd;
@@ -580,7 +555,7 @@ int iommufd_veventq_alloc(struct iommufd_ucmd *ucmd)
 	list_add_tail(&veventq->node, &viommu->veventqs);
 
 	fdno = iommufd_eventq_init(&veventq->common, "[iommufd-viommu-event]",
-				   ucmd->ictx, &iommufd_veventq_ops);
+				   ucmd->ictx, &iommufd_viommu_fops);
 	if (fdno < 0) {
 		rc = fdno;
 		goto out_abort;
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 3c0374154a94d3..6c23d5b58901af 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -434,20 +434,11 @@ void iopt_remove_access(struct io_pagetable *iopt,
 			u32 iopt_access_list_id);
 void iommufd_access_destroy_object(struct iommufd_object *obj);
 
-struct iommufd_eventq_ops {
-	ssize_t (*read)(struct iommufd_eventq *eventq, char __user *buf,
-			size_t count, loff_t *ppos); /* Mandatory op */
-	ssize_t (*write)(struct iommufd_eventq *eventq, const char __user *buf,
-			 size_t count, loff_t *ppos); /* Optional op */
-};
-
 struct iommufd_eventq {
 	struct iommufd_object obj;
 	struct iommufd_ctx *ictx;
 	struct file *filep;
 
-	const struct iommufd_eventq_ops *ops;
-
 	/* The lists of outstanding events protected by below mutex. */
 	struct mutex mutex;
 	struct list_head deliver;

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 05/14] iommufd: Rename fault.c to eventq.c
  2025-01-07 17:10 ` [PATCH v5 05/14] iommufd: Rename fault.c to eventq.c Nicolin Chen
@ 2025-01-10 17:27   ` Jason Gunthorpe
  0 siblings, 0 replies; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-10 17:27 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Tue, Jan 07, 2025 at 09:10:08AM -0800, Nicolin Chen wrote:
> Rename the file, aligning with the new eventq object.
> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/iommufd/Makefile              | 2 +-
>  drivers/iommu/iommufd/{fault.c => eventq.c} | 0
>  2 files changed, 1 insertion(+), 1 deletion(-)
>  rename drivers/iommu/iommufd/{fault.c => eventq.c} (100%)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-07 17:10 ` [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper Nicolin Chen
  2025-01-10  7:12   ` Tian, Kevin
@ 2025-01-10 17:41   ` Jason Gunthorpe
  2025-01-10 18:38     ` Nicolin Chen
  1 sibling, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-10 17:41 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Tue, Jan 07, 2025 at 09:10:11AM -0800, Nicolin Chen wrote:
> +/*
> + * Typically called in driver's threaded IRQ handler.
> + * The @type and @event_data must be defined in include/uapi/linux/iommufd.h
> + */
> +int iommufd_viommu_report_event(struct iommufd_viommu *viommu,
> +				enum iommu_veventq_type type, void *event_data,
> +				size_t data_len)
> +{
> +	struct iommufd_veventq *veventq;
> +	struct iommufd_vevent *vevent;
> +	int rc = 0;
> +
> +	if (!viommu)
> +		return -ENODEV;
> +	if (WARN_ON_ONCE(!viommu->ops || !viommu->ops->supports_veventq ||
> +			 !viommu->ops->supports_veventq(type)))
> +		return -EOPNOTSUPP;
> +	if (WARN_ON_ONCE(!data_len || !event_data))
> +		return -EINVAL;
> +
> +	down_read(&viommu->veventqs_rwsem);
> +
> +	veventq = iommufd_viommu_find_veventq(viommu, type);
> +	if (!veventq) {
> +		rc = -EOPNOTSUPP;
> +		goto out_unlock_veventqs;
> +	}
> +
> +	vevent = kmalloc(struct_size(vevent, event_data, data_len), GFP_KERNEL);
> +	if (!vevent) {
> +		rc = -ENOMEM;
> +		goto out_unlock_veventqs;
> +	}
> +	memcpy(vevent->event_data, event_data, data_len);

The page fault path is self limited because end point devices are only
able to issue a certain number of PRI's before they have to stop.

But the async events generated by something like the SMMU are not self
limiting and we can have a huge barrage of them. I think you need to
add some kind of limiting here otherwise we will OOM the kernel and
crash, eg if the VM spams protection errors.

The virtual event queue should behave the same as if the physical
event queue overflows, and that logic should be in the smmu driver -
this should return some Exxx to indicate the queue is filled.

I supposed we will need a way to indicate lost events to userspace on
top of this?

Presumably userspace should specify the max queue size.

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
  2025-01-07 17:10 ` [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC Nicolin Chen
  2025-01-10  7:06   ` Tian, Kevin
@ 2025-01-10 17:48   ` Jason Gunthorpe
  2025-01-10 19:27     ` Nicolin Chen
  1 sibling, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-10 17:48 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Tue, Jan 07, 2025 at 09:10:09AM -0800, Nicolin Chen wrote:

> +static ssize_t iommufd_veventq_fops_read(struct iommufd_eventq *eventq,
> +					 char __user *buf, size_t count,
> +					 loff_t *ppos)
> +{
> +	size_t done = 0;
> +	int rc = 0;
> +
> +	if (*ppos)
> +		return -ESPIPE;
> +
> +	mutex_lock(&eventq->mutex);
> +	while (!list_empty(&eventq->deliver) && count > done) {
> +		struct iommufd_vevent *cur = list_first_entry(
> +			&eventq->deliver, struct iommufd_vevent, node);
> +
> +		if (cur->data_len > count - done)
> +			break;
> +
> +		if (copy_to_user(buf + done, cur->event_data, cur->data_len)) {
> +			rc = -EFAULT;
> +			break;
> +		}

Now that I look at this more closely, the fault path this is copied
from is not great.

This copy_to_user() can block while waiting on a page fault, possibily
for a long time. While blocked the mutex is held and we can't add more
entries to the list.

That will cause the shared IRQ handler in the iommu driver to back up,
which would cause a global DOS.

This probably wants to be organized to look more like

while (itm = eventq_get_next_item(eventq)) {
   if (..) {
       eventq_restore_failed_item(eventq);
       return -1;
   }
}

Where the next_item would just be a simple spinlock across the linked
list manipulation.

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-10 17:41   ` Jason Gunthorpe
@ 2025-01-10 18:38     ` Nicolin Chen
  2025-01-10 19:51       ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-10 18:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Fri, Jan 10, 2025 at 01:41:32PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 07, 2025 at 09:10:11AM -0800, Nicolin Chen wrote:
> > +/*
> > + * Typically called in driver's threaded IRQ handler.
> > + * The @type and @event_data must be defined in include/uapi/linux/iommufd.h
> > + */
> > +int iommufd_viommu_report_event(struct iommufd_viommu *viommu,
> > +				enum iommu_veventq_type type, void *event_data,
> > +				size_t data_len)
> > +{
> > +	struct iommufd_veventq *veventq;
> > +	struct iommufd_vevent *vevent;
> > +	int rc = 0;
> > +
> > +	if (!viommu)
> > +		return -ENODEV;
> > +	if (WARN_ON_ONCE(!viommu->ops || !viommu->ops->supports_veventq ||
> > +			 !viommu->ops->supports_veventq(type)))
> > +		return -EOPNOTSUPP;
> > +	if (WARN_ON_ONCE(!data_len || !event_data))
> > +		return -EINVAL;
> > +
> > +	down_read(&viommu->veventqs_rwsem);
> > +
> > +	veventq = iommufd_viommu_find_veventq(viommu, type);
> > +	if (!veventq) {
> > +		rc = -EOPNOTSUPP;
> > +		goto out_unlock_veventqs;
> > +	}
> > +
> > +	vevent = kmalloc(struct_size(vevent, event_data, data_len), GFP_KERNEL);
> > +	if (!vevent) {
> > +		rc = -ENOMEM;
> > +		goto out_unlock_veventqs;
> > +	}
> > +	memcpy(vevent->event_data, event_data, data_len);
> 
> The page fault path is self limited because end point devices are only
> able to issue a certain number of PRI's before they have to stop.
> 
> But the async events generated by something like the SMMU are not self
> limiting and we can have a huge barrage of them. I think you need to
> add some kind of limiting here otherwise we will OOM the kernel and
> crash, eg if the VM spams protection errors.

Ack. I think we can just use an atomic counter in the producer
and consumer functions.

> The virtual event queue should behave the same as if the physical
> event queue overflows, and that logic should be in the smmu driver -
> this should return some Exxx to indicate the queue is filled.

Hmm, the driver only screams...

static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
{
[...]
		/*
		 * Not much we can do on overflow, so scream and pretend we're
		 * trying harder.
		 */
		if (queue_sync_prod_in(q) == -EOVERFLOW)
			dev_err(smmu->dev, "EVTQ overflow detected -- events lost\n");

> I supposed we will need a way to indicate lost events to userspace on
> top of this?

Perhaps another u32 flag in the arm_smmuv3_vevent struct to report
an overflow. That said, what userspace/VMM will need to do with it?

> Presumably userspace should specify the max queue size.

Yes. Similarly, vCMDQ has a vcmdq_log2size in the driver structure
for that. For veventq, this piece is core managed, so we will need
a veventq_size or so in the common iommufd_veventq_alloc structure.

Thanks!
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-10 14:51     ` Jason Gunthorpe
@ 2025-01-10 18:40       ` Nicolin Chen
  0 siblings, 0 replies; 77+ messages in thread
From: Nicolin Chen @ 2025-01-10 18:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, corbet@lwn.net, will@kernel.org, joro@8bytes.org,
	suravee.suthikulpanit@amd.com, robin.murphy@arm.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com, shuah@kernel.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	eric.auger@redhat.com, jean-philippe@linaro.org, mdf@kernel.org,
	mshavit@google.com, shameerali.kolothum.thodi@huawei.com,
	smostafa@google.com, ddutile@redhat.com, Liu, Yi L,
	patches@lists.linux.dev

On Fri, Jan 10, 2025 at 10:51:49AM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 07:12:46AM +0000, Tian, Kevin wrote:
> 
> > > +	if (!viommu)
> > > +		return -ENODEV;
> > > +	if (WARN_ON_ONCE(!viommu->ops || !viommu->ops-
> > > >supports_veventq ||
> > > +			 !viommu->ops->supports_veventq(type)))
> > > +		return -EOPNOTSUPP;
> > 
> > Hmm the driver knows which type is supported by itself before
> > calling this helper. Why bother having the helper calling into
> > the driver again to verify?
> 
> Indeed, it might make sense to protect this with
> 
>         if (IS_ENABLED(CONFIG_IOMMUFD_TEST))
> 
> As a compiled out assertion
> 
> Or drop it
> 
> We shouldn't have unnecessary argument validation on fast paths,
> !viommu should go too.

Ack. I will drop those two if-statments.

Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
  2025-01-10 17:48   ` Jason Gunthorpe
@ 2025-01-10 19:27     ` Nicolin Chen
  2025-01-10 19:49       ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-10 19:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Fri, Jan 10, 2025 at 01:48:42PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 07, 2025 at 09:10:09AM -0800, Nicolin Chen wrote:
> 
> > +static ssize_t iommufd_veventq_fops_read(struct iommufd_eventq *eventq,
> > +					 char __user *buf, size_t count,
> > +					 loff_t *ppos)
> > +{
> > +	size_t done = 0;
> > +	int rc = 0;
> > +
> > +	if (*ppos)
> > +		return -ESPIPE;
> > +
> > +	mutex_lock(&eventq->mutex);
> > +	while (!list_empty(&eventq->deliver) && count > done) {
> > +		struct iommufd_vevent *cur = list_first_entry(
> > +			&eventq->deliver, struct iommufd_vevent, node);
> > +
> > +		if (cur->data_len > count - done)
> > +			break;
> > +
> > +		if (copy_to_user(buf + done, cur->event_data, cur->data_len)) {
> > +			rc = -EFAULT;
> > +			break;
> > +		}
> 
> Now that I look at this more closely, the fault path this is copied
> from is not great.
> 
> This copy_to_user() can block while waiting on a page fault, possibily
> for a long time. While blocked the mutex is held and we can't add more
> entries to the list.
>
> That will cause the shared IRQ handler in the iommu driver to back up,
> which would cause a global DOS.
>
> This probably wants to be organized to look more like
> 
> while (itm = eventq_get_next_item(eventq)) {
>    if (..) {
>        eventq_restore_failed_item(eventq);
>        return -1;
>    }
> }
> 
> Where the next_item would just be a simple spinlock across the linked
> list manipulation.

Would it be simpler by just limiting one node per read(), i.e.
no "while (!list_empty)" and no block?

The report() adds one node at a time, and wakes up the poll()
each time of adding a node. And user space could read one event
at a time too?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
  2025-01-10 19:27     ` Nicolin Chen
@ 2025-01-10 19:49       ` Jason Gunthorpe
  2025-01-10 21:58         ` Nicolin Chen
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-10 19:49 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Fri, Jan 10, 2025 at 11:27:53AM -0800, Nicolin Chen wrote:
> On Fri, Jan 10, 2025 at 01:48:42PM -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 07, 2025 at 09:10:09AM -0800, Nicolin Chen wrote:
> > 
> > > +static ssize_t iommufd_veventq_fops_read(struct iommufd_eventq *eventq,
> > > +					 char __user *buf, size_t count,
> > > +					 loff_t *ppos)
> > > +{
> > > +	size_t done = 0;
> > > +	int rc = 0;
> > > +
> > > +	if (*ppos)
> > > +		return -ESPIPE;
> > > +
> > > +	mutex_lock(&eventq->mutex);
> > > +	while (!list_empty(&eventq->deliver) && count > done) {
> > > +		struct iommufd_vevent *cur = list_first_entry(
> > > +			&eventq->deliver, struct iommufd_vevent, node);
> > > +
> > > +		if (cur->data_len > count - done)
> > > +			break;
> > > +
> > > +		if (copy_to_user(buf + done, cur->event_data, cur->data_len)) {
> > > +			rc = -EFAULT;
> > > +			break;
> > > +		}
> > 
> > Now that I look at this more closely, the fault path this is copied
> > from is not great.
> > 
> > This copy_to_user() can block while waiting on a page fault, possibily
> > for a long time. While blocked the mutex is held and we can't add more
> > entries to the list.
> >
> > That will cause the shared IRQ handler in the iommu driver to back up,
> > which would cause a global DOS.
> >
> > This probably wants to be organized to look more like
> > 
> > while (itm = eventq_get_next_item(eventq)) {
> >    if (..) {
> >        eventq_restore_failed_item(eventq);
> >        return -1;
> >    }
> > }
> > 
> > Where the next_item would just be a simple spinlock across the linked
> > list manipulation.
> 
> Would it be simpler by just limiting one node per read(), i.e.
> no "while (!list_empty)" and no block?
> 
> The report() adds one node at a time, and wakes up the poll()
> each time of adding a node. And user space could read one event
> at a time too?

That doesn't really help, the issue is it holds the lock over the
copy_to_user() which it is doing because it doesn't want pull the item off
the list and then try to handle the failure and put it back.

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-10 18:38     ` Nicolin Chen
@ 2025-01-10 19:51       ` Jason Gunthorpe
  2025-01-10 19:56         ` Nicolin Chen
  2025-01-13  5:37         ` Nicolin Chen
  0 siblings, 2 replies; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-10 19:51 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Fri, Jan 10, 2025 at 10:38:42AM -0800, Nicolin Chen wrote:
> > The virtual event queue should behave the same as if the physical
> > event queue overflows, and that logic should be in the smmu driver -
> > this should return some Exxx to indicate the queue is filled.
> 
> Hmm, the driver only screams...
> 
> static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
> {
> [...]
> 		/*
> 		 * Not much we can do on overflow, so scream and pretend we're
> 		 * trying harder.
> 		 */
> 		if (queue_sync_prod_in(q) == -EOVERFLOW)
> 			dev_err(smmu->dev, "EVTQ overflow detected -- events lost\n");

Well it must know from the HW somehow that the overflow has happened??

> > I supposed we will need a way to indicate lost events to userspace on
> > top of this?
> 
> Perhaps another u32 flag in the arm_smmuv3_vevent struct to report
> an overflow. That said, what userspace/VMM will need to do with it?

Trigger the above code in the VM somehow?

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-10 19:51       ` Jason Gunthorpe
@ 2025-01-10 19:56         ` Nicolin Chen
  2025-01-13  5:37         ` Nicolin Chen
  1 sibling, 0 replies; 77+ messages in thread
From: Nicolin Chen @ 2025-01-10 19:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Fri, Jan 10, 2025 at 03:51:14PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 10:38:42AM -0800, Nicolin Chen wrote:
> > > The virtual event queue should behave the same as if the physical
> > > event queue overflows, and that logic should be in the smmu driver -
> > > this should return some Exxx to indicate the queue is filled.
> > 
> > Hmm, the driver only screams...
> > 
> > static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
> > {
> > [...]
> > 		/*
> > 		 * Not much we can do on overflow, so scream and pretend we're
> > 		 * trying harder.
> > 		 */
> > 		if (queue_sync_prod_in(q) == -EOVERFLOW)
> > 			dev_err(smmu->dev, "EVTQ overflow detected -- events lost\n");
> 
> Well it must know from the HW somehow that the overflow has happened??
> 
> > > I supposed we will need a way to indicate lost events to userspace on
> > > top of this?
> > 
> > Perhaps another u32 flag in the arm_smmuv3_vevent struct to report
> > an overflow. That said, what userspace/VMM will need to do with it?
> 
> Trigger the above code in the VM somehow?

Oh, I see. I misunderstood somehow..

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 04/14] iommufd: Abstract an iommufd_eventq from iommufd_fault
  2025-01-10 17:26   ` Jason Gunthorpe
@ 2025-01-10 20:49     ` Nicolin Chen
  0 siblings, 0 replies; 77+ messages in thread
From: Nicolin Chen @ 2025-01-10 20:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Fri, Jan 10, 2025 at 01:26:37PM -0400, Jason Gunthorpe wrote:
> +#define INIT_EVENTQ_FOPS(read_op, write_op)                     \
> +	(struct file_operations){                               \
> +		.owner = THIS_MODULE,                           \
> +		.open = nonseekable_open,                       \
> +		.read = read_op,                                \
> +		.write = write_op,                              \
> +		.poll = iommufd_eventq_fops_poll,               \
> +		.release = iommufd_eventq_fops_release,         \
> +	}

There is an ERROR complained by checkpatch. So I changed a bit,
and squashed it to the previous patch adding iommufd_fault_init:

+#define INIT_FAULT_FOPS(read_op, write_op)                                     \
+       ((const struct file_operations){                                       \
+               .owner = THIS_MODULE,                                          \
+               .open = nonseekable_open,                                      \
+               .read = read_op,                                               \
+               .write = write_op,                                             \
+               .poll = iommufd_fault_fops_poll,                               \
+               .release = iommufd_fault_fops_release,                         \
+       })

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
  2025-01-10  7:06   ` Tian, Kevin
@ 2025-01-10 21:29     ` Nicolin Chen
  2025-01-13  2:52       ` Tian, Kevin
  0 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-10 21:29 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: jgg@nvidia.com, corbet@lwn.net, will@kernel.org, joro@8bytes.org,
	suravee.suthikulpanit@amd.com, robin.murphy@arm.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com, shuah@kernel.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	eric.auger@redhat.com, jean-philippe@linaro.org, mdf@kernel.org,
	mshavit@google.com, shameerali.kolothum.thodi@huawei.com,
	smostafa@google.com, ddutile@redhat.com, Liu, Yi L,
	patches@lists.linux.dev

On Fri, Jan 10, 2025 at 07:06:49AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, January 8, 2025 1:10 AM
> > +
> > +int iommufd_veventq_alloc(struct iommufd_ucmd *ucmd)
> > +{
> > +	struct iommu_veventq_alloc *cmd = ucmd->cmd;
> > +	struct iommufd_veventq *veventq;
> > +	struct iommufd_viommu *viommu;
> > +	int fdno;
> > +	int rc;
> > +
> > +	if (cmd->flags || cmd->type == IOMMU_VEVENTQ_TYPE_DEFAULT)
> > +		return -EOPNOTSUPP;
> > +
> > +	viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
> > +	if (IS_ERR(viommu))
> > +		return PTR_ERR(viommu);
> > +
> > +	if (!viommu->ops || !viommu->ops->supports_veventq ||
> > +	    !viommu->ops->supports_veventq(cmd->type))
> > +		return -EOPNOTSUPP;
> > +
> 
> I'm not sure about the necessity of above check. The event queue
> is just a software struct with a user-specified format for the iommu
> driver to report viommu event. The struct itself is not constrained
> by the hardware capability, though I'm not sure a real usage in
> which a smmu driver wants to report a vtd event. But legitimately
> an user can create any type of event queues which might just be
> never used.

Allowing a random type that a driver will never use for reporting
doesn't sound to make a lot of sense to me...

That being said, yea..I guess we could drop the limit here, since
it isn't going to break anything?

> It sounds clearer to do the check when IOPF cap is actually enabled
> on a device contained in the viommu. At that point check whether 
> a required type eventqueue has been created. If not then fail the
> iopf enabling.

Hmm, isn't IOPF a different channel?

And a vEVENTQ is per vIOMMU, not necessarily per vDEVICE/device..

> Then it reveals probably another todo in this series. Seems you still
> let the smmu driver statically enable iopf when probing the device. 
> Sounds like iommufd_viommu_alloc_hwpt_nested() may accept
> IOMMU_HWPT_FAULT_ID_VALID to refer to a event queue and
> later dynamically enable/disable iopf when attaching a device to the
> hwpt and check the event queue type there. Just like how the fault
> object is handled.

You've lost me here :-/

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 07/14] iommufd/viommu: Add iommufd_viommu_get_vdev_id helper
  2025-01-10  7:07   ` Tian, Kevin
@ 2025-01-10 21:35     ` Nicolin Chen
  0 siblings, 0 replies; 77+ messages in thread
From: Nicolin Chen @ 2025-01-10 21:35 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: jgg@nvidia.com, corbet@lwn.net, will@kernel.org, joro@8bytes.org,
	suravee.suthikulpanit@amd.com, robin.murphy@arm.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com, shuah@kernel.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	eric.auger@redhat.com, jean-philippe@linaro.org, mdf@kernel.org,
	mshavit@google.com, shameerali.kolothum.thodi@huawei.com,
	smostafa@google.com, ddutile@redhat.com, Liu, Yi L,
	patches@lists.linux.dev

On Fri, Jan 10, 2025 at 07:07:36AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, January 8, 2025 1:10 AM
> > +
> > +	xa_lock(&viommu->vdevs);
> > +	xa_for_each(&viommu->vdevs, index, vdev) {
> > +		if (vdev && vdev->dev == dev) {
> > +			vdev_id = (unsigned long)vdev->id;
> > +			break;
> > +		}
> > +	}
> 
> Nit. No need to check vdev pointer as commented in last review.

Oh, I missed that. Will drop it.

> Reviewed-by: Kevin Tian <kevin.tian@intel.com>

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
  2025-01-10 19:49       ` Jason Gunthorpe
@ 2025-01-10 21:58         ` Nicolin Chen
  2025-01-13 19:12           ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-10 21:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Fri, Jan 10, 2025 at 03:49:50PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 11:27:53AM -0800, Nicolin Chen wrote:
> > On Fri, Jan 10, 2025 at 01:48:42PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Jan 07, 2025 at 09:10:09AM -0800, Nicolin Chen wrote:
> > > 
> > > > +static ssize_t iommufd_veventq_fops_read(struct iommufd_eventq *eventq,
> > > > +					 char __user *buf, size_t count,
> > > > +					 loff_t *ppos)
> > > > +{
> > > > +	size_t done = 0;
> > > > +	int rc = 0;
> > > > +
> > > > +	if (*ppos)
> > > > +		return -ESPIPE;
> > > > +
> > > > +	mutex_lock(&eventq->mutex);
> > > > +	while (!list_empty(&eventq->deliver) && count > done) {
> > > > +		struct iommufd_vevent *cur = list_first_entry(
> > > > +			&eventq->deliver, struct iommufd_vevent, node);
> > > > +
> > > > +		if (cur->data_len > count - done)
> > > > +			break;
> > > > +
> > > > +		if (copy_to_user(buf + done, cur->event_data, cur->data_len)) {
> > > > +			rc = -EFAULT;
> > > > +			break;
> > > > +		}
> > > 
> > > Now that I look at this more closely, the fault path this is copied
> > > from is not great.
> > > 
> > > This copy_to_user() can block while waiting on a page fault, possibily
> > > for a long time. While blocked the mutex is held and we can't add more
> > > entries to the list.
> > >
> > > That will cause the shared IRQ handler in the iommu driver to back up,
> > > which would cause a global DOS.
> > >
> > > This probably wants to be organized to look more like
> > > 
> > > while (itm = eventq_get_next_item(eventq)) {
> > >    if (..) {
> > >        eventq_restore_failed_item(eventq);
> > >        return -1;
> > >    }
> > > }
> > > 
> > > Where the next_item would just be a simple spinlock across the linked
> > > list manipulation.
> > 
> > Would it be simpler by just limiting one node per read(), i.e.
> > no "while (!list_empty)" and no block?
> > 
> > The report() adds one node at a time, and wakes up the poll()
> > each time of adding a node. And user space could read one event
> > at a time too?
> 
> That doesn't really help, the issue is it holds the lock over the
> copy_to_user() which it is doing because it doesn't want pull the item off
> the list and then try to handle the failure and put it back.

Hmm, it seems that I haven't got your first narrative straight..

Would you mind elaborate "copy_to_user() can block while waiting
on a page fault"? When would this happen?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
  2025-01-10 21:29     ` Nicolin Chen
@ 2025-01-13  2:52       ` Tian, Kevin
  2025-01-13  4:51         ` Nicolin Chen
  0 siblings, 1 reply; 77+ messages in thread
From: Tian, Kevin @ 2025-01-13  2:52 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jgg@nvidia.com, corbet@lwn.net, will@kernel.org, joro@8bytes.org,
	suravee.suthikulpanit@amd.com, robin.murphy@arm.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com, shuah@kernel.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	eric.auger@redhat.com, jean-philippe@linaro.org, mdf@kernel.org,
	mshavit@google.com, shameerali.kolothum.thodi@huawei.com,
	smostafa@google.com, ddutile@redhat.com, Liu, Yi L,
	patches@lists.linux.dev

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Saturday, January 11, 2025 5:29 AM
> 
> On Fri, Jan 10, 2025 at 07:06:49AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Wednesday, January 8, 2025 1:10 AM
> > > +
> > > +int iommufd_veventq_alloc(struct iommufd_ucmd *ucmd)
> > > +{
> > > +	struct iommu_veventq_alloc *cmd = ucmd->cmd;
> > > +	struct iommufd_veventq *veventq;
> > > +	struct iommufd_viommu *viommu;
> > > +	int fdno;
> > > +	int rc;
> > > +
> > > +	if (cmd->flags || cmd->type == IOMMU_VEVENTQ_TYPE_DEFAULT)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
> > > +	if (IS_ERR(viommu))
> > > +		return PTR_ERR(viommu);
> > > +
> > > +	if (!viommu->ops || !viommu->ops->supports_veventq ||
> > > +	    !viommu->ops->supports_veventq(cmd->type))
> > > +		return -EOPNOTSUPP;
> > > +
> >
> > I'm not sure about the necessity of above check. The event queue
> > is just a software struct with a user-specified format for the iommu
> > driver to report viommu event. The struct itself is not constrained
> > by the hardware capability, though I'm not sure a real usage in
> > which a smmu driver wants to report a vtd event. But legitimately
> > an user can create any type of event queues which might just be
> > never used.
> 
> Allowing a random type that a driver will never use for reporting
> doesn't sound to make a lot of sense to me...
> 
> That being said, yea..I guess we could drop the limit here, since
> it isn't going to break anything?
> 
> > It sounds clearer to do the check when IOPF cap is actually enabled
> > on a device contained in the viommu. At that point check whether
> > a required type eventqueue has been created. If not then fail the
> > iopf enabling.
> 
> Hmm, isn't IOPF a different channel?

We have a fault queue for delivering IOPF on hwpt, when vIOMMU is
not involved

Now with vIOMMU my understanding was that all events including
IOPF are delivered via the event queue in the vIOMMU. Just echoed
by the documentation patch:

+- IOMMUFD_OBJ_VEVENTQ, representing a software queue for a vIOMMU to report its
+  events such as translation faults occurred to a nested stage-1 and HW-specific
+  events.

> 
> And a vEVENTQ is per vIOMMU, not necessarily per vDEVICE/device..

Yes. My point was to verify whether the vEVENTQ type is compatible when
a nested faultable hwpt is created with vIOMMU as the parent. then when
attaching a device to the nested hwpt we dynamically turn on PRI on the
device just like how it's handled in the fault queue path.

> 
> > Then it reveals probably another todo in this series. Seems you still
> > let the smmu driver statically enable iopf when probing the device.
> > Sounds like iommufd_viommu_alloc_hwpt_nested() may accept
> > IOMMU_HWPT_FAULT_ID_VALID to refer to a event queue and
> > later dynamically enable/disable iopf when attaching a device to the
> > hwpt and check the event queue type there. Just like how the fault
> > object is handled.
> 
> You've lost me here :-/
> 

Hope above explanation makes my point clearer. Then for a nested
hwpt created within a vIOMMU there is an open whether we want
a per-hwpt option to mark whether it allows fault, or assume that
every nested hwpt (and the devices attached to it) must be faultable
once any vEVENTQ is created in the vIOMMU.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
  2025-01-13  2:52       ` Tian, Kevin
@ 2025-01-13  4:51         ` Nicolin Chen
  2025-01-13  8:17           ` Tian, Kevin
  2025-01-13 19:10           ` Jason Gunthorpe
  0 siblings, 2 replies; 77+ messages in thread
From: Nicolin Chen @ 2025-01-13  4:51 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: jgg@nvidia.com, corbet@lwn.net, will@kernel.org, joro@8bytes.org,
	suravee.suthikulpanit@amd.com, robin.murphy@arm.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com, shuah@kernel.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	eric.auger@redhat.com, jean-philippe@linaro.org, mdf@kernel.org,
	mshavit@google.com, shameerali.kolothum.thodi@huawei.com,
	smostafa@google.com, ddutile@redhat.com, Liu, Yi L,
	patches@lists.linux.dev

On Mon, Jan 13, 2025 at 02:52:32AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Saturday, January 11, 2025 5:29 AM
> > 
> > On Fri, Jan 10, 2025 at 07:06:49AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Wednesday, January 8, 2025 1:10 AM
> > > > +
> > > > +int iommufd_veventq_alloc(struct iommufd_ucmd *ucmd)
> > > > +{
> > > > +	struct iommu_veventq_alloc *cmd = ucmd->cmd;
> > > > +	struct iommufd_veventq *veventq;
> > > > +	struct iommufd_viommu *viommu;
> > > > +	int fdno;
> > > > +	int rc;
> > > > +
> > > > +	if (cmd->flags || cmd->type == IOMMU_VEVENTQ_TYPE_DEFAULT)
> > > > +		return -EOPNOTSUPP;
> > > > +
> > > > +	viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
> > > > +	if (IS_ERR(viommu))
> > > > +		return PTR_ERR(viommu);
> > > > +
> > > > +	if (!viommu->ops || !viommu->ops->supports_veventq ||
> > > > +	    !viommu->ops->supports_veventq(cmd->type))
> > > > +		return -EOPNOTSUPP;
> > > > +
> > >
> > > I'm not sure about the necessity of above check. The event queue
> > > is just a software struct with a user-specified format for the iommu
> > > driver to report viommu event. The struct itself is not constrained
> > > by the hardware capability, though I'm not sure a real usage in
> > > which a smmu driver wants to report a vtd event. But legitimately
> > > an user can create any type of event queues which might just be
> > > never used.
> > 
> > Allowing a random type that a driver will never use for reporting
> > doesn't sound to make a lot of sense to me...
> > 
> > That being said, yea..I guess we could drop the limit here, since
> > it isn't going to break anything?
> > 
> > > It sounds clearer to do the check when IOPF cap is actually enabled
> > > on a device contained in the viommu. At that point check whether
> > > a required type eventqueue has been created. If not then fail the
> > > iopf enabling.
> > 
> > Hmm, isn't IOPF a different channel?
> 
> We have a fault queue for delivering IOPF on hwpt, when vIOMMU is
> not involved
> 
> Now with vIOMMU my understanding was that all events including
> IOPF are delivered via the event queue in the vIOMMU. Just echoed
> by the documentation patch:
> 
> +- IOMMUFD_OBJ_VEVENTQ, representing a software queue for a vIOMMU to report its
> +  events such as translation faults occurred to a nested stage-1 and HW-specific
> +  events.

Oh, looks like that line misguided you.. It should be non-PRI type
of fault, e.g. a stage-1 DMA translation error should be forwarded
to the guest. I can make it clearer.

> > 
> > And a vEVENTQ is per vIOMMU, not necessarily per vDEVICE/device..
> 
> Yes. My point was to verify whether the vEVENTQ type is compatible when
> a nested faultable hwpt is created with vIOMMU as the parent. then when
> attaching a device to the nested hwpt we dynamically turn on PRI on the
> device just like how it's handled in the fault queue path.

We will still have the fault queue:
	if (error is handled by PRI)
		report via fault queue; // need response
	else (error is handled by vEVENTQ)
		report via vEVENTQ; // no need of response
	else
		dump unhandled faults;

> > > Then it reveals probably another todo in this series. Seems you still
> > > let the smmu driver statically enable iopf when probing the device.
> > > Sounds like iommufd_viommu_alloc_hwpt_nested() may accept
> > > IOMMU_HWPT_FAULT_ID_VALID to refer to a event queue and
> > > later dynamically enable/disable iopf when attaching a device to the
> > > hwpt and check the event queue type there. Just like how the fault
> > > object is handled.
> > 
> > You've lost me here :-/
> > 
> 
> Hope above explanation makes my point clearer. Then for a nested
> hwpt created within a vIOMMU there is an open whether we want
> a per-hwpt option to mark whether it allows fault, or assume that
> every nested hwpt (and the devices attached to it) must be faultable
> once any vEVENTQ is created in the vIOMMU.

A vIOMMU-based nested HWPT should still enable IOPF via the flag
IOMMU_HWPT_FAULT_ID_VALID.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-10 19:51       ` Jason Gunthorpe
  2025-01-10 19:56         ` Nicolin Chen
@ 2025-01-13  5:37         ` Nicolin Chen
  2025-01-13 19:21           ` Jason Gunthorpe
  1 sibling, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-13  5:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Fri, Jan 10, 2025 at 03:51:14PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 10:38:42AM -0800, Nicolin Chen wrote:
> > > The virtual event queue should behave the same as if the physical
> > > event queue overflows, and that logic should be in the smmu driver -
> > > this should return some Exxx to indicate the queue is filled.
> > 
> > Hmm, the driver only screams...
> > 
> > static irqreturn_t arm_smmu_evtq_thread(int irq, void *dev)
> > {
> > [...]
> > 		/*
> > 		 * Not much we can do on overflow, so scream and pretend we're
> > 		 * trying harder.
> > 		 */
> > 		if (queue_sync_prod_in(q) == -EOVERFLOW)
> > 			dev_err(smmu->dev, "EVTQ overflow detected -- events lost\n");
> 
> Well it must know from the HW somehow that the overflow has happened??
> 
> > > I supposed we will need a way to indicate lost events to userspace on
> > > top of this?
> > 
> > Perhaps another u32 flag in the arm_smmuv3_vevent struct to report
> > an overflow. That said, what userspace/VMM will need to do with it?
> 
> Trigger the above code in the VM somehow?

I found two ways of forwarding an overflow flag:

1. Return -EOVERFLOW to read(). But it cannot return the read bytes
any more:
--------------------------------------------------
@@ -95,2 +95,3 @@ int iommufd_viommu_report_event(struct iommufd_viommu *viommu,
        if (atomic_read(&veventq->num_events) == veventq->depth) {
+               set_bit(IOMMUFD_VEVENTQ_ERROR_OVERFLOW, veventq->errors);
                rc = -EOVERFLOW;

[..]

@@ -386,2 +386,5 @@ static ssize_t iommufd_veventq_fops_read(struct file *filep, char __user *buf,
 
+       if (test_bit(IOMMUFD_VEVENTQ_ERROR_OVERFLOW, veventq->errors))
+               rc = -EOVERFLOW;
+
        mutex_lock(&eventq->mutex);
@@ -398,2 +401,3 @@ static ssize_t iommufd_veventq_fops_read(struct file *filep, char __user *buf,
                }
+               clear_bit(IOMMUFD_VEVENTQ_ERROR_OVERFLOW, veventq->errors);
                atomic_dec(&veventq->num_events);
@@ -405,2 +409,4 @@ static ssize_t iommufd_veventq_fops_read(struct file *filep, char __user *buf,
 
+       if (rc == -EOVERFLOW)
+               return rc;
        return done == 0 ? rc : done;

[..]

@@ -554,2 +554,4 @@ struct iommufd_veventq {
        atomic_t num_events;
+#define IOMMUFD_VEVENTQ_ERROR_OVERFLOW 0
+       DECLARE_BITMAP(errors, 32);
 };
--------------------------------------------------

2. Return EPOLLERR via pollfd.revents. But it cannot use POLLERR
for other errors any more:
--------------------------------------------------
@@ -420,2 +421,4 @@ static __poll_t iommufd_eventq_fops_poll(struct file *filep,
        poll_wait(filep, &eventq->wait_queue, wait);
+       if (test_bit(IOMMUFD_VEVENTQ_ERROR_OVERFLOW, veventq->errors))
+               return EPOLLERR;
        mutex_lock(&eventq->mutex);

[..]

@@ -1001,2 +1001,5 @@ static int _test_cmd_trigger_vevent(int fd, __u32 dev_id, __u32 event_fd,
 
+       if (pollfd.revents & POLLERR)
+               return -1;
+
        return event.virt_id == virt_id ? 0 : -EINVAL
--------------------------------------------------

It feels that returning at read() might be slightly nicer?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
  2025-01-13  4:51         ` Nicolin Chen
@ 2025-01-13  8:17           ` Tian, Kevin
  2025-01-13 19:10           ` Jason Gunthorpe
  1 sibling, 0 replies; 77+ messages in thread
From: Tian, Kevin @ 2025-01-13  8:17 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jgg@nvidia.com, corbet@lwn.net, will@kernel.org, joro@8bytes.org,
	suravee.suthikulpanit@amd.com, robin.murphy@arm.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com, shuah@kernel.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	eric.auger@redhat.com, jean-philippe@linaro.org, mdf@kernel.org,
	mshavit@google.com, shameerali.kolothum.thodi@huawei.com,
	smostafa@google.com, ddutile@redhat.com, Liu, Yi L,
	patches@lists.linux.dev

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Monday, January 13, 2025 12:51 PM
> 
> On Mon, Jan 13, 2025 at 02:52:32AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Saturday, January 11, 2025 5:29 AM
> > >
> > > On Fri, Jan 10, 2025 at 07:06:49AM +0000, Tian, Kevin wrote:
> > > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > > Sent: Wednesday, January 8, 2025 1:10 AM
> > > > > +
> > > > > +int iommufd_veventq_alloc(struct iommufd_ucmd *ucmd)
> > > > > +{
> > > > > +	struct iommu_veventq_alloc *cmd = ucmd->cmd;
> > > > > +	struct iommufd_veventq *veventq;
> > > > > +	struct iommufd_viommu *viommu;
> > > > > +	int fdno;
> > > > > +	int rc;
> > > > > +
> > > > > +	if (cmd->flags || cmd->type ==
> IOMMU_VEVENTQ_TYPE_DEFAULT)
> > > > > +		return -EOPNOTSUPP;
> > > > > +
> > > > > +	viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
> > > > > +	if (IS_ERR(viommu))
> > > > > +		return PTR_ERR(viommu);
> > > > > +
> > > > > +	if (!viommu->ops || !viommu->ops->supports_veventq ||
> > > > > +	    !viommu->ops->supports_veventq(cmd->type))
> > > > > +		return -EOPNOTSUPP;
> > > > > +
> > > >
> > > > I'm not sure about the necessity of above check. The event queue
> > > > is just a software struct with a user-specified format for the iommu
> > > > driver to report viommu event. The struct itself is not constrained
> > > > by the hardware capability, though I'm not sure a real usage in
> > > > which a smmu driver wants to report a vtd event. But legitimately
> > > > an user can create any type of event queues which might just be
> > > > never used.
> > >
> > > Allowing a random type that a driver will never use for reporting
> > > doesn't sound to make a lot of sense to me...
> > >
> > > That being said, yea..I guess we could drop the limit here, since
> > > it isn't going to break anything?
> > >
> > > > It sounds clearer to do the check when IOPF cap is actually enabled
> > > > on a device contained in the viommu. At that point check whether
> > > > a required type eventqueue has been created. If not then fail the
> > > > iopf enabling.
> > >
> > > Hmm, isn't IOPF a different channel?
> >
> > We have a fault queue for delivering IOPF on hwpt, when vIOMMU is
> > not involved
> >
> > Now with vIOMMU my understanding was that all events including
> > IOPF are delivered via the event queue in the vIOMMU. Just echoed
> > by the documentation patch:
> >
> > +- IOMMUFD_OBJ_VEVENTQ, representing a software queue for a
> vIOMMU to report its
> > +  events such as translation faults occurred to a nested stage-1 and HW-
> specific
> > +  events.
> 
> Oh, looks like that line misguided you.. It should be non-PRI type
> of fault, e.g. a stage-1 DMA translation error should be forwarded
> to the guest. I can make it clearer.
> 
> > >
> > > And a vEVENTQ is per vIOMMU, not necessarily per vDEVICE/device..
> >
> > Yes. My point was to verify whether the vEVENTQ type is compatible when
> > a nested faultable hwpt is created with vIOMMU as the parent. then when
> > attaching a device to the nested hwpt we dynamically turn on PRI on the
> > device just like how it's handled in the fault queue path.
> 
> We will still have the fault queue:
> 	if (error is handled by PRI)
> 		report via fault queue; // need response
> 	else (error is handled by vEVENTQ)
> 		report via vEVENTQ; // no need of response
> 	else
> 		dump unhandled faults;
> 
> > > > Then it reveals probably another todo in this series. Seems you still
> > > > let the smmu driver statically enable iopf when probing the device.
> > > > Sounds like iommufd_viommu_alloc_hwpt_nested() may accept
> > > > IOMMU_HWPT_FAULT_ID_VALID to refer to a event queue and
> > > > later dynamically enable/disable iopf when attaching a device to the
> > > > hwpt and check the event queue type there. Just like how the fault
> > > > object is handled.
> > >
> > > You've lost me here :-/
> > >
> >
> > Hope above explanation makes my point clearer. Then for a nested
> > hwpt created within a vIOMMU there is an open whether we want
> > a per-hwpt option to mark whether it allows fault, or assume that
> > every nested hwpt (and the devices attached to it) must be faultable
> > once any vEVENTQ is created in the vIOMMU.
> 
> A vIOMMU-based nested HWPT should still enable IOPF via the flag
> IOMMU_HWPT_FAULT_ID_VALID.
> 

Thanks for clarification. Clearly I got it messed. 😊

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 14/14] iommu/arm-smmu-v3: Report events that belong to devices attached to vIOMMU
  2025-01-09 11:04   ` kernel test robot
@ 2025-01-13 19:01     ` Nicolin Chen
  2025-01-13 19:06       ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-13 19:01 UTC (permalink / raw)
  To: will
  Cc: jgg, kevin.tian, corbet, oe-kbuild-all, joro,
	suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

On Thu, Jan 09, 2025 at 07:04:10PM +0800, kernel test robot wrote:
> sparse warnings: (new ones prefixed by >>)
> >> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c:461:21: sparse: sparse: invalid assignment: &=
>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c:461:21: sparse:    left side has type restricted __le64
>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c:461:21: sparse:    right side has type unsigned long long
>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c:462:21: sparse: sparse: invalid assignment: |=
>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c:462:21: sparse:    left side has type restricted __le64
>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c:462:21: sparse:    right side has type unsigned long long
> >> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c:464:23: sparse: sparse: cast from restricted __le64
>    drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c:465:23: sparse: sparse: cast from restricted __le64

I fixed these with the followings:
-----------------------------------------------------------------------------------
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
index 0c7a5894ba07..aa453e842a39 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
@@ -457,20 +457,28 @@ struct iommufd_viommu *arm_vsmmu_alloc(struct device *dev,
 	return &vsmmu->core;
 }
 
+/* This is basically iommu_vevent_arm_smmuv3 in u64 for conversion */
+struct arm_vsmmu_evt {
+	union {
+		u64 evt[EVTQ_ENT_DWORDS];
+		struct iommu_vevent_arm_smmuv3 uevt;
+	};
+};
+
 int arm_vmaster_report_event(struct arm_smmu_vmaster *vmaster, u64 *evt)
 {
-	struct iommu_vevent_arm_smmuv3 vevt =
-		*(struct iommu_vevent_arm_smmuv3 *)evt;
+	struct arm_vsmmu_evt *vevt = (struct arm_vsmmu_evt *)evt;
+	int i;
 
-	vevt.evt[0] &= ~EVTQ_0_SID;
-	vevt.evt[0] |= FIELD_PREP(EVTQ_0_SID, vmaster->vsid);
+	vevt->evt[0] &= ~EVTQ_0_SID;
+	vevt->evt[0] |= FIELD_PREP(EVTQ_0_SID, vmaster->vsid);
 
-	vevt.evt[0] = cpu_to_le64(vevt.evt[0]);
-	vevt.evt[1] = cpu_to_le64(vevt.evt[1]);
+	for (i = 0; i < EVTQ_ENT_DWORDS; i++)
+		vevt->uevt.evt[i] = cpu_to_le64(vevt->evt[i]);
 
 	return iommufd_viommu_report_event(&vmaster->vsmmu->core,
-					   IOMMU_VEVENTQ_TYPE_ARM_SMMUV3, &vevt,
-					   sizeof(vevt));
+					   IOMMU_VEVENTQ_TYPE_ARM_SMMUV3,
+					   &vevt->uevt, sizeof(vevt->uevt));
 }
 
 MODULE_IMPORT_NS("IOMMUFD");
-----------------------------------------------------------------------------------

This also fixes the array size from 2 to EVTQ_ENT_DWORDS==(4).

@Will,
Would it be possible for you to ack or review the two SMMU patches
in this series? I am still hoping Jason might be able to take this
after I respin a v6.

Thanks!
Nicolin

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 14/14] iommu/arm-smmu-v3: Report events that belong to devices attached to vIOMMU
  2025-01-13 19:01     ` Nicolin Chen
@ 2025-01-13 19:06       ` Jason Gunthorpe
  2025-01-13 19:15         ` Nicolin Chen
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-13 19:06 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, kevin.tian, corbet, oe-kbuild-all, joro,
	suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

On Mon, Jan 13, 2025 at 11:01:10AM -0800, Nicolin Chen wrote:

> +/* This is basically iommu_vevent_arm_smmuv3 in u64 for conversion */
> +struct arm_vsmmu_evt {
> +	union {
> +		u64 evt[EVTQ_ENT_DWORDS];
> +		struct iommu_vevent_arm_smmuv3 uevt;
> +	};
> +};

This doesn't seem right, don't make unions like this

>  int arm_vmaster_report_event(struct arm_smmu_vmaster *vmaster, u64 *evt)
>  {
> -	struct iommu_vevent_arm_smmuv3 vevt =
> -		*(struct iommu_vevent_arm_smmuv3 *)evt;

evt is clearly not a iommu_vevent_arm_smmuv3 since it has the wrong
endianess? It should stay in its own type.

struct struct iommu_vevent_arm_smmuv3 uevt;

uet.evt[0] = cpu_to_le64((evt[0] & ~EVTQ_0_SID) | FIELD_PREP(EVTQ_0_SID, vmaster->vsid));
for (i = 1; i != EVTQ_ENT_DWORDS; i++)
    uet.evt[i] = cpu_to_le64(evt[i]);

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
  2025-01-13  4:51         ` Nicolin Chen
  2025-01-13  8:17           ` Tian, Kevin
@ 2025-01-13 19:10           ` Jason Gunthorpe
  1 sibling, 0 replies; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-13 19:10 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Tian, Kevin, corbet@lwn.net, will@kernel.org, joro@8bytes.org,
	suravee.suthikulpanit@amd.com, robin.murphy@arm.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com, shuah@kernel.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	eric.auger@redhat.com, jean-philippe@linaro.org, mdf@kernel.org,
	mshavit@google.com, shameerali.kolothum.thodi@huawei.com,
	smostafa@google.com, ddutile@redhat.com, Liu, Yi L,
	patches@lists.linux.dev

On Sun, Jan 12, 2025 at 08:51:25PM -0800, Nicolin Chen wrote:

> We will still have the fault queue:
> 	if (error is handled by PRI)
> 		report via fault queue; // need response

This is an important point, we can't generate a response for the PRI
unless the fault queue is used, so even though the vEVENTQ could carry
the PRI event, it must not be done. If there is no fault handle
available the kernel must NAK the PRI event internally, never send it
to the vEVENTQ.

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
  2025-01-10 21:58         ` Nicolin Chen
@ 2025-01-13 19:12           ` Jason Gunthorpe
  2025-01-13 19:18             ` Nicolin Chen
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-13 19:12 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Fri, Jan 10, 2025 at 01:58:21PM -0800, Nicolin Chen wrote:
> Hmm, it seems that I haven't got your first narrative straight..
> 
> Would you mind elaborate "copy_to_user() can block while waiting
> on a page fault"? When would this happen?

copy_to_user() is a sleeping function that sleeps if the user memory
is non-present. So userspace can cause copy_to_user to copy to
anything, including memory that is non-present and will take along
time to page fault in. Eg perhaps by abusing userfaultfd.

We should not allow userspace to globally DOS the iommu driver this
way.

So do not hold locks that are also held by the HW event path across
copy_to_user().

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 14/14] iommu/arm-smmu-v3: Report events that belong to devices attached to vIOMMU
  2025-01-13 19:06       ` Jason Gunthorpe
@ 2025-01-13 19:15         ` Nicolin Chen
  2025-01-13 19:18           ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-13 19:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: will, kevin.tian, corbet, oe-kbuild-all, joro,
	suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

On Mon, Jan 13, 2025 at 03:06:41PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 13, 2025 at 11:01:10AM -0800, Nicolin Chen wrote:
> 
> > +/* This is basically iommu_vevent_arm_smmuv3 in u64 for conversion */
> > +struct arm_vsmmu_evt {
> > +	union {
> > +		u64 evt[EVTQ_ENT_DWORDS];
> > +		struct iommu_vevent_arm_smmuv3 uevt;
> > +	};
> > +};
> 
> This doesn't seem right, don't make unions like this

This is copied from the invalidate union though... Any reason why
this is no longer good for evt v.s. cmd?

> >  int arm_vmaster_report_event(struct arm_smmu_vmaster *vmaster, u64 *evt)
> >  {
> > -	struct iommu_vevent_arm_smmuv3 vevt =
> > -		*(struct iommu_vevent_arm_smmuv3 *)evt;
> 
> evt is clearly not a iommu_vevent_arm_smmuv3 since it has the wrong
> endianess? It should stay in its own type.
> 
> struct struct iommu_vevent_arm_smmuv3 uevt;
> 
> uet.evt[0] = cpu_to_le64((evt[0] & ~EVTQ_0_SID) | FIELD_PREP(EVTQ_0_SID, vmaster->vsid));
> for (i = 1; i != EVTQ_ENT_DWORDS; i++)
>     uet.evt[i] = cpu_to_le64(evt[i]);

I will use this.

Thanks!
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 14/14] iommu/arm-smmu-v3: Report events that belong to devices attached to vIOMMU
  2025-01-13 19:15         ` Nicolin Chen
@ 2025-01-13 19:18           ` Jason Gunthorpe
  0 siblings, 0 replies; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-13 19:18 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: will, kevin.tian, corbet, oe-kbuild-all, joro,
	suravee.suthikulpanit, robin.murphy, dwmw2, baolu.lu, shuah,
	linux-kernel, iommu, linux-arm-kernel, linux-kselftest, linux-doc,
	eric.auger, jean-philippe, mdf, mshavit,
	shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu, patches

On Mon, Jan 13, 2025 at 11:15:21AM -0800, Nicolin Chen wrote:
> On Mon, Jan 13, 2025 at 03:06:41PM -0400, Jason Gunthorpe wrote:
> > On Mon, Jan 13, 2025 at 11:01:10AM -0800, Nicolin Chen wrote:
> > 
> > > +/* This is basically iommu_vevent_arm_smmuv3 in u64 for conversion */
> > > +struct arm_vsmmu_evt {
> > > +	union {
> > > +		u64 evt[EVTQ_ENT_DWORDS];
> > > +		struct iommu_vevent_arm_smmuv3 uevt;
> > > +	};
> > > +};
> > 
> > This doesn't seem right, don't make unions like this
> 
> This is copied from the invalidate union though... Any reason why
> this is no longer good for evt v.s. cmd?

In that case it was trying to be really optimal and copy in place,
here we don't care so much..

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC
  2025-01-13 19:12           ` Jason Gunthorpe
@ 2025-01-13 19:18             ` Nicolin Chen
  0 siblings, 0 replies; 77+ messages in thread
From: Nicolin Chen @ 2025-01-13 19:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Mon, Jan 13, 2025 at 03:12:25PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 10, 2025 at 01:58:21PM -0800, Nicolin Chen wrote:
> > Hmm, it seems that I haven't got your first narrative straight..
> > 
> > Would you mind elaborate "copy_to_user() can block while waiting
> > on a page fault"? When would this happen?
> 
> copy_to_user() is a sleeping function that sleeps if the user memory
> is non-present. So userspace can cause copy_to_user to copy to
> anything, including memory that is non-present and will take along
> time to page fault in. Eg perhaps by abusing userfaultfd.
> 
> We should not allow userspace to globally DOS the iommu driver this
> way.
> 
> So do not hold locks that are also held by the HW event path across
> copy_to_user().

I see. Thanks for explaining. I will add a patch fixing the fault
read() and change the veventq read() accordingly.

Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-13  5:37         ` Nicolin Chen
@ 2025-01-13 19:21           ` Jason Gunthorpe
  2025-01-13 19:47             ` Nicolin Chen
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-13 19:21 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Sun, Jan 12, 2025 at 09:37:41PM -0800, Nicolin Chen wrote:

> > > > I supposed we will need a way to indicate lost events to userspace on
> > > > top of this?
> > > 
> > > Perhaps another u32 flag in the arm_smmuv3_vevent struct to report
> > > an overflow. That said, what userspace/VMM will need to do with it?
> > 
> > Trigger the above code in the VM somehow?
> 
> I found two ways of forwarding an overflow flag:
> 
> 1. Return -EOVERFLOW to read(). But it cannot return the read bytes
> any more:

You could not return any bytes, it would have to be 0 bytes read, ie
immediately return EOVERFLOW and do nothing else.

Returning EOVERFLOW from read would have to also clear the overflow
indicator.

The other approach would be to add a sequence number to each event and
let userspace detect the non-montonicity. It would require adding a
header to the native ARM evt.

> 2. Return EPOLLERR via pollfd.revents. But it cannot use POLLERR
> for other errors any more:
> --------------------------------------------------
> @@ -420,2 +421,4 @@ static __poll_t iommufd_eventq_fops_poll(struct file *filep,
>         poll_wait(filep, &eventq->wait_queue, wait);
> +       if (test_bit(IOMMUFD_VEVENTQ_ERROR_OVERFLOW, veventq->errors))
> +               return EPOLLERR;
>         mutex_lock(&eventq->mutex);

But then how do you clear the error? I've only seen POLLERR used for
fatal conditions so there is no recovery, it is permanent.

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 13/14] iommu/arm-smmu-v3: Introduce struct arm_smmu_vmaster
  2025-01-07 17:10 ` [PATCH v5 13/14] iommu/arm-smmu-v3: Introduce struct arm_smmu_vmaster Nicolin Chen
@ 2025-01-13 19:29   ` Jason Gunthorpe
  2025-01-13 19:52     ` Nicolin Chen
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-13 19:29 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Tue, Jan 07, 2025 at 09:10:16AM -0800, Nicolin Chen wrote:

> +int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state,
> +				    struct iommu_domain *domain)
> +{
> +	struct arm_smmu_nested_domain *nested_domain;
> +	struct arm_smmu_vmaster *vmaster;
> +	unsigned long vsid;
> +	unsigned int cfg;
> +
> +	iommu_group_mutex_assert(state->master->dev);
> +
> +	if (domain->type != IOMMU_DOMAIN_NESTED)
> +		return 0;
> +	nested_domain = to_smmu_nested_domain(domain);
> +
> +	/* Skip ABORT/BYPASS or invalid vSTE */
> +	cfg = FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(nested_domain->ste[0]));
> +	if (cfg == STRTAB_STE_0_CFG_ABORT || cfg == STRTAB_STE_0_CFG_BYPASS)
> +		return 0;

Why? If the VM sets an ABORT vSTE then I would expect that any
protection violation events the VM triggers are captured and forwarded
as well?

Basically any time a vSTE is in place we should capture events that
are affiliated with the SID?

> +	if (!(nested_domain->ste[0] & cpu_to_le64(STRTAB_STE_0_V)))
> +		return 0;
> +
> +	vsid = iommufd_viommu_get_vdev_id(&nested_domain->vsmmu->core,
> +					  state->master->dev);
> +	/* Fail the attach if vSID is not correct set by the user space */
> +	if (!vsid)
> +		return -ENOENT;

Is it really OK that 0 is being used as invalid here?

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-13 19:21           ` Jason Gunthorpe
@ 2025-01-13 19:47             ` Nicolin Chen
  2025-01-13 19:54               ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-13 19:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Mon, Jan 13, 2025 at 03:21:44PM -0400, Jason Gunthorpe wrote:
> On Sun, Jan 12, 2025 at 09:37:41PM -0800, Nicolin Chen wrote:
> 
> > > > > I supposed we will need a way to indicate lost events to userspace on
> > > > > top of this?
> > > > 
> > > > Perhaps another u32 flag in the arm_smmuv3_vevent struct to report
> > > > an overflow. That said, what userspace/VMM will need to do with it?
> > > 
> > > Trigger the above code in the VM somehow?
> > 
> > I found two ways of forwarding an overflow flag:
> > 
> > 1. Return -EOVERFLOW to read(). But it cannot return the read bytes
> > any more:
> 
> You could not return any bytes, it would have to be 0 bytes read, ie
> immediately return EOVERFLOW and do nothing else.
> 
> Returning EOVERFLOW from read would have to also clear the overflow
> indicator.

OK. That means user space should read again for actual events in the
queue, after getting the first EOVERFLOW.

One concern is, if the report() keeps producing events to the queue,
it will always set the EOVERFLOW flag, then user space won't have a
chance to read the events out until the last report(). Wondering if
this would make sense, as I see SMMU driver's arm_smmu_evtq_thread()
reporting an OVERFLOW while allowing SW to continue reading the evtq.

> The other approach would be to add a sequence number to each event and
> let userspace detect the non-montonicity. It would require adding a
> header to the native ARM evt.

Yea, I thought about that. The tricky thing is that the header will
be a core-level header pairing with a driver-level vEVENTQ type and
can never change its length, though we can define a 64-bit flag that
can reserve the other 63 bits for future use?

That being asked, this seems to be a better approach as user space
can get the overflow flag while doing the read() that can potentially
clear the overflow flag too, v.s. blocked until the last report().

> > 2. Return EPOLLERR via pollfd.revents. But it cannot use POLLERR
> > for other errors any more:
> > --------------------------------------------------
> > @@ -420,2 +421,4 @@ static __poll_t iommufd_eventq_fops_poll(struct file *filep,
> >         poll_wait(filep, &eventq->wait_queue, wait);
> > +       if (test_bit(IOMMUFD_VEVENTQ_ERROR_OVERFLOW, veventq->errors))
> > +               return EPOLLERR;
> >         mutex_lock(&eventq->mutex);
> 
> But then how do you clear the error? I've only seen POLLERR used for
> fatal conditions so there is no recovery, it is permanent.

Overflow means the queue has tons of events for user space to read(),
so user space should read() to clear the error.

I found this piece in eventfd manual, so was wondering if we can resue:
https://man7.org/linux/man-pages/man2/eventfd.2.html
              ?  If an overflow of the counter value was detected, then
                 select(2) indicates the file descriptor as being both
                 readable and writable, and poll(2) returns a POLLERR
                 event.  As noted above, write(2) can never overflow the
                 counter.  However an overflow can occur if 2^64 eventfd
                 "signal posts" were performed by the KAIO subsystem
                 (theoretically possible, but practically unlikely).  If
                 an overflow has occurred, then read(2) will return that
                 maximum uint64_t value (i.e., 0xffffffffffffffff).

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 13/14] iommu/arm-smmu-v3: Introduce struct arm_smmu_vmaster
  2025-01-13 19:29   ` Jason Gunthorpe
@ 2025-01-13 19:52     ` Nicolin Chen
  0 siblings, 0 replies; 77+ messages in thread
From: Nicolin Chen @ 2025-01-13 19:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Mon, Jan 13, 2025 at 03:29:27PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 07, 2025 at 09:10:16AM -0800, Nicolin Chen wrote:
> 
> > +int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state,
> > +				    struct iommu_domain *domain)
> > +{
> > +	struct arm_smmu_nested_domain *nested_domain;
> > +	struct arm_smmu_vmaster *vmaster;
> > +	unsigned long vsid;
> > +	unsigned int cfg;
> > +
> > +	iommu_group_mutex_assert(state->master->dev);
> > +
> > +	if (domain->type != IOMMU_DOMAIN_NESTED)
> > +		return 0;
> > +	nested_domain = to_smmu_nested_domain(domain);
> > +
> > +	/* Skip ABORT/BYPASS or invalid vSTE */
> > +	cfg = FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(nested_domain->ste[0]));
> > +	if (cfg == STRTAB_STE_0_CFG_ABORT || cfg == STRTAB_STE_0_CFG_BYPASS)
> > +		return 0;
> 
> Why? If the VM sets an ABORT vSTE then I would expect that any
> protection violation events the VM triggers are captured and forwarded
> as well?
>
> Basically any time a vSTE is in place we should capture events that
> are affiliated with the SID?

I see. I will drop this.
 
> > +	if (!(nested_domain->ste[0] & cpu_to_le64(STRTAB_STE_0_V)))
> > +		return 0;
> > +
> > +	vsid = iommufd_viommu_get_vdev_id(&nested_domain->vsmmu->core,
> > +					  state->master->dev);
> > +	/* Fail the attach if vSID is not correct set by the user space */
> > +	if (!vsid)
> > +		return -ENOENT;
> 
> Is it really OK that 0 is being used as invalid here?

Hmm, perhaps better to do an int function for -ENOENT. Will fix
this.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-13 19:47             ` Nicolin Chen
@ 2025-01-13 19:54               ` Jason Gunthorpe
  2025-01-13 20:44                 ` Nicolin Chen
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-13 19:54 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Mon, Jan 13, 2025 at 11:47:52AM -0800, Nicolin Chen wrote:

> > You could not return any bytes, it would have to be 0 bytes read, ie
> > immediately return EOVERFLOW and do nothing else.
> > 
> > Returning EOVERFLOW from read would have to also clear the overflow
> > indicator.
> 
> OK. That means user space should read again for actual events in the
> queue, after getting the first EOVERFLOW.

Yes

> One concern is, if the report() keeps producing events to the queue,
> it will always set the EOVERFLOW flag, then user space won't have a
> chance to read the events out until the last report(). Wondering if
> this would make sense, as I see SMMU driver's arm_smmu_evtq_thread()
> reporting an OVERFLOW while allowing SW to continue reading the evtq.

Yes, this issue seems fatal to this idea. You need to report the
overflow at the right point in the queue so that userspace can read
the data out to free up the queue, otherwise it will livelock.

> > The other approach would be to add a sequence number to each event and
> > let userspace detect the non-montonicity. It would require adding a
> > header to the native ARM evt.
> 
> Yea, I thought about that. The tricky thing is that the header will
> be a core-level header pairing with a driver-level vEVENTQ type and
> can never change its length, though we can define a 64-bit flag that
> can reserve the other 63 bits for future use?

The header format could be revised by changing the driver specific
format tag.

You'd want to push a special event when the first overflow happens and
probably also report a counter so userspace can know how many events
got lost.

This seems most robust and simplest to implement..

I think I'd implement it by having a static overflow list entry so no
memory allocation is needed and just keep moving that entry to the
back of the list every time an event is lost. This way it will cover
lost events due to memory outages too

For old formats like the fault queue you could return EOVERFLOW
whenever the sequence number becomes discontiguous or it sees the
overflow event..

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-13 19:54               ` Jason Gunthorpe
@ 2025-01-13 20:44                 ` Nicolin Chen
  2025-01-14 13:41                   ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-13 20:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Mon, Jan 13, 2025 at 03:54:33PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 13, 2025 at 11:47:52AM -0800, Nicolin Chen wrote:
> > > The other approach would be to add a sequence number to each event and
> > > let userspace detect the non-montonicity. It would require adding a
> > > header to the native ARM evt.
> > 
> > Yea, I thought about that. The tricky thing is that the header will
> > be a core-level header pairing with a driver-level vEVENTQ type and
> > can never change its length, though we can define a 64-bit flag that
> > can reserve the other 63 bits for future use?
> 
> The header format could be revised by changing the driver specific
> format tag.

Yea, we need a header format (or "header type") when the vEVENTQ
is allocated.

> You'd want to push a special event when the first overflow happens and
> probably also report a counter so userspace can know how many events
> got lost.

How about this:

enum iommufd_veventq_header_type {
	IOMMU_VEVENTQ_HEADER_TYPE_V1,
};

enum iommu_hwpt_pgfault_flags {
	IOMMU_VEVENT_HEADER_FLAGS_OVERFLOW = (1 << 0),
};

struct iommufd_vevent_header_v1 {
	__u64 flags;
	__u32 num_events;
	__u32 num_overflows; // valid if flag_overflow is set
};

> This seems most robust and simplest to implement..
> 
> I think I'd implement it by having a static overflow list entry so no
> memory allocation is needed and just keep moving that entry to the
> back of the list every time an event is lost. This way it will cover
> lost events due to memory outages too

Could double-adding the same static node to the list happen and
corrupt the list?

I think the vevent_header doesn't need to be exactly match with
the driver event. So, maybe a vEVENTQ object could hold a header
structure during iommu_veventq_alloc?

struct iommufd_veventq {
	...
	atomic_t num_events;
	atomic_t num_overflows;
	DECLARE_BITMAP(errors, 32);
	struct iommufd_vevent_header_v1 *header;
};

The header is a bit redundant to its upper three members though.

> For old formats like the fault queue you could return EOVERFLOW
> whenever the sequence number becomes discontiguous or it sees the
> overflow event..

So, IOMMU_OBJ_FAULT is safe to return EOVERFLOW via read(), as
you mentioned that it is self-limited right?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-13 20:44                 ` Nicolin Chen
@ 2025-01-14 13:41                   ` Jason Gunthorpe
  2025-01-17 22:11                     ` Nicolin Chen
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-14 13:41 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Mon, Jan 13, 2025 at 12:44:37PM -0800, Nicolin Chen wrote:

> > You'd want to push a special event when the first overflow happens and
> > probably also report a counter so userspace can know how many events
> > got lost.
> 
> How about this:
> 
> enum iommufd_veventq_header_type {
> 	IOMMU_VEVENTQ_HEADER_TYPE_V1,
> };

You don't need another format tag, just describe it in the driver tag.

> enum iommu_hwpt_pgfault_flags {
> 	IOMMU_VEVENT_HEADER_FLAGS_OVERFLOW = (1 << 0),
> };
> 
> struct iommufd_vevent_header_v1 {
> 	__u64 flags;
> 	__u32 num_events;
> 	__u32 num_overflows; // valid if flag_overflow is set
> };

num_overflows is hard, I'd just keep a counter.

> > This seems most robust and simplest to implement..
> > 
> > I think I'd implement it by having a static overflow list entry so no
> > memory allocation is needed and just keep moving that entry to the
> > back of the list every time an event is lost. This way it will cover
> > lost events due to memory outages too
> 
> Could double-adding the same static node to the list happen and
> corrupt the list?

It has to be done carefully, but under a lock you can make this work

> > For old formats like the fault queue you could return EOVERFLOW
> > whenever the sequence number becomes discontiguous or it sees the
> > overflow event..
> 
> So, IOMMU_OBJ_FAULT is safe to return EOVERFLOW via read(), as
> you mentioned that it is self-limited right?

The idea for fault is that we wouldn't hit it, but if we add a limit,
or hit a memory allocation error, then it could happen

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 01/14] iommufd: Keep OBJ/IOCTL lists in an alphabetical order
  2025-01-07 17:10 ` [PATCH v5 01/14] iommufd: Keep OBJ/IOCTL lists in an alphabetical order Nicolin Chen
  2025-01-10  6:26   ` Tian, Kevin
  2025-01-10 17:25   ` Jason Gunthorpe
@ 2025-01-14 19:29   ` Jason Gunthorpe
  2 siblings, 0 replies; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-14 19:29 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Tue, Jan 07, 2025 at 09:10:04AM -0800, Nicolin Chen wrote:
> Reorder the existing OBJ/IOCTL lists.
> 
> Also run clang-format for the same coding style at line wrappings.
> 
> No functional change.
> 
> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/iommufd/main.c | 30 ++++++++++++++----------------
>  1 file changed, 14 insertions(+), 16 deletions(-)

Applied to for-next

Thanks,
Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-14 13:41                   ` Jason Gunthorpe
@ 2025-01-17 22:11                     ` Nicolin Chen
  2025-01-20 18:18                       ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-17 22:11 UTC (permalink / raw)
  To: Jason Gunthorpe, kevin.tian
  Cc: corbet, will, joro, suravee.suthikulpanit, robin.murphy, dwmw2,
	baolu.lu, shuah, linux-kernel, iommu, linux-arm-kernel,
	linux-kselftest, linux-doc, eric.auger, jean-philippe, mdf,
	mshavit, shameerali.kolothum.thodi, smostafa, ddutile, yi.l.liu,
	patches

On Tue, Jan 14, 2025 at 09:41:58AM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 13, 2025 at 12:44:37PM -0800, Nicolin Chen wrote:
> > 	IOMMU_VEVENT_HEADER_FLAGS_OVERFLOW = (1 << 0),
> > };
> > 
> > struct iommufd_vevent_header_v1 {
> > 	__u64 flags;
> > 	__u32 num_events;
> > 	__u32 num_overflows; // valid if flag_overflow is set
> > };
> 
> num_overflows is hard, I'd just keep a counter.

Ack. How does this look overall?

@@ -1013,6 +1013,33 @@ struct iommu_ioas_change_process {
 #define IOMMU_IOAS_CHANGE_PROCESS \
        _IO(IOMMUFD_TYPE, IOMMUFD_CMD_IOAS_CHANGE_PROCESS)

+/**
+ * enum iommu_veventq_state - state for struct iommufd_vevent_header
+ * @IOMMU_VEVENTQ_STATE_OK: vEVENTQ is running
+ * @IOMMU_VEVENTQ_STATE_OVERFLOW: vEVENTQ is overflowed
+ */
+enum iommu_veventq_state {
+       IOMMU_VEVENTQ_STATE_OK = (1 << 0),
+       IOMMU_VEVENTQ_STATE_OVERFLOW = (1 << 1),
+};
+
+/**
+ * struct iommufd_vevent_header - Virtual Event Header for a vEVENTQ Status
+ * @state: One of enum iommu_veventq_state
+ * @counter: A counter reflecting the state of the vEVENTQ
+ *
+ * ----------------------------------------------------------------------------
+ * | @state                       | @counter                                  |
+ * ----------------------------------------------------------------------------
+ * | IOMMU_VEVENTQ_STATE_OK       | number of readable vEVENTs in the vEVENTQ |
+ * | IOMMU_VEVENTQ_STATE_OVERFLOW | number of missed vEVENTs since overflow   |
+ * ----------------------------------------------------------------------------
+ */
+struct iommufd_vevent_header {
+       __u32 state;
+       __u32 counter;
+};
+
 /**
  * enum iommu_veventq_type - Virtual Event Queue Type
  * @IOMMU_VEVENTQ_TYPE_DEFAULT: Reserved for future use
@@ -1050,6 +1077,13 @@ struct iommu_vevent_arm_smmuv3 {
  *
  * Explicitly allocate a virtual event queue interface for a vIOMMU. A vIOMMU
  * can have multiple FDs for different types, but is confined to one per @type.
+ * User space should open the @out_veventq_fd to read vEVENTs out of a vEVENTQ,
+ * if there are vEVENTs available. A vEVENTQ will overflow if the number of the
+ * vEVENTs hits @veventq_depth.
+ *
+ * Each vEVENT in a vEVENTQ encloses a struct iommufd_vevent_header followed by
+ * a type-specific data structure. The iommufd_vevent_header reports the status
+ * of the vEVENTQ when the vEVENT is added to the vEVENTQ.
  */
 struct iommu_veventq_alloc {
        __u32 size;

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-17 22:11                     ` Nicolin Chen
@ 2025-01-20 18:18                       ` Jason Gunthorpe
  2025-01-20 20:52                         ` Nicolin Chen
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-20 18:18 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Fri, Jan 17, 2025 at 02:11:15PM -0800, Nicolin Chen wrote:
> +/**
> + * struct iommufd_vevent_header - Virtual Event Header for a vEVENTQ Status
> + * @state: One of enum iommu_veventq_state

I'd probably just make this a flag with overflow as the only current flag?

> + * @counter: A counter reflecting the state of the vEVENTQ

> + * ----------------------------------------------------------------------------
> + * | @state                       | @counter                                  |
> + * ----------------------------------------------------------------------------
> + * | IOMMU_VEVENTQ_STATE_OK       | number of readable vEVENTs in the vEVENTQ |
> + * | IOMMU_VEVENTQ_STATE_OVERFLOW | number of missed vEVENTs since overflow   |
> + * ----------------------------------------------------------------------------

When I said counter I literally ment a counter of the number of events
that were sent into the queue. So if events are dropped there is a
trivial gap in the count. Very easy to implement

IOMMU_VEVENTQ_STATE_OVERFLOW with a 0 length event is seen if events
have been lost and no subsequent events are present. It exists to
ensure timely delivery of the overflow event to userspace. counter
will be the sequence number of the next successful event.

If events are lost in the middle of the queue then flags will remain 0
but counter will become non-montonic. A counter delta > 1 indicates
that many events have been lost.

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-20 18:18                       ` Jason Gunthorpe
@ 2025-01-20 20:52                         ` Nicolin Chen
  2025-01-21 18:36                           ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-20 20:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Mon, Jan 20, 2025 at 02:18:54PM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 17, 2025 at 02:11:15PM -0800, Nicolin Chen wrote:
> > +/**
> > + * struct iommufd_vevent_header - Virtual Event Header for a vEVENTQ Status
> > + * @state: One of enum iommu_veventq_state
> 
> I'd probably just make this a flag with overflow as the only current flag?

Ack.

> > + * @counter: A counter reflecting the state of the vEVENTQ
> 
> > + * ----------------------------------------------------------------------------
> > + * | @state                       | @counter                                  |
> > + * ----------------------------------------------------------------------------
> > + * | IOMMU_VEVENTQ_STATE_OK       | number of readable vEVENTs in the vEVENTQ |
> > + * | IOMMU_VEVENTQ_STATE_OVERFLOW | number of missed vEVENTs since overflow   |
> > + * ----------------------------------------------------------------------------
> 
> When I said counter I literally ment a counter of the number of events
> that were sent into the queue. So if events are dropped there is a
> trivial gap in the count. Very easy to implement

The counter of the number of events in the vEVENTQ could decrease
when userspace reads the queue. But you were saying "the number of
events that were sent into the queue", which is like a PROD index
that would keep growing but reset to 0 after UINT_MAX?

> IOMMU_VEVENTQ_STATE_OVERFLOW with a 0 length event is seen if events
> have been lost and no subsequent events are present. It exists to
> ensure timely delivery of the overflow event to userspace. counter
> will be the sequence number of the next successful event.

So userspace should first read the header to decide whether or not
to read a vEVENT. If header is overflows, it should skip the vEVENT
struct and read the next header?

> If events are lost in the middle of the queue then flags will remain 0
> but counter will become non-montonic. A counter delta > 1 indicates
> that many events have been lost.

I don't quite get the "no subsequent events" v.s. "in the middle of
the queue"..

The producer is the driver calling iommufd_viommu_report_event that
only produces a single vEVENT at a time. When the number of vEVENTs
in the vEVENTQ hits the @veventq_depth, it won't insert new vEVENTs
but add an overflow (or exception) node to the head of deliver list
and increase the producer index so the next vEVENT that can find an
empty space in the queue will have an index with a gap (delta >= 1)?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-20 20:52                         ` Nicolin Chen
@ 2025-01-21 18:36                           ` Jason Gunthorpe
  2025-01-21 19:55                             ` Nicolin Chen
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-21 18:36 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Mon, Jan 20, 2025 at 12:52:09PM -0800, Nicolin Chen wrote:
> The counter of the number of events in the vEVENTQ could decrease
> when userspace reads the queue. But you were saying "the number of
> events that were sent into the queue", which is like a PROD index
> that would keep growing but reset to 0 after UINT_MAX?

yes

> > IOMMU_VEVENTQ_STATE_OVERFLOW with a 0 length event is seen if events
> > have been lost and no subsequent events are present. It exists to
> > ensure timely delivery of the overflow event to userspace. counter
> > will be the sequence number of the next successful event.
> 
> So userspace should first read the header to decide whether or not
> to read a vEVENT. If header is overflows, it should skip the vEVENT
> struct and read the next header?

Yes, but there won't be a next header. overflow would always be the
last thing in a read() response. If there is another event then
overflow is indicated by non-monotonic count.

> > If events are lost in the middle of the queue then flags will remain 0
> > but counter will become non-montonic. A counter delta > 1 indicates
> > that many events have been lost.
> 
> I don't quite get the "no subsequent events" v.s. "in the middle of
> the queue"..

I mean to supress specific overflow events to userspace if the counter already
fully indicates overflow.

The purpose of the overflow event is specifically, and only, to
indicate immediately that an overflow occured at the end of the queue,
and no additional events have been pushed since the overflow.

Without this we could loose an event and userspace may not realize
it for a long time.

> The producer is the driver calling iommufd_viommu_report_event that
> only produces a single vEVENT at a time. When the number of vEVENTs
> in the vEVENTQ hits the @veventq_depth, it won't insert new vEVENTs
> but add an overflow (or exception) node to the head of deliver list
> and increase the producer index so the next vEVENT that can find an
> empty space in the queue will have an index with a gap (delta >= 1)?

Yes, but each new overflow should move the single preallocated
overflow node back to the end of the list, and the read side should
skip the overflow node if it is not the last entry in the list

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-21 18:36                           ` Jason Gunthorpe
@ 2025-01-21 19:55                             ` Nicolin Chen
  2025-01-21 20:09                               ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-21 19:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Tue, Jan 21, 2025 at 02:36:11PM -0400, Jason Gunthorpe wrote:
> On Mon, Jan 20, 2025 at 12:52:09PM -0800, Nicolin Chen wrote:
> > The counter of the number of events in the vEVENTQ could decrease
> > when userspace reads the queue. But you were saying "the number of
> > events that were sent into the queue", which is like a PROD index
> > that would keep growing but reset to 0 after UINT_MAX?
> 
> yes

Ack. Then I think we should name it "index", beside a "counter"
indicating the number of events in the queue. Or perhaps a pair
of consumer and producer indexes that wrap at end of a limit.

> > > IOMMU_VEVENTQ_STATE_OVERFLOW with a 0 length event is seen if events
> > > have been lost and no subsequent events are present. It exists to
> > > ensure timely delivery of the overflow event to userspace. counter
> > > will be the sequence number of the next successful event.
> > 
> > So userspace should first read the header to decide whether or not
> > to read a vEVENT. If header is overflows, it should skip the vEVENT
> > struct and read the next header?
> 
> Yes, but there won't be a next header. overflow would always be the
> last thing in a read() response. If there is another event then
> overflow is indicated by non-monotonic count.

I am not 100% sure why "overflow would always be the last thing
in a read() response". I thought that kernel should immediately
report an overflow to user space when the vEVENTQ is overflowed.

Yet, thinking about this once again: user space actually has its
own queue. There's probably no point in letting it know about an
overflow immediately when the kernel vEVENTQ overflows until its
own user queue overflows after it reads the entire vEVENTQ so it
can trigger a vHW event/irq to the VM?

> > > If events are lost in the middle of the queue then flags will remain 0
> > > but counter will become non-montonic. A counter delta > 1 indicates
> > > that many events have been lost.
> > 
> > I don't quite get the "no subsequent events" v.s. "in the middle of
> > the queue"..
> 
> I mean to supress specific overflow events to userspace if the counter already
> fully indicates overflow.
> 
> The purpose of the overflow event is specifically, and only, to
> indicate immediately that an overflow occured at the end of the queue,
> and no additional events have been pushed since the overflow.
> 
> Without this we could loose an event and userspace may not realize
> it for a long time.

I see. Because there is no further new event, there would be no
new index to indicate a gap. Thus, we need an overflow node.

> > The producer is the driver calling iommufd_viommu_report_event that
> > only produces a single vEVENT at a time. When the number of vEVENTs
> > in the vEVENTQ hits the @veventq_depth, it won't insert new vEVENTs
> > but add an overflow (or exception) node to the head of deliver list
> > and increase the producer index so the next vEVENT that can find an
> > empty space in the queue will have an index with a gap (delta >= 1)?
> 
> Yes, but each new overflow should move the single preallocated
> overflow node back to the end of the list, and the read side should
> skip the overflow node if it is not the last entry in the list

If the number of events in the queue is below @veventq_depth as
userspace consumed the events from the queue, I think a new
iommufd_viommu_report_event call should delete the overflow node
from the end of the list, right? Since it can just insert a new
event to the list by marking it a non-monotonic index..

Thanks!
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-21 19:55                             ` Nicolin Chen
@ 2025-01-21 20:09                               ` Jason Gunthorpe
  2025-01-21 21:02                                 ` Nicolin Chen
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-21 20:09 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Tue, Jan 21, 2025 at 11:55:16AM -0800, Nicolin Chen wrote:
> Ack. Then I think we should name it "index", beside a "counter"
> indicating the number of events in the queue. Or perhaps a pair
> of consumer and producer indexes that wrap at end of a limit.

sequence perhaps would be a good name

> > > > IOMMU_VEVENTQ_STATE_OVERFLOW with a 0 length event is seen if events
> > > > have been lost and no subsequent events are present. It exists to
> > > > ensure timely delivery of the overflow event to userspace. counter
> > > > will be the sequence number of the next successful event.
> > > 
> > > So userspace should first read the header to decide whether or not
> > > to read a vEVENT. If header is overflows, it should skip the vEVENT
> > > struct and read the next header?
> > 
> > Yes, but there won't be a next header. overflow would always be the
> > last thing in a read() response. If there is another event then
> > overflow is indicated by non-monotonic count.
> 
> I am not 100% sure why "overflow would always be the last thing
> in a read() response". I thought that kernel should immediately
> report an overflow to user space when the vEVENTQ is overflowed.

As below, if you observe overflow then it was at the end of the kernel
queue and there is no further events after it. So it should always end
up last.

Perhaps we could enforce this directly in the kernel's read by making
it the only, first and last, response to read.

> Yet, thinking about this once again: user space actually has its
> own queue. There's probably no point in letting it know about an
> overflow immediately when the kernel vEVENTQ overflows until its
> own user queue overflows after it reads the entire vEVENTQ so it
> can trigger a vHW event/irq to the VM?

The kernel has no idea what userspace is doing, the kernel's job
should be to ensure timely delivery of all events, if an event is lost
it should ensure timely delivery of the lost event notification. There
is little else it can do.

I suppose userspace has a choice, it could discard events from the
kernel when its virtual HW queue gets full, or it could backpressure
the kernel and stop reading hoping the kernel queue will buffer it
futher.

> > Without this we could loose an event and userspace may not realize
> > it for a long time.
> 
> I see. Because there is no further new event, there would be no
> new index to indicate a gap. Thus, we need an overflow node.

yes

> If the number of events in the queue is below @veventq_depth as
> userspace consumed the events from the queue, I think a new
> iommufd_viommu_report_event call should delete the overflow node
> from the end of the list, right? 

You can do that, or the read side can ignore a non-end overflow node.

I'm not sure which option will turn out to be easier to implement..

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-21 20:09                               ` Jason Gunthorpe
@ 2025-01-21 21:02                                 ` Nicolin Chen
  2025-01-21 21:14                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-21 21:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Tue, Jan 21, 2025 at 04:09:24PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 21, 2025 at 11:55:16AM -0800, Nicolin Chen wrote:
> > > > > IOMMU_VEVENTQ_STATE_OVERFLOW with a 0 length event is seen if events
> > > > > have been lost and no subsequent events are present. It exists to
> > > > > ensure timely delivery of the overflow event to userspace. counter
> > > > > will be the sequence number of the next successful event.
> > > > 
> > > > So userspace should first read the header to decide whether or not
> > > > to read a vEVENT. If header is overflows, it should skip the vEVENT
> > > > struct and read the next header?
> > > 
> > > Yes, but there won't be a next header. overflow would always be the
> > > last thing in a read() response. If there is another event then
> > > overflow is indicated by non-monotonic count.
> > 
> > I am not 100% sure why "overflow would always be the last thing
> > in a read() response". I thought that kernel should immediately
> > report an overflow to user space when the vEVENTQ is overflowed.
> 
> As below, if you observe overflow then it was at the end of the kernel
> queue and there is no further events after it. So it should always end
> up last.
> 
> Perhaps we could enforce this directly in the kernel's read by making
> it the only, first and last, response to read.

Hmm, since the overflow node is the last node in the list, isn't
it already an enforcement it's the only/first/last node to read?
(Assuming we choose to delete the overflow node from the list if
 new event can be inserted.)

> > Yet, thinking about this once again: user space actually has its
> > own queue. There's probably no point in letting it know about an
> > overflow immediately when the kernel vEVENTQ overflows until its
> > own user queue overflows after it reads the entire vEVENTQ so it
> > can trigger a vHW event/irq to the VM?
> 
> The kernel has no idea what userspace is doing, the kernel's job
> should be to ensure timely delivery of all events, if an event is lost
> it should ensure timely delivery of the lost event notification. There
> is little else it can do.

Yet, "timely" means still having an entire-queue-size-long delay
since the overflow node is at the end of the queue, right?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-21 21:02                                 ` Nicolin Chen
@ 2025-01-21 21:14                                   ` Jason Gunthorpe
  2025-01-21 21:40                                     ` Nicolin Chen
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-21 21:14 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Tue, Jan 21, 2025 at 01:02:16PM -0800, Nicolin Chen wrote:
> On Tue, Jan 21, 2025 at 04:09:24PM -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 21, 2025 at 11:55:16AM -0800, Nicolin Chen wrote:
> > > > > > IOMMU_VEVENTQ_STATE_OVERFLOW with a 0 length event is seen if events
> > > > > > have been lost and no subsequent events are present. It exists to
> > > > > > ensure timely delivery of the overflow event to userspace. counter
> > > > > > will be the sequence number of the next successful event.
> > > > > 
> > > > > So userspace should first read the header to decide whether or not
> > > > > to read a vEVENT. If header is overflows, it should skip the vEVENT
> > > > > struct and read the next header?
> > > > 
> > > > Yes, but there won't be a next header. overflow would always be the
> > > > last thing in a read() response. If there is another event then
> > > > overflow is indicated by non-monotonic count.
> > > 
> > > I am not 100% sure why "overflow would always be the last thing
> > > in a read() response". I thought that kernel should immediately
> > > report an overflow to user space when the vEVENTQ is overflowed.
> > 
> > As below, if you observe overflow then it was at the end of the kernel
> > queue and there is no further events after it. So it should always end
> > up last.
> > 
> > Perhaps we could enforce this directly in the kernel's read by making
> > it the only, first and last, response to read.
> 
> Hmm, since the overflow node is the last node in the list, isn't
> it already an enforcement it's the only/first/last node to read?
> (Assuming we choose to delete the overflow node from the list if
>  new event can be inserted.)

Since we don't hold the spinlock the whole time there is a race where
we could pull the overflow off and then another entry could be pushed
while we do the copy_to_user.

> > > Yet, thinking about this once again: user space actually has its
> > > own queue. There's probably no point in letting it know about an
> > > overflow immediately when the kernel vEVENTQ overflows until its
> > > own user queue overflows after it reads the entire vEVENTQ so it
> > > can trigger a vHW event/irq to the VM?
> > 
> > The kernel has no idea what userspace is doing, the kernel's job
> > should be to ensure timely delivery of all events, if an event is lost
> > it should ensure timely delivery of the lost event notification. There
> > is little else it can do.
> 
> Yet, "timely" means still having an entire-queue-size-long delay
> since the overflow node is at the end of the queue, right?

Yes, but also in this case the vIOMMU isn't experiancing an overflow
so it doesn't need to know about it. 

The main point here is if we somehow loose an event the vIOMMU driver
may need to do something to recover from the lost event. It doesn't
become relavent until the lost event is present in the virtual HW
queue.

There is also the minor detail of what happens if the hypervisor HW
queue overflows - I don't know the answer here. It is security
concerning since the VM can spam DMA errors at high rate. :|

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-21 21:14                                   ` Jason Gunthorpe
@ 2025-01-21 21:40                                     ` Nicolin Chen
  2025-01-22  0:21                                       ` Jason Gunthorpe
  2025-01-22  8:05                                       ` Nicolin Chen
  0 siblings, 2 replies; 77+ messages in thread
From: Nicolin Chen @ 2025-01-21 21:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Tue, Jan 21, 2025 at 05:14:04PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 21, 2025 at 01:02:16PM -0800, Nicolin Chen wrote:
> > On Tue, Jan 21, 2025 at 04:09:24PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Jan 21, 2025 at 11:55:16AM -0800, Nicolin Chen wrote:
> > > > > > > IOMMU_VEVENTQ_STATE_OVERFLOW with a 0 length event is seen if events
> > > > > > > have been lost and no subsequent events are present. It exists to
> > > > > > > ensure timely delivery of the overflow event to userspace. counter
> > > > > > > will be the sequence number of the next successful event.
> > > > > > 
> > > > > > So userspace should first read the header to decide whether or not
> > > > > > to read a vEVENT. If header is overflows, it should skip the vEVENT
> > > > > > struct and read the next header?
> > > > > 
> > > > > Yes, but there won't be a next header. overflow would always be the
> > > > > last thing in a read() response. If there is another event then
> > > > > overflow is indicated by non-monotonic count.
> > > > 
> > > > I am not 100% sure why "overflow would always be the last thing
> > > > in a read() response". I thought that kernel should immediately
> > > > report an overflow to user space when the vEVENTQ is overflowed.
> > > 
> > > As below, if you observe overflow then it was at the end of the kernel
> > > queue and there is no further events after it. So it should always end
> > > up last.
> > > 
> > > Perhaps we could enforce this directly in the kernel's read by making
> > > it the only, first and last, response to read.
> > 
> > Hmm, since the overflow node is the last node in the list, isn't
> > it already an enforcement it's the only/first/last node to read?
> > (Assuming we choose to delete the overflow node from the list if
> >  new event can be inserted.)
> 
> Since we don't hold the spinlock the whole time there is a race where
> we could pull the overflow off and then another entry could be pushed
> while we do the copy_to_user.

I see. I'll be careful around that. I can imagine that one tricky
thing can be to restore the overflow node back to the list when a
copy_to_user fails..

> > > > Yet, thinking about this once again: user space actually has its
> > > > own queue. There's probably no point in letting it know about an
> > > > overflow immediately when the kernel vEVENTQ overflows until its
> > > > own user queue overflows after it reads the entire vEVENTQ so it
> > > > can trigger a vHW event/irq to the VM?
> > > 
> > > The kernel has no idea what userspace is doing, the kernel's job
> > > should be to ensure timely delivery of all events, if an event is lost
> > > it should ensure timely delivery of the lost event notification. There
> > > is little else it can do.
> > 
> > Yet, "timely" means still having an entire-queue-size-long delay
> > since the overflow node is at the end of the queue, right?
> 
> Yes, but also in this case the vIOMMU isn't experiancing an overflow
> so it doesn't need to know about it. 
> 
> The main point here is if we somehow loose an event the vIOMMU driver
> may need to do something to recover from the lost event. It doesn't
> become relavent until the lost event is present in the virtual HW
> queue.

Ack.

> There is also the minor detail of what happens if the hypervisor HW
> queue overflows - I don't know the answer here. It is security
> concerning since the VM can spam DMA errors at high rate. :|

In my view, the hypervisor queue is the vHW queue for the VM, so
it should act like a HW, which means it's up to the guest kernel
driver that handles the high rate DMA errors..

The entire thing is complicated since we have a double buffer: a
kernel queue and a hypervisor queue. I am not very sure if both 
queues should have the same depth or perhaps the kernel one might
have a slightly bigger size to buffer the hypervisor one. If only
kernel could directly add events to the hypervisor queue and also
set the overflow flag in the hypervisor..

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-21 21:40                                     ` Nicolin Chen
@ 2025-01-22  0:21                                       ` Jason Gunthorpe
  2025-01-22  7:15                                         ` Nicolin Chen
  2025-01-22  8:05                                       ` Nicolin Chen
  1 sibling, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-22  0:21 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Tue, Jan 21, 2025 at 01:40:05PM -0800, Nicolin Chen wrote:
> > There is also the minor detail of what happens if the hypervisor HW
> > queue overflows - I don't know the answer here. It is security
> > concerning since the VM can spam DMA errors at high rate. :|
> 
> In my view, the hypervisor queue is the vHW queue for the VM, so
> it should act like a HW, which means it's up to the guest kernel
> driver that handles the high rate DMA errors..

I'm mainly wondering what happens if the single physical kernel
event queue overflows because it is DOS'd by a VM and the hypervisor
cannot drain it fast enough?

I haven't looked closely but is there some kind of rate limiting or
otherwise to mitigate DOS attacks on the shared event queue from VMs?

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-22  0:21                                       ` Jason Gunthorpe
@ 2025-01-22  7:15                                         ` Nicolin Chen
  2025-01-22  9:33                                           ` Tian, Kevin
  0 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-22  7:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Tue, Jan 21, 2025 at 08:21:28PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 21, 2025 at 01:40:05PM -0800, Nicolin Chen wrote:
> > > There is also the minor detail of what happens if the hypervisor HW
> > > queue overflows - I don't know the answer here. It is security
> > > concerning since the VM can spam DMA errors at high rate. :|
> > 
> > In my view, the hypervisor queue is the vHW queue for the VM, so
> > it should act like a HW, which means it's up to the guest kernel
> > driver that handles the high rate DMA errors..
> 
> I'm mainly wondering what happens if the single physical kernel
> event queue overflows because it is DOS'd by a VM and the hypervisor
> cannot drain it fast enough?
> 
> I haven't looked closely but is there some kind of rate limiting or
> otherwise to mitigate DOS attacks on the shared event queue from VMs?

SMMUv3 reads the event out of the physical kernel event queue,
and adds that to faultq or veventq or prints it out. So, it'd
not overflow because of DOS? And all other drivers should do
the same?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-21 21:40                                     ` Nicolin Chen
  2025-01-22  0:21                                       ` Jason Gunthorpe
@ 2025-01-22  8:05                                       ` Nicolin Chen
  2025-01-22 18:02                                         ` Nicolin Chen
  1 sibling, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-22  8:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Tue, Jan 21, 2025 at 01:40:13PM -0800, Nicolin Chen wrote:
> On Tue, Jan 21, 2025 at 05:14:04PM -0400, Jason Gunthorpe wrote:
> > Since we don't hold the spinlock the whole time there is a race where
> > we could pull the overflow off and then another entry could be pushed
> > while we do the copy_to_user.
> 
> I see. I'll be careful around that. I can imagine that one tricky
> thing can be to restore the overflow node back to the list when a
> copy_to_user fails..

Hmm, it gets trickier because the overflow node is a preallocated
single node per vEVENTQ. We must not only check list_empty for its
restore, but also protect the overflow->header.sequence from races
between atomic_inc and copy_to_user. However, we can't use a mutex
because copy_to_user might DOS...

A simple solution could be to just duplicate overflow nodes, each
of which contains a different sequence, like a regular vEVENT node.
This certainly changes the uAPI for read(). Though the duplication
of overflow nodes doesn't sound optimal, it's acceptable since the
duplicated nodes would have been regular vEVENT nodes if there was
no overflow (i.e. no extra overhead)?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* RE: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-22  7:15                                         ` Nicolin Chen
@ 2025-01-22  9:33                                           ` Tian, Kevin
  2025-01-22 19:54                                             ` Nicolin Chen
  0 siblings, 1 reply; 77+ messages in thread
From: Tian, Kevin @ 2025-01-22  9:33 UTC (permalink / raw)
  To: Nicolin Chen, Jason Gunthorpe
  Cc: corbet@lwn.net, will@kernel.org, joro@8bytes.org,
	suravee.suthikulpanit@amd.com, robin.murphy@arm.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com, shuah@kernel.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	eric.auger@redhat.com, jean-philippe@linaro.org, mdf@kernel.org,
	mshavit@google.com, shameerali.kolothum.thodi@huawei.com,
	smostafa@google.com, ddutile@redhat.com, Liu, Yi L,
	patches@lists.linux.dev

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, January 22, 2025 3:16 PM
> 
> On Tue, Jan 21, 2025 at 08:21:28PM -0400, Jason Gunthorpe wrote:
> > On Tue, Jan 21, 2025 at 01:40:05PM -0800, Nicolin Chen wrote:
> > > > There is also the minor detail of what happens if the hypervisor HW
> > > > queue overflows - I don't know the answer here. It is security
> > > > concerning since the VM can spam DMA errors at high rate. :|
> > >
> > > In my view, the hypervisor queue is the vHW queue for the VM, so
> > > it should act like a HW, which means it's up to the guest kernel
> > > driver that handles the high rate DMA errors..
> >
> > I'm mainly wondering what happens if the single physical kernel
> > event queue overflows because it is DOS'd by a VM and the hypervisor
> > cannot drain it fast enough?
> >
> > I haven't looked closely but is there some kind of rate limiting or
> > otherwise to mitigate DOS attacks on the shared event queue from VMs?
> 
> SMMUv3 reads the event out of the physical kernel event queue,
> and adds that to faultq or veventq or prints it out. So, it'd
> not overflow because of DOS? And all other drivers should do
> the same?
> 

"add that to faultq or eventq" could take time or the irqthread
could be preempted for various reasons then there is always an
window within which an overflow condition could occur due to
the smmu driver incapable of fetching pending events timely.

On VT-d the driver could disable reporting non-recoverable fault
for a given device via a control bit in the PASID entry, but I didn't
see a similar knob for PRQ.

and the overflow situation on intel-iommu driver is higher than
arm. The irqthread reads head/tail once, batch-reports  the
events in-between, and then updates the head register...



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-22  8:05                                       ` Nicolin Chen
@ 2025-01-22 18:02                                         ` Nicolin Chen
  2025-01-23  7:02                                           ` Nicolin Chen
  0 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-22 18:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Wed, Jan 22, 2025 at 12:05:27AM -0800, Nicolin Chen wrote:
> On Tue, Jan 21, 2025 at 01:40:13PM -0800, Nicolin Chen wrote:
> > On Tue, Jan 21, 2025 at 05:14:04PM -0400, Jason Gunthorpe wrote:
> > > Since we don't hold the spinlock the whole time there is a race where
> > > we could pull the overflow off and then another entry could be pushed
> > > while we do the copy_to_user.
> > 
> > I see. I'll be careful around that. I can imagine that one tricky
> > thing can be to restore the overflow node back to the list when a
> > copy_to_user fails..
> 
> Hmm, it gets trickier because the overflow node is a preallocated
> single node per vEVENTQ. We must not only check list_empty for its
> restore, but also protect the overflow->header.sequence from races
> between atomic_inc and copy_to_user. However, we can't use a mutex
> because copy_to_user might DOS...
> 
> A simple solution could be to just duplicate overflow nodes, each
> of which contains a different sequence, like a regular vEVENT node.
> This certainly changes the uAPI for read(). Though the duplication
> of overflow nodes doesn't sound optimal, it's acceptable since the
> duplicated nodes would have been regular vEVENT nodes if there was
> no overflow (i.e. no extra overhead)?

Ah, didn't think clearly last night.. We can't simply add overflow
nodes either for rate/memory limit reason that you concerned about
in the other email. On the other hand, though certainly not being
ideal, indulging the race at the sequence index might not be that
harmful, compared to the extreme of the other case..

I'll give another thought today if there's some other way out.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-22  9:33                                           ` Tian, Kevin
@ 2025-01-22 19:54                                             ` Nicolin Chen
  2025-01-23 13:42                                               ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-22 19:54 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Gunthorpe, corbet@lwn.net, will@kernel.org, joro@8bytes.org,
	suravee.suthikulpanit@amd.com, robin.murphy@arm.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com, shuah@kernel.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	eric.auger@redhat.com, jean-philippe@linaro.org, mdf@kernel.org,
	mshavit@google.com, shameerali.kolothum.thodi@huawei.com,
	smostafa@google.com, ddutile@redhat.com, Liu, Yi L,
	patches@lists.linux.dev

On Wed, Jan 22, 2025 at 09:33:35AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, January 22, 2025 3:16 PM
> > 
> > On Tue, Jan 21, 2025 at 08:21:28PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Jan 21, 2025 at 01:40:05PM -0800, Nicolin Chen wrote:
> > > > > There is also the minor detail of what happens if the hypervisor HW
> > > > > queue overflows - I don't know the answer here. It is security
> > > > > concerning since the VM can spam DMA errors at high rate. :|
> > > >
> > > > In my view, the hypervisor queue is the vHW queue for the VM, so
> > > > it should act like a HW, which means it's up to the guest kernel
> > > > driver that handles the high rate DMA errors..
> > >
> > > I'm mainly wondering what happens if the single physical kernel
> > > event queue overflows because it is DOS'd by a VM and the hypervisor
> > > cannot drain it fast enough?
> > >
> > > I haven't looked closely but is there some kind of rate limiting or
> > > otherwise to mitigate DOS attacks on the shared event queue from VMs?
> > 
> > SMMUv3 reads the event out of the physical kernel event queue,
> > and adds that to faultq or veventq or prints it out. So, it'd
> > not overflow because of DOS? And all other drivers should do
> > the same?
> > 
> 
> "add that to faultq or eventq" could take time or the irqthread
> could be preempted for various reasons then there is always an
> window within which an overflow condition could occur due to
> the smmu driver incapable of fetching pending events timely.

Oh, I see..

> On VT-d the driver could disable reporting non-recoverable fault
> for a given device via a control bit in the PASID entry, but I didn't
> see a similar knob for PRQ.

ARM has an event suppressing CD.R bit to disable event recording
for a device. However, the stage-1 CD is controlled by the guest
kernel or VMM having the control of the ram..

ARM seems to also have an interesting event merging feature:
 STE.MEV, bit [83]
 Merge Events arising from terminated transactions from this stream.
   0b0 Do not merge similar fault records
   0b1 Permit similar fault records to be merged
 The SMMU might be able to reduce the usage of the Event queue by
 coalescing fault records that share the same page granule of address,
 access type and SubstreamID.
   Setting MEV == 1 does not guarantee that faults will be coalesced.
   Setting MEV == 0 causes a physical SMMU to prevent coalescing of
   fault records, however, a hypervisor might not honour this setting
   if it deems a guest to be too verbose.
 Note: Software must expect, and be able to deal with, coalesced fault
 records even when MEV == 0.

Yet, the driver doesn't seem to care setting it at this moment.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-22 18:02                                         ` Nicolin Chen
@ 2025-01-23  7:02                                           ` Nicolin Chen
  2025-01-23 13:43                                             ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Nicolin Chen @ 2025-01-23  7:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Wed, Jan 22, 2025 at 10:02:49AM -0800, Nicolin Chen wrote:
> On Wed, Jan 22, 2025 at 12:05:27AM -0800, Nicolin Chen wrote:
> > On Tue, Jan 21, 2025 at 01:40:13PM -0800, Nicolin Chen wrote:
> > > On Tue, Jan 21, 2025 at 05:14:04PM -0400, Jason Gunthorpe wrote:
> > > > Since we don't hold the spinlock the whole time there is a race where
> > > > we could pull the overflow off and then another entry could be pushed
> > > > while we do the copy_to_user.
> > > 
> > > I see. I'll be careful around that. I can imagine that one tricky
> > > thing can be to restore the overflow node back to the list when a
> > > copy_to_user fails..
> > 
> > Hmm, it gets trickier because the overflow node is a preallocated
> > single node per vEVENTQ. We must not only check list_empty for its
> > restore, but also protect the overflow->header.sequence from races
> > between atomic_inc and copy_to_user. However, we can't use a mutex
> > because copy_to_user might DOS...
> > 
> > A simple solution could be to just duplicate overflow nodes, each
> > of which contains a different sequence, like a regular vEVENT node.
> > This certainly changes the uAPI for read(). Though the duplication
> > of overflow nodes doesn't sound optimal, it's acceptable since the
> > duplicated nodes would have been regular vEVENT nodes if there was
> > no overflow (i.e. no extra overhead)?
> 
> Ah, didn't think clearly last night.. We can't simply add overflow
> nodes either for rate/memory limit reason that you concerned about
> in the other email. On the other hand, though certainly not being
> ideal, indulging the race at the sequence index might not be that
> harmful, compared to the extreme of the other case..
> 
> I'll give another thought today if there's some other way out.

I made a change to duplicate the overflow node in the fetch() that
is protected by the spinlock, which is used for copy_to_user. Then
the other routine for vEVENT reporting, protected by the spinlock,
can race-freely update the preallocated overflow node.

Nicolin

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-22 19:54                                             ` Nicolin Chen
@ 2025-01-23 13:42                                               ` Jason Gunthorpe
  0 siblings, 0 replies; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-23 13:42 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Tian, Kevin, corbet@lwn.net, will@kernel.org, joro@8bytes.org,
	suravee.suthikulpanit@amd.com, robin.murphy@arm.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com, shuah@kernel.org,
	linux-kernel@vger.kernel.org, iommu@lists.linux.dev,
	linux-arm-kernel@lists.infradead.org,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	eric.auger@redhat.com, jean-philippe@linaro.org, mdf@kernel.org,
	mshavit@google.com, shameerali.kolothum.thodi@huawei.com,
	smostafa@google.com, ddutile@redhat.com, Liu, Yi L,
	patches@lists.linux.dev

On Wed, Jan 22, 2025 at 11:54:52AM -0800, Nicolin Chen wrote:

> ARM seems to also have an interesting event merging feature:
>  STE.MEV, bit [83]
>  Merge Events arising from terminated transactions from this stream.
>    0b0 Do not merge similar fault records
>    0b1 Permit similar fault records to be merged
>  The SMMU might be able to reduce the usage of the Event queue by
>  coalescing fault records that share the same page granule of address,
>  access type and SubstreamID.
>    Setting MEV == 1 does not guarantee that faults will be coalesced.
>    Setting MEV == 0 causes a physical SMMU to prevent coalescing of
>    fault records, however, a hypervisor might not honour this setting
>    if it deems a guest to be too verbose.
>  Note: Software must expect, and be able to deal with, coalesced fault
>  records even when MEV == 0.
> 
> Yet, the driver doesn't seem to care setting it at this moment.

Yeah, we will eventually need to implement whatever DOS mitigations
are included in the IOMMU architectures..

I think DOS testing the event architecture should be part of the
testing/qualification.

It should be quite easy to make a DOS spammer using VFIO in userspace
in the VM to test it with the mlx5 vfio driver.. (though you need VFIO
to work in a VM which RMR will prevent, but that can be hacked around
I guess)

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper
  2025-01-23  7:02                                           ` Nicolin Chen
@ 2025-01-23 13:43                                             ` Jason Gunthorpe
  0 siblings, 0 replies; 77+ messages in thread
From: Jason Gunthorpe @ 2025-01-23 13:43 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, joro, suravee.suthikulpanit,
	robin.murphy, dwmw2, baolu.lu, shuah, linux-kernel, iommu,
	linux-arm-kernel, linux-kselftest, linux-doc, eric.auger,
	jean-philippe, mdf, mshavit, shameerali.kolothum.thodi, smostafa,
	ddutile, yi.l.liu, patches

On Wed, Jan 22, 2025 at 11:02:59PM -0800, Nicolin Chen wrote:

> I made a change to duplicate the overflow node in the fetch() that
> is protected by the spinlock, which is used for copy_to_user. Then
> the other routine for vEVENT reporting, protected by the spinlock,
> can race-freely update the preallocated overflow node.

That seems like the right answer, copy the few bytes you need on
the stack under the spinlock.

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2025-01-23 13:43 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-07 17:10 [PATCH v5 00/14] iommufd: Add vIOMMU infrastructure (Part-3: vEVENTQ) Nicolin Chen
2025-01-07 17:10 ` [PATCH v5 01/14] iommufd: Keep OBJ/IOCTL lists in an alphabetical order Nicolin Chen
2025-01-10  6:26   ` Tian, Kevin
2025-01-10 17:25   ` Jason Gunthorpe
2025-01-14 19:29   ` Jason Gunthorpe
2025-01-07 17:10 ` [PATCH v5 02/14] iommufd/fault: Add an iommufd_fault_init() helper Nicolin Chen
2025-01-10 17:25   ` Jason Gunthorpe
2025-01-07 17:10 ` [PATCH v5 03/14] iommufd/fault: Move iommufd_fault_iopf_handler() to header Nicolin Chen
2025-01-10 17:25   ` Jason Gunthorpe
2025-01-07 17:10 ` [PATCH v5 04/14] iommufd: Abstract an iommufd_eventq from iommufd_fault Nicolin Chen
2025-01-10  6:26   ` Tian, Kevin
2025-01-10 17:26   ` Jason Gunthorpe
2025-01-10 20:49     ` Nicolin Chen
2025-01-07 17:10 ` [PATCH v5 05/14] iommufd: Rename fault.c to eventq.c Nicolin Chen
2025-01-10 17:27   ` Jason Gunthorpe
2025-01-07 17:10 ` [PATCH v5 06/14] iommufd: Add IOMMUFD_OBJ_VEVENTQ and IOMMUFD_CMD_VEVENTQ_ALLOC Nicolin Chen
2025-01-10  7:06   ` Tian, Kevin
2025-01-10 21:29     ` Nicolin Chen
2025-01-13  2:52       ` Tian, Kevin
2025-01-13  4:51         ` Nicolin Chen
2025-01-13  8:17           ` Tian, Kevin
2025-01-13 19:10           ` Jason Gunthorpe
2025-01-10 17:48   ` Jason Gunthorpe
2025-01-10 19:27     ` Nicolin Chen
2025-01-10 19:49       ` Jason Gunthorpe
2025-01-10 21:58         ` Nicolin Chen
2025-01-13 19:12           ` Jason Gunthorpe
2025-01-13 19:18             ` Nicolin Chen
2025-01-07 17:10 ` [PATCH v5 07/14] iommufd/viommu: Add iommufd_viommu_get_vdev_id helper Nicolin Chen
2025-01-10  7:07   ` Tian, Kevin
2025-01-10 21:35     ` Nicolin Chen
2025-01-07 17:10 ` [PATCH v5 08/14] iommufd/viommu: Add iommufd_viommu_report_event helper Nicolin Chen
2025-01-10  7:12   ` Tian, Kevin
2025-01-10 14:51     ` Jason Gunthorpe
2025-01-10 18:40       ` Nicolin Chen
2025-01-10 17:41   ` Jason Gunthorpe
2025-01-10 18:38     ` Nicolin Chen
2025-01-10 19:51       ` Jason Gunthorpe
2025-01-10 19:56         ` Nicolin Chen
2025-01-13  5:37         ` Nicolin Chen
2025-01-13 19:21           ` Jason Gunthorpe
2025-01-13 19:47             ` Nicolin Chen
2025-01-13 19:54               ` Jason Gunthorpe
2025-01-13 20:44                 ` Nicolin Chen
2025-01-14 13:41                   ` Jason Gunthorpe
2025-01-17 22:11                     ` Nicolin Chen
2025-01-20 18:18                       ` Jason Gunthorpe
2025-01-20 20:52                         ` Nicolin Chen
2025-01-21 18:36                           ` Jason Gunthorpe
2025-01-21 19:55                             ` Nicolin Chen
2025-01-21 20:09                               ` Jason Gunthorpe
2025-01-21 21:02                                 ` Nicolin Chen
2025-01-21 21:14                                   ` Jason Gunthorpe
2025-01-21 21:40                                     ` Nicolin Chen
2025-01-22  0:21                                       ` Jason Gunthorpe
2025-01-22  7:15                                         ` Nicolin Chen
2025-01-22  9:33                                           ` Tian, Kevin
2025-01-22 19:54                                             ` Nicolin Chen
2025-01-23 13:42                                               ` Jason Gunthorpe
2025-01-22  8:05                                       ` Nicolin Chen
2025-01-22 18:02                                         ` Nicolin Chen
2025-01-23  7:02                                           ` Nicolin Chen
2025-01-23 13:43                                             ` Jason Gunthorpe
2025-01-07 17:10 ` [PATCH v5 09/14] iommufd/selftest: Require vdev_id when attaching to a nested domain Nicolin Chen
2025-01-07 17:10 ` [PATCH v5 10/14] iommufd/selftest: Add IOMMU_TEST_OP_TRIGGER_VEVENT for vEVENTQ coverage Nicolin Chen
2025-01-07 17:10 ` [PATCH v5 11/14] iommufd/selftest: Add IOMMU_VEVENTQ_ALLOC test coverage Nicolin Chen
2025-01-07 17:10 ` [PATCH v5 12/14] Documentation: userspace-api: iommufd: Update FAULT and VEVENTQ Nicolin Chen
2025-01-10  7:13   ` Tian, Kevin
2025-01-07 17:10 ` [PATCH v5 13/14] iommu/arm-smmu-v3: Introduce struct arm_smmu_vmaster Nicolin Chen
2025-01-13 19:29   ` Jason Gunthorpe
2025-01-13 19:52     ` Nicolin Chen
2025-01-07 17:10 ` [PATCH v5 14/14] iommu/arm-smmu-v3: Report events that belong to devices attached to vIOMMU Nicolin Chen
2025-01-09 11:04   ` kernel test robot
2025-01-13 19:01     ` Nicolin Chen
2025-01-13 19:06       ` Jason Gunthorpe
2025-01-13 19:15         ` Nicolin Chen
2025-01-13 19:18           ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).