[PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE)

linux-tegra.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE)
@ 2025-06-26 19:34 Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 01/28] iommufd: Report unmapped bytes in the error path of iopt_unmap_iova_range Nicolin Chen
                   ` (27 more replies)
  0 siblings, 28 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

The vIOMMU object is designed to represent a slice of an IOMMU HW for its
virtualization features shared with or passed to user space (a VM mostly)
in a way of HW acceleration. This extended the HWPT-based design for more
advanced virtualization feature.

HW QUEUE introduced by this series as a part of the vIOMMU infrastructure
represents a HW accelerated queue/buffer for VM to use exclusively, e.g.
 - NVIDIA's Virtual Command Queue
 - AMD vIOMMU's Command Buffer, Event Log Buffer, and PPR Log Buffer
each of which allows its IOMMU HW to directly access a queue memory owned
by a guest VM and allows a guest OS to control the HW queue direclty, to
avoid VM Exit overheads to improve the performance.

Introduce IOMMUFD_OBJ_HW_QUEUE and its pairing IOMMUFD_CMD_HW_QUEUE_ALLOC
allowing VMM to forward the IOMMU-specific queue info, such as queue base
address, size, and etc.

Meanwhile, a guest-owned queue needs the guest kernel to control the queue
by reading/writing its consumer and producer indexes, via MMIO acceses to
the hardware MMIO registers. Introduce an mmap infrastructure for iommufd
to support passing through a piece of MMIO region from the host physical
address space to the guest physical address space. The mmap info (offset/
length) used by an mmap syscall must be pre-allocated and returned to the
user space via an output driver-data during an IOMMUFD_CMD_HW_QUEUE_ALLOC
call. Thus, it requires a driver-specific user data support in the vIOMMU
allocation flow.

As a real-world use case, this series implements a HW QUEUE support in the
tegra241-cmdqv driver for VCMDQs on NVIDIA Grace CPU. In another word, it
is also the Tegra CMDQV series Part-2 (user-space support), reworked from
Previous RFCv1:
    https://lore.kernel.org/all/cover.1712978212.git.nicolinc@nvidia.com/
This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to
the standard SMMUv3 operating in the nested translation mode trapping CMDQ
for TLBI and ATC_INV commands, this gives a huge performance improvement:
70% to 90% reductions of invalidation time were measured by various DMA
unmap tests running in a guest OS.

// Unmap latencies from "dma_map_benchmark -g @granule -t @threads",
// by toggling "/sys/kernel/debug/iommu/tegra241_cmdqv/bypass_vcmdq"
@granule | @threads | bypass_vcmdq=1 | bypass_vcmdq=0
    4KB        1          35.7 us          5.3 us
   16KB        1          41.8 us          6.8 us
   64KB        1          68.9 us          9.9 us
  128KB        1         109.0 us         12.6 us
  256KB        1         187.1 us         18.0 us
    4KB        2          96.9 us          6.8 us
   16KB        2          97.8 us          7.5 us
   64KB        2         151.5 us         10.7 us
  128KB        2         257.8 us         12.7 us
  256KB        2         443.0 us         17.9 us

This is on Github:
https://github.com/nicolinc/iommufd/commits/iommufd_hw_queue-v7

Paring QEMU branch for testing:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_hw_queue-v7

Changelog
v7
 * Rebased on Jason's for-next tree (iommufd_hw_queue-prep series)
 * Add Reviewed-by from Baolu, Jason, Pranjal
 * Update kdocs and notes
 * [iommu] Replace "u32" with "enum iommu_hw_info_type"
 * [iommufd] Rename vdev->id to vdev->virt_id
 * [iommufd] Replace macros with inline helpers
 * [iommufd] Report unmapped_bytes in error path
 * [iommufd] Add iommufd_access_is_internal helper
 * [iommufd] Do not drop ops->unmap check for mdevs
 * [iommufd] Store physical addresses in immap structure
 * [iommufd] Reorder access and hw_queue object allocations
 * [iommufd] Scan for an internal access before any unmap call
 * [iommufd] Drop unused ictx pointer in struct iommufd_hw_queue
 * [iommufd] Use kcalloc to avoid failure due to memory fragmentation
 * [tegra] Use "else"
 * [tegra] Lock destroy() using lvcmdq_mutex
v6
 https://lore.kernel.org/all/cover.1749884998.git.nicolinc@nvidia.com/
 * Rebase on iommufd_hw_queue-prep-v2
 * Add Reviewed-by from Kevin and Jason
 * [iommufd] Update kdocs and notes
 * [iommufd] Drop redundant pages[i] check
 * [iommufd] Allow nesting_parent_iova to be 0
 * [iommufd] Add iommufd_hw_queue_alloc_phys()
 * [iommufd] Revise iommufd_viommu_alloc/destroy_mmap APIs
 * [iommufd] Move destroy ops to vdevice/hw_queue structures
 * [iommufd] Add union in hw_info struct to share out_data_type field
 * [iommufd] Replace iopt_pin/unpin_pages() with internal access APIs
 * [iommufd] Replace vdevice_alloc with vdevice_size and vdevice_init
 * [iommufd] Replace hw_queue_alloc with get_hw_queue_size/hw_queue_init
 * [iommufd] Replace IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA with init_phys
 * [smmu] Drop arm_smmu_domain_ipa_to_pa
 * [smmu] Update arm_smmu_impl_ops changes for vsmmu_init
 * [tegra] Add a vdev_to_vsid macro
 * [tegra] Add lvcmdq_mutex to protect multi queues
 * [tegra] Drop duplicated kcalloc for vintf->lvcmdqs (memory leak)
v5
 https://lore.kernel.org/all/cover.1747537752.git.nicolinc@nvidia.com/
 * Rebase on v6.15-rc6
 * Add Reviewed-by from Jason and Kevin
 * Correct typos in kdoc and update commit logs
 * [iommufd] Add a cosmetic fix
 * [iommufd] Drop unused num_pfns
 * [iommufd] Drop unnecessary check
 * [iommufd] Reorder patch sequence
 * [iommufd] Use io_remap_pfn_range()
 * [iommufd] Use success oriented flow
 * [iommufd] Fix max_npages calculation
 * [iommufd] Add more selftest coverage
 * [iommufd] Drop redundant static_assert
 * [iommufd] Fix mmap pfn range validation
 * [iommufd] Reject unmap on pinned iovas
 * [iommufd] Drop redundant vm_flags_set()
 * [iommufd] Drop iommufd_struct_destroy()
 * [iommufd] Drop redundant queue iova test
 * [iommufd] Use "mmio_addr" and "mmio_pfn"
 * [iommufd] Rename to "nesting_parent_iova"
 * [iommufd] Make iopt_pin_pages call option
 * [iommufd] Add ictx comparison in depend()
 * [iommufd] Add iommufd_object_alloc_ucmd()
 * [iommufd] Move kcalloc() after validations
 * [iommufd] Replace ictx setting with WARN_ON
 * [iommufd] Make hw_info's type bidirectional
 * [smmu] Add supported_vsmmu_type in impl_ops
 * [smmu] Drop impl report in smmu vendor struct
 * [tegra] Add IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV
 * [tegra] Replace "number of VINTFs" with a note
 * [tegra] Drop the redundant lvcmdq pointer setting
 * [tegra] Flag IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA
 * [tegra] Use "vintf_alloc_vsid" for vdevice_alloc op
v4
 https://lore.kernel.org/all/cover.1746757630.git.nicolinc@nvidia.com/
 * Rebase on v6.15-rc5
 * Add Reviewed-by from Vasant
 * Rename "vQUEUE" to "HW QUEUE"
 * Use "offset" and "length" for all mmap-related variables
 * [iommufd] Use u64 for guest PA
 * [iommufd] Fix typo in uAPI doc
 * [iommufd] Rename immap_id to offset
 * [iommufd] Drop the partial-size mmap support
 * [iommufd] Do not replace WARN_ON with WARN_ON_ONCE
 * [iommufd] Use "u64 base_addr" for queue base address
 * [iommufd] Use u64 base_pfn/num_pfns for immap structure
 * [iommufd] Correct the size passed in to mtree_alloc_range()
 * [iommufd] Add IOMMUFD_VIOMMU_FLAG_HW_QUEUE_READS_PA to viommu_ops
v3
 https://lore.kernel.org/all/cover.1746139811.git.nicolinc@nvidia.com/
 * Add Reviewed-by from Baolu, Pranjal, and Alok
 * Revise kdocs, uAPI docs, and commit logs
 * Rename "vCMDQ" back to "vQUEUE" for AMD cases
 * [tegra] Add tegra241_vcmdq_hw_flush_timeout()
 * [tegra] Rename vsmmu_alloc to alloc_vintf_user
 * [tegra] Use writel for SID replacement registers
 * [tegra] Move mmap removal call to vsmmu_destroy op
 * [tegra] Fix revert in tegra241_vintf_alloc_lvcmdq_user()
 * [iommufd] Replace "& ~PAGE_MASK" with PAGE_ALIGNED()
 * [iommufd] Add an object-type "owner" to immap structure
 * [iommufd] Drop the ictx input in the new for-driver APIs
 * [iommufd] Add iommufd_vma_ops to keep track of mmap lifecycle
 * [iommufd] Add viommu-based iommufd_viommu_alloc/destroy_mmap helpers
 * [iommufd] Rename iommufd_ctx_alloc/free_mmap to
             _iommufd_alloc/destroy_mmap
v2
 https://lore.kernel.org/all/cover.1745646960.git.nicolinc@nvidia.com/
 * Add Reviewed-by from Jason
 * [smmu] Fix vsmmu initial value
 * [smmu] Support impl for hw_info
 * [tegra] Rename "slot" to "vsid"
 * [tegra] Update kdocs and commit logs
 * [tegra] Map/unmap LVCMDQ dynamically
 * [tegra] Refcount the previous LVCMDQ
 * [tegra] Return -EEXIST if LVCMDQ exists
 * [tegra] Simplify VINTF cleanup routine
 * [tegra] Use vmid and s2_domain in vsmmu
 * [tegra] Rename "mmap_pgoff" to "immap_id"
 * [tegra] Add more addr and length validation
 * [iommufd] Add more narrative to mmap's kdoc
 * [iommufd] Add iommufd_struct_depend/undepend()
 * [iommufd] Rename vcmdq_free op to vcmdq_destroy
 * [iommufd] Fix bug in iommu_copy_struct_to_user()
 * [iommufd] Drop is_io from iommufd_ctx_alloc_mmap()
 * [iommufd] Test the queue memory for its contiguity
 * [iommufd] Return -ENXIO if address or length fails
 * [iommufd] Do not change @min_last in mock_viommu_alloc()
 * [iommufd] Generalize TEGRA241_VCMDQ data in core structure
 * [iommufd] Add selftest coverage for IOMMUFD_CMD_VCMDQ_ALLOC
 * [iommufd] Add iopt_pin_pages() to prevent queue memory from unmapping
v1
 https://lore.kernel.org/all/cover.1744353300.git.nicolinc@nvidia.com/

Thanks
Nicolin

Nicolin Chen (28):
  iommufd: Report unmapped bytes in the error path of
    iopt_unmap_iova_range
  iommufd/viommu: Explicitly define vdev->virt_id
  iommu: Use enum iommu_hw_info_type for type in hw_info op
  iommu: Add iommu_copy_struct_to_user helper
  iommu: Pass in a driver-level user data structure to viommu_init op
  iommufd/viommu: Allow driver-specific user data for a vIOMMU object
  iommufd/selftest: Support user_data in mock_viommu_alloc
  iommufd/selftest: Add coverage for viommu data
  iommufd/access: Add internal APIs for HW queue to use
  iommufd/access: Bypass access->ops->unmap for internal use
  iommufd/viommu: Add driver-defined vDEVICE support
  iommufd/viommu: Introduce IOMMUFD_OBJ_HW_QUEUE and its related struct
  iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl
  iommufd/driver: Add iommufd_hw_queue_depend/undepend() helpers
  iommufd/selftest: Add coverage for IOMMUFD_CMD_HW_QUEUE_ALLOC
  iommufd: Add mmap interface
  iommufd/selftest: Add coverage for the new mmap interface
  Documentation: userspace-api: iommufd: Update HW QUEUE
  iommu: Allow an input type in hw_info op
  iommufd: Allow an input data_type via iommu_hw_info
  iommufd/selftest: Update hw_info coverage for an input data_type
  iommu/arm-smmu-v3-iommufd: Add vsmmu_size/type and vsmmu_init impl ops
  iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops
  iommu/tegra241-cmdqv: Use request_threaded_irq
  iommu/tegra241-cmdqv: Simplify deinit flow in
    tegra241_cmdqv_remove_vintf()
  iommu/tegra241-cmdqv: Do not statically map LVCMDQs
  iommu/tegra241-cmdqv: Add user-space use support
  iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  22 +-
 drivers/iommu/iommufd/iommufd_private.h       |  50 +-
 drivers/iommu/iommufd/iommufd_test.h          |  20 +
 include/linux/iommu.h                         |  50 +-
 include/linux/iommufd.h                       | 160 ++++++
 include/uapi/linux/iommufd.h                  | 145 +++++-
 tools/testing/selftests/iommu/iommufd_utils.h |  89 +++-
 .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     |  28 +-
 .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c    | 484 +++++++++++++++++-
 drivers/iommu/intel/iommu.c                   |   7 +-
 drivers/iommu/iommufd/device.c                |  90 +++-
 drivers/iommu/iommufd/driver.c                |  81 ++-
 drivers/iommu/iommufd/io_pagetable.c          |  17 +-
 drivers/iommu/iommufd/main.c                  |  69 +++
 drivers/iommu/iommufd/selftest.c              | 153 +++++-
 drivers/iommu/iommufd/viommu.c                | 208 +++++++-
 tools/testing/selftests/iommu/iommufd.c       | 143 +++++-
 .../selftests/iommu/iommufd_fail_nth.c        |  15 +-
 Documentation/userspace-api/iommufd.rst       |  12 +
 19 files changed, 1736 insertions(+), 107 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v7 01/28] iommufd: Report unmapped bytes in the error path of iopt_unmap_iova_range
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-07-02  9:39   ` Tian, Kevin
  2025-07-04 12:59   ` Jason Gunthorpe
  2025-06-26 19:34 ` [PATCH v7 02/28] iommufd/viommu: Explicitly define vdev->virt_id Nicolin Chen
                   ` (26 subsequent siblings)
  27 siblings, 2 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

There are callers that read the unmapped bytes even when rc != 0. Thus, do
not forget to report it in the error path too.

Fixes: 8d40205f6093 ("iommufd: Add kAPI toward external drivers for kernel access")
Cc: stable@vger.kernel.org
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/io_pagetable.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 13d010f19ed1..22fc3a12109f 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -743,8 +743,10 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
 			iommufd_access_notify_unmap(iopt, area_first, length);
 			/* Something is not responding to unmap requests. */
 			tries++;
-			if (WARN_ON(tries > 100))
-				return -EDEADLOCK;
+			if (WARN_ON(tries > 100)) {
+				rc = -EDEADLOCK;
+				goto out_unmapped;
+			}
 			goto again;
 		}
 
@@ -766,6 +768,7 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
 out_unlock_iova:
 	up_write(&iopt->iova_rwsem);
 	up_read(&iopt->domains_rwsem);
+out_unmapped:
 	if (unmapped)
 		*unmapped = unmapped_bytes;
 	return rc;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 02/28] iommufd/viommu: Explicitly define vdev->virt_id
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 01/28] iommufd: Report unmapped bytes in the error path of iopt_unmap_iova_range Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-07-01 12:30   ` Pranjal Shrivastava
                     ` (2 more replies)
  2025-06-26 19:34 ` [PATCH v7 03/28] iommu: Use enum iommu_hw_info_type for type in hw_info op Nicolin Chen
                   ` (25 subsequent siblings)
  27 siblings, 3 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

The "id" is too genernal to get its meaning easily. Rename it explicitly to
"virt_id" and update the kdocs for readability. No functional changes.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_private.h | 7 ++++++-
 drivers/iommu/iommufd/driver.c          | 2 +-
 drivers/iommu/iommufd/viommu.c          | 4 ++--
 3 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 4f5e8cd99c96..09f895638f68 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -634,7 +634,12 @@ struct iommufd_vdevice {
 	struct iommufd_object obj;
 	struct iommufd_viommu *viommu;
 	struct device *dev;
-	u64 id; /* per-vIOMMU virtual ID */
+
+	/*
+	 * Virtual device ID per vIOMMU, e.g. vSID of ARM SMMUv3, vDeviceID of
+	 * AMD IOMMU, and vRID of a nested Intel VT-d to a Context Table
+	 */
+	u64 virt_id;
 };
 
 #ifdef CONFIG_IOMMUFD_TEST
diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c
index 2fee399a148e..887719016804 100644
--- a/drivers/iommu/iommufd/driver.c
+++ b/drivers/iommu/iommufd/driver.c
@@ -30,7 +30,7 @@ int iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu,
 	xa_lock(&viommu->vdevs);
 	xa_for_each(&viommu->vdevs, index, vdev) {
 		if (vdev->dev == dev) {
-			*vdev_id = vdev->id;
+			*vdev_id = vdev->virt_id;
 			rc = 0;
 			break;
 		}
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
index 25ac08fbb52a..bc8796e6684e 100644
--- a/drivers/iommu/iommufd/viommu.c
+++ b/drivers/iommu/iommufd/viommu.c
@@ -111,7 +111,7 @@ void iommufd_vdevice_destroy(struct iommufd_object *obj)
 	struct iommufd_viommu *viommu = vdev->viommu;
 
 	/* xa_cmpxchg is okay to fail if alloc failed xa_cmpxchg previously */
-	xa_cmpxchg(&viommu->vdevs, vdev->id, vdev, NULL, GFP_KERNEL);
+	xa_cmpxchg(&viommu->vdevs, vdev->virt_id, vdev, NULL, GFP_KERNEL);
 	refcount_dec(&viommu->obj.users);
 	put_device(vdev->dev);
 }
@@ -150,7 +150,7 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd)
 		goto out_put_idev;
 	}
 
-	vdev->id = virt_id;
+	vdev->virt_id = virt_id;
 	vdev->dev = idev->dev;
 	get_device(idev->dev);
 	vdev->viommu = viommu;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 03/28] iommu: Use enum iommu_hw_info_type for type in hw_info op
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 01/28] iommufd: Report unmapped bytes in the error path of iopt_unmap_iova_range Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 02/28] iommufd/viommu: Explicitly define vdev->virt_id Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-07-01 12:48   ` Pranjal Shrivastava
                     ` (3 more replies)
  2025-06-26 19:34 ` [PATCH v7 04/28] iommu: Add iommu_copy_struct_to_user helper Nicolin Chen
                   ` (24 subsequent siblings)
  27 siblings, 4 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

Replace u32 to make it clear. No functional changes.

Also simplify the kdoc since the type itself is clear enough.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h                               | 6 +++---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 3 ++-
 drivers/iommu/intel/iommu.c                         | 3 ++-
 drivers/iommu/iommufd/selftest.c                    | 3 ++-
 4 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 04548b18df28..b87c2841e6bc 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -563,8 +563,7 @@ iommu_copy_struct_from_full_user_array(void *kdst, size_t kdst_entry_size,
  * @capable: check capability
  * @hw_info: report iommu hardware information. The data buffer returned by this
  *           op is allocated in the iommu driver and freed by the caller after
- *           use. The information type is one of enum iommu_hw_info_type defined
- *           in include/uapi/linux/iommufd.h.
+ *           use.
  * @domain_alloc: Do not use in new drivers
  * @domain_alloc_identity: allocate an IDENTITY domain. Drivers should prefer to
  *                         use identity_domain instead. This should only be used
@@ -623,7 +622,8 @@ iommu_copy_struct_from_full_user_array(void *kdst, size_t kdst_entry_size,
  */
 struct iommu_ops {
 	bool (*capable)(struct device *dev, enum iommu_cap);
-	void *(*hw_info)(struct device *dev, u32 *length, u32 *type);
+	void *(*hw_info)(struct device *dev, u32 *length,
+			 enum iommu_hw_info_type *type);
 
 	/* Domain allocation and freeing by the iommu driver */
 #if IS_ENABLED(CONFIG_FSL_PAMU)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
index 9f59c95a254c..69bbe39e28de 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
@@ -7,7 +7,8 @@
 
 #include "arm-smmu-v3.h"
 
-void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type)
+void *arm_smmu_hw_info(struct device *dev, u32 *length,
+		       enum iommu_hw_info_type *type)
 {
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
 	struct iommu_hw_info_arm_smmuv3 *info;
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 7aa3932251b2..850f1a6f548c 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4091,7 +4091,8 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain *domain,
 	return ret;
 }
 
-static void *intel_iommu_hw_info(struct device *dev, u32 *length, u32 *type)
+static void *intel_iommu_hw_info(struct device *dev, u32 *length,
+				 enum iommu_hw_info_type *type)
 {
 	struct device_domain_info *info = dev_iommu_priv_get(dev);
 	struct intel_iommu *iommu = info->iommu;
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 74ca955a766e..7a9abe3f47d5 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -287,7 +287,8 @@ static struct iommu_domain mock_blocking_domain = {
 	.ops = &mock_blocking_ops,
 };
 
-static void *mock_domain_hw_info(struct device *dev, u32 *length, u32 *type)
+static void *mock_domain_hw_info(struct device *dev, u32 *length,
+				 enum iommu_hw_info_type *type)
 {
 	struct iommu_test_hw_info *info;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 04/28] iommu: Add iommu_copy_struct_to_user helper
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (2 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 03/28] iommu: Use enum iommu_hw_info_type for type in hw_info op Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 05/28] iommu: Pass in a driver-level user data structure to viommu_init op Nicolin Chen
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

Similar to the iommu_copy_struct_from_user helper receiving data from the
user space, add an iommu_copy_struct_to_user helper to report output data
back to the user space data pointer.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index b87c2841e6bc..fd7319706684 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -558,6 +558,46 @@ iommu_copy_struct_from_full_user_array(void *kdst, size_t kdst_entry_size,
 	return 0;
 }
 
+/**
+ * __iommu_copy_struct_to_user - Report iommu driver specific user space data
+ * @dst_data: Pointer to a struct iommu_user_data for user space data location
+ * @src_data: Pointer to an iommu driver specific user data that is defined in
+ *            include/uapi/linux/iommufd.h
+ * @data_type: The data type of the @src_data. Must match with @dst_data.type
+ * @data_len: Length of current user data structure, i.e. sizeof(struct _src)
+ * @min_len: Initial length of user data structure for backward compatibility.
+ *           This should be offsetofend using the last member in the user data
+ *           struct that was initially added to include/uapi/linux/iommufd.h
+ */
+static inline int
+__iommu_copy_struct_to_user(const struct iommu_user_data *dst_data,
+			    void *src_data, unsigned int data_type,
+			    size_t data_len, size_t min_len)
+{
+	if (WARN_ON(!dst_data || !src_data))
+		return -EINVAL;
+	if (dst_data->type != data_type)
+		return -EINVAL;
+	if (dst_data->len < min_len || data_len < dst_data->len)
+		return -EINVAL;
+	return copy_struct_to_user(dst_data->uptr, dst_data->len, src_data,
+				   data_len, NULL);
+}
+
+/**
+ * iommu_copy_struct_to_user - Report iommu driver specific user space data
+ * @user_data: Pointer to a struct iommu_user_data for user space data location
+ * @ksrc: Pointer to an iommu driver specific user data that is defined in
+ *        include/uapi/linux/iommufd.h
+ * @data_type: The data type of the @ksrc. Must match with @user_data->type
+ * @min_last: The last member of the data structure @ksrc points in the initial
+ *            version.
+ * Return 0 for success, otherwise -error.
+ */
+#define iommu_copy_struct_to_user(user_data, ksrc, data_type, min_last)        \
+	__iommu_copy_struct_to_user(user_data, ksrc, data_type, sizeof(*ksrc), \
+				    offsetofend(typeof(*ksrc), min_last))
+
 /**
  * struct iommu_ops - iommu ops and capabilities
  * @capable: check capability
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 05/28] iommu: Pass in a driver-level user data structure to viommu_init op
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (3 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 04/28] iommu: Add iommu_copy_struct_to_user helper Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 06/28] iommufd/viommu: Allow driver-specific user data for a vIOMMU object Nicolin Chen
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

The new type of vIOMMU for tegra241-cmdqv allows user space VM to use one
of its virtual command queue HW resources exclusively. This requires user
space to mmap the corresponding MMIO page from kernel space for direct HW
control.

To forward the mmap info (offset and length), iommufd should add a driver
specific data structure to the IOMMUFD_CMD_VIOMMU_ALLOC ioctl, for driver
to output the info during the vIOMMU initialization back to user space.

Similar to the existing ioctls and their IOMMU handlers, add a user_data
to viommu_init op to bridge between iommufd and drivers.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h         | 3 ++-
 include/linux/iommu.h                               | 3 ++-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 3 ++-
 drivers/iommu/iommufd/selftest.c                    | 3 ++-
 drivers/iommu/iommufd/viommu.c                      | 2 +-
 5 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index bb39af84e6b0..7eed5c8c72dd 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -1037,7 +1037,8 @@ void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type);
 size_t arm_smmu_get_viommu_size(struct device *dev,
 				enum iommu_viommu_type viommu_type);
 int arm_vsmmu_init(struct iommufd_viommu *viommu,
-		   struct iommu_domain *parent_domain);
+		   struct iommu_domain *parent_domain,
+		   const struct iommu_user_data *user_data);
 int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state,
 				    struct arm_smmu_nested_domain *nested_domain);
 void arm_smmu_attach_commit_vmaster(struct arm_smmu_attach_state *state);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index fd7319706684..e06a0fbe4bc7 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -700,7 +700,8 @@ struct iommu_ops {
 	size_t (*get_viommu_size)(struct device *dev,
 				  enum iommu_viommu_type viommu_type);
 	int (*viommu_init)(struct iommufd_viommu *viommu,
-			   struct iommu_domain *parent_domain);
+			   struct iommu_domain *parent_domain,
+			   const struct iommu_user_data *user_data);
 
 	const struct iommu_domain_ops *default_domain_ops;
 	unsigned long pgsize_bitmap;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
index 69bbe39e28de..170d69162848 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
@@ -419,7 +419,8 @@ size_t arm_smmu_get_viommu_size(struct device *dev,
 }
 
 int arm_vsmmu_init(struct iommufd_viommu *viommu,
-		   struct iommu_domain *parent_domain)
+		   struct iommu_domain *parent_domain,
+		   const struct iommu_user_data *user_data)
 {
 	struct arm_vsmmu *vsmmu = container_of(viommu, struct arm_vsmmu, core);
 	struct arm_smmu_device *smmu =
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 7a9abe3f47d5..0d896a89ace7 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -779,7 +779,8 @@ static size_t mock_get_viommu_size(struct device *dev,
 }
 
 static int mock_viommu_init(struct iommufd_viommu *viommu,
-			    struct iommu_domain *parent_domain)
+			    struct iommu_domain *parent_domain,
+			    const struct iommu_user_data *user_data)
 {
 	struct mock_iommu_device *mock_iommu = container_of(
 		viommu->iommu_dev, struct mock_iommu_device, iommu_dev);
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
index bc8796e6684e..2009a421efae 100644
--- a/drivers/iommu/iommufd/viommu.c
+++ b/drivers/iommu/iommufd/viommu.c
@@ -84,7 +84,7 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd)
 	 */
 	viommu->iommu_dev = __iommu_get_iommu_dev(idev->dev);
 
-	rc = ops->viommu_init(viommu, hwpt_paging->common.domain);
+	rc = ops->viommu_init(viommu, hwpt_paging->common.domain, NULL);
 	if (rc)
 		goto out_put_hwpt;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 06/28] iommufd/viommu: Allow driver-specific user data for a vIOMMU object
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (4 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 05/28] iommu: Pass in a driver-level user data structure to viommu_init op Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 07/28] iommufd/selftest: Support user_data in mock_viommu_alloc Nicolin Chen
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

The new type of vIOMMU for tegra241-cmdqv driver needs a driver-specific
user data. So, add data_len/uptr to the iommu_viommu_alloc uAPI and pass
it in via the viommu_init iommu op.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Acked-by: Pranjal Shrivastava <praan@google.com>
Acked-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/uapi/linux/iommufd.h   | 6 ++++++
 drivers/iommu/iommufd/viommu.c | 8 +++++++-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index f29b6c44655e..272da7324a2b 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -965,6 +965,9 @@ enum iommu_viommu_type {
  * @dev_id: The device's physical IOMMU will be used to back the virtual IOMMU
  * @hwpt_id: ID of a nesting parent HWPT to associate to
  * @out_viommu_id: Output virtual IOMMU ID for the allocated object
+ * @data_len: Length of the type specific data
+ * @__reserved: Must be 0
+ * @data_uptr: User pointer to a driver-specific virtual IOMMU data
  *
  * Allocate a virtual IOMMU object, representing the underlying physical IOMMU's
  * virtualization support that is a security-isolated slice of the real IOMMU HW
@@ -985,6 +988,9 @@ struct iommu_viommu_alloc {
 	__u32 dev_id;
 	__u32 hwpt_id;
 	__u32 out_viommu_id;
+	__u32 data_len;
+	__u32 __reserved;
+	__aligned_u64 data_uptr;
 };
 #define IOMMU_VIOMMU_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_ALLOC)
 
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
index 2009a421efae..c0365849f849 100644
--- a/drivers/iommu/iommufd/viommu.c
+++ b/drivers/iommu/iommufd/viommu.c
@@ -17,6 +17,11 @@ void iommufd_viommu_destroy(struct iommufd_object *obj)
 int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd)
 {
 	struct iommu_viommu_alloc *cmd = ucmd->cmd;
+	const struct iommu_user_data user_data = {
+		.type = cmd->type,
+		.uptr = u64_to_user_ptr(cmd->data_uptr),
+		.len = cmd->data_len,
+	};
 	struct iommufd_hwpt_paging *hwpt_paging;
 	struct iommufd_viommu *viommu;
 	struct iommufd_device *idev;
@@ -84,7 +89,8 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd)
 	 */
 	viommu->iommu_dev = __iommu_get_iommu_dev(idev->dev);
 
-	rc = ops->viommu_init(viommu, hwpt_paging->common.domain, NULL);
+	rc = ops->viommu_init(viommu, hwpt_paging->common.domain,
+			      user_data.len ? &user_data : NULL);
 	if (rc)
 		goto out_put_hwpt;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 07/28] iommufd/selftest: Support user_data in mock_viommu_alloc
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (5 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 06/28] iommufd/viommu: Allow driver-specific user data for a vIOMMU object Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 08/28] iommufd/selftest: Add coverage for viommu data Nicolin Chen
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

Add a simple user_data for an input-to-output loopback test.

Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_test.h | 13 +++++++++++++
 drivers/iommu/iommufd/selftest.c     | 15 +++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index 1cd7e8394129..fbf9ecb35a13 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -227,6 +227,19 @@ struct iommu_hwpt_invalidate_selftest {
 
 #define IOMMU_VIOMMU_TYPE_SELFTEST 0xdeadbeef
 
+/**
+ * struct iommu_viommu_selftest - vIOMMU data for Mock driver
+ *                                (IOMMU_VIOMMU_TYPE_SELFTEST)
+ * @in_data: Input random data from user space
+ * @out_data: Output data (matching @in_data) to user space
+ *
+ * Simply set @out_data=@in_data for a loopback test
+ */
+struct iommu_viommu_selftest {
+	__u32 in_data;
+	__u32 out_data;
+};
+
 /* Should not be equal to any defined value in enum iommu_viommu_invalidate_data_type */
 #define IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST 0xdeadbeef
 #define IOMMU_VIOMMU_INVALIDATE_DATA_SELFTEST_INVALID 0xdadbeef
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 0d896a89ace7..38066dfeb2e7 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -784,6 +784,21 @@ static int mock_viommu_init(struct iommufd_viommu *viommu,
 {
 	struct mock_iommu_device *mock_iommu = container_of(
 		viommu->iommu_dev, struct mock_iommu_device, iommu_dev);
+	struct iommu_viommu_selftest data;
+	int rc;
+
+	if (user_data) {
+		rc = iommu_copy_struct_from_user(
+			&data, user_data, IOMMU_VIOMMU_TYPE_SELFTEST, out_data);
+		if (rc)
+			return rc;
+
+		data.out_data = data.in_data;
+		rc = iommu_copy_struct_to_user(
+			user_data, &data, IOMMU_VIOMMU_TYPE_SELFTEST, out_data);
+		if (rc)
+			return rc;
+	}
 
 	refcount_inc(&mock_iommu->users);
 	viommu->ops = &mock_viommu_ops;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 08/28] iommufd/selftest: Add coverage for viommu data
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (6 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 07/28] iommufd/selftest: Support user_data in mock_viommu_alloc Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 09/28] iommufd/access: Add internal APIs for HW queue to use Nicolin Chen
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

Extend the existing test_cmd/err_viommu_alloc helpers to accept optional
user data. And add a TEST_F for a loopback test.

Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 tools/testing/selftests/iommu/iommufd_utils.h | 21 ++++++++-----
 tools/testing/selftests/iommu/iommufd.c       | 30 ++++++++++++++-----
 .../selftests/iommu/iommufd_fail_nth.c        |  5 ++--
 3 files changed, 39 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 72f6636e5d90..a5d4cbd089ba 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -897,7 +897,8 @@ static int _test_cmd_trigger_iopf(int fd, __u32 device_id, __u32 pasid,
 					    pasid, fault_fd))
 
 static int _test_cmd_viommu_alloc(int fd, __u32 device_id, __u32 hwpt_id,
-				  __u32 type, __u32 flags, __u32 *viommu_id)
+				  __u32 flags, __u32 type, void *data,
+				  __u32 data_len, __u32 *viommu_id)
 {
 	struct iommu_viommu_alloc cmd = {
 		.size = sizeof(cmd),
@@ -905,6 +906,8 @@ static int _test_cmd_viommu_alloc(int fd, __u32 device_id, __u32 hwpt_id,
 		.type = type,
 		.dev_id = device_id,
 		.hwpt_id = hwpt_id,
+		.data_uptr = (uint64_t)data,
+		.data_len = data_len,
 	};
 	int ret;
 
@@ -916,13 +919,15 @@ static int _test_cmd_viommu_alloc(int fd, __u32 device_id, __u32 hwpt_id,
 	return 0;
 }
 
-#define test_cmd_viommu_alloc(device_id, hwpt_id, type, viommu_id)        \
-	ASSERT_EQ(0, _test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, \
-					    type, 0, viommu_id))
-#define test_err_viommu_alloc(_errno, device_id, hwpt_id, type, viommu_id) \
-	EXPECT_ERRNO(_errno,                                               \
-		     _test_cmd_viommu_alloc(self->fd, device_id, hwpt_id,  \
-					    type, 0, viommu_id))
+#define test_cmd_viommu_alloc(device_id, hwpt_id, type, data, data_len,      \
+			      viommu_id)                                     \
+	ASSERT_EQ(0, _test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, 0, \
+					    type, data, data_len, viommu_id))
+#define test_err_viommu_alloc(_errno, device_id, hwpt_id, type, data,        \
+			      data_len, viommu_id)                           \
+	EXPECT_ERRNO(_errno,                                                 \
+		     _test_cmd_viommu_alloc(self->fd, device_id, hwpt_id, 0, \
+					    type, data, data_len, viommu_id))
 
 static int _test_cmd_vdevice_alloc(int fd, __u32 viommu_id, __u32 idev_id,
 				   __u64 virt_id, __u32 *vdev_id)
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 1a8e85afe9aa..f22388dfacad 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -2688,7 +2688,7 @@ FIXTURE_SETUP(iommufd_viommu)
 
 		/* Allocate a vIOMMU taking refcount of the parent hwpt */
 		test_cmd_viommu_alloc(self->device_id, self->hwpt_id,
-				      IOMMU_VIOMMU_TYPE_SELFTEST,
+				      IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0,
 				      &self->viommu_id);
 
 		/* Allocate a regular nested hwpt */
@@ -2727,24 +2727,27 @@ TEST_F(iommufd_viommu, viommu_negative_tests)
 	if (self->device_id) {
 		/* Negative test -- invalid hwpt (hwpt_id=0) */
 		test_err_viommu_alloc(ENOENT, device_id, 0,
-				      IOMMU_VIOMMU_TYPE_SELFTEST, NULL);
+				      IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0,
+				      NULL);
 
 		/* Negative test -- not a nesting parent hwpt */
 		test_cmd_hwpt_alloc(device_id, ioas_id, 0, &hwpt_id);
 		test_err_viommu_alloc(EINVAL, device_id, hwpt_id,
-				      IOMMU_VIOMMU_TYPE_SELFTEST, NULL);
+				      IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0,
+				      NULL);
 		test_ioctl_destroy(hwpt_id);
 
 		/* Negative test -- unsupported viommu type */
 		test_err_viommu_alloc(EOPNOTSUPP, device_id, self->hwpt_id,
-				      0xdead, NULL);
+				      0xdead, NULL, 0, NULL);
 		EXPECT_ERRNO(EBUSY,
 			     _test_ioctl_destroy(self->fd, self->hwpt_id));
 		EXPECT_ERRNO(EBUSY,
 			     _test_ioctl_destroy(self->fd, self->viommu_id));
 	} else {
 		test_err_viommu_alloc(ENOENT, self->device_id, self->hwpt_id,
-				      IOMMU_VIOMMU_TYPE_SELFTEST, NULL);
+				      IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0,
+				      NULL);
 	}
 }
 
@@ -2791,6 +2794,20 @@ TEST_F(iommufd_viommu, viommu_alloc_nested_iopf)
 	}
 }
 
+TEST_F(iommufd_viommu, viommu_alloc_with_data)
+{
+	struct iommu_viommu_selftest data = {
+		.in_data = 0xbeef,
+	};
+
+	if (self->device_id) {
+		test_cmd_viommu_alloc(self->device_id, self->hwpt_id,
+				      IOMMU_VIOMMU_TYPE_SELFTEST, &data,
+				      sizeof(data), &self->viommu_id);
+		ASSERT_EQ(data.out_data, data.in_data);
+	}
+}
+
 TEST_F(iommufd_viommu, vdevice_alloc)
 {
 	uint32_t viommu_id = self->viommu_id;
@@ -3105,8 +3122,7 @@ TEST_F(iommufd_device_pasid, pasid_attach)
 
 	/* Allocate a regular nested hwpt based on viommu */
 	test_cmd_viommu_alloc(self->device_id, parent_hwpt_id,
-			      IOMMU_VIOMMU_TYPE_SELFTEST,
-			      &viommu_id);
+			      IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0, &viommu_id);
 	test_cmd_hwpt_alloc_nested(self->device_id, viommu_id,
 				   IOMMU_HWPT_ALLOC_PASID,
 				   &nested_hwpt_id[2],
diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c
index e11ec4b121fc..f7ccf1822108 100644
--- a/tools/testing/selftests/iommu/iommufd_fail_nth.c
+++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c
@@ -688,8 +688,9 @@ TEST_FAIL_NTH(basic_fail_nth, device)
 				 IOMMU_HWPT_DATA_NONE, 0, 0))
 		return -1;
 
-	if (_test_cmd_viommu_alloc(self->fd, idev_id, hwpt_id,
-				   IOMMU_VIOMMU_TYPE_SELFTEST, 0, &viommu_id))
+	if (_test_cmd_viommu_alloc(self->fd, idev_id, hwpt_id, 0,
+				   IOMMU_VIOMMU_TYPE_SELFTEST, NULL, 0,
+				   &viommu_id))
 		return -1;
 
 	if (_test_cmd_vdevice_alloc(self->fd, viommu_id, idev_id, 0, &vdev_id))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 09/28] iommufd/access: Add internal APIs for HW queue to use
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (7 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 08/28] iommufd/selftest: Add coverage for viommu data Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-07-02  9:42   ` Tian, Kevin
  2025-07-04 13:08   ` Jason Gunthorpe
  2025-06-26 19:34 ` [PATCH v7 10/28] iommufd/access: Bypass access->ops->unmap for internal use Nicolin Chen
                   ` (18 subsequent siblings)
  27 siblings, 2 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

The new HW queue object, as an internal iommufd object, wants to reuse the
struct iommufd_access to pin some iova range in the iopt.

However, an access generally takes the refcount of an ictx. So, in such an
internal case, a deadlock could happen when the release of the ictx has to
wait for the release of the access first when releasing a hw_queue object,
which could wait for the release of the ictx that is refcounted:
    ictx --releases--> hw_queue --releases--> access
      ^                                         |
      |_________________releases________________v

To address this, add a set of lightweight internal APIs to unlink the ictx
and the access, i.e. no ictx refcounting by the access:
    ictx --releases--> hw_queue --releases--> access

Then, there's no point in setting the access->ictx. So simply define !ictx
as an flag for an internal use and add an inline helper.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_private.h | 23 ++++++++++
 drivers/iommu/iommufd/device.c          | 59 +++++++++++++++++++++----
 2 files changed, 73 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 09f895638f68..9d1f55deb9ca 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -484,6 +484,29 @@ void iopt_remove_access(struct io_pagetable *iopt,
 			struct iommufd_access *access, u32 iopt_access_list_id);
 void iommufd_access_destroy_object(struct iommufd_object *obj);
 
+/* iommufd_access for internal use */
+static inline bool iommufd_access_is_internal(struct iommufd_access *access)
+{
+	return !access->ictx;
+}
+
+struct iommufd_access *iommufd_access_create_internal(struct iommufd_ctx *ictx);
+
+static inline void
+iommufd_access_destroy_internal(struct iommufd_ctx *ictx,
+				struct iommufd_access *access)
+{
+	iommufd_object_destroy_user(ictx, &access->obj);
+}
+
+int iommufd_access_attach_internal(struct iommufd_access *access,
+				   struct iommufd_ioas *ioas);
+
+static inline void iommufd_access_detach_internal(struct iommufd_access *access)
+{
+	iommufd_access_detach(access);
+}
+
 struct iommufd_eventq {
 	struct iommufd_object obj;
 	struct iommufd_ctx *ictx;
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index e9b6ca47095c..07a4ff753c12 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -1084,7 +1084,39 @@ void iommufd_access_destroy_object(struct iommufd_object *obj)
 	if (access->ioas)
 		WARN_ON(iommufd_access_change_ioas(access, NULL));
 	mutex_unlock(&access->ioas_lock);
-	iommufd_ctx_put(access->ictx);
+	if (!iommufd_access_is_internal(access))
+		iommufd_ctx_put(access->ictx);
+}
+
+static struct iommufd_access *__iommufd_access_create(struct iommufd_ctx *ictx)
+{
+	struct iommufd_access *access;
+
+	/*
+	 * There is no uAPI for the access object, but to keep things symmetric
+	 * use the object infrastructure anyhow.
+	 */
+	access = iommufd_object_alloc(ictx, access, IOMMUFD_OBJ_ACCESS);
+	if (IS_ERR(access))
+		return access;
+
+	/* The calling driver is a user until iommufd_access_destroy() */
+	refcount_inc(&access->obj.users);
+	mutex_init(&access->ioas_lock);
+	return access;
+}
+
+struct iommufd_access *iommufd_access_create_internal(struct iommufd_ctx *ictx)
+{
+	struct iommufd_access *access;
+
+	access = __iommufd_access_create(ictx);
+	if (IS_ERR(access))
+		return access;
+	access->iova_alignment = PAGE_SIZE;
+
+	iommufd_object_finalize(ictx, &access->obj);
+	return access;
 }
 
 /**
@@ -1106,11 +1138,7 @@ iommufd_access_create(struct iommufd_ctx *ictx,
 {
 	struct iommufd_access *access;
 
-	/*
-	 * There is no uAPI for the access object, but to keep things symmetric
-	 * use the object infrastructure anyhow.
-	 */
-	access = iommufd_object_alloc(ictx, access, IOMMUFD_OBJ_ACCESS);
+	access = __iommufd_access_create(ictx);
 	if (IS_ERR(access))
 		return access;
 
@@ -1122,13 +1150,10 @@ iommufd_access_create(struct iommufd_ctx *ictx,
 	else
 		access->iova_alignment = 1;
 
-	/* The calling driver is a user until iommufd_access_destroy() */
-	refcount_inc(&access->obj.users);
 	access->ictx = ictx;
 	iommufd_ctx_get(ictx);
 	iommufd_object_finalize(ictx, &access->obj);
 	*id = access->obj.id;
-	mutex_init(&access->ioas_lock);
 	return access;
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_access_create, "IOMMUFD");
@@ -1173,6 +1198,22 @@ int iommufd_access_attach(struct iommufd_access *access, u32 ioas_id)
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_access_attach, "IOMMUFD");
 
+int iommufd_access_attach_internal(struct iommufd_access *access,
+				   struct iommufd_ioas *ioas)
+{
+	int rc;
+
+	mutex_lock(&access->ioas_lock);
+	if (WARN_ON(access->ioas)) {
+		mutex_unlock(&access->ioas_lock);
+		return -EINVAL;
+	}
+
+	rc = iommufd_access_change_ioas(access, ioas);
+	mutex_unlock(&access->ioas_lock);
+	return rc;
+}
+
 int iommufd_access_replace(struct iommufd_access *access, u32 ioas_id)
 {
 	int rc;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 10/28] iommufd/access: Bypass access->ops->unmap for internal use
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (8 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 09/28] iommufd/access: Add internal APIs for HW queue to use Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-07-02  9:45   ` Tian, Kevin
  2025-06-26 19:34 ` [PATCH v7 11/28] iommufd/viommu: Add driver-defined vDEVICE support Nicolin Chen
                   ` (17 subsequent siblings)
  27 siblings, 1 reply; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

The access object has been used externally by VFIO mdev devices, allowing
them to pin/unpin physical pages (via needs_pin_pages). Meanwhile, a racy
unmap can occur in this case, so these devices usually implement an unmap
handler, invoked by iommufd_access_notify_unmap().

The new HW queue object will need the same pin/unpin feature, although it
(unlike the mdev case) wants to reject any unmap attempt, during its life
cycle. Instead, it would not implement an unmap handler. Thus, bypass any
access->ops->unmap call when the access is marked as internal. Also error
out the internal case in iommufd_access_notify_unmap() to reject an unmap
operation and propagatethe errno upwards.

Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_private.h |  4 ++--
 drivers/iommu/iommufd/device.c          | 21 +++++++++++++++++----
 drivers/iommu/iommufd/io_pagetable.c    | 10 +++++++++-
 3 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 9d1f55deb9ca..b849099e804b 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -111,8 +111,8 @@ int iopt_read_and_clear_dirty_data(struct io_pagetable *iopt,
 int iopt_set_dirty_tracking(struct io_pagetable *iopt,
 			    struct iommu_domain *domain, bool enable);
 
-void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
-				 unsigned long length);
+int iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
+				unsigned long length);
 int iopt_table_add_domain(struct io_pagetable *iopt,
 			  struct iommu_domain *domain);
 void iopt_table_remove_domain(struct io_pagetable *iopt,
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 07a4ff753c12..8f078fda795a 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -1048,7 +1048,7 @@ static int iommufd_access_change_ioas(struct iommufd_access *access,
 	}
 
 	if (cur_ioas) {
-		if (access->ops->unmap) {
+		if (!iommufd_access_is_internal(access) && access->ops->unmap) {
 			mutex_unlock(&access->ioas_lock);
 			access->ops->unmap(access->data, 0, ULONG_MAX);
 			mutex_lock(&access->ioas_lock);
@@ -1245,15 +1245,24 @@ EXPORT_SYMBOL_NS_GPL(iommufd_access_replace, "IOMMUFD");
  * run in the future. Due to this a driver must not create locking that prevents
  * unmap to complete while iommufd_access_destroy() is running.
  */
-void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
-				 unsigned long length)
+int iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
+				unsigned long length)
 {
 	struct iommufd_ioas *ioas =
 		container_of(iopt, struct iommufd_ioas, iopt);
 	struct iommufd_access *access;
 	unsigned long index;
+	int ret = 0;
 
 	xa_lock(&ioas->iopt.access_list);
+	/* Bypass any unmap if there is an internal access */
+	xa_for_each(&ioas->iopt.access_list, index, access) {
+		if (iommufd_access_is_internal(access)) {
+			ret = -EBUSY;
+			goto unlock;
+		}
+	}
+
 	xa_for_each(&ioas->iopt.access_list, index, access) {
 		if (!iommufd_lock_obj(&access->obj))
 			continue;
@@ -1264,7 +1273,9 @@ void iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long iova,
 		iommufd_put_object(access->ictx, &access->obj);
 		xa_lock(&ioas->iopt.access_list);
 	}
+unlock:
 	xa_unlock(&ioas->iopt.access_list);
+	return ret;
 }
 
 /**
@@ -1362,7 +1373,9 @@ int iommufd_access_pin_pages(struct iommufd_access *access, unsigned long iova,
 
 	/* Driver's ops don't support pin_pages */
 	if (IS_ENABLED(CONFIG_IOMMUFD_TEST) &&
-	    WARN_ON(access->iova_alignment != PAGE_SIZE || !access->ops->unmap))
+	    WARN_ON(access->iova_alignment != PAGE_SIZE ||
+		    (!iommufd_access_is_internal(access) &&
+		     !access->ops->unmap)))
 		return -EINVAL;
 
 	if (!length)
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 22fc3a12109f..6b8477b1f94b 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -740,7 +740,15 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
 			up_write(&iopt->iova_rwsem);
 			up_read(&iopt->domains_rwsem);
 
-			iommufd_access_notify_unmap(iopt, area_first, length);
+			rc = iommufd_access_notify_unmap(iopt, area_first,
+							 length);
+			if (rc) {
+				down_read(&iopt->domains_rwsem);
+				down_write(&iopt->iova_rwsem);
+				area->prevent_access = false;
+				goto out_unlock_iova;
+			}
+
 			/* Something is not responding to unmap requests. */
 			tries++;
 			if (WARN_ON(tries > 100)) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 11/28] iommufd/viommu: Add driver-defined vDEVICE support
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (9 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 10/28] iommufd/access: Bypass access->ops->unmap for internal use Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 12/28] iommufd/viommu: Introduce IOMMUFD_OBJ_HW_QUEUE and its related struct Nicolin Chen
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

NVIDIA VCMDQ driver will have a driver-defined vDEVICE structure and do
some HW configurations with that.

To allow IOMMU drivers to define their own vDEVICE structures, move the
struct iommufd_vdevice to the public header and provide a pair of viommu
ops, similar to get_viommu_size and viommu_init.

Doing this, however, creates a new window between the vDEVICE allocation
and its driver-level initialization, during which an abort could happen
but it can't invoke a driver destroy function from the struct viommu_ops
since the driver structure isn't initialized yet. vIOMMU object doesn't
have this problem, since its destroy op is set via the viommu_ops by the
driver viommu_init function. Thus, vDEVICE should do something similar:
add a destroy function pointer inside the struct iommufd_vdevice instead
of the struct iommufd_viommu_ops.

Note that there is unlikely a use case for a type dependent vDEVICE, so
a static vdevice_size is probably enough for the near term instead of a
get_vdevice_size function op.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_private.h | 12 ----------
 include/linux/iommufd.h                 | 31 +++++++++++++++++++++++++
 drivers/iommu/iommufd/viommu.c          | 26 ++++++++++++++++++++-
 3 files changed, 56 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index b849099e804b..4db8c6890f4c 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -653,18 +653,6 @@ void iommufd_viommu_destroy(struct iommufd_object *obj);
 int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd);
 void iommufd_vdevice_destroy(struct iommufd_object *obj);
 
-struct iommufd_vdevice {
-	struct iommufd_object obj;
-	struct iommufd_viommu *viommu;
-	struct device *dev;
-
-	/*
-	 * Virtual device ID per vIOMMU, e.g. vSID of ARM SMMUv3, vDeviceID of
-	 * AMD IOMMU, and vRID of a nested Intel VT-d to a Context Table
-	 */
-	u64 virt_id;
-};
-
 #ifdef CONFIG_IOMMUFD_TEST
 int iommufd_test(struct iommufd_ucmd *ucmd);
 void iommufd_selftest_destroy(struct iommufd_object *obj);
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 2d1bf2f97ee3..3709b264af28 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -104,6 +104,21 @@ struct iommufd_viommu {
 	enum iommu_viommu_type type;
 };
 
+struct iommufd_vdevice {
+	struct iommufd_object obj;
+	struct iommufd_viommu *viommu;
+	struct device *dev;
+
+	/*
+	 * Virtual device ID per vIOMMU, e.g. vSID of ARM SMMUv3, vDeviceID of
+	 * AMD IOMMU, and vRID of a nested Intel VT-d to a Context Table
+	 */
+	u64 virt_id;
+
+	/* Clean up all driver-specific parts of an iommufd_vdevice */
+	void (*destroy)(struct iommufd_vdevice *vdev);
+};
+
 /**
  * struct iommufd_viommu_ops - vIOMMU specific operations
  * @destroy: Clean up all driver-specific parts of an iommufd_viommu. The memory
@@ -120,6 +135,14 @@ struct iommufd_viommu {
  *                    array->entry_num to report the number of handled requests.
  *                    The data structure of the array entry must be defined in
  *                    include/uapi/linux/iommufd.h
+ * @vdevice_size: Size of the driver-defined vDEVICE structure per this vIOMMU
+ * @vdevice_init: Initialize the driver-level structure of a vDEVICE object, or
+ *                related HW procedure. @vdev is already initialized by iommufd
+ *                core: vdev->dev and vdev->viommu pointers; vdev->id carries a
+ *                per-vIOMMU virtual ID (refer to struct iommu_vdevice_alloc in
+ *                include/uapi/linux/iommufd.h)
+ *                If driver has a deinit function to revert what vdevice_init op
+ *                does, it should set it to the @vdev->destroy function pointer
  */
 struct iommufd_viommu_ops {
 	void (*destroy)(struct iommufd_viommu *viommu);
@@ -128,6 +151,8 @@ struct iommufd_viommu_ops {
 		const struct iommu_user_data *user_data);
 	int (*cache_invalidate)(struct iommufd_viommu *viommu,
 				struct iommu_user_data_array *array);
+	const size_t vdevice_size;
+	int (*vdevice_init)(struct iommufd_vdevice *vdev);
 };
 
 #if IS_ENABLED(CONFIG_IOMMUFD)
@@ -224,4 +249,10 @@ static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu,
 	 BUILD_BUG_ON_ZERO(offsetof(drv_struct, member)) +                     \
 	 BUILD_BUG_ON_ZERO(!__same_type(struct iommufd_viommu,                 \
 					((drv_struct *)NULL)->member)))
+
+#define VDEVICE_STRUCT_SIZE(drv_struct, member)                                \
+	(sizeof(drv_struct) +                                                  \
+	 BUILD_BUG_ON_ZERO(offsetof(drv_struct, member)) +                     \
+	 BUILD_BUG_ON_ZERO(!__same_type(struct iommufd_vdevice,                \
+					((drv_struct *)NULL)->member)))
 #endif
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
index c0365849f849..081ee6697a11 100644
--- a/drivers/iommu/iommufd/viommu.c
+++ b/drivers/iommu/iommufd/viommu.c
@@ -116,6 +116,8 @@ void iommufd_vdevice_destroy(struct iommufd_object *obj)
 		container_of(obj, struct iommufd_vdevice, obj);
 	struct iommufd_viommu *viommu = vdev->viommu;
 
+	if (vdev->destroy)
+		vdev->destroy(vdev);
 	/* xa_cmpxchg is okay to fail if alloc failed xa_cmpxchg previously */
 	xa_cmpxchg(&viommu->vdevs, vdev->virt_id, vdev, NULL, GFP_KERNEL);
 	refcount_dec(&viommu->obj.users);
@@ -126,6 +128,7 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd)
 {
 	struct iommu_vdevice_alloc *cmd = ucmd->cmd;
 	struct iommufd_vdevice *vdev, *curr;
+	size_t vdev_size = sizeof(*vdev);
 	struct iommufd_viommu *viommu;
 	struct iommufd_device *idev;
 	u64 virt_id = cmd->virt_id;
@@ -150,7 +153,22 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd)
 		goto out_put_idev;
 	}
 
-	vdev = iommufd_object_alloc_ucmd(ucmd, vdev, IOMMUFD_OBJ_VDEVICE);
+	if (viommu->ops && viommu->ops->vdevice_size) {
+		/*
+		 * It is a driver bug for:
+		 * - ops->vdevice_size smaller than the core structure size
+		 * - not implementing a pairing ops->vdevice_init op
+		 */
+		if (WARN_ON_ONCE(viommu->ops->vdevice_size < vdev_size ||
+				 !viommu->ops->vdevice_init)) {
+			rc = -EOPNOTSUPP;
+			goto out_put_idev;
+		}
+		vdev_size = viommu->ops->vdevice_size;
+	}
+
+	vdev = (struct iommufd_vdevice *)_iommufd_object_alloc_ucmd(
+		ucmd, vdev_size, IOMMUFD_OBJ_VDEVICE);
 	if (IS_ERR(vdev)) {
 		rc = PTR_ERR(vdev);
 		goto out_put_idev;
@@ -168,6 +186,12 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd)
 		goto out_put_idev;
 	}
 
+	if (viommu->ops && viommu->ops->vdevice_init) {
+		rc = viommu->ops->vdevice_init(vdev);
+		if (rc)
+			goto out_put_idev;
+	}
+
 	cmd->out_vdevice_id = vdev->obj.id;
 	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 12/28] iommufd/viommu: Introduce IOMMUFD_OBJ_HW_QUEUE and its related struct
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (10 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 11/28] iommufd/viommu: Add driver-defined vDEVICE support Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 13/28] iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl Nicolin Chen
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

Add IOMMUFD_OBJ_HW_QUEUE with an iommufd_hw_queue structure, representing
a HW-accelerated queue type of IOMMU's physical queue that can be passed
through to a user space VM for direct hardware control, such as:
 - NVIDIA's Virtual Command Queue
 - AMD vIOMMU's Command Buffer, Event Log Buffers, and PPR Log Buffers

Add new viommu ops for iommufd to communicate with IOMMU drivers to fetch
supported HW queue structure size and to forward user space ioctls to the
IOMMU drivers for initialization/destroy.

As the existing HWs, NVIDIA's VCMDQs access the guest memory via physical
addresses, while AMD's Buffers access the guest memory via guest physical
addresses (i.e. iova of the nesting parent HWPT). Separate two mutually
exclusive hw_queue_init and hw_queue_init_phys ops to indicate whether a
vIOMMU HW accesses the guest queue in the guest physical space (via iova)
or the host physical space (via pa).

In a latter case, the iommufd core will validate the physical pages of a
given guest queue, to ensure the underlying physical pages are contiguous
and pinned.

Since this is introduced with NVIDIA's VCMDQs, add hw_queue_init_phys for
now, and leave some notes for hw_queue_init in the near future (for AMD).

Either NVIDIA's or AMD's HW is a multi-queue model: NVIDIA's will be only
one type in enum iommu_hw_queue_type, while AMD's will be three different
types (two of which will have multi queues). Compared to letting the core
manage multiple queues with three types per vIOMMU object, it'd be easier
for the driver to manage that by having three different driver-structure
arrays per vIOMMU object. Thus, pass in the index to the init op.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommufd.h      | 42 ++++++++++++++++++++++++++++++++++++
 include/uapi/linux/iommufd.h |  9 ++++++++
 2 files changed, 51 insertions(+)

diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 3709b264af28..808536ed97e0 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -37,6 +37,7 @@ enum iommufd_object_type {
 	IOMMUFD_OBJ_VIOMMU,
 	IOMMUFD_OBJ_VDEVICE,
 	IOMMUFD_OBJ_VEVENTQ,
+	IOMMUFD_OBJ_HW_QUEUE,
 #ifdef CONFIG_IOMMUFD_TEST
 	IOMMUFD_OBJ_SELFTEST,
 #endif
@@ -119,6 +120,19 @@ struct iommufd_vdevice {
 	void (*destroy)(struct iommufd_vdevice *vdev);
 };
 
+struct iommufd_hw_queue {
+	struct iommufd_object obj;
+	struct iommufd_viommu *viommu;
+
+	u64 base_addr; /* in guest physical address space */
+	size_t length;
+
+	enum iommu_hw_queue_type type;
+
+	/* Clean up all driver-specific parts of an iommufd_hw_queue */
+	void (*destroy)(struct iommufd_hw_queue *hw_queue);
+};
+
 /**
  * struct iommufd_viommu_ops - vIOMMU specific operations
  * @destroy: Clean up all driver-specific parts of an iommufd_viommu. The memory
@@ -143,6 +157,22 @@ struct iommufd_vdevice {
  *                include/uapi/linux/iommufd.h)
  *                If driver has a deinit function to revert what vdevice_init op
  *                does, it should set it to the @vdev->destroy function pointer
+ * @get_hw_queue_size: Get the size of a driver-defined HW queue structure for a
+ *                     given @viommu corresponding to @queue_type. Driver should
+ *                     return 0 if HW queue aren't supported accordingly. It is
+ *                     required for driver to use the HW_QUEUE_STRUCT_SIZE macro
+ *                     to sanitize the driver-level HW queue structure related
+ *                     to the core one
+ * @hw_queue_init_phys: Initialize the driver-level structure of a HW queue that
+ *                      is initialized with its core-level structure that holds
+ *                      all the info about a guest queue memory.
+ *                      Driver providing this op indicates that HW accesses the
+ *                      guest queue memory via physical addresses.
+ *                      @index carries the logical HW QUEUE ID per vIOMMU in a
+ *                      guest VM, for a multi-queue model. @base_addr_pa carries
+ *                      the physical location of the guest queue
+ *                      If driver has a deinit function to revert what this op
+ *                      does, it should set it to the @hw_queue->destroy pointer
  */
 struct iommufd_viommu_ops {
 	void (*destroy)(struct iommufd_viommu *viommu);
@@ -153,6 +183,11 @@ struct iommufd_viommu_ops {
 				struct iommu_user_data_array *array);
 	const size_t vdevice_size;
 	int (*vdevice_init)(struct iommufd_vdevice *vdev);
+	size_t (*get_hw_queue_size)(struct iommufd_viommu *viommu,
+				    enum iommu_hw_queue_type queue_type);
+	/* AMD's HW will add hw_queue_init simply using @hw_queue->base_addr */
+	int (*hw_queue_init_phys)(struct iommufd_hw_queue *hw_queue, u32 index,
+				  phys_addr_t base_addr_pa);
 };
 
 #if IS_ENABLED(CONFIG_IOMMUFD)
@@ -255,4 +290,11 @@ static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu,
 	 BUILD_BUG_ON_ZERO(offsetof(drv_struct, member)) +                     \
 	 BUILD_BUG_ON_ZERO(!__same_type(struct iommufd_vdevice,                \
 					((drv_struct *)NULL)->member)))
+
+#define HW_QUEUE_STRUCT_SIZE(drv_struct, member)                               \
+	(sizeof(drv_struct) +                                                  \
+	 BUILD_BUG_ON_ZERO(offsetof(drv_struct, member)) +                     \
+	 BUILD_BUG_ON_ZERO(!__same_type(struct iommufd_hw_queue,               \
+					((drv_struct *)NULL)->member)))
+
 #endif
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 272da7324a2b..971aa19c0ba1 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -1147,4 +1147,13 @@ struct iommu_veventq_alloc {
 	__u32 __reserved;
 };
 #define IOMMU_VEVENTQ_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VEVENTQ_ALLOC)
+
+/**
+ * enum iommu_hw_queue_type - HW Queue Type
+ * @IOMMU_HW_QUEUE_TYPE_DEFAULT: Reserved for future use
+ */
+enum iommu_hw_queue_type {
+	IOMMU_HW_QUEUE_TYPE_DEFAULT = 0,
+};
+
 #endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 13/28] iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (11 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 12/28] iommufd/viommu: Introduce IOMMUFD_OBJ_HW_QUEUE and its related struct Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-07-04 13:26   ` Jason Gunthorpe
  2025-06-26 19:34 ` [PATCH v7 14/28] iommufd/driver: Add iommufd_hw_queue_depend/undepend() helpers Nicolin Chen
                   ` (14 subsequent siblings)
  27 siblings, 1 reply; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

Introduce a new IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl for user space to allocate
a HW QUEUE object for a vIOMMU specific HW-accelerated queue, e.g.:
 - NVIDIA's Virtual Command Queue
 - AMD vIOMMU's Command Buffer, Event Log Buffers, and PPR Log Buffers

Since this is introduced with NVIDIA's VCMDQs that access the guest memory
in the physical address space, add an iommufd_hw_queue_alloc_phys() helper
that will create an access object to the queue memory in the IOAS, to avoid
the mappings of the guest memory from being unmapped, during the life cycle
of the HW queue object.

AMD's HW will need an hw_queue_init op that is mutually exclusive with the
hw_queue_init_phys op, and their case will bypass the access part, i.e. no
iommufd_hw_queue_alloc_phys() call.

Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_private.h |   2 +
 include/linux/iommufd.h                 |   1 +
 include/uapi/linux/iommufd.h            |  33 +++++
 drivers/iommu/iommufd/main.c            |   6 +
 drivers/iommu/iommufd/viommu.c          | 170 ++++++++++++++++++++++++
 5 files changed, 212 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 4db8c6890f4c..936387bcd0d0 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -652,6 +652,8 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd);
 void iommufd_viommu_destroy(struct iommufd_object *obj);
 int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd);
 void iommufd_vdevice_destroy(struct iommufd_object *obj);
+int iommufd_hw_queue_alloc_ioctl(struct iommufd_ucmd *ucmd);
+void iommufd_hw_queue_destroy(struct iommufd_object *obj);
 
 #ifdef CONFIG_IOMMUFD_TEST
 int iommufd_test(struct iommufd_ucmd *ucmd);
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 808536ed97e0..c98edfce5081 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -123,6 +123,7 @@ struct iommufd_vdevice {
 struct iommufd_hw_queue {
 	struct iommufd_object obj;
 	struct iommufd_viommu *viommu;
+	struct iommufd_access *access;
 
 	u64 base_addr; /* in guest physical address space */
 	size_t length;
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 971aa19c0ba1..f091ea072c5f 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -56,6 +56,7 @@ enum {
 	IOMMUFD_CMD_VDEVICE_ALLOC = 0x91,
 	IOMMUFD_CMD_IOAS_CHANGE_PROCESS = 0x92,
 	IOMMUFD_CMD_VEVENTQ_ALLOC = 0x93,
+	IOMMUFD_CMD_HW_QUEUE_ALLOC = 0x94,
 };
 
 /**
@@ -1156,4 +1157,36 @@ enum iommu_hw_queue_type {
 	IOMMU_HW_QUEUE_TYPE_DEFAULT = 0,
 };
 
+/**
+ * struct iommu_hw_queue_alloc - ioctl(IOMMU_HW_QUEUE_ALLOC)
+ * @size: sizeof(struct iommu_hw_queue_alloc)
+ * @flags: Must be 0
+ * @viommu_id: Virtual IOMMU ID to associate the HW queue with
+ * @type: One of enum iommu_hw_queue_type
+ * @index: The logical index to the HW queue per virtual IOMMU for a multi-queue
+ *         model
+ * @out_hw_queue_id: The ID of the new HW queue
+ * @nesting_parent_iova: Base address of the queue memory in the guest physical
+ *                       address space
+ * @length: Length of the queue memory
+ *
+ * Allocate a HW queue object for a vIOMMU-specific HW-accelerated queue, which
+ * allows HW to access a guest queue memory described using @nesting_parent_iova
+ * and @length.
+ *
+ * A vIOMMU can allocate multiple queues, but it must use a different @index per
+ * type to separate each allocation, e.g.
+ *     Type1 HW queue0, Type1 HW queue1, Type2 HW queue0, ...
+ */
+struct iommu_hw_queue_alloc {
+	__u32 size;
+	__u32 flags;
+	__u32 viommu_id;
+	__u32 type;
+	__u32 index;
+	__u32 out_hw_queue_id;
+	__aligned_u64 nesting_parent_iova;
+	__aligned_u64 length;
+};
+#define IOMMU_HW_QUEUE_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_HW_QUEUE_ALLOC)
 #endif
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 778694d7c207..4e8dbbfac890 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -354,6 +354,7 @@ union ucmd_buffer {
 	struct iommu_destroy destroy;
 	struct iommu_fault_alloc fault;
 	struct iommu_hw_info info;
+	struct iommu_hw_queue_alloc hw_queue;
 	struct iommu_hwpt_alloc hwpt;
 	struct iommu_hwpt_get_dirty_bitmap get_dirty_bitmap;
 	struct iommu_hwpt_invalidate cache;
@@ -396,6 +397,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = {
 		 struct iommu_fault_alloc, out_fault_fd),
 	IOCTL_OP(IOMMU_GET_HW_INFO, iommufd_get_hw_info, struct iommu_hw_info,
 		 __reserved),
+	IOCTL_OP(IOMMU_HW_QUEUE_ALLOC, iommufd_hw_queue_alloc_ioctl,
+		 struct iommu_hw_queue_alloc, length),
 	IOCTL_OP(IOMMU_HWPT_ALLOC, iommufd_hwpt_alloc, struct iommu_hwpt_alloc,
 		 __reserved),
 	IOCTL_OP(IOMMU_HWPT_GET_DIRTY_BITMAP, iommufd_hwpt_get_dirty_bitmap,
@@ -559,6 +562,9 @@ static const struct iommufd_object_ops iommufd_object_ops[] = {
 	[IOMMUFD_OBJ_FAULT] = {
 		.destroy = iommufd_fault_destroy,
 	},
+	[IOMMUFD_OBJ_HW_QUEUE] = {
+		.destroy = iommufd_hw_queue_destroy,
+	},
 	[IOMMUFD_OBJ_HWPT_PAGING] = {
 		.destroy = iommufd_hwpt_paging_destroy,
 		.abort = iommufd_hwpt_paging_abort,
diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
index 081ee6697a11..ce509a827721 100644
--- a/drivers/iommu/iommufd/viommu.c
+++ b/drivers/iommu/iommufd/viommu.c
@@ -201,3 +201,173 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd)
 	iommufd_put_object(ucmd->ictx, &viommu->obj);
 	return rc;
 }
+
+static void iommufd_hw_queue_destroy_access(struct iommufd_ctx *ictx,
+					    struct iommufd_access *access,
+					    u64 base_iova, size_t length)
+{
+	iommufd_access_unpin_pages(access, base_iova, length);
+	iommufd_access_detach_internal(access);
+	iommufd_access_destroy_internal(ictx, access);
+}
+
+void iommufd_hw_queue_destroy(struct iommufd_object *obj)
+{
+	struct iommufd_hw_queue *hw_queue =
+		container_of(obj, struct iommufd_hw_queue, obj);
+
+	if (hw_queue->destroy)
+		hw_queue->destroy(hw_queue);
+	if (hw_queue->access)
+		iommufd_hw_queue_destroy_access(hw_queue->viommu->ictx,
+						hw_queue->access,
+						hw_queue->base_addr,
+						hw_queue->length);
+	if (hw_queue->viommu)
+		refcount_dec(&hw_queue->viommu->obj.users);
+}
+
+/*
+ * When the HW accesses the guest queue via physical addresses, the underlying
+ * physical pages of the guest queue must be contiguous. Also, for the security
+ * concern that IOMMUFD_CMD_IOAS_UNMAP could potentially remove the mappings of
+ * the guest queue from the nesting parent iopt while the HW is still accessing
+ * the guest queue memory physically, such a HW queue must require an access to
+ * pin the underlying pages and prevent that from happening.
+ */
+static struct iommufd_access *
+iommufd_hw_queue_alloc_phys(struct iommu_hw_queue_alloc *cmd,
+			    struct iommufd_viommu *viommu, phys_addr_t *base_pa)
+{
+	struct iommufd_access *access;
+	struct page **pages;
+	int max_npages, i;
+	u64 offset;
+	int rc;
+
+	offset =
+		cmd->nesting_parent_iova - PAGE_ALIGN(cmd->nesting_parent_iova);
+	max_npages = DIV_ROUND_UP(offset + cmd->length, PAGE_SIZE);
+
+	/*
+	 * Use kvcalloc() to avoid memory fragmentation for a large page array.
+	 * Set __GFP_NOWARN to avoid syzkaller blowups
+	 */
+	pages = kvcalloc(max_npages, sizeof(*pages), GFP_KERNEL | __GFP_NOWARN);
+	if (!pages)
+		return ERR_PTR(-ENOMEM);
+
+	access = iommufd_access_create_internal(viommu->ictx);
+	if (IS_ERR(access)) {
+		rc = PTR_ERR(access);
+		goto out_free;
+	}
+
+	rc = iommufd_access_attach_internal(access, viommu->hwpt->ioas);
+	if (rc)
+		goto out_destroy;
+
+	rc = iommufd_access_pin_pages(access, cmd->nesting_parent_iova,
+				      cmd->length, pages, 0);
+	if (rc)
+		goto out_detach;
+
+	/* Validate if the underlying physical pages are contiguous */
+	for (i = 1; i < max_npages; i++) {
+		if (page_to_pfn(pages[i]) == page_to_pfn(pages[i - 1]) + 1)
+			continue;
+		rc = -EFAULT;
+		goto out_unpin;
+	}
+
+	*base_pa = page_to_pfn(pages[0]) << PAGE_SHIFT;
+	kfree(pages);
+	return access;
+
+out_unpin:
+	iommufd_access_unpin_pages(access, cmd->nesting_parent_iova,
+				   cmd->length);
+out_detach:
+	iommufd_access_detach_internal(access);
+out_destroy:
+	iommufd_access_destroy_internal(viommu->ictx, access);
+out_free:
+	kfree(pages);
+	return ERR_PTR(rc);
+}
+
+int iommufd_hw_queue_alloc_ioctl(struct iommufd_ucmd *ucmd)
+{
+	struct iommu_hw_queue_alloc *cmd = ucmd->cmd;
+	struct iommufd_hw_queue *hw_queue;
+	struct iommufd_viommu *viommu;
+	struct iommufd_access *access;
+	size_t hw_queue_size;
+	phys_addr_t base_pa;
+	u64 last;
+	int rc;
+
+	if (cmd->flags || cmd->type == IOMMU_HW_QUEUE_TYPE_DEFAULT)
+		return -EOPNOTSUPP;
+	if (!cmd->length)
+		return -EINVAL;
+	if (check_add_overflow(cmd->nesting_parent_iova, cmd->length - 1,
+			       &last))
+		return -EOVERFLOW;
+
+	viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
+	if (IS_ERR(viommu))
+		return PTR_ERR(viommu);
+
+	if (!viommu->ops || !viommu->ops->get_hw_queue_size ||
+	    !viommu->ops->hw_queue_init_phys) {
+		rc = -EOPNOTSUPP;
+		goto out_put_viommu;
+	}
+
+	hw_queue_size = viommu->ops->get_hw_queue_size(viommu, cmd->type);
+	if (!hw_queue_size) {
+		rc = -EOPNOTSUPP;
+		goto out_put_viommu;
+	}
+
+	/*
+	 * It is a driver bug for providing a hw_queue_size smaller than the
+	 * core HW queue structure size
+	 */
+	if (WARN_ON_ONCE(hw_queue_size < sizeof(*hw_queue))) {
+		rc = -EOPNOTSUPP;
+		goto out_put_viommu;
+	}
+
+	hw_queue = (struct iommufd_hw_queue *)_iommufd_object_alloc_ucmd(
+		ucmd, hw_queue_size, IOMMUFD_OBJ_HW_QUEUE);
+	if (IS_ERR(hw_queue)) {
+		rc = PTR_ERR(hw_queue);
+		goto out_put_viommu;
+	}
+
+	access = iommufd_hw_queue_alloc_phys(cmd, viommu, &base_pa);
+	if (IS_ERR(access)) {
+		rc = PTR_ERR(access);
+		goto out_put_viommu;
+	}
+
+	hw_queue->viommu = viommu;
+	refcount_inc(&viommu->obj.users);
+	hw_queue->access = access;
+	hw_queue->type = cmd->type;
+	hw_queue->length = cmd->length;
+	hw_queue->base_addr = cmd->nesting_parent_iova;
+
+	rc = viommu->ops->hw_queue_init_phys(hw_queue, cmd->index, base_pa);
+	if (rc)
+		goto out_put_viommu;
+
+	cmd->out_hw_queue_id = hw_queue->obj.id;
+	rc = iommufd_ucmd_respond(ucmd, sizeof(*cmd));
+
+out_put_viommu:
+	iommufd_put_object(ucmd->ictx, &viommu->obj);
+	return rc;
+}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 14/28] iommufd/driver: Add iommufd_hw_queue_depend/undepend() helpers
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (12 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 13/28] iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 15/28] iommufd/selftest: Add coverage for IOMMUFD_CMD_HW_QUEUE_ALLOC Nicolin Chen
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

NVIDIA Virtual Command Queue is one of the iommufd users exposing vIOMMU
features to user space VMs. Its hardware has a strict rule when mapping
and unmapping multiple global CMDQVs to/from a VM-owned VINTF, requiring
mappings in ascending order and unmappings in descending order.

The tegra241-cmdqv driver can apply the rule for a mapping in the LVCMDQ
allocation handler. However, it can't do the same for an unmapping since
user space could start random destroy calls breaking the rule, while the
destroy op in the driver level can't reject a destroy call as it returns
void.

Add iommufd_hw_queue_depend/undepend for-driver helpers, allowing LVCMDQ
allocator to refcount_inc() a sibling LVCMDQ object and LVCMDQ destroyer
to refcount_dec(), so that iommufd core will help block a random destroy
call that breaks the rule.

This is a bit of compromise, because a driver might end up with abusing
the API that deadlocks the objects. So restrict the API to a dependency
between two driver-allocated objects of the same type, as iommufd would
unlikely build any core-level dependency in this case. And encourage to
use the macro version that currently supports the HW QUEUE objects only.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommufd.h        | 44 ++++++++++++++++++++++++++++++++++
 drivers/iommu/iommufd/driver.c | 28 ++++++++++++++++++++++
 2 files changed, 72 insertions(+)

diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index c98edfce5081..563bdd30a8bb 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -251,6 +251,10 @@ static inline int iommufd_vfio_compat_set_no_iommu(struct iommufd_ctx *ictx)
 #endif /* CONFIG_IOMMUFD */
 
 #if IS_ENABLED(CONFIG_IOMMUFD_DRIVER_CORE)
+int _iommufd_object_depend(struct iommufd_object *obj_dependent,
+			   struct iommufd_object *obj_depended);
+void _iommufd_object_undepend(struct iommufd_object *obj_dependent,
+			      struct iommufd_object *obj_depended);
 struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu,
 				       unsigned long vdev_id);
 int iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu,
@@ -259,6 +263,18 @@ int iommufd_viommu_report_event(struct iommufd_viommu *viommu,
 				enum iommu_veventq_type type, void *event_data,
 				size_t data_len);
 #else /* !CONFIG_IOMMUFD_DRIVER_CORE */
+static inline int _iommufd_object_depend(struct iommufd_object *obj_dependent,
+					 struct iommufd_object *obj_depended)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void
+_iommufd_object_undepend(struct iommufd_object *obj_dependent,
+			 struct iommufd_object *obj_depended)
+{
+}
+
 static inline struct device *
 iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id)
 {
@@ -298,4 +314,32 @@ static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu,
 	 BUILD_BUG_ON_ZERO(!__same_type(struct iommufd_hw_queue,               \
 					((drv_struct *)NULL)->member)))
 
+/*
+ * Helpers for IOMMU driver to build/destroy a dependency between two sibling
+ * structures created by one of the allocators above
+ */
+#define iommufd_hw_queue_depend(dependent, depended, member)                   \
+	({                                                                     \
+		int ret = -EINVAL;                                             \
+									       \
+		static_assert(__same_type(struct iommufd_hw_queue,             \
+					  dependent->member));                 \
+		static_assert(__same_type(typeof(*dependent), *depended));     \
+		if (!WARN_ON_ONCE(dependent->member.viommu !=                  \
+				  depended->member.viommu))                    \
+			ret = _iommufd_object_depend(&dependent->member.obj,   \
+						     &depended->member.obj);   \
+		ret;                                                           \
+	})
+
+#define iommufd_hw_queue_undepend(dependent, depended, member)                 \
+	({                                                                     \
+		static_assert(__same_type(struct iommufd_hw_queue,             \
+					  dependent->member));                 \
+		static_assert(__same_type(typeof(*dependent), *depended));     \
+		WARN_ON_ONCE(dependent->member.viommu !=                       \
+			     depended->member.viommu);                         \
+		_iommufd_object_undepend(&dependent->member.obj,               \
+					 &depended->member.obj);               \
+	})
 #endif
diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c
index 887719016804..e578ef32d30c 100644
--- a/drivers/iommu/iommufd/driver.c
+++ b/drivers/iommu/iommufd/driver.c
@@ -3,6 +3,34 @@
  */
 #include "iommufd_private.h"
 
+/* Driver should use a per-structure helper in include/linux/iommufd.h */
+int _iommufd_object_depend(struct iommufd_object *obj_dependent,
+			   struct iommufd_object *obj_depended)
+{
+	/* Reject self dependency that dead locks */
+	if (obj_dependent == obj_depended)
+		return -EINVAL;
+	/* Only support dependency between two objects of the same type */
+	if (obj_dependent->type != obj_depended->type)
+		return -EINVAL;
+
+	refcount_inc(&obj_depended->users);
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(_iommufd_object_depend, "IOMMUFD");
+
+/* Driver should use a per-structure helper in include/linux/iommufd.h */
+void _iommufd_object_undepend(struct iommufd_object *obj_dependent,
+			      struct iommufd_object *obj_depended)
+{
+	if (WARN_ON_ONCE(obj_dependent == obj_depended ||
+			 obj_dependent->type != obj_depended->type))
+		return;
+
+	refcount_dec(&obj_depended->users);
+}
+EXPORT_SYMBOL_NS_GPL(_iommufd_object_undepend, "IOMMUFD");
+
 /* Caller should xa_lock(&viommu->vdevs) to protect the return value */
 struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu,
 				       unsigned long vdev_id)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 15/28] iommufd/selftest: Add coverage for IOMMUFD_CMD_HW_QUEUE_ALLOC
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (13 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 14/28] iommufd/driver: Add iommufd_hw_queue_depend/undepend() helpers Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 16/28] iommufd: Add mmap interface Nicolin Chen
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

Some simple tests for IOMMUFD_CMD_HW_QUEUE_ALLOC infrastructure covering
the new iommufd_hw_queue_depend/undepend() helpers.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_test.h          |  3 +
 tools/testing/selftests/iommu/iommufd_utils.h | 31 ++++++
 drivers/iommu/iommufd/selftest.c              | 97 +++++++++++++++++++
 tools/testing/selftests/iommu/iommufd.c       | 59 +++++++++++
 .../selftests/iommu/iommufd_fail_nth.c        |  6 ++
 5 files changed, 196 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index fbf9ecb35a13..51cd744a354f 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -265,4 +265,7 @@ struct iommu_viommu_event_selftest {
 	__u32 virt_id;
 };
 
+#define IOMMU_HW_QUEUE_TYPE_SELFTEST 0xdeadbeef
+#define IOMMU_TEST_HW_QUEUE_MAX 2
+
 #endif
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index a5d4cbd089ba..9a556f99d992 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -956,6 +956,37 @@ static int _test_cmd_vdevice_alloc(int fd, __u32 viommu_id, __u32 idev_id,
 		     _test_cmd_vdevice_alloc(self->fd, viommu_id, idev_id,   \
 					     virt_id, vdev_id))
 
+static int _test_cmd_hw_queue_alloc(int fd, __u32 viommu_id, __u32 type,
+				    __u32 idx, __u64 base_addr, __u64 length,
+				    __u32 *hw_queue_id)
+{
+	struct iommu_hw_queue_alloc cmd = {
+		.size = sizeof(cmd),
+		.viommu_id = viommu_id,
+		.type = type,
+		.index = idx,
+		.nesting_parent_iova = base_addr,
+		.length = length,
+	};
+	int ret;
+
+	ret = ioctl(fd, IOMMU_HW_QUEUE_ALLOC, &cmd);
+	if (ret)
+		return ret;
+	if (hw_queue_id)
+		*hw_queue_id = cmd.out_hw_queue_id;
+	return 0;
+}
+
+#define test_cmd_hw_queue_alloc(viommu_id, type, idx, base_addr, len, out_qid) \
+	ASSERT_EQ(0, _test_cmd_hw_queue_alloc(self->fd, viommu_id, type, idx,  \
+					      base_addr, len, out_qid))
+#define test_err_hw_queue_alloc(_errno, viommu_id, type, idx, base_addr, len, \
+				out_qid)                                      \
+	EXPECT_ERRNO(_errno,                                                  \
+		     _test_cmd_hw_queue_alloc(self->fd, viommu_id, type, idx, \
+					      base_addr, len, out_qid))
+
 static int _test_cmd_veventq_alloc(int fd, __u32 viommu_id, __u32 type,
 				   __u32 *veventq_id, __u32 *veventq_fd)
 {
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 38066dfeb2e7..2189e9b119ee 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -150,6 +150,8 @@ to_mock_nested(struct iommu_domain *domain)
 struct mock_viommu {
 	struct iommufd_viommu core;
 	struct mock_iommu_domain *s2_parent;
+	struct mock_hw_queue *hw_queue[IOMMU_TEST_HW_QUEUE_MAX];
+	struct mutex queue_mutex;
 };
 
 static inline struct mock_viommu *to_mock_viommu(struct iommufd_viommu *viommu)
@@ -157,6 +159,19 @@ static inline struct mock_viommu *to_mock_viommu(struct iommufd_viommu *viommu)
 	return container_of(viommu, struct mock_viommu, core);
 }
 
+struct mock_hw_queue {
+	struct iommufd_hw_queue core;
+	struct mock_viommu *mock_viommu;
+	struct mock_hw_queue *prev;
+	u16 index;
+};
+
+static inline struct mock_hw_queue *
+to_mock_hw_queue(struct iommufd_hw_queue *hw_queue)
+{
+	return container_of(hw_queue, struct mock_hw_queue, core);
+}
+
 enum selftest_obj_type {
 	TYPE_IDEV,
 };
@@ -670,9 +685,11 @@ static void mock_viommu_destroy(struct iommufd_viommu *viommu)
 {
 	struct mock_iommu_device *mock_iommu = container_of(
 		viommu->iommu_dev, struct mock_iommu_device, iommu_dev);
+	struct mock_viommu *mock_viommu = to_mock_viommu(viommu);
 
 	if (refcount_dec_and_test(&mock_iommu->users))
 		complete(&mock_iommu->complete);
+	mutex_destroy(&mock_viommu->queue_mutex);
 
 	/* iommufd core frees mock_viommu and viommu */
 }
@@ -764,10 +781,86 @@ static int mock_viommu_cache_invalidate(struct iommufd_viommu *viommu,
 	return rc;
 }
 
+static size_t mock_viommu_get_hw_queue_size(struct iommufd_viommu *viommu,
+					    enum iommu_hw_queue_type queue_type)
+{
+	if (queue_type != IOMMU_HW_QUEUE_TYPE_SELFTEST)
+		return 0;
+	return HW_QUEUE_STRUCT_SIZE(struct mock_hw_queue, core);
+}
+
+static void mock_hw_queue_destroy(struct iommufd_hw_queue *hw_queue)
+{
+	struct mock_hw_queue *mock_hw_queue = to_mock_hw_queue(hw_queue);
+	struct mock_viommu *mock_viommu = mock_hw_queue->mock_viommu;
+
+	mutex_lock(&mock_viommu->queue_mutex);
+	mock_viommu->hw_queue[mock_hw_queue->index] = NULL;
+	if (mock_hw_queue->prev)
+		iommufd_hw_queue_undepend(mock_hw_queue, mock_hw_queue->prev,
+					  core);
+	mutex_unlock(&mock_viommu->queue_mutex);
+}
+
+/* Test iommufd_hw_queue_depend/undepend() */
+static int mock_hw_queue_init_phys(struct iommufd_hw_queue *hw_queue, u32 index,
+				   phys_addr_t base_addr_pa)
+{
+	struct mock_viommu *mock_viommu = to_mock_viommu(hw_queue->viommu);
+	struct mock_hw_queue *mock_hw_queue = to_mock_hw_queue(hw_queue);
+	struct mock_hw_queue *prev = NULL;
+	int rc = 0;
+
+	if (index >= IOMMU_TEST_HW_QUEUE_MAX)
+		return -EINVAL;
+
+	mutex_lock(&mock_viommu->queue_mutex);
+
+	if (mock_viommu->hw_queue[index]) {
+		rc = -EEXIST;
+		goto unlock;
+	}
+
+	if (index) {
+		prev = mock_viommu->hw_queue[index - 1];
+		if (!prev) {
+			rc = -EIO;
+			goto unlock;
+		}
+	}
+
+	/*
+	 * Test to catch a kernel bug if the core converted the physical address
+	 * incorrectly. Let mock_domain_iova_to_phys() WARN_ON if it fails.
+	 */
+	if (base_addr_pa != iommu_iova_to_phys(&mock_viommu->s2_parent->domain,
+					       hw_queue->base_addr)) {
+		rc = -EFAULT;
+		goto unlock;
+	}
+
+	if (prev) {
+		rc = iommufd_hw_queue_depend(mock_hw_queue, prev, core);
+		if (rc)
+			goto unlock;
+	}
+
+	mock_hw_queue->prev = prev;
+	mock_hw_queue->mock_viommu = mock_viommu;
+	mock_viommu->hw_queue[index] = mock_hw_queue;
+
+	hw_queue->destroy = &mock_hw_queue_destroy;
+unlock:
+	mutex_unlock(&mock_viommu->queue_mutex);
+	return rc;
+}
+
 static struct iommufd_viommu_ops mock_viommu_ops = {
 	.destroy = mock_viommu_destroy,
 	.alloc_domain_nested = mock_viommu_alloc_domain_nested,
 	.cache_invalidate = mock_viommu_cache_invalidate,
+	.get_hw_queue_size = mock_viommu_get_hw_queue_size,
+	.hw_queue_init_phys = mock_hw_queue_init_phys,
 };
 
 static size_t mock_get_viommu_size(struct device *dev,
@@ -784,6 +877,7 @@ static int mock_viommu_init(struct iommufd_viommu *viommu,
 {
 	struct mock_iommu_device *mock_iommu = container_of(
 		viommu->iommu_dev, struct mock_iommu_device, iommu_dev);
+	struct mock_viommu *mock_viommu = to_mock_viommu(viommu);
 	struct iommu_viommu_selftest data;
 	int rc;
 
@@ -801,6 +895,9 @@ static int mock_viommu_init(struct iommufd_viommu *viommu,
 	}
 
 	refcount_inc(&mock_iommu->users);
+	mutex_init(&mock_viommu->queue_mutex);
+	mock_viommu->s2_parent = to_mock_domain(parent_domain);
+
 	viommu->ops = &mock_viommu_ops;
 	return 0;
 }
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index f22388dfacad..0da72a2388e9 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -3031,6 +3031,65 @@ TEST_F(iommufd_viommu, vdevice_cache)
 	}
 }
 
+TEST_F(iommufd_viommu, hw_queue)
+{
+	uint32_t viommu_id = self->viommu_id;
+	__u64 iova = MOCK_APERTURE_START;
+	uint32_t hw_queue_id[2];
+
+	if (viommu_id) {
+		/* Fail IOMMU_HW_QUEUE_TYPE_DEFAULT */
+		test_err_hw_queue_alloc(EOPNOTSUPP, viommu_id,
+					IOMMU_HW_QUEUE_TYPE_DEFAULT, 0, iova,
+					PAGE_SIZE, &hw_queue_id[0]);
+		/* Fail queue addr and length */
+		test_err_hw_queue_alloc(EINVAL, viommu_id,
+					IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova,
+					0, &hw_queue_id[0]);
+		test_err_hw_queue_alloc(EOVERFLOW, viommu_id,
+					IOMMU_HW_QUEUE_TYPE_SELFTEST, 0,
+					~(uint64_t)0, PAGE_SIZE,
+					&hw_queue_id[0]);
+		/* Fail missing iova */
+		test_err_hw_queue_alloc(ENOENT, viommu_id,
+					IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova,
+					PAGE_SIZE, &hw_queue_id[0]);
+
+		/* Map iova */
+		test_ioctl_ioas_map(buffer, PAGE_SIZE, &iova);
+
+		/* Fail index=1 and =MAX; must start from index=0 */
+		test_err_hw_queue_alloc(EIO, viommu_id,
+					IOMMU_HW_QUEUE_TYPE_SELFTEST, 1, iova,
+					PAGE_SIZE, &hw_queue_id[0]);
+		test_err_hw_queue_alloc(EINVAL, viommu_id,
+					IOMMU_HW_QUEUE_TYPE_SELFTEST,
+					IOMMU_TEST_HW_QUEUE_MAX, iova,
+					PAGE_SIZE, &hw_queue_id[0]);
+
+		/* Allocate index=0, declare ownership of the iova */
+		test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST,
+					0, iova, PAGE_SIZE, &hw_queue_id[0]);
+		/* Fail duplicate */
+		test_err_hw_queue_alloc(EEXIST, viommu_id,
+					IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova,
+					PAGE_SIZE, &hw_queue_id[0]);
+		/* Fail unmap, due to iova ownership */
+		test_err_ioctl_ioas_unmap(EBUSY, iova, PAGE_SIZE);
+
+		/* Allocate index=1 */
+		test_cmd_hw_queue_alloc(viommu_id, IOMMU_HW_QUEUE_TYPE_SELFTEST,
+					1, iova, PAGE_SIZE, &hw_queue_id[1]);
+		/* Fail to destroy, due to dependency */
+		EXPECT_ERRNO(EBUSY,
+			     _test_ioctl_destroy(self->fd, hw_queue_id[0]));
+
+		/* Destroy in descending order */
+		test_ioctl_destroy(hw_queue_id[1]);
+		test_ioctl_destroy(hw_queue_id[0]);
+	}
+}
+
 FIXTURE(iommufd_device_pasid)
 {
 	int fd;
diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c
index f7ccf1822108..41c685bbd252 100644
--- a/tools/testing/selftests/iommu/iommufd_fail_nth.c
+++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c
@@ -634,6 +634,7 @@ TEST_FAIL_NTH(basic_fail_nth, device)
 	uint32_t idev_id;
 	uint32_t hwpt_id;
 	uint32_t viommu_id;
+	uint32_t hw_queue_id;
 	uint32_t vdev_id;
 	__u64 iova;
 
@@ -696,6 +697,11 @@ TEST_FAIL_NTH(basic_fail_nth, device)
 	if (_test_cmd_vdevice_alloc(self->fd, viommu_id, idev_id, 0, &vdev_id))
 		return -1;
 
+	if (_test_cmd_hw_queue_alloc(self->fd, viommu_id,
+				     IOMMU_HW_QUEUE_TYPE_SELFTEST, 0, iova,
+				     PAGE_SIZE, &hw_queue_id))
+		return -1;
+
 	if (_test_ioctl_fault_alloc(self->fd, &fault_id, &fault_fd))
 		return -1;
 	close(fault_fd);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 16/28] iommufd: Add mmap interface
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (14 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 15/28] iommufd/selftest: Add coverage for IOMMUFD_CMD_HW_QUEUE_ALLOC Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 17/28] iommufd/selftest: Add coverage for the new " Nicolin Chen
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

For vIOMMU passing through HW resources to user space (VMs), allowing a VM
to control the passed through HW directly by accessing hardware registers,
add an mmap infrastructure to map the physical MMIO pages to user space.

Maintain a maple tree per ictx as a translation table managing mmappable
regions, from an allocated for-user mmap offset to an iommufd_mmap struct,
where it stores the real physical address range for io_remap_pfn_range().

Keep track of the lifecycle of the mmappable region by taking refcount of
its owner, so as to enforce user space to unmap the region first before it
can destroy its owner object.

To allow an IOMMU driver to add and delete mmappable regions onto/from the
maple tree, add iommufd_viommu_alloc/destroy_mmap helpers.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_private.h | 14 ++++++
 include/linux/iommufd.h                 | 42 +++++++++++++++++
 drivers/iommu/iommufd/driver.c          | 51 ++++++++++++++++++++
 drivers/iommu/iommufd/main.c            | 63 +++++++++++++++++++++++++
 4 files changed, 170 insertions(+)

diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 936387bcd0d0..ebac6a4b3538 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -7,6 +7,7 @@
 #include <linux/iommu.h>
 #include <linux/iommufd.h>
 #include <linux/iova_bitmap.h>
+#include <linux/maple_tree.h>
 #include <linux/rwsem.h>
 #include <linux/uaccess.h>
 #include <linux/xarray.h>
@@ -44,6 +45,7 @@ struct iommufd_ctx {
 	struct xarray groups;
 	wait_queue_head_t destroy_wait;
 	struct rw_semaphore ioas_creation_lock;
+	struct maple_tree mt_mmap;
 
 	struct mutex sw_msi_lock;
 	struct list_head sw_msi_list;
@@ -55,6 +57,18 @@ struct iommufd_ctx {
 	struct iommufd_ioas *vfio_ioas;
 };
 
+/* Entry for iommufd_ctx::mt_mmap */
+struct iommufd_mmap {
+	struct iommufd_object *owner;
+
+	/* Page-shifted start position in mt_mmap to validate vma->vm_pgoff */
+	unsigned long vm_pgoff;
+
+	/* Physical range for io_remap_pfn_range() */
+	phys_addr_t mmio_addr;
+	size_t length;
+};
+
 /*
  * The IOVA to PFN map. The map automatically copies the PFNs into multiple
  * domains and permits sharing of PFNs between io_pagetable instances. This
diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h
index 563bdd30a8bb..7ab9e3e928b3 100644
--- a/include/linux/iommufd.h
+++ b/include/linux/iommufd.h
@@ -255,6 +255,11 @@ int _iommufd_object_depend(struct iommufd_object *obj_dependent,
 			   struct iommufd_object *obj_depended);
 void _iommufd_object_undepend(struct iommufd_object *obj_dependent,
 			      struct iommufd_object *obj_depended);
+int _iommufd_alloc_mmap(struct iommufd_ctx *ictx, struct iommufd_object *owner,
+			phys_addr_t mmio_addr, size_t length,
+			unsigned long *offset);
+void _iommufd_destroy_mmap(struct iommufd_ctx *ictx,
+			   struct iommufd_object *owner, unsigned long offset);
 struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu,
 				       unsigned long vdev_id);
 int iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu,
@@ -275,6 +280,20 @@ _iommufd_object_undepend(struct iommufd_object *obj_dependent,
 {
 }
 
+static inline int _iommufd_alloc_mmap(struct iommufd_ctx *ictx,
+				      struct iommufd_object *owner,
+				      phys_addr_t mmio_addr, size_t length,
+				      unsigned long *offset)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void _iommufd_destroy_mmap(struct iommufd_ctx *ictx,
+					 struct iommufd_object *owner,
+					 unsigned long offset)
+{
+}
+
 static inline struct device *
 iommufd_viommu_find_dev(struct iommufd_viommu *viommu, unsigned long vdev_id)
 {
@@ -342,4 +361,27 @@ static inline int iommufd_viommu_report_event(struct iommufd_viommu *viommu,
 		_iommufd_object_undepend(&dependent->member.obj,               \
 					 &depended->member.obj);               \
 	})
+
+/*
+ * Helpers for IOMMU driver to alloc/destroy an mmapable area for a structure.
+ *
+ * To support an mmappable MMIO region, kernel driver must first register it to
+ * iommufd core to allocate an @offset, during a driver-structure initialization
+ * (e.g. viommu_init op). Then, it should report to user space this @offset and
+ * the @length of the MMIO region for mmap syscall.
+ */
+static inline int iommufd_viommu_alloc_mmap(struct iommufd_viommu *viommu,
+					    phys_addr_t mmio_addr,
+					    size_t length,
+					    unsigned long *offset)
+{
+	return _iommufd_alloc_mmap(viommu->ictx, &viommu->obj, mmio_addr,
+				   length, offset);
+}
+
+static inline void iommufd_viommu_destroy_mmap(struct iommufd_viommu *viommu,
+					       unsigned long offset)
+{
+	_iommufd_destroy_mmap(viommu->ictx, &viommu->obj, offset);
+}
 #endif
diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c
index e578ef32d30c..2c9af93217f1 100644
--- a/drivers/iommu/iommufd/driver.c
+++ b/drivers/iommu/iommufd/driver.c
@@ -31,6 +31,57 @@ void _iommufd_object_undepend(struct iommufd_object *obj_dependent,
 }
 EXPORT_SYMBOL_NS_GPL(_iommufd_object_undepend, "IOMMUFD");
 
+/*
+ * Allocate an @offset to return to user space to use for an mmap() syscall
+ *
+ * Driver should use a per-structure helper in include/linux/iommufd.h
+ */
+int _iommufd_alloc_mmap(struct iommufd_ctx *ictx, struct iommufd_object *owner,
+			phys_addr_t mmio_addr, size_t length,
+			unsigned long *offset)
+{
+	struct iommufd_mmap *immap;
+	unsigned long startp;
+	int rc;
+
+	if (!PAGE_ALIGNED(mmio_addr))
+		return -EINVAL;
+	if (!length || !PAGE_ALIGNED(length))
+		return -EINVAL;
+
+	immap = kzalloc(sizeof(*immap), GFP_KERNEL);
+	if (!immap)
+		return -ENOMEM;
+	immap->owner = owner;
+	immap->length = length;
+	immap->mmio_addr = mmio_addr;
+
+	rc = mtree_alloc_range(&ictx->mt_mmap, &startp, immap, immap->length, 0,
+			       PHYS_ADDR_MAX, GFP_KERNEL);
+	if (rc < 0) {
+		kfree(immap);
+		return rc;
+	}
+
+	/* mmap() syscall will right-shift the offset in vma->vm_pgoff too */
+	immap->vm_pgoff = startp >> PAGE_SHIFT;
+	*offset = startp;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(_iommufd_alloc_mmap, "IOMMUFD");
+
+/* Driver should use a per-structure helper in include/linux/iommufd.h */
+void _iommufd_destroy_mmap(struct iommufd_ctx *ictx,
+			   struct iommufd_object *owner, unsigned long offset)
+{
+	struct iommufd_mmap *immap;
+
+	immap = mtree_erase(&ictx->mt_mmap, offset >> PAGE_SHIFT);
+	WARN_ON_ONCE(!immap || immap->owner != owner);
+	kfree(immap);
+}
+EXPORT_SYMBOL_NS_GPL(_iommufd_destroy_mmap, "IOMMUFD");
+
 /* Caller should xa_lock(&viommu->vdevs) to protect the return value */
 struct device *iommufd_viommu_find_dev(struct iommufd_viommu *viommu,
 				       unsigned long vdev_id)
diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c
index 4e8dbbfac890..0fb81a905cb1 100644
--- a/drivers/iommu/iommufd/main.c
+++ b/drivers/iommu/iommufd/main.c
@@ -275,6 +275,7 @@ static int iommufd_fops_open(struct inode *inode, struct file *filp)
 	xa_init_flags(&ictx->objects, XA_FLAGS_ALLOC1 | XA_FLAGS_ACCOUNT);
 	xa_init(&ictx->groups);
 	ictx->file = filp;
+	mt_init_flags(&ictx->mt_mmap, MT_FLAGS_ALLOC_RANGE);
 	init_waitqueue_head(&ictx->destroy_wait);
 	mutex_init(&ictx->sw_msi_lock);
 	INIT_LIST_HEAD(&ictx->sw_msi_list);
@@ -479,11 +480,73 @@ static long iommufd_fops_ioctl(struct file *filp, unsigned int cmd,
 	return ret;
 }
 
+static void iommufd_fops_vma_open(struct vm_area_struct *vma)
+{
+	struct iommufd_mmap *immap = vma->vm_private_data;
+
+	refcount_inc(&immap->owner->users);
+}
+
+static void iommufd_fops_vma_close(struct vm_area_struct *vma)
+{
+	struct iommufd_mmap *immap = vma->vm_private_data;
+
+	refcount_dec(&immap->owner->users);
+}
+
+static const struct vm_operations_struct iommufd_vma_ops = {
+	.open = iommufd_fops_vma_open,
+	.close = iommufd_fops_vma_close,
+};
+
+/* The vm_pgoff must be pre-allocated from mt_mmap, and given to user space */
+static int iommufd_fops_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	struct iommufd_ctx *ictx = filp->private_data;
+	size_t length = vma->vm_end - vma->vm_start;
+	struct iommufd_mmap *immap;
+	int rc;
+
+	if (!PAGE_ALIGNED(length))
+		return -EINVAL;
+	if (!(vma->vm_flags & VM_SHARED))
+		return -EINVAL;
+	if (vma->vm_flags & VM_EXEC)
+		return -EPERM;
+
+	/* vma->vm_pgoff carries a page-shifted start position to an immap */
+	immap = mtree_load(&ictx->mt_mmap, vma->vm_pgoff << PAGE_SHIFT);
+	if (!immap)
+		return -ENXIO;
+	/*
+	 * mtree_load() returns the immap for any contained mmio_addr, so only
+	 * allow the exact immap thing to be mapped
+	 */
+	if (vma->vm_pgoff != immap->vm_pgoff || length != immap->length)
+		return -ENXIO;
+
+	vma->vm_pgoff = 0;
+	vma->vm_private_data = immap;
+	vma->vm_ops = &iommufd_vma_ops;
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+	rc = io_remap_pfn_range(vma, vma->vm_start,
+				immap->mmio_addr >> PAGE_SHIFT, length,
+				vma->vm_page_prot);
+	if (rc)
+		return rc;
+
+	/* vm_ops.open won't be called for mmap itself. */
+	refcount_inc(&immap->owner->users);
+	return rc;
+}
+
 static const struct file_operations iommufd_fops = {
 	.owner = THIS_MODULE,
 	.open = iommufd_fops_open,
 	.release = iommufd_fops_release,
 	.unlocked_ioctl = iommufd_fops_ioctl,
+	.mmap = iommufd_fops_mmap,
 };
 
 /**
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 17/28] iommufd/selftest: Add coverage for the new mmap interface
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (15 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 16/28] iommufd: Add mmap interface Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 18/28] Documentation: userspace-api: iommufd: Update HW QUEUE Nicolin Chen
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

Extend the loopback test to a new mmap page.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/iommufd_test.h          |  4 +++
 tools/testing/selftests/iommu/iommufd_utils.h |  4 +++
 drivers/iommu/iommufd/selftest.c              | 33 ++++++++++++++++++-
 tools/testing/selftests/iommu/iommufd.c       | 22 +++++++++++++
 4 files changed, 62 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommufd/iommufd_test.h b/drivers/iommu/iommufd/iommufd_test.h
index 51cd744a354f..8fc618b2bcf9 100644
--- a/drivers/iommu/iommufd/iommufd_test.h
+++ b/drivers/iommu/iommufd/iommufd_test.h
@@ -232,12 +232,16 @@ struct iommu_hwpt_invalidate_selftest {
  *                                (IOMMU_VIOMMU_TYPE_SELFTEST)
  * @in_data: Input random data from user space
  * @out_data: Output data (matching @in_data) to user space
+ * @out_mmap_offset: The offset argument for mmap syscall
+ * @out_mmap_length: The length argument for mmap syscall
  *
  * Simply set @out_data=@in_data for a loopback test
  */
 struct iommu_viommu_selftest {
 	__u32 in_data;
 	__u32 out_data;
+	__aligned_u64 out_mmap_offset;
+	__aligned_u64 out_mmap_length;
 };
 
 /* Should not be equal to any defined value in enum iommu_viommu_invalidate_data_type */
diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 9a556f99d992..4a1b2bade018 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -56,6 +56,10 @@ static unsigned long PAGE_SIZE;
 #define offsetofend(TYPE, MEMBER) \
 	(offsetof(TYPE, MEMBER) + sizeof_field(TYPE, MEMBER))
 
+#define test_err_mmap(_errno, length, offset)                                 \
+	EXPECT_ERRNO(_errno, (long)mmap(NULL, length, PROT_READ | PROT_WRITE, \
+					MAP_SHARED, self->fd, offset))
+
 static inline void *memfd_mmap(size_t length, int prot, int flags, int *mfd_p)
 {
 	int mfd_flags = (flags & MAP_HUGETLB) ? MFD_HUGETLB : 0;
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 2189e9b119ee..8b2c44b32530 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -152,6 +152,9 @@ struct mock_viommu {
 	struct mock_iommu_domain *s2_parent;
 	struct mock_hw_queue *hw_queue[IOMMU_TEST_HW_QUEUE_MAX];
 	struct mutex queue_mutex;
+
+	unsigned long mmap_offset;
+	u32 *page; /* Mmap page to test u32 type of in_data */
 };
 
 static inline struct mock_viommu *to_mock_viommu(struct iommufd_viommu *viommu)
@@ -689,6 +692,10 @@ static void mock_viommu_destroy(struct iommufd_viommu *viommu)
 
 	if (refcount_dec_and_test(&mock_iommu->users))
 		complete(&mock_iommu->complete);
+	if (mock_viommu->mmap_offset)
+		iommufd_viommu_destroy_mmap(&mock_viommu->core,
+					    mock_viommu->mmap_offset);
+	free_page((unsigned long)mock_viommu->page);
 	mutex_destroy(&mock_viommu->queue_mutex);
 
 	/* iommufd core frees mock_viommu and viommu */
@@ -887,11 +894,28 @@ static int mock_viommu_init(struct iommufd_viommu *viommu,
 		if (rc)
 			return rc;
 
+		/* Allocate two pages */
+		mock_viommu->page =
+			(u32 *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, 1);
+		if (!mock_viommu->page)
+			return -ENOMEM;
+
+		rc = iommufd_viommu_alloc_mmap(&mock_viommu->core,
+					       __pa(mock_viommu->page),
+					       PAGE_SIZE * 2,
+					       &mock_viommu->mmap_offset);
+		if (rc)
+			goto err_free_page;
+
+		/* For loopback tests on both the page and out_data */
+		*mock_viommu->page = data.in_data;
 		data.out_data = data.in_data;
+		data.out_mmap_length = PAGE_SIZE * 2;
+		data.out_mmap_offset = mock_viommu->mmap_offset;
 		rc = iommu_copy_struct_to_user(
 			user_data, &data, IOMMU_VIOMMU_TYPE_SELFTEST, out_data);
 		if (rc)
-			return rc;
+			goto err_destroy_mmap;
 	}
 
 	refcount_inc(&mock_iommu->users);
@@ -900,6 +924,13 @@ static int mock_viommu_init(struct iommufd_viommu *viommu,
 
 	viommu->ops = &mock_viommu_ops;
 	return 0;
+
+err_destroy_mmap:
+	iommufd_viommu_destroy_mmap(&mock_viommu->core,
+				    mock_viommu->mmap_offset);
+err_free_page:
+	free_page((unsigned long)mock_viommu->page);
+	return rc;
 }
 
 static const struct iommu_ops mock_ops = {
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 0da72a2388e9..235504ee05e3 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -2799,12 +2799,34 @@ TEST_F(iommufd_viommu, viommu_alloc_with_data)
 	struct iommu_viommu_selftest data = {
 		.in_data = 0xbeef,
 	};
+	uint32_t *test;
 
 	if (self->device_id) {
 		test_cmd_viommu_alloc(self->device_id, self->hwpt_id,
 				      IOMMU_VIOMMU_TYPE_SELFTEST, &data,
 				      sizeof(data), &self->viommu_id);
 		ASSERT_EQ(data.out_data, data.in_data);
+
+		/* Negative mmap tests -- offset and length cannot be changed */
+		test_err_mmap(ENXIO, data.out_mmap_length,
+			      data.out_mmap_offset + PAGE_SIZE);
+		test_err_mmap(ENXIO, data.out_mmap_length,
+			      data.out_mmap_offset + PAGE_SIZE * 2);
+		test_err_mmap(ENXIO, data.out_mmap_length / 2,
+			      data.out_mmap_offset);
+		test_err_mmap(ENXIO, data.out_mmap_length * 2,
+			      data.out_mmap_offset);
+
+		/* Now do a correct mmap for a loopback test */
+		test = mmap(NULL, data.out_mmap_length, PROT_READ | PROT_WRITE,
+			    MAP_SHARED, self->fd, data.out_mmap_offset);
+		ASSERT_NE(MAP_FAILED, test);
+		ASSERT_EQ(data.in_data, *test);
+
+		/* The owner of the mmap region should be blocked */
+		EXPECT_ERRNO(EBUSY,
+			     _test_ioctl_destroy(self->fd, self->viommu_id));
+		munmap(test, data.out_mmap_length);
 	}
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 18/28] Documentation: userspace-api: iommufd: Update HW QUEUE
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (16 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 17/28] iommufd/selftest: Add coverage for the new " Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 19/28] iommu: Allow an input type in hw_info op Nicolin Chen
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

With the introduction of the new object and its infrastructure, update the
doc to reflect that.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 Documentation/userspace-api/iommufd.rst | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/Documentation/userspace-api/iommufd.rst b/Documentation/userspace-api/iommufd.rst
index b0df15865dec..03f7510384d2 100644
--- a/Documentation/userspace-api/iommufd.rst
+++ b/Documentation/userspace-api/iommufd.rst
@@ -124,6 +124,17 @@ Following IOMMUFD objects are exposed to userspace:
   used to allocate a vEVENTQ. Each vIOMMU can support multiple types of vEVENTS,
   but is confined to one vEVENTQ per vEVENTQ type.
 
+- IOMMUFD_OBJ_HW_QUEUE, representing a hardware accelerated queue, as a subset
+  of IOMMU's virtualization features, for the IOMMU HW to directly read or write
+  the virtual queue memory owned by a guest OS. This HW-acceleration feature can
+  allow VM to work with the IOMMU HW directly without a VM Exit, so as to reduce
+  overhead from the hypercalls. Along with the HW QUEUE object, iommufd provides
+  user space an mmap interface for VMM to mmap a physical MMIO region from the
+  host physical address space to the guest physical address space, allowing the
+  guest OS to directly control the allocated HW QUEUE. Thus, when allocating a
+  HW QUEUE, the VMM must request a pair of mmap info (offset/length) and pass in
+  exactly to an mmap syscall via its offset and length arguments.
+
 All user-visible objects are destroyed via the IOMMU_DESTROY uAPI.
 
 The diagrams below show relationships between user-visible objects and kernel
@@ -270,6 +281,7 @@ User visible objects are backed by following datastructures:
 - iommufd_viommu for IOMMUFD_OBJ_VIOMMU.
 - iommufd_vdevice for IOMMUFD_OBJ_VDEVICE.
 - iommufd_veventq for IOMMUFD_OBJ_VEVENTQ.
+- iommufd_hw_queue for IOMMUFD_OBJ_HW_QUEUE.
 
 Several terminologies when looking at these datastructures:
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 19/28] iommu: Allow an input type in hw_info op
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (17 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 18/28] Documentation: userspace-api: iommufd: Update HW QUEUE Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-07-01 12:54   ` Pranjal Shrivastava
  2025-06-26 19:34 ` [PATCH v7 20/28] iommufd: Allow an input data_type via iommu_hw_info Nicolin Chen
                   ` (8 subsequent siblings)
  27 siblings, 1 reply; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

The hw_info uAPI will support a bidirectional data_type field that can be
used as an input field for user space to request for a specific info data.

To prepare for the uAPI update, change the iommu layer first:
 - Add a new IOMMU_HW_INFO_TYPE_DEFAULT as an input, for which driver can
   output its only (or firstly) supported type
 - Update the kdoc accordingly
 - Roll out the type validation in the existing drivers

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h                               | 3 ++-
 include/uapi/linux/iommufd.h                        | 4 +++-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 4 ++++
 drivers/iommu/intel/iommu.c                         | 4 ++++
 drivers/iommu/iommufd/device.c                      | 3 +++
 drivers/iommu/iommufd/selftest.c                    | 4 ++++
 6 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index e06a0fbe4bc7..e8b59ef54e48 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -603,7 +603,8 @@ __iommu_copy_struct_to_user(const struct iommu_user_data *dst_data,
  * @capable: check capability
  * @hw_info: report iommu hardware information. The data buffer returned by this
  *           op is allocated in the iommu driver and freed by the caller after
- *           use.
+ *           use. @type can input a requested type and output a supported type.
+ *           Driver should reject an unsupported data @type input
  * @domain_alloc: Do not use in new drivers
  * @domain_alloc_identity: allocate an IDENTITY domain. Drivers should prefer to
  *                         use identity_domain instead. This should only be used
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index f091ea072c5f..6ad361ff9b06 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -593,13 +593,15 @@ struct iommu_hw_info_arm_smmuv3 {
 
 /**
  * enum iommu_hw_info_type - IOMMU Hardware Info Types
- * @IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not report hardware
+ * @IOMMU_HW_INFO_TYPE_NONE: Output by the drivers that do not report hardware
  *                           info
+ * @IOMMU_HW_INFO_TYPE_DEFAULT: Input to request for a default type
  * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
  * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
  */
 enum iommu_hw_info_type {
 	IOMMU_HW_INFO_TYPE_NONE = 0,
+	IOMMU_HW_INFO_TYPE_DEFAULT = 0,
 	IOMMU_HW_INFO_TYPE_INTEL_VTD = 1,
 	IOMMU_HW_INFO_TYPE_ARM_SMMUV3 = 2,
 };
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
index 170d69162848..eb9fe1f6311a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
@@ -15,6 +15,10 @@ void *arm_smmu_hw_info(struct device *dev, u32 *length,
 	u32 __iomem *base_idr;
 	unsigned int i;
 
+	if (*type != IOMMU_HW_INFO_TYPE_DEFAULT &&
+	    *type != IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
+		return ERR_PTR(-EOPNOTSUPP);
+
 	info = kzalloc(sizeof(*info), GFP_KERNEL);
 	if (!info)
 		return ERR_PTR(-ENOMEM);
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 850f1a6f548c..5f75faffca15 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4098,6 +4098,10 @@ static void *intel_iommu_hw_info(struct device *dev, u32 *length,
 	struct intel_iommu *iommu = info->iommu;
 	struct iommu_hw_info_vtd *vtd;
 
+	if (*type != IOMMU_HW_INFO_TYPE_DEFAULT &&
+	    *type != IOMMU_HW_INFO_TYPE_INTEL_VTD)
+		return ERR_PTR(-EOPNOTSUPP);
+
 	vtd = kzalloc(sizeof(*vtd), GFP_KERNEL);
 	if (!vtd)
 		return ERR_PTR(-ENOMEM);
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 8f078fda795a..64a51993e6a1 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -1519,6 +1519,9 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd)
 	    cmd->__reserved[2])
 		return -EOPNOTSUPP;
 
+	/* Clear the type field since drivers don't support a random input */
+	cmd->out_data_type = IOMMU_HW_INFO_TYPE_DEFAULT;
+
 	idev = iommufd_get_device(ucmd, cmd->dev_id);
 	if (IS_ERR(idev))
 		return PTR_ERR(idev);
diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
index 8b2c44b32530..a5dc36219a90 100644
--- a/drivers/iommu/iommufd/selftest.c
+++ b/drivers/iommu/iommufd/selftest.c
@@ -310,6 +310,10 @@ static void *mock_domain_hw_info(struct device *dev, u32 *length,
 {
 	struct iommu_test_hw_info *info;
 
+	if (*type != IOMMU_HW_INFO_TYPE_DEFAULT &&
+	    *type != IOMMU_HW_INFO_TYPE_SELFTEST)
+		return ERR_PTR(-EOPNOTSUPP);
+
 	info = kzalloc(sizeof(*info), GFP_KERNEL);
 	if (!info)
 		return ERR_PTR(-ENOMEM);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 20/28] iommufd: Allow an input data_type via iommu_hw_info
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (18 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 19/28] iommu: Allow an input type in hw_info op Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-07-01 12:58   ` Pranjal Shrivastava
  2025-06-26 19:34 ` [PATCH v7 21/28] iommufd/selftest: Update hw_info coverage for an input data_type Nicolin Chen
                   ` (7 subsequent siblings)
  27 siblings, 1 reply; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

The iommu_hw_info can output via the out_data_type field the vendor data
type from a driver, but this only allows driver to report one data type.

Now, with SMMUv3 having a Tegra241 CMDQV implementation, it has two sets
of types and data structs to report.

One way to support that is to use the same type field bidirectionally.

Reuse the same field by adding an "in_data_type", allowing user space to
request for a specific type and to get the corresponding data.

For backward compatibility, since the ioctl handler has never checked an
input value, add an IOMMU_HW_INFO_FLAG_INPUT_TYPE to switch between the
old output-only field and the new bidirectional field.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/uapi/linux/iommufd.h   | 20 +++++++++++++++++++-
 drivers/iommu/iommufd/device.c |  9 ++++++---
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 6ad361ff9b06..6ae9d2102154 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -628,6 +628,15 @@ enum iommufd_hw_capabilities {
 	IOMMU_HW_CAP_PCI_PASID_PRIV = 1 << 2,
 };
 
+/**
+ * enum iommufd_hw_info_flags - Flags for iommu_hw_info
+ * @IOMMU_HW_INFO_FLAG_INPUT_TYPE: If set, @in_data_type carries an input type
+ *                                 for user space to request for a specific info
+ */
+enum iommufd_hw_info_flags {
+	IOMMU_HW_INFO_FLAG_INPUT_TYPE = 1 << 0,
+};
+
 /**
  * struct iommu_hw_info - ioctl(IOMMU_GET_HW_INFO)
  * @size: sizeof(struct iommu_hw_info)
@@ -637,6 +646,12 @@ enum iommufd_hw_capabilities {
  *            data that kernel supports
  * @data_uptr: User pointer to a user-space buffer used by the kernel to fill
  *             the iommu type specific hardware information data
+ * @in_data_type: This shares the same field with @out_data_type, making it be
+ *                a bidirectional field. When IOMMU_HW_INFO_FLAG_INPUT_TYPE is
+ *                set, an input type carried via this @in_data_type field will
+ *                be valid, requesting for the info data to the given type. If
+ *                IOMMU_HW_INFO_FLAG_INPUT_TYPE is unset, any input value will
+ *                be seen as IOMMU_HW_INFO_TYPE_DEFAULT
  * @out_data_type: Output the iommu hardware info type as defined in the enum
  *                 iommu_hw_info_type.
  * @out_capabilities: Output the generic iommu capability info type as defined
@@ -666,7 +681,10 @@ struct iommu_hw_info {
 	__u32 dev_id;
 	__u32 data_len;
 	__aligned_u64 data_uptr;
-	__u32 out_data_type;
+	union {
+		__u32 in_data_type;
+		__u32 out_data_type;
+	};
 	__u8 out_max_pasid_log2;
 	__u8 __reserved[3];
 	__aligned_u64 out_capabilities;
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 64a51993e6a1..cbd86aabdd1c 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -1506,6 +1506,7 @@ EXPORT_SYMBOL_NS_GPL(iommufd_access_rw, "IOMMUFD");
 
 int iommufd_get_hw_info(struct iommufd_ucmd *ucmd)
 {
+	const u32 SUPPORTED_FLAGS = IOMMU_HW_INFO_FLAG_INPUT_TYPE;
 	struct iommu_hw_info *cmd = ucmd->cmd;
 	void __user *user_ptr = u64_to_user_ptr(cmd->data_uptr);
 	const struct iommu_ops *ops;
@@ -1515,12 +1516,14 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd)
 	void *data;
 	int rc;
 
-	if (cmd->flags || cmd->__reserved[0] || cmd->__reserved[1] ||
-	    cmd->__reserved[2])
+	if (cmd->flags & ~SUPPORTED_FLAGS)
+		return -EOPNOTSUPP;
+	if (cmd->__reserved[0] || cmd->__reserved[1] || cmd->__reserved[2])
 		return -EOPNOTSUPP;
 
 	/* Clear the type field since drivers don't support a random input */
-	cmd->out_data_type = IOMMU_HW_INFO_TYPE_DEFAULT;
+	if (!(cmd->flags & IOMMU_HW_INFO_FLAG_INPUT_TYPE))
+		cmd->in_data_type = IOMMU_HW_INFO_TYPE_DEFAULT;
 
 	idev = iommufd_get_device(ucmd, cmd->dev_id);
 	if (IS_ERR(idev))
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 21/28] iommufd/selftest: Update hw_info coverage for an input data_type
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (19 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 20/28] iommufd: Allow an input data_type via iommu_hw_info Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 22/28] iommu/arm-smmu-v3-iommufd: Add vsmmu_size/type and vsmmu_init impl ops Nicolin Chen
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

Test both IOMMU_HW_INFO_TYPE_DEFAULT and IOMMU_HW_INFO_TYPE_SELFTEST, and
add a negative test for an unsupported type.

Also drop the unused mask in test_cmd_get_hw_capabilities() as checkpatch
is complaining.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 tools/testing/selftests/iommu/iommufd_utils.h | 33 +++++++++++--------
 tools/testing/selftests/iommu/iommufd.c       | 32 +++++++++++++-----
 .../selftests/iommu/iommufd_fail_nth.c        |  4 +--
 3 files changed, 46 insertions(+), 23 deletions(-)

diff --git a/tools/testing/selftests/iommu/iommufd_utils.h b/tools/testing/selftests/iommu/iommufd_utils.h
index 4a1b2bade018..5384852ce038 100644
--- a/tools/testing/selftests/iommu/iommufd_utils.h
+++ b/tools/testing/selftests/iommu/iommufd_utils.h
@@ -761,20 +761,24 @@ static void teardown_iommufd(int fd, struct __test_metadata *_metadata)
 #endif
 
 /* @data can be NULL */
-static int _test_cmd_get_hw_info(int fd, __u32 device_id, void *data,
-				 size_t data_len, uint32_t *capabilities,
-				 uint8_t *max_pasid)
+static int _test_cmd_get_hw_info(int fd, __u32 device_id, __u32 data_type,
+				 void *data, size_t data_len,
+				 uint32_t *capabilities, uint8_t *max_pasid)
 {
 	struct iommu_test_hw_info *info = (struct iommu_test_hw_info *)data;
 	struct iommu_hw_info cmd = {
 		.size = sizeof(cmd),
 		.dev_id = device_id,
 		.data_len = data_len,
+		.in_data_type = data_type,
 		.data_uptr = (uint64_t)data,
 		.out_capabilities = 0,
 	};
 	int ret;
 
+	if (data_type != IOMMU_HW_INFO_TYPE_DEFAULT)
+		cmd.flags |= IOMMU_HW_INFO_FLAG_INPUT_TYPE;
+
 	ret = ioctl(fd, IOMMU_GET_HW_INFO, &cmd);
 	if (ret)
 		return ret;
@@ -817,20 +821,23 @@ static int _test_cmd_get_hw_info(int fd, __u32 device_id, void *data,
 	return 0;
 }
 
-#define test_cmd_get_hw_info(device_id, data, data_len)               \
-	ASSERT_EQ(0, _test_cmd_get_hw_info(self->fd, device_id, data, \
-					   data_len, NULL, NULL))
+#define test_cmd_get_hw_info(device_id, data_type, data, data_len)         \
+	ASSERT_EQ(0, _test_cmd_get_hw_info(self->fd, device_id, data_type, \
+					   data, data_len, NULL, NULL))
 
-#define test_err_get_hw_info(_errno, device_id, data, data_len)               \
-	EXPECT_ERRNO(_errno, _test_cmd_get_hw_info(self->fd, device_id, data, \
-						   data_len, NULL, NULL))
+#define test_err_get_hw_info(_errno, device_id, data_type, data, data_len) \
+	EXPECT_ERRNO(_errno,                                               \
+		     _test_cmd_get_hw_info(self->fd, device_id, data_type, \
+					   data, data_len, NULL, NULL))
 
-#define test_cmd_get_hw_capabilities(device_id, caps, mask) \
-	ASSERT_EQ(0, _test_cmd_get_hw_info(self->fd, device_id, NULL, \
+#define test_cmd_get_hw_capabilities(device_id, caps)                        \
+	ASSERT_EQ(0, _test_cmd_get_hw_info(self->fd, device_id,              \
+					   IOMMU_HW_INFO_TYPE_DEFAULT, NULL, \
 					   0, &caps, NULL))
 
-#define test_cmd_get_hw_info_pasid(device_id, max_pasid)              \
-	ASSERT_EQ(0, _test_cmd_get_hw_info(self->fd, device_id, NULL, \
+#define test_cmd_get_hw_info_pasid(device_id, max_pasid)                     \
+	ASSERT_EQ(0, _test_cmd_get_hw_info(self->fd, device_id,              \
+					   IOMMU_HW_INFO_TYPE_DEFAULT, NULL, \
 					   0, NULL, max_pasid))
 
 static int _test_ioctl_fault_alloc(int fd, __u32 *fault_id, __u32 *fault_fd)
diff --git a/tools/testing/selftests/iommu/iommufd.c b/tools/testing/selftests/iommu/iommufd.c
index 235504ee05e3..fda93a195e26 100644
--- a/tools/testing/selftests/iommu/iommufd.c
+++ b/tools/testing/selftests/iommu/iommufd.c
@@ -764,19 +764,34 @@ TEST_F(iommufd_ioas, get_hw_info)
 		uint8_t max_pasid = 0;
 
 		/* Provide a zero-size user_buffer */
-		test_cmd_get_hw_info(self->device_id, NULL, 0);
+		test_cmd_get_hw_info(self->device_id,
+				     IOMMU_HW_INFO_TYPE_DEFAULT, NULL, 0);
 		/* Provide a user_buffer with exact size */
-		test_cmd_get_hw_info(self->device_id, &buffer_exact, sizeof(buffer_exact));
+		test_cmd_get_hw_info(self->device_id,
+				     IOMMU_HW_INFO_TYPE_DEFAULT, &buffer_exact,
+				     sizeof(buffer_exact));
+
+		/* Request for a wrong data_type, and a correct one */
+		test_err_get_hw_info(EOPNOTSUPP, self->device_id,
+				     IOMMU_HW_INFO_TYPE_SELFTEST + 1,
+				     &buffer_exact, sizeof(buffer_exact));
+		test_cmd_get_hw_info(self->device_id,
+				     IOMMU_HW_INFO_TYPE_SELFTEST, &buffer_exact,
+				     sizeof(buffer_exact));
 		/*
 		 * Provide a user_buffer with size larger than the exact size to check if
 		 * kernel zero the trailing bytes.
 		 */
-		test_cmd_get_hw_info(self->device_id, &buffer_larger, sizeof(buffer_larger));
+		test_cmd_get_hw_info(self->device_id,
+				     IOMMU_HW_INFO_TYPE_DEFAULT, &buffer_larger,
+				     sizeof(buffer_larger));
 		/*
 		 * Provide a user_buffer with size smaller than the exact size to check if
 		 * the fields within the size range still gets updated.
 		 */
-		test_cmd_get_hw_info(self->device_id, &buffer_smaller, sizeof(buffer_smaller));
+		test_cmd_get_hw_info(self->device_id,
+				     IOMMU_HW_INFO_TYPE_DEFAULT,
+				     &buffer_smaller, sizeof(buffer_smaller));
 		test_cmd_get_hw_info_pasid(self->device_id, &max_pasid);
 		ASSERT_EQ(0, max_pasid);
 		if (variant->pasid_capable) {
@@ -786,9 +801,11 @@ TEST_F(iommufd_ioas, get_hw_info)
 		}
 	} else {
 		test_err_get_hw_info(ENOENT, self->device_id,
-				     &buffer_exact, sizeof(buffer_exact));
+				     IOMMU_HW_INFO_TYPE_DEFAULT, &buffer_exact,
+				     sizeof(buffer_exact));
 		test_err_get_hw_info(ENOENT, self->device_id,
-				     &buffer_larger, sizeof(buffer_larger));
+				     IOMMU_HW_INFO_TYPE_DEFAULT, &buffer_larger,
+				     sizeof(buffer_larger));
 	}
 }
 
@@ -2175,8 +2192,7 @@ TEST_F(iommufd_dirty_tracking, device_dirty_capability)
 
 	test_cmd_hwpt_alloc(self->idev_id, self->ioas_id, 0, &hwpt_id);
 	test_cmd_mock_domain(hwpt_id, &stddev_id, NULL, NULL);
-	test_cmd_get_hw_capabilities(self->idev_id, caps,
-				     IOMMU_HW_CAP_DIRTY_TRACKING);
+	test_cmd_get_hw_capabilities(self->idev_id, caps);
 	ASSERT_EQ(IOMMU_HW_CAP_DIRTY_TRACKING,
 		  caps & IOMMU_HW_CAP_DIRTY_TRACKING);
 
diff --git a/tools/testing/selftests/iommu/iommufd_fail_nth.c b/tools/testing/selftests/iommu/iommufd_fail_nth.c
index 41c685bbd252..651fc9f13c08 100644
--- a/tools/testing/selftests/iommu/iommufd_fail_nth.c
+++ b/tools/testing/selftests/iommu/iommufd_fail_nth.c
@@ -667,8 +667,8 @@ TEST_FAIL_NTH(basic_fail_nth, device)
 					&self->stdev_id, NULL, &idev_id))
 		return -1;
 
-	if (_test_cmd_get_hw_info(self->fd, idev_id, &info,
-				  sizeof(info), NULL, NULL))
+	if (_test_cmd_get_hw_info(self->fd, idev_id, IOMMU_HW_INFO_TYPE_DEFAULT,
+				  &info, sizeof(info), NULL, NULL))
 		return -1;
 
 	if (_test_cmd_hwpt_alloc(self->fd, idev_id, ioas_id, 0,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 22/28] iommu/arm-smmu-v3-iommufd: Add vsmmu_size/type and vsmmu_init impl ops
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (20 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 21/28] iommufd/selftest: Update hw_info coverage for an input data_type Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 23/28] iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops Nicolin Chen
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

An impl driver might want to allocate its own type of vIOMMU object or the
standard IOMMU_VIOMMU_TYPE_ARM_SMMUV3 by setting up its own SW/HW bits, as
the tegra241-cmdqv driver will add IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV.

Add vsmmu_size/type and vsmmu_init to struct arm_smmu_impl_ops. Prioritize
them in arm_smmu_get_viommu_size() and arm_vsmmu_init().

Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h         | 5 +++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 8 ++++++++
 2 files changed, 13 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 7eed5c8c72dd..07589350b2a1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -16,6 +16,7 @@
 #include <linux/sizes.h>
 
 struct arm_smmu_device;
+struct arm_vsmmu;
 
 /* MMIO registers */
 #define ARM_SMMU_IDR0			0x0
@@ -720,6 +721,10 @@ struct arm_smmu_impl_ops {
 	int (*init_structures)(struct arm_smmu_device *smmu);
 	struct arm_smmu_cmdq *(*get_secondary_cmdq)(
 		struct arm_smmu_device *smmu, struct arm_smmu_cmdq_ent *ent);
+	const size_t vsmmu_size;
+	const enum iommu_viommu_type vsmmu_type;
+	int (*vsmmu_init)(struct arm_vsmmu *vsmmu,
+			  const struct iommu_user_data *user_data);
 };
 
 /* An SMMUv3 instance */
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
index eb9fe1f6311a..2ab1c6cf4aac 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
@@ -416,6 +416,10 @@ size_t arm_smmu_get_viommu_size(struct device *dev,
 	    !(smmu->features & ARM_SMMU_FEAT_S2FWB))
 		return 0;
 
+	if (smmu->impl_ops && smmu->impl_ops->vsmmu_size &&
+	    viommu_type == smmu->impl_ops->vsmmu_type)
+		return smmu->impl_ops->vsmmu_size;
+
 	if (viommu_type != IOMMU_VIOMMU_TYPE_ARM_SMMUV3)
 		return 0;
 
@@ -439,6 +443,10 @@ int arm_vsmmu_init(struct iommufd_viommu *viommu,
 	/* FIXME Move VMID allocation from the S2 domain allocation to here */
 	vsmmu->vmid = s2_parent->s2_cfg.vmid;
 
+	if (smmu->impl_ops && smmu->impl_ops->vsmmu_init &&
+	    viommu->type == smmu->impl_ops->vsmmu_type)
+		return smmu->impl_ops->vsmmu_init(vsmmu, user_data);
+
 	viommu->ops = &arm_vsmmu_ops;
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 23/28] iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (21 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 22/28] iommu/arm-smmu-v3-iommufd: Add vsmmu_size/type and vsmmu_init impl ops Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-07-01 12:24   ` Pranjal Shrivastava
  2025-06-26 19:34 ` [PATCH v7 24/28] iommu/tegra241-cmdqv: Use request_threaded_irq Nicolin Chen
                   ` (4 subsequent siblings)
  27 siblings, 1 reply; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

This will be used by Tegra241 CMDQV implementation to report a non-default
HW info data.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h         | 7 +++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 8 ++++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 07589350b2a1..836d5556008e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -721,6 +721,13 @@ struct arm_smmu_impl_ops {
 	int (*init_structures)(struct arm_smmu_device *smmu);
 	struct arm_smmu_cmdq *(*get_secondary_cmdq)(
 		struct arm_smmu_device *smmu, struct arm_smmu_cmdq_ent *ent);
+	/*
+	 * An implementation should define its own type other than the default
+	 * IOMMU_HW_INFO_TYPE_ARM_SMMUV3. And it must validate the input @type
+	 * to return its own structure.
+	 */
+	void *(*hw_info)(struct arm_smmu_device *smmu, u32 *length,
+			 enum iommu_hw_info_type *type);
 	const size_t vsmmu_size;
 	const enum iommu_viommu_type vsmmu_type;
 	int (*vsmmu_init)(struct arm_vsmmu *vsmmu,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
index 2ab1c6cf4aac..1cf9646e776f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
@@ -11,13 +11,17 @@ void *arm_smmu_hw_info(struct device *dev, u32 *length,
 		       enum iommu_hw_info_type *type)
 {
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	const struct arm_smmu_impl_ops *impl_ops = master->smmu->impl_ops;
 	struct iommu_hw_info_arm_smmuv3 *info;
 	u32 __iomem *base_idr;
 	unsigned int i;
 
 	if (*type != IOMMU_HW_INFO_TYPE_DEFAULT &&
-	    *type != IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
-		return ERR_PTR(-EOPNOTSUPP);
+	    *type != IOMMU_HW_INFO_TYPE_ARM_SMMUV3) {
+		if (!impl_ops || !impl_ops->hw_info)
+			return ERR_PTR(-EOPNOTSUPP);
+		return impl_ops->hw_info(master->smmu, length, type);
+	}
 
 	info = kzalloc(sizeof(*info), GFP_KERNEL);
 	if (!info)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 24/28] iommu/tegra241-cmdqv: Use request_threaded_irq
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (22 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 23/28] iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 25/28] iommu/tegra241-cmdqv: Simplify deinit flow in tegra241_cmdqv_remove_vintf() Nicolin Chen
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

A vEVENT can be reported only from a threaded IRQ context. Change to using
request_threaded_irq to support that.

Acked-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
index dd7d030d2e89..ba029f7d24ce 100644
--- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
+++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
@@ -824,8 +824,9 @@ __tegra241_cmdqv_probe(struct arm_smmu_device *smmu, struct resource *res,
 	cmdqv->dev = smmu->impl_dev;
 
 	if (cmdqv->irq > 0) {
-		ret = request_irq(irq, tegra241_cmdqv_isr, 0, "tegra241-cmdqv",
-				  cmdqv);
+		ret = request_threaded_irq(irq, NULL, tegra241_cmdqv_isr,
+					   IRQF_ONESHOT, "tegra241-cmdqv",
+					   cmdqv);
 		if (ret) {
 			dev_err(cmdqv->dev, "failed to request irq (%d): %d\n",
 				cmdqv->irq, ret);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 25/28] iommu/tegra241-cmdqv: Simplify deinit flow in tegra241_cmdqv_remove_vintf()
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (23 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 24/28] iommu/tegra241-cmdqv: Use request_threaded_irq Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 26/28] iommu/tegra241-cmdqv: Do not statically map LVCMDQs Nicolin Chen
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

The current flow of tegra241_cmdqv_remove_vintf() is:
 1. For each LVCMDQ, tegra241_vintf_remove_lvcmdq():
    a. Disable the LVCMDQ HW
    b. Release the LVCMDQ SW resource
 2. For current VINTF, tegra241_vintf_hw_deinit():
    c. Disable all LVCMDQ HWs
    d. Disable VINTF HW

Obviously, the step 1.a and the step 2.c are redundant.

Since tegra241_vintf_hw_deinit() disables all of its LVCMDQ HWs, it could
simplify the flow in tegra241_cmdqv_remove_vintf() by calling that first:
 1. For current VINTF, tegra241_vintf_hw_deinit():
    a. Disable all LVCMDQ HWs
    b. Disable VINTF HW
 2. Release all LVCMDQ SW resources

Drop tegra241_vintf_remove_lvcmdq(), and move tegra241_vintf_free_lvcmdq()
as the new step 2.

Acked-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
index ba029f7d24ce..8d418c131b1b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
+++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
@@ -628,24 +628,17 @@ static int tegra241_cmdqv_init_vintf(struct tegra241_cmdqv *cmdqv, u16 max_idx,
 
 /* Remove Helpers */
 
-static void tegra241_vintf_remove_lvcmdq(struct tegra241_vintf *vintf, u16 lidx)
-{
-	tegra241_vcmdq_hw_deinit(vintf->lvcmdqs[lidx]);
-	tegra241_vintf_free_lvcmdq(vintf, lidx);
-}
-
 static void tegra241_cmdqv_remove_vintf(struct tegra241_cmdqv *cmdqv, u16 idx)
 {
 	struct tegra241_vintf *vintf = cmdqv->vintfs[idx];
 	u16 lidx;
 
+	tegra241_vintf_hw_deinit(vintf);
+
 	/* Remove LVCMDQ resources */
 	for (lidx = 0; lidx < vintf->cmdqv->num_lvcmdqs_per_vintf; lidx++)
 		if (vintf->lvcmdqs[lidx])
-			tegra241_vintf_remove_lvcmdq(vintf, lidx);
-
-	/* Remove VINTF resources */
-	tegra241_vintf_hw_deinit(vintf);
+			tegra241_vintf_free_lvcmdq(vintf, lidx);
 
 	dev_dbg(cmdqv->dev, "VINTF%u: deallocated\n", vintf->idx);
 	tegra241_cmdqv_deinit_vintf(cmdqv, idx);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 26/28] iommu/tegra241-cmdqv: Do not statically map LVCMDQs
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (24 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 25/28] iommu/tegra241-cmdqv: Simplify deinit flow in tegra241_cmdqv_remove_vintf() Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support Nicolin Chen
  2025-06-26 19:34 ` [PATCH v7 28/28] iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support Nicolin Chen
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

To simplify the mappings from global VCMDQs to VINTFs' LVCMDQs, the design
chose to do static allocations and mappings in the global reset function.

However, with the user-owned VINTF support, it exposes a security concern:
if user space VM only wants one LVCMDQ for a VINTF, statically mapping two
or more LVCMDQs creates a hidden VCMDQ that user space could DoS attack by
writing random stuff to overwhelm the kernel with unhandleable IRQs.

Thus, to support the user-owned VINTF feature, a LVCMDQ mapping has to be
done dynamically.

HW allows pre-assigning global VCMDQs in the CMDQ_ALLOC registers, without
finalizing the mappings by keeping CMDQV_CMDQ_ALLOCATED=0. So, add a pair
of map/unmap helper that simply sets/clears that bit.

For kernel-owned VINTF0, move LVCMDQ mappings to tegra241_vintf_hw_init(),
and the unmappings to tegra241_vintf_hw_deinit().

For user-owned VINTFs that will be added, the mappings/unmappings will be
on demand upon an LVCMDQ allocation from the user space.

However, the dynamic LVCMDQ mapping/unmapping can complicate the timing of
calling tegra241_vcmdq_hw_init/deinit(), which write LVCMDQ address space,
i.e. requiring LVCMDQ to be mapped. Highlight that with a note to the top
of either of them.

Acked-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c    | 37 +++++++++++++++++--
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
index 8d418c131b1b..869c90b660c1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
+++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
@@ -351,6 +351,7 @@ tegra241_cmdqv_get_cmdq(struct arm_smmu_device *smmu,
 
 /* HW Reset Functions */
 
+/* This function is for LVCMDQ, so @vcmdq must not be unmapped yet */
 static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
 {
 	char header[64], *h = lvcmdq_error_header(vcmdq, header, 64);
@@ -379,6 +380,7 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
 	dev_dbg(vcmdq->cmdqv->dev, "%sdeinited\n", h);
 }
 
+/* This function is for LVCMDQ, so @vcmdq must be mapped prior */
 static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
 {
 	char header[64], *h = lvcmdq_error_header(vcmdq, header, 64);
@@ -404,16 +406,42 @@ static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
 	return 0;
 }
 
+/* Unmap a global VCMDQ from the pre-assigned LVCMDQ */
+static void tegra241_vcmdq_unmap_lvcmdq(struct tegra241_vcmdq *vcmdq)
+{
+	u32 regval = readl(REG_CMDQV(vcmdq->cmdqv, CMDQ_ALLOC(vcmdq->idx)));
+	char header[64], *h = lvcmdq_error_header(vcmdq, header, 64);
+
+	writel(regval & ~CMDQV_CMDQ_ALLOCATED,
+	       REG_CMDQV(vcmdq->cmdqv, CMDQ_ALLOC(vcmdq->idx)));
+	dev_dbg(vcmdq->cmdqv->dev, "%sunmapped\n", h);
+}
+
 static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf)
 {
-	u16 lidx;
+	u16 lidx = vintf->cmdqv->num_lvcmdqs_per_vintf;
 
-	for (lidx = 0; lidx < vintf->cmdqv->num_lvcmdqs_per_vintf; lidx++)
-		if (vintf->lvcmdqs && vintf->lvcmdqs[lidx])
+	/* HW requires to unmap LVCMDQs in descending order */
+	while (lidx--) {
+		if (vintf->lvcmdqs && vintf->lvcmdqs[lidx]) {
 			tegra241_vcmdq_hw_deinit(vintf->lvcmdqs[lidx]);
+			tegra241_vcmdq_unmap_lvcmdq(vintf->lvcmdqs[lidx]);
+		}
+	}
 	vintf_write_config(vintf, 0);
 }
 
+/* Map a global VCMDQ to the pre-assigned LVCMDQ */
+static void tegra241_vcmdq_map_lvcmdq(struct tegra241_vcmdq *vcmdq)
+{
+	u32 regval = readl(REG_CMDQV(vcmdq->cmdqv, CMDQ_ALLOC(vcmdq->idx)));
+	char header[64], *h = lvcmdq_error_header(vcmdq, header, 64);
+
+	writel(regval | CMDQV_CMDQ_ALLOCATED,
+	       REG_CMDQV(vcmdq->cmdqv, CMDQ_ALLOC(vcmdq->idx)));
+	dev_dbg(vcmdq->cmdqv->dev, "%smapped\n", h);
+}
+
 static int tegra241_vintf_hw_init(struct tegra241_vintf *vintf, bool hyp_own)
 {
 	u32 regval;
@@ -441,8 +469,10 @@ static int tegra241_vintf_hw_init(struct tegra241_vintf *vintf, bool hyp_own)
 	 */
 	vintf->hyp_own = !!(VINTF_HYP_OWN & readl(REG_VINTF(vintf, CONFIG)));
 
+	/* HW requires to map LVCMDQs in ascending order */
 	for (lidx = 0; lidx < vintf->cmdqv->num_lvcmdqs_per_vintf; lidx++) {
 		if (vintf->lvcmdqs && vintf->lvcmdqs[lidx]) {
+			tegra241_vcmdq_map_lvcmdq(vintf->lvcmdqs[lidx]);
 			ret = tegra241_vcmdq_hw_init(vintf->lvcmdqs[lidx]);
 			if (ret) {
 				tegra241_vintf_hw_deinit(vintf);
@@ -476,7 +506,6 @@ static int tegra241_cmdqv_hw_reset(struct arm_smmu_device *smmu)
 		for (lidx = 0; lidx < cmdqv->num_lvcmdqs_per_vintf; lidx++) {
 			regval  = FIELD_PREP(CMDQV_CMDQ_ALLOC_VINTF, idx);
 			regval |= FIELD_PREP(CMDQV_CMDQ_ALLOC_LVCMDQ, lidx);
-			regval |= CMDQV_CMDQ_ALLOCATED;
 			writel_relaxed(regval,
 				       REG_CMDQV(cmdqv, CMDQ_ALLOC(qidx++)));
 		}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (25 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 26/28] iommu/tegra241-cmdqv: Do not statically map LVCMDQs Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  2025-07-01 16:02   ` Pranjal Shrivastava
  2025-06-26 19:34 ` [PATCH v7 28/28] iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support Nicolin Chen
  27 siblings, 1 reply; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

The CMDQV HW supports a user-space use for virtualization cases. It allows
the VM to issue guest-level TLBI or ATC_INV commands directly to the queue
and executes them without a VMEXIT, as HW will replace the VMID field in a
TLBI command and the SID field in an ATC_INV command with the preset VMID
and SID.

This is built upon the vIOMMU infrastructure by allowing VMM to allocate a
VINTF (as a vIOMMU object) and assign VCMDQs (HW QUEUE objs) to the VINTF.

So firstly, replace the standard vSMMU model with the VINTF implementation
but reuse the standard cache_invalidate op (for unsupported commands) and
the standard alloc_domain_nested op (for standard nested STE).

Each VINTF has two 64KB MMIO pages (128B per logical VCMDQ):
 - Page0 (directly accessed by guest) has all the control and status bits.
 - Page1 (trapped by VMM) has guest-owned queue memory location/size info.

VMM should trap the emulated VINTF0's page1 of the guest VM for the guest-
level VCMDQ location/size info and forward that to the kernel to translate
to a physical memory location to program the VCMDQ HW during an allocation
call. Then, it should mmap the assigned VINTF's page0 to the VINTF0 page0
of the guest VM. This allows the guest OS to read and write the guest-own
VINTF's page0 for direct control of the VCMDQ HW.

For ATC invalidation commands that hold an SID, it requires all devices to
register their virtual SIDs to the SID_MATCH registers and their physical
SIDs to the pairing SID_REPLACE registers, so that HW can use those as a
lookup table to replace those virtual SIDs with the correct physical SIDs.
Thus, implement the driver-allocated vDEVICE op with a tegra241_vintf_sid
structure to allocate SID_REPLACE and to program the SIDs accordingly.

This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to
the standard SMMUv3 operating in the nested translation mode trapping CMDQ
for TLBI and ATC_INV commands, this gives a huge performance improvement:
70% to 90% reductions of invalidation time were measured by various DMA
unmap tests running in a guest OS.

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   7 +
 include/uapi/linux/iommufd.h                  |  58 +++
 .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     |   6 +-
 .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c    | 407 +++++++++++++++++-
 4 files changed, 471 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 836d5556008e..aa25156e04a3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -1056,10 +1056,17 @@ int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state,
 void arm_smmu_attach_commit_vmaster(struct arm_smmu_attach_state *state);
 void arm_smmu_master_clear_vmaster(struct arm_smmu_master *master);
 int arm_vmaster_report_event(struct arm_smmu_vmaster *vmaster, u64 *evt);
+struct iommu_domain *
+arm_vsmmu_alloc_domain_nested(struct iommufd_viommu *viommu, u32 flags,
+			      const struct iommu_user_data *user_data);
+int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
+			       struct iommu_user_data_array *array);
 #else
 #define arm_smmu_get_viommu_size NULL
 #define arm_smmu_hw_info NULL
 #define arm_vsmmu_init NULL
+#define arm_vsmmu_alloc_domain_nested NULL
+#define arm_vsmmu_cache_invalidate NULL
 
 static inline int
 arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state,
diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 6ae9d2102154..1c9e486113e3 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -591,6 +591,27 @@ struct iommu_hw_info_arm_smmuv3 {
 	__u32 aidr;
 };
 
+/**
+ * iommu_hw_info_tegra241_cmdqv - NVIDIA Tegra241 CMDQV Hardware Information
+ *                                (IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV)
+ * @flags: Must be 0
+ * @version: Version number for the CMDQ-V HW for PARAM bits[03:00]
+ * @log2vcmdqs: Log2 of the total number of VCMDQs for PARAM bits[07:04]
+ * @log2vsids: Log2 of the total number of SID replacements for PARAM bits[15:12]
+ * @__reserved: Must be 0
+ *
+ * VMM can use these fields directly in its emulated global PARAM register. Note
+ * that only one Virtual Interface (VINTF) should be exposed to a VM, i.e. PARAM
+ * bits[11:08] should be set to 0 for log2 of the total number of VINTFs.
+ */
+struct iommu_hw_info_tegra241_cmdqv {
+	__u32 flags;
+	__u8 version;
+	__u8 log2vcmdqs;
+	__u8 log2vsids;
+	__u8 __reserved;
+};
+
 /**
  * enum iommu_hw_info_type - IOMMU Hardware Info Types
  * @IOMMU_HW_INFO_TYPE_NONE: Output by the drivers that do not report hardware
@@ -598,12 +619,15 @@ struct iommu_hw_info_arm_smmuv3 {
  * @IOMMU_HW_INFO_TYPE_DEFAULT: Input to request for a default type
  * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
  * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
+ * @IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
+ *                                     SMMUv3) info type
  */
 enum iommu_hw_info_type {
 	IOMMU_HW_INFO_TYPE_NONE = 0,
 	IOMMU_HW_INFO_TYPE_DEFAULT = 0,
 	IOMMU_HW_INFO_TYPE_INTEL_VTD = 1,
 	IOMMU_HW_INFO_TYPE_ARM_SMMUV3 = 2,
+	IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV = 3,
 };
 
 /**
@@ -972,10 +996,29 @@ struct iommu_fault_alloc {
  * enum iommu_viommu_type - Virtual IOMMU Type
  * @IOMMU_VIOMMU_TYPE_DEFAULT: Reserved for future use
  * @IOMMU_VIOMMU_TYPE_ARM_SMMUV3: ARM SMMUv3 driver specific type
+ * @IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
+ *                                    SMMUv3) Virtual Interface (VINTF)
  */
 enum iommu_viommu_type {
 	IOMMU_VIOMMU_TYPE_DEFAULT = 0,
 	IOMMU_VIOMMU_TYPE_ARM_SMMUV3 = 1,
+	IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV = 2,
+};
+
+/**
+ * struct iommu_viommu_tegra241_cmdqv - NVIDIA Tegra241 CMDQV Virtual Interface
+ *                                      (IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV)
+ * @out_vintf_mmap_offset: mmap offset argument for VINTF's page0
+ * @out_vintf_mmap_length: mmap length argument for VINTF's page0
+ *
+ * Both @out_vintf_mmap_offset and @out_vintf_mmap_length are reported by kernel
+ * for user space to mmap the VINTF page0 from the host physical address space
+ * to the guest physical address space so that a guest kernel can directly R/W
+ * access to the VINTF page0 in order to control its virtual command queues.
+ */
+struct iommu_viommu_tegra241_cmdqv {
+	__aligned_u64 out_vintf_mmap_offset;
+	__aligned_u64 out_vintf_mmap_length;
 };
 
 /**
@@ -1172,9 +1215,24 @@ struct iommu_veventq_alloc {
 /**
  * enum iommu_hw_queue_type - HW Queue Type
  * @IOMMU_HW_QUEUE_TYPE_DEFAULT: Reserved for future use
+ * @IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
+ *                                      SMMUv3) Virtual Command Queue (VCMDQ)
  */
 enum iommu_hw_queue_type {
 	IOMMU_HW_QUEUE_TYPE_DEFAULT = 0,
+	/*
+	 * TEGRA241_CMDQV requirements (otherwise, allocation will fail)
+	 * - alloc starts from the lowest @index=0 in ascending order
+	 * - destroy starts from the last allocated @index in descending order
+	 * - @base_addr must be aligned to @length in bytes and mapped in IOAS
+	 * - @length must be a power of 2, with a minimum 32 bytes and a maximum
+	 *   2 ^ idr[1].CMDQS * 16 bytes (use GET_HW_INFO call to read idr[1]
+	 *   from struct iommu_hw_info_arm_smmuv3)
+	 * - suggest to back the queue memory with contiguous physical pages or
+	 *   a single huge page with alignment of the queue size, and limit the
+	 *   emulated vSMMU's IDR1.CMDQS to log2(huge page size / 16 bytes)
+	 */
+	IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV = 1,
 };
 
 /**
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
index 1cf9646e776f..d9bea8f1f636 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
@@ -225,7 +225,7 @@ static int arm_smmu_validate_vste(struct iommu_hwpt_arm_smmuv3 *arg,
 	return 0;
 }
 
-static struct iommu_domain *
+struct iommu_domain *
 arm_vsmmu_alloc_domain_nested(struct iommufd_viommu *viommu, u32 flags,
 			      const struct iommu_user_data *user_data)
 {
@@ -336,8 +336,8 @@ static int arm_vsmmu_convert_user_cmd(struct arm_vsmmu *vsmmu,
 	return 0;
 }
 
-static int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
-				      struct iommu_user_data_array *array)
+int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
+			       struct iommu_user_data_array *array)
 {
 	struct arm_vsmmu *vsmmu = container_of(viommu, struct arm_vsmmu, core);
 	struct arm_smmu_device *smmu = vsmmu->smmu;
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
index 869c90b660c1..e073b64553d5 100644
--- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
+++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
@@ -8,7 +8,9 @@
 #include <linux/dma-mapping.h>
 #include <linux/interrupt.h>
 #include <linux/iommu.h>
+#include <linux/iommufd.h>
 #include <linux/iopoll.h>
+#include <uapi/linux/iommufd.h>
 
 #include <acpi/acpixf.h>
 
@@ -26,8 +28,10 @@
 #define  CMDQV_EN			BIT(0)
 
 #define TEGRA241_CMDQV_PARAM		0x0004
+#define  CMDQV_NUM_SID_PER_VM_LOG2	GENMASK(15, 12)
 #define  CMDQV_NUM_VINTF_LOG2		GENMASK(11, 8)
 #define  CMDQV_NUM_VCMDQ_LOG2		GENMASK(7, 4)
+#define  CMDQV_VER			GENMASK(3, 0)
 
 #define TEGRA241_CMDQV_STATUS		0x0008
 #define  CMDQV_ENABLED			BIT(0)
@@ -53,6 +57,9 @@
 #define  VINTF_STATUS			GENMASK(3, 1)
 #define  VINTF_ENABLED			BIT(0)
 
+#define TEGRA241_VINTF_SID_MATCH(s)	(0x0040 + 0x4*(s))
+#define TEGRA241_VINTF_SID_REPLACE(s)	(0x0080 + 0x4*(s))
+
 #define TEGRA241_VINTF_LVCMDQ_ERR_MAP_64(m) \
 					(0x00C0 + 0x8*(m))
 #define  LVCMDQ_ERR_MAP_NUM_64		2
@@ -114,16 +121,20 @@ MODULE_PARM_DESC(bypass_vcmdq,
 
 /**
  * struct tegra241_vcmdq - Virtual Command Queue
+ * @core: Embedded iommufd_hw_queue structure
  * @idx: Global index in the CMDQV
  * @lidx: Local index in the VINTF
  * @enabled: Enable status
  * @cmdqv: Parent CMDQV pointer
  * @vintf: Parent VINTF pointer
+ * @prev: Previous LVCMDQ to depend on
  * @cmdq: Command Queue struct
  * @page0: MMIO Page0 base address
  * @page1: MMIO Page1 base address
  */
 struct tegra241_vcmdq {
+	struct iommufd_hw_queue core;
+
 	u16 idx;
 	u16 lidx;
 
@@ -131,22 +142,30 @@ struct tegra241_vcmdq {
 
 	struct tegra241_cmdqv *cmdqv;
 	struct tegra241_vintf *vintf;
+	struct tegra241_vcmdq *prev;
 	struct arm_smmu_cmdq cmdq;
 
 	void __iomem *page0;
 	void __iomem *page1;
 };
+#define hw_queue_to_vcmdq(v) container_of(v, struct tegra241_vcmdq, core)
 
 /**
  * struct tegra241_vintf - Virtual Interface
+ * @vsmmu: Embedded arm_vsmmu structure
  * @idx: Global index in the CMDQV
  * @enabled: Enable status
  * @hyp_own: Owned by hypervisor (in-kernel)
  * @cmdqv: Parent CMDQV pointer
  * @lvcmdqs: List of logical VCMDQ pointers
+ * @lvcmdq_mutex: Lock to serialize user-allocated lvcmdqs
  * @base: MMIO base address
+ * @mmap_offset: Offset argument for mmap() syscall
+ * @sids: Stream ID replacement resources
  */
 struct tegra241_vintf {
+	struct arm_vsmmu vsmmu;
+
 	u16 idx;
 
 	bool enabled;
@@ -154,19 +173,41 @@ struct tegra241_vintf {
 
 	struct tegra241_cmdqv *cmdqv;
 	struct tegra241_vcmdq **lvcmdqs;
+	struct mutex lvcmdq_mutex; /* user space race */
 
 	void __iomem *base;
+	unsigned long mmap_offset;
+
+	struct ida sids;
+};
+#define viommu_to_vintf(v) container_of(v, struct tegra241_vintf, vsmmu.core)
+
+/**
+ * struct tegra241_vintf_sid - Virtual Interface Stream ID Replacement
+ * @core: Embedded iommufd_vdevice structure, holding virtual Stream ID
+ * @vintf: Parent VINTF pointer
+ * @sid: Physical Stream ID
+ * @idx: Replacement index in the VINTF
+ */
+struct tegra241_vintf_sid {
+	struct iommufd_vdevice core;
+	struct tegra241_vintf *vintf;
+	u32 sid;
+	u8 idx;
 };
+#define vdev_to_vsid(v) container_of(v, struct tegra241_vintf_sid, core)
 
 /**
  * struct tegra241_cmdqv - CMDQ-V for SMMUv3
  * @smmu: SMMUv3 device
  * @dev: CMDQV device
  * @base: MMIO base address
+ * @base_phys: MMIO physical base address, for mmap
  * @irq: IRQ number
  * @num_vintfs: Total number of VINTFs
  * @num_vcmdqs: Total number of VCMDQs
  * @num_lvcmdqs_per_vintf: Number of logical VCMDQs per VINTF
+ * @num_sids_per_vintf: Total number of SID replacements per VINTF
  * @vintf_ids: VINTF id allocator
  * @vintfs: List of VINTFs
  */
@@ -175,12 +216,14 @@ struct tegra241_cmdqv {
 	struct device *dev;
 
 	void __iomem *base;
+	phys_addr_t base_phys;
 	int irq;
 
 	/* CMDQV Hardware Params */
 	u16 num_vintfs;
 	u16 num_vcmdqs;
 	u16 num_lvcmdqs_per_vintf;
+	u16 num_sids_per_vintf;
 
 	struct ida vintf_ids;
 
@@ -351,6 +394,29 @@ tegra241_cmdqv_get_cmdq(struct arm_smmu_device *smmu,
 
 /* HW Reset Functions */
 
+/*
+ * When a guest-owned VCMDQ is disabled, if the guest did not enqueue a CMD_SYNC
+ * following an ATC_INV command at the end of the guest queue while this ATC_INV
+ * is timed out, the TIMEOUT will not be reported until this VCMDQ gets assigned
+ * to the next VM, which will be a false alarm potentially causing some unwanted
+ * behavior in the new VM. Thus, a guest-owned VCMDQ must flush the TIMEOUT when
+ * it gets disabled. This can be done by just issuing a CMD_SYNC to SMMU CMDQ.
+ */
+static void tegra241_vcmdq_hw_flush_timeout(struct tegra241_vcmdq *vcmdq)
+{
+	struct arm_smmu_device *smmu = &vcmdq->cmdqv->smmu;
+	u64 cmd_sync[CMDQ_ENT_DWORDS] = {};
+
+	cmd_sync[0] = FIELD_PREP(CMDQ_0_OP, CMDQ_OP_CMD_SYNC) |
+		      FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_NONE);
+
+	/*
+	 * It does not hurt to insert another CMD_SYNC, taking advantage of the
+	 * arm_smmu_cmdq_issue_cmdlist() that waits for the CMD_SYNC completion.
+	 */
+	arm_smmu_cmdq_issue_cmdlist(smmu, &smmu->cmdq, cmd_sync, 1, true);
+}
+
 /* This function is for LVCMDQ, so @vcmdq must not be unmapped yet */
 static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
 {
@@ -364,6 +430,8 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
 			readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, GERROR)),
 			readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, CONS)));
 	}
+	tegra241_vcmdq_hw_flush_timeout(vcmdq);
+
 	writel_relaxed(0, REG_VCMDQ_PAGE0(vcmdq, PROD));
 	writel_relaxed(0, REG_VCMDQ_PAGE0(vcmdq, CONS));
 	writeq_relaxed(0, REG_VCMDQ_PAGE1(vcmdq, BASE));
@@ -380,6 +448,12 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
 	dev_dbg(vcmdq->cmdqv->dev, "%sdeinited\n", h);
 }
 
+/* This function is for LVCMDQ, so @vcmdq must be mapped prior */
+static void _tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
+{
+	writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE));
+}
+
 /* This function is for LVCMDQ, so @vcmdq must be mapped prior */
 static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
 {
@@ -390,7 +464,7 @@ static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
 	tegra241_vcmdq_hw_deinit(vcmdq);
 
 	/* Configure and enable VCMDQ */
-	writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE));
+	_tegra241_vcmdq_hw_init(vcmdq);
 
 	ret = vcmdq_write_config(vcmdq, VCMDQ_EN);
 	if (ret) {
@@ -420,6 +494,7 @@ static void tegra241_vcmdq_unmap_lvcmdq(struct tegra241_vcmdq *vcmdq)
 static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf)
 {
 	u16 lidx = vintf->cmdqv->num_lvcmdqs_per_vintf;
+	int sidx;
 
 	/* HW requires to unmap LVCMDQs in descending order */
 	while (lidx--) {
@@ -429,6 +504,10 @@ static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf)
 		}
 	}
 	vintf_write_config(vintf, 0);
+	for (sidx = 0; sidx < vintf->cmdqv->num_sids_per_vintf; sidx++) {
+		writel(0, REG_VINTF(vintf, SID_MATCH(sidx)));
+		writel(0, REG_VINTF(vintf, SID_REPLACE(sidx)));
+	}
 }
 
 /* Map a global VCMDQ to the pre-assigned LVCMDQ */
@@ -457,7 +536,8 @@ static int tegra241_vintf_hw_init(struct tegra241_vintf *vintf, bool hyp_own)
 	 * whether enabling it here or not, as !HYP_OWN cmdq HWs only support a
 	 * restricted set of supported commands.
 	 */
-	regval = FIELD_PREP(VINTF_HYP_OWN, hyp_own);
+	regval = FIELD_PREP(VINTF_HYP_OWN, hyp_own) |
+		 FIELD_PREP(VINTF_VMID, vintf->vsmmu.vmid);
 	writel(regval, REG_VINTF(vintf, CONFIG));
 
 	ret = vintf_write_config(vintf, regval | VINTF_EN);
@@ -584,7 +664,9 @@ static void tegra241_vintf_free_lvcmdq(struct tegra241_vintf *vintf, u16 lidx)
 
 	dev_dbg(vintf->cmdqv->dev,
 		"%sdeallocated\n", lvcmdq_error_header(vcmdq, header, 64));
-	kfree(vcmdq);
+	/* Guest-owned VCMDQ is free-ed with hw_queue by iommufd core */
+	if (vcmdq->vintf->hyp_own)
+		kfree(vcmdq);
 }
 
 static struct tegra241_vcmdq *
@@ -671,7 +753,13 @@ static void tegra241_cmdqv_remove_vintf(struct tegra241_cmdqv *cmdqv, u16 idx)
 
 	dev_dbg(cmdqv->dev, "VINTF%u: deallocated\n", vintf->idx);
 	tegra241_cmdqv_deinit_vintf(cmdqv, idx);
-	kfree(vintf);
+	if (!vintf->hyp_own) {
+		mutex_destroy(&vintf->lvcmdq_mutex);
+		ida_destroy(&vintf->sids);
+		/* Guest-owned VINTF is free-ed with viommu by iommufd core */
+	} else {
+		kfree(vintf);
+	}
 }
 
 static void tegra241_cmdqv_remove(struct arm_smmu_device *smmu)
@@ -699,10 +787,45 @@ static void tegra241_cmdqv_remove(struct arm_smmu_device *smmu)
 	put_device(cmdqv->dev); /* smmu->impl_dev */
 }
 
+static int
+tegra241_cmdqv_init_vintf_user(struct arm_vsmmu *vsmmu,
+			       const struct iommu_user_data *user_data);
+
+static void *tegra241_cmdqv_hw_info(struct arm_smmu_device *smmu, u32 *length,
+				    enum iommu_hw_info_type *type)
+{
+	struct tegra241_cmdqv *cmdqv =
+		container_of(smmu, struct tegra241_cmdqv, smmu);
+	struct iommu_hw_info_tegra241_cmdqv *info;
+	u32 regval;
+
+	if (*type != IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	info = kzalloc(sizeof(*info), GFP_KERNEL);
+	if (!info)
+		return ERR_PTR(-ENOMEM);
+
+	regval = readl_relaxed(REG_CMDQV(cmdqv, PARAM));
+	info->log2vcmdqs = ilog2(cmdqv->num_lvcmdqs_per_vintf);
+	info->log2vsids = ilog2(cmdqv->num_sids_per_vintf);
+	info->version = FIELD_GET(CMDQV_VER, regval);
+
+	*length = sizeof(*info);
+	*type = IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV;
+	return info;
+}
+
 static struct arm_smmu_impl_ops tegra241_cmdqv_impl_ops = {
+	/* For in-kernel use */
 	.get_secondary_cmdq = tegra241_cmdqv_get_cmdq,
 	.device_reset = tegra241_cmdqv_hw_reset,
 	.device_remove = tegra241_cmdqv_remove,
+	/* For user-space use */
+	.hw_info = tegra241_cmdqv_hw_info,
+	.vsmmu_size = VIOMMU_STRUCT_SIZE(struct tegra241_vintf, vsmmu.core),
+	.vsmmu_type = IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV,
+	.vsmmu_init = tegra241_cmdqv_init_vintf_user,
 };
 
 /* Probe Functions */
@@ -844,6 +967,7 @@ __tegra241_cmdqv_probe(struct arm_smmu_device *smmu, struct resource *res,
 	cmdqv->irq = irq;
 	cmdqv->base = base;
 	cmdqv->dev = smmu->impl_dev;
+	cmdqv->base_phys = res->start;
 
 	if (cmdqv->irq > 0) {
 		ret = request_threaded_irq(irq, NULL, tegra241_cmdqv_isr,
@@ -860,6 +984,8 @@ __tegra241_cmdqv_probe(struct arm_smmu_device *smmu, struct resource *res,
 	cmdqv->num_vintfs = 1 << FIELD_GET(CMDQV_NUM_VINTF_LOG2, regval);
 	cmdqv->num_vcmdqs = 1 << FIELD_GET(CMDQV_NUM_VCMDQ_LOG2, regval);
 	cmdqv->num_lvcmdqs_per_vintf = cmdqv->num_vcmdqs / cmdqv->num_vintfs;
+	cmdqv->num_sids_per_vintf =
+		1 << FIELD_GET(CMDQV_NUM_SID_PER_VM_LOG2, regval);
 
 	cmdqv->vintfs =
 		kcalloc(cmdqv->num_vintfs, sizeof(*cmdqv->vintfs), GFP_KERNEL);
@@ -913,3 +1039,276 @@ struct arm_smmu_device *tegra241_cmdqv_probe(struct arm_smmu_device *smmu)
 	put_device(smmu->impl_dev);
 	return ERR_PTR(-ENODEV);
 }
+
+/* User space VINTF and VCMDQ Functions */
+
+static size_t tegra241_vintf_get_vcmdq_size(struct iommufd_viommu *viommu,
+					    enum iommu_hw_queue_type queue_type)
+{
+	if (queue_type != IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV)
+		return 0;
+	return HW_QUEUE_STRUCT_SIZE(struct tegra241_vcmdq, core);
+}
+
+static int tegra241_vcmdq_hw_init_user(struct tegra241_vcmdq *vcmdq)
+{
+	char header[64];
+
+	/* Configure the vcmdq only; User space does the enabling */
+	_tegra241_vcmdq_hw_init(vcmdq);
+
+	dev_dbg(vcmdq->cmdqv->dev, "%sinited at host PA 0x%llx size 0x%lx\n",
+		lvcmdq_error_header(vcmdq, header, 64),
+		vcmdq->cmdq.q.q_base & VCMDQ_ADDR,
+		1UL << (vcmdq->cmdq.q.q_base & VCMDQ_LOG2SIZE));
+	return 0;
+}
+
+static void
+tegra241_vintf_destroy_lvcmdq_user(struct iommufd_hw_queue *hw_queue)
+{
+	struct tegra241_vcmdq *vcmdq = hw_queue_to_vcmdq(hw_queue);
+
+	mutex_lock(&vcmdq->vintf->lvcmdq_mutex);
+	tegra241_vcmdq_hw_deinit(vcmdq);
+	tegra241_vcmdq_unmap_lvcmdq(vcmdq);
+	tegra241_vintf_free_lvcmdq(vcmdq->vintf, vcmdq->lidx);
+	if (vcmdq->prev)
+		iommufd_hw_queue_undepend(vcmdq, vcmdq->prev, core);
+	mutex_unlock(&vcmdq->vintf->lvcmdq_mutex);
+}
+
+static int tegra241_vintf_alloc_lvcmdq_user(struct iommufd_hw_queue *hw_queue,
+					    u32 lidx, phys_addr_t base_addr_pa)
+{
+	struct tegra241_vintf *vintf = viommu_to_vintf(hw_queue->viommu);
+	struct tegra241_vcmdq *vcmdq = hw_queue_to_vcmdq(hw_queue);
+	struct tegra241_cmdqv *cmdqv = vintf->cmdqv;
+	struct arm_smmu_device *smmu = &cmdqv->smmu;
+	struct tegra241_vcmdq *prev = NULL;
+	u32 log2size, max_n_shift;
+	char header[64];
+	int ret;
+
+	if (hw_queue->type != IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV)
+		return -EOPNOTSUPP;
+	if (lidx >= cmdqv->num_lvcmdqs_per_vintf)
+		return -EINVAL;
+
+	mutex_lock(&vintf->lvcmdq_mutex);
+
+	if (vintf->lvcmdqs[lidx]) {
+		ret = -EEXIST;
+		goto unlock;
+	}
+
+	/*
+	 * HW requires to map LVCMDQs in ascending order, so reject if the
+	 * previous lvcmdqs is not allocated yet.
+	 */
+	if (lidx) {
+		prev = vintf->lvcmdqs[lidx - 1];
+		if (!prev) {
+			ret = -EIO;
+			goto unlock;
+		}
+	}
+
+	/*
+	 * hw_queue->length must be a power of 2, in range of
+	 *   [ 32, 2 ^ (idr[1].CMDQS + CMDQ_ENT_SZ_SHIFT) ]
+	 */
+	max_n_shift = FIELD_GET(IDR1_CMDQS,
+				readl_relaxed(smmu->base + ARM_SMMU_IDR1));
+	if (!is_power_of_2(hw_queue->length) || hw_queue->length < 32 ||
+	    hw_queue->length > (1 << (max_n_shift + CMDQ_ENT_SZ_SHIFT))) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+	log2size = ilog2(hw_queue->length) - CMDQ_ENT_SZ_SHIFT;
+
+	/* base_addr_pa must be aligned to hw_queue->length */
+	if (base_addr_pa & ~VCMDQ_ADDR ||
+	    base_addr_pa & (hw_queue->length - 1)) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	/*
+	 * HW requires to unmap LVCMDQs in descending order, so destroy() must
+	 * follow this rule. Set a dependency on its previous LVCMDQ so iommufd
+	 * core will help enforce it.
+	 */
+	if (prev) {
+		ret = iommufd_hw_queue_depend(vcmdq, prev, core);
+		if (ret)
+			goto unlock;
+	}
+	vcmdq->prev = prev;
+
+	ret = tegra241_vintf_init_lvcmdq(vintf, lidx, vcmdq);
+	if (ret)
+		goto undepend_vcmdq;
+
+	dev_dbg(cmdqv->dev, "%sallocated\n",
+		lvcmdq_error_header(vcmdq, header, 64));
+
+	tegra241_vcmdq_map_lvcmdq(vcmdq);
+
+	vcmdq->cmdq.q.q_base = base_addr_pa & VCMDQ_ADDR;
+	vcmdq->cmdq.q.q_base |= log2size;
+
+	ret = tegra241_vcmdq_hw_init_user(vcmdq);
+	if (ret)
+		goto unmap_lvcmdq;
+
+	hw_queue->destroy = &tegra241_vintf_destroy_lvcmdq_user;
+	mutex_unlock(&vintf->lvcmdq_mutex);
+	return 0;
+
+unmap_lvcmdq:
+	tegra241_vcmdq_unmap_lvcmdq(vcmdq);
+	tegra241_vintf_deinit_lvcmdq(vintf, lidx);
+undepend_vcmdq:
+	if (vcmdq->prev)
+		iommufd_hw_queue_undepend(vcmdq, vcmdq->prev, core);
+unlock:
+	mutex_unlock(&vintf->lvcmdq_mutex);
+	return ret;
+}
+
+static void tegra241_cmdqv_destroy_vintf_user(struct iommufd_viommu *viommu)
+{
+	struct tegra241_vintf *vintf = viommu_to_vintf(viommu);
+
+	if (vintf->mmap_offset)
+		iommufd_viommu_destroy_mmap(&vintf->vsmmu.core,
+					    vintf->mmap_offset);
+	tegra241_cmdqv_remove_vintf(vintf->cmdqv, vintf->idx);
+}
+
+static void tegra241_vintf_destroy_vsid(struct iommufd_vdevice *vdev)
+{
+	struct tegra241_vintf_sid *vsid = vdev_to_vsid(vdev);
+	struct tegra241_vintf *vintf = vsid->vintf;
+
+	writel(0, REG_VINTF(vintf, SID_MATCH(vsid->idx)));
+	writel(0, REG_VINTF(vintf, SID_REPLACE(vsid->idx)));
+	ida_free(&vintf->sids, vsid->idx);
+	dev_dbg(vintf->cmdqv->dev,
+		"VINTF%u: deallocated SID_REPLACE%d for pSID=%x\n", vintf->idx,
+		vsid->idx, vsid->sid);
+}
+
+static int tegra241_vintf_init_vsid(struct iommufd_vdevice *vdev)
+{
+	struct arm_smmu_master *master = dev_iommu_priv_get(vdev->dev);
+	struct tegra241_vintf *vintf = viommu_to_vintf(vdev->viommu);
+	struct tegra241_vintf_sid *vsid = vdev_to_vsid(vdev);
+	struct arm_smmu_stream *stream = &master->streams[0];
+	u64 virt_sid = vdev->virt_id;
+	int sidx;
+
+	if (virt_sid > UINT_MAX)
+		return -EINVAL;
+
+	WARN_ON_ONCE(master->num_streams != 1);
+
+	/* Find an empty pair of SID_REPLACE and SID_MATCH */
+	sidx = ida_alloc_max(&vintf->sids, vintf->cmdqv->num_sids_per_vintf - 1,
+			     GFP_KERNEL);
+	if (sidx < 0)
+		return sidx;
+
+	writel(stream->id, REG_VINTF(vintf, SID_REPLACE(sidx)));
+	writel(virt_sid << 1 | 0x1, REG_VINTF(vintf, SID_MATCH(sidx)));
+	dev_dbg(vintf->cmdqv->dev,
+		"VINTF%u: allocated SID_REPLACE%d for pSID=%x, vSID=%x\n",
+		vintf->idx, sidx, stream->id, (u32)virt_sid);
+
+	vsid->idx = sidx;
+	vsid->vintf = vintf;
+	vsid->sid = stream->id;
+
+	vdev->destroy = &tegra241_vintf_destroy_vsid;
+	return 0;
+}
+
+static struct iommufd_viommu_ops tegra241_cmdqv_viommu_ops = {
+	.destroy = tegra241_cmdqv_destroy_vintf_user,
+	.alloc_domain_nested = arm_vsmmu_alloc_domain_nested,
+	.cache_invalidate = arm_vsmmu_cache_invalidate,
+	.vdevice_size = VDEVICE_STRUCT_SIZE(struct tegra241_vintf_sid, core),
+	.vdevice_init = tegra241_vintf_init_vsid,
+	.get_hw_queue_size = tegra241_vintf_get_vcmdq_size,
+	.hw_queue_init_phys = tegra241_vintf_alloc_lvcmdq_user,
+};
+
+static int
+tegra241_cmdqv_init_vintf_user(struct arm_vsmmu *vsmmu,
+			       const struct iommu_user_data *user_data)
+{
+	struct tegra241_cmdqv *cmdqv =
+		container_of(vsmmu->smmu, struct tegra241_cmdqv, smmu);
+	struct tegra241_vintf *vintf = viommu_to_vintf(&vsmmu->core);
+	struct iommu_viommu_tegra241_cmdqv data;
+	phys_addr_t page0_base;
+	int ret;
+
+	if (!user_data)
+		return -EINVAL;
+
+	ret = iommu_copy_struct_from_user(&data, user_data,
+					  IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV,
+					  out_vintf_mmap_length);
+	if (ret)
+		return ret;
+
+	ret = tegra241_cmdqv_init_vintf(cmdqv, cmdqv->num_vintfs - 1, vintf);
+	if (ret < 0) {
+		dev_err(cmdqv->dev, "no more available vintf\n");
+		return ret;
+	}
+
+	/*
+	 * Initialize the user-owned VINTF without a LVCMDQ, as it cannot pre-
+	 * allocate a LVCMDQ until user space wants one, for security reasons.
+	 * It is different than the kernel-owned VINTF0, which had pre-assigned
+	 * and pre-allocated global VCMDQs that would be mapped to the LVCMDQs
+	 * by the tegra241_vintf_hw_init() call.
+	 */
+	ret = tegra241_vintf_hw_init(vintf, false);
+	if (ret)
+		goto deinit_vintf;
+
+	page0_base = cmdqv->base_phys + TEGRA241_VINTFi_PAGE0(vintf->idx);
+	ret = iommufd_viommu_alloc_mmap(&vintf->vsmmu.core, page0_base, SZ_64K,
+					&vintf->mmap_offset);
+	if (ret)
+		goto hw_deinit_vintf;
+
+	data.out_vintf_mmap_length = SZ_64K;
+	data.out_vintf_mmap_offset = vintf->mmap_offset;
+	ret = iommu_copy_struct_to_user(user_data, &data,
+					IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV,
+					out_vintf_mmap_length);
+	if (ret)
+		goto free_mmap;
+
+	ida_init(&vintf->sids);
+	mutex_init(&vintf->lvcmdq_mutex);
+
+	dev_dbg(cmdqv->dev, "VINTF%u: allocated with vmid (%d)\n", vintf->idx,
+		vintf->vsmmu.vmid);
+
+	vsmmu->core.ops = &tegra241_cmdqv_viommu_ops;
+	return 0;
+
+free_mmap:
+	iommufd_viommu_destroy_mmap(&vintf->vsmmu.core, vintf->mmap_offset);
+hw_deinit_vintf:
+	tegra241_vintf_hw_deinit(vintf);
+deinit_vintf:
+	tegra241_cmdqv_deinit_vintf(cmdqv, vintf->idx);
+	return ret;
+}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v7 28/28] iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support
  2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
                   ` (26 preceding siblings ...)
  2025-06-26 19:34 ` [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support Nicolin Chen
@ 2025-06-26 19:34 ` Nicolin Chen
  27 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-06-26 19:34 UTC (permalink / raw)
  To: jgg, kevin.tian, corbet, will
  Cc: bagasdotme, robin.murphy, joro, thierry.reding, vdumpa, jonathanh,
	shuah, jsnitsel, nathan, peterz, yi.l.liu, mshavit, praan,
	zhangzekun11, iommu, linux-doc, linux-kernel, linux-arm-kernel,
	linux-tegra, linux-kselftest, patches, mochs, alok.a.tiwari,
	vasant.hegde, dwmw2, baolu.lu

Add a new vEVENTQ type for VINTFs that are assigned to the user space.
Simply report the two 64-bit LVCMDQ_ERR_MAPs register values.

Reviewed-by: Alok Tiwari <alok.a.tiwari@oracle.com>
Reviewed-by: Pranjal Shrivastava <praan@google.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/uapi/linux/iommufd.h                  | 15 +++++++++++++
 .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c    | 22 +++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
index 1c9e486113e3..a2840beefa8c 100644
--- a/include/uapi/linux/iommufd.h
+++ b/include/uapi/linux/iommufd.h
@@ -1145,10 +1145,12 @@ struct iommufd_vevent_header {
  * enum iommu_veventq_type - Virtual Event Queue Type
  * @IOMMU_VEVENTQ_TYPE_DEFAULT: Reserved for future use
  * @IOMMU_VEVENTQ_TYPE_ARM_SMMUV3: ARM SMMUv3 Virtual Event Queue
+ * @IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV Extension IRQ
  */
 enum iommu_veventq_type {
 	IOMMU_VEVENTQ_TYPE_DEFAULT = 0,
 	IOMMU_VEVENTQ_TYPE_ARM_SMMUV3 = 1,
+	IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV = 2,
 };
 
 /**
@@ -1172,6 +1174,19 @@ struct iommu_vevent_arm_smmuv3 {
 	__aligned_le64 evt[4];
 };
 
+/**
+ * struct iommu_vevent_tegra241_cmdqv - Tegra241 CMDQV IRQ
+ *                                      (IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV)
+ * @lvcmdq_err_map: 128-bit logical vcmdq error map, little-endian.
+ *                  (Refer to register LVCMDQ_ERR_MAPs per VINTF )
+ *
+ * The 128-bit register value from HW exclusively reflect the error bits for a
+ * Virtual Interface represented by a vIOMMU object. Read and report directly.
+ */
+struct iommu_vevent_tegra241_cmdqv {
+	__aligned_le64 lvcmdq_err_map[2];
+};
+
 /**
  * struct iommu_veventq_alloc - ioctl(IOMMU_VEVENTQ_ALLOC)
  * @size: sizeof(struct iommu_veventq_alloc)
diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
index e073b64553d5..d57a3bea948c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
+++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
@@ -295,6 +295,20 @@ static inline int vcmdq_write_config(struct tegra241_vcmdq *vcmdq, u32 regval)
 
 /* ISR Functions */
 
+static void tegra241_vintf_user_handle_error(struct tegra241_vintf *vintf)
+{
+	struct iommufd_viommu *viommu = &vintf->vsmmu.core;
+	struct iommu_vevent_tegra241_cmdqv vevent_data;
+	int i;
+
+	for (i = 0; i < LVCMDQ_ERR_MAP_NUM_64; i++)
+		vevent_data.lvcmdq_err_map[i] =
+			readq_relaxed(REG_VINTF(vintf, LVCMDQ_ERR_MAP_64(i)));
+
+	iommufd_viommu_report_event(viommu, IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV,
+				    &vevent_data, sizeof(vevent_data));
+}
+
 static void tegra241_vintf0_handle_error(struct tegra241_vintf *vintf)
 {
 	int i;
@@ -340,6 +354,14 @@ static irqreturn_t tegra241_cmdqv_isr(int irq, void *devid)
 		vintf_map &= ~BIT_ULL(0);
 	}
 
+	/* Handle other user VINTFs and their LVCMDQs */
+	while (vintf_map) {
+		unsigned long idx = __ffs64(vintf_map);
+
+		tegra241_vintf_user_handle_error(cmdqv->vintfs[idx]);
+		vintf_map &= ~BIT_ULL(idx);
+	}
+
 	return IRQ_HANDLED;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 23/28] iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops
  2025-06-26 19:34 ` [PATCH v7 23/28] iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops Nicolin Chen
@ 2025-07-01 12:24   ` Pranjal Shrivastava
  0 siblings, 0 replies; 67+ messages in thread
From: Pranjal Shrivastava @ 2025-07-01 12:24 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jgg, kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Thu, Jun 26, 2025 at 12:34:54PM -0700, Nicolin Chen wrote:
> This will be used by Tegra241 CMDQV implementation to report a non-default
> HW info data.
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h         | 7 +++++++
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 8 ++++++--
>  2 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 07589350b2a1..836d5556008e 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -721,6 +721,13 @@ struct arm_smmu_impl_ops {
>  	int (*init_structures)(struct arm_smmu_device *smmu);
>  	struct arm_smmu_cmdq *(*get_secondary_cmdq)(
>  		struct arm_smmu_device *smmu, struct arm_smmu_cmdq_ent *ent);
> +	/*
> +	 * An implementation should define its own type other than the default
> +	 * IOMMU_HW_INFO_TYPE_ARM_SMMUV3. And it must validate the input @type
> +	 * to return its own structure.
> +	 */
> +	void *(*hw_info)(struct arm_smmu_device *smmu, u32 *length,
> +			 enum iommu_hw_info_type *type);

Thanks for adding the comment, this looks good.

>  	const size_t vsmmu_size;
>  	const enum iommu_viommu_type vsmmu_type;
>  	int (*vsmmu_init)(struct arm_vsmmu *vsmmu,
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> index 2ab1c6cf4aac..1cf9646e776f 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> @@ -11,13 +11,17 @@ void *arm_smmu_hw_info(struct device *dev, u32 *length,
>  		       enum iommu_hw_info_type *type)
>  {
>  	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> +	const struct arm_smmu_impl_ops *impl_ops = master->smmu->impl_ops;
>  	struct iommu_hw_info_arm_smmuv3 *info;
>  	u32 __iomem *base_idr;
>  	unsigned int i;
>  
>  	if (*type != IOMMU_HW_INFO_TYPE_DEFAULT &&
> -	    *type != IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
> -		return ERR_PTR(-EOPNOTSUPP);
> +	    *type != IOMMU_HW_INFO_TYPE_ARM_SMMUV3) {
> +		if (!impl_ops || !impl_ops->hw_info)
> +			return ERR_PTR(-EOPNOTSUPP);
> +		return impl_ops->hw_info(master->smmu, length, type);
> +	}
>  
>  	info = kzalloc(sizeof(*info), GFP_KERNEL);
>  	if (!info)

Reviewed-by: Pranjal Shrivastava <praan@google.com>

> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 02/28] iommufd/viommu: Explicitly define vdev->virt_id
  2025-06-26 19:34 ` [PATCH v7 02/28] iommufd/viommu: Explicitly define vdev->virt_id Nicolin Chen
@ 2025-07-01 12:30   ` Pranjal Shrivastava
  2025-07-02  9:40   ` Tian, Kevin
  2025-07-04 12:59   ` Jason Gunthorpe
  2 siblings, 0 replies; 67+ messages in thread
From: Pranjal Shrivastava @ 2025-07-01 12:30 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jgg, kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Thu, Jun 26, 2025 at 12:34:33PM -0700, Nicolin Chen wrote:
> The "id" is too genernal to get its meaning easily. Rename it explicitly to
> "virt_id" and update the kdocs for readability. No functional changes.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/iommufd/iommufd_private.h | 7 ++++++-
>  drivers/iommu/iommufd/driver.c          | 2 +-
>  drivers/iommu/iommufd/viommu.c          | 4 ++--
>  3 files changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
> index 4f5e8cd99c96..09f895638f68 100644
> --- a/drivers/iommu/iommufd/iommufd_private.h
> +++ b/drivers/iommu/iommufd/iommufd_private.h
> @@ -634,7 +634,12 @@ struct iommufd_vdevice {
>  	struct iommufd_object obj;
>  	struct iommufd_viommu *viommu;
>  	struct device *dev;
> -	u64 id; /* per-vIOMMU virtual ID */
> +
> +	/*
> +	 * Virtual device ID per vIOMMU, e.g. vSID of ARM SMMUv3, vDeviceID of
> +	 * AMD IOMMU, and vRID of a nested Intel VT-d to a Context Table
> +	 */
> +	u64 virt_id;
>  };
>  
>  #ifdef CONFIG_IOMMUFD_TEST
> diff --git a/drivers/iommu/iommufd/driver.c b/drivers/iommu/iommufd/driver.c
> index 2fee399a148e..887719016804 100644
> --- a/drivers/iommu/iommufd/driver.c
> +++ b/drivers/iommu/iommufd/driver.c
> @@ -30,7 +30,7 @@ int iommufd_viommu_get_vdev_id(struct iommufd_viommu *viommu,
>  	xa_lock(&viommu->vdevs);
>  	xa_for_each(&viommu->vdevs, index, vdev) {
>  		if (vdev->dev == dev) {
> -			*vdev_id = vdev->id;
> +			*vdev_id = vdev->virt_id;
>  			rc = 0;
>  			break;
>  		}
> diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c
> index 25ac08fbb52a..bc8796e6684e 100644
> --- a/drivers/iommu/iommufd/viommu.c
> +++ b/drivers/iommu/iommufd/viommu.c
> @@ -111,7 +111,7 @@ void iommufd_vdevice_destroy(struct iommufd_object *obj)
>  	struct iommufd_viommu *viommu = vdev->viommu;
>  
>  	/* xa_cmpxchg is okay to fail if alloc failed xa_cmpxchg previously */
> -	xa_cmpxchg(&viommu->vdevs, vdev->id, vdev, NULL, GFP_KERNEL);
> +	xa_cmpxchg(&viommu->vdevs, vdev->virt_id, vdev, NULL, GFP_KERNEL);
>  	refcount_dec(&viommu->obj.users);
>  	put_device(vdev->dev);
>  }
> @@ -150,7 +150,7 @@ int iommufd_vdevice_alloc_ioctl(struct iommufd_ucmd *ucmd)
>  		goto out_put_idev;
>  	}
>  
> -	vdev->id = virt_id;
> +	vdev->virt_id = virt_id;
>  	vdev->dev = idev->dev;
>  	get_device(idev->dev);
>  	vdev->viommu = viommu;

Reviewed-by: Pranjal Shrivastava <praan@google.com>

> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 03/28] iommu: Use enum iommu_hw_info_type for type in hw_info op
  2025-06-26 19:34 ` [PATCH v7 03/28] iommu: Use enum iommu_hw_info_type for type in hw_info op Nicolin Chen
@ 2025-07-01 12:48   ` Pranjal Shrivastava
  2025-07-01 12:51   ` Pranjal Shrivastava
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 67+ messages in thread
From: Pranjal Shrivastava @ 2025-07-01 12:48 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jgg, kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Thu, Jun 26, 2025 at 12:34:34PM -0700, Nicolin Chen wrote:
> Replace u32 to make it clear. No functional changes.
> 
> Also simplify the kdoc since the type itself is clear enough.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  include/linux/iommu.h                               | 6 +++---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 3 ++-
>  drivers/iommu/intel/iommu.c                         | 3 ++-
>  drivers/iommu/iommufd/selftest.c                    | 3 ++-
>  4 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 04548b18df28..b87c2841e6bc 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -563,8 +563,7 @@ iommu_copy_struct_from_full_user_array(void *kdst, size_t kdst_entry_size,
>   * @capable: check capability
>   * @hw_info: report iommu hardware information. The data buffer returned by this
>   *           op is allocated in the iommu driver and freed by the caller after
> - *           use. The information type is one of enum iommu_hw_info_type defined
> - *           in include/uapi/linux/iommufd.h.
> + *           use.
>   * @domain_alloc: Do not use in new drivers
>   * @domain_alloc_identity: allocate an IDENTITY domain. Drivers should prefer to
>   *                         use identity_domain instead. This should only be used
> @@ -623,7 +622,8 @@ iommu_copy_struct_from_full_user_array(void *kdst, size_t kdst_entry_size,
>   */
>  struct iommu_ops {
>  	bool (*capable)(struct device *dev, enum iommu_cap);
> -	void *(*hw_info)(struct device *dev, u32 *length, u32 *type);
> +	void *(*hw_info)(struct device *dev, u32 *length,
> +			 enum iommu_hw_info_type *type);
>  
>  	/* Domain allocation and freeing by the iommu driver */
>  #if IS_ENABLED(CONFIG_FSL_PAMU)
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> index 9f59c95a254c..69bbe39e28de 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> @@ -7,7 +7,8 @@
>  
>  #include "arm-smmu-v3.h"
>  
> -void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type)
> +void *arm_smmu_hw_info(struct device *dev, u32 *length,
> +		       enum iommu_hw_info_type *type)
>  {
>  	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
>  	struct iommu_hw_info_arm_smmuv3 *info;
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 7aa3932251b2..850f1a6f548c 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -4091,7 +4091,8 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain *domain,
>  	return ret;
>  }
>  
> -static void *intel_iommu_hw_info(struct device *dev, u32 *length, u32 *type)
> +static void *intel_iommu_hw_info(struct device *dev, u32 *length,
> +				 enum iommu_hw_info_type *type)
>  {
>  	struct device_domain_info *info = dev_iommu_priv_get(dev);
>  	struct intel_iommu *iommu = info->iommu;
> diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
> index 74ca955a766e..7a9abe3f47d5 100644
> --- a/drivers/iommu/iommufd/selftest.c
> +++ b/drivers/iommu/iommufd/selftest.c
> @@ -287,7 +287,8 @@ static struct iommu_domain mock_blocking_domain = {
>  	.ops = &mock_blocking_ops,
>  };
>  
> -static void *mock_domain_hw_info(struct device *dev, u32 *length, u32 *type)
> +static void *mock_domain_hw_info(struct device *dev, u32 *length,
> +				 enum iommu_hw_info_type *type)
>  {
>  	struct iommu_test_hw_info *info;
> 

Reviewed-by: Pranjal Shrivastava <praan@google.com>

> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 03/28] iommu: Use enum iommu_hw_info_type for type in hw_info op
  2025-06-26 19:34 ` [PATCH v7 03/28] iommu: Use enum iommu_hw_info_type for type in hw_info op Nicolin Chen
  2025-07-01 12:48   ` Pranjal Shrivastava
@ 2025-07-01 12:51   ` Pranjal Shrivastava
  2025-07-02  9:41   ` Tian, Kevin
  2025-07-04 13:00   ` Jason Gunthorpe
  3 siblings, 0 replies; 67+ messages in thread
From: Pranjal Shrivastava @ 2025-07-01 12:51 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jgg, kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Thu, Jun 26, 2025 at 12:34:34PM -0700, Nicolin Chen wrote:
> Replace u32 to make it clear. No functional changes.
> 
> Also simplify the kdoc since the type itself is clear enough.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  include/linux/iommu.h                               | 6 +++---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 3 ++-
>  drivers/iommu/intel/iommu.c                         | 3 ++-
>  drivers/iommu/iommufd/selftest.c                    | 3 ++-
>  4 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 04548b18df28..b87c2841e6bc 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -563,8 +563,7 @@ iommu_copy_struct_from_full_user_array(void *kdst, size_t kdst_entry_size,
>   * @capable: check capability
>   * @hw_info: report iommu hardware information. The data buffer returned by this
>   *           op is allocated in the iommu driver and freed by the caller after
> - *           use. The information type is one of enum iommu_hw_info_type defined
> - *           in include/uapi/linux/iommufd.h.
> + *           use.
>   * @domain_alloc: Do not use in new drivers
>   * @domain_alloc_identity: allocate an IDENTITY domain. Drivers should prefer to
>   *                         use identity_domain instead. This should only be used
> @@ -623,7 +622,8 @@ iommu_copy_struct_from_full_user_array(void *kdst, size_t kdst_entry_size,
>   */
>  struct iommu_ops {
>  	bool (*capable)(struct device *dev, enum iommu_cap);
> -	void *(*hw_info)(struct device *dev, u32 *length, u32 *type);
> +	void *(*hw_info)(struct device *dev, u32 *length,
> +			 enum iommu_hw_info_type *type);
>  
>  	/* Domain allocation and freeing by the iommu driver */
>  #if IS_ENABLED(CONFIG_FSL_PAMU)
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> index 9f59c95a254c..69bbe39e28de 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> @@ -7,7 +7,8 @@
>  
>  #include "arm-smmu-v3.h"
>  
> -void *arm_smmu_hw_info(struct device *dev, u32 *length, u32 *type)
> +void *arm_smmu_hw_info(struct device *dev, u32 *length,
> +		       enum iommu_hw_info_type *type)
>  {
>  	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
>  	struct iommu_hw_info_arm_smmuv3 *info;
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 7aa3932251b2..850f1a6f548c 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -4091,7 +4091,8 @@ static int intel_iommu_set_dev_pasid(struct iommu_domain *domain,
>  	return ret;
>  }
>  
> -static void *intel_iommu_hw_info(struct device *dev, u32 *length, u32 *type)
> +static void *intel_iommu_hw_info(struct device *dev, u32 *length,
> +				 enum iommu_hw_info_type *type)
>  {
>  	struct device_domain_info *info = dev_iommu_priv_get(dev);
>  	struct intel_iommu *iommu = info->iommu;
> diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
> index 74ca955a766e..7a9abe3f47d5 100644
> --- a/drivers/iommu/iommufd/selftest.c
> +++ b/drivers/iommu/iommufd/selftest.c
> @@ -287,7 +287,8 @@ static struct iommu_domain mock_blocking_domain = {
>  	.ops = &mock_blocking_ops,
>  };
>  
> -static void *mock_domain_hw_info(struct device *dev, u32 *length, u32 *type)
> +static void *mock_domain_hw_info(struct device *dev, u32 *length,
> +				 enum iommu_hw_info_type *type)
>  {
>  	struct iommu_test_hw_info *info;
>  

Reviewed-by: Pranjal Shrivastava <praan@google.com>

> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 19/28] iommu: Allow an input type in hw_info op
  2025-06-26 19:34 ` [PATCH v7 19/28] iommu: Allow an input type in hw_info op Nicolin Chen
@ 2025-07-01 12:54   ` Pranjal Shrivastava
  0 siblings, 0 replies; 67+ messages in thread
From: Pranjal Shrivastava @ 2025-07-01 12:54 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jgg, kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Thu, Jun 26, 2025 at 12:34:50PM -0700, Nicolin Chen wrote:
> The hw_info uAPI will support a bidirectional data_type field that can be
> used as an input field for user space to request for a specific info data.
> 
> To prepare for the uAPI update, change the iommu layer first:
>  - Add a new IOMMU_HW_INFO_TYPE_DEFAULT as an input, for which driver can
>    output its only (or firstly) supported type
>  - Update the kdoc accordingly
>  - Roll out the type validation in the existing drivers
> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  include/linux/iommu.h                               | 3 ++-
>  include/uapi/linux/iommufd.h                        | 4 +++-
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 4 ++++
>  drivers/iommu/intel/iommu.c                         | 4 ++++
>  drivers/iommu/iommufd/device.c                      | 3 +++
>  drivers/iommu/iommufd/selftest.c                    | 4 ++++
>  6 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index e06a0fbe4bc7..e8b59ef54e48 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -603,7 +603,8 @@ __iommu_copy_struct_to_user(const struct iommu_user_data *dst_data,
>   * @capable: check capability
>   * @hw_info: report iommu hardware information. The data buffer returned by this
>   *           op is allocated in the iommu driver and freed by the caller after
> - *           use.
> + *           use. @type can input a requested type and output a supported type.
> + *           Driver should reject an unsupported data @type input
>   * @domain_alloc: Do not use in new drivers
>   * @domain_alloc_identity: allocate an IDENTITY domain. Drivers should prefer to
>   *                         use identity_domain instead. This should only be used
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index f091ea072c5f..6ad361ff9b06 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -593,13 +593,15 @@ struct iommu_hw_info_arm_smmuv3 {
>  
>  /**
>   * enum iommu_hw_info_type - IOMMU Hardware Info Types
> - * @IOMMU_HW_INFO_TYPE_NONE: Used by the drivers that do not report hardware
> + * @IOMMU_HW_INFO_TYPE_NONE: Output by the drivers that do not report hardware
>   *                           info
> + * @IOMMU_HW_INFO_TYPE_DEFAULT: Input to request for a default type
>   * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
>   * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
>   */
>  enum iommu_hw_info_type {
>  	IOMMU_HW_INFO_TYPE_NONE = 0,
> +	IOMMU_HW_INFO_TYPE_DEFAULT = 0,
>  	IOMMU_HW_INFO_TYPE_INTEL_VTD = 1,
>  	IOMMU_HW_INFO_TYPE_ARM_SMMUV3 = 2,
>  };
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> index 170d69162848..eb9fe1f6311a 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> @@ -15,6 +15,10 @@ void *arm_smmu_hw_info(struct device *dev, u32 *length,
>  	u32 __iomem *base_idr;
>  	unsigned int i;
>  
> +	if (*type != IOMMU_HW_INFO_TYPE_DEFAULT &&
> +	    *type != IOMMU_HW_INFO_TYPE_ARM_SMMUV3)
> +		return ERR_PTR(-EOPNOTSUPP);
> +
>  	info = kzalloc(sizeof(*info), GFP_KERNEL);
>  	if (!info)
>  		return ERR_PTR(-ENOMEM);
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 850f1a6f548c..5f75faffca15 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -4098,6 +4098,10 @@ static void *intel_iommu_hw_info(struct device *dev, u32 *length,
>  	struct intel_iommu *iommu = info->iommu;
>  	struct iommu_hw_info_vtd *vtd;
>  
> +	if (*type != IOMMU_HW_INFO_TYPE_DEFAULT &&
> +	    *type != IOMMU_HW_INFO_TYPE_INTEL_VTD)
> +		return ERR_PTR(-EOPNOTSUPP);
> +
>  	vtd = kzalloc(sizeof(*vtd), GFP_KERNEL);
>  	if (!vtd)
>  		return ERR_PTR(-ENOMEM);
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index 8f078fda795a..64a51993e6a1 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -1519,6 +1519,9 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd)
>  	    cmd->__reserved[2])
>  		return -EOPNOTSUPP;
>  
> +	/* Clear the type field since drivers don't support a random input */
> +	cmd->out_data_type = IOMMU_HW_INFO_TYPE_DEFAULT;
> +
>  	idev = iommufd_get_device(ucmd, cmd->dev_id);
>  	if (IS_ERR(idev))
>  		return PTR_ERR(idev);
> diff --git a/drivers/iommu/iommufd/selftest.c b/drivers/iommu/iommufd/selftest.c
> index 8b2c44b32530..a5dc36219a90 100644
> --- a/drivers/iommu/iommufd/selftest.c
> +++ b/drivers/iommu/iommufd/selftest.c
> @@ -310,6 +310,10 @@ static void *mock_domain_hw_info(struct device *dev, u32 *length,
>  {
>  	struct iommu_test_hw_info *info;
>  
> +	if (*type != IOMMU_HW_INFO_TYPE_DEFAULT &&
> +	    *type != IOMMU_HW_INFO_TYPE_SELFTEST)
> +		return ERR_PTR(-EOPNOTSUPP);
> +
>  	info = kzalloc(sizeof(*info), GFP_KERNEL);
>  	if (!info)
>  		return ERR_PTR(-ENOMEM);

Reviewed-by: Pranjal Shrivastava <praan@google.com>

> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 20/28] iommufd: Allow an input data_type via iommu_hw_info
  2025-06-26 19:34 ` [PATCH v7 20/28] iommufd: Allow an input data_type via iommu_hw_info Nicolin Chen
@ 2025-07-01 12:58   ` Pranjal Shrivastava
  0 siblings, 0 replies; 67+ messages in thread
From: Pranjal Shrivastava @ 2025-07-01 12:58 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jgg, kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Thu, Jun 26, 2025 at 12:34:51PM -0700, Nicolin Chen wrote:
> The iommu_hw_info can output via the out_data_type field the vendor data
> type from a driver, but this only allows driver to report one data type.
> 
> Now, with SMMUv3 having a Tegra241 CMDQV implementation, it has two sets
> of types and data structs to report.
> 
> One way to support that is to use the same type field bidirectionally.
> 
> Reuse the same field by adding an "in_data_type", allowing user space to
> request for a specific type and to get the corresponding data.
> 
> For backward compatibility, since the ioctl handler has never checked an
> input value, add an IOMMU_HW_INFO_FLAG_INPUT_TYPE to switch between the
> old output-only field and the new bidirectional field.
> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  include/uapi/linux/iommufd.h   | 20 +++++++++++++++++++-
>  drivers/iommu/iommufd/device.c |  9 ++++++---
>  2 files changed, 25 insertions(+), 4 deletions(-)
> 
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 6ad361ff9b06..6ae9d2102154 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -628,6 +628,15 @@ enum iommufd_hw_capabilities {
>  	IOMMU_HW_CAP_PCI_PASID_PRIV = 1 << 2,
>  };
>  
> +/**
> + * enum iommufd_hw_info_flags - Flags for iommu_hw_info
> + * @IOMMU_HW_INFO_FLAG_INPUT_TYPE: If set, @in_data_type carries an input type
> + *                                 for user space to request for a specific info
> + */
> +enum iommufd_hw_info_flags {
> +	IOMMU_HW_INFO_FLAG_INPUT_TYPE = 1 << 0,
> +};
> +
>  /**
>   * struct iommu_hw_info - ioctl(IOMMU_GET_HW_INFO)
>   * @size: sizeof(struct iommu_hw_info)
> @@ -637,6 +646,12 @@ enum iommufd_hw_capabilities {
>   *            data that kernel supports
>   * @data_uptr: User pointer to a user-space buffer used by the kernel to fill
>   *             the iommu type specific hardware information data
> + * @in_data_type: This shares the same field with @out_data_type, making it be
> + *                a bidirectional field. When IOMMU_HW_INFO_FLAG_INPUT_TYPE is
> + *                set, an input type carried via this @in_data_type field will
> + *                be valid, requesting for the info data to the given type. If
> + *                IOMMU_HW_INFO_FLAG_INPUT_TYPE is unset, any input value will
> + *                be seen as IOMMU_HW_INFO_TYPE_DEFAULT
>   * @out_data_type: Output the iommu hardware info type as defined in the enum
>   *                 iommu_hw_info_type.
>   * @out_capabilities: Output the generic iommu capability info type as defined
> @@ -666,7 +681,10 @@ struct iommu_hw_info {
>  	__u32 dev_id;
>  	__u32 data_len;
>  	__aligned_u64 data_uptr;
> -	__u32 out_data_type;
> +	union {
> +		__u32 in_data_type;
> +		__u32 out_data_type;
> +	};
>  	__u8 out_max_pasid_log2;
>  	__u8 __reserved[3];
>  	__aligned_u64 out_capabilities;
> diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
> index 64a51993e6a1..cbd86aabdd1c 100644
> --- a/drivers/iommu/iommufd/device.c
> +++ b/drivers/iommu/iommufd/device.c
> @@ -1506,6 +1506,7 @@ EXPORT_SYMBOL_NS_GPL(iommufd_access_rw, "IOMMUFD");
>  
>  int iommufd_get_hw_info(struct iommufd_ucmd *ucmd)
>  {
> +	const u32 SUPPORTED_FLAGS = IOMMU_HW_INFO_FLAG_INPUT_TYPE;
>  	struct iommu_hw_info *cmd = ucmd->cmd;
>  	void __user *user_ptr = u64_to_user_ptr(cmd->data_uptr);
>  	const struct iommu_ops *ops;
> @@ -1515,12 +1516,14 @@ int iommufd_get_hw_info(struct iommufd_ucmd *ucmd)
>  	void *data;
>  	int rc;
>  
> -	if (cmd->flags || cmd->__reserved[0] || cmd->__reserved[1] ||
> -	    cmd->__reserved[2])
> +	if (cmd->flags & ~SUPPORTED_FLAGS)
> +		return -EOPNOTSUPP;
> +	if (cmd->__reserved[0] || cmd->__reserved[1] || cmd->__reserved[2])
>  		return -EOPNOTSUPP;
>  
>  	/* Clear the type field since drivers don't support a random input */
> -	cmd->out_data_type = IOMMU_HW_INFO_TYPE_DEFAULT;
> +	if (!(cmd->flags & IOMMU_HW_INFO_FLAG_INPUT_TYPE))
> +		cmd->in_data_type = IOMMU_HW_INFO_TYPE_DEFAULT;
>  
>  	idev = iommufd_get_device(ucmd, cmd->dev_id);
>  	if (IS_ERR(idev))

Reviewed-by: Pranjal Shrivastava <praan@google.com>

> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-06-26 19:34 ` [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support Nicolin Chen
@ 2025-07-01 16:02   ` Pranjal Shrivastava
  2025-07-01 19:42     ` Nicolin Chen
  0 siblings, 1 reply; 67+ messages in thread
From: Pranjal Shrivastava @ 2025-07-01 16:02 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jgg, kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Thu, Jun 26, 2025 at 12:34:58PM -0700, Nicolin Chen wrote:
> The CMDQV HW supports a user-space use for virtualization cases. It allows
> the VM to issue guest-level TLBI or ATC_INV commands directly to the queue
> and executes them without a VMEXIT, as HW will replace the VMID field in a
> TLBI command and the SID field in an ATC_INV command with the preset VMID
> and SID.
> 
> This is built upon the vIOMMU infrastructure by allowing VMM to allocate a
> VINTF (as a vIOMMU object) and assign VCMDQs (HW QUEUE objs) to the VINTF.
> 
> So firstly, replace the standard vSMMU model with the VINTF implementation
> but reuse the standard cache_invalidate op (for unsupported commands) and
> the standard alloc_domain_nested op (for standard nested STE).
> 
> Each VINTF has two 64KB MMIO pages (128B per logical VCMDQ):
>  - Page0 (directly accessed by guest) has all the control and status bits.
>  - Page1 (trapped by VMM) has guest-owned queue memory location/size info.
> 
> VMM should trap the emulated VINTF0's page1 of the guest VM for the guest-
> level VCMDQ location/size info and forward that to the kernel to translate
> to a physical memory location to program the VCMDQ HW during an allocation
> call. Then, it should mmap the assigned VINTF's page0 to the VINTF0 page0
> of the guest VM. This allows the guest OS to read and write the guest-own
> VINTF's page0 for direct control of the VCMDQ HW.
> 
> For ATC invalidation commands that hold an SID, it requires all devices to
> register their virtual SIDs to the SID_MATCH registers and their physical
> SIDs to the pairing SID_REPLACE registers, so that HW can use those as a
> lookup table to replace those virtual SIDs with the correct physical SIDs.
> Thus, implement the driver-allocated vDEVICE op with a tegra241_vintf_sid
> structure to allocate SID_REPLACE and to program the SIDs accordingly.
> 
> This enables the HW accelerated feature for NVIDIA Grace CPU. Compared to
> the standard SMMUv3 operating in the nested translation mode trapping CMDQ
> for TLBI and ATC_INV commands, this gives a huge performance improvement:
> 70% to 90% reductions of invalidation time were measured by various DMA
> unmap tests running in a guest OS.
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   7 +
>  include/uapi/linux/iommufd.h                  |  58 +++
>  .../arm/arm-smmu-v3/arm-smmu-v3-iommufd.c     |   6 +-
>  .../iommu/arm/arm-smmu-v3/tegra241-cmdqv.c    | 407 +++++++++++++++++-
>  4 files changed, 471 insertions(+), 7 deletions(-)
> 

Reviewed-by: Pranjal Shrivastava <praan@google.com> with the following
nits:

> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 836d5556008e..aa25156e04a3 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -1056,10 +1056,17 @@ int arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state,
>  void arm_smmu_attach_commit_vmaster(struct arm_smmu_attach_state *state);
>  void arm_smmu_master_clear_vmaster(struct arm_smmu_master *master);
>  int arm_vmaster_report_event(struct arm_smmu_vmaster *vmaster, u64 *evt);
> +struct iommu_domain *
> +arm_vsmmu_alloc_domain_nested(struct iommufd_viommu *viommu, u32 flags,
> +			      const struct iommu_user_data *user_data);
> +int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
> +			       struct iommu_user_data_array *array);
>  #else
>  #define arm_smmu_get_viommu_size NULL
>  #define arm_smmu_hw_info NULL
>  #define arm_vsmmu_init NULL
> +#define arm_vsmmu_alloc_domain_nested NULL
> +#define arm_vsmmu_cache_invalidate NULL
>  
>  static inline int
>  arm_smmu_attach_prepare_vmaster(struct arm_smmu_attach_state *state,
> diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h
> index 6ae9d2102154..1c9e486113e3 100644
> --- a/include/uapi/linux/iommufd.h
> +++ b/include/uapi/linux/iommufd.h
> @@ -591,6 +591,27 @@ struct iommu_hw_info_arm_smmuv3 {
>  	__u32 aidr;
>  };
>  
> +/**
> + * iommu_hw_info_tegra241_cmdqv - NVIDIA Tegra241 CMDQV Hardware Information
> + *                                (IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV)
> + * @flags: Must be 0
> + * @version: Version number for the CMDQ-V HW for PARAM bits[03:00]
> + * @log2vcmdqs: Log2 of the total number of VCMDQs for PARAM bits[07:04]
> + * @log2vsids: Log2 of the total number of SID replacements for PARAM bits[15:12]
> + * @__reserved: Must be 0
> + *
> + * VMM can use these fields directly in its emulated global PARAM register. Note
> + * that only one Virtual Interface (VINTF) should be exposed to a VM, i.e. PARAM
> + * bits[11:08] should be set to 0 for log2 of the total number of VINTFs.
> + */
> +struct iommu_hw_info_tegra241_cmdqv {
> +	__u32 flags;
> +	__u8 version;
> +	__u8 log2vcmdqs;
> +	__u8 log2vsids;
> +	__u8 __reserved;
> +};
> +
>  /**
>   * enum iommu_hw_info_type - IOMMU Hardware Info Types
>   * @IOMMU_HW_INFO_TYPE_NONE: Output by the drivers that do not report hardware
> @@ -598,12 +619,15 @@ struct iommu_hw_info_arm_smmuv3 {
>   * @IOMMU_HW_INFO_TYPE_DEFAULT: Input to request for a default type
>   * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
>   * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
> + * @IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
> + *                                     SMMUv3) info type

I know that the goal here is to mention that Tegra241 CMDQV is an
extension for Arm SMMUv3, but this comment could be misunderstood as the
"type" being an extension to IOMMU_HW_INFO_TYPE_ARM_SMMUV3. How about we
get rid of the word "extension" and have something like:

 * @IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV info type (NVIDIA's 
 * 				       implementation of Arm SMMUv3 with CMDQ 
 *				       Virtualization)

Sorry to be nit-picky here, I know that the code is clear, but I've seen
people don't care to read more than the uapi descriptions. Maybe we
could re-write this comment, here and everywhere else?

>   */
>  enum iommu_hw_info_type {
>  	IOMMU_HW_INFO_TYPE_NONE = 0,
>  	IOMMU_HW_INFO_TYPE_DEFAULT = 0,
>  	IOMMU_HW_INFO_TYPE_INTEL_VTD = 1,
>  	IOMMU_HW_INFO_TYPE_ARM_SMMUV3 = 2,
> +	IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV = 3,
>  };
>  
>  /**
> @@ -972,10 +996,29 @@ struct iommu_fault_alloc {
>   * enum iommu_viommu_type - Virtual IOMMU Type
>   * @IOMMU_VIOMMU_TYPE_DEFAULT: Reserved for future use
>   * @IOMMU_VIOMMU_TYPE_ARM_SMMUV3: ARM SMMUv3 driver specific type
> + * @IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
> + *                                    SMMUv3) Virtual Interface (VINTF)
>   */
>  enum iommu_viommu_type {
>  	IOMMU_VIOMMU_TYPE_DEFAULT = 0,
>  	IOMMU_VIOMMU_TYPE_ARM_SMMUV3 = 1,
> +	IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV = 2,
> +};
> +
> +/**
> + * struct iommu_viommu_tegra241_cmdqv - NVIDIA Tegra241 CMDQV Virtual Interface
> + *                                      (IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV)
> + * @out_vintf_mmap_offset: mmap offset argument for VINTF's page0
> + * @out_vintf_mmap_length: mmap length argument for VINTF's page0
> + *
> + * Both @out_vintf_mmap_offset and @out_vintf_mmap_length are reported by kernel
> + * for user space to mmap the VINTF page0 from the host physical address space
> + * to the guest physical address space so that a guest kernel can directly R/W
> + * access to the VINTF page0 in order to control its virtual command queues.
> + */
> +struct iommu_viommu_tegra241_cmdqv {
> +	__aligned_u64 out_vintf_mmap_offset;
> +	__aligned_u64 out_vintf_mmap_length;
>  };
>  
>  /**
> @@ -1172,9 +1215,24 @@ struct iommu_veventq_alloc {
>  /**
>   * enum iommu_hw_queue_type - HW Queue Type
>   * @IOMMU_HW_QUEUE_TYPE_DEFAULT: Reserved for future use
> + * @IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
> + *                                      SMMUv3) Virtual Command Queue (VCMDQ)
>   */
>  enum iommu_hw_queue_type {
>  	IOMMU_HW_QUEUE_TYPE_DEFAULT = 0,
> +	/*
> +	 * TEGRA241_CMDQV requirements (otherwise, allocation will fail)
> +	 * - alloc starts from the lowest @index=0 in ascending order
> +	 * - destroy starts from the last allocated @index in descending order
> +	 * - @base_addr must be aligned to @length in bytes and mapped in IOAS
> +	 * - @length must be a power of 2, with a minimum 32 bytes and a maximum
> +	 *   2 ^ idr[1].CMDQS * 16 bytes (use GET_HW_INFO call to read idr[1]
> +	 *   from struct iommu_hw_info_arm_smmuv3)
> +	 * - suggest to back the queue memory with contiguous physical pages or
> +	 *   a single huge page with alignment of the queue size, and limit the
> +	 *   emulated vSMMU's IDR1.CMDQS to log2(huge page size / 16 bytes)
> +	 */
> +	IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV = 1,
>  };
>  
>  /**
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> index 1cf9646e776f..d9bea8f1f636 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c
> @@ -225,7 +225,7 @@ static int arm_smmu_validate_vste(struct iommu_hwpt_arm_smmuv3 *arg,
>  	return 0;
>  }
>  
> -static struct iommu_domain *
> +struct iommu_domain *
>  arm_vsmmu_alloc_domain_nested(struct iommufd_viommu *viommu, u32 flags,
>  			      const struct iommu_user_data *user_data)
>  {
> @@ -336,8 +336,8 @@ static int arm_vsmmu_convert_user_cmd(struct arm_vsmmu *vsmmu,
>  	return 0;
>  }
>  
> -static int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
> -				      struct iommu_user_data_array *array)
> +int arm_vsmmu_cache_invalidate(struct iommufd_viommu *viommu,
> +			       struct iommu_user_data_array *array)
>  {
>  	struct arm_vsmmu *vsmmu = container_of(viommu, struct arm_vsmmu, core);
>  	struct arm_smmu_device *smmu = vsmmu->smmu;
> diff --git a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
> index 869c90b660c1..e073b64553d5 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c
> @@ -8,7 +8,9 @@
>  #include <linux/dma-mapping.h>
>  #include <linux/interrupt.h>
>  #include <linux/iommu.h>
> +#include <linux/iommufd.h>
>  #include <linux/iopoll.h>
> +#include <uapi/linux/iommufd.h>
>  
>  #include <acpi/acpixf.h>
>  
> @@ -26,8 +28,10 @@
>  #define  CMDQV_EN			BIT(0)
>  
>  #define TEGRA241_CMDQV_PARAM		0x0004
> +#define  CMDQV_NUM_SID_PER_VM_LOG2	GENMASK(15, 12)
>  #define  CMDQV_NUM_VINTF_LOG2		GENMASK(11, 8)
>  #define  CMDQV_NUM_VCMDQ_LOG2		GENMASK(7, 4)
> +#define  CMDQV_VER			GENMASK(3, 0)
>  
>  #define TEGRA241_CMDQV_STATUS		0x0008
>  #define  CMDQV_ENABLED			BIT(0)
> @@ -53,6 +57,9 @@
>  #define  VINTF_STATUS			GENMASK(3, 1)
>  #define  VINTF_ENABLED			BIT(0)
>  
> +#define TEGRA241_VINTF_SID_MATCH(s)	(0x0040 + 0x4*(s))
> +#define TEGRA241_VINTF_SID_REPLACE(s)	(0x0080 + 0x4*(s))
> +
>  #define TEGRA241_VINTF_LVCMDQ_ERR_MAP_64(m) \
>  					(0x00C0 + 0x8*(m))
>  #define  LVCMDQ_ERR_MAP_NUM_64		2
> @@ -114,16 +121,20 @@ MODULE_PARM_DESC(bypass_vcmdq,
>  
>  /**
>   * struct tegra241_vcmdq - Virtual Command Queue
> + * @core: Embedded iommufd_hw_queue structure
>   * @idx: Global index in the CMDQV
>   * @lidx: Local index in the VINTF
>   * @enabled: Enable status
>   * @cmdqv: Parent CMDQV pointer
>   * @vintf: Parent VINTF pointer
> + * @prev: Previous LVCMDQ to depend on
>   * @cmdq: Command Queue struct
>   * @page0: MMIO Page0 base address
>   * @page1: MMIO Page1 base address
>   */
>  struct tegra241_vcmdq {
> +	struct iommufd_hw_queue core;
> +
>  	u16 idx;
>  	u16 lidx;
>  
> @@ -131,22 +142,30 @@ struct tegra241_vcmdq {
>  
>  	struct tegra241_cmdqv *cmdqv;
>  	struct tegra241_vintf *vintf;
> +	struct tegra241_vcmdq *prev;
>  	struct arm_smmu_cmdq cmdq;
>  
>  	void __iomem *page0;
>  	void __iomem *page1;
>  };
> +#define hw_queue_to_vcmdq(v) container_of(v, struct tegra241_vcmdq, core)
>  
>  /**
>   * struct tegra241_vintf - Virtual Interface
> + * @vsmmu: Embedded arm_vsmmu structure
>   * @idx: Global index in the CMDQV
>   * @enabled: Enable status
>   * @hyp_own: Owned by hypervisor (in-kernel)
>   * @cmdqv: Parent CMDQV pointer
>   * @lvcmdqs: List of logical VCMDQ pointers
> + * @lvcmdq_mutex: Lock to serialize user-allocated lvcmdqs
>   * @base: MMIO base address
> + * @mmap_offset: Offset argument for mmap() syscall
> + * @sids: Stream ID replacement resources
>   */
>  struct tegra241_vintf {
> +	struct arm_vsmmu vsmmu;
> +
>  	u16 idx;
>  
>  	bool enabled;
> @@ -154,19 +173,41 @@ struct tegra241_vintf {
>  
>  	struct tegra241_cmdqv *cmdqv;
>  	struct tegra241_vcmdq **lvcmdqs;
> +	struct mutex lvcmdq_mutex; /* user space race */
>  
>  	void __iomem *base;
> +	unsigned long mmap_offset;
> +
> +	struct ida sids;
> +};
> +#define viommu_to_vintf(v) container_of(v, struct tegra241_vintf, vsmmu.core)
> +
> +/**
> + * struct tegra241_vintf_sid - Virtual Interface Stream ID Replacement
> + * @core: Embedded iommufd_vdevice structure, holding virtual Stream ID
> + * @vintf: Parent VINTF pointer
> + * @sid: Physical Stream ID
> + * @idx: Replacement index in the VINTF
> + */
> +struct tegra241_vintf_sid {
> +	struct iommufd_vdevice core;
> +	struct tegra241_vintf *vintf;
> +	u32 sid;
> +	u8 idx;
>  };

AFAIU, This seems to be a handle for sid -> vintf mapping.. it yes, then
I'm not sure if "Virtual Interface Stream ID Replacement" clarifies that?

> +#define vdev_to_vsid(v) container_of(v, struct tegra241_vintf_sid, core)
>  
>  /**
>   * struct tegra241_cmdqv - CMDQ-V for SMMUv3
>   * @smmu: SMMUv3 device
>   * @dev: CMDQV device
>   * @base: MMIO base address
> + * @base_phys: MMIO physical base address, for mmap
>   * @irq: IRQ number
>   * @num_vintfs: Total number of VINTFs
>   * @num_vcmdqs: Total number of VCMDQs
>   * @num_lvcmdqs_per_vintf: Number of logical VCMDQs per VINTF
> + * @num_sids_per_vintf: Total number of SID replacements per VINTF
>   * @vintf_ids: VINTF id allocator
>   * @vintfs: List of VINTFs
>   */
> @@ -175,12 +216,14 @@ struct tegra241_cmdqv {
>  	struct device *dev;
>  
>  	void __iomem *base;
> +	phys_addr_t base_phys;
>  	int irq;
>  
>  	/* CMDQV Hardware Params */
>  	u16 num_vintfs;
>  	u16 num_vcmdqs;
>  	u16 num_lvcmdqs_per_vintf;
> +	u16 num_sids_per_vintf;
>  
>  	struct ida vintf_ids;
>  
> @@ -351,6 +394,29 @@ tegra241_cmdqv_get_cmdq(struct arm_smmu_device *smmu,
>  
>  /* HW Reset Functions */
>  
> +/*
> + * When a guest-owned VCMDQ is disabled, if the guest did not enqueue a CMD_SYNC
> + * following an ATC_INV command at the end of the guest queue while this ATC_INV
> + * is timed out, the TIMEOUT will not be reported until this VCMDQ gets assigned
> + * to the next VM, which will be a false alarm potentially causing some unwanted
> + * behavior in the new VM. Thus, a guest-owned VCMDQ must flush the TIMEOUT when
> + * it gets disabled. This can be done by just issuing a CMD_SYNC to SMMU CMDQ.
> + */
> +static void tegra241_vcmdq_hw_flush_timeout(struct tegra241_vcmdq *vcmdq)
> +{
> +	struct arm_smmu_device *smmu = &vcmdq->cmdqv->smmu;
> +	u64 cmd_sync[CMDQ_ENT_DWORDS] = {};
> +
> +	cmd_sync[0] = FIELD_PREP(CMDQ_0_OP, CMDQ_OP_CMD_SYNC) |
> +		      FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_NONE);
> +
> +	/*
> +	 * It does not hurt to insert another CMD_SYNC, taking advantage of the
> +	 * arm_smmu_cmdq_issue_cmdlist() that waits for the CMD_SYNC completion.
> +	 */
> +	arm_smmu_cmdq_issue_cmdlist(smmu, &smmu->cmdq, cmd_sync, 1, true);
> +}

If I'm getting this right, it issues a CMD_SYNC to the Host's CMDQ i.e.
the non-CMDQV CMDQ, the main CMDQ of the SMMUv3? (i.e. the CMDQ present
without the Tegra241 CMDQV extension?)

so.. basically on every VM switch, there would be an additional CMD_SYNC
issued to the non-CMDQV CMDQ to flush the TIMEOUT and we'll poll for
it's completion?

> +
>  /* This function is for LVCMDQ, so @vcmdq must not be unmapped yet */
>  static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
>  {
> @@ -364,6 +430,8 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
>  			readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, GERROR)),
>  			readl_relaxed(REG_VCMDQ_PAGE0(vcmdq, CONS)));
>  	}
> +	tegra241_vcmdq_hw_flush_timeout(vcmdq);
> +
>  	writel_relaxed(0, REG_VCMDQ_PAGE0(vcmdq, PROD));
>  	writel_relaxed(0, REG_VCMDQ_PAGE0(vcmdq, CONS));
>  	writeq_relaxed(0, REG_VCMDQ_PAGE1(vcmdq, BASE));
> @@ -380,6 +448,12 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
>  	dev_dbg(vcmdq->cmdqv->dev, "%sdeinited\n", h);
>  }
>  
> +/* This function is for LVCMDQ, so @vcmdq must be mapped prior */
> +static void _tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
> +{
> +	writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE));
> +}
> +

Not sure why we broke this off to a function, will there be more stuff
here or is this just to use it in tegra241_vcmdq_hw_init_user as well?

>  /* This function is for LVCMDQ, so @vcmdq must be mapped prior */
>  static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
>  {
> @@ -390,7 +464,7 @@ static int tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
>  	tegra241_vcmdq_hw_deinit(vcmdq);
>  
>  	/* Configure and enable VCMDQ */
> -	writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE));
> +	_tegra241_vcmdq_hw_init(vcmdq);
>  
>  	ret = vcmdq_write_config(vcmdq, VCMDQ_EN);
>  	if (ret) {
> @@ -420,6 +494,7 @@ static void tegra241_vcmdq_unmap_lvcmdq(struct tegra241_vcmdq *vcmdq)
>  static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf)
>  {
>  	u16 lidx = vintf->cmdqv->num_lvcmdqs_per_vintf;
> +	int sidx;
>  
>  	/* HW requires to unmap LVCMDQs in descending order */
>  	while (lidx--) {
> @@ -429,6 +504,10 @@ static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf)
>  		}
>  	}
>  	vintf_write_config(vintf, 0);
> +	for (sidx = 0; sidx < vintf->cmdqv->num_sids_per_vintf; sidx++) {
> +		writel(0, REG_VINTF(vintf, SID_MATCH(sidx)));
> +		writel(0, REG_VINTF(vintf, SID_REPLACE(sidx)));
> +	}
>  }

I'm assuming we call the de-init while switching VMs and hence we need
to clear these to avoid spurious SID replacements in the new VM? Or do
they not reset to 0 when the HW is reset?

>  
>  /* Map a global VCMDQ to the pre-assigned LVCMDQ */
> @@ -457,7 +536,8 @@ static int tegra241_vintf_hw_init(struct tegra241_vintf *vintf, bool hyp_own)
>  	 * whether enabling it here or not, as !HYP_OWN cmdq HWs only support a
>  	 * restricted set of supported commands.
>  	 */
> -	regval = FIELD_PREP(VINTF_HYP_OWN, hyp_own);
> +	regval = FIELD_PREP(VINTF_HYP_OWN, hyp_own) |
> +		 FIELD_PREP(VINTF_VMID, vintf->vsmmu.vmid);
>  	writel(regval, REG_VINTF(vintf, CONFIG));
>  
>  	ret = vintf_write_config(vintf, regval | VINTF_EN);
> @@ -584,7 +664,9 @@ static void tegra241_vintf_free_lvcmdq(struct tegra241_vintf *vintf, u16 lidx)
>  
>  	dev_dbg(vintf->cmdqv->dev,
>  		"%sdeallocated\n", lvcmdq_error_header(vcmdq, header, 64));
> -	kfree(vcmdq);
> +	/* Guest-owned VCMDQ is free-ed with hw_queue by iommufd core */
> +	if (vcmdq->vintf->hyp_own)
> +		kfree(vcmdq);
>  }
>  
>  static struct tegra241_vcmdq *
> @@ -671,7 +753,13 @@ static void tegra241_cmdqv_remove_vintf(struct tegra241_cmdqv *cmdqv, u16 idx)
>  
>  	dev_dbg(cmdqv->dev, "VINTF%u: deallocated\n", vintf->idx);
>  	tegra241_cmdqv_deinit_vintf(cmdqv, idx);
> -	kfree(vintf);
> +	if (!vintf->hyp_own) {
> +		mutex_destroy(&vintf->lvcmdq_mutex);
> +		ida_destroy(&vintf->sids);
> +		/* Guest-owned VINTF is free-ed with viommu by iommufd core */
> +	} else {
> +		kfree(vintf);
> +	}
>  }
>  
>  static void tegra241_cmdqv_remove(struct arm_smmu_device *smmu)
> @@ -699,10 +787,45 @@ static void tegra241_cmdqv_remove(struct arm_smmu_device *smmu)
>  	put_device(cmdqv->dev); /* smmu->impl_dev */
>  }
>  
> +static int
> +tegra241_cmdqv_init_vintf_user(struct arm_vsmmu *vsmmu,
> +			       const struct iommu_user_data *user_data);
> +
> +static void *tegra241_cmdqv_hw_info(struct arm_smmu_device *smmu, u32 *length,
> +				    enum iommu_hw_info_type *type)
> +{
> +	struct tegra241_cmdqv *cmdqv =
> +		container_of(smmu, struct tegra241_cmdqv, smmu);
> +	struct iommu_hw_info_tegra241_cmdqv *info;
> +	u32 regval;
> +
> +	if (*type != IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV)
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	info = kzalloc(sizeof(*info), GFP_KERNEL);
> +	if (!info)
> +		return ERR_PTR(-ENOMEM);
> +
> +	regval = readl_relaxed(REG_CMDQV(cmdqv, PARAM));
> +	info->log2vcmdqs = ilog2(cmdqv->num_lvcmdqs_per_vintf);
> +	info->log2vsids = ilog2(cmdqv->num_sids_per_vintf);
> +	info->version = FIELD_GET(CMDQV_VER, regval);
> +
> +	*length = sizeof(*info);
> +	*type = IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV;
> +	return info;
> +}
> +
>  static struct arm_smmu_impl_ops tegra241_cmdqv_impl_ops = {
> +	/* For in-kernel use */
>  	.get_secondary_cmdq = tegra241_cmdqv_get_cmdq,
>  	.device_reset = tegra241_cmdqv_hw_reset,
>  	.device_remove = tegra241_cmdqv_remove,
> +	/* For user-space use */
> +	.hw_info = tegra241_cmdqv_hw_info,
> +	.vsmmu_size = VIOMMU_STRUCT_SIZE(struct tegra241_vintf, vsmmu.core),
> +	.vsmmu_type = IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV,
> +	.vsmmu_init = tegra241_cmdqv_init_vintf_user,
>  };
>  
>  /* Probe Functions */
> @@ -844,6 +967,7 @@ __tegra241_cmdqv_probe(struct arm_smmu_device *smmu, struct resource *res,
>  	cmdqv->irq = irq;
>  	cmdqv->base = base;
>  	cmdqv->dev = smmu->impl_dev;
> +	cmdqv->base_phys = res->start;
>  
>  	if (cmdqv->irq > 0) {
>  		ret = request_threaded_irq(irq, NULL, tegra241_cmdqv_isr,
> @@ -860,6 +984,8 @@ __tegra241_cmdqv_probe(struct arm_smmu_device *smmu, struct resource *res,
>  	cmdqv->num_vintfs = 1 << FIELD_GET(CMDQV_NUM_VINTF_LOG2, regval);
>  	cmdqv->num_vcmdqs = 1 << FIELD_GET(CMDQV_NUM_VCMDQ_LOG2, regval);
>  	cmdqv->num_lvcmdqs_per_vintf = cmdqv->num_vcmdqs / cmdqv->num_vintfs;
> +	cmdqv->num_sids_per_vintf =
> +		1 << FIELD_GET(CMDQV_NUM_SID_PER_VM_LOG2, regval);
>  
>  	cmdqv->vintfs =
>  		kcalloc(cmdqv->num_vintfs, sizeof(*cmdqv->vintfs), GFP_KERNEL);
> @@ -913,3 +1039,276 @@ struct arm_smmu_device *tegra241_cmdqv_probe(struct arm_smmu_device *smmu)
>  	put_device(smmu->impl_dev);
>  	return ERR_PTR(-ENODEV);
>  }
> +
> +/* User space VINTF and VCMDQ Functions */
> +
> +static size_t tegra241_vintf_get_vcmdq_size(struct iommufd_viommu *viommu,
> +					    enum iommu_hw_queue_type queue_type)
> +{
> +	if (queue_type != IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV)
> +		return 0;
> +	return HW_QUEUE_STRUCT_SIZE(struct tegra241_vcmdq, core);
> +}
> +
> +static int tegra241_vcmdq_hw_init_user(struct tegra241_vcmdq *vcmdq)
> +{
> +	char header[64];
> +
> +	/* Configure the vcmdq only; User space does the enabling */
> +	_tegra241_vcmdq_hw_init(vcmdq);
> +
> +	dev_dbg(vcmdq->cmdqv->dev, "%sinited at host PA 0x%llx size 0x%lx\n",
> +		lvcmdq_error_header(vcmdq, header, 64),
> +		vcmdq->cmdq.q.q_base & VCMDQ_ADDR,
> +		1UL << (vcmdq->cmdq.q.q_base & VCMDQ_LOG2SIZE));
> +	return 0;
> +}
> +
> +static void
> +tegra241_vintf_destroy_lvcmdq_user(struct iommufd_hw_queue *hw_queue)
> +{
> +	struct tegra241_vcmdq *vcmdq = hw_queue_to_vcmdq(hw_queue);
> +
> +	mutex_lock(&vcmdq->vintf->lvcmdq_mutex);
> +	tegra241_vcmdq_hw_deinit(vcmdq);
> +	tegra241_vcmdq_unmap_lvcmdq(vcmdq);
> +	tegra241_vintf_free_lvcmdq(vcmdq->vintf, vcmdq->lidx);
> +	if (vcmdq->prev)
> +		iommufd_hw_queue_undepend(vcmdq, vcmdq->prev, core);
> +	mutex_unlock(&vcmdq->vintf->lvcmdq_mutex);
> +}
> +
> +static int tegra241_vintf_alloc_lvcmdq_user(struct iommufd_hw_queue *hw_queue,
> +					    u32 lidx, phys_addr_t base_addr_pa)
> +{
> +	struct tegra241_vintf *vintf = viommu_to_vintf(hw_queue->viommu);
> +	struct tegra241_vcmdq *vcmdq = hw_queue_to_vcmdq(hw_queue);
> +	struct tegra241_cmdqv *cmdqv = vintf->cmdqv;
> +	struct arm_smmu_device *smmu = &cmdqv->smmu;
> +	struct tegra241_vcmdq *prev = NULL;
> +	u32 log2size, max_n_shift;
> +	char header[64];
> +	int ret;
> +
> +	if (hw_queue->type != IOMMU_HW_QUEUE_TYPE_TEGRA241_CMDQV)
> +		return -EOPNOTSUPP;
> +	if (lidx >= cmdqv->num_lvcmdqs_per_vintf)
> +		return -EINVAL;
> +
> +	mutex_lock(&vintf->lvcmdq_mutex);
> +
> +	if (vintf->lvcmdqs[lidx]) {
> +		ret = -EEXIST;
> +		goto unlock;
> +	}
> +
> +	/*
> +	 * HW requires to map LVCMDQs in ascending order, so reject if the
> +	 * previous lvcmdqs is not allocated yet.
> +	 */
> +	if (lidx) {
> +		prev = vintf->lvcmdqs[lidx - 1];
> +		if (!prev) {
> +			ret = -EIO;
> +			goto unlock;
> +		}
> +	}
> +
> +	/*
> +	 * hw_queue->length must be a power of 2, in range of
> +	 *   [ 32, 2 ^ (idr[1].CMDQS + CMDQ_ENT_SZ_SHIFT) ]
> +	 */
> +	max_n_shift = FIELD_GET(IDR1_CMDQS,
> +				readl_relaxed(smmu->base + ARM_SMMU_IDR1));
> +	if (!is_power_of_2(hw_queue->length) || hw_queue->length < 32 ||
> +	    hw_queue->length > (1 << (max_n_shift + CMDQ_ENT_SZ_SHIFT))) {
> +		ret = -EINVAL;
> +		goto unlock;
> +	}
> +	log2size = ilog2(hw_queue->length) - CMDQ_ENT_SZ_SHIFT;
> +
> +	/* base_addr_pa must be aligned to hw_queue->length */
> +	if (base_addr_pa & ~VCMDQ_ADDR ||
> +	    base_addr_pa & (hw_queue->length - 1)) {
> +		ret = -EINVAL;
> +		goto unlock;
> +	}
> +
> +	/*
> +	 * HW requires to unmap LVCMDQs in descending order, so destroy() must
> +	 * follow this rule. Set a dependency on its previous LVCMDQ so iommufd
> +	 * core will help enforce it.
> +	 */
> +	if (prev) {
> +		ret = iommufd_hw_queue_depend(vcmdq, prev, core);
> +		if (ret)
> +			goto unlock;
> +	}
> +	vcmdq->prev = prev;
> +
> +	ret = tegra241_vintf_init_lvcmdq(vintf, lidx, vcmdq);
> +	if (ret)
> +		goto undepend_vcmdq;
> +
> +	dev_dbg(cmdqv->dev, "%sallocated\n",
> +		lvcmdq_error_header(vcmdq, header, 64));
> +
> +	tegra241_vcmdq_map_lvcmdq(vcmdq);
> +
> +	vcmdq->cmdq.q.q_base = base_addr_pa & VCMDQ_ADDR;
> +	vcmdq->cmdq.q.q_base |= log2size;
> +
> +	ret = tegra241_vcmdq_hw_init_user(vcmdq);
> +	if (ret)
> +		goto unmap_lvcmdq;
> +
> +	hw_queue->destroy = &tegra241_vintf_destroy_lvcmdq_user;
> +	mutex_unlock(&vintf->lvcmdq_mutex);
> +	return 0;
> +
> +unmap_lvcmdq:
> +	tegra241_vcmdq_unmap_lvcmdq(vcmdq);
> +	tegra241_vintf_deinit_lvcmdq(vintf, lidx);
> +undepend_vcmdq:
> +	if (vcmdq->prev)
> +		iommufd_hw_queue_undepend(vcmdq, vcmdq->prev, core);
> +unlock:
> +	mutex_unlock(&vintf->lvcmdq_mutex);
> +	return ret;
> +}
> +
> +static void tegra241_cmdqv_destroy_vintf_user(struct iommufd_viommu *viommu)
> +{
> +	struct tegra241_vintf *vintf = viommu_to_vintf(viommu);
> +
> +	if (vintf->mmap_offset)
> +		iommufd_viommu_destroy_mmap(&vintf->vsmmu.core,
> +					    vintf->mmap_offset);
> +	tegra241_cmdqv_remove_vintf(vintf->cmdqv, vintf->idx);
> +}
> +
> +static void tegra241_vintf_destroy_vsid(struct iommufd_vdevice *vdev)
> +{
> +	struct tegra241_vintf_sid *vsid = vdev_to_vsid(vdev);
> +	struct tegra241_vintf *vintf = vsid->vintf;
> +
> +	writel(0, REG_VINTF(vintf, SID_MATCH(vsid->idx)));
> +	writel(0, REG_VINTF(vintf, SID_REPLACE(vsid->idx)));
> +	ida_free(&vintf->sids, vsid->idx);
> +	dev_dbg(vintf->cmdqv->dev,
> +		"VINTF%u: deallocated SID_REPLACE%d for pSID=%x\n", vintf->idx,
> +		vsid->idx, vsid->sid);
> +}
> +
> +static int tegra241_vintf_init_vsid(struct iommufd_vdevice *vdev)
> +{
> +	struct arm_smmu_master *master = dev_iommu_priv_get(vdev->dev);
> +	struct tegra241_vintf *vintf = viommu_to_vintf(vdev->viommu);
> +	struct tegra241_vintf_sid *vsid = vdev_to_vsid(vdev);
> +	struct arm_smmu_stream *stream = &master->streams[0];
> +	u64 virt_sid = vdev->virt_id;
> +	int sidx;
> +
> +	if (virt_sid > UINT_MAX)
> +		return -EINVAL;
> +
> +	WARN_ON_ONCE(master->num_streams != 1);
> +
> +	/* Find an empty pair of SID_REPLACE and SID_MATCH */
> +	sidx = ida_alloc_max(&vintf->sids, vintf->cmdqv->num_sids_per_vintf - 1,
> +			     GFP_KERNEL);
> +	if (sidx < 0)
> +		return sidx;
> +
> +	writel(stream->id, REG_VINTF(vintf, SID_REPLACE(sidx)));
> +	writel(virt_sid << 1 | 0x1, REG_VINTF(vintf, SID_MATCH(sidx)));
> +	dev_dbg(vintf->cmdqv->dev,
> +		"VINTF%u: allocated SID_REPLACE%d for pSID=%x, vSID=%x\n",
> +		vintf->idx, sidx, stream->id, (u32)virt_sid);
> +
> +	vsid->idx = sidx;
> +	vsid->vintf = vintf;
> +	vsid->sid = stream->id;
> +
> +	vdev->destroy = &tegra241_vintf_destroy_vsid;
> +	return 0;
> +}
> +
> +static struct iommufd_viommu_ops tegra241_cmdqv_viommu_ops = {
> +	.destroy = tegra241_cmdqv_destroy_vintf_user,
> +	.alloc_domain_nested = arm_vsmmu_alloc_domain_nested,
> +	.cache_invalidate = arm_vsmmu_cache_invalidate,

I see that we currently use the main cmdq to issue these cache
invalidations (there's a FIXME in arm_vsmmu_cache_invalidate). I was
hoping for this series to change that but I'm assuming there's another
series coming for that?

Meanwhile, I guess it'd be good to call that out for folks who have
Grace and start trying out this feature.. I'm assuming they won't see
as much perf improvement with this series alone since we're still using
the main CMDQ in the upstream code?

> +	.vdevice_size = VDEVICE_STRUCT_SIZE(struct tegra241_vintf_sid, core),
> +	.vdevice_init = tegra241_vintf_init_vsid,
> +	.get_hw_queue_size = tegra241_vintf_get_vcmdq_size,
> +	.hw_queue_init_phys = tegra241_vintf_alloc_lvcmdq_user,
> +};
> +
> +static int
> +tegra241_cmdqv_init_vintf_user(struct arm_vsmmu *vsmmu,
> +			       const struct iommu_user_data *user_data)
> +{
> +	struct tegra241_cmdqv *cmdqv =
> +		container_of(vsmmu->smmu, struct tegra241_cmdqv, smmu);
> +	struct tegra241_vintf *vintf = viommu_to_vintf(&vsmmu->core);
> +	struct iommu_viommu_tegra241_cmdqv data;
> +	phys_addr_t page0_base;
> +	int ret;
> +
> +	if (!user_data)
> +		return -EINVAL;
> +
> +	ret = iommu_copy_struct_from_user(&data, user_data,
> +					  IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV,
> +					  out_vintf_mmap_length);
> +	if (ret)
> +		return ret;
> +
> +	ret = tegra241_cmdqv_init_vintf(cmdqv, cmdqv->num_vintfs - 1, vintf);
> +	if (ret < 0) {
> +		dev_err(cmdqv->dev, "no more available vintf\n");
> +		return ret;
> +	}
> +
> +	/*
> +	 * Initialize the user-owned VINTF without a LVCMDQ, as it cannot pre-
> +	 * allocate a LVCMDQ until user space wants one, for security reasons.
> +	 * It is different than the kernel-owned VINTF0, which had pre-assigned
> +	 * and pre-allocated global VCMDQs that would be mapped to the LVCMDQs
> +	 * by the tegra241_vintf_hw_init() call.
> +	 */
> +	ret = tegra241_vintf_hw_init(vintf, false);
> +	if (ret)
> +		goto deinit_vintf;
> +
> +	page0_base = cmdqv->base_phys + TEGRA241_VINTFi_PAGE0(vintf->idx);
> +	ret = iommufd_viommu_alloc_mmap(&vintf->vsmmu.core, page0_base, SZ_64K,
> +					&vintf->mmap_offset);
> +	if (ret)
> +		goto hw_deinit_vintf;
> +
> +	data.out_vintf_mmap_length = SZ_64K;
> +	data.out_vintf_mmap_offset = vintf->mmap_offset;
> +	ret = iommu_copy_struct_to_user(user_data, &data,
> +					IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV,
> +					out_vintf_mmap_length);
> +	if (ret)
> +		goto free_mmap;
> +
> +	ida_init(&vintf->sids);
> +	mutex_init(&vintf->lvcmdq_mutex);
> +
> +	dev_dbg(cmdqv->dev, "VINTF%u: allocated with vmid (%d)\n", vintf->idx,
> +		vintf->vsmmu.vmid);
> +
> +	vsmmu->core.ops = &tegra241_cmdqv_viommu_ops;
> +	return 0;
> +
> +free_mmap:
> +	iommufd_viommu_destroy_mmap(&vintf->vsmmu.core, vintf->mmap_offset);
> +hw_deinit_vintf:
> +	tegra241_vintf_hw_deinit(vintf);
> +deinit_vintf:
> +	tegra241_cmdqv_deinit_vintf(cmdqv, vintf->idx);
> +	return ret;
> +}

Thanks,
Praan

> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-07-01 16:02   ` Pranjal Shrivastava
@ 2025-07-01 19:42     ` Nicolin Chen
  2025-07-01 20:03       ` Pranjal Shrivastava
  0 siblings, 1 reply; 67+ messages in thread
From: Nicolin Chen @ 2025-07-01 19:42 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: jgg, kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Tue, Jul 01, 2025 at 04:02:35PM +0000, Pranjal Shrivastava wrote:
> On Thu, Jun 26, 2025 at 12:34:58PM -0700, Nicolin Chen wrote:
> >  /**
> >   * enum iommu_hw_info_type - IOMMU Hardware Info Types
> >   * @IOMMU_HW_INFO_TYPE_NONE: Output by the drivers that do not report hardware
> > @@ -598,12 +619,15 @@ struct iommu_hw_info_arm_smmuv3 {
> >   * @IOMMU_HW_INFO_TYPE_DEFAULT: Input to request for a default type
> >   * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
> >   * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
> > + * @IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
> > + *                                     SMMUv3) info type
> 
> I know that the goal here is to mention that Tegra241 CMDQV is an
> extension for Arm SMMUv3, but this comment could be misunderstood as the
> "type" being an extension to IOMMU_HW_INFO_TYPE_ARM_SMMUV3. How about we

IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV only reports CMDQV structure.
VMM still needs to poll the IOMMU_HW_INFO_TYPE_ARM_SMMUV3. It's
basically working as "type being an extension".

> Sorry to be nit-picky here, I know that the code is clear, but I've seen
> people don't care to read more than the uapi descriptions. Maybe we
> could re-write this comment, here and everywhere else?

I can change this thought:

+ * @IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
+ *                                    SMMUv3) enabled ARM SMMUv3 type

> > +/**
> > + * struct tegra241_vintf_sid - Virtual Interface Stream ID Replacement
> > + * @core: Embedded iommufd_vdevice structure, holding virtual Stream ID
> > + * @vintf: Parent VINTF pointer
> > + * @sid: Physical Stream ID
> > + * @idx: Replacement index in the VINTF
> > + */
> > +struct tegra241_vintf_sid {
> > +	struct iommufd_vdevice core;
> > +	struct tegra241_vintf *vintf;
> > +	u32 sid;
> > +	u8 idx;
> >  };
> 
> AFAIU, This seems to be a handle for sid -> vintf mapping.. it yes, then
> I'm not sure if "Virtual Interface Stream ID Replacement" clarifies that?

No. It's for vSID to pSID mappings. I had it explained in commit log:

For ATC invalidation commands that hold an SID, it requires all devices to
register their virtual SIDs to the SID_MATCH registers and their physical
SIDs to the pairing SID_REPLACE registers, so that HW can use those as a
lookup table to replace those virtual SIDs with the correct physical SIDs.
Thus, implement the driver-allocated vDEVICE op with a tegra241_vintf_sid
structure to allocate SID_REPLACE and to program the SIDs accordingly.

> > @@ -351,6 +394,29 @@ tegra241_cmdqv_get_cmdq(struct arm_smmu_device *smmu,
> >  
> >  /* HW Reset Functions */
> >  
> > +/*
> > + * When a guest-owned VCMDQ is disabled, if the guest did not enqueue a CMD_SYNC
> > + * following an ATC_INV command at the end of the guest queue while this ATC_INV
> > + * is timed out, the TIMEOUT will not be reported until this VCMDQ gets assigned
> > + * to the next VM, which will be a false alarm potentially causing some unwanted
> > + * behavior in the new VM. Thus, a guest-owned VCMDQ must flush the TIMEOUT when
> > + * it gets disabled. This can be done by just issuing a CMD_SYNC to SMMU CMDQ.
> > + */
> > +static void tegra241_vcmdq_hw_flush_timeout(struct tegra241_vcmdq *vcmdq)
> > +{
> > +	struct arm_smmu_device *smmu = &vcmdq->cmdqv->smmu;
> > +	u64 cmd_sync[CMDQ_ENT_DWORDS] = {};
> > +
> > +	cmd_sync[0] = FIELD_PREP(CMDQ_0_OP, CMDQ_OP_CMD_SYNC) |
> > +		      FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_NONE);
> > +
> > +	/*
> > +	 * It does not hurt to insert another CMD_SYNC, taking advantage of the
> > +	 * arm_smmu_cmdq_issue_cmdlist() that waits for the CMD_SYNC completion.
> > +	 */
> > +	arm_smmu_cmdq_issue_cmdlist(smmu, &smmu->cmdq, cmd_sync, 1, true);
> > +}
> 
> If I'm getting this right, it issues a CMD_SYNC to the Host's CMDQ i.e.
> the non-CMDQV CMDQ, the main CMDQ of the SMMUv3? (i.e. the CMDQ present
> without the Tegra241 CMDQV extension?)
>
> so.. basically on every VM switch, there would be an additional CMD_SYNC
> issued to the non-CMDQV CMDQ to flush the TIMEOUT and we'll poll for
> it's completion?

The main CMDQ exists regardless whether CMDQV extension is there or
not. The CMD_SYNC can be issued to any (v)CMDQ. The smmu->cmdq is
just the easiest one to use here.

> > @@ -380,6 +448,12 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
> >  	dev_dbg(vcmdq->cmdqv->dev, "%sdeinited\n", h);
> >  }
> >  
> > +/* This function is for LVCMDQ, so @vcmdq must be mapped prior */
> > +static void _tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
> > +{
> > +	writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE));
> > +}
> > +
> 
> Not sure why we broke this off to a function, will there be more stuff
> here or is this just to use it in tegra241_vcmdq_hw_init_user as well?

I can take it off.

> > @@ -429,6 +504,10 @@ static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf)
> >  		}
> >  	}
> >  	vintf_write_config(vintf, 0);
> > +	for (sidx = 0; sidx < vintf->cmdqv->num_sids_per_vintf; sidx++) {
> > +		writel(0, REG_VINTF(vintf, SID_MATCH(sidx)));
> > +		writel(0, REG_VINTF(vintf, SID_REPLACE(sidx)));
> > +	}
> >  }
> 
> I'm assuming we call the de-init while switching VMs and hence we need
> to clear these to avoid spurious SID replacements in the new VM? Or do
> they not reset to 0 when the HW is reset?

The driver does not reset HW when tearing down a VM, but only sets
VINTF's enable bit to 0. So, it should just set other configuration
bits to 0 as well.

> > +static struct iommufd_viommu_ops tegra241_cmdqv_viommu_ops = {
> > +	.destroy = tegra241_cmdqv_destroy_vintf_user,
> > +	.alloc_domain_nested = arm_vsmmu_alloc_domain_nested,
> > +	.cache_invalidate = arm_vsmmu_cache_invalidate,
> 
> I see that we currently use the main cmdq to issue these cache
> invalidations (there's a FIXME in arm_vsmmu_cache_invalidate). I was
> hoping for this series to change that but I'm assuming there's another
> series coming for that?
> 
> Meanwhile, I guess it'd be good to call that out for folks who have
> Grace and start trying out this feature.. I'm assuming they won't see
> as much perf improvement with this series alone since we're still using
> the main CMDQ in the upstream code?

VCMDQ only accelerates invalidation commands.

That is for non-invalidation commands that VCMDQ doesn't support,
so they still have to go in the standard nesting pathway.

Let's add a line:
	/* for non-invalidation commands use */

Nicolin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-07-01 19:42     ` Nicolin Chen
@ 2025-07-01 20:03       ` Pranjal Shrivastava
  2025-07-01 20:23         ` Nicolin Chen
  0 siblings, 1 reply; 67+ messages in thread
From: Pranjal Shrivastava @ 2025-07-01 20:03 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jgg, kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Tue, Jul 01, 2025 at 12:42:32PM -0700, Nicolin Chen wrote:
> On Tue, Jul 01, 2025 at 04:02:35PM +0000, Pranjal Shrivastava wrote:
> > On Thu, Jun 26, 2025 at 12:34:58PM -0700, Nicolin Chen wrote:
> > >  /**
> > >   * enum iommu_hw_info_type - IOMMU Hardware Info Types
> > >   * @IOMMU_HW_INFO_TYPE_NONE: Output by the drivers that do not report hardware
> > > @@ -598,12 +619,15 @@ struct iommu_hw_info_arm_smmuv3 {
> > >   * @IOMMU_HW_INFO_TYPE_DEFAULT: Input to request for a default type
> > >   * @IOMMU_HW_INFO_TYPE_INTEL_VTD: Intel VT-d iommu info type
> > >   * @IOMMU_HW_INFO_TYPE_ARM_SMMUV3: ARM SMMUv3 iommu info type
> > > + * @IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
> > > + *                                     SMMUv3) info type
> > 
> > I know that the goal here is to mention that Tegra241 CMDQV is an
> > extension for Arm SMMUv3, but this comment could be misunderstood as the
> > "type" being an extension to IOMMU_HW_INFO_TYPE_ARM_SMMUV3. How about we
> 
> IOMMU_HW_INFO_TYPE_TEGRA241_CMDQV only reports CMDQV structure.
> VMM still needs to poll the IOMMU_HW_INFO_TYPE_ARM_SMMUV3. It's
> basically working as "type being an extension".
> 

Ohh okay, I see.. I thought we were describing the HW.

> > Sorry to be nit-picky here, I know that the code is clear, but I've seen
> > people don't care to read more than the uapi descriptions. Maybe we
> > could re-write this comment, here and everywhere else?
> 
> I can change this thought:
> 
> + * @IOMMU_VIOMMU_TYPE_TEGRA241_CMDQV: NVIDIA Tegra241 CMDQV (extension for ARM
> + *                                    SMMUv3) enabled ARM SMMUv3 type
> 

Yes, that helps, thanks!

> > > +/**
> > > + * struct tegra241_vintf_sid - Virtual Interface Stream ID Replacement
> > > + * @core: Embedded iommufd_vdevice structure, holding virtual Stream ID
> > > + * @vintf: Parent VINTF pointer
> > > + * @sid: Physical Stream ID
> > > + * @idx: Replacement index in the VINTF
> > > + */
> > > +struct tegra241_vintf_sid {
> > > +	struct iommufd_vdevice core;
> > > +	struct tegra241_vintf *vintf;
> > > +	u32 sid;
> > > +	u8 idx;
> > >  };
> > 
> > AFAIU, This seems to be a handle for sid -> vintf mapping.. it yes, then
> > I'm not sure if "Virtual Interface Stream ID Replacement" clarifies that?
> 
> No. It's for vSID to pSID mappings. I had it explained in commit log:
> 

I get that, it's for vSID -> pSID mapping which also "happens to" point
to the vintf.. all I wanted to say was that the description is unclear..
We could've described it as "Vintf SID map" or something, but I guess
it's fine the way it is too.. your call.

> For ATC invalidation commands that hold an SID, it requires all devices to
> register their virtual SIDs to the SID_MATCH registers and their physical
> SIDs to the pairing SID_REPLACE registers, so that HW can use those as a
> lookup table to replace those virtual SIDs with the correct physical SIDs.
> Thus, implement the driver-allocated vDEVICE op with a tegra241_vintf_sid
> structure to allocate SID_REPLACE and to program the SIDs accordingly.
> 
> > > @@ -351,6 +394,29 @@ tegra241_cmdqv_get_cmdq(struct arm_smmu_device *smmu,
> > >  
> > >  /* HW Reset Functions */
> > >  
> > > +/*
> > > + * When a guest-owned VCMDQ is disabled, if the guest did not enqueue a CMD_SYNC
> > > + * following an ATC_INV command at the end of the guest queue while this ATC_INV
> > > + * is timed out, the TIMEOUT will not be reported until this VCMDQ gets assigned
> > > + * to the next VM, which will be a false alarm potentially causing some unwanted
> > > + * behavior in the new VM. Thus, a guest-owned VCMDQ must flush the TIMEOUT when
> > > + * it gets disabled. This can be done by just issuing a CMD_SYNC to SMMU CMDQ.
> > > + */
> > > +static void tegra241_vcmdq_hw_flush_timeout(struct tegra241_vcmdq *vcmdq)
> > > +{
> > > +	struct arm_smmu_device *smmu = &vcmdq->cmdqv->smmu;
> > > +	u64 cmd_sync[CMDQ_ENT_DWORDS] = {};
> > > +
> > > +	cmd_sync[0] = FIELD_PREP(CMDQ_0_OP, CMDQ_OP_CMD_SYNC) |
> > > +		      FIELD_PREP(CMDQ_SYNC_0_CS, CMDQ_SYNC_0_CS_NONE);
> > > +
> > > +	/*
> > > +	 * It does not hurt to insert another CMD_SYNC, taking advantage of the
> > > +	 * arm_smmu_cmdq_issue_cmdlist() that waits for the CMD_SYNC completion.
> > > +	 */
> > > +	arm_smmu_cmdq_issue_cmdlist(smmu, &smmu->cmdq, cmd_sync, 1, true);
> > > +}
> > 
> > If I'm getting this right, it issues a CMD_SYNC to the Host's CMDQ i.e.
> > the non-CMDQV CMDQ, the main CMDQ of the SMMUv3? (i.e. the CMDQ present
> > without the Tegra241 CMDQV extension?)
> >
> > so.. basically on every VM switch, there would be an additional CMD_SYNC
> > issued to the non-CMDQV CMDQ to flush the TIMEOUT and we'll poll for
> > it's completion?
> 
> The main CMDQ exists regardless whether CMDQV extension is there or
> not. The CMD_SYNC can be issued to any (v)CMDQ. The smmu->cmdq is
> just the easiest one to use here.
> 

I see. Thanks!

> > > @@ -380,6 +448,12 @@ static void tegra241_vcmdq_hw_deinit(struct tegra241_vcmdq *vcmdq)
> > >  	dev_dbg(vcmdq->cmdqv->dev, "%sdeinited\n", h);
> > >  }
> > >  
> > > +/* This function is for LVCMDQ, so @vcmdq must be mapped prior */
> > > +static void _tegra241_vcmdq_hw_init(struct tegra241_vcmdq *vcmdq)
> > > +{
> > > +	writeq_relaxed(vcmdq->cmdq.q.q_base, REG_VCMDQ_PAGE1(vcmdq, BASE));
> > > +}
> > > +
> > 
> > Not sure why we broke this off to a function, will there be more stuff
> > here or is this just to use it in tegra241_vcmdq_hw_init_user as well?
> 
> I can take it off.
> 

Nah, that's okay, I was just curious.

> > > @@ -429,6 +504,10 @@ static void tegra241_vintf_hw_deinit(struct tegra241_vintf *vintf)
> > >  		}
> > >  	}
> > >  	vintf_write_config(vintf, 0);
> > > +	for (sidx = 0; sidx < vintf->cmdqv->num_sids_per_vintf; sidx++) {
> > > +		writel(0, REG_VINTF(vintf, SID_MATCH(sidx)));
> > > +		writel(0, REG_VINTF(vintf, SID_REPLACE(sidx)));
> > > +	}
> > >  }
> > 
> > I'm assuming we call the de-init while switching VMs and hence we need
> > to clear these to avoid spurious SID replacements in the new VM? Or do
> > they not reset to 0 when the HW is reset?
> 
> The driver does not reset HW when tearing down a VM, but only sets
> VINTF's enable bit to 0. So, it should just set other configuration
> bits to 0 as well.
> 
> > > +static struct iommufd_viommu_ops tegra241_cmdqv_viommu_ops = {
> > > +	.destroy = tegra241_cmdqv_destroy_vintf_user,
> > > +	.alloc_domain_nested = arm_vsmmu_alloc_domain_nested,
> > > +	.cache_invalidate = arm_vsmmu_cache_invalidate,
> > 
> > I see that we currently use the main cmdq to issue these cache
> > invalidations (there's a FIXME in arm_vsmmu_cache_invalidate). I was
> > hoping for this series to change that but I'm assuming there's another
> > series coming for that?
> > 
> > Meanwhile, I guess it'd be good to call that out for folks who have
> > Grace and start trying out this feature.. I'm assuming they won't see
> > as much perf improvement with this series alone since we're still using
> > the main CMDQ in the upstream code?
> 
> VCMDQ only accelerates invalidation commands.
> 

I get that.. but I see we're using `arm_vsmmu_cache_invalidate` here
from arm-smmu-v3-iommufd.c which seems to issue all commands to
smmu->cmdq as of now (the code has a FIXME as well), per the code:

	/* FIXME always uses the main cmdq rather than trying to group by type */
        ret = arm_smmu_cmdq_issue_cmdlist(smmu, &smmu->cmdq, last->cmd,
					  cur - last, true);

I was hoping this FIXME to be addressed in this series..

> That is for non-invalidation commands that VCMDQ doesn't support,
> so they still have to go in the standard nesting pathway.
> 
> Let's add a line:
> 	/* for non-invalidation commands use */

Umm.. I was talking about the cache_invalidate op? I think there's some
misunderstanding here? What am I missing?

> 
> Nicolin

Thanks
Praan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-07-01 20:03       ` Pranjal Shrivastava
@ 2025-07-01 20:23         ` Nicolin Chen
  2025-07-01 20:43           ` Pranjal Shrivastava
  0 siblings, 1 reply; 67+ messages in thread
From: Nicolin Chen @ 2025-07-01 20:23 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: jgg, kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Tue, Jul 01, 2025 at 08:03:35PM +0000, Pranjal Shrivastava wrote:
> On Tue, Jul 01, 2025 at 12:42:32PM -0700, Nicolin Chen wrote:
> > On Tue, Jul 01, 2025 at 04:02:35PM +0000, Pranjal Shrivastava wrote:
> > > On Thu, Jun 26, 2025 at 12:34:58PM -0700, Nicolin Chen wrote:
> > > > +/**
> > > > + * struct tegra241_vintf_sid - Virtual Interface Stream ID Replacement
> > > > + * @core: Embedded iommufd_vdevice structure, holding virtual Stream ID
> > > > + * @vintf: Parent VINTF pointer
> > > > + * @sid: Physical Stream ID
> > > > + * @idx: Replacement index in the VINTF
> > > > + */
> > > > +struct tegra241_vintf_sid {
> > > > +	struct iommufd_vdevice core;
> > > > +	struct tegra241_vintf *vintf;
> > > > +	u32 sid;
> > > > +	u8 idx;
> > > >  };
> > > 
> > > AFAIU, This seems to be a handle for sid -> vintf mapping.. it yes, then
> > > I'm not sure if "Virtual Interface Stream ID Replacement" clarifies that?
> > 
> > No. It's for vSID to pSID mappings. I had it explained in commit log:
> > 
> 
> I get that, it's for vSID -> pSID mapping which also "happens to" point
> to the vintf.. all I wanted to say was that the description is unclear..
> We could've described it as "Vintf SID map" or something, but I guess
> it's fine the way it is too.. your call.

The "replace" word is borrowed from the "SID_REPLACE" HW register.

But I think it's okay to call it just "mapping", if that makes it
clearer.

> > > > +static struct iommufd_viommu_ops tegra241_cmdqv_viommu_ops = {
> > > > +	.destroy = tegra241_cmdqv_destroy_vintf_user,
> > > > +	.alloc_domain_nested = arm_vsmmu_alloc_domain_nested,
> > > > +	.cache_invalidate = arm_vsmmu_cache_invalidate,
> > > 
> > > I see that we currently use the main cmdq to issue these cache
> > > invalidations (there's a FIXME in arm_vsmmu_cache_invalidate). I was
> > > hoping for this series to change that but I'm assuming there's another
> > > series coming for that?
> > > 
> > > Meanwhile, I guess it'd be good to call that out for folks who have
> > > Grace and start trying out this feature.. I'm assuming they won't see
> > > as much perf improvement with this series alone since we're still using
> > > the main CMDQ in the upstream code?
> > 
> > VCMDQ only accelerates invalidation commands.
> > 
> 
> I get that.. but I see we're using `arm_vsmmu_cache_invalidate` here
> from arm-smmu-v3-iommufd.c which seems to issue all commands to
> smmu->cmdq as of now (the code has a FIXME as well), per the code:
> 
> 	/* FIXME always uses the main cmdq rather than trying to group by type */
>         ret = arm_smmu_cmdq_issue_cmdlist(smmu, &smmu->cmdq, last->cmd,
> 					  cur - last, true);
> 
> I was hoping this FIXME to be addressed in this series..

Oh, that's not related.

The main goal of this series is to route all invalidation commands
to the VCMDQ HW. And this is where Grace users can see perf gains
mentioned in the cover letter or commit log, from eliminating the
VM Exits at those most frequently used commands.

Any non-invalidation commands will just reuse what we have with the
standard SMMU nesting. And even if we did something to that FIXME,
there is no significant perf gain as it's going down the trapping
pathway, i.e. the VM Exits are always there.

> > That is for non-invalidation commands that VCMDQ doesn't support,
> > so they still have to go in the standard nesting pathway.
> > 
> > Let's add a line:
> > 	/* for non-invalidation commands use */
> 
> Umm.. I was talking about the cache_invalidate op? I think there's some
> misunderstanding here? What am I missing?

That line is exactly for cache_invalidate. All the non-invalidation
commands will be sent to he arm_vsmmu_cache_invalidate() by VMM, as
it means.

Or perhaps calling them "non-accelerated commands" would be nicer.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-07-01 20:23         ` Nicolin Chen
@ 2025-07-01 20:43           ` Pranjal Shrivastava
  2025-07-01 22:07             ` Nicolin Chen
  0 siblings, 1 reply; 67+ messages in thread
From: Pranjal Shrivastava @ 2025-07-01 20:43 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jgg, kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Tue, Jul 01, 2025 at 01:23:17PM -0700, Nicolin Chen wrote:
> On Tue, Jul 01, 2025 at 08:03:35PM +0000, Pranjal Shrivastava wrote:
> > On Tue, Jul 01, 2025 at 12:42:32PM -0700, Nicolin Chen wrote:
> > > On Tue, Jul 01, 2025 at 04:02:35PM +0000, Pranjal Shrivastava wrote:
> > > > On Thu, Jun 26, 2025 at 12:34:58PM -0700, Nicolin Chen wrote:
> > > > > +/**
> > > > > + * struct tegra241_vintf_sid - Virtual Interface Stream ID Replacement
> > > > > + * @core: Embedded iommufd_vdevice structure, holding virtual Stream ID
> > > > > + * @vintf: Parent VINTF pointer
> > > > > + * @sid: Physical Stream ID
> > > > > + * @idx: Replacement index in the VINTF
> > > > > + */
> > > > > +struct tegra241_vintf_sid {
> > > > > +	struct iommufd_vdevice core;
> > > > > +	struct tegra241_vintf *vintf;
> > > > > +	u32 sid;
> > > > > +	u8 idx;
> > > > >  };
> > > > 
> > > > AFAIU, This seems to be a handle for sid -> vintf mapping.. it yes, then
> > > > I'm not sure if "Virtual Interface Stream ID Replacement" clarifies that?
> > > 
> > > No. It's for vSID to pSID mappings. I had it explained in commit log:
> > > 
> > 
> > I get that, it's for vSID -> pSID mapping which also "happens to" point
> > to the vintf.. all I wanted to say was that the description is unclear..
> > We could've described it as "Vintf SID map" or something, but I guess
> > it's fine the way it is too.. your call.
> 
> The "replace" word is borrowed from the "SID_REPLACE" HW register.
> 
> But I think it's okay to call it just "mapping", if that makes it
> clearer.
> 

Anything works. Maybe let it be as is.

> > > > > +static struct iommufd_viommu_ops tegra241_cmdqv_viommu_ops = {
> > > > > +	.destroy = tegra241_cmdqv_destroy_vintf_user,
> > > > > +	.alloc_domain_nested = arm_vsmmu_alloc_domain_nested,
> > > > > +	.cache_invalidate = arm_vsmmu_cache_invalidate,
> > > > 
> > > > I see that we currently use the main cmdq to issue these cache
> > > > invalidations (there's a FIXME in arm_vsmmu_cache_invalidate). I was
> > > > hoping for this series to change that but I'm assuming there's another
> > > > series coming for that?
> > > > 
> > > > Meanwhile, I guess it'd be good to call that out for folks who have
> > > > Grace and start trying out this feature.. I'm assuming they won't see
> > > > as much perf improvement with this series alone since we're still using
> > > > the main CMDQ in the upstream code?
> > > 
> > > VCMDQ only accelerates invalidation commands.
> > > 
> > 
> > I get that.. but I see we're using `arm_vsmmu_cache_invalidate` here
> > from arm-smmu-v3-iommufd.c which seems to issue all commands to
> > smmu->cmdq as of now (the code has a FIXME as well), per the code:
> > 
> > 	/* FIXME always uses the main cmdq rather than trying to group by type */
> >         ret = arm_smmu_cmdq_issue_cmdlist(smmu, &smmu->cmdq, last->cmd,
> > 					  cur - last, true);
> > 
> > I was hoping this FIXME to be addressed in this series..
> 
> Oh, that's not related.
> 
> The main goal of this series is to route all invalidation commands
> to the VCMDQ HW. And this is where Grace users can see perf gains
> mentioned in the cover letter or commit log, from eliminating the
> VM Exits at those most frequently used commands.
> 
> Any non-invalidation commands will just reuse what we have with the
> standard SMMU nesting. And even if we did something to that FIXME,
> there is no significant perf gain as it's going down the trapping
> pathway, i.e. the VM Exits are always there.
> 
> > > That is for non-invalidation commands that VCMDQ doesn't support,
> > > so they still have to go in the standard nesting pathway.
> > > 
> > > Let's add a line:
> > > 	/* for non-invalidation commands use */
> > 
> > Umm.. I was talking about the cache_invalidate op? I think there's some
> > misunderstanding here? What am I missing?
> 
> That line is exactly for cache_invalidate. All the non-invalidation
> commands will be sent to he arm_vsmmu_cache_invalidate() by VMM, as
> it means.
> 
> Or perhaps calling them "non-accelerated commands" would be nicer.

Uhh okay, so there'll be a separate driver in the VM issuing invalidation
commands directly to the CMDQV thus we don't see any of it's part here?

AND for non-invalidation commands, we trap out and the VMM ends up
calling the `cache_invalidate` op of the viommu?

Is that understanding correct?

> 
> Thanks
> Nicolin

Thanks
Praan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-07-01 20:43           ` Pranjal Shrivastava
@ 2025-07-01 22:07             ` Nicolin Chen
  2025-07-01 22:51               ` Pranjal Shrivastava
  0 siblings, 1 reply; 67+ messages in thread
From: Nicolin Chen @ 2025-07-01 22:07 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: jgg, kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Tue, Jul 01, 2025 at 08:43:30PM +0000, Pranjal Shrivastava wrote:
> On Tue, Jul 01, 2025 at 01:23:17PM -0700, Nicolin Chen wrote:
> > Or perhaps calling them "non-accelerated commands" would be nicer.
> 
> Uhh okay, so there'll be a separate driver in the VM issuing invalidation
> commands directly to the CMDQV thus we don't see any of it's part here?

That's how it works. VM must run a guest-level VCMDQ driver that
separates accelerated and non-accelerated commands as it already
does:

    accelerated commands => VCMDQ (HW)
non-accelerated commands => SMMU CMDQ (SW) =iommufd=> SMMU CMDQ (HW)

Nicolin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-07-01 22:07             ` Nicolin Chen
@ 2025-07-01 22:51               ` Pranjal Shrivastava
  2025-07-01 23:01                 ` Nicolin Chen
  0 siblings, 1 reply; 67+ messages in thread
From: Pranjal Shrivastava @ 2025-07-01 22:51 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jgg, kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Tue, Jul 01, 2025 at 03:07:57PM -0700, Nicolin Chen wrote:
> On Tue, Jul 01, 2025 at 08:43:30PM +0000, Pranjal Shrivastava wrote:
> > On Tue, Jul 01, 2025 at 01:23:17PM -0700, Nicolin Chen wrote:
> > > Or perhaps calling them "non-accelerated commands" would be nicer.
> > 
> > Uhh okay, so there'll be a separate driver in the VM issuing invalidation
> > commands directly to the CMDQV thus we don't see any of it's part here?
> 
> That's how it works. VM must run a guest-level VCMDQ driver that
> separates accelerated and non-accelerated commands as it already
> does:
> 
>     accelerated commands => VCMDQ (HW)
> non-accelerated commands => SMMU CMDQ (SW) =iommufd=> SMMU CMDQ (HW)
> 

Right exactly what got me confused. I was assuming the same CMDQV driver
would run in the Guest kernel but seems like there's another driver for
the Guest that's not in tree yet or maybe is a purely user-space thing?

And the weird part was that "invalidation" commands are accelerated but
we use the .cache_invalidate viommu op for `non-invalidation` commands.
But I guess what you meant there could be non-accelerated invalidation 
commands (maybe something stage 2 TLBIs?) which would go through the 
.cache_invalidate op, right?

> Nicolin

Praan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-07-01 22:51               ` Pranjal Shrivastava
@ 2025-07-01 23:01                 ` Nicolin Chen
  2025-07-02  0:14                   ` Pranjal Shrivastava
  0 siblings, 1 reply; 67+ messages in thread
From: Nicolin Chen @ 2025-07-01 23:01 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: jgg, kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Tue, Jul 01, 2025 at 10:51:20PM +0000, Pranjal Shrivastava wrote:
> On Tue, Jul 01, 2025 at 03:07:57PM -0700, Nicolin Chen wrote:
> > On Tue, Jul 01, 2025 at 08:43:30PM +0000, Pranjal Shrivastava wrote:
> > > On Tue, Jul 01, 2025 at 01:23:17PM -0700, Nicolin Chen wrote:
> > > > Or perhaps calling them "non-accelerated commands" would be nicer.
> > > 
> > > Uhh okay, so there'll be a separate driver in the VM issuing invalidation
> > > commands directly to the CMDQV thus we don't see any of it's part here?
> > 
> > That's how it works. VM must run a guest-level VCMDQ driver that
> > separates accelerated and non-accelerated commands as it already
> > does:
> > 
> >     accelerated commands => VCMDQ (HW)
> > non-accelerated commands => SMMU CMDQ (SW) =iommufd=> SMMU CMDQ (HW)
> > 
> 
> Right exactly what got me confused. I was assuming the same CMDQV driver
> would run in the Guest kernel but seems like there's another driver for
> the Guest that's not in tree yet or maybe is a purely user-space thing?

It's the same tegra241-cmdqv.c in the kernel, which is already
a part of mainline Linux. Both host and guest run the same copy
of software. The host kernel just has the user VINTF part (via
iommufd) additional to what the guest already has.

> And the weird part was that "invalidation" commands are accelerated but
> we use the .cache_invalidate viommu op for `non-invalidation` commands.
> But I guess what you meant there could be non-accelerated invalidation 
> commands (maybe something stage 2 TLBIs?) which would go through the 
> .cache_invalidate op, right?

I am talking about this:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c?h=v6.16-rc4#n305

Those commands returned "false" will be issued to smmu->cmdq in a
guest VM, which will be trapped by VMM as a standard SMMU nesting
and will be further forwarded via iommufd to the host kernel that
will invoke this cache_invalidate op in the arm-smmu-v3 driver.

Those commands returned "true" will be issued to vcmdq->cmdq that
is HW-accelerated queue (setup by VMM via iommufd's hw_queue/mmap).

Nicolin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-07-01 23:01                 ` Nicolin Chen
@ 2025-07-02  0:14                   ` Pranjal Shrivastava
  2025-07-02  0:46                     ` Nicolin Chen
  0 siblings, 1 reply; 67+ messages in thread
From: Pranjal Shrivastava @ 2025-07-02  0:14 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jgg, kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Tue, Jul 01, 2025 at 04:01:34PM -0700, Nicolin Chen wrote:
> On Tue, Jul 01, 2025 at 10:51:20PM +0000, Pranjal Shrivastava wrote:
> > On Tue, Jul 01, 2025 at 03:07:57PM -0700, Nicolin Chen wrote:
> > > On Tue, Jul 01, 2025 at 08:43:30PM +0000, Pranjal Shrivastava wrote:
> > > > On Tue, Jul 01, 2025 at 01:23:17PM -0700, Nicolin Chen wrote:
> > > > > Or perhaps calling them "non-accelerated commands" would be nicer.
> > > > 
> > > > Uhh okay, so there'll be a separate driver in the VM issuing invalidation
> > > > commands directly to the CMDQV thus we don't see any of it's part here?
> > > 
> > > That's how it works. VM must run a guest-level VCMDQ driver that
> > > separates accelerated and non-accelerated commands as it already
> > > does:
> > > 
> > >     accelerated commands => VCMDQ (HW)
> > > non-accelerated commands => SMMU CMDQ (SW) =iommufd=> SMMU CMDQ (HW)
> > > 
> > 
> > Right exactly what got me confused. I was assuming the same CMDQV driver
> > would run in the Guest kernel but seems like there's another driver for
> > the Guest that's not in tree yet or maybe is a purely user-space thing?
> 
> It's the same tegra241-cmdqv.c in the kernel, which is already
> a part of mainline Linux. Both host and guest run the same copy
> of software. The host kernel just has the user VINTF part (via
> iommufd) additional to what the guest already has.
> 
> > And the weird part was that "invalidation" commands are accelerated but
> > we use the .cache_invalidate viommu op for `non-invalidation` commands.
> > But I guess what you meant there could be non-accelerated invalidation 
> > commands (maybe something stage 2 TLBIs?) which would go through the 
> > .cache_invalidate op, right?
> 
> I am talking about this:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/arm/arm-smmu-v3/tegra241-cmdqv.c?h=v6.16-rc4#n305
> 
> Those commands returned "false" will be issued to smmu->cmdq in a
> guest VM, which will be trapped by VMM as a standard SMMU nesting
> and will be further forwarded via iommufd to the host kernel that
> will invoke this cache_invalidate op in the arm-smmu-v3 driver.
> 
> Those commands returned "true" will be issued to vcmdq->cmdq that
> is HW-accelerated queue (setup by VMM via iommufd's hw_queue/mmap).


Right, this brings me back to my original understanding, the arm-smmu-v3
driver checks for "supported commands" and figures out which queue shall
they be issued to.. now there are commands which are "non-invalidation"
commands which are non-acclerated like CMD_PRI_RESP, which would be
issued through the trap => .cache_invalidate path. 

Thus, coming back to the two initial points:

1) Issuing "non-invalidation" commands through .cache_invalidate could
   be confusing, I'm not asking to change the op name here, but if we
   plan to label it, let's label them as "Trapped commands" OR
   "non-accelerated" commands as you suggested.

2) The "FIXME" confusion: The comment in arm_vsmmu_cache_invalidate
   mentions we'd like to "fix" the issuing of commands through the main
   cmdq and instead like to group by "type", if that "type" is the queue
   type (which I assume it is because IOMMU_TYPE has to be arm-smmu-v3),
   what do we plan to do differently there, given that the op is only
   for trapped commands *have* to go through the main CMDQ?

   If we were planning to do something based on the queue type, I was
   hoping for it to be addressed in this series as we've introduced the
   Tegra CMDQ type. 

That's all I wanted to say, sorry for if this was confusing.

> 
> Nicolin

Thanks,
Praan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-07-02  0:14                   ` Pranjal Shrivastava
@ 2025-07-02  0:46                     ` Nicolin Chen
  2025-07-02  1:38                       ` Pranjal Shrivastava
  0 siblings, 1 reply; 67+ messages in thread
From: Nicolin Chen @ 2025-07-02  0:46 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: jgg, kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Wed, Jul 02, 2025 at 12:14:28AM +0000, Pranjal Shrivastava wrote:
> Thus, coming back to the two initial points:
> 
> 1) Issuing "non-invalidation" commands through .cache_invalidate could
>    be confusing, I'm not asking to change the op name here, but if we
>    plan to label it, let's label them as "Trapped commands" OR
>    "non-accelerated" commands as you suggested.

VCMDQ only accelerates limited invalidation commands, not all of
them: STE cache invalidation and CD cache invalidation commands
still go down to that op.

> 2) The "FIXME" confusion: The comment in arm_vsmmu_cache_invalidate
>    mentions we'd like to "fix" the issuing of commands through the main
>    cmdq and instead like to group by "type", if that "type" is the queue
>    type (which I assume it is because IOMMU_TYPE has to be arm-smmu-v3),

I recall that FIXME is noted by Jason at that time. And it should
be interpreted as "group by opcode", IIUIC.

The thing is that for a host kernel that enabled in-kernel VCMDQs,
those trapped user commands can be just issued to the smmu->cmdq
or a vcmdq (picked via the get_secondary_cmdq impl_op).

>    what do we plan to do differently there, given that the op is only
>    for trapped commands *have* to go through the main CMDQ?

If we do something differently there, it could just do a one-time
get_secondary_cmdq call to pick a in-kernel vcmdq over smmu->cmdq
to fill in all the trapped commands.

And this is not related to this series at all.

Nicolin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-07-02  0:46                     ` Nicolin Chen
@ 2025-07-02  1:38                       ` Pranjal Shrivastava
  2025-07-02 18:05                         ` Jason Gunthorpe
  0 siblings, 1 reply; 67+ messages in thread
From: Pranjal Shrivastava @ 2025-07-02  1:38 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jgg, kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Tue, Jul 01, 2025 at 05:46:06PM -0700, Nicolin Chen wrote:
> On Wed, Jul 02, 2025 at 12:14:28AM +0000, Pranjal Shrivastava wrote:
> > Thus, coming back to the two initial points:
> > 
> > 1) Issuing "non-invalidation" commands through .cache_invalidate could
> >    be confusing, I'm not asking to change the op name here, but if we
> >    plan to label it, let's label them as "Trapped commands" OR
> >    "non-accelerated" commands as you suggested.
> 
> VCMDQ only accelerates limited invalidation commands, not all of
> them: STE cache invalidation and CD cache invalidation commands
> still go down to that op.
> 

Right, I'm just saying the "other" non-accelerated commands that are
NOT invalidations also go down that op. So, if we add a comment, let's 
not call them "non-invalidation" commands.

> > 2) The "FIXME" confusion: The comment in arm_vsmmu_cache_invalidate
> >    mentions we'd like to "fix" the issuing of commands through the main
> >    cmdq and instead like to group by "type", if that "type" is the queue
> >    type (which I assume it is because IOMMU_TYPE has to be arm-smmu-v3),
> 
> I recall that FIXME is noted by Jason at that time. And it should
> be interpreted as "group by opcode", IIUIC.

I see.. I misunderstood that..

> 
> The thing is that for a host kernel that enabled in-kernel VCMDQs,
> those trapped user commands can be just issued to the smmu->cmdq
> or a vcmdq (picked via the get_secondary_cmdq impl_op).
> 

Ohh.. so maybe some sort of a load balancing thing?

> >    what do we plan to do differently there, given that the op is only
> >    for trapped commands *have* to go through the main CMDQ?
> 
> If we do something differently there, it could just do a one-time
> get_secondary_cmdq call to pick a in-kernel vcmdq over smmu->cmdq
> to fill in all the trapped commands.
> 

Alright.

> And this is not related to this series at all.

Agreed, sorry for the confusion then.. I thought that the "type" meant
the queue type.. I guess it's all done then. I have no further questions

Thanks for the clarification!

> 
> Nicolin

Praan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v7 01/28] iommufd: Report unmapped bytes in the error path of iopt_unmap_iova_range
  2025-06-26 19:34 ` [PATCH v7 01/28] iommufd: Report unmapped bytes in the error path of iopt_unmap_iova_range Nicolin Chen
@ 2025-07-02  9:39   ` Tian, Kevin
  2025-07-04 12:59   ` Jason Gunthorpe
  1 sibling, 0 replies; 67+ messages in thread
From: Tian, Kevin @ 2025-07-02  9:39 UTC (permalink / raw)
  To: Nicolin Chen, jgg@nvidia.com, corbet@lwn.net, will@kernel.org
  Cc: bagasdotme@gmail.com, robin.murphy@arm.com, joro@8bytes.org,
	thierry.reding@gmail.com, vdumpa@nvidia.com, jonathanh@nvidia.com,
	shuah@kernel.org, jsnitsel@redhat.com, nathan@kernel.org,
	peterz@infradead.org, Liu, Yi L, mshavit@google.com,
	praan@google.com, zhangzekun11@huawei.com, iommu@lists.linux.dev,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	mochs@nvidia.com, alok.a.tiwari@oracle.com, vasant.hegde@amd.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Friday, June 27, 2025 3:35 AM
> 
> There are callers that read the unmapped bytes even when rc != 0. Thus, do
> not forget to report it in the error path too.
> 
> Fixes: 8d40205f6093 ("iommufd: Add kAPI toward external drivers for kernel
> access")
> Cc: stable@vger.kernel.org
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v7 02/28] iommufd/viommu: Explicitly define vdev->virt_id
  2025-06-26 19:34 ` [PATCH v7 02/28] iommufd/viommu: Explicitly define vdev->virt_id Nicolin Chen
  2025-07-01 12:30   ` Pranjal Shrivastava
@ 2025-07-02  9:40   ` Tian, Kevin
  2025-07-02 19:59     ` Nicolin Chen
  2025-07-04 12:59   ` Jason Gunthorpe
  2 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2025-07-02  9:40 UTC (permalink / raw)
  To: Nicolin Chen, jgg@nvidia.com, corbet@lwn.net, will@kernel.org
  Cc: bagasdotme@gmail.com, robin.murphy@arm.com, joro@8bytes.org,
	thierry.reding@gmail.com, vdumpa@nvidia.com, jonathanh@nvidia.com,
	shuah@kernel.org, jsnitsel@redhat.com, nathan@kernel.org,
	peterz@infradead.org, Liu, Yi L, mshavit@google.com,
	praan@google.com, zhangzekun11@huawei.com, iommu@lists.linux.dev,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	mochs@nvidia.com, alok.a.tiwari@oracle.com, vasant.hegde@amd.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Friday, June 27, 2025 3:35 AM
> +
> +	/*
> +	 * Virtual device ID per vIOMMU, e.g. vSID of ARM SMMUv3,
> vDeviceID of
> +	 * AMD IOMMU, and vRID of a nested Intel VT-d to a Context Table
> +	 */
> +	u64 virt_id;

Just "vRID of Intel VT-d"? the current description is not very clear
to me.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v7 03/28] iommu: Use enum iommu_hw_info_type for type in hw_info op
  2025-06-26 19:34 ` [PATCH v7 03/28] iommu: Use enum iommu_hw_info_type for type in hw_info op Nicolin Chen
  2025-07-01 12:48   ` Pranjal Shrivastava
  2025-07-01 12:51   ` Pranjal Shrivastava
@ 2025-07-02  9:41   ` Tian, Kevin
  2025-07-04 13:00   ` Jason Gunthorpe
  3 siblings, 0 replies; 67+ messages in thread
From: Tian, Kevin @ 2025-07-02  9:41 UTC (permalink / raw)
  To: Nicolin Chen, jgg@nvidia.com, corbet@lwn.net, will@kernel.org
  Cc: bagasdotme@gmail.com, robin.murphy@arm.com, joro@8bytes.org,
	thierry.reding@gmail.com, vdumpa@nvidia.com, jonathanh@nvidia.com,
	shuah@kernel.org, jsnitsel@redhat.com, nathan@kernel.org,
	peterz@infradead.org, Liu, Yi L, mshavit@google.com,
	praan@google.com, zhangzekun11@huawei.com, iommu@lists.linux.dev,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	mochs@nvidia.com, alok.a.tiwari@oracle.com, vasant.hegde@amd.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Friday, June 27, 2025 3:35 AM
> 
> Replace u32 to make it clear. No functional changes.
> 
> Also simplify the kdoc since the type itself is clear enough.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v7 09/28] iommufd/access: Add internal APIs for HW queue to use
  2025-06-26 19:34 ` [PATCH v7 09/28] iommufd/access: Add internal APIs for HW queue to use Nicolin Chen
@ 2025-07-02  9:42   ` Tian, Kevin
  2025-07-04 13:08   ` Jason Gunthorpe
  1 sibling, 0 replies; 67+ messages in thread
From: Tian, Kevin @ 2025-07-02  9:42 UTC (permalink / raw)
  To: Nicolin Chen, jgg@nvidia.com, corbet@lwn.net, will@kernel.org
  Cc: bagasdotme@gmail.com, robin.murphy@arm.com, joro@8bytes.org,
	thierry.reding@gmail.com, vdumpa@nvidia.com, jonathanh@nvidia.com,
	shuah@kernel.org, jsnitsel@redhat.com, nathan@kernel.org,
	peterz@infradead.org, Liu, Yi L, mshavit@google.com,
	praan@google.com, zhangzekun11@huawei.com, iommu@lists.linux.dev,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	mochs@nvidia.com, alok.a.tiwari@oracle.com, vasant.hegde@amd.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Friday, June 27, 2025 3:35 AM
> 
> The new HW queue object, as an internal iommufd object, wants to reuse
> the
> struct iommufd_access to pin some iova range in the iopt.
> 
> However, an access generally takes the refcount of an ictx. So, in such an
> internal case, a deadlock could happen when the release of the ictx has to
> wait for the release of the access first when releasing a hw_queue object,
> which could wait for the release of the ictx that is refcounted:
>     ictx --releases--> hw_queue --releases--> access
>       ^                                         |
>       |_________________releases________________v
> 
> To address this, add a set of lightweight internal APIs to unlink the ictx
> and the access, i.e. no ictx refcounting by the access:
>     ictx --releases--> hw_queue --releases--> access
> 
> Then, there's no point in setting the access->ictx. So simply define !ictx
> as an flag for an internal use and add an inline helper.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v7 10/28] iommufd/access: Bypass access->ops->unmap for internal use
  2025-06-26 19:34 ` [PATCH v7 10/28] iommufd/access: Bypass access->ops->unmap for internal use Nicolin Chen
@ 2025-07-02  9:45   ` Tian, Kevin
  2025-07-02 20:12     ` Nicolin Chen
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2025-07-02  9:45 UTC (permalink / raw)
  To: Nicolin Chen, jgg@nvidia.com, corbet@lwn.net, will@kernel.org
  Cc: bagasdotme@gmail.com, robin.murphy@arm.com, joro@8bytes.org,
	thierry.reding@gmail.com, vdumpa@nvidia.com, jonathanh@nvidia.com,
	shuah@kernel.org, jsnitsel@redhat.com, nathan@kernel.org,
	peterz@infradead.org, Liu, Yi L, mshavit@google.com,
	praan@google.com, zhangzekun11@huawei.com, iommu@lists.linux.dev,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	mochs@nvidia.com, alok.a.tiwari@oracle.com, vasant.hegde@amd.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Friday, June 27, 2025 3:35 AM
> 
> +int iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long
> iova,
> +				unsigned long length)
>  {
>  	struct iommufd_ioas *ioas =
>  		container_of(iopt, struct iommufd_ioas, iopt);
>  	struct iommufd_access *access;
>  	unsigned long index;
> +	int ret = 0;
> 
>  	xa_lock(&ioas->iopt.access_list);
> +	/* Bypass any unmap if there is an internal access */
> +	xa_for_each(&ioas->iopt.access_list, index, access) {
> +		if (iommufd_access_is_internal(access)) {
> +			ret = -EBUSY;
> +			goto unlock;
> +		}
> +	}
> +

hmm all those checks are per iopt. Could do one-off check in
iopt_unmap_iova_range() and store the result in a local flag.

Then use that flag to decide whether to return -EBUSY if
area->num_accesses is true in the loop.

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-07-02  1:38                       ` Pranjal Shrivastava
@ 2025-07-02 18:05                         ` Jason Gunthorpe
  2025-07-03 14:46                           ` Pranjal Shrivastava
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2025-07-02 18:05 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Nicolin Chen, kevin.tian, corbet, will, bagasdotme, robin.murphy,
	joro, thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Wed, Jul 02, 2025 at 01:38:33AM +0000, Pranjal Shrivastava wrote:
> On Tue, Jul 01, 2025 at 05:46:06PM -0700, Nicolin Chen wrote:
> > On Wed, Jul 02, 2025 at 12:14:28AM +0000, Pranjal Shrivastava wrote:
> > > Thus, coming back to the two initial points:
> > > 
> > > 1) Issuing "non-invalidation" commands through .cache_invalidate could
> > >    be confusing, I'm not asking to change the op name here, but if we
> > >    plan to label it, let's label them as "Trapped commands" OR
> > >    "non-accelerated" commands as you suggested.
> > 
> > VCMDQ only accelerates limited invalidation commands, not all of
> > them: STE cache invalidation and CD cache invalidation commands
> > still go down to that op.
> > 
> 
> Right, I'm just saying the "other" non-accelerated commands that are
> NOT invalidations also go down that op. So, if we add a comment, let's 
> not call them "non-invalidation" commands.

There are no non-invalidation commands:

static int arm_vsmmu_convert_user_cmd(struct arm_vsmmu *vsmmu,
				      struct arm_vsmmu_invalidation_cmd *cmd)
{
	switch (cmd->cmd[0] & CMDQ_0_OP) {
	case CMDQ_OP_TLBI_NSNH_ALL:
	case CMDQ_OP_TLBI_NH_VA:
	case CMDQ_OP_TLBI_NH_VAA:
	case CMDQ_OP_TLBI_NH_ALL:
	case CMDQ_OP_TLBI_NH_ASID:
	case CMDQ_OP_ATC_INV:
	case CMDQ_OP_CFGI_CD:
	case CMDQ_OP_CFGI_CD_ALL:

Those are only invalidations.

CD invalidation can't go through the vCMDQ path.

> > > 2) The "FIXME" confusion: The comment in arm_vsmmu_cache_invalidate
> > >    mentions we'd like to "fix" the issuing of commands through the main
> > >    cmdq and instead like to group by "type", if that "type" is the queue
> > >    type (which I assume it is because IOMMU_TYPE has to be arm-smmu-v3),
> > 
> > I recall that FIXME is noted by Jason at that time. And it should
> > be interpreted as "group by opcode", IIUIC.
> 
> I see.. I misunderstood that..

Yes, we could use the vCMDQ in the SMMU driver for invalidations which
would give some minor locking advantage. But it is not really
important to anyone.
 
> > The thing is that for a host kernel that enabled in-kernel VCMDQs,
> > those trapped user commands can be just issued to the smmu->cmdq
> > or a vcmdq (picked via the get_secondary_cmdq impl_op).
> 
> Ohh.. so maybe some sort of a load balancing thing?

The goal of the SMMU driver when it detects CMDQV support is to route
all supported invalidations to CMDQV queues and then balance those
queues across CPUs to reduce lock contention.

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 02/28] iommufd/viommu: Explicitly define vdev->virt_id
  2025-07-02  9:40   ` Tian, Kevin
@ 2025-07-02 19:59     ` Nicolin Chen
  0 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-07-02 19:59 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: jgg@nvidia.com, corbet@lwn.net, will@kernel.org,
	bagasdotme@gmail.com, robin.murphy@arm.com, joro@8bytes.org,
	thierry.reding@gmail.com, vdumpa@nvidia.com, jonathanh@nvidia.com,
	shuah@kernel.org, jsnitsel@redhat.com, nathan@kernel.org,
	peterz@infradead.org, Liu, Yi L, mshavit@google.com,
	praan@google.com, zhangzekun11@huawei.com, iommu@lists.linux.dev,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	mochs@nvidia.com, alok.a.tiwari@oracle.com, vasant.hegde@amd.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com

On Wed, Jul 02, 2025 at 09:40:50AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Friday, June 27, 2025 3:35 AM
> > +
> > +	/*
> > +	 * Virtual device ID per vIOMMU, e.g. vSID of ARM SMMUv3,
> > vDeviceID of
> > +	 * AMD IOMMU, and vRID of a nested Intel VT-d to a Context Table
> > +	 */
> > +	u64 virt_id;
> 
> Just "vRID of Intel VT-d"? the current description is not very clear
> to me.

Looks like we use "vRID of Intel VT-d to a Context Table" in the
Documentation/userspace-api/iommufd.rst, but forgot to change in
the uAPI header:
 * @virt_id: Virtual device ID per vIOMMU, e.g. vSID of ARM SMMUv3, vDeviceID
 *           of AMD IOMMU, and vRID of a nested Intel VT-d to a Context Table

Let me correct both of them.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 10/28] iommufd/access: Bypass access->ops->unmap for internal use
  2025-07-02  9:45   ` Tian, Kevin
@ 2025-07-02 20:12     ` Nicolin Chen
  2025-07-03  4:57       ` Tian, Kevin
  0 siblings, 1 reply; 67+ messages in thread
From: Nicolin Chen @ 2025-07-02 20:12 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: jgg@nvidia.com, corbet@lwn.net, will@kernel.org,
	bagasdotme@gmail.com, robin.murphy@arm.com, joro@8bytes.org,
	thierry.reding@gmail.com, vdumpa@nvidia.com, jonathanh@nvidia.com,
	shuah@kernel.org, jsnitsel@redhat.com, nathan@kernel.org,
	peterz@infradead.org, Liu, Yi L, mshavit@google.com,
	praan@google.com, zhangzekun11@huawei.com, iommu@lists.linux.dev,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	mochs@nvidia.com, alok.a.tiwari@oracle.com, vasant.hegde@amd.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com

On Wed, Jul 02, 2025 at 09:45:26AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Friday, June 27, 2025 3:35 AM
> > 
> > +int iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned long
> > iova,
> > +				unsigned long length)
> >  {
> >  	struct iommufd_ioas *ioas =
> >  		container_of(iopt, struct iommufd_ioas, iopt);
> >  	struct iommufd_access *access;
> >  	unsigned long index;
> > +	int ret = 0;
> > 
> >  	xa_lock(&ioas->iopt.access_list);
> > +	/* Bypass any unmap if there is an internal access */
> > +	xa_for_each(&ioas->iopt.access_list, index, access) {
> > +		if (iommufd_access_is_internal(access)) {
> > +			ret = -EBUSY;
> > +			goto unlock;
> > +		}
> > +	}
> > +
> 
> hmm all those checks are per iopt. Could do one-off check in
> iopt_unmap_iova_range() and store the result in a local flag.
> 
> Then use that flag to decide whether to return -EBUSY if
> area->num_accesses is true in the loop.

I don't quite follow this...

Do you suggest to move this xa_for_each to iopt_unmap_iova_range?

What's that local flag used for?

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* RE: [PATCH v7 10/28] iommufd/access: Bypass access->ops->unmap for internal use
  2025-07-02 20:12     ` Nicolin Chen
@ 2025-07-03  4:57       ` Tian, Kevin
  2025-07-04  4:08         ` Nicolin Chen
  0 siblings, 1 reply; 67+ messages in thread
From: Tian, Kevin @ 2025-07-03  4:57 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: jgg@nvidia.com, corbet@lwn.net, will@kernel.org,
	bagasdotme@gmail.com, robin.murphy@arm.com, joro@8bytes.org,
	thierry.reding@gmail.com, vdumpa@nvidia.com, jonathanh@nvidia.com,
	shuah@kernel.org, jsnitsel@redhat.com, nathan@kernel.org,
	peterz@infradead.org, Liu, Yi L, mshavit@google.com,
	praan@google.com, zhangzekun11@huawei.com, iommu@lists.linux.dev,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	mochs@nvidia.com, alok.a.tiwari@oracle.com, vasant.hegde@amd.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, July 3, 2025 4:12 AM
> 
> On Wed, Jul 02, 2025 at 09:45:26AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Friday, June 27, 2025 3:35 AM
> > >
> > > +int iommufd_access_notify_unmap(struct io_pagetable *iopt, unsigned
> long
> > > iova,
> > > +				unsigned long length)
> > >  {
> > >  	struct iommufd_ioas *ioas =
> > >  		container_of(iopt, struct iommufd_ioas, iopt);
> > >  	struct iommufd_access *access;
> > >  	unsigned long index;
> > > +	int ret = 0;
> > >
> > >  	xa_lock(&ioas->iopt.access_list);
> > > +	/* Bypass any unmap if there is an internal access */
> > > +	xa_for_each(&ioas->iopt.access_list, index, access) {
> > > +		if (iommufd_access_is_internal(access)) {
> > > +			ret = -EBUSY;
> > > +			goto unlock;
> > > +		}
> > > +	}
> > > +
> >
> > hmm all those checks are per iopt. Could do one-off check in
> > iopt_unmap_iova_range() and store the result in a local flag.
> >
> > Then use that flag to decide whether to return -EBUSY if
> > area->num_accesses is true in the loop.
> 
> I don't quite follow this...
> 
> Do you suggest to move this xa_for_each to iopt_unmap_iova_range?

yes

> 
> What's that local flag used for?
> 

I meant something like below:

iopt_unmap_iova_range()
{
	bool internal_access = false;

	down_read(&iopt->domains_rwsem);
	down_write(&iopt->iova_rwsem);
	/* Bypass any unmap if there is an internal access */
	xa_for_each(&iopt->access_list, index, access) {
		if (iommufd_access_is_internal(access)) {
			internal_access = true;
			break;
		}
	}

	while ((area = iopt_area_iter_first(iopt, start, last))) {
		if (area->num_access) {
			if (internal_access) {
				rc = -EBUSY;
				goto out_unlock_iova;
			}
			up_write(&iopt->iova_rwsem);
			up_read(&iopt->domains_rwsem);
			iommufd_access_notify_unmap(iopt, area_first, length);	
		}
	}
}

it checks the access_list in the common path, but the cost should be
negligible when there is no access attached to this iopt. The upside
is that now unmap is denied explicitly in the area loop instead of 
still trying to unmap and then handling errors.

but the current way is also fine. After another thought I'm neutral
to it. 😊

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-07-02 18:05                         ` Jason Gunthorpe
@ 2025-07-03 14:46                           ` Pranjal Shrivastava
  2025-07-03 17:55                             ` Jason Gunthorpe
  0 siblings, 1 reply; 67+ messages in thread
From: Pranjal Shrivastava @ 2025-07-03 14:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, kevin.tian, corbet, will, bagasdotme, robin.murphy,
	joro, thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Wed, Jul 02, 2025 at 03:05:41PM -0300, Jason Gunthorpe wrote:
> On Wed, Jul 02, 2025 at 01:38:33AM +0000, Pranjal Shrivastava wrote:
> > On Tue, Jul 01, 2025 at 05:46:06PM -0700, Nicolin Chen wrote:
> > > On Wed, Jul 02, 2025 at 12:14:28AM +0000, Pranjal Shrivastava wrote:
> > > > Thus, coming back to the two initial points:
> > > > 
> > > > 1) Issuing "non-invalidation" commands through .cache_invalidate could
> > > >    be confusing, I'm not asking to change the op name here, but if we
> > > >    plan to label it, let's label them as "Trapped commands" OR
> > > >    "non-accelerated" commands as you suggested.
> > > 
> > > VCMDQ only accelerates limited invalidation commands, not all of
> > > them: STE cache invalidation and CD cache invalidation commands
> > > still go down to that op.
> > > 
> > 
> > Right, I'm just saying the "other" non-accelerated commands that are
> > NOT invalidations also go down that op. So, if we add a comment, let's 
> > not call them "non-invalidation" commands.
> 
> There are no non-invalidation commands:
> 
> static int arm_vsmmu_convert_user_cmd(struct arm_vsmmu *vsmmu,
> 				      struct arm_vsmmu_invalidation_cmd *cmd)
> {
> 	switch (cmd->cmd[0] & CMDQ_0_OP) {
> 	case CMDQ_OP_TLBI_NSNH_ALL:
> 	case CMDQ_OP_TLBI_NH_VA:
> 	case CMDQ_OP_TLBI_NH_VAA:
> 	case CMDQ_OP_TLBI_NH_ALL:
> 	case CMDQ_OP_TLBI_NH_ASID:
> 	case CMDQ_OP_ATC_INV:
> 	case CMDQ_OP_CFGI_CD:
> 	case CMDQ_OP_CFGI_CD_ALL:
> 
> Those are only invalidations.
> 
> CD invalidation can't go through the vCMDQ path.
> 

Right.. I was however hoping we'd also trap commands like CMD_PRI_RESP
and CMD_RESUME...I'm not sure if they should be accelerated via CMDQV..
I guess I'll need to look and understand a little more if they are..

> > > > 2) The "FIXME" confusion: The comment in arm_vsmmu_cache_invalidate
> > > >    mentions we'd like to "fix" the issuing of commands through the main
> > > >    cmdq and instead like to group by "type", if that "type" is the queue
> > > >    type (which I assume it is because IOMMU_TYPE has to be arm-smmu-v3),
> > > 
> > > I recall that FIXME is noted by Jason at that time. And it should
> > > be interpreted as "group by opcode", IIUIC.
> > 
> > I see.. I misunderstood that..
> 
> Yes, we could use the vCMDQ in the SMMU driver for invalidations which
> would give some minor locking advantage. But it is not really
> important to anyone.
> 

Alright, I see. Makes sense. Thanks for the clarification.

> > > The thing is that for a host kernel that enabled in-kernel VCMDQs,
> > > those trapped user commands can be just issued to the smmu->cmdq
> > > or a vcmdq (picked via the get_secondary_cmdq impl_op).
> > 
> > Ohh.. so maybe some sort of a load balancing thing?
> 
> The goal of the SMMU driver when it detects CMDQV support is to route
> all supported invalidations to CMDQV queues and then balance those
> queues across CPUs to reduce lock contention.
> 

I see.. that makes sense.. so it's a relatively small gain (but a nice
one). Thanks for clarifying!

> Jason

Praan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-07-03 14:46                           ` Pranjal Shrivastava
@ 2025-07-03 17:55                             ` Jason Gunthorpe
  2025-07-03 18:48                               ` Pranjal Shrivastava
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2025-07-03 17:55 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Nicolin Chen, kevin.tian, corbet, will, bagasdotme, robin.murphy,
	joro, thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Thu, Jul 03, 2025 at 02:46:03PM +0000, Pranjal Shrivastava wrote:

> Right.. I was however hoping we'd also trap commands like CMD_PRI_RESP
> and CMD_RESUME...I'm not sure if they should be accelerated via CMDQV..
> I guess I'll need to look and understand a little more if they are..

Right now these commands are not supported by vSMMUv3 in Linux.

They probably should be trapped, but completing a PRI (or resuming a
stall which we will treat the same) will go through the PRI/page fault
logic in iommufd not the cache invalidate.

> > The goal of the SMMU driver when it detects CMDQV support is to route
> > all supported invalidations to CMDQV queues and then balance those
> > queues across CPUs to reduce lock contention.
> 
> I see.. that makes sense.. so it's a relatively small gain (but a nice
> one). Thanks for clarifying!

On bare metal the gain is small (due to locking and balancing), while
on virtualization the gain is huge (due to no trapping).

Regardless the SMMU driver uses cmdqv support if the HW says it is
there.

Jason


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-07-03 17:55                             ` Jason Gunthorpe
@ 2025-07-03 18:48                               ` Pranjal Shrivastava
  2025-07-04 12:50                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 67+ messages in thread
From: Pranjal Shrivastava @ 2025-07-03 18:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, kevin.tian, corbet, will, bagasdotme, robin.murphy,
	joro, thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Thu, Jul 03, 2025 at 02:55:32PM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 03, 2025 at 02:46:03PM +0000, Pranjal Shrivastava wrote:
> 
> > Right.. I was however hoping we'd also trap commands like CMD_PRI_RESP
> > and CMD_RESUME...I'm not sure if they should be accelerated via CMDQV..
> > I guess I'll need to look and understand a little more if they are..
> 
> Right now these commands are not supported by vSMMUv3 in Linux.
> 
> They probably should be trapped, but completing a PRI (or resuming a
> stall which we will treat the same) will go through the PRI/page fault
> logic in iommufd not the cache invalidate.
> 

Ahh, thanks for this, that saved a lot of my time! And yes, I see some
functions in eventq.c calling the iopf_group_response which settles the
CMD_RESUME. So.. I assume these resume commands would be trapped and
*actually* executed through this or a similar path for vPRI. 

Meh, I had been putting off reading up the fault parts of iommufd, 
I guess I'll go through that too, now :) 

> > > The goal of the SMMU driver when it detects CMDQV support is to route
> > > all supported invalidations to CMDQV queues and then balance those
> > > queues across CPUs to reduce lock contention.
> > 
> > I see.. that makes sense.. so it's a relatively small gain (but a nice
> > one). Thanks for clarifying!
> 
> On bare metal the gain is small (due to locking and balancing), while
> on virtualization the gain is huge (due to no trapping).
> 

Ohh yes, I meant the bare metal gains here.. for virtualization, it's
definitely huge (as reported too).

> Regardless the SMMU driver uses cmdqv support if the HW says it is
> there.
>
> Jason
>

Thanks!
Praan

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 10/28] iommufd/access: Bypass access->ops->unmap for internal use
  2025-07-03  4:57       ` Tian, Kevin
@ 2025-07-04  4:08         ` Nicolin Chen
  0 siblings, 0 replies; 67+ messages in thread
From: Nicolin Chen @ 2025-07-04  4:08 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: jgg@nvidia.com, corbet@lwn.net, will@kernel.org,
	bagasdotme@gmail.com, robin.murphy@arm.com, joro@8bytes.org,
	thierry.reding@gmail.com, vdumpa@nvidia.com, jonathanh@nvidia.com,
	shuah@kernel.org, jsnitsel@redhat.com, nathan@kernel.org,
	peterz@infradead.org, Liu, Yi L, mshavit@google.com,
	praan@google.com, zhangzekun11@huawei.com, iommu@lists.linux.dev,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-tegra@vger.kernel.org,
	linux-kselftest@vger.kernel.org, patches@lists.linux.dev,
	mochs@nvidia.com, alok.a.tiwari@oracle.com, vasant.hegde@amd.com,
	dwmw2@infradead.org, baolu.lu@linux.intel.com

On Thu, Jul 03, 2025 at 04:57:34AM +0000, Tian, Kevin wrote:
> I meant something like below:
> 
> iopt_unmap_iova_range()
> {
> 	bool internal_access = false;
> 
> 	down_read(&iopt->domains_rwsem);
> 	down_write(&iopt->iova_rwsem);
> 	/* Bypass any unmap if there is an internal access */
> 	xa_for_each(&iopt->access_list, index, access) {
> 		if (iommufd_access_is_internal(access)) {
> 			internal_access = true;
> 			break;
> 		}
> 	}
> 
> 	while ((area = iopt_area_iter_first(iopt, start, last))) {
> 		if (area->num_access) {
> 			if (internal_access) {
> 				rc = -EBUSY;
> 				goto out_unlock_iova;
> 			}
> 			up_write(&iopt->iova_rwsem);
> 			up_read(&iopt->domains_rwsem);
> 			iommufd_access_notify_unmap(iopt, area_first, length);	
> 		}
> 	}
> }
> 
> it checks the access_list in the common path, but the cost should be
> negligible when there is no access attached to this iopt. The upside
> is that now unmap is denied explicitly in the area loop instead of 
> still trying to unmap and then handling errors.

Hmm, I realized that either way might be incorrect, as it iterates
the entire iopt for any internal access regardless its iova ranges.

What we really want is to reject an unmap against the same range as
once pinged by an internal access, i.e. other range of unmap should
be still allowed.

So, doing it at this level isn't enough. I think we should still go
down to struct iopt_area as my v5 did:
https://lore.kernel.org/all/3ddc8c678406772a8358a265912bb1c064f4c796.1747537752.git.nicolinc@nvidia.com/
We'd only need to rename to num_locked as you suggested, i.e.

@@ -719,6 +719,12 @@ static int iopt_unmap_iova_range(struct io_pagetable *iopt, unsigned long start,
 			goto out_unlock_iova;
 		}
 
+		/* The area is locked by an object that has not been destroyed */
+		if (area->num_locked) {
+			rc = -EBUSY;
+			goto out_unlock_iova;
+		}
+
 		if (area_first < start || area_last > last) {
 			rc = -ENOENT;
 			goto out_unlock_iova;

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-07-03 18:48                               ` Pranjal Shrivastava
@ 2025-07-04 12:50                                 ` Jason Gunthorpe
  2025-07-10  9:04                                   ` Pranjal Shrivastava
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2025-07-04 12:50 UTC (permalink / raw)
  To: Pranjal Shrivastava
  Cc: Nicolin Chen, kevin.tian, corbet, will, bagasdotme, robin.murphy,
	joro, thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Thu, Jul 03, 2025 at 06:48:42PM +0000, Pranjal Shrivastava wrote:

> Ahh, thanks for this, that saved a lot of my time! And yes, I see some
> functions in eventq.c calling the iopf_group_response which settles the
> CMD_RESUME. So.. I assume these resume commands would be trapped and
> *actually* executed through this or a similar path for vPRI. 

Yes, that is what Intel did. PRI has to be tracked in the kernel
because we have to ack requests eventually. If the VMM crashes the
kernel has to ack everything and try to clean up.

Also SMMUv3 does not support PRI today, just stall.

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 01/28] iommufd: Report unmapped bytes in the error path of iopt_unmap_iova_range
  2025-06-26 19:34 ` [PATCH v7 01/28] iommufd: Report unmapped bytes in the error path of iopt_unmap_iova_range Nicolin Chen
  2025-07-02  9:39   ` Tian, Kevin
@ 2025-07-04 12:59   ` Jason Gunthorpe
  1 sibling, 0 replies; 67+ messages in thread
From: Jason Gunthorpe @ 2025-07-04 12:59 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, praan, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Thu, Jun 26, 2025 at 12:34:32PM -0700, Nicolin Chen wrote:
> There are callers that read the unmapped bytes even when rc != 0. Thus, do
> not forget to report it in the error path too.
> 
> Fixes: 8d40205f6093 ("iommufd: Add kAPI toward external drivers for kernel access")
> Cc: stable@vger.kernel.org
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/iommufd/io_pagetable.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 02/28] iommufd/viommu: Explicitly define vdev->virt_id
  2025-06-26 19:34 ` [PATCH v7 02/28] iommufd/viommu: Explicitly define vdev->virt_id Nicolin Chen
  2025-07-01 12:30   ` Pranjal Shrivastava
  2025-07-02  9:40   ` Tian, Kevin
@ 2025-07-04 12:59   ` Jason Gunthorpe
  2 siblings, 0 replies; 67+ messages in thread
From: Jason Gunthorpe @ 2025-07-04 12:59 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, praan, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Thu, Jun 26, 2025 at 12:34:33PM -0700, Nicolin Chen wrote:
> The "id" is too genernal to get its meaning easily. Rename it explicitly to
> "virt_id" and update the kdocs for readability. No functional changes.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/iommufd/iommufd_private.h | 7 ++++++-
>  drivers/iommu/iommufd/driver.c          | 2 +-
>  drivers/iommu/iommufd/viommu.c          | 4 ++--
>  3 files changed, 9 insertions(+), 4 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 03/28] iommu: Use enum iommu_hw_info_type for type in hw_info op
  2025-06-26 19:34 ` [PATCH v7 03/28] iommu: Use enum iommu_hw_info_type for type in hw_info op Nicolin Chen
                     ` (2 preceding siblings ...)
  2025-07-02  9:41   ` Tian, Kevin
@ 2025-07-04 13:00   ` Jason Gunthorpe
  3 siblings, 0 replies; 67+ messages in thread
From: Jason Gunthorpe @ 2025-07-04 13:00 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, praan, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Thu, Jun 26, 2025 at 12:34:34PM -0700, Nicolin Chen wrote:
> Replace u32 to make it clear. No functional changes.
> 
> Also simplify the kdoc since the type itself is clear enough.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  include/linux/iommu.h                               | 6 +++---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-iommufd.c | 3 ++-
>  drivers/iommu/intel/iommu.c                         | 3 ++-
>  drivers/iommu/iommufd/selftest.c                    | 3 ++-
>  4 files changed, 9 insertions(+), 6 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 09/28] iommufd/access: Add internal APIs for HW queue to use
  2025-06-26 19:34 ` [PATCH v7 09/28] iommufd/access: Add internal APIs for HW queue to use Nicolin Chen
  2025-07-02  9:42   ` Tian, Kevin
@ 2025-07-04 13:08   ` Jason Gunthorpe
  1 sibling, 0 replies; 67+ messages in thread
From: Jason Gunthorpe @ 2025-07-04 13:08 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, praan, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Thu, Jun 26, 2025 at 12:34:40PM -0700, Nicolin Chen wrote:
> The new HW queue object, as an internal iommufd object, wants to reuse the
> struct iommufd_access to pin some iova range in the iopt.
> 
> However, an access generally takes the refcount of an ictx. So, in such an
> internal case, a deadlock could happen when the release of the ictx has to
> wait for the release of the access first when releasing a hw_queue object,
> which could wait for the release of the ictx that is refcounted:
>     ictx --releases--> hw_queue --releases--> access
>       ^                                         |
>       |_________________releases________________v
> 
> To address this, add a set of lightweight internal APIs to unlink the ictx
> and the access, i.e. no ictx refcounting by the access:
>     ictx --releases--> hw_queue --releases--> access
> 
> Then, there's no point in setting the access->ictx. So simply define !ictx
> as an flag for an internal use and add an inline helper.
> 
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/iommu/iommufd/iommufd_private.h | 23 ++++++++++
>  drivers/iommu/iommufd/device.c          | 59 +++++++++++++++++++++----
>  2 files changed, 73 insertions(+), 9 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 13/28] iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl
  2025-06-26 19:34 ` [PATCH v7 13/28] iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl Nicolin Chen
@ 2025-07-04 13:26   ` Jason Gunthorpe
  2025-07-04 13:33     ` Jason Gunthorpe
  0 siblings, 1 reply; 67+ messages in thread
From: Jason Gunthorpe @ 2025-07-04 13:26 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, praan, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Thu, Jun 26, 2025 at 12:34:44PM -0700, Nicolin Chen wrote:
> +static struct iommufd_access *
> +iommufd_hw_queue_alloc_phys(struct iommu_hw_queue_alloc *cmd,
> +			    struct iommufd_viommu *viommu, phys_addr_t *base_pa)
> +{
> +	struct iommufd_access *access;
> +	struct page **pages;
> +	int max_npages, i;

These types are not int..

> +	u64 offset;
> +	int rc;
> +
> +	offset =
> +		cmd->nesting_parent_iova - PAGE_ALIGN(cmd->nesting_parent_iova);

This is a u64

> +	max_npages = DIV_ROUND_UP(offset + cmd->length, PAGE_SIZE);

Length is a u64

It should be

/* DIV_ROUND_UP(offset + cmd->length, PAGE_SIZE) */
if (check_add_overflow(offset, cmd->length, &length))
   return -ERANGE;
if (check_add_overflow(length, PAGE_SIZE-1, &length))
   return -ERANGE;
if (length > SIZE_MAX)
   return -ERANGE;
max_npages = length / PAGE_SIZE;

And then max_npages and i should be size_t.

Otherwise it looks OK

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 13/28] iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl
  2025-07-04 13:26   ` Jason Gunthorpe
@ 2025-07-04 13:33     ` Jason Gunthorpe
  0 siblings, 0 replies; 67+ messages in thread
From: Jason Gunthorpe @ 2025-07-04 13:33 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: kevin.tian, corbet, will, bagasdotme, robin.murphy, joro,
	thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, praan, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Fri, Jul 04, 2025 at 10:26:02AM -0300, Jason Gunthorpe wrote:

> /* DIV_ROUND_UP(offset + cmd->length, PAGE_SIZE) */
> if (check_add_overflow(offset, cmd->length, &length))
>    return -ERANGE;
> if (check_add_overflow(length, PAGE_SIZE-1, &length))
>    return -ERANGE;
> if (length > SIZE_MAX)
>    return -ERANGE;
> max_npages = length / PAGE_SIZE;

Actually I see now that overflow.h supports mixed types, so this can
be simplified:

 size_t max_npages;
 size_t length;
 u64 offset;
 size_t i;

 offset = cmd->nesting_parent_iova - PAGE_ALIGN(cmd->nesting_parent_iova);

 /* DIV_ROUND_UP(offset + cmd->length, PAGE_SIZE) */
 if (check_add_overflow(offset, cmd->length, &length))
    return -ERANGE;
 if (check_add_overflow(length, PAGE_SIZE-1, &length))
    return -ERANGE;
 max_npages = length / PAGE_SIZE;

Then the kcvalloc takes in size_t:

kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)

So there is no silent cast and truncation.

Jason

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support
  2025-07-04 12:50                                 ` Jason Gunthorpe
@ 2025-07-10  9:04                                   ` Pranjal Shrivastava
  0 siblings, 0 replies; 67+ messages in thread
From: Pranjal Shrivastava @ 2025-07-10  9:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Nicolin Chen, kevin.tian, corbet, will, bagasdotme, robin.murphy,
	joro, thierry.reding, vdumpa, jonathanh, shuah, jsnitsel, nathan,
	peterz, yi.l.liu, mshavit, zhangzekun11, iommu, linux-doc,
	linux-kernel, linux-arm-kernel, linux-tegra, linux-kselftest,
	patches, mochs, alok.a.tiwari, vasant.hegde, dwmw2, baolu.lu

On Fri, Jul 04, 2025 at 09:50:12AM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 03, 2025 at 06:48:42PM +0000, Pranjal Shrivastava wrote:
> 
> > Ahh, thanks for this, that saved a lot of my time! And yes, I see some
> > functions in eventq.c calling the iopf_group_response which settles the
> > CMD_RESUME. So.. I assume these resume commands would be trapped and
> > *actually* executed through this or a similar path for vPRI. 
> 
> Yes, that is what Intel did. PRI has to be tracked in the kernel
> because we have to ack requests eventually. If the VMM crashes the
> kernel has to ack everything and try to clean up.
> 

I see.. thanks for clarifying!

> Also SMMUv3 does not support PRI today, just stall.
> 

Ack. Thanks!

> Jason
Praan

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2025-07-10  9:04 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-26 19:34 [PATCH v7 00/28] iommufd: Add vIOMMU infrastructure (Part-4 HW QUEUE) Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 01/28] iommufd: Report unmapped bytes in the error path of iopt_unmap_iova_range Nicolin Chen
2025-07-02  9:39   ` Tian, Kevin
2025-07-04 12:59   ` Jason Gunthorpe
2025-06-26 19:34 ` [PATCH v7 02/28] iommufd/viommu: Explicitly define vdev->virt_id Nicolin Chen
2025-07-01 12:30   ` Pranjal Shrivastava
2025-07-02  9:40   ` Tian, Kevin
2025-07-02 19:59     ` Nicolin Chen
2025-07-04 12:59   ` Jason Gunthorpe
2025-06-26 19:34 ` [PATCH v7 03/28] iommu: Use enum iommu_hw_info_type for type in hw_info op Nicolin Chen
2025-07-01 12:48   ` Pranjal Shrivastava
2025-07-01 12:51   ` Pranjal Shrivastava
2025-07-02  9:41   ` Tian, Kevin
2025-07-04 13:00   ` Jason Gunthorpe
2025-06-26 19:34 ` [PATCH v7 04/28] iommu: Add iommu_copy_struct_to_user helper Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 05/28] iommu: Pass in a driver-level user data structure to viommu_init op Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 06/28] iommufd/viommu: Allow driver-specific user data for a vIOMMU object Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 07/28] iommufd/selftest: Support user_data in mock_viommu_alloc Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 08/28] iommufd/selftest: Add coverage for viommu data Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 09/28] iommufd/access: Add internal APIs for HW queue to use Nicolin Chen
2025-07-02  9:42   ` Tian, Kevin
2025-07-04 13:08   ` Jason Gunthorpe
2025-06-26 19:34 ` [PATCH v7 10/28] iommufd/access: Bypass access->ops->unmap for internal use Nicolin Chen
2025-07-02  9:45   ` Tian, Kevin
2025-07-02 20:12     ` Nicolin Chen
2025-07-03  4:57       ` Tian, Kevin
2025-07-04  4:08         ` Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 11/28] iommufd/viommu: Add driver-defined vDEVICE support Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 12/28] iommufd/viommu: Introduce IOMMUFD_OBJ_HW_QUEUE and its related struct Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 13/28] iommufd/viommu: Add IOMMUFD_CMD_HW_QUEUE_ALLOC ioctl Nicolin Chen
2025-07-04 13:26   ` Jason Gunthorpe
2025-07-04 13:33     ` Jason Gunthorpe
2025-06-26 19:34 ` [PATCH v7 14/28] iommufd/driver: Add iommufd_hw_queue_depend/undepend() helpers Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 15/28] iommufd/selftest: Add coverage for IOMMUFD_CMD_HW_QUEUE_ALLOC Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 16/28] iommufd: Add mmap interface Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 17/28] iommufd/selftest: Add coverage for the new " Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 18/28] Documentation: userspace-api: iommufd: Update HW QUEUE Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 19/28] iommu: Allow an input type in hw_info op Nicolin Chen
2025-07-01 12:54   ` Pranjal Shrivastava
2025-06-26 19:34 ` [PATCH v7 20/28] iommufd: Allow an input data_type via iommu_hw_info Nicolin Chen
2025-07-01 12:58   ` Pranjal Shrivastava
2025-06-26 19:34 ` [PATCH v7 21/28] iommufd/selftest: Update hw_info coverage for an input data_type Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 22/28] iommu/arm-smmu-v3-iommufd: Add vsmmu_size/type and vsmmu_init impl ops Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 23/28] iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops Nicolin Chen
2025-07-01 12:24   ` Pranjal Shrivastava
2025-06-26 19:34 ` [PATCH v7 24/28] iommu/tegra241-cmdqv: Use request_threaded_irq Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 25/28] iommu/tegra241-cmdqv: Simplify deinit flow in tegra241_cmdqv_remove_vintf() Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 26/28] iommu/tegra241-cmdqv: Do not statically map LVCMDQs Nicolin Chen
2025-06-26 19:34 ` [PATCH v7 27/28] iommu/tegra241-cmdqv: Add user-space use support Nicolin Chen
2025-07-01 16:02   ` Pranjal Shrivastava
2025-07-01 19:42     ` Nicolin Chen
2025-07-01 20:03       ` Pranjal Shrivastava
2025-07-01 20:23         ` Nicolin Chen
2025-07-01 20:43           ` Pranjal Shrivastava
2025-07-01 22:07             ` Nicolin Chen
2025-07-01 22:51               ` Pranjal Shrivastava
2025-07-01 23:01                 ` Nicolin Chen
2025-07-02  0:14                   ` Pranjal Shrivastava
2025-07-02  0:46                     ` Nicolin Chen
2025-07-02  1:38                       ` Pranjal Shrivastava
2025-07-02 18:05                         ` Jason Gunthorpe
2025-07-03 14:46                           ` Pranjal Shrivastava
2025-07-03 17:55                             ` Jason Gunthorpe
2025-07-03 18:48                               ` Pranjal Shrivastava
2025-07-04 12:50                                 ` Jason Gunthorpe
2025-07-10  9:04                                   ` Pranjal Shrivastava
2025-06-26 19:34 ` [PATCH v7 28/28] iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support Nicolin Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).